Solr cloud clusterstate.json update query ?

2015-05-05 Thread Sai Sreenivas K
Could you clarify on the following questions,
1. Is there a way to avoid all the nodes simultaneously getting into
recovery state when a bulk indexing happens ? Is there an api to disable
replication on one node for a while ?

2. We recently changed the host name on nodes in solr.xml. But the old host
entries still exist in the clusterstate.json marked as active state. Though
live_nodes has the correct information. Who updates clusterstate.json if
the node goes down in an ungraceful fashion without notifying its down
state ?

Thanks,
Sai Sreenivas K


Alternate ways to facet spatial data

2015-05-05 Thread James Sewell
Hello all,

I've just started using SOLR for spatial queries and it looks great so far.
I've mostly been investigating importing a large amount of point data,
indexing and searching it.

I've discovered the facet.heatmap functionality, which is great - but I
would like to ask if it is possible to get slightly different results from
this.

Essentially rather than a heatmap I would like either a polygon per cluster
(might be too much computation?) or a point per cluster (centroid would be
great, centre of grid would be ok), coupled with the point count.

Is this currently possible using faceting, or does it seem like a workable
feature I could implement?

Cheers,

James Sewell,
PostgreSQL Team Lead / Solutions Architect
__


 Level 2, 50 Queen St, Melbourne VIC 3000

*P *(+61) 3 8370 8000  *W* www.lisasoft.com  *F *(+61) 3 8370 8099



-- 

James Sewell,
PostgreSQL Team Lead / Solutions Architect
__


 Level 2, 50 Queen St, Melbourne VIC 3000

*P *(+61) 3 8370 8000  *W* www.lisasoft.com  *F *(+61) 3 8370 8099

-- 


--
The contents of this email are confidential and may be subject to legal or 
professional privilege and copyright. No representation is made that this 
email is free of viruses or other defects. If you have received this 
communication in error, you may not copy or distribute any part of it or 
otherwise disclose its contents to anyone. Please advise the sender of your 
incorrect receipt of this correspondence.


Re: Solr 5.1.0 Cloud and Zookeeper

2015-05-05 Thread shacky
Thank you very much for your answer.
I installed ZooKeeper 3.4.6 on my Debian (Wheezy) system, and it's working well.
The only problem I have is that I'm looking for some init script but I
cannot find anything. I'm also trying to adapt the script in Debian's
zookeeperd package, but I have some problems.
Do you know some working init scripts for ZooKeeper on Debian?

2015-05-05 15:30 GMT+02:00 Mark Miller markrmil...@gmail.com:
 A bug fix version difference probably won't matter. It's best to use the
 same version everyone else uses and the one our tests use, but it's very
 likely 3.4.5 will work without a hitch.

 - Mark

 On Tue, May 5, 2015 at 9:09 AM shacky shack...@gmail.com wrote:

 Hi.

 I read on
 https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble
 that Solr needs to use the same ZooKeeper version it owns (at the
 moment 3.4.6).
 Debian Jessie has ZooKeeper 3.4.5
 (https://packages.debian.org/jessie/zookeeper).

 Are you sure that this version won't work with Solr 5.1.0?

 Thank you very much for your help!
 Bye



SolrCloud indexing

2015-05-05 Thread Vincenzo D'Amore
Hi all,

I have 3 nodes and there are 3 shards but looking at solrcloud admin I see
that all the leaders are on the same node.

If I understood well looking at  solr documentation
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
:

 When a document is sent to a machine for indexing, the system first
 determines if the machine is a replica or a leader.
 If the machine is a replica, the document is forwarded to the leader for
 processing.
 If the machine is a leader, SolrCloud determines which shard the document
 should go to, forwards the document the leader for that shard, indexes the
 document for this shard, and forwards the index notation to itself and any
 replicas.


So I have 3 nodes, with 3 shards and 2 replicas of each shard.

http://picpaste.com/pics/Screen_Shot_2015-05-05_at_15.19.54-Xp8uztpt.1430832218.png

Does it mean that all the indexing is done by the leaders in one node? If
so, how do I distribute the indexing (the shard leaders) across nodes?


-- 
Vincenzo D'Amore
email: v.dam...@gmail.com
skype: free.dev
mobile: +39 349 8513251


Solr Wordpress - one server or two?

2015-05-05 Thread Robg50
I'm thinking of taking SOLR for a test drive and will probably keep it if it
works as I'm hoping so I'd like to get it as right as possible the first
time out.

I'm running Wordpress on Ubuntu with php and Mariadb 10. The server is a 7
core, 4gb, Azure VM. The database is 4gb. The data itself is mainly docs,
pdfs, and app descriptions/images from iTunes and Google Play Store.

I have two questions: 1. Should I put SOLR on the same server that hosts my
site and db or create a second VM just for SOLR? I'm looking for speed here,
mainly.

If I install on a 2nd server should I use Tomcat instead of Apache2?
Any advice is much appreciated!!! Rob



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Wordpress-one-server-or-two-tp4203914.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr 5.1.0 Cloud and Zookeeper

2015-05-05 Thread shacky
Hi.

I read on 
https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble
that Solr needs to use the same ZooKeeper version it owns (at the
moment 3.4.6).
Debian Jessie has ZooKeeper 3.4.5
(https://packages.debian.org/jessie/zookeeper).

Are you sure that this version won't work with Solr 5.1.0?

Thank you very much for your help!
Bye


Re: Solr 5.1.0 Cloud and Zookeeper

2015-05-05 Thread Mark Miller
A bug fix version difference probably won't matter. It's best to use the
same version everyone else uses and the one our tests use, but it's very
likely 3.4.5 will work without a hitch.

- Mark

On Tue, May 5, 2015 at 9:09 AM shacky shack...@gmail.com wrote:

 Hi.

 I read on
 https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble
 that Solr needs to use the same ZooKeeper version it owns (at the
 moment 3.4.6).
 Debian Jessie has ZooKeeper 3.4.5
 (https://packages.debian.org/jessie/zookeeper).

 Are you sure that this version won't work with Solr 5.1.0?

 Thank you very much for your help!
 Bye



Re: Multiple index.timestamp directories using up disk space

2015-05-05 Thread Rishi Easwaran
Worried about data loss makes sense. If I get the way solr behaves, the new 
directory should only have missing/changed segments. 
I guess since our application is extremely write heavy, with lot of inserts and 
deletes, almost every segment is touched even during a short window, so it 
appears like for our deployment every segment is copied over when replicas get 
out of sync. 

Thanks for clarifying this behaviour from solr cloud so we can put in external 
steps to resolve when this situation arises.  
 

 

 

-Original Message-
From: Ramkumar R. Aiyengar andyetitmo...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, May 5, 2015 4:52 am
Subject: Re: Multiple index.timestamp directories using up disk space


Yes, data loss is the concern. If the recovering replica is not able
to
retrieve the files from the leader, it at least has an older copy.

Also,
the entire index is not fetched from the leader, only the segments
which have
changed. The replica initially gets the file list from the
replica, checks
against what it has, and then downloads the difference --
then moves it to the
main index. Note that this process can fail sometimes
(say due to I/O errors,
or due to a problem with the leader itself), in
which case the replica drops
all accumulated files from the leader, and
starts from scratch. If that
happens, it needs to look back at its old
index again to figure out what it
needs to download on the next attempt.

May be with a fair number of
assumptions which should usually hold good,
you can still come up with a
mechanism to drop existing files, but those
won't hold good in case of serious
issues with the cloud, you could end up
losing data. That's worse than using a
bit more disk space!
On 4 May 2015 11:56, Rishi Easwaran
rishi.easwa...@aol.com wrote:

Thanks for the responses Mark and
Ramkumar.

 The question I had was, why does Solr need 2 copies at any given
time,
leading to 2x disk space usage.
 Not sure if this information is not
published anywhere, and makes HW
estimation almost impossible for large scale
deployment. Even if the copies
are temporary, this becomes really expensive,
especially when using SSD in
production, when the complex size is over 400TB
indexes, running 1000's of
solr cloud shards.

 If a solr follower has
decided that it needs to do replication from leader
and capture full copy
snapshot. Why can't it delete the old information and
replicate from scratch,
not requiring more disk space.
 Is the concern data loss (a case when both
leader and follower lose data)?.

 Thanks,

Rishi.







-Original Message-
From: Mark Miller
markrmil...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue,
Apr 28, 2015 10:52 am
Subject: Re: Multiple index.timestamp directories using
up disk space


If copies of the index are not eventually cleaned up, I'd
fill a JIRA
to
address the issue. Those directories should be removed over
time. At
times
there will have to be a couple around at the same time and
others may
take
a while to clean up.

- Mark

On Tue, Apr 28, 2015 at 3:27
AM Ramkumar
R. Aiyengar 
andyetitmo...@gmail.com wrote:

 SolrCloud does
need up to
twice the amount of disk space as your usual
 index size during
replication.
Amongst other things, this ensures you have
 a full copy of the
index at any
point. There's no way around this, I would
 suggest you
provision the
additional disk space needed.
 On 20 Apr 2015 23:21, Rishi
Easwaran
rishi.easwa...@aol.com wrote:

  Hi All,
 
  We are
seeing this
problem with solr 4.6 and solr 4.10.3.
  For some reason, solr
cloud tries to
recover and creates a new index
  directory -
(ex:index.20150420181214550),
while keeping the older index
 as
  is. This
creates an issues where the
disk space fills up and the shard
  never ends
up recovering.
  Usually
this requires a manual intervention of  bouncing
the instance and
  wiping
the disk clean to allow for a clean recovery.


  Any ideas on how to
prevent solr from creating multiple copies of
index
  directory.
 
 
Thanks,
  Rishi.
 


 


Re: Finding out optimal hash ranges for shard split

2015-05-05 Thread anand.mahajan
Looks like its not possible to find out the optimal hash ranges for a split
before you actually split it. So the only way out is to keep splitting out
the large subshards?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Finding-out-optimal-hash-ranges-for-shard-split-tp4203609p4204045.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud indexing

2015-05-05 Thread Erick Erickson
bq: Does it mean that all the indexing is done by the leaders in one node?

no. The raw document is forwarded from the leader to the replica and
it's indexed on all the nodes. The leader has a little bit of extra
work to do routing the docs, but that's it. Shouldn't be a problem
with 3 shards.

bq: If so, how do I distribute the indexing (the shard leaders) across nodes?

You don't really need to bother I don't think, especially if you don't
see significantly higher CPU utilization on the leader. If you
absolutely MUST distribute leadership, see the Collections API and the
REBALANCELEADERS and BALANCESHARDUNIQUE (Solr 5.1 only) but frankly I
wouldn't worry about it unless and until you had demonstrated need.

Best,
Erick

On Tue, May 5, 2015 at 6:28 AM, Vincenzo D'Amore v.dam...@gmail.com wrote:
 Hi all,

 I have 3 nodes and there are 3 shards but looking at solrcloud admin I see
 that all the leaders are on the same node.

 If I understood well looking at  solr documentation
 https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
 :

 When a document is sent to a machine for indexing, the system first
 determines if the machine is a replica or a leader.
 If the machine is a replica, the document is forwarded to the leader for
 processing.
 If the machine is a leader, SolrCloud determines which shard the document
 should go to, forwards the document the leader for that shard, indexes the
 document for this shard, and forwards the index notation to itself and any
 replicas.


 So I have 3 nodes, with 3 shards and 2 replicas of each shard.

 http://picpaste.com/pics/Screen_Shot_2015-05-05_at_15.19.54-Xp8uztpt.1430832218.png

 Does it mean that all the indexing is done by the leaders in one node? If
 so, how do I distribute the indexing (the shard leaders) across nodes?


 --
 Vincenzo D'Amore
 email: v.dam...@gmail.com
 skype: free.dev
 mobile: +39 349 8513251


Re: Solr 5.0 - uniqueKey case insensitive ?

2015-05-05 Thread Erick Erickson
Well, working fine may be a bit of an overstatement. That has never
been officially supported, so it just happened to work in 3.6.

As Chris points out, if you're using SolrCloud then this will _not_
work as routing happens early in the process, i.e. before the analysis
chain gets the token so various copies of the doc will exist on
different shards.

Best,
Erick

On Mon, May 4, 2015 at 4:19 PM, Bruno Mannina bmann...@free.fr wrote:
 Hello Chris,

 yes I confirm on my SOLR3.6 it works fine since several years, and each doc
 added with same code is updated not added.

 To be more clear, I receive docs with a field name pn and it's the
 uniqueKey, and it always in uppercase

 so I must define in my schema.xml

 field name=id type=string multiValued=false indexed=true
 required=true stored=true/
 field name=pn type=text_general multiValued=true indexed=true
 stored=false/
 ...
uniqueKeyid/uniqueKey
 ...
   copyField source=id dest=pn/

 but the application that use solr already exists so it requests with pn
 field not id, i cannot change that.
 and in each docs I receive, there is not id field, just pn field, and  i
 cannot also change that.

 so there is a problem no ? I must import a id field and request a pn field,
 but I have a pn field only for import...



 Le 05/05/2015 01:00, Chris Hostetter a écrit :

 : On SOLR3.6, I defined a string_ci field like this:
 :
 : fieldType name=string_ci class=solr.TextField
 : sortMissingLast=true omitNorms=true
 : analyzer
 :   tokenizer class=solr.KeywordTokenizerFactory/
 :   filter class=solr.LowerCaseFilterFactory/
 : /analyzer
 : /fieldType
 :
 : field name=pn type=string_ci multiValued=false indexed=true
 : required=true stored=true/


 I'm really suprised that field would have worked for you (reliably) as a
 uniqueKey field even in Solr 3.6.

 the best practice for something like what you describe has always (going
 back to Solr 1.x) been to use a copyField to create a case insensitive
 copy of your uniqueKey for searching.

 if, for some reason, you really want case insensitve *updates* (so a doc
 with id foo overwrites a doc with id FOO then the only reliable way to
 make something like that work is to do the lowercassing in an
 UpdateProcessor to ensure it happens *before* the docs are distributed to
 the correct shard, and so the correct existing doc is overwritten (even if
 you aren't using solr cloud)



 -Hoss
 http://www.lucidworks.com/




 ---
 Ce courrier électronique ne contient aucun virus ou logiciel malveillant
 parce que la protection avast! Antivirus est active.
 http://www.avast.com



Re: Solr cloud clusterstate.json update query ?

2015-05-05 Thread Erick Erickson
about 1. This shouldn't be happening, so I wouldn't concentrate
there first. The most common reason is that you have a short Zookeeper
timeout and the replicas go into a stop-the-world garbage collection
that exceeds the timeout. So the first thing to do is to see if that's
happening. Here are a couple of good places to start:

http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/
http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning_for_Solr

2 Partial answer is that ZK does a keep-alive type thing and if the
solr nodes it knows about don't reply, it marks the nodes as down.

Best,
Erick

On Tue, May 5, 2015 at 5:42 AM, Sai Sreenivas K sa...@myntra.com wrote:
 Could you clarify on the following questions,
 1. Is there a way to avoid all the nodes simultaneously getting into
 recovery state when a bulk indexing happens ? Is there an api to disable
 replication on one node for a while ?

 2. We recently changed the host name on nodes in solr.xml. But the old host
 entries still exist in the clusterstate.json marked as active state. Though
 live_nodes has the correct information. Who updates clusterstate.json if
 the node goes down in an ungraceful fashion without notifying its down
 state ?

 Thanks,
 Sai Sreenivas K


Re: SolrCloud collection properties

2015-05-05 Thread Erick Erickson
_What_ properties? Details matter

And how do you do this now? Assuming you do this with separate conf
directories, these are then just configsets in Zookeeper and you can
have as many of them as you want. Problem here is that each one of
them is a complete set of schema and config files, AFAIK the config
set is the finest granularity that you have OOB.

Best,
Erick

On Tue, May 5, 2015 at 6:55 AM, Markus Heiden markus.hei...@s24.com wrote:
 Hi,

 we are trying to migrate from Solr 4.10 to SolrCloud 4.10. I understood
 that SolrCloud uses collections as abstraction from the cores. What I am
 missing is a possibility to store collection-specific properties in
 Zookeeper. Using property.foo=bar in CREATE-URLs just sets core-specific
 properties which are not distributed, e.g. if I migrate a shard from one
 node to another.

 How do I define collection-specific properties (to be used in
 solrconfig.xml and schema.xml) which get distributed with the collection to
 all nodes?

 Why do I try that? Currently we have different cores which structure is
 identical, but have each having some specific properties. I would like to
 have a single configuration for them in Zookeeper from which I want to
 create different collections, which just differ in the value of some
 properties.

 Markus


Re: Solr cloud clusterstate.json update query ?

2015-05-05 Thread Gopal Jee
about 2 , live_nodes under zookeeper is ephemeral node (please see
zookeeper ephemeral node). So, once connection from solr zkClient to
zookeeper is lost, these nodes will disappear automatically. AFAIK,
clusterstate.json is updated by overseer based on messages published to a
queue in zookeeper by solr zkclients. In case, solr node dies ungracefully,
I am not sure how this event is updated in clusterstate.json.
*Can someone shed some light *on ungraceful solr shutdown and consequent
status update in clusterstate. I guess there would be some ay, because all
nodes in a cluster decides clusterstate based on watched clusterstate.json
node. They will not be watching live_nodes for updating their state.

Gopal

On Wed, May 6, 2015 at 6:33 AM, Erick Erickson erickerick...@gmail.com
wrote:

 about 1. This shouldn't be happening, so I wouldn't concentrate
 there first. The most common reason is that you have a short Zookeeper
 timeout and the replicas go into a stop-the-world garbage collection
 that exceeds the timeout. So the first thing to do is to see if that's
 happening. Here are a couple of good places to start:

 http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/
 http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning_for_Solr

 2 Partial answer is that ZK does a keep-alive type thing and if the
 solr nodes it knows about don't reply, it marks the nodes as down.

 Best,
 Erick

 On Tue, May 5, 2015 at 5:42 AM, Sai Sreenivas K sa...@myntra.com wrote:
  Could you clarify on the following questions,
  1. Is there a way to avoid all the nodes simultaneously getting into
  recovery state when a bulk indexing happens ? Is there an api to disable
  replication on one node for a while ?
 
  2. We recently changed the host name on nodes in solr.xml. But the old
 host
  entries still exist in the clusterstate.json marked as active state.
 Though
  live_nodes has the correct information. Who updates clusterstate.json if
  the node goes down in an ungraceful fashion without notifying its down
  state ?
 
  Thanks,
  Sai Sreenivas K




--


Re: Solr Wordpress - one server or two?

2015-05-05 Thread Shawn Heisey
On 5/5/2015 6:11 AM, Robg50 wrote:
 I'm thinking of taking SOLR for a test drive and will probably keep it if it
 works as I'm hoping so I'd like to get it as right as possible the first
 time out.

 I'm running Wordpress on Ubuntu with php and Mariadb 10. The server is a 7
 core, 4gb, Azure VM. The database is 4gb. The data itself is mainly docs,
 pdfs, and app descriptions/images from iTunes and Google Play Store.

 I have two questions: 1. Should I put SOLR on the same server that hosts my
 site and db or create a second VM just for SOLR? I'm looking for speed here,
 mainly.

 If I install on a 2nd server should I use Tomcat instead of Apache2?

It's nearly impossible to answer your question with the information
provided.  We can make an educated guess if we have more info, but it
would be a guess ... the only way to actually know is to prototype and
try it.

https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Also, be aware of general advice regarding memory and Solr performance. 
Unless your index is tiny, which probably won't be the case with a 4GB
source database, 4GB of RAM may be too small for even a dedicated Solr
machine.  If the machine is NOT dedicated, the likelihood of 4GB being
enough RAM is even smaller:

http://wiki.apache.org/solr/SolrPerformanceProblems

I can say something about your last question ... Solr won't run in
Apache.  It requires a Java servlet container.  As long as I've been
using it, it has shipped with an example that includes Jetty, but in 4.x
and earlier versions there was a separately available war file for
deployment in another app like Tomcat.  The 5.x versions are shipped as
a complete package with startup scripts that start Jetty.  You *can*
still find the .war and put it in another container like Tomcat, but we
are discouraging that approach even more strongly than we did for past
versions.

Thanks,
Shawn



SolrCloud collection properties

2015-05-05 Thread Markus Heiden
Hi,

we are trying to migrate from Solr 4.10 to SolrCloud 4.10. I understood
that SolrCloud uses collections as abstraction from the cores. What I am
missing is a possibility to store collection-specific properties in
Zookeeper. Using property.foo=bar in CREATE-URLs just sets core-specific
properties which are not distributed, e.g. if I migrate a shard from one
node to another.

How do I define collection-specific properties (to be used in
solrconfig.xml and schema.xml) which get distributed with the collection to
all nodes?

Why do I try that? Currently we have different cores which structure is
identical, but have each having some specific properties. I would like to
have a single configuration for them in Zookeeper from which I want to
create different collections, which just differ in the value of some
properties.

Markus


Re: Solr Exception The remote server returned an error: (400) Bad Request.

2015-05-05 Thread marotosg
Thanks for the answer but i don't think that's going to solve my problem.For
instance if I copy this query in the chrome
browserhttp://localhost:8080/solr48/person/select?q=CoreD:25I get this
error.4001CoreD:25undefined field CoreD400If I use wget  from linux wget
http://localhost:8080/solr48/person/select?q=CoreD:25I get ERROR:400 Bad
Request.Is any reason why I am not getting same error?Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Exception-The-remote-server-returned-an-error-400-Bad-Request-tp4203889p4203949.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multiple index.timestamp directories using up disk space

2015-05-05 Thread Shawn Heisey
On 5/5/2015 7:29 AM, Rishi Easwaran wrote:
 Worried about data loss makes sense. If I get the way solr behaves, the new 
 directory should only have missing/changed segments. 
 I guess since our application is extremely write heavy, with lot of inserts 
 and deletes, almost every segment is touched even during a short window, so 
 it appears like for our deployment every segment is copied over when replicas 
 get out of sync.

Once a segment is written, it is *NEVER* updated again.  This aspect of
Lucene indexes makes Solr replication more efficient.  The ids of
deleted documents are written to separate files specifically for
tracking deletes.  Those files are typically quite small compared to the
index segments.  Any new documents are inserted into new segments.

When older segments are merged, the information in all of those segments
is copied to a single new segment (minus documents marked as deleted),
and then the old segments are erased.  Optimizing replaces the entire
index, and each replica of the index would be considered different, so
an index recovery that happens after optimization might copy the whole
thing.

If you are seeing a lot of index recoveries during normal operation,
chances are that your Solr servers do not have enough resources, and the
resource that has the most impact on performance is memory.  The amount
of memory required for good Solr performance is higher than most people
expect.  It's a normal expectation that programs require memory to run,
but Solr has an additional memory requirement that often surprises them
-- the need for a significant OS disk cache:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn



Solr/ Solr Cloud meetup at Aol

2015-05-05 Thread Rishi Easwaran

 Hi All,

Aol is hosting a meetup in Dulles VA. The topic this time is Solr/ Solr Cloud. 

 http://www.meetup.com/Code-Brew/events/53217/

Thanks,
Rishi.

Re: Multiple index.timestamp directories using up disk space

2015-05-05 Thread Shawn Heisey
On 5/5/2015 1:15 PM, Rishi Easwaran wrote:
 Thanks for clarifying lucene segment behaviour. We don't trigger optimize 
 externally, could it be internal solr optimize? Is there a setting/ knob to 
 control when optimize occurs. 

Optimize never happens automatically, but *merging* does.  An optimize
is nothing more than a forced merge down to one segment.  There is a
merge policy, consulted anytime a new segment is created, that decides
whether any automatic merges need to take place and what segments will
be merged.  That merge policy can be configured in solrconfig.xml.

  The behaviour we see multiple huge directories for the same core. Till we 
 figure out what's going on, the only option we are left with it is to clean 
 up the entire index to free up disk space, and allow a replica to sync from 
 scratch.

If multiple index directories exist after replication, there was either
a problem that prevented the rename and deletion of the directories
(common on Windows, less common on UNIX variants like Linux), or you're
running into a bug.  Unless you are performing maintenance or a machine
goes down, index recovery (replication) should *not* be happening during
normal operation of a SolrCloud cluster.  Frequent index recoveries
usually mean that there's a performance problem.

Solr performs better on bare metal than on virtual machines.

Thanks,
Shawn



Re: Limit Results By Score?

2015-05-05 Thread Chris Hostetter

: We have implemented a custom scoring function and also need to limit the
: results by score.  How could we go about that?  Alternatively, can we
: suppress the results early using some kind of custom filter?

in general, limiting by score is a bad idea for all of the reasons 
outlined here...

https://wiki.apache.org/lucene-java/ScoresAsPercentages

...if you have defined a custom scoring function, then many of those 
issues may not apply, and you can use the frange parser to filter 
documents which do not have a score in a given range...

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FunctionRangeQueryParser

How exactly you use frange with your custom score depends on how you 
implement it -- if ou implemented it directly as a ValueSource (w/ a 
ValueSourceParser) then you can just call that function directly.

If you've implemented it as a custom similarity on regular query 
structures, you can still use frange and just wrap your query using the 
query() function.

Eithe way, you can use the frange parser as part of a filter query to 
limit the results based on the score of your function -- independent of 
what your main query / sort are.

if, in the later case, you want to match  sort documents based on the 
same query, you can still do that using local params to refer to the query 
in both places...

q=your_custom_querysort=score descfq={!frange l=90}query($q)


-Hoss
http://www.lucidworks.com/


Re: Schema API: add-field-type

2015-05-05 Thread Steve Rowe
Hi Steve, responses inline below:

 On Apr 29, 2015, at 6:50 PM, Steven White swhite4...@gmail.com wrote:
 
 Hi Everyone,
 
 When I pass the following:
 http://localhost:8983/solr/db/schema/fieldtypes?wt=xml
 
 I see this (as one example):
 
  lst
str name=namedate/str
str name=classsolr.TrieDateField/str
str name=precisionStep0/str
str name=positionIncrementGap0/str
arr name=fields
  strlast_modified/str
/arr
arr name=dynamicFields
  str*_dts/str
  str*_dt/str
/arr
  /lst
 
 See how there is fields and dynamicfields?  However, when I look in
 schema.xml, I see this:
 
  fieldType name=date class=solr.TrieDateField precisionStep=0
 positionIncrementGap=0/
 
 See how there is nothing about fields and dynamicfields.
 
 Now, when I look further into the schema.xml, I see they are coming from:
 
  field name=last_modified type=date indexed=true stored=true/
  dynamicField name=*_dt  type=dateindexed=true  stored=true/
  dynamicField name=*_dts type=dateindexed=true  stored=true
 multiValued=true/
 
 So it all makes sense.
 
 Does this means the response of fieldtypes includes fields and
 dynamicfields as syntactic-sugar to let me know of the relationship this
 field-type has or is there more to it?

It’s FYI: this is the full list of fields and dynamic fields that use the given 
fieldtype.

 The reason why I care about this question is because I'm using Solr's
 Schema API (see: https://cwiki.apache.org/confluence/display/solr/Schema+API)
 to make changes to my schema.  Per this link:
 https://cwiki.apache.org/confluence/display/solr/Schema+API#SchemaAPI-AddaNewFieldType
 it shows how to add a field-type via add-field-type but there is no
 mention of fields or dynamicfields in this API.  My assumption is
 fields and dynamicfields need not be part of this API, instead it is
 done via add-field and add-dynamic-field, thus what I see in the XML of
 fieldtypes response is just syntactic-sugar.  Did I get all this right?
 

Yes, as you say, to add (dynamic) fields after adding a field type, you must 
use the “add-field” and “add-dynamic-field” commands.  Note that you can do so 
in a single request if you like, as long as “add-field-type” is ordered before 
any referencing “add-field”/“add-dynamic-field” command.

To be clear, the “add-field-type” command does not support passing in a set of 
fields and/or dynamic fields to be added with the new field type.

Steve



Re: Editing the Solr Wiki

2015-05-05 Thread Chris Hostetter

you should be good to go, thanks (in advance) for helping out with your 
edits.

: http://www.manning.com/turnbull/. I have already set up an account with 
: the username NicoleButterfield.   Many thanks in advance for your help 


-Hoss
http://www.lucidworks.com/

Re: Multiple index.timestamp directories using up disk space

2015-05-05 Thread Rishi Easwaran
Hi Shawn, 

Thanks for clarifying lucene segment behaviour. We don't trigger optimize 
externally, could it be internal solr optimize? Is there a setting/ knob to 
control when optimize occurs. 

Thanks for pointing it out, will monitor memory closely. Though doubt memory is 
an issue, these are top tier machines with 144GB RAM supporting 12x4GB JVM's. 
Out of which 9 JVM's are running in cloud mode writing to SSD, should be enough 
memory leftover for OS cache.


 The behaviour we see multiple huge directories for the same core. Till we 
figure out what's going on, the only option we are left with it is to clean up 
the entire index to free up disk space, and allow a replica to sync from 
scratch.

Thanks,
Rishi.  

 

-Original Message-
From: Shawn Heisey apa...@elyograg.org
To: solr-user solr-user@lucene.apache.org
Sent: Tue, May 5, 2015 10:55 am
Subject: Re: Multiple index.timestamp directories using up disk space


On 5/5/2015 7:29 AM, Rishi Easwaran wrote:
 Worried about data loss makes
sense. If I get the way solr behaves, the new directory should only have
missing/changed segments. 
 I guess since our application is extremely write
heavy, with lot of inserts and deletes, almost every segment is touched even
during a short window, so it appears like for our deployment every segment is
copied over when replicas get out of sync.

Once a segment is written, it is
*NEVER* updated again.  This aspect of
Lucene indexes makes Solr replication
more efficient.  The ids of
deleted documents are written to separate files
specifically for
tracking deletes.  Those files are typically quite small
compared to the
index segments.  Any new documents are inserted into new
segments.

When older segments are merged, the information in all of those
segments
is copied to a single new segment (minus documents marked as
deleted),
and then the old segments are erased.  Optimizing replaces the
entire
index, and each replica of the index would be considered different,
so
an index recovery that happens after optimization might copy the
whole
thing.

If you are seeing a lot of index recoveries during normal
operation,
chances are that your Solr servers do not have enough resources, and
the
resource that has the most impact on performance is memory.  The amount
of
memory required for good Solr performance is higher than most people
expect. 
It's a normal expectation that programs require memory to run,
but Solr has an
additional memory requirement that often surprises them
-- the need for a
significant OS disk
cache:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn


 


Re: Multiple index.timestamp directories using up disk space

2015-05-05 Thread Ramkumar R. Aiyengar
Yes, data loss is the concern. If the recovering replica is not able to
retrieve the files from the leader, it at least has an older copy.

Also, the entire index is not fetched from the leader, only the segments
which have changed. The replica initially gets the file list from the
replica, checks against what it has, and then downloads the difference --
then moves it to the main index. Note that this process can fail sometimes
(say due to I/O errors, or due to a problem with the leader itself), in
which case the replica drops all accumulated files from the leader, and
starts from scratch. If that happens, it needs to look back at its old
index again to figure out what it needs to download on the next attempt.

May be with a fair number of assumptions which should usually hold good,
you can still come up with a mechanism to drop existing files, but those
won't hold good in case of serious issues with the cloud, you could end up
losing data. That's worse than using a bit more disk space!
On 4 May 2015 11:56, Rishi Easwaran rishi.easwa...@aol.com wrote:

Thanks for the responses Mark and Ramkumar.

 The question I had was, why does Solr need 2 copies at any given time,
leading to 2x disk space usage.
 Not sure if this information is not published anywhere, and makes HW
estimation almost impossible for large scale deployment. Even if the copies
are temporary, this becomes really expensive, especially when using SSD in
production, when the complex size is over 400TB indexes, running 1000's of
solr cloud shards.

 If a solr follower has decided that it needs to do replication from leader
and capture full copy snapshot. Why can't it delete the old information and
replicate from scratch, not requiring more disk space.
 Is the concern data loss (a case when both leader and follower lose data)?.

 Thanks,
 Rishi.







-Original Message-
From: Mark Miller markrmil...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, Apr 28, 2015 10:52 am
Subject: Re: Multiple index.timestamp directories using up disk space


If copies of the index are not eventually cleaned up, I'd fill a JIRA
to
address the issue. Those directories should be removed over time. At
times
there will have to be a couple around at the same time and others may
take
a while to clean up.

- Mark

On Tue, Apr 28, 2015 at 3:27 AM Ramkumar
R. Aiyengar 
andyetitmo...@gmail.com wrote:

 SolrCloud does need up to
twice the amount of disk space as your usual
 index size during replication.
Amongst other things, this ensures you have
 a full copy of the index at any
point. There's no way around this, I would
 suggest you provision the
additional disk space needed.
 On 20 Apr 2015 23:21, Rishi Easwaran
rishi.easwa...@aol.com wrote:

  Hi All,
 
  We are seeing this
problem with solr 4.6 and solr 4.10.3.
  For some reason, solr cloud tries to
recover and creates a new index
  directory - (ex:index.20150420181214550),
while keeping the older index
 as
  is. This creates an issues where the
disk space fills up and the shard
  never ends up recovering.
  Usually
this requires a manual intervention of  bouncing the instance and
  wiping
the disk clean to allow for a clean recovery.
 
  Any ideas on how to
prevent solr from creating multiple copies of index
  directory.
 
 
Thanks,
  Rishi.
 



Re: Slow highlighting on Solr 5.0.0

2015-05-05 Thread Ere Maijala
I'm seeing the same with Solr 5.1.0 after upgrading from 4.10.2. Here 
are my timings:


4.10.2:
process: 1432.0
highlight: 723.0

5.1.0:
process: 9570.0
highlight: 8790.0

schema.xml and solrconfig.xml are available at 
https://github.com/NatLibFi/NDL-VuFind-Solr/tree/master/vufind/biblio/conf.


A couple of jstack outputs taken when the query was executing are 
available at http://pastebin.com/eJrEy2Wb


Any suggestions would be appreciated. Or would it make sense to just 
file a JIRA issue?


--Ere

3.3.2015, 0.48, Matt Hilt kirjoitti:

Short form:
While testing Solr 5.0.0 within our staging environment, I noticed that
highlight enabled queries are much slower than I saw with 4.10. Are
there any obvious reasons why this might be the case? As far as I can
tell, nothing has changed with the default highlight search component or
its parameters.


A little more detail:
The bulk of the collection config set was stolen from the basic 4.X
example config set. I changed my schema.xml and solrconfig.xml just
enough to get 5.0 to create a new collection (removed non-trie fields,
some other deprecated response handler definitions, etc). I can provide
my version of the solr.HighlightComponent config, but it is identical to
the sample_techproducts_configs example in 5.0.  Are there any other
config files I could provide that might be useful?


Number on “much slower”:
I indexed a very small subset of my data into the new collection and
used the /select interface to do a simple debug query. Solr 4.10 gives
the following pertinent info:
response: { numFound: 72628,
...
debug: {
timing: { time: 95, process: { time: 94, query: { time: 6 },
highlight: { time: 84 }, debug: { time: 4 } }
---
Whereas solr 5.0 is:
response: { numFound: 1093,
...
debug: {
timing: { time: 6551, process: { time: 6549, query: { time:
0 }, highlight: { time: 6524 }, debug: { time: 25 }






--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Proximity searching in percentage

2015-05-05 Thread Zheng Lin Edwin Yeo
Hi,

Would like to check, how do we implement character proximity searching
that's in terms of percentage with regards to the length of the word,
instead of a fixed number of edit distance (characters)?

For example, if we have a proximity of 20%, a word with 5 characters will
have an edit distance of 1, and a word with 10 characters will
automatically have an edit distance of 2.

Will Solr be able to do that for us?

Regards,
Edwin


Solr Exception The remote server returned an error: (400) Bad Request.

2015-05-05 Thread marotosg
Hi,

I am having some difficulties knowing which one is the exception I am having
on my client for some queries. Queries malformed are always coming back to
my solrNet client as The remote server returned an error: (400) Bad
Request.. Internally Solr is actually printing the log issues like
undefined field fieldName.

Do your have any idea about getting more detailed info into the http
response?

Thanks,
Sergio



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Exception-The-remote-server-returned-an-error-400-Bad-Request-tp4203889.html
Sent from the Solr - User mailing list archive at Nabble.com.


Limit Results By Score?

2015-05-05 Thread Johannes Ruscheinski
Hi,

We have implemented a custom scoring function and also need to limit the
results by score.  How could we go about that?  Alternatively, can we
suppress the results early using some kind of custom filter?

--Johannes

-- 
Dr. Johannes Ruscheinski
Universitätsbibliothek Tübingen - IT-Abteilung -
Wilhelmstr. 32, 72074 Tübingen

Tel: +49 7071 29-72820
FAX: +49 7071 29-5069
Email: johannes.ruschein...@uni-tuebingen.de




Re: Solr Exception The remote server returned an error: (400) Bad Request.

2015-05-05 Thread Tomasz Borek
Take a look at query parameters and use debug and/or explain.

https://wiki.apache.org/solr/CommonQueryParameters

Also, perhaps change parser from default one to less stringent dismax.

Hard to say what fits your case as I don't know it, but those two are best
starting points I know of.

pozdrawiam,
LAFK

2015-05-05 10:39 GMT+02:00 marotosg marot...@gmail.com:

 Hi,

 I am having some difficulties knowing which one is the exception I am
 having
 on my client for some queries. Queries malformed are always coming back to
 my solrNet client as The remote server returned an error: (400) Bad
 Request.. Internally Solr is actually printing the log issues like
 undefined field fieldName.

 Do your have any idea about getting more detailed info into the http
 response?

 Thanks,
 Sergio



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-Exception-The-remote-server-returned-an-error-400-Bad-Request-tp4203889.html
 Sent from the Solr - User mailing list archive at Nabble.com.