RE: inconsistent number of results returned in solr cloud

2012-11-29 Thread Buttler, David
Sorry, yes, I had been using the BETA version.  I have deleted all of that, 
replaced the jars with the released versions (reduced my core count), and now I 
have consistent results.
I guess I missed that JIRA ticket, sorry for the false alarm.
Dave


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, November 23, 2012 4:25 AM
To: solr-user@lucene.apache.org
Subject: Re: inconsistent number of results returned in solr cloud

Dave:

I should have asked this first. What version of Solr are you using? I  Not sure 
whether it was fixed in BETA or not (certainly is in the 4.0 GA release). There 
was a problem with adding a doclist via solrj, here's one related JIRA, 
although it wasn't the main fix:
https://issues.apache.org/jira/browse/SOLR-3001. I suspect that's the known 
problem Mark mentioned.

Because what you're seeing _sure_ sounds similar

Best
Erick


On Mon, Nov 19, 2012 at 12:49 PM, Buttler, David buttl...@llnl.gov wrote:

 Answers inline below

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Saturday, November 17, 2012 6:40 AM
 To: solr-user@lucene.apache.org
 Subject: Re: inconsistent number of results returned in solr cloud

 Hmmm, first an aside. If by commit after every batch of documents  
 you mean after every call to server.add(doclist), there's no real need 
 to do that unless you're striving for really low latency. the usual 
 recommendation is to use commitWithin when adding and commit only at 
 the very end of the run. This shouldn't actually be germane to your 
 issue, just an FYI.

 DB Good point.  The code for committing docs to solr is fairly old.  
 DB I
 will update it since I don't have a latency requirement.

 So you're saying that the inconsistency is permanent? By that I mean 
 it keeps coming back inconsistently for minutes/hours/days?

 DB Yes, it is permanent.  I have collections that have been up for 
 DB weeks,
 and are still returning inconsistent results, and I haven't been 
 adding any additional documents.
 DB Related to this, I seem to have a discrepancy between the number 
 DB of
 documents I think I am sending to solr, and the number of documents it 
 is reporting.  I have tried reducing the number of shards for one of 
 my small collections, so I deleted all references to this collections, 
 and reloaded it. I think I have 260 documents submitted (counted from a 
 hadoop job).
  Solr returns a count of ~430 (it varies), and the first returned 
 document is not consistent.

 I guess if I were trying to test this I'd need to know how you added 
 subsequent collections. In particular what you did re: zookeeper as 
 you added each collection.

 DB These are my steps
 DB 1. Create the collection via the HTTP API: http://
 host:port/solr/admin/collections?action=CREATEname=collectionn
 umShards=6%20collection.configName=collection
 DB 2. Relaunch one of my JVM processes, bootstrapping the collection:
 DB java -Xmx16g -Dcollection.configName=collection 
 DB -Djetty.port=port
 -DzkHost=zkhost -Dsolr.solr.home=solr home -DnumShards=6 
 -Dbootstrap_confdir=conf -jar start.jar
 DB load data

 DB Let me know if something is unclear.  I can run through the 
 DB process
 again and document it more carefully.
 DB
 DB Thanks for looking at it,
 DB Dave

 Best
 Erick


 On Fri, Nov 16, 2012 at 2:55 PM, Buttler, David buttl...@llnl.gov wrote:

  My typical way of adding documents is through SolrJ, where I commit 
  after every batch of documents (where the batch size is 
  configurable)
 
  I have now tried committing several times, from the command line 
  (curl) with and without openSearcher=true.  It does not affect anything.
 
  Dave
 
  -Original Message-
  From: Mark Miller [mailto:markrmil...@gmail.com]
  Sent: Friday, November 16, 2012 11:04 AM
  To: solr-user@lucene.apache.org
  Subject: Re: inconsistent number of results returned in solr cloud
 
  How did you do the final commit? Can you try a lone commit (with
  openSearcher=true) and see if that affects things?
 
  Trying to determine if this is a known issue or not.
 
  - Mark
 
  On Nov 16, 2012, at 1:34 PM, Buttler, David buttl...@llnl.gov wrote:
 
   Hi all,
   I buried an issue in my last post, so let me pop it up.
  
   I have a cluster with 10 collections on it.  The first collection 
   I
  loaded works perfectly.  But every subsequent collection returns an 
  inconsistent number of results for each query.  The queries can be 
  simply *:*, or more complex facet queries.  If I go to individual 
  cores and
 issue
  the query, with distrib=false, I get a consistent number of results.  
  I
 am
  wondering if there is some delay in returning results from my 
  shards, and the queried node just times out and displays the number 
  of results that
 it
  has received so far.  If there is such a timeout, it must be very 
  small,
 as
  my QTime is around 11 ms.
  
   Dave
 
 



RE: inconsistent number of results returned in solr cloud

2012-11-19 Thread Buttler, David
Answers inline below

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Saturday, November 17, 2012 6:40 AM
To: solr-user@lucene.apache.org
Subject: Re: inconsistent number of results returned in solr cloud

Hmmm, first an aside. If by commit after every batch of documents  you
mean after every call to server.add(doclist), there's no real need to do
that unless you're striving for really low latency. the usual
recommendation is to use commitWithin when adding and commit only at the
very end of the run. This shouldn't actually be germane to your issue, just
an FYI.

DB Good point.  The code for committing docs to solr is fairly old.  I will 
update it since I don't have a latency requirement.

So you're saying that the inconsistency is permanent? By that I mean it
keeps coming back inconsistently for minutes/hours/days?

DB Yes, it is permanent.  I have collections that have been up for weeks, and 
are still returning inconsistent results, and I haven't been adding any 
additional documents.
DB Related to this, I seem to have a discrepancy between the number of 
documents I think I am sending to solr, and the number of documents it is 
reporting.  I have tried reducing the number of shards for one of my small 
collections, so I deleted all references to this collections, and reloaded it. 
I think I have 260 documents submitted (counted from a hadoop job).  Solr 
returns a count of ~430 (it varies), and the first returned document is not 
consistent.

I guess if I were trying to test this I'd need to know how you added
subsequent collections. In particular what you did re: zookeeper as you
added each collection.

DB These are my steps
DB 1. Create the collection via the HTTP API: 
http://host:port/solr/admin/collections?action=CREATEname=collectionnumShards=6%20collection.configName=collection
DB 2. Relaunch one of my JVM processes, bootstrapping the collection: 
DB java -Xmx16g -Dcollection.configName=collection -Djetty.port=port 
-DzkHost=zkhost -Dsolr.solr.home=solr home -DnumShards=6 
-Dbootstrap_confdir=conf -jar start.jar
DB load data

DB Let me know if something is unclear.  I can run through the process again 
and document it more carefully.
DB
DB Thanks for looking at it,
DB Dave

Best
Erick


On Fri, Nov 16, 2012 at 2:55 PM, Buttler, David buttl...@llnl.gov wrote:

 My typical way of adding documents is through SolrJ, where I commit after
 every batch of documents (where the batch size is configurable)

 I have now tried committing several times, from the command line (curl)
 with and without openSearcher=true.  It does not affect anything.

 Dave

 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com]
 Sent: Friday, November 16, 2012 11:04 AM
 To: solr-user@lucene.apache.org
 Subject: Re: inconsistent number of results returned in solr cloud

 How did you do the final commit? Can you try a lone commit (with
 openSearcher=true) and see if that affects things?

 Trying to determine if this is a known issue or not.

 - Mark

 On Nov 16, 2012, at 1:34 PM, Buttler, David buttl...@llnl.gov wrote:

  Hi all,
  I buried an issue in my last post, so let me pop it up.
 
  I have a cluster with 10 collections on it.  The first collection I
 loaded works perfectly.  But every subsequent collection returns an
 inconsistent number of results for each query.  The queries can be simply
 *:*, or more complex facet queries.  If I go to individual cores and issue
 the query, with distrib=false, I get a consistent number of results.  I am
 wondering if there is some delay in returning results from my shards, and
 the queried node just times out and displays the number of results that it
 has received so far.  If there is such a timeout, it must be very small, as
 my QTime is around 11 ms.
 
  Dave




RE: Architecture Question

2012-11-19 Thread Buttler, David
If you just want to store the data, you can dump it into HDFS sequence files.  
While HBase is really nice if you want to process and serve data real-time, it 
adds overhead to use it as pure storage.
Dave

-Original Message-
From: Cool Techi [mailto:cooltec...@outlook.com] 
Sent: Friday, November 16, 2012 8:26 PM
To: solr-user@lucene.apache.org
Subject: RE: Architecture Question

Hi Otis,

Thanks for your reply, just wanted to check what NoSql structure would be best 
suited to store data and use the least amount of memory, since for most of my 
work Solr would be sufficient and I want to store data just in case we want to 
reindex and as a backup.

Regards,
Ayush

 Date: Fri, 16 Nov 2012 15:47:40 -0500
 Subject: Re: Architecture Question
 From: otis.gospodne...@gmail.com
 To: solr-user@lucene.apache.org
 
 Hello,
 
 
 
  I am not sure if this is the right forum for this question, but it would
  be great if I could be pointed in the right direction. We have been using a
  combination of MySql and Solr for all our company full text and query
  needs.  But as our customers have grow so has the amount of data and MySql
  is just not proving to be a right option for storing/querying.
 
  I have been looking at Solr Cloud and it looks really impressive, but and
  not sure if we should give away our storage system. So, I have been
  exploring DataStax but a commercial option is out of question. So we were
  thinking of using hbase to store the data and at the same time index the
  data into Solr cloud, but for many reasons this design doesn't seem
  convincing (Also seen basic of Lilly).
 
  1) Would it be recommended to just user Solr cloud with multiple
  replication or hbase-solr seems like good option
 
 
 If you trust SolrCloud with replication and keep all your fields stored
 then you could live without an external DB.  At this point I personally
 would still want an external DB.  Whether HBase is the right DB for the job
 I can't tell because I don't know anything about your data, volume, access
 patterns, etc.  I can tell you that HBase does scale well - we have tables
 with many billions of rows stored in it for instance.
 
 
  2) How much strain would be to keep both Solr Shard and Hbase node on the
  same machine
 
 
 HBase loves memory.  So does Solr.  They both dislike disk IO (who
 doesn't!).  Solr can use a lot of CPU for indexing/searching, depending on
 the volume.  HBase RegionServers can use a lot of CPU if you run MapReuce
 on data in HBase.
 
 
  3) if there a calculation on what kind of machine configuration would I
  need to store 500-1000 million records. Most of these with be social data
  (Twitter/facebook/blogs etc) and how many shards.
 
 
 No recipe here, unfortunately.  You'd have to experiment and test, do load
 and performance testing, etc.  If you need help with Solr + HBase, we
 happen to have a lot of experience with both and have even used them
 together for some of our clients.
 
 Otis
 --
 Performance Monitoring - http://sematext.com/spm/index.html
 Search Analytics - http://sematext.com/search-analytics/index.html
  


inconsistent number of results returned in solr cloud

2012-11-16 Thread Buttler, David
Hi all,
I buried an issue in my last post, so let me pop it up.

I have a cluster with 10 collections on it.  The first collection I loaded 
works perfectly.  But every subsequent collection returns an inconsistent 
number of results for each query.  The queries can be simply *:*, or more 
complex facet queries.  If I go to individual cores and issue the query, with 
distrib=false, I get a consistent number of results.  I am wondering if there 
is some delay in returning results from my shards, and the queried node just 
times out and displays the number of results that it has received so far.  If 
there is such a timeout, it must be very small, as my QTime is around 11 ms.

Dave


RE: inconsistent number of results returned in solr cloud

2012-11-16 Thread Buttler, David
My typical way of adding documents is through SolrJ, where I commit after every 
batch of documents (where the batch size is configurable)

I have now tried committing several times, from the command line (curl) with 
and without openSearcher=true.  It does not affect anything.

Dave

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Friday, November 16, 2012 11:04 AM
To: solr-user@lucene.apache.org
Subject: Re: inconsistent number of results returned in solr cloud

How did you do the final commit? Can you try a lone commit (with 
openSearcher=true) and see if that affects things?

Trying to determine if this is a known issue or not.

- Mark

On Nov 16, 2012, at 1:34 PM, Buttler, David buttl...@llnl.gov wrote:

 Hi all,
 I buried an issue in my last post, so let me pop it up.
 
 I have a cluster with 10 collections on it.  The first collection I loaded 
 works perfectly.  But every subsequent collection returns an inconsistent 
 number of results for each query.  The queries can be simply *:*, or more 
 complex facet queries.  If I go to individual cores and issue the query, with 
 distrib=false, I get a consistent number of results.  I am wondering if there 
 is some delay in returning results from my shards, and the queried node just 
 times out and displays the number of results that it has received so far.  If 
 there is such a timeout, it must be very small, as my QTime is around 11 ms.
 
 Dave



cores shards and disks in SolrCloud

2012-11-15 Thread Buttler, David
Hi,
I have a question about the optimal way to distribute solr indexes across a 
cloud.  I have a small number of collections (less than 10).  And a small 
cluster (6 nodes), but each node has several disks - 5 of which I am using for 
my solr indexes.  The cluster is also a hadoop cluster, so the disks are not 
RAIDed, they are JBOD.  So, on my 5 slave nodes, each with 5 disks, I was 
thinking of putting one shard per collection.  This means I end up with 25 
shards per collection.  If I had 10 collections, that would make it 250 shards 
total.  Given that Solr 4 supports multi-core, my first thought was to try one 
JVM for each node: for 10 collections per node, that means that each JVM would 
contain 50 shards.

So, I set up my first collection, with a modest 20M documents, and everything 
seems to work fine.  But, now my subsequent collections that I have added are 
having issues.  The first one is that every time I query for the document count 
(*:* with rows=0), a different number of documents is returned. The number can 
differ by as much as 10%.  Now if I query each shard individually (setting 
distrib=false), the number returned is always consistent.

I am not entirely sure this is related as I may have missed a step in my setup 
of subsequent collections (bootstrapping the config)

But, more related to the architecture question: is it better to have one JVM 
per disk, one JVM per shard, or one JVM per node.  Given the MMap of the 
indexes, how does memory play into the question?   There is a blog post 
(http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html) that 
recommends minimizing the amount of JVM memory and maximizing the amount of 
OS-level file cache, but how does that impact sorting / boosting?

Sorry if I have missed some documentation: I have been through the cloud 
tutorial a couple of times, and I didn't see any discussion of these issues

Thanks,
Dave


RE: cores shards and disks in SolrCloud

2012-11-15 Thread Buttler, David
The main reason to split a collection into 25 shards is to reduce the impact of 
the loss of a disk.  I was running an older version of solr, a disk went down, 
and my entire collection was offline.  Solr 4 offers shards.tolerant to reduce 
the impact of the loss of a disk: fewer documents will be returned.  Obviously, 
I could replicate the data so that I wouldn't lose any documents while I 
replace my disk, but since I am already storing the original data in HDFS, 
(with a 3x replication), adding additional replication for solr eats into my 
disk budget a bit too much.

Also, my other collections have larger amounts of data / number of documents. 
For every TB of raw data, how much disk space do I want to be using? As little 
as possible.  Drives are cheap, but not free.  And, nodes only hold so many 
drives.  

Dave

-Original Message-
From: Upayavira [mailto:u...@odoko.co.uk] 
Sent: Thursday, November 15, 2012 4:37 PM
To: solr-user@lucene.apache.org
Subject: Re: cores shards and disks in SolrCloud

Personally I see no benefit to have more than one JVM per node, cores
can handle it. I would say that splitting a 20m index into 25 shards
strikes me as serious overkill, unless you expect to expand
significantly. 20m would likely be okay with two or three shards. You
can store the indexes for each core on different disks which can give
ome performance benefit.

Just some thoughts.

Upayavira



On Thu, Nov 15, 2012, at 11:04 PM, Buttler, David wrote:
 Hi,
 I have a question about the optimal way to distribute solr indexes across
 a cloud.  I have a small number of collections (less than 10).  And a
 small cluster (6 nodes), but each node has several disks - 5 of which I
 am using for my solr indexes.  The cluster is also a hadoop cluster, so
 the disks are not RAIDed, they are JBOD.  So, on my 5 slave nodes, each
 with 5 disks, I was thinking of putting one shard per collection.  This
 means I end up with 25 shards per collection.  If I had 10 collections,
 that would make it 250 shards total.  Given that Solr 4 supports
 multi-core, my first thought was to try one JVM for each node: for 10
 collections per node, that means that each JVM would contain 50 shards.
 
 So, I set up my first collection, with a modest 20M documents, and
 everything seems to work fine.  But, now my subsequent collections that I
 have added are having issues.  The first one is that every time I query
 for the document count (*:* with rows=0), a different number of documents
 is returned. The number can differ by as much as 10%.  Now if I query
 each shard individually (setting distrib=false), the number returned is
 always consistent.
 
 I am not entirely sure this is related as I may have missed a step in my
 setup of subsequent collections (bootstrapping the config)
 
 But, more related to the architecture question: is it better to have one
 JVM per disk, one JVM per shard, or one JVM per node.  Given the MMap of
 the indexes, how does memory play into the question?   There is a blog
 post
 (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html)
 that recommends minimizing the amount of JVM memory and maximizing the
 amount of OS-level file cache, but how does that impact sorting /
 boosting?
 
 Sorry if I have missed some documentation: I have been through the cloud
 tutorial a couple of times, and I didn't see any discussion of these
 issues
 
 Thanks,
 Dave


RE: Cloud assigning incorrect port to shards

2012-08-29 Thread Buttler, David
I think the issue was that I didn't have a solr.xml in the solr home.  I was a 
little confused by the example directory because there are actually 5 solr.xml 
files
% find . -name solr.xml
./multicore/solr.xml
./example-DIH/solr/solr.xml
./exampledocs/solr.xml
./contexts/solr.xml
./solr/solr.xml

Creating my own jetty installation directory without the example instances led 
to me deleting the solr/solr.xml file.

I have now created a new solr home and set up a solr.xml file there, and things 
look much better.

Thanks for the feedback,
Dave



-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Thursday, August 23, 2012 6:00 PM
To: solr-user@lucene.apache.org
Subject: Re: Cloud assigning incorrect port to shards

Can you post your solr.xml file?

On Thursday, August 23, 2012, Buttler, David wrote:

 I am using the jetty container from the example.  The only thing I 
 have done is change the schema to match up my documents rather than 
 the example

 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com javascript:;]
 Sent: Wednesday, August 22, 2012 5:50 PM
 To: solr-user@lucene.apache.org javascript:;
 Subject: Re: Cloud assigning incorrect port to shards

 What container are you using?

 Sent from my iPhone

 On Aug 22, 2012, at 3:14 PM, Buttler, David 
 buttl...@llnl.govjavascript:;
 wrote:

  Hi,
  I have set up a Solr 4 beta cloud cluster.  I have uploaded a config
 directory, and linked it with a configuration name.
 
  I have started two solr on two computers and added a couple of 
  shards
 using the Core Admin function on the admin page.
 
  When I go to the admin cloud view, the shards all have the computer 
  name
 and port attached to them.  BUT, the port is the default port (8983), 
 and not the port that I assigned on the command line.  I can still 
 connect to the correct port, and not the reported port.  I anticipate 
 that this will lead to errors when I get to doing distributed query, 
 as zookeeper seems to be collecting incorrect information.
 
  Any thoughts as to why the incorrect port is being stored in zookeeper?
 
  Thanks,
  Dave



--
- Mark

http://www.lucidimagination.com


RE: Cloud assigning incorrect port to shards

2012-08-23 Thread Buttler, David
I am using the jetty container from the example.  The only thing I have done is 
change the schema to match up my documents rather than the example

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Wednesday, August 22, 2012 5:50 PM
To: solr-user@lucene.apache.org
Subject: Re: Cloud assigning incorrect port to shards

What container are you using?

Sent from my iPhone

On Aug 22, 2012, at 3:14 PM, Buttler, David buttl...@llnl.gov wrote:

 Hi,
 I have set up a Solr 4 beta cloud cluster.  I have uploaded a config 
 directory, and linked it with a configuration name.
 
 I have started two solr on two computers and added a couple of shards using 
 the Core Admin function on the admin page.
 
 When I go to the admin cloud view, the shards all have the computer name and 
 port attached to them.  BUT, the port is the default port (8983), and not the 
 port that I assigned on the command line.  I can still connect to the correct 
 port, and not the reported port.  I anticipate that this will lead to errors 
 when I get to doing distributed query, as zookeeper seems to be collecting 
 incorrect information.
 
 Any thoughts as to why the incorrect port is being stored in zookeeper?
 
 Thanks,
 Dave


RE: Co-existing solr cloud installations

2012-08-22 Thread Buttler, David
This is really nice.  Thanks for pointing it out.
Dave

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Tuesday, August 21, 2012 8:23 PM
To: solr-user@lucene.apache.org
Subject: Re: Co-existing solr cloud installations

You can use a connect string of host:port/path to 'chroot' a path. I
think currently you have to manually create the path first though. See
the ZkCli tool (doc'd on SolrCloud wiki) for a simple way to do that.

I keep meaning to look into auto making it if it doesn't exist, but
have not gotten to it.

- Mark

On Tue, Aug 21, 2012 at 4:46 PM, Buttler, David buttl...@llnl.gov wrote:
 Hi all,
 I would like to use a single zookeeper cluster to manage multiple Solr cloud 
 installations.  However, the current design of how Solr uses zookeeper seems 
 to preclude that.  Have I missed a configuration option to set a zookeeper 
 prefix for all of a Solr cloud configuration directories?

 If I look at the zookeeper data it looks like:

  * /clusterstate.json
  * /collections
  * /configs
  * /live_nodes
  * /overseer
  * /overseer_elect
  * /zookeeper
 Is there a reason not to put all of these nodes under some user-configurable 
 higher-level node, such as /solr4?
 It could have a reasonable default value to make it just as easy to find as /.

 My current issue is that I have an old Solr cloud instance from back in the 
 Solr 1.5 days, and I don't expect that the new version and the old version 
 will play nice.

 Thanks,
 Dave



Cloud assigning incorrect port to shards

2012-08-22 Thread Buttler, David
Hi,
I have set up a Solr 4 beta cloud cluster.  I have uploaded a config directory, 
and linked it with a configuration name.

I have started two solr on two computers and added a couple of shards using the 
Core Admin function on the admin page.

When I go to the admin cloud view, the shards all have the computer name and 
port attached to them.  BUT, the port is the default port (8983), and not the 
port that I assigned on the command line.  I can still connect to the correct 
port, and not the reported port.  I anticipate that this will lead to errors 
when I get to doing distributed query, as zookeeper seems to be collecting 
incorrect information.

Any thoughts as to why the incorrect port is being stored in zookeeper?

Thanks,
Dave


Co-existing solr cloud installations

2012-08-21 Thread Buttler, David
Hi all,
I would like to use a single zookeeper cluster to manage multiple Solr cloud 
installations.  However, the current design of how Solr uses zookeeper seems to 
preclude that.  Have I missed a configuration option to set a zookeeper prefix 
for all of a Solr cloud configuration directories?

If I look at the zookeeper data it looks like:

 * /clusterstate.json
 * /collections
 * /configs
 * /live_nodes
 * /overseer
 * /overseer_elect
 * /zookeeper
Is there a reason not to put all of these nodes under some user-configurable 
higher-level node, such as /solr4?
It could have a reasonable default value to make it just as easy to find as /.

My current issue is that I have an old Solr cloud instance from back in the 
Solr 1.5 days, and I don't expect that the new version and the old version will 
play nice.

Thanks,
Dave



solr 4 degraded behavior failure

2012-08-16 Thread Buttler, David
Hi all,
I am testing out the could features in Solr 4, and I have a observation about 
the behavior under failure.

Following the cloud tutorial, I set up a collection with 2 shards.  I started 4 
servers (so each shard is replicated twice).  I added the test documents, and 
everything works fine.  If I kill one or two servers, everything continues to 
work.  However, when three servers are killed, zero results are returned.  This 
is an improvement over previous versions of the cloud branch where having 
missing shards would result in an error, but I would have expected fewer 
results rather than zero results.

It turns out that there is a parameter that can be added to a query to get 
degraded results, but it is not described on the Solr cloud page.  It is on the 
DistributedSearch page, but it is poorly defined, and difficult to locate 
starting from the cloud page.

The way to get degraded results is to append:
shards.tolerant=true
to your Solr query.

Dave






RE: solr 4 degraded behavior failure

2012-08-16 Thread Buttler, David
Is there a way to make the shards.tolerant=true behavior the default behavior?

-Original Message-
From: Buttler, David [mailto:buttl...@llnl.gov] 
Sent: Thursday, August 16, 2012 11:01 AM
To: solr-user@lucene.apache.org
Subject: solr 4 degraded behavior failure

Hi all,
I am testing out the could features in Solr 4, and I have a observation about 
the behavior under failure.

Following the cloud tutorial, I set up a collection with 2 shards.  I started 4 
servers (so each shard is replicated twice).  I added the test documents, and 
everything works fine.  If I kill one or two servers, everything continues to 
work.  However, when three servers are killed, zero results are returned.  This 
is an improvement over previous versions of the cloud branch where having 
missing shards would result in an error, but I would have expected fewer 
results rather than zero results.

It turns out that there is a parameter that can be added to a query to get 
degraded results, but it is not described on the Solr cloud page.  It is on the 
DistributedSearch page, but it is poorly defined, and difficult to locate 
starting from the cloud page.

The way to get degraded results is to append:
shards.tolerant=true
to your Solr query.

Dave






RE: solr.xml entries got deleted when powered off

2012-08-15 Thread Buttler, David
You are not putting these files in /tmp are you?  That is sometimes wiped by 
different OS's on shutdown


-Original Message-
From: vempap [mailto:phani.vemp...@emc.com] 
Sent: Wednesday, August 15, 2012 3:31 PM
To: solr-user@lucene.apache.org
Subject: Re: solr.xml entries got deleted when powered off

It's happening when I'm not doing a clean shutdown. Are there any more
scenarios it might happen ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496p4001503.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Distributed Searching + unique Ids

2012-08-14 Thread Buttler, David
I just downloaded the solr 4 beta, and was running through the tutorial.  It 
seemed to me that I was getting duplicate counts in my facet fields when I had 
two shards and four cores running. For example, 
http://localhost:8983/solr/collection1/browse
Reports 21 entries in the facet cat:electronics, but if I click on that facet, 
there are only 14 results, and it still reports 21 entries for cat:electronics.

Is this a known bug?

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, August 14, 2012 7:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Distributed Searching + unique Ids

Don't do this. Many bits of sharding assume that a uniqueKey
exists on one and only one shard. Document counts may be
off. Faceting may be off.  Etc.

Why do you want to duplicate records across shards? What
benefit is this providing?

This feels like an XY problem...

Best
Erick

On Fri, Aug 10, 2012 at 1:10 PM, Eric Khoury ekhour...@hotmail.com wrote:




 hey guys, the spec mentions the following:


  The unique
  key field must be unique across all shards. If docs with
  duplicate unique keys are encountered, Solr will make an attempt to 
 return
  valid results, but the behavior may be non-deterministic.


 I'm actually looking to duplicate certain objects across shards, and hoping 
 to have duplicates removed when querying over all shards.If these duplicates 
 have the same ids, will that work?  Will this cause chaos with paging?  I 
 imagine that it might affect faceting as well?thanks,Eric.


Duplicated facet counts in solr 4 beta: user error

2012-08-14 Thread Buttler, David
Here are my steps:

1)  Download apache-solr-4.0.0-BETA

2)  Untar into a directory

3)  cp -r example example2

4)  cp -r example exampleB

5)  cp -r example example2B

6)  cd example;  java -Dbootstrap_confdir=./solr/collection1/conf 
-Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar

7)  cd example2; java -Djetty.port=7574 -DzkHost=localhost:9983 -jar 
start.jar

8)  cd exampleB; java -Djetty.port=8900 -DzkHost=localhost:9983 -jar 
start.jar

9)  cd example2B; java -Djetty.port=7500 -DzkHost=localhost:9983 -jar 
start.jar

10)   cd example/exampledocs; java 
-Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml

http://localhost:8983/solr/collection1/select?q=*:*wt=xmlfq=cat:%22electronics%22
14 results returned

This is correct.  Let's try a slightly more circuitous route by running through 
the solr tutorial first


1)  Download apache-solr-4.0.0-BETA

2)  Untar into a directory

3)  cd example; java  -jar start.jar

4)  cd example/exampledocs; java 
-Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml

5)  kill jetty server

6)  cp -r example example2

7)  cp -r example exampleB

8)  cp -r example example2B

9)  cd example;  java -Dbootstrap_confdir=./solr/collection1/conf 
-Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar

10)   cd example2; java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar

11)   cd exampleB; java -Djetty.port=8900 -DzkHost=localhost:9983 -jar start.jar

12)   cd example2B; java -Djetty.port=7500 -DzkHost=localhost:9983 -jar 
start.jar

13)   cd example/exampledocs; java 
-Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml

With the same query as above, 22 results are returned.

Looking at this, it is somewhat obvious that what is happening is that the 
index was copied over from the tutorial and was not cleaned up before running 
the cloud examples.

Adding the debug=query parameter to the query URL produces the following:
lst name=debug
str name=rawquerystring*:*/str
str name=querystring*:*/str
str name=parsedqueryMatchAllDocsQuery(*:*)/str
str name=parsedquery_toString*:*/str
str name=QParserLuceneQParser/str
arr name=filter_queries
strcat:electronics/str
/arr
arr name=parsed_filter_queries
strcat:electronics/str
/arr
/lst

So, Erick's diagnoses is correct: pilot error.  However, the straightforward 
path through the tutorial and on to solr cloud makes it easy to make this 
mistake. Maybe a small warning in the solr cloud page would help?

Now, running a delete operations fixes things:
cd example/exampledocs;
java -Dcommit=false -Ddata=args -jar post.jar 
deletequery*:*/query/delete
causes the number of results to be zero.  So, let's reload the data:
java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml
now the number of results for our query
http://localhost:8983/solr/collection1/select?q=*:*wt=xmlfq=cat:electronicshttp://localhost:8983/solr/collection1/select?q=*:*wt=xmlfq=cat:%22electronics
is back to the correct 14 results.

Dave

PS apologizes for hijacking the thread earlier.


RE: DIH full-import failure, no real error message

2010-11-16 Thread Buttler, David
I am using the solr cloud branch on 6 machines.  I first load PubMed into 
HBase, and then push the fields I care about to solr.  Indexing from HBase to 
solr takes about 18 minutes.  Loading to hbase takes a little longer (2 
hours?), but it only happens once so I haven't spent much time trying to 
optimize.

This gives me the flexibility of a solr search as well as full document 
retrieval (and additional processing) from hbase.

Dave

-Original Message-
From: Erik Fäßler [mailto:erik.faess...@uni-jena.de] 
Sent: Tuesday, November 16, 2010 9:16 AM
To: solr-user@lucene.apache.org
Subject: Re: DIH full-import failure, no real error message

  Thank you very much, I will have a read on your links.

The full-text-red-flag is exactly the thing why I'm testing this with 
Solr. As was said before by Dennis, I could also use a database as long 
as I don't need sophisticated query capabilities. To be honest, I don't 
know the performance gap between a Lucene index and a database in such a 
case. I guess I will have to test it.
This is thought as a substitution for holding every single file on disc. 
But I need the whole file information because it's not clear which 
information will be required in the future. And we don't want to 
re-index every time we add a new field (not yet, that is ;)).

Best regards,

 Erik

Am 16.11.2010 16:27, schrieb Erick Erickson:
 The key is that Solr handles merges by copying, and only after
 the copy is complete does it delete the old index. So you'll need
 at least 2x your final index size before you start, especially if you
 optimize...

 Here's a handy matrix of what you need in your index depending
 upon what you want to do:
 http://BLOCKEDsearch.lucidimagination.com/search/out?u=http://BLOCKEDwiki.apache.org/solr/FieldOptionsByUseCase

 Leaving out what you don't use will help by shrinking your index.

 http://BLOCKEDsearch.lucidimagination.com/search/out?u=http://BLOCKEDwiki.apache.org/solr/FieldOptionsByUseCasethe
 thing that jumps out is that you're storing your entire XML document
 as well as indexing it. Are you expecting to return the document
 to the user? Storing the entire document is is a red-flag, you
 probably don't want to do this. If you need to return the entire
 document some time, one strategy is to index whatever you need
 to search, and index what you need to fetch the document from
 an external store. You can index the values of selected tags as fields in
 your documents. That would also give you far more flexibility
 when searching.

 Best
 Erick




 On Tue, Nov 16, 2010 at 9:48 AM, Erik Fäßlererik.faess...@uni-jena.dewrote:

   Hello Erick,

 I guess I'm the one asking for pardon - but sure not you! It seems, you're
 first guess could already be the correct one. Disc space IS kind of short
 and I believe it could have run out; since Solr is performing a rollback
 after the failure, I didn't notice (beside the fact that this is one of our
 server machine, but apparently the wrong mount point...).

 I not yet absolutely sure of this, but it would explain a lot and it really
 looks like it. So thank you for this maybe not so obvious hint :)

 But you also mentioned the merging strategy. I left everything on the
 standards that come with the Solr download concerning these things.
 Could it be that such a great index needs another treatment? Could you
 point me to a Wiki page or something where I get a few tips?

 Thanks a lot, I will try building the index on a partition with enough
 space, perhaps that will already do it.

 Best regards,

 Erik

 Am 16.11.2010 14:19, schrieb Erick Erickson:

   Several questions. Pardon me if they're obvious, but I've spent fr
 too much of my life overlooking the obvious...

 1   Is it possible you're running out of disk? 40-50G could suck up
 a lot of disk, especially when merging. You may need that much again
 free when a merge occurs.
 2   speaking of merging, what are your merge settings? How are you
 triggering merges. SeemergeFactor   and associated in solrconfig.xml?
 3   You might get some insight by removing the Solr indexing part, can
 you spin through your parsing from beginning to end? That would
 eliminate your questions about whether you're XML parsing is the
 problem.


 40-50G is a large index, but it's certainly within Solr's capability,
 so you're not hitting any built-in limits.

 My first guess would be that you're running out of disk, at least
 that's the first thing I'd check next...

 Best
 Erick

 On Tue, Nov 16, 2010 at 3:33 AM, Erik Fäßlererik.faess...@uni-jena.de
 wrote:
Hey all,
 I'm trying to create a Solr index for the 2010 Medline-baseline (
 www.BLOCKEDpubmed.gov, over 18 million XML documents). My goal is to be 
 able to
 retrieve single XML documents by their ID. Each document comes with a
 unique
 ID, the PubMedID. So my schema (important portions) looks like this:

 field name=pmid type=string indexed=true stored=true
 required=true /
 field name=date type=tdate