RE: inconsistent number of results returned in solr cloud
Sorry, yes, I had been using the BETA version. I have deleted all of that, replaced the jars with the released versions (reduced my core count), and now I have consistent results. I guess I missed that JIRA ticket, sorry for the false alarm. Dave -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, November 23, 2012 4:25 AM To: solr-user@lucene.apache.org Subject: Re: inconsistent number of results returned in solr cloud Dave: I should have asked this first. What version of Solr are you using? I Not sure whether it was fixed in BETA or not (certainly is in the 4.0 GA release). There was a problem with adding a doclist via solrj, here's one related JIRA, although it wasn't the main fix: https://issues.apache.org/jira/browse/SOLR-3001. I suspect that's the known problem Mark mentioned. Because what you're seeing _sure_ sounds similar Best Erick On Mon, Nov 19, 2012 at 12:49 PM, Buttler, David buttl...@llnl.gov wrote: Answers inline below -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Saturday, November 17, 2012 6:40 AM To: solr-user@lucene.apache.org Subject: Re: inconsistent number of results returned in solr cloud Hmmm, first an aside. If by commit after every batch of documents you mean after every call to server.add(doclist), there's no real need to do that unless you're striving for really low latency. the usual recommendation is to use commitWithin when adding and commit only at the very end of the run. This shouldn't actually be germane to your issue, just an FYI. DB Good point. The code for committing docs to solr is fairly old. DB I will update it since I don't have a latency requirement. So you're saying that the inconsistency is permanent? By that I mean it keeps coming back inconsistently for minutes/hours/days? DB Yes, it is permanent. I have collections that have been up for DB weeks, and are still returning inconsistent results, and I haven't been adding any additional documents. DB Related to this, I seem to have a discrepancy between the number DB of documents I think I am sending to solr, and the number of documents it is reporting. I have tried reducing the number of shards for one of my small collections, so I deleted all references to this collections, and reloaded it. I think I have 260 documents submitted (counted from a hadoop job). Solr returns a count of ~430 (it varies), and the first returned document is not consistent. I guess if I were trying to test this I'd need to know how you added subsequent collections. In particular what you did re: zookeeper as you added each collection. DB These are my steps DB 1. Create the collection via the HTTP API: http:// host:port/solr/admin/collections?action=CREATEname=collectionn umShards=6%20collection.configName=collection DB 2. Relaunch one of my JVM processes, bootstrapping the collection: DB java -Xmx16g -Dcollection.configName=collection DB -Djetty.port=port -DzkHost=zkhost -Dsolr.solr.home=solr home -DnumShards=6 -Dbootstrap_confdir=conf -jar start.jar DB load data DB Let me know if something is unclear. I can run through the DB process again and document it more carefully. DB DB Thanks for looking at it, DB Dave Best Erick On Fri, Nov 16, 2012 at 2:55 PM, Buttler, David buttl...@llnl.gov wrote: My typical way of adding documents is through SolrJ, where I commit after every batch of documents (where the batch size is configurable) I have now tried committing several times, from the command line (curl) with and without openSearcher=true. It does not affect anything. Dave -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Friday, November 16, 2012 11:04 AM To: solr-user@lucene.apache.org Subject: Re: inconsistent number of results returned in solr cloud How did you do the final commit? Can you try a lone commit (with openSearcher=true) and see if that affects things? Trying to determine if this is a known issue or not. - Mark On Nov 16, 2012, at 1:34 PM, Buttler, David buttl...@llnl.gov wrote: Hi all, I buried an issue in my last post, so let me pop it up. I have a cluster with 10 collections on it. The first collection I loaded works perfectly. But every subsequent collection returns an inconsistent number of results for each query. The queries can be simply *:*, or more complex facet queries. If I go to individual cores and issue the query, with distrib=false, I get a consistent number of results. I am wondering if there is some delay in returning results from my shards, and the queried node just times out and displays the number of results that it has received so far. If there is such a timeout, it must be very small, as my QTime is around 11 ms. Dave
RE: inconsistent number of results returned in solr cloud
Answers inline below -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Saturday, November 17, 2012 6:40 AM To: solr-user@lucene.apache.org Subject: Re: inconsistent number of results returned in solr cloud Hmmm, first an aside. If by commit after every batch of documents you mean after every call to server.add(doclist), there's no real need to do that unless you're striving for really low latency. the usual recommendation is to use commitWithin when adding and commit only at the very end of the run. This shouldn't actually be germane to your issue, just an FYI. DB Good point. The code for committing docs to solr is fairly old. I will update it since I don't have a latency requirement. So you're saying that the inconsistency is permanent? By that I mean it keeps coming back inconsistently for minutes/hours/days? DB Yes, it is permanent. I have collections that have been up for weeks, and are still returning inconsistent results, and I haven't been adding any additional documents. DB Related to this, I seem to have a discrepancy between the number of documents I think I am sending to solr, and the number of documents it is reporting. I have tried reducing the number of shards for one of my small collections, so I deleted all references to this collections, and reloaded it. I think I have 260 documents submitted (counted from a hadoop job). Solr returns a count of ~430 (it varies), and the first returned document is not consistent. I guess if I were trying to test this I'd need to know how you added subsequent collections. In particular what you did re: zookeeper as you added each collection. DB These are my steps DB 1. Create the collection via the HTTP API: http://host:port/solr/admin/collections?action=CREATEname=collectionnumShards=6%20collection.configName=collection DB 2. Relaunch one of my JVM processes, bootstrapping the collection: DB java -Xmx16g -Dcollection.configName=collection -Djetty.port=port -DzkHost=zkhost -Dsolr.solr.home=solr home -DnumShards=6 -Dbootstrap_confdir=conf -jar start.jar DB load data DB Let me know if something is unclear. I can run through the process again and document it more carefully. DB DB Thanks for looking at it, DB Dave Best Erick On Fri, Nov 16, 2012 at 2:55 PM, Buttler, David buttl...@llnl.gov wrote: My typical way of adding documents is through SolrJ, where I commit after every batch of documents (where the batch size is configurable) I have now tried committing several times, from the command line (curl) with and without openSearcher=true. It does not affect anything. Dave -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Friday, November 16, 2012 11:04 AM To: solr-user@lucene.apache.org Subject: Re: inconsistent number of results returned in solr cloud How did you do the final commit? Can you try a lone commit (with openSearcher=true) and see if that affects things? Trying to determine if this is a known issue or not. - Mark On Nov 16, 2012, at 1:34 PM, Buttler, David buttl...@llnl.gov wrote: Hi all, I buried an issue in my last post, so let me pop it up. I have a cluster with 10 collections on it. The first collection I loaded works perfectly. But every subsequent collection returns an inconsistent number of results for each query. The queries can be simply *:*, or more complex facet queries. If I go to individual cores and issue the query, with distrib=false, I get a consistent number of results. I am wondering if there is some delay in returning results from my shards, and the queried node just times out and displays the number of results that it has received so far. If there is such a timeout, it must be very small, as my QTime is around 11 ms. Dave
RE: Architecture Question
If you just want to store the data, you can dump it into HDFS sequence files. While HBase is really nice if you want to process and serve data real-time, it adds overhead to use it as pure storage. Dave -Original Message- From: Cool Techi [mailto:cooltec...@outlook.com] Sent: Friday, November 16, 2012 8:26 PM To: solr-user@lucene.apache.org Subject: RE: Architecture Question Hi Otis, Thanks for your reply, just wanted to check what NoSql structure would be best suited to store data and use the least amount of memory, since for most of my work Solr would be sufficient and I want to store data just in case we want to reindex and as a backup. Regards, Ayush Date: Fri, 16 Nov 2012 15:47:40 -0500 Subject: Re: Architecture Question From: otis.gospodne...@gmail.com To: solr-user@lucene.apache.org Hello, I am not sure if this is the right forum for this question, but it would be great if I could be pointed in the right direction. We have been using a combination of MySql and Solr for all our company full text and query needs. But as our customers have grow so has the amount of data and MySql is just not proving to be a right option for storing/querying. I have been looking at Solr Cloud and it looks really impressive, but and not sure if we should give away our storage system. So, I have been exploring DataStax but a commercial option is out of question. So we were thinking of using hbase to store the data and at the same time index the data into Solr cloud, but for many reasons this design doesn't seem convincing (Also seen basic of Lilly). 1) Would it be recommended to just user Solr cloud with multiple replication or hbase-solr seems like good option If you trust SolrCloud with replication and keep all your fields stored then you could live without an external DB. At this point I personally would still want an external DB. Whether HBase is the right DB for the job I can't tell because I don't know anything about your data, volume, access patterns, etc. I can tell you that HBase does scale well - we have tables with many billions of rows stored in it for instance. 2) How much strain would be to keep both Solr Shard and Hbase node on the same machine HBase loves memory. So does Solr. They both dislike disk IO (who doesn't!). Solr can use a lot of CPU for indexing/searching, depending on the volume. HBase RegionServers can use a lot of CPU if you run MapReuce on data in HBase. 3) if there a calculation on what kind of machine configuration would I need to store 500-1000 million records. Most of these with be social data (Twitter/facebook/blogs etc) and how many shards. No recipe here, unfortunately. You'd have to experiment and test, do load and performance testing, etc. If you need help with Solr + HBase, we happen to have a lot of experience with both and have even used them together for some of our clients. Otis -- Performance Monitoring - http://sematext.com/spm/index.html Search Analytics - http://sematext.com/search-analytics/index.html
inconsistent number of results returned in solr cloud
Hi all, I buried an issue in my last post, so let me pop it up. I have a cluster with 10 collections on it. The first collection I loaded works perfectly. But every subsequent collection returns an inconsistent number of results for each query. The queries can be simply *:*, or more complex facet queries. If I go to individual cores and issue the query, with distrib=false, I get a consistent number of results. I am wondering if there is some delay in returning results from my shards, and the queried node just times out and displays the number of results that it has received so far. If there is such a timeout, it must be very small, as my QTime is around 11 ms. Dave
RE: inconsistent number of results returned in solr cloud
My typical way of adding documents is through SolrJ, where I commit after every batch of documents (where the batch size is configurable) I have now tried committing several times, from the command line (curl) with and without openSearcher=true. It does not affect anything. Dave -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Friday, November 16, 2012 11:04 AM To: solr-user@lucene.apache.org Subject: Re: inconsistent number of results returned in solr cloud How did you do the final commit? Can you try a lone commit (with openSearcher=true) and see if that affects things? Trying to determine if this is a known issue or not. - Mark On Nov 16, 2012, at 1:34 PM, Buttler, David buttl...@llnl.gov wrote: Hi all, I buried an issue in my last post, so let me pop it up. I have a cluster with 10 collections on it. The first collection I loaded works perfectly. But every subsequent collection returns an inconsistent number of results for each query. The queries can be simply *:*, or more complex facet queries. If I go to individual cores and issue the query, with distrib=false, I get a consistent number of results. I am wondering if there is some delay in returning results from my shards, and the queried node just times out and displays the number of results that it has received so far. If there is such a timeout, it must be very small, as my QTime is around 11 ms. Dave
cores shards and disks in SolrCloud
Hi, I have a question about the optimal way to distribute solr indexes across a cloud. I have a small number of collections (less than 10). And a small cluster (6 nodes), but each node has several disks - 5 of which I am using for my solr indexes. The cluster is also a hadoop cluster, so the disks are not RAIDed, they are JBOD. So, on my 5 slave nodes, each with 5 disks, I was thinking of putting one shard per collection. This means I end up with 25 shards per collection. If I had 10 collections, that would make it 250 shards total. Given that Solr 4 supports multi-core, my first thought was to try one JVM for each node: for 10 collections per node, that means that each JVM would contain 50 shards. So, I set up my first collection, with a modest 20M documents, and everything seems to work fine. But, now my subsequent collections that I have added are having issues. The first one is that every time I query for the document count (*:* with rows=0), a different number of documents is returned. The number can differ by as much as 10%. Now if I query each shard individually (setting distrib=false), the number returned is always consistent. I am not entirely sure this is related as I may have missed a step in my setup of subsequent collections (bootstrapping the config) But, more related to the architecture question: is it better to have one JVM per disk, one JVM per shard, or one JVM per node. Given the MMap of the indexes, how does memory play into the question? There is a blog post (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html) that recommends minimizing the amount of JVM memory and maximizing the amount of OS-level file cache, but how does that impact sorting / boosting? Sorry if I have missed some documentation: I have been through the cloud tutorial a couple of times, and I didn't see any discussion of these issues Thanks, Dave
RE: cores shards and disks in SolrCloud
The main reason to split a collection into 25 shards is to reduce the impact of the loss of a disk. I was running an older version of solr, a disk went down, and my entire collection was offline. Solr 4 offers shards.tolerant to reduce the impact of the loss of a disk: fewer documents will be returned. Obviously, I could replicate the data so that I wouldn't lose any documents while I replace my disk, but since I am already storing the original data in HDFS, (with a 3x replication), adding additional replication for solr eats into my disk budget a bit too much. Also, my other collections have larger amounts of data / number of documents. For every TB of raw data, how much disk space do I want to be using? As little as possible. Drives are cheap, but not free. And, nodes only hold so many drives. Dave -Original Message- From: Upayavira [mailto:u...@odoko.co.uk] Sent: Thursday, November 15, 2012 4:37 PM To: solr-user@lucene.apache.org Subject: Re: cores shards and disks in SolrCloud Personally I see no benefit to have more than one JVM per node, cores can handle it. I would say that splitting a 20m index into 25 shards strikes me as serious overkill, unless you expect to expand significantly. 20m would likely be okay with two or three shards. You can store the indexes for each core on different disks which can give ome performance benefit. Just some thoughts. Upayavira On Thu, Nov 15, 2012, at 11:04 PM, Buttler, David wrote: Hi, I have a question about the optimal way to distribute solr indexes across a cloud. I have a small number of collections (less than 10). And a small cluster (6 nodes), but each node has several disks - 5 of which I am using for my solr indexes. The cluster is also a hadoop cluster, so the disks are not RAIDed, they are JBOD. So, on my 5 slave nodes, each with 5 disks, I was thinking of putting one shard per collection. This means I end up with 25 shards per collection. If I had 10 collections, that would make it 250 shards total. Given that Solr 4 supports multi-core, my first thought was to try one JVM for each node: for 10 collections per node, that means that each JVM would contain 50 shards. So, I set up my first collection, with a modest 20M documents, and everything seems to work fine. But, now my subsequent collections that I have added are having issues. The first one is that every time I query for the document count (*:* with rows=0), a different number of documents is returned. The number can differ by as much as 10%. Now if I query each shard individually (setting distrib=false), the number returned is always consistent. I am not entirely sure this is related as I may have missed a step in my setup of subsequent collections (bootstrapping the config) But, more related to the architecture question: is it better to have one JVM per disk, one JVM per shard, or one JVM per node. Given the MMap of the indexes, how does memory play into the question? There is a blog post (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html) that recommends minimizing the amount of JVM memory and maximizing the amount of OS-level file cache, but how does that impact sorting / boosting? Sorry if I have missed some documentation: I have been through the cloud tutorial a couple of times, and I didn't see any discussion of these issues Thanks, Dave
RE: Cloud assigning incorrect port to shards
I think the issue was that I didn't have a solr.xml in the solr home. I was a little confused by the example directory because there are actually 5 solr.xml files % find . -name solr.xml ./multicore/solr.xml ./example-DIH/solr/solr.xml ./exampledocs/solr.xml ./contexts/solr.xml ./solr/solr.xml Creating my own jetty installation directory without the example instances led to me deleting the solr/solr.xml file. I have now created a new solr home and set up a solr.xml file there, and things look much better. Thanks for the feedback, Dave -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Thursday, August 23, 2012 6:00 PM To: solr-user@lucene.apache.org Subject: Re: Cloud assigning incorrect port to shards Can you post your solr.xml file? On Thursday, August 23, 2012, Buttler, David wrote: I am using the jetty container from the example. The only thing I have done is change the schema to match up my documents rather than the example -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com javascript:;] Sent: Wednesday, August 22, 2012 5:50 PM To: solr-user@lucene.apache.org javascript:; Subject: Re: Cloud assigning incorrect port to shards What container are you using? Sent from my iPhone On Aug 22, 2012, at 3:14 PM, Buttler, David buttl...@llnl.govjavascript:; wrote: Hi, I have set up a Solr 4 beta cloud cluster. I have uploaded a config directory, and linked it with a configuration name. I have started two solr on two computers and added a couple of shards using the Core Admin function on the admin page. When I go to the admin cloud view, the shards all have the computer name and port attached to them. BUT, the port is the default port (8983), and not the port that I assigned on the command line. I can still connect to the correct port, and not the reported port. I anticipate that this will lead to errors when I get to doing distributed query, as zookeeper seems to be collecting incorrect information. Any thoughts as to why the incorrect port is being stored in zookeeper? Thanks, Dave -- - Mark http://www.lucidimagination.com
RE: Cloud assigning incorrect port to shards
I am using the jetty container from the example. The only thing I have done is change the schema to match up my documents rather than the example -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Wednesday, August 22, 2012 5:50 PM To: solr-user@lucene.apache.org Subject: Re: Cloud assigning incorrect port to shards What container are you using? Sent from my iPhone On Aug 22, 2012, at 3:14 PM, Buttler, David buttl...@llnl.gov wrote: Hi, I have set up a Solr 4 beta cloud cluster. I have uploaded a config directory, and linked it with a configuration name. I have started two solr on two computers and added a couple of shards using the Core Admin function on the admin page. When I go to the admin cloud view, the shards all have the computer name and port attached to them. BUT, the port is the default port (8983), and not the port that I assigned on the command line. I can still connect to the correct port, and not the reported port. I anticipate that this will lead to errors when I get to doing distributed query, as zookeeper seems to be collecting incorrect information. Any thoughts as to why the incorrect port is being stored in zookeeper? Thanks, Dave
RE: Co-existing solr cloud installations
This is really nice. Thanks for pointing it out. Dave -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Tuesday, August 21, 2012 8:23 PM To: solr-user@lucene.apache.org Subject: Re: Co-existing solr cloud installations You can use a connect string of host:port/path to 'chroot' a path. I think currently you have to manually create the path first though. See the ZkCli tool (doc'd on SolrCloud wiki) for a simple way to do that. I keep meaning to look into auto making it if it doesn't exist, but have not gotten to it. - Mark On Tue, Aug 21, 2012 at 4:46 PM, Buttler, David buttl...@llnl.gov wrote: Hi all, I would like to use a single zookeeper cluster to manage multiple Solr cloud installations. However, the current design of how Solr uses zookeeper seems to preclude that. Have I missed a configuration option to set a zookeeper prefix for all of a Solr cloud configuration directories? If I look at the zookeeper data it looks like: * /clusterstate.json * /collections * /configs * /live_nodes * /overseer * /overseer_elect * /zookeeper Is there a reason not to put all of these nodes under some user-configurable higher-level node, such as /solr4? It could have a reasonable default value to make it just as easy to find as /. My current issue is that I have an old Solr cloud instance from back in the Solr 1.5 days, and I don't expect that the new version and the old version will play nice. Thanks, Dave
Cloud assigning incorrect port to shards
Hi, I have set up a Solr 4 beta cloud cluster. I have uploaded a config directory, and linked it with a configuration name. I have started two solr on two computers and added a couple of shards using the Core Admin function on the admin page. When I go to the admin cloud view, the shards all have the computer name and port attached to them. BUT, the port is the default port (8983), and not the port that I assigned on the command line. I can still connect to the correct port, and not the reported port. I anticipate that this will lead to errors when I get to doing distributed query, as zookeeper seems to be collecting incorrect information. Any thoughts as to why the incorrect port is being stored in zookeeper? Thanks, Dave
Co-existing solr cloud installations
Hi all, I would like to use a single zookeeper cluster to manage multiple Solr cloud installations. However, the current design of how Solr uses zookeeper seems to preclude that. Have I missed a configuration option to set a zookeeper prefix for all of a Solr cloud configuration directories? If I look at the zookeeper data it looks like: * /clusterstate.json * /collections * /configs * /live_nodes * /overseer * /overseer_elect * /zookeeper Is there a reason not to put all of these nodes under some user-configurable higher-level node, such as /solr4? It could have a reasonable default value to make it just as easy to find as /. My current issue is that I have an old Solr cloud instance from back in the Solr 1.5 days, and I don't expect that the new version and the old version will play nice. Thanks, Dave
solr 4 degraded behavior failure
Hi all, I am testing out the could features in Solr 4, and I have a observation about the behavior under failure. Following the cloud tutorial, I set up a collection with 2 shards. I started 4 servers (so each shard is replicated twice). I added the test documents, and everything works fine. If I kill one or two servers, everything continues to work. However, when three servers are killed, zero results are returned. This is an improvement over previous versions of the cloud branch where having missing shards would result in an error, but I would have expected fewer results rather than zero results. It turns out that there is a parameter that can be added to a query to get degraded results, but it is not described on the Solr cloud page. It is on the DistributedSearch page, but it is poorly defined, and difficult to locate starting from the cloud page. The way to get degraded results is to append: shards.tolerant=true to your Solr query. Dave
RE: solr 4 degraded behavior failure
Is there a way to make the shards.tolerant=true behavior the default behavior? -Original Message- From: Buttler, David [mailto:buttl...@llnl.gov] Sent: Thursday, August 16, 2012 11:01 AM To: solr-user@lucene.apache.org Subject: solr 4 degraded behavior failure Hi all, I am testing out the could features in Solr 4, and I have a observation about the behavior under failure. Following the cloud tutorial, I set up a collection with 2 shards. I started 4 servers (so each shard is replicated twice). I added the test documents, and everything works fine. If I kill one or two servers, everything continues to work. However, when three servers are killed, zero results are returned. This is an improvement over previous versions of the cloud branch where having missing shards would result in an error, but I would have expected fewer results rather than zero results. It turns out that there is a parameter that can be added to a query to get degraded results, but it is not described on the Solr cloud page. It is on the DistributedSearch page, but it is poorly defined, and difficult to locate starting from the cloud page. The way to get degraded results is to append: shards.tolerant=true to your Solr query. Dave
RE: solr.xml entries got deleted when powered off
You are not putting these files in /tmp are you? That is sometimes wiped by different OS's on shutdown -Original Message- From: vempap [mailto:phani.vemp...@emc.com] Sent: Wednesday, August 15, 2012 3:31 PM To: solr-user@lucene.apache.org Subject: Re: solr.xml entries got deleted when powered off It's happening when I'm not doing a clean shutdown. Are there any more scenarios it might happen ? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496p4001503.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Distributed Searching + unique Ids
I just downloaded the solr 4 beta, and was running through the tutorial. It seemed to me that I was getting duplicate counts in my facet fields when I had two shards and four cores running. For example, http://localhost:8983/solr/collection1/browse Reports 21 entries in the facet cat:electronics, but if I click on that facet, there are only 14 results, and it still reports 21 entries for cat:electronics. Is this a known bug? -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, August 14, 2012 7:16 AM To: solr-user@lucene.apache.org Subject: Re: Distributed Searching + unique Ids Don't do this. Many bits of sharding assume that a uniqueKey exists on one and only one shard. Document counts may be off. Faceting may be off. Etc. Why do you want to duplicate records across shards? What benefit is this providing? This feels like an XY problem... Best Erick On Fri, Aug 10, 2012 at 1:10 PM, Eric Khoury ekhour...@hotmail.com wrote: hey guys, the spec mentions the following: The unique key field must be unique across all shards. If docs with duplicate unique keys are encountered, Solr will make an attempt to return valid results, but the behavior may be non-deterministic. I'm actually looking to duplicate certain objects across shards, and hoping to have duplicates removed when querying over all shards.If these duplicates have the same ids, will that work? Will this cause chaos with paging? I imagine that it might affect faceting as well?thanks,Eric.
Duplicated facet counts in solr 4 beta: user error
Here are my steps: 1) Download apache-solr-4.0.0-BETA 2) Untar into a directory 3) cp -r example example2 4) cp -r example exampleB 5) cp -r example example2B 6) cd example; java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar 7) cd example2; java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar 8) cd exampleB; java -Djetty.port=8900 -DzkHost=localhost:9983 -jar start.jar 9) cd example2B; java -Djetty.port=7500 -DzkHost=localhost:9983 -jar start.jar 10) cd example/exampledocs; java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml http://localhost:8983/solr/collection1/select?q=*:*wt=xmlfq=cat:%22electronics%22 14 results returned This is correct. Let's try a slightly more circuitous route by running through the solr tutorial first 1) Download apache-solr-4.0.0-BETA 2) Untar into a directory 3) cd example; java -jar start.jar 4) cd example/exampledocs; java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml 5) kill jetty server 6) cp -r example example2 7) cp -r example exampleB 8) cp -r example example2B 9) cd example; java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar 10) cd example2; java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar 11) cd exampleB; java -Djetty.port=8900 -DzkHost=localhost:9983 -jar start.jar 12) cd example2B; java -Djetty.port=7500 -DzkHost=localhost:9983 -jar start.jar 13) cd example/exampledocs; java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml With the same query as above, 22 results are returned. Looking at this, it is somewhat obvious that what is happening is that the index was copied over from the tutorial and was not cleaned up before running the cloud examples. Adding the debug=query parameter to the query URL produces the following: lst name=debug str name=rawquerystring*:*/str str name=querystring*:*/str str name=parsedqueryMatchAllDocsQuery(*:*)/str str name=parsedquery_toString*:*/str str name=QParserLuceneQParser/str arr name=filter_queries strcat:electronics/str /arr arr name=parsed_filter_queries strcat:electronics/str /arr /lst So, Erick's diagnoses is correct: pilot error. However, the straightforward path through the tutorial and on to solr cloud makes it easy to make this mistake. Maybe a small warning in the solr cloud page would help? Now, running a delete operations fixes things: cd example/exampledocs; java -Dcommit=false -Ddata=args -jar post.jar deletequery*:*/query/delete causes the number of results to be zero. So, let's reload the data: java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml now the number of results for our query http://localhost:8983/solr/collection1/select?q=*:*wt=xmlfq=cat:electronicshttp://localhost:8983/solr/collection1/select?q=*:*wt=xmlfq=cat:%22electronics is back to the correct 14 results. Dave PS apologizes for hijacking the thread earlier.
RE: DIH full-import failure, no real error message
I am using the solr cloud branch on 6 machines. I first load PubMed into HBase, and then push the fields I care about to solr. Indexing from HBase to solr takes about 18 minutes. Loading to hbase takes a little longer (2 hours?), but it only happens once so I haven't spent much time trying to optimize. This gives me the flexibility of a solr search as well as full document retrieval (and additional processing) from hbase. Dave -Original Message- From: Erik Fäßler [mailto:erik.faess...@uni-jena.de] Sent: Tuesday, November 16, 2010 9:16 AM To: solr-user@lucene.apache.org Subject: Re: DIH full-import failure, no real error message Thank you very much, I will have a read on your links. The full-text-red-flag is exactly the thing why I'm testing this with Solr. As was said before by Dennis, I could also use a database as long as I don't need sophisticated query capabilities. To be honest, I don't know the performance gap between a Lucene index and a database in such a case. I guess I will have to test it. This is thought as a substitution for holding every single file on disc. But I need the whole file information because it's not clear which information will be required in the future. And we don't want to re-index every time we add a new field (not yet, that is ;)). Best regards, Erik Am 16.11.2010 16:27, schrieb Erick Erickson: The key is that Solr handles merges by copying, and only after the copy is complete does it delete the old index. So you'll need at least 2x your final index size before you start, especially if you optimize... Here's a handy matrix of what you need in your index depending upon what you want to do: http://BLOCKEDsearch.lucidimagination.com/search/out?u=http://BLOCKEDwiki.apache.org/solr/FieldOptionsByUseCase Leaving out what you don't use will help by shrinking your index. http://BLOCKEDsearch.lucidimagination.com/search/out?u=http://BLOCKEDwiki.apache.org/solr/FieldOptionsByUseCasethe thing that jumps out is that you're storing your entire XML document as well as indexing it. Are you expecting to return the document to the user? Storing the entire document is is a red-flag, you probably don't want to do this. If you need to return the entire document some time, one strategy is to index whatever you need to search, and index what you need to fetch the document from an external store. You can index the values of selected tags as fields in your documents. That would also give you far more flexibility when searching. Best Erick On Tue, Nov 16, 2010 at 9:48 AM, Erik Fäßlererik.faess...@uni-jena.dewrote: Hello Erick, I guess I'm the one asking for pardon - but sure not you! It seems, you're first guess could already be the correct one. Disc space IS kind of short and I believe it could have run out; since Solr is performing a rollback after the failure, I didn't notice (beside the fact that this is one of our server machine, but apparently the wrong mount point...). I not yet absolutely sure of this, but it would explain a lot and it really looks like it. So thank you for this maybe not so obvious hint :) But you also mentioned the merging strategy. I left everything on the standards that come with the Solr download concerning these things. Could it be that such a great index needs another treatment? Could you point me to a Wiki page or something where I get a few tips? Thanks a lot, I will try building the index on a partition with enough space, perhaps that will already do it. Best regards, Erik Am 16.11.2010 14:19, schrieb Erick Erickson: Several questions. Pardon me if they're obvious, but I've spent fr too much of my life overlooking the obvious... 1 Is it possible you're running out of disk? 40-50G could suck up a lot of disk, especially when merging. You may need that much again free when a merge occurs. 2 speaking of merging, what are your merge settings? How are you triggering merges. SeemergeFactor and associated in solrconfig.xml? 3 You might get some insight by removing the Solr indexing part, can you spin through your parsing from beginning to end? That would eliminate your questions about whether you're XML parsing is the problem. 40-50G is a large index, but it's certainly within Solr's capability, so you're not hitting any built-in limits. My first guess would be that you're running out of disk, at least that's the first thing I'd check next... Best Erick On Tue, Nov 16, 2010 at 3:33 AM, Erik Fäßlererik.faess...@uni-jena.de wrote: Hey all, I'm trying to create a Solr index for the 2010 Medline-baseline ( www.BLOCKEDpubmed.gov, over 18 million XML documents). My goal is to be able to retrieve single XML documents by their ID. Each document comes with a unique ID, the PubMedID. So my schema (important portions) looks like this: field name=pmid type=string indexed=true stored=true required=true / field name=date type=tdate