Re: Solr cloud and auto shard timeline
Hi, I think there is a mixup here. SolrCloud has the same sharding capabilities as ES at this point, I believe, other than manual moving of shards Mark mentions. Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, Mar 21, 2013 at 7:08 PM, Jamie Johnson jej2...@gmail.com wrote: I've seen that Elastic Search has had auto shardding capabilities for some time, is there a timeline for when a similar capability is being targeted for Solr Cloud?
Re: Writing new indexes from index readers slow!
Jed, While this is something completely different, have you considered using SolrEntityProcessor instead? (assuming all your fields are stored) http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, Mar 21, 2013 at 2:25 PM, Jed Glazner jglaz...@adobe.com wrote: Hey Hey Everybody!, I'm not sure if I should have posted this to the developers list... if i'm totally barking up the wrong tree here, please let me know! Anywho, I've developed a command line utility based on the MultiPassIndexSplitter class from the lucene library, but I'm finding that on our large index (350GB), it's taking WAY to long to write the newly split indexes! It took 20.5 hours for execution to finish. I should note that solr is not running while I'm splitting the index. Because solr can't really be running while I run this tool performance is critical as our service will be down! I am aware that there is an api currently under development on trunk in solr cloud (https://issues.apache.org/jira/browse/SOLR-3755) but I need something now as our large index wreaking havoc on our service. Here is some basic context info: The Index: == Solr/Lucene 4.1 Index Size: 350GB Documents: 185,194,528 The Hardware (http://aws.amazon.com/ec2/instance-types/): === AWS High-Memory X-Large (m2.xlarge) instance CPU: 8 cores (2 virtual cores with 3.25 EC2 Compute Units each) 17.1 GB ram 1.2TB ebs raid The Process (splitting 1 index into 8): === I'm trying to split this index into 8 separate indexes using this tool. To do this I create 8 worker threads. Each thread creates gets a new FakeDeleteIndexReader object, and loops over every document, and uses a hash algorithm to decide if it should keep or delete the document. Note that the documents are not actually deleted at this point because (as I understand it) the FakeDeleteIndexReader emulates deletes without actually modifying the underlying index. After each worker has determined which documents it should keep I create a new Directory object, Instanciate a new IndexWriter, and pass the FakeDeleteIndexReader object to the addIndexs method. (this is the part that takes forever!) It only takes about an hour for all of the threads to hash/delete the documents it doesn't want. However it takes 19+ hours to write all of the new indexes! Watching iowait The disk doesn't look to be over worked (about 85% idle), so i'm baffled as to why it would take that long! I've tried running the write operations inside the worker threads, and serialy with no real difference! Here is the relevant code that I'm using to write the indexes: /** * Creates/merges a new index with a FakeDeleteIndexReader. The reader should have marked/deleted all * of the documents that should not be included in this new index. When the index is written/committed * these documents will be removed. * * @param directory *The directory object of the new index * @param version *The lucene version of the index * @param reader *A FakeDeleteIndexReader that contains lots of uncommitted deletes. * @throws IOException */ private void writeToDisk(Directory directory, Version version, FakeDeleteIndexReader reader) throws IOException { IndexWriterConfig cfg = new IndexWriterConfig(version, new WhitespaceAnalyzer(version)); cfg.setOpenMode(OpenMode.CREATE); IndexWriter w = new IndexWriter(directory, cfg); w.addIndexes(reader); w.commit(); w.close(); reader.close(); } Any Ideas?? I'm happy to share more snippets of source code if that is helpful.. -- *Jed** Glazner* Sr. Software Engineer Adobe Social 385.221.1072 (tel) 801.360.0181 (cell) jglaz...@adobe.com 550 East Timpanogus Circle Orem, UT 84097-6215, USA www.adobe.com ** **
Re: Writing new indexes from index readers slow!
Thanks Otis, I had not considered that approach, however not all of our fields are stored so that's not going to work for me. I'm wondering if its slow because there is just the one reader getting passed to the index writer... I noticed today that the addIndexes method can take an array of readers. Maybe if I can send in an array of readers for the individual segments in the index... I'll try that tomorrow. Jed Otis Gospodnetic otis.gospodne...@gmail.com wrote: Jed, While this is something completely different, have you considered using SolrEntityProcessor instead? (assuming all your fields are stored) http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, Mar 21, 2013 at 2:25 PM, Jed Glazner jglaz...@adobe.commailto:jglaz...@adobe.com wrote: Hey Hey Everybody!, I'm not sure if I should have posted this to the developers list... if i'm totally barking up the wrong tree here, please let me know! Anywho, I've developed a command line utility based on the MultiPassIndexSplitter class from the lucene library, but I'm finding that on our large index (350GB), it's taking WAY to long to write the newly split indexes! It took 20.5 hours for execution to finish. I should note that solr is not running while I'm splitting the index. Because solr can't really be running while I run this tool performance is critical as our service will be down! I am aware that there is an api currently under development on trunk in solr cloud (https://issues.apache.org/jira/browse/SOLR-3755) but I need something now as our large index wreaking havoc on our service. Here is some basic context info: The Index: == Solr/Lucene 4.1 Index Size: 350GB Documents: 185,194,528 The Hardware (http://aws.amazon.com/ec2/instance-types/): === AWS High-Memory X-Large (m2.xlarge) instance CPU: 8 cores (2 virtual cores with 3.25 EC2 Compute Units each) 17.1 GB ram 1.2TB ebs raid The Process (splitting 1 index into 8): === I'm trying to split this index into 8 separate indexes using this tool. To do this I create 8 worker threads. Each thread creates gets a new FakeDeleteIndexReader object, and loops over every document, and uses a hash algorithm to decide if it should keep or delete the document. Note that the documents are not actually deleted at this point because (as I understand it) the FakeDeleteIndexReader emulates deletes without actually modifying the underlying index. After each worker has determined which documents it should keep I create a new Directory object, Instanciate a new IndexWriter, and pass the FakeDeleteIndexReader object to the addIndexs method. (this is the part that takes forever!) It only takes about an hour for all of the threads to hash/delete the documents it doesn't want. However it takes 19+ hours to write all of the new indexes! Watching iowait The disk doesn't look to be over worked (about 85% idle), so i'm baffled as to why it would take that long! I've tried running the write operations inside the worker threads, and serialy with no real difference! Here is the relevant code that I'm using to write the indexes: /** * Creates/merges a new index with a FakeDeleteIndexReader. The reader should have marked/deleted all * of the documents that should not be included in this new index. When the index is written/committed * these documents will be removed. * * @param directory *The directory object of the new index * @param version *The lucene version of the index * @param reader *A FakeDeleteIndexReader that contains lots of uncommitted deletes. * @throws IOException */ private void writeToDisk(Directory directory, Version version, FakeDeleteIndexReader reader) throws IOException { IndexWriterConfig cfg = new IndexWriterConfig(version, new WhitespaceAnalyzer(version)); cfg.setOpenMode(OpenMode.CREATE); IndexWriter w = new IndexWriter(directory, cfg); w.addIndexes(reader); w.commit(); w.close(); reader.close(); } Any Ideas?? I'm happy to share more snippets of source code if that is helpful.. -- [cid:part1.06000602.0109@adobe.com] Jed Glazner Sr. Software Engineer Adobe Social 385.221.1072tel:385.221.1072 (tel) 801.360.0181tel:801.360.0181 (cell) jglaz...@adobe.commailto:jglaz...@adobe.com 550 East Timpanogus Circle Orem, UT 84097-6215, USA www.adobe.comhttp://www.adobe.com
Re: Solr 4.2 - Slave Index version is higher than Master
The other odd thing here is that this should not stop replication at all. When the slave is ahead, it will still have it's index replaced. - Mark On Mar 22, 2013, at 1:26 AM, Mark Miller markrmil...@gmail.com wrote: I'm working on testing to try and catch what you are seeing here: https://issues.apache.org/jira/browse/SOLR-4629 - Mark On Mar 22, 2013, at 12:23 AM, Mark Miller markrmil...@gmail.com wrote: Let me know if there is anything else you can add. A test with your setup that index n docs randomly, commits, randomly updates a conf file or not, and then replicates and repeats x times does not seem to fail, even with very high values for n and x. On every replication, the versions are compared. Is there anything else you are putting into this mix? - Mark On Mar 21, 2013, at 11:28 PM, Uomesh uom...@gmail.com wrote: Thank you!!, Attached is my master solrconfig.xml. I have few custom handlers which you might need to remove. In custom handler i have not much code just adding some custom data for UI. Thanks, Umesh On Thu, Mar 21, 2013 at 9:59 PM, Mark Miller-3 [via Lucene] ml-node+s472066n4049933h47@n3.nabou mible.com ml-node+s472066n4049933...@n3.nabble.com wrote: Could you attach the master as well? - Mark On Mar 21, 2013, at 4:36 PM, Uomesh [hidden email]http://user/SendEmail.jtp?type=nodenode=4049933i=0 wrote: Hi Mark, Attached is my solrconfig_slave.xml. My replication interval is 1 minute(default). Please let me know if you need any more config details Thanks, umesh On Thu, Mar 21, 2013 at 3:19 PM, Mark Miller-3 [via Lucene] [hidden email] http://user/SendEmail.jtp?type=nodenode=4049933i=1 wrote: Can you give more details about your configuration and setup? Our best bet is to try and recreate this with a unit test. - Mark On Mar 21, 2013, at 4:08 PM, Uomesh [hidden email] http://user/SendEmail.jtp?type=nodenode=4049832i=0 wrote: Hi, I am seeing an issue after upgrading from solr 3.6.2 to Solr 4.2. My Slave stop replicating after sometime. And it seems issue is because of my Slave Index version is higher than master. How could it be possible to Slave Index version is higher than master? Please help me. IS there anything i need to remove from my slave solrconfig.xml. Index Version Gen Size Master: 1363893820575 93 8.75 MB Slave: 1363896006624 94 8.75 MB Thanks, Umesh -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827.html Sent from the Solr - User mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827p4049832.html To unsubscribe from Solr 4.2 - Slave Index version is higher than Master, click here . NAML http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml solrconfig_slave.xml (67K) http://lucene.472066.n3.nabble.com/attachment/4049840/0/solrconfig_slave.xml -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827p4049840.html Sent from the Solr - User mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827p4049933.html To unsubscribe from Solr 4.2 - Slave Index version is higher than Master, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4049827code=VW9tZXNoQGdtYWlsLmNvbXw0MDQ5ODI3fDIyODkyODYxMg== . NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml solrconfig.xml (74K) http://lucene.472066.n3.nabble.com/attachment/4049934/0/solrconfig.xml -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827p4049934.html Sent from the Solr - User mailing list archive at Nabble.com.
solr 4.1 replcation whole indexs files from leader
Hi, I use solrcloud 4.1. I start up two solr nodes A and B and then created a new collection using CoreAdmin to A using one shard, so Node A is leader. Then I index some docs to it. Then I created the same collection using CoreAdmin to B to become a replica. I found that solr will sync all index files from A to B. Under B's data dir, I have: index.20130318083415358 folder which has all the synced index files, index.properties, replication.properties and tlog folder(empty inside). Then I removed the collection from node B using CoreAdmin UNLOAD, I keep all files in B's data dir, I didn't delete them. Then I create the same collection from B again. I found that solr will sync files from A to B AGAIN!!! And there is another folder index.20130318084514166 been created under B's data folder. Actually I didn't index any docs to A after I UNLOAD collection on B. So I wonder how to let solr know that B already has the correct index files and not to do the sync again?
Re: DocValues and field requirements
Hi Shawn, Thank you for your response. Yes, that's strange. By enabling DocValues the information about missing fields is lost, which changes the way of sorting as well. Adding default value to the fields can change a logic of application dramatically (I can't set default value to 0 for all Trie*Fields fields, because it could impact the results displayed to the end user, which is not good). It's a pity that using DocValues is so limited. Regards. On 21 March 2013 22:29, Shawn Heisey s...@elyograg.org wrote: On 3/21/2013 3:07 PM, Shawn Heisey wrote: This might be a requirement of the lower-level Lucene API, or it might be a requirement that was instituted at the Solr level because a problem was found when docs did not contain the field. Google seems reluctant to tell me, and I haven't figured out the right way to ask. Some poking around the Lucene API has turned up an interesting notation on all the different types on DocValues: http://lucene.apache.org/core/**4_1_0/core/org/apache/lucene/** index/DocValues.Type.html#**FIXED_INTS_16http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/index/DocValues.Type.html#FIXED_INTS_16 They all say that if a value isn't present, zero (or an empty string) is assumed, and that there is no way to distinguish this from the same value that is intentionally indexed. So it appears that Solr *could* use docValues without the required or default value restriction, but there is a strong possibility that the behavior will not be what the user expects. When docValues is not turned on, there is a clear difference between a default value and a missing field. The sort mechanism without docValues can sort documents with the field missing either before or after the other values. That would be impossible with docValues. By including the restriction, the dev team has made it less likely that the Solr admin will be surprised by the new behavior, because they have to change the field definition to make docValues work. Thanks, Shawn
RE: Solr 4.2 - Slave Index version is higher than Master
To add to the discussion. We're running classic master/slave replication (not solrcloud) with 1 master and 2 slaves and I noticed the slave having a higher version number than the master the other day as well. In our case, knock on wood, it hasn't stopped replication. If you'd like a copy of our config I can provide off-list. Regards, Phil. From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Fri 22/03/2013 06:32 To: solr-user@lucene.apache.org Subject: Re: Solr 4.2 - Slave Index version is higher than Master The other odd thing here is that this should not stop replication at all. When the slave is ahead, it will still have it's index replaced. - Mark On Mar 22, 2013, at 1:26 AM, Mark Miller markrmil...@gmail.com wrote: I'm working on testing to try and catch what you are seeing here: https://issues.apache.org/jira/browse/SOLR-4629 - Mark On Mar 22, 2013, at 12:23 AM, Mark Miller markrmil...@gmail.com wrote: Let me know if there is anything else you can add. A test with your setup that index n docs randomly, commits, randomly updates a conf file or not, and then replicates and repeats x times does not seem to fail, even with very high values for n and x. On every replication, the versions are compared. Is there anything else you are putting into this mix? - Mark On Mar 21, 2013, at 11:28 PM, Uomesh uom...@gmail.com wrote: Thank you!!, Attached is my master solrconfig.xml. I have few custom handlers which you might need to remove. In custom handler i have not much code just adding some custom data for UI. Thanks, Umesh On Thu, Mar 21, 2013 at 9:59 PM, Mark Miller-3 [via Lucene] ml-node+s472066n4049933h47@n3.nabou mible.com ml-node+s472066n4049933...@n3.nabble.com wrote: Could you attach the master as well? - Mark On Mar 21, 2013, at 4:36 PM, Uomesh [hidden email]http://user/SendEmail.jtp?type=nodenode=4049933i=0 wrote: Hi Mark, Attached is my solrconfig_slave.xml. My replication interval is 1 minute(default). Please let me know if you need any more config details Thanks, umesh On Thu, Mar 21, 2013 at 3:19 PM, Mark Miller-3 [via Lucene] [hidden email] http://user/SendEmail.jtp?type=nodenode=4049933i=1 wrote: Can you give more details about your configuration and setup? Our best bet is to try and recreate this with a unit test. - Mark On Mar 21, 2013, at 4:08 PM, Uomesh [hidden email] http://user/SendEmail.jtp?type=nodenode=4049832i=0 wrote: Hi, I am seeing an issue after upgrading from solr 3.6.2 to Solr 4.2. My Slave stop replicating after sometime. And it seems issue is because of my Slave Index version is higher than master. How could it be possible to Slave Index version is higher than master? Please help me. IS there anything i need to remove from my slave solrconfig.xml. Index Version Gen Size Master: 1363893820575 93 8.75 MB Slave: 1363896006624 94 8.75 MB Thanks, Umesh -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827.html Sent from the Solr - User mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827p4049832.html To unsubscribe from Solr 4.2 - Slave Index version is higher than Master, click here . NAML http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml solrconfig_slave.xml (67K) http://lucene.472066.n3.nabble.com/attachment/4049840/0/solrconfig_slave.xml -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827p4049840.html Sent from the Solr - User mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827p4049933.html To unsubscribe from Solr 4.2 - Slave Index version is higher than Master, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4049827code=VW9tZXNoQGdtYWlsLmNvbXw0MDQ5ODI3fDIyODkyODYxMg== .
Re: Solr cloud and auto shard timeline
I am sorry for the confusion, I had assumed that there was a way to issue commands to ES to have it change it's current shard layout (i.e. go from 2 to 4 for instance) but on further reading of their documentation I do not see that. That being said is there a timeline on being able to add shards to solr cloud by splitting an existing shard (or set of shards) and does anyone have a good writeup of the different capabilities between the two at this point? On Fri, Mar 22, 2013 at 2:01 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, I think there is a mixup here. SolrCloud has the same sharding capabilities as ES at this point, I believe, other than manual moving of shards Mark mentions. Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, Mar 21, 2013 at 7:08 PM, Jamie Johnson jej2...@gmail.com wrote: I've seen that Elastic Search has had auto shardding capabilities for some time, is there a timeline for when a similar capability is being targeted for Solr Cloud?
Re: Don't cache filter queries
On Thu, Mar 21, 2013 at 6:22 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Just add {!cache=false} to the filter in your query : (http://wiki.apache.org/solr/SolrCaching#filterCache). ... : I need to use the filter query feature to filter my results, but I : don't want the results cached as documents are added to the index : several times per second and the results will be state immediately. Is : there any way to disable filter query caching? Or remove the filterCache config option from your solrconfig.xml if you really don't want any caching of any filter queries. Fnrakly though: that's throwing the baby out with the bath water -- just because you are updating your index super-fast-like doesn't mean you aren't getting benefts from the caches, particularly from commonly reused filters which are applied to many qureies which might get executed concurrently -- not to entnion that a single filter might be reused multiple times within a single request to solr. disabling cache *warming* can make a lot of sense in NRT cases, but eliminating caching alltogether rarely does. Thanks. The problem is that the queries with filter queries are taking much longer to run (~60-80 ms) than the queries without (~1-4 ms). I figured that the problem may have been with the caching. In fact, running a query with a filter query and caching disabled is running in the range of 16-30 ms, which is quite an improvement. Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Solr 4.2 - Slave Index version is higher than Master
That issue was already with solr 4.1. http://lucene.472066.n3.nabble.com/replication-problems-with-solr4-1-td4039647.html Nice to know that it is still there in 4.2. With some luck it will make it to 4.2.1 ;-) Regards Bernd Am 21.03.2013 21:08, schrieb Uomesh: Hi, I am seeing an issue after upgrading from solr 3.6.2 to Solr 4.2. My Slave stop replicating after sometime. And it seems issue is because of my Slave Index version is higher than master. How could it be possible to Slave Index version is higher than master? Please help me. IS there anything i need to remove from my slave solrconfig.xml. Index Version Gen Size Master: 1363893820575 93 8.75 MB Slave:1363896006624 94 8.75 MB Thanks, Umesh -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827.html Sent from the Solr - User mailing list archive at Nabble.com.
Using Solr For a Real Search Engine
If I want to use Solr in a web search engine what kind of strategies should I follow about how to run Solr. I mean I can run it via embedded jetty or use war and deploy to a container? You should consider that I will have heavy work load on my Solr.
RE: Logging inside a custom analyzer
Thanks a lot, it was exactly what I need, sorry for not being so clear with my question :). Gian Maria. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, March 19, 2013 3:04 PM To: solr-user@lucene.apache.org; alkamp...@nablasoft.com Subject: Re: Logging inside a custom analyzer what do you mean log information into solar from a custom analyzer? Have info go from your custom analyzer into the Solr log? In which case, just do something like: private static final Logger log = LoggerFactory.getLogger(YourPrivateClass.class.getName()); and then in your code something like log.info(your message here); Best Erick On Tue, Mar 19, 2013 at 1:32 AM, Gian Maria Ricci alkamp...@nablasoft.comwrote: Hi to everyone, What is the best way to log information into solar from a custom analyzer? Is there any way to integrate log4j or is it better to use some solr logging method? Thanks again for your invaluable help Gian Maria
Re: SOLR - Documents with large number of fields ~ 450
Hi, I have a collection with more than 4K fields, but mostly Trie*Fields types. It is used for faceting,sorting,searching and statsComponent. It works pretty fine on Amazon 4xm1.large (7.5GB RAM) EC2 boxes. I'm using SolrCloud, multi A-Z setup and ephemeral storage. Index is managed by mmap, 4GB for Java heap, CMS for GC. Currently there is 800K records, but will be about 2m. Queries response is much longer (couple to dozen of seconds) during bulk loading, but this is rather typical as I think. Indexing takes much much longer than in case of records with less number of fields. I'm sending updates in 5MB batches. No OOM issues. Regarding DocValues: I believe they are great improvement for faceting, but they are annoying because of their limitations: as far as I checked a field has to be required or to have default value which is not possible in my case (I can't set some figures to 0 by default as it may impact other results displayed to the end user, which is not good). I wish it could change. Regards. On 21 March 2013 07:56, kobe.free.wo...@gmail.com kobe.free.wo...@gmail.com wrote: Hello All, Scenario: My data model consist of approx. 450 fields with different types of data. We want to include each field for indexing as a result it will create a single SOLR document with *450 fields*. The total of number of records in the data set is *755K*. We will be using the features like faceting and sorting on approx. 50 fields. We are planning to use SOLR 4.1. Following is the hardware configuration of the web server that we plan to install SOLR on:- CPU: 2 x Dual Core (4 cores) | RAM: 12GB | Storage: 212 GB Questions : 1)What's the best approach when dealing with documents with large number of fields. What's the drawback of having a single document with a very large number of fields. Does SOLR support documents with large number of fields as in my case? 2)Will there be any performance issue if i define all of the 450 fields for indexing? Also if faceting is done on 50 fields with document having large number of fields and huge number of records? 3)The name of the fields in the data set are quiet lengthy around 60 characters. Will it be a problem defining fields with such a huge name in the schema file? Is there any best practice to be followed related to naming convention? Will big field names create problem during querying? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-Documents-with-large-number-of-fields-450-tp4049633.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr cloud and auto shard timeline
Yes Anshum exactly what I was looking for. Is this being targeted in a particular solr release? I see that some of the related issues are targeted for 4.3, is that the goal for this as well? On Fri, Mar 22, 2013 at 8:07 AM, Anshum Gupta ans...@anshumgupta.netwrote: Hi Jamie, There's progress on the Shard splitting JIRA that I believe you are talking about. You may have a look at this for more details: https://issues.apache.org/jira/browse/SOLR-3755 . On Fri, Mar 22, 2013 at 4:30 PM, Jamie Johnson jej2...@gmail.com wrote: I am sorry for the confusion, I had assumed that there was a way to issue commands to ES to have it change it's current shard layout (i.e. go from 2 to 4 for instance) but on further reading of their documentation I do not see that. That being said is there a timeline on being able to add shards to solr cloud by splitting an existing shard (or set of shards) and does anyone have a good writeup of the different capabilities between the two at this point? On Fri, Mar 22, 2013 at 2:01 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, I think there is a mixup here. SolrCloud has the same sharding capabilities as ES at this point, I believe, other than manual moving of shards Mark mentions. Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, Mar 21, 2013 at 7:08 PM, Jamie Johnson jej2...@gmail.com wrote: I've seen that Elastic Search has had auto shardding capabilities for some time, is there a timeline for when a similar capability is being targeted for Solr Cloud? -- Anshum Gupta http://www.anshumgupta.net
Re: Slow queries for common terms
Hi There might not be a final cure with more RAM if you are CPU bound. Scoring 90M docs is some work. Can you check what's going on during those 15 seconds? Is your CPU at 100%? Try an (foo OR bar OR baz) search which generates 100mill hits and see if that is slow too, even if you don't use frequent words. I'm sure you can find other frequent terms in your corpus which display similar behaviour, words which are even more frequent than book. Are you using AND as default operator? You will benefit from limiting the number of results as much as possible. The real solution is to shard across N number of servers, until you reach the desired performance for the desired indexing/querying load. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 22. mars 2013 kl. 02:52 skrev David Parks davidpark...@yahoo.com: I figured I was trying to pull a coup here, but this is a temporary configuration while we only run a few users through an early beta. The performance is perfectly good for most terms, it's just this books term. I'm curious how adding RAM will solve that. I can see how deploying solr cloud and sharding should affect it, but would simply giving Solr 16GB of ram improve query time with this one term that is common to 90M of the 300M documents? In due time I do plan to implement solr cloud and run the whole thing through proper load testing. Right now I'm just trying to get it to work for a few users. If you could elaborate a bit on your thinking I'd be quite grateful. David -Original Message- From: Jan Høydahl [mailto:jan@cominvent.com] Sent: Thursday, March 21, 2013 8:01 PM To: solr-user@lucene.apache.org Subject: Re: Slow queries for common terms Hi, If you say that you try to index 300M docs in ONE single Solr server, with a few gigs of RAM, then that's the reason for some bad performance right there. You should benchmark to find the sweet-spot of how many documents you want to fit per node/shard and still have acceptable indexing/query performance. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 21. mars 2013 kl. 12:43 skrev David Parks davidpark...@yahoo.com: We have 300M documents, each about a paragraph of text on average. The index is 140GB in size. I'm not sure how to find the IDF score, was that in the debug query below? It seems that any query with the word book in it triggers a 15 sec response time (unless it's the 2nd time we run the same query). Looking at terms, 'book' is the 2nd highest term with 90M documents in the index. Calling 'book' a stop word doesn't seem reasonable, and while that article on bigrams and common grams is fascinating, I wonder if it addresses this situation, in which we aren't really likely to manage a bi-gram phrase match between the search book sales improvement, and the terms in the document: category book marketing and sales today the real guide to improving right? I think this is what's happening here, everything with a common phrase category book is getting included, which seems logical and correct. -Original Message- From: Jan Høydahl [mailto:jan@cominvent.com] Sent: Thursday, March 21, 2013 5:43 PM To: solr-user@lucene.apache.org Subject: Re: Slow queries for common terms Hi, I think you can start by reading this blog http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-co mmon-w ords-part-2 and try out the approach using a dictionary of the most common words in your index. You don't say how many documents, avg. doc size, the IDF value of book, how much RAM, whether you utilize disk caching well enough and many other things which could affect this situation. But the pure fact that only a few common search words trigger such a delay would suggest commongrams as a possible way forward. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 21. mars 2013 kl. 11:09 skrev David Parks davidpark...@yahoo.com: I've got a query that takes 15 seconds to return whenever I have the term book in a query that isn't cached. That's a pretty common term in our search index. We're indexing about 120 GB of text data. We only store terms and IDs, no document data, and the disk is virtually unused, it's all CPU time. I haven't done much yet to optimizing and scale solr, as we're only trying to support a small number of users in a private beta. I currently only have a couple of gigs of ram dedicated to Solr (we've ordered more hardware for it, but it's not in yet). I wonder if there's something I can do in the short term to alleviate the problem. Many searches work great, but these ones that take 15+ sec are a black eye. I'd be happy with a short term fix followed in the near future by a more proper
Solr 4.2 replcation whole index files mechanism.
Hi, I use solrcloud 4.1. I start up two solr nodes A and B and then created a new collection using CoreAdmin to A using one shard, so Node A is leader. Then I index some docs to it. Then I created the same collection using CoreAdmin to B to become a replica. I found that solr will sync all index files from A to B. Under B's data dir, I have: index.20130318083415358 folder which has all the synced index files, index.properties, replication.properties and tlog folder(empty inside). Then I removed the collection from node B using CoreAdmin UNLOAD, I keep all files in B's data dir, I didn't delete them. Then I create the same collection from B again. I found that solr will sync files from A to B AGAIN!!! And there is another folder index.20130318084514166 been created under B's data folder. Actually I didn't index any docs to A after I UNLOAD collection on B. So I wonder how to let solr know that B already has the correct index files and not to do the sync again? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-replcation-whole-index-files-mechanism-tp4049980.html Sent from the Solr - User mailing list archive at Nabble.com.
Urgent:Solr cloud issue
Hi Shawan, I have seen your post on solr cloude Master-Master configuration on two servers. I have to use the same Solr structure, but from long I am not able to configure it to comunicate between two server, on single server it works fine. Can you pls help me out to provide required config changes, so that SOLR can comunicate between two servers. http://grokbase.com/t/lucene/solr-user/132pb1pe34/solrcloud-master-master Regards Anuj Vats
PatternReplaceFilterFactory -- what does this regex do?
I'm using the Solr Suggester for autocompletion with WFSTLookup suggest component, and a text file with phrases and weights. ( http://wiki.apache.org/solr/Suggester) I found that the following filter made it impossible to match on ampersands. So I removed it. But I'm sure it was there for a reason. What was this suppose to do? filter class=solr.PatternReplaceFilterFactory pattern= ([^\p{L}\p{M}\p{N}\p{Cs}]*[\p{L}\p{M}\p{N}\p{Cs}\_]+:)|([^\p{L}\p{M}\p{N}\p{Cs}])+ replacement= replace=all/ Thanks, Eric Wilson
Re: Sort-field for ALL docs in FieldCache for sort queries - OOM on lots of docs
On 3/21/13 10:50 PM, Shawn Heisey wrote: On 3/21/2013 4:05 AM, Per Steffensen wrote: Can anyone else elaborate? How to activate it? How to make sure, for sorting, that sort-field-value for all docs are not read into memory for sorting - leading to OOM when you have a lot of docs? Can this feature be activated on top of an existing 4.0 index, or do you have to re-index everything? There is one requirement that may not be obvious - every document must have a value in the field, so you must either make the field either required or give it a default value in the schema. Solr 4.2 will refuse to start the core if this requirement is not met. That is not problem for us. The field exist on every document. The example schema hints that the value might need to be single-valued. I have not tested this. Sorting is already problematic on multi-valued fields, so I assume that this won't be the case for you. That is not a problem for us either. The field is single-valued. To use docValues, add docValues=true and then either set required=true or default=somevalue on the field definition in schema.xml, restart Solr or reload the core, and reindex. Your index will get bigger. So the answer to ...or do you have to re-index everything? is yes!? If the touted behavior of handling the sort mechanism in OS disk cache memory (or just reading the disk if there's not enough memory) rather than heap is correct, then it should solve your issues. I hope it does! Me too. I will find out soon - I hope! But re-indexing is kinda a problem for us, but we will figure out. Any guide to re-index all you stuff anywhere, so I do it the easiest way? Guess maybe there are some nice tricks about steaming data directly from one Solr running the old index into a new Solr running the new index, and then discard the old index afterwards? Thanks, Shawn Thanks a lot, Shawn! Regards, Per Steffensen
Re: Sort-field for ALL docs in FieldCache for sort queries - OOM on lots of docs
On 3/22/2013 8:54 AM, Per Steffensen wrote: Me too. I will find out soon - I hope! But re-indexing is kinda a problem for us, but we will figure out. Any guide to re-index all you stuff anywhere, so I do it the easiest way? Guess maybe there are some nice tricks about steaming data directly from one Solr running the old index into a new Solr running the new index, and then discard the old index afterwards? There is no guide to reindexing, because there are so many ways to index. The basic procedure is to repeat whatever you did the first time, possibly deleting the entire index first. Because Lucene and Solr indexes often require changes to deal with changing requirements, the full index procedure should be automated and repeatable. The dataimport handler has a SolrEntityProcessor that can index from another Solr instance. All fields must be stored for this to work, because it just retrieves documents and ignores the search index. Many people (including myself) do not store all fields, in an attempt to keep the index size down. Thanks, Shawn
Solr 4.2, reindexing, transaction logs, high memory usage
Dear List, We are using solr-4.2 to build an index of 5M docs each limited to 6K in size. Conceptually we are modelling a stack of documents. Here is a excerpt from our schema.xml dynamicField name=publicationBody_* type=string indexed=false stored=true multiValued=false termVectors=false / copyFieldsource=publicationBody_* dest=publicationBodies/ We have publicationBody_1: ..., publicationBody_2: ... maximum of 30 with max 10K of data in each. We run this index in 8 solr sharded in 8 solr cores on a single host an m2.4xlarge EC2 instances. We do not use zookeeper (because of operational issues on our live indexes) and manage the sharding ourselves. For this index we run with -Xmx30G and observe in (jsconsole) that the solr runs with approximately 25G. Autocommit kills solr, it sends heap memory usage to max and kills solr. The reason appears to be committing to all cores in parallel. Disabling autoCommit and running a loop like while(true); do for i in $(seq 0 7); do curl -s http://localhost:8085/solr/core${i}/update?commit=truewt=json; done produces: {responseHeader:{status:0,QTime:8297}} {responseHeader:{status:0,QTime:8358}} {responseHeader:{status:0,QTime:9552}} {responseHeader:{status:0,QTime:8368}} {responseHeader:{status:0,QTime:9296}} {responseHeader:{status:0,QTime:8527}} {responseHeader:{status:0,QTime:9458}} {responseHeader:{status:0,QTime:8929}} 8 seconds to process a commit where with no changes to the index!?! Transaction Logs 55M /mnt/solr-stack/solr.data.0/tlog 45M /mnt/solr-stack/solr.data.1/tlog 28M /mnt/solr-stack/solr.data.2/tlog 17M /mnt/solr-stack/solr.data.3/tlog 118M/mnt/solr-stack/solr.data.4/tlog 123M/mnt/solr-stack/solr.data.5/tlog 68M /mnt/solr-stack/solr.data.6/tlog 63M /mnt/solr-stack/solr.data.7/tlog Index --- 2.8G/mnt/solr-stack/solr.data.0/index 2.7G/mnt/solr-stack/solr.data.1/index 3.2G/mnt/solr-stack/solr.data.2/index 2.7G/mnt/solr-stack/solr.data.3/index 3.1G/mnt/solr-stack/solr.data.4/index 2.7G/mnt/solr-stack/solr.data.5/index 2.9G/mnt/solr-stack/solr.data.6/index 3.0G/mnt/solr-stack/solr.data.7/index Why does solr need such a large heap space for this index (it dies with 10G and 20G and is constant at 28G in jconsole)? Why does running a commits in parallel via autoCommit or the command exhaust the memory? Are we using dynamic fields incorrectly? We have also tried to run the same index on an SSD-disk backed hi1.4xlarge Amazon instance. Here autoCommit every 30 seconds works, rotating transaction logs files correctly. -- Raghav Senior backend developer - www.issuu.com
Re: Solr 4.2 replcation whole index files mechanism.
There are a few things going on here that caused this, all resolved in 4.2 as far as I know. - Mark On Mar 22, 2013, at 3:56 AM, bradhill99 bradhil...@yahoo.com wrote: Hi, I use solrcloud 4.1. I start up two solr nodes A and B and then created a new collection using CoreAdmin to A using one shard, so Node A is leader. Then I index some docs to it. Then I created the same collection using CoreAdmin to B to become a replica. I found that solr will sync all index files from A to B. Under B's data dir, I have: index.20130318083415358 folder which has all the synced index files, index.properties, replication.properties and tlog folder(empty inside). Then I removed the collection from node B using CoreAdmin UNLOAD, I keep all files in B's data dir, I didn't delete them. Then I create the same collection from B again. I found that solr will sync files from A to B AGAIN!!! And there is another folder index.20130318084514166 been created under B's data folder. Actually I didn't index any docs to A after I UNLOAD collection on B. So I wonder how to let solr know that B already has the correct index files and not to do the sync again? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-replcation-whole-index-files-mechanism-tp4049980.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.2 - Slave Index version is higher than Master
Are you replicating configuration files as well? - Mark On Mar 22, 2013, at 6:38 AM, John, Phil (CSS) philj...@capita.co.uk wrote: To add to the discussion. We're running classic master/slave replication (not solrcloud) with 1 master and 2 slaves and I noticed the slave having a higher version number than the master the other day as well. In our case, knock on wood, it hasn't stopped replication. If you'd like a copy of our config I can provide off-list. Regards, Phil. From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Fri 22/03/2013 06:32 To: solr-user@lucene.apache.org Subject: Re: Solr 4.2 - Slave Index version is higher than Master The other odd thing here is that this should not stop replication at all. When the slave is ahead, it will still have it's index replaced. - Mark On Mar 22, 2013, at 1:26 AM, Mark Miller markrmil...@gmail.com wrote: I'm working on testing to try and catch what you are seeing here: https://issues.apache.org/jira/browse/SOLR-4629 - Mark On Mar 22, 2013, at 12:23 AM, Mark Miller markrmil...@gmail.com wrote: Let me know if there is anything else you can add. A test with your setup that index n docs randomly, commits, randomly updates a conf file or not, and then replicates and repeats x times does not seem to fail, even with very high values for n and x. On every replication, the versions are compared. Is there anything else you are putting into this mix? - Mark On Mar 21, 2013, at 11:28 PM, Uomesh uom...@gmail.com wrote: Thank you!!, Attached is my master solrconfig.xml. I have few custom handlers which you might need to remove. In custom handler i have not much code just adding some custom data for UI. Thanks, Umesh On Thu, Mar 21, 2013 at 9:59 PM, Mark Miller-3 [via Lucene] ml-node+s472066n4049933h47@n3.nabou mible.com ml-node+s472066n4049933...@n3.nabble.com wrote: Could you attach the master as well? - Mark On Mar 21, 2013, at 4:36 PM, Uomesh [hidden email]http://user/SendEmail.jtp?type=nodenode=4049933i=0 wrote: Hi Mark, Attached is my solrconfig_slave.xml. My replication interval is 1 minute(default). Please let me know if you need any more config details Thanks, umesh On Thu, Mar 21, 2013 at 3:19 PM, Mark Miller-3 [via Lucene] [hidden email] http://user/SendEmail.jtp?type=nodenode=4049933i=1 wrote: Can you give more details about your configuration and setup? Our best bet is to try and recreate this with a unit test. - Mark On Mar 21, 2013, at 4:08 PM, Uomesh [hidden email] http://user/SendEmail.jtp?type=nodenode=4049832i=0 wrote: Hi, I am seeing an issue after upgrading from solr 3.6.2 to Solr 4.2. My Slave stop replicating after sometime. And it seems issue is because of my Slave Index version is higher than master. How could it be possible to Slave Index version is higher than master? Please help me. IS there anything i need to remove from my slave solrconfig.xml. Index Version Gen Size Master: 1363893820575 93 8.75 MB Slave: 1363896006624 94 8.75 MB Thanks, Umesh -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827.html Sent from the Solr - User mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827p4049832.html To unsubscribe from Solr 4.2 - Slave Index version is higher than Master, click here . NAML http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml solrconfig_slave.xml (67K) http://lucene.472066.n3.nabble.com/attachment/4049840/0/solrconfig_slave.xml -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827p4049840.html Sent from the Solr - User mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827p4049933.html To unsubscribe from Solr 4.2 - Slave Index version is higher than Master, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4049827code=VW9tZXNoQGdtYWlsLmNvbXw0MDQ5ODI3fDIyODkyODYxMg== .
Re: SOLR - Documents with large number of fields ~ 450
with the on disk option. Could you elaborate on that? Den 22/03/2013 05.25 skrev Mark Miller markrmil...@gmail.com: You might try using docvalues with the on disk option and try and let the OS manage all the memory needed for all the faceting/sorting. This would require Solr 4.2. - Mark On Mar 21, 2013, at 2:56 AM, kobe.free.wo...@gmail.com wrote: Hello All, Scenario: My data model consist of approx. 450 fields with different types of data. We want to include each field for indexing as a result it will create a single SOLR document with *450 fields*. The total of number of records in the data set is *755K*. We will be using the features like faceting and sorting on approx. 50 fields. We are planning to use SOLR 4.1. Following is the hardware configuration of the web server that we plan to install SOLR on:- CPU: 2 x Dual Core (4 cores) | RAM: 12GB | Storage: 212 GB Questions : 1)What's the best approach when dealing with documents with large number of fields. What's the drawback of having a single document with a very large number of fields. Does SOLR support documents with large number of fields as in my case? 2)Will there be any performance issue if i define all of the 450 fields for indexing? Also if faceting is done on 50 fields with document having large number of fields and huge number of records? 3)The name of the fields in the data set are quiet lengthy around 60 characters. Will it be a problem defining fields with such a huge name in the schema file? Is there any best practice to be followed related to naming convention? Will big field names create problem during querying? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-Documents-with-large-number-of-fields-450-tp4049633.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.2 - Slave Index version is higher than Master
Hi Mrk, I am replicating below config files but not replicating solrconfig.xml. confFiles:schema.xml, elevate.xml, stopwords.txt, mapping-FoldToASCII.txt, mapping-ISOLatin1Accent.txt, protwords.txt, spellings.txt, synonyms.txt also strange I am seeing big Gen difference between Master and slave. My master slave is 2 while Slave is 56. If i do the full import then the Gen is getting higher then slave one and its replicating. i have more than 30 cores on my solr instance and all are scheduled to replicate on same time. IndexVersionGenSizeMaster: 1363903243590 2 94 bytes Slave: 1363967579193 56 94 bytes Thanks, Umesh On Fri, Mar 22, 2013 at 10:42 AM, Mark Miller-3 [via Lucene] ml-node+s472066n4050075...@n3.nabble.com wrote: Are you replicating configuration files as well? - Mark On Mar 22, 2013, at 6:38 AM, John, Phil (CSS) [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=0 wrote: To add to the discussion. We're running classic master/slave replication (not solrcloud) with 1 master and 2 slaves and I noticed the slave having a higher version number than the master the other day as well. In our case, knock on wood, it hasn't stopped replication. If you'd like a copy of our config I can provide off-list. Regards, Phil. From: Mark Miller [mailto:[hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=1] Sent: Fri 22/03/2013 06:32 To: [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=2 Subject: Re: Solr 4.2 - Slave Index version is higher than Master The other odd thing here is that this should not stop replication at all. When the slave is ahead, it will still have it's index replaced. - Mark On Mar 22, 2013, at 1:26 AM, Mark Miller [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=3 wrote: I'm working on testing to try and catch what you are seeing here: https://issues.apache.org/jira/browse/SOLR-4629 - Mark On Mar 22, 2013, at 12:23 AM, Mark Miller [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=4 wrote: Let me know if there is anything else you can add. A test with your setup that index n docs randomly, commits, randomly updates a conf file or not, and then replicates and repeats x times does not seem to fail, even with very high values for n and x. On every replication, the versions are compared. Is there anything else you are putting into this mix? - Mark On Mar 21, 2013, at 11:28 PM, Uomesh [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=5 wrote: Thank you!!, Attached is my master solrconfig.xml. I have few custom handlers which you might need to remove. In custom handler i have not much code just adding some custom data for UI. Thanks, Umesh On Thu, Mar 21, 2013 at 9:59 PM, Mark Miller-3 [via Lucene] [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=6 mible.com [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=7 wrote: Could you attach the master as well? - Mark On Mar 21, 2013, at 4:36 PM, Uomesh [hidden email] http://user/SendEmail.jtp?type=nodenode=4049933i=0 wrote: Hi Mark, Attached is my solrconfig_slave.xml. My replication interval is 1 minute(default). Please let me know if you need any more config details Thanks, umesh On Thu, Mar 21, 2013 at 3:19 PM, Mark Miller-3 [via Lucene] [hidden email] http://user/SendEmail.jtp?type=nodenode=4049933i=1 wrote: Can you give more details about your configuration and setup? Our best bet is to try and recreate this with a unit test. - Mark On Mar 21, 2013, at 4:08 PM, Uomesh [hidden email] http://user/SendEmail.jtp?type=nodenode=4049832i=0 wrote: Hi, I am seeing an issue after upgrading from solr 3.6.2 to Solr 4.2. My Slave stop replicating after sometime. And it seems issue is because of my Slave Index version is higher than master. How could it be possible to Slave Index version is higher than master? Please help me. IS there anything i need to remove from my slave solrconfig.xml. Index Version Gen Size Master: 1363893820575 93 8.75 MB Slave: 1363896006624 94 8.75 MB Thanks, Umesh -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827.html Sent from the Solr - User mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827p4049832.html To unsubscribe from Solr 4.2 - Slave Index version is higher than Master, click here . NAML
Re: Solr 4.2 - Slave Index version is higher than Master
Also, I am replicating only on commit and startup. Thanks, Umesh On Fri, Mar 22, 2013 at 11:23 AM, Umesh Sharma uom...@gmail.com wrote: Hi Mrk, I am replicating below config files but not replicating solrconfig.xml. confFiles: schema.xml, elevate.xml, stopwords.txt, mapping-FoldToASCII.txt, mapping-ISOLatin1Accent.txt, protwords.txt, spellings.txt, synonyms.txt also strange I am seeing big Gen difference between Master and slave. My master slave is 2 while Slave is 56. If i do the full import then the Gen is getting higher then slave one and its replicating. i have more than 30 cores on my solr instance and all are scheduled to replicate on same time. Index Version Gen Size Master: 1363903243590 2 94 bytes Slave: 1363967579193 56 94 bytes Thanks, Umesh On Fri, Mar 22, 2013 at 10:42 AM, Mark Miller-3 [via Lucene] ml-node+s472066n4050075...@n3.nabble.com wrote: Are you replicating configuration files as well? - Mark On Mar 22, 2013, at 6:38 AM, John, Phil (CSS) [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=0 wrote: To add to the discussion. We're running classic master/slave replication (not solrcloud) with 1 master and 2 slaves and I noticed the slave having a higher version number than the master the other day as well. In our case, knock on wood, it hasn't stopped replication. If you'd like a copy of our config I can provide off-list. Regards, Phil. From: Mark Miller [mailto:[hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=1] Sent: Fri 22/03/2013 06:32 To: [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=2 Subject: Re: Solr 4.2 - Slave Index version is higher than Master The other odd thing here is that this should not stop replication at all. When the slave is ahead, it will still have it's index replaced. - Mark On Mar 22, 2013, at 1:26 AM, Mark Miller [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=3 wrote: I'm working on testing to try and catch what you are seeing here: https://issues.apache.org/jira/browse/SOLR-4629 - Mark On Mar 22, 2013, at 12:23 AM, Mark Miller [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=4 wrote: Let me know if there is anything else you can add. A test with your setup that index n docs randomly, commits, randomly updates a conf file or not, and then replicates and repeats x times does not seem to fail, even with very high values for n and x. On every replication, the versions are compared. Is there anything else you are putting into this mix? - Mark On Mar 21, 2013, at 11:28 PM, Uomesh [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=5 wrote: Thank you!!, Attached is my master solrconfig.xml. I have few custom handlers which you might need to remove. In custom handler i have not much code just adding some custom data for UI. Thanks, Umesh On Thu, Mar 21, 2013 at 9:59 PM, Mark Miller-3 [via Lucene] [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=6 mible.com [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=7 wrote: Could you attach the master as well? - Mark On Mar 21, 2013, at 4:36 PM, Uomesh [hidden email] http://user/SendEmail.jtp?type=nodenode=4049933i=0 wrote: Hi Mark, Attached is my solrconfig_slave.xml. My replication interval is 1 minute(default). Please let me know if you need any more config details Thanks, umesh On Thu, Mar 21, 2013 at 3:19 PM, Mark Miller-3 [via Lucene] [hidden email] http://user/SendEmail.jtp?type=nodenode=4049933i=1 wrote: Can you give more details about your configuration and setup? Our best bet is to try and recreate this with a unit test. - Mark On Mar 21, 2013, at 4:08 PM, Uomesh [hidden email] http://user/SendEmail.jtp?type=nodenode=4049832i=0 wrote: Hi, I am seeing an issue after upgrading from solr 3.6.2 to Solr 4.2. My Slave stop replicating after sometime. And it seems issue is because of my Slave Index version is higher than master. How could it be possible to Slave Index version is higher than master? Please help me. IS there anything i need to remove from my slave solrconfig.xml. Index Version Gen Size Master: 1363893820575 93 8.75 MB Slave: 1363896006624 94 8.75 MB Thanks, Umesh -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827.html Sent from the Solr - User mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827p4049832.html To unsubscribe from Solr 4.2 - Slave Index version is higher than
NoSuchMethodError updateDocument
I use Solr 4.1.0 and Nutch 2.1, Java 1.7.0_17, Tomcat 7.0, Intellij IDEA 12.with a Centos 6.4 at my 64 bit computer. I run that command succesfully: bin/nutch solrindex http://localhost:8080/solr -index However when I run that command: bin/nutch solrindex http://localhost:8080/solr -reindex I get that error : Mar 22, 2013 6:48:27 PM org.apache.solr.common.SolrException log SEVERE: null:java.lang.RuntimeException: java.lang.NoSuchMethodError: org.apache.lucene.index.IndexWriter.updateDocument(Lorg/apache/lucene/index/Term;Lorg/apache/lucene/index/IndexDocument;Lorg/apache/lucene/analysis/Analyzer;)V at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:653) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:366) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:936) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1004) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.NoSuchMethodError: org.apache.lucene.index.IndexWriter.updateDocument(Lorg/apache/lucene/index/Term;Lorg/apache/lucene/index/IndexDocument;Lorg/apache/lucene/analysis/Analyzer;)V at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:201) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:451) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:587) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1812) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345) ... 16 more
Re: Solr 4.2 - Slave Index version is higher than Master
And your also on 4.2? - Mark On Mar 22, 2013, at 12:41 PM, Uomesh uom...@gmail.com wrote: Also, I am replicating only on commit and startup. Thanks, Umesh On Fri, Mar 22, 2013 at 11:23 AM, Umesh Sharma uom...@gmail.com wrote: Hi Mrk, I am replicating below config files but not replicating solrconfig.xml. confFiles: schema.xml, elevate.xml, stopwords.txt, mapping-FoldToASCII.txt, mapping-ISOLatin1Accent.txt, protwords.txt, spellings.txt, synonyms.txt also strange I am seeing big Gen difference between Master and slave. My master slave is 2 while Slave is 56. If i do the full import then the Gen is getting higher then slave one and its replicating. i have more than 30 cores on my solr instance and all are scheduled to replicate on same time. Index Version Gen Size Master: 1363903243590 2 94 bytes Slave: 1363967579193 56 94 bytes Thanks, Umesh On Fri, Mar 22, 2013 at 10:42 AM, Mark Miller-3 [via Lucene] ml-node+s472066n4050075...@n3.nabble.com wrote: Are you replicating configuration files as well? - Mark On Mar 22, 2013, at 6:38 AM, John, Phil (CSS) [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=0 wrote: To add to the discussion. We're running classic master/slave replication (not solrcloud) with 1 master and 2 slaves and I noticed the slave having a higher version number than the master the other day as well. In our case, knock on wood, it hasn't stopped replication. If you'd like a copy of our config I can provide off-list. Regards, Phil. From: Mark Miller [mailto:[hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=1] Sent: Fri 22/03/2013 06:32 To: [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=2 Subject: Re: Solr 4.2 - Slave Index version is higher than Master The other odd thing here is that this should not stop replication at all. When the slave is ahead, it will still have it's index replaced. - Mark On Mar 22, 2013, at 1:26 AM, Mark Miller [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=3 wrote: I'm working on testing to try and catch what you are seeing here: https://issues.apache.org/jira/browse/SOLR-4629 - Mark On Mar 22, 2013, at 12:23 AM, Mark Miller [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=4 wrote: Let me know if there is anything else you can add. A test with your setup that index n docs randomly, commits, randomly updates a conf file or not, and then replicates and repeats x times does not seem to fail, even with very high values for n and x. On every replication, the versions are compared. Is there anything else you are putting into this mix? - Mark On Mar 21, 2013, at 11:28 PM, Uomesh [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=5 wrote: Thank you!!, Attached is my master solrconfig.xml. I have few custom handlers which you might need to remove. In custom handler i have not much code just adding some custom data for UI. Thanks, Umesh On Thu, Mar 21, 2013 at 9:59 PM, Mark Miller-3 [via Lucene] [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=6 mible.com [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=7 wrote: Could you attach the master as well? - Mark On Mar 21, 2013, at 4:36 PM, Uomesh [hidden email] http://user/SendEmail.jtp?type=nodenode=4049933i=0 wrote: Hi Mark, Attached is my solrconfig_slave.xml. My replication interval is 1 minute(default). Please let me know if you need any more config details Thanks, umesh On Thu, Mar 21, 2013 at 3:19 PM, Mark Miller-3 [via Lucene] [hidden email] http://user/SendEmail.jtp?type=nodenode=4049933i=1 wrote: Can you give more details about your configuration and setup? Our best bet is to try and recreate this with a unit test. - Mark On Mar 21, 2013, at 4:08 PM, Uomesh [hidden email] http://user/SendEmail.jtp?type=nodenode=4049832i=0 wrote: Hi, I am seeing an issue after upgrading from solr 3.6.2 to Solr 4.2. My Slave stop replicating after sometime. And it seems issue is because of my Slave Index version is higher than master. How could it be possible to Slave Index version is higher than master? Please help me. IS there anything i need to remove from my slave solrconfig.xml. Index Version Gen Size Master: 1363893820575 93 8.75 MB Slave: 1363896006624 94 8.75 MB Thanks, Umesh -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827.html Sent from the Solr - User mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827p4049832.html To unsubscribe from Solr 4.2 - Slave
Re: strange behaviour of wordbreak spellchecker in solr cloud
Hello, Further investigation shows the following pattern, for both DirectIndex and wordbreak spellchekers. Assume that in all cases there are spellchecker results when distrib=false In distributed mode (distrib=true) case when matches=0 1. group=true, no spellcheck results 2. group=false , there are spellcheck results case when matches0 1. group=true, there are spellcheck results 2. group =false, there are spellcheck results Do these constitute a failing test case? Thanks. Alex. -Original Message- From: alxsss alx...@aim.com To: solr-user solr-user@lucene.apache.org Sent: Thu, Mar 21, 2013 6:50 pm Subject: Re: strange behaviour of wordbreak spellchecker in solr cloud Hello, I am debugging the SpellCheckComponent#finishStage. From the responses I see that not only wordbreak, but also directSpellchecker does not return some results in distributed mode. The request handler I was using had str name=grouptrue/str So, I desided to turn of grouping and I see spellcheck results in distributed mode. curl 'server1:8983/solr/test/testhandler?q=paulusolesindent=truerows=10shards.qt=testhandler' has no spellchek results but curl 'server1:8983/solr/test/testhandler?q=paulusolesindent=truerows=10shards.qt=testhandler group=false' returns results. So, the conclusion is that grouping causes the distributed spellcheker to fail. Could please you point me to the class that may be responsible to this issue? Thanks. Alex. -Original Message- From: Dyer, James james.d...@ingramcontent.com To: solr-user solr-user@lucene.apache.org Sent: Thu, Mar 21, 2013 11:23 am Subject: RE: strange behaviour of wordbreak spellchecker in solr cloud The shard responses get combined in SpellCheckComponent#finishStage . I highly recommend you file a JIRA bug report for this at https://issues.apache.org/jira/browse/SOLR . If you write a failing unit test, it would make it much more likely that others would help you with a fix. Of course, if you solve the issue entirely, a patch would be much appreciated. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: alx...@aim.com [mailto:alx...@aim.com] Sent: Thursday, March 21, 2013 12:45 PM To: solr-user@lucene.apache.org Subject: Re: strange behaviour of wordbreak spellchecker in solr cloud Hello, We need this feature be fixed ASAP. So, please let me know which class is responsible for combining spellcheck results from all shards. I will try to debug the code. Thanks in advance. Alex. -Original Message- From: alxsss alx...@aim.com To: solr-user solr-user@lucene.apache.org Sent: Tue, Mar 19, 2013 11:34 am Subject: Re: strange behaviour of wordbreak spellchecker in solr cloud -- distributed environment. But to nail it down, we probably need to see both -- the applicable requestHandler / Not sure what this is? I have searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypespell/str !-- Multiple Spell Checkers can be declared and used by this component -- !-- a spellchecker built from a field of the main index -- lst name=spellchecker str name=namedirect/str str name=fieldspell/str str name=classnamesolr.DirectSolrSpellChecker/str !-- the spellcheck distance measure used, the default is the internal levenshtein -- str name=distanceMeasureinternal/str !-- minimum accuracy needed to be considered a valid spellcheck suggestion -- float name=accuracy0.5/float !-- the maximum #edits we consider when enumerating terms: can be 1 or 2 -- int name=maxEdits2/int !-- the minimum shared prefix when enumerating terms -- int name=minPrefix1/int !-- maximum number of inspections per result. -- int name=maxInspections5/int !-- minimum length of a query term to be considered for correction -- int name=minQueryLength4/int !-- maximum threshold of documents a query term can appear to be considered for correction -- float name=maxQueryFrequency0.01/float !-- uncomment this to require suggestions to occur in 1% of the documents float name=thresholdTokenFrequency.01/float -- /lst !-- a spellchecker that can break or combine words. See /spell handler below for usage -- lst name=spellchecker str name=namewordbreak/str str name=classnamesolr.WordBreakSolrSpellChecker/str str name=fieldspell/str str name=combineWordstrue/str str name=breakWordstrue/str int name=maxChanges10/int /lst !-- a spellchecker that uses a different distance measure -- !-- lst name=spellchecker str name=namejarowinkler/str str name=fieldspell/str str name=classnamesolr.DirectSolrSpellChecker/str str name=distanceMeasure org.apache.lucene.search.spell.JaroWinklerDistance /str /lst --
Re: Solr 4.2, reindexing, transaction logs, high memory usage
On 3/22/2013 9:24 AM, Raghav Karol wrote: We run this index in 8 solr sharded in 8 solr cores on a single host an m2.4xlarge EC2 instances. We do not use zookeeper (because of operational issues on our live indexes) and manage the sharding ourselves. For this index we run with -Xmx30G and observe in (jsconsole) that the solr runs with approximately 25G. Autocommit kills solr, it sends heap memory usage to max and kills solr. The reason appears to be committing to all cores in parallel. Disabling autoCommit and running a loop like while(true); do for i in $(seq 0 7); do curl -s http://localhost:8085/solr/core${i}/update?commit=truewt=json; done produces: {responseHeader:{status:0,QTime:8297}} {responseHeader:{status:0,QTime:8358}} {responseHeader:{status:0,QTime:9552}} {responseHeader:{status:0,QTime:8368}} {responseHeader:{status:0,QTime:9296}} {responseHeader:{status:0,QTime:8527}} {responseHeader:{status:0,QTime:9458}} {responseHeader:{status:0,QTime:8929}} 8 seconds to process a commit where with no changes to the index!?! If this index is actively processing queries, then what you are experiencing here is probably cache warming - Solr looks at the entries in each of its caches and uses those entries to run queries against the new index to pre-populate the new caches. The number of entries that are used for warming queries will be controlled by the autoWarmCount value on the cache definition. Why does solr need such a large heap space for this index (it dies with 10G and 20G and is constant at 28G in jconsole)? Why does running a commits in parallel via autoCommit or the command exhaust the memory? Are we using dynamic fields incorrectly? When you run a commit, Solr fires up a new index searcher object, complete with caches, which will then be autowarmed from the old caches as described above. Until the new object is fully warmed, the old searcher will exist and will continue to serve queries. If you issue another commit while a new searcher is already warming, then *another* searcher is likely to get fired up as well, depending on the value of maxWarmingSearchers in your solrconfig.xml file. The amount of memory required by a searcher can be very high, due in part to caches, especially the FieldCache, which is used internally by Lucene and is not configurable like the others. If you have 8 cores and you run commits on them in parallel that take several seconds, then for several seconds you will have at least sixteen searchers running. If your maxWarmingSearchers value is higher than 1, you might end up with even more searchers running at the same time. This is likely where your memory is going. By lowering the autoWarmCount values on your caches, you can reduce the amount of time it takes to do a commit. You should also keep track of whether anything has actually changed on each core and don't issue a commit when nothing has changed. Also, it would be a good idea to stagger the commits so that all your cores are not committing at the same time. Thanks, Shawn
Re: Slow queries for common terms
Hi David and Jan, I wrote the blog post, and David, you are right, the problem we had was with phrase queries because our positions lists are so huge. Boolean queries don't need to read the positions lists. I think you need to determine whether you are CPU bound or I/O bound.It is possible that you are I/O bound and reading the term frequency postings for 90 million docs is taking a long time. In that case, More memory in the machine (but not dedicated to Solr) might help because Solr relies on OS disk caching for caching the postings lists. You would still need to do some cache warming with your most common terms. On the other hand as Jan pointed out, you may be cpu bound because Solr doesn't have early termination and has to rank all 90 million docs in order to show the top 10 or 25. Did you try the OR search to see if your CPU is at 100%? Tom On Fri, Mar 22, 2013 at 10:14 AM, Jan Høydahl jan@cominvent.com wrote: Hi There might not be a final cure with more RAM if you are CPU bound. Scoring 90M docs is some work. Can you check what's going on during those 15 seconds? Is your CPU at 100%? Try an (foo OR bar OR baz) search which generates 100mill hits and see if that is slow too, even if you don't use frequent words. I'm sure you can find other frequent terms in your corpus which display similar behaviour, words which are even more frequent than book. Are you using AND as default operator? You will benefit from limiting the number of results as much as possible. The real solution is to shard across N number of servers, until you reach the desired performance for the desired indexing/querying load. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com
Re: DocValues and field requirements
: Thank you for your response. Yes, that's strange. By enabling DocValues the : information about missing fields is lost, which changes the way of sorting : as well. Adding default value to the fields can change a logic of : application dramatically (I can't set default value to 0 for all : Trie*Fields fields, because it could impact the results displayed to the : end user, which is not good). It's a pity that using DocValues is so : limited. I'm not really up on docvalues, but i asked rmuir about this a bit on IRC the crux of the issue is that there are two differnet docvalue impls, one that uses a fixed amount of space per doc (ie: exactly one value per doc) and one that alloaws an ordered set of values per doc (ie: multivalued). the multivalued docvals impl was wired into solr for multivalued fields, and the single valued docvals impl was wired in for hte single valued case -- but since since the single valued docvals impl *has* to have a value for every doc, the schema error you encountered was added if you try to use it on a field that isn't required or doesn't have a default value -- to force you to be explicit about which default you want, instead of hte low level lucene 0 default coming into play w/o you knowing about it. (as Shawn mentioned) the multivalued docvals impl could concivably be used instead for these types of single valued fields (ie: to support 0 or 1 values) but there is no sorting support for multivalued docvals, so it would cause other problems. One possible workarround for people who want to take advantage of sort missing first/last type sorting on a docvals type field would be to mange the missing information yourself in a distinct field which you also leveraged in any filtering or sorting on the docvals field. ie, have a docvalues field myfield which is single valued, with some configured default value, and then have a myfield_exists boolean field which is single valued and required. when indexing docs, if myfield does/doesn't have a value set myfield_exists to accordingly (this would be fairly trivial in an updated processor) and then instead of sorting just on myfield desc you would sort on myfield_exists (asc|desc), myfield desc (where you pick hte asc or desc depending on wether you want docs w/o values first or last). you would likewise need to filter on myfield_exists:true anytime you did queries against the myfield field. (perhaps someoen could work on patch to inject a synthetic field like this automatically for fields that are docValues=true multiValued=false required=false w/o a defualtValue?) -Hoss
Re: Can we manipulate termfreq to count as 1 for multiple matches?
: parameter *omitTermFreqAndPositions* the key thing to remember being: if you use this, then by omiting positions you can no longer do phrase queries. : or you can use a custom similarity class that overrides the term freq and : return one for only that field. : http://wiki.apache.org/solr/SchemaXml#Similarity There is actaully a SImilarity class already written designed to target this specific problem of keyword spamming in text fields... : Document_1 : Name = Blue Jeans : Description = This jeans is very soft. Jeans is pretty nice. : : Now, If I Search for Jeans then Jeans is found in 2 places in : Description field. ...first off, it's important to remember that 'tf' doesn't afect things in isolation -- usually there is also a lenghtNorm factor that would penalize the score of that document compared to another one that had a short description that only included the word Jeans once (ie: These are Red Jeans) Using the SweetSpotSimilarity, you can specify target values identifying what ideal values (ie: sweet spot) you anticipate in a typical document for both the tf and lengthNorm ... https://lucene.apache.org/solr/4_2_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html https://lucene.apache.org/core/4_2_0/misc/org/apache/lucene/misc/SweetSpotSimilarity.html ...so if you want to say that 1 to 4 instances of the term are equally good, and above that start to reward docs more you could configure the tf function to do that. (If you really want the same tf() scoring factor for all docs, regardless on how many times the term is mentioned -- then you would need to write your own SImilarity subclass at the moment) -Hoss
Re: transientCacheSize not working
I've created an issue and patch here that makes it possible to specify transient and loadOnStatup on core creation: https://issues.apache.org/jira/browse/SOLR-4631 On Wed, Mar 20, 2013 at 10:14 AM, didier deshommes dfdes...@gmail.comwrote: Thanks. Is there a way to pass loadOnStartup and/or transient as parameters to the core admin http api? This doesn't seem to work: curl http://localhost:8983/solr/admin/cores?action=CREATEtransient=truename=c1 On Tue, Mar 19, 2013 at 7:29 PM, Mark Miller markrmil...@gmail.comwrote: I don't think SolrCloud works with the transient stuff. - Mark On Mar 19, 2013, at 8:04 PM, didier deshommes dfdes...@gmail.com wrote: Hi, I cannot get Solrcloud to respect transientCacheSize when creating multiple cores via the web api. I'm runnig solr 4.2 like this: java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=conf1 -DzkRun -DnumShards=1 -jar start.jar I'm creating multiple cores via the core admin http api: curl http://localhost:8983/solr/admin/cores?action=CREATEname=tmp1 curl http://localhost:8983/solr/admin/cores?action=CREATEname=tmp2 curl http://localhost:8983/solr/admin/cores?action=CREATEname=tmp3 My solr.xml looks like: ?xml version=1.0 encoding=UTF-8 ? solr persistent=true cores transientCacheSize=2 adminPath=/admin/cores shareSchema=true zkClientTimeout=${zkClientTimeout:15000} hostPort=8983 hostContext=solr /cores /solr When I list all cores currently loaded, via curl http://localhost:8983/solr/admin/cores?action=status , I notice that all 3 cores are still running, even though transientCacheSize is 2. Can anyone tell me why that is? Also, is there a way to pass loadOnStartup and transient to the core admin http api? Specifying these when creating a core doesn't seem to work: curl http://localhost:8983/solr/admin/cores?action=CREATEtransient=true Thanks, didier
Re: how to get term vector information of sepcific word/position in field
: is there any way, if i can get term vector information of specific word : only, like i can pass the word, and it will just return term position and : frequency for that word only? : : and also if i can pass the position e.g. startPosition=5 and endPosition=10; : then it will return terms, positions and frequency of words which are there : occurred inbeween start and end postion. I don't think either of these are available out of hte box, but you could probably modify the code in TermVectoryComponent that iterates over terms to filter what it adds to the response based on explicitly bassed in term startPos and endPos params. It would not only cut down on the total data being returned, but since you can do a seek on a TermsEnum limiting that way should speed up hte processing as well. i don't think you can seek on term positions however, so you'd still have to iterate over all the positions until you found the startPos, but bailing out once you reach the endPos may save some time as well. If you do go this route, by all means please submit a patch in jira, it could be handy for other TVC users... https://wiki.apache.org/solr/HowToContribute https://issues.apache.org/jira/browse/SOLR -Hoss
Re: Solr 4.2 - Slave Index version is higher than Master
That was to you Phil. So it seems this is a problem with the configuration replication case I would guess - I didn't really look at that path in the 4.2 fixes I worked on. I did add it to the new testing I'm doing since I've suspected it (it will prompt a core reload that doesn't happen when configs don't replicate). I'll see what I can do to try and get a test to catch it. - mark On Mar 22, 2013, at 1:49 PM, Mark Miller markrmil...@gmail.com wrote: And your also on 4.2? - Mark On Mar 22, 2013, at 12:41 PM, Uomesh uom...@gmail.com wrote: Also, I am replicating only on commit and startup. Thanks, Umesh On Fri, Mar 22, 2013 at 11:23 AM, Umesh Sharma uom...@gmail.com wrote: Hi Mrk, I am replicating below config files but not replicating solrconfig.xml. confFiles: schema.xml, elevate.xml, stopwords.txt, mapping-FoldToASCII.txt, mapping-ISOLatin1Accent.txt, protwords.txt, spellings.txt, synonyms.txt also strange I am seeing big Gen difference between Master and slave. My master slave is 2 while Slave is 56. If i do the full import then the Gen is getting higher then slave one and its replicating. i have more than 30 cores on my solr instance and all are scheduled to replicate on same time. Index Version Gen Size Master: 1363903243590 2 94 bytes Slave: 1363967579193 56 94 bytes Thanks, Umesh On Fri, Mar 22, 2013 at 10:42 AM, Mark Miller-3 [via Lucene] ml-node+s472066n4050075...@n3.nabble.com wrote: Are you replicating configuration files as well? - Mark On Mar 22, 2013, at 6:38 AM, John, Phil (CSS) [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=0 wrote: To add to the discussion. We're running classic master/slave replication (not solrcloud) with 1 master and 2 slaves and I noticed the slave having a higher version number than the master the other day as well. In our case, knock on wood, it hasn't stopped replication. If you'd like a copy of our config I can provide off-list. Regards, Phil. From: Mark Miller [mailto:[hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=1] Sent: Fri 22/03/2013 06:32 To: [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=2 Subject: Re: Solr 4.2 - Slave Index version is higher than Master The other odd thing here is that this should not stop replication at all. When the slave is ahead, it will still have it's index replaced. - Mark On Mar 22, 2013, at 1:26 AM, Mark Miller [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=3 wrote: I'm working on testing to try and catch what you are seeing here: https://issues.apache.org/jira/browse/SOLR-4629 - Mark On Mar 22, 2013, at 12:23 AM, Mark Miller [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=4 wrote: Let me know if there is anything else you can add. A test with your setup that index n docs randomly, commits, randomly updates a conf file or not, and then replicates and repeats x times does not seem to fail, even with very high values for n and x. On every replication, the versions are compared. Is there anything else you are putting into this mix? - Mark On Mar 21, 2013, at 11:28 PM, Uomesh [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=5 wrote: Thank you!!, Attached is my master solrconfig.xml. I have few custom handlers which you might need to remove. In custom handler i have not much code just adding some custom data for UI. Thanks, Umesh On Thu, Mar 21, 2013 at 9:59 PM, Mark Miller-3 [via Lucene] [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=6 mible.com [hidden email]http://user/SendEmail.jtp?type=nodenode=4050075i=7 wrote: Could you attach the master as well? - Mark On Mar 21, 2013, at 4:36 PM, Uomesh [hidden email] http://user/SendEmail.jtp?type=nodenode=4049933i=0 wrote: Hi Mark, Attached is my solrconfig_slave.xml. My replication interval is 1 minute(default). Please let me know if you need any more config details Thanks, umesh On Thu, Mar 21, 2013 at 3:19 PM, Mark Miller-3 [via Lucene] [hidden email] http://user/SendEmail.jtp?type=nodenode=4049933i=1 wrote: Can you give more details about your configuration and setup? Our best bet is to try and recreate this with a unit test. - Mark On Mar 21, 2013, at 4:08 PM, Uomesh [hidden email] http://user/SendEmail.jtp?type=nodenode=4049832i=0 wrote: Hi, I am seeing an issue after upgrading from solr 3.6.2 to Solr 4.2. My Slave stop replicating after sometime. And it seems issue is because of my Slave Index version is higher than master. How could it be possible to Slave Index version is higher than master? Please help me. IS there anything i need to remove from my slave solrconfig.xml. Index Version Gen Size Master: 1363893820575 93 8.75 MB Slave: 1363896006624 94 8.75 MB Thanks, Umesh -- View this
RE: strange behaviour of wordbreak spellchecker in solr cloud
Alex, I added your comments to SOLR-3758 (https://issues.apache.org/jira/browse/SOLR-3758) , which seems to me to be the very same issue. If you need this to work now and if you cannot devise a fix yourself, then perhaps a workaround is if the query returns with 0 results, re-issue the query with rows=0group=false (you would omit all other optional components also). This will give you back just a spell check result. I realize this is not optimal because it requires the overhead of issuing 2 queries but if you do it only in instances the user gets nothing (or very little) back maybe it would be tolerable? Then once a viable fix is devised you can remove the extra code from your application. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: alx...@aim.com [mailto:alx...@aim.com] Sent: Friday, March 22, 2013 12:53 PM To: solr-user@lucene.apache.org Subject: Re: strange behaviour of wordbreak spellchecker in solr cloud Hello, Further investigation shows the following pattern, for both DirectIndex and wordbreak spellchekers. Assume that in all cases there are spellchecker results when distrib=false In distributed mode (distrib=true) case when matches=0 1. group=true, no spellcheck results 2. group=false , there are spellcheck results case when matches0 1. group=true, there are spellcheck results 2. group =false, there are spellcheck results Do these constitute a failing test case? Thanks. Alex. -Original Message- From: alxsss alx...@aim.com To: solr-user solr-user@lucene.apache.org Sent: Thu, Mar 21, 2013 6:50 pm Subject: Re: strange behaviour of wordbreak spellchecker in solr cloud Hello, I am debugging the SpellCheckComponent#finishStage. From the responses I see that not only wordbreak, but also directSpellchecker does not return some results in distributed mode. The request handler I was using had str name=grouptrue/str So, I desided to turn of grouping and I see spellcheck results in distributed mode. curl 'server1:8983/solr/test/testhandler?q=paulusolesindent=truerows=10shards.qt=testhandler' has no spellchek results but curl 'server1:8983/solr/test/testhandler?q=paulusolesindent=truerows=10shards.qt=testhandler group=false' returns results. So, the conclusion is that grouping causes the distributed spellcheker to fail. Could please you point me to the class that may be responsible to this issue? Thanks. Alex. -Original Message- From: Dyer, James james.d...@ingramcontent.com To: solr-user solr-user@lucene.apache.org Sent: Thu, Mar 21, 2013 11:23 am Subject: RE: strange behaviour of wordbreak spellchecker in solr cloud The shard responses get combined in SpellCheckComponent#finishStage . I highly recommend you file a JIRA bug report for this at https://issues.apache.org/jira/browse/SOLR . If you write a failing unit test, it would make it much more likely that others would help you with a fix. Of course, if you solve the issue entirely, a patch would be much appreciated. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: alx...@aim.com [mailto:alx...@aim.com] Sent: Thursday, March 21, 2013 12:45 PM To: solr-user@lucene.apache.org Subject: Re: strange behaviour of wordbreak spellchecker in solr cloud Hello, We need this feature be fixed ASAP. So, please let me know which class is responsible for combining spellcheck results from all shards. I will try to debug the code. Thanks in advance. Alex. -Original Message- From: alxsss alx...@aim.com To: solr-user solr-user@lucene.apache.org Sent: Tue, Mar 19, 2013 11:34 am Subject: Re: strange behaviour of wordbreak spellchecker in solr cloud -- distributed environment. But to nail it down, we probably need to see both -- the applicable requestHandler / Not sure what this is? I have searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypespell/str !-- Multiple Spell Checkers can be declared and used by this component -- !-- a spellchecker built from a field of the main index -- lst name=spellchecker str name=namedirect/str str name=fieldspell/str str name=classnamesolr.DirectSolrSpellChecker/str !-- the spellcheck distance measure used, the default is the internal levenshtein -- str name=distanceMeasureinternal/str !-- minimum accuracy needed to be considered a valid spellcheck suggestion -- float name=accuracy0.5/float !-- the maximum #edits we consider when enumerating terms: can be 1 or 2 -- int name=maxEdits2/int !-- the minimum shared prefix when enumerating terms -- int name=minPrefix1/int !-- maximum number of inspections per result. -- int name=maxInspections5/int !-- minimum length of a query term to be considered for correction -- int name=minQueryLength4/int !-- maximum threshold of documents a query term
Re: strange behaviour of wordbreak spellchecker in solr cloud
Thanks. I can fix this, but going over code it seems it is not easy to figure out where the whole request and response come from. I followed up SpellCheckComponent#finishStage and found out that SearchHandler#handleRequestBody calls this function. However, which part calls handleRequestBody and how its arguments are constructed is not clear. Thanks. Alex. -Original Message- From: Dyer, James james.d...@ingramcontent.com To: solr-user solr-user@lucene.apache.org Sent: Fri, Mar 22, 2013 2:08 pm Subject: RE: strange behaviour of wordbreak spellchecker in solr cloud Alex, I added your comments to SOLR-3758 (https://issues.apache.org/jira/browse/SOLR-3758) , which seems to me to be the very same issue. If you need this to work now and if you cannot devise a fix yourself, then perhaps a workaround is if the query returns with 0 results, re-issue the query with rows=0group=false (you would omit all other optional components also). This will give you back just a spell check result. I realize this is not optimal because it requires the overhead of issuing 2 queries but if you do it only in instances the user gets nothing (or very little) back maybe it would be tolerable? Then once a viable fix is devised you can remove the extra code from your application. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: alx...@aim.com [mailto:alx...@aim.com] Sent: Friday, March 22, 2013 12:53 PM To: solr-user@lucene.apache.org Subject: Re: strange behaviour of wordbreak spellchecker in solr cloud Hello, Further investigation shows the following pattern, for both DirectIndex and wordbreak spellchekers. Assume that in all cases there are spellchecker results when distrib=false In distributed mode (distrib=true) case when matches=0 1. group=true, no spellcheck results 2. group=false , there are spellcheck results case when matches0 1. group=true, there are spellcheck results 2. group =false, there are spellcheck results Do these constitute a failing test case? Thanks. Alex. -Original Message- From: alxsss alx...@aim.com To: solr-user solr-user@lucene.apache.org Sent: Thu, Mar 21, 2013 6:50 pm Subject: Re: strange behaviour of wordbreak spellchecker in solr cloud Hello, I am debugging the SpellCheckComponent#finishStage. From the responses I see that not only wordbreak, but also directSpellchecker does not return some results in distributed mode. The request handler I was using had str name=grouptrue/str So, I desided to turn of grouping and I see spellcheck results in distributed mode. curl 'server1:8983/solr/test/testhandler?q=paulusolesindent=truerows=10shards.qt=testhandler' has no spellchek results but curl 'server1:8983/solr/test/testhandler?q=paulusolesindent=truerows=10shards.qt=testhandler group=false' returns results. So, the conclusion is that grouping causes the distributed spellcheker to fail. Could please you point me to the class that may be responsible to this issue? Thanks. Alex. -Original Message- From: Dyer, James james.d...@ingramcontent.com To: solr-user solr-user@lucene.apache.org Sent: Thu, Mar 21, 2013 11:23 am Subject: RE: strange behaviour of wordbreak spellchecker in solr cloud The shard responses get combined in SpellCheckComponent#finishStage . I highly recommend you file a JIRA bug report for this at https://issues.apache.org/jira/browse/SOLR . If you write a failing unit test, it would make it much more likely that others would help you with a fix. Of course, if you solve the issue entirely, a patch would be much appreciated. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: alx...@aim.com [mailto:alx...@aim.com] Sent: Thursday, March 21, 2013 12:45 PM To: solr-user@lucene.apache.org Subject: Re: strange behaviour of wordbreak spellchecker in solr cloud Hello, We need this feature be fixed ASAP. So, please let me know which class is responsible for combining spellcheck results from all shards. I will try to debug the code. Thanks in advance. Alex. -Original Message- From: alxsss alx...@aim.com To: solr-user solr-user@lucene.apache.org Sent: Tue, Mar 19, 2013 11:34 am Subject: Re: strange behaviour of wordbreak spellchecker in solr cloud -- distributed environment. But to nail it down, we probably need to see both -- the applicable requestHandler / Not sure what this is? I have searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypespell/str !-- Multiple Spell Checkers can be declared and used by this component -- !-- a spellchecker built from a field of the main index -- lst name=spellchecker str name=namedirect/str str name=fieldspell/str str name=classnamesolr.DirectSolrSpellChecker/str !-- the spellcheck distance measure used, the default is the internal levenshtein -- str
Re: Did something change with Payloads?
Ok, this is very bizzare. If I insert more than one document at a time using the update handler like so: [{id:1,foo_ap:bar|50}},{id:2,foo_ap:bar|75}] It actually stores the same payload value 50 for both docs. That seems like a bug, no? There was a core change in 4.1 to how payloads were stored. I'm wondering if solr is not handling them properly? Jim -- View this message in context: http://lucene.472066.n3.nabble.com/Did-something-change-with-Payloads-tp4049561p4050599.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: overseer queue clogged
Thanks, Mark! The core node names in the solr.xml in solr4.2 is great! Maybe in 4.3 it can be supported via API? Also I am glad you mentioned in other post the chance to namespace zookeeper by adding a path to the end of the comma-delim zk hosts. That works out really well in our situation for having zk serve multiple amazon environments that go up and down independently of each other -- no issues w/ shared clusterstate.json or overseers. Regarding our original problem, we were able to restart all our shards but one, which wasn't getting past Mar 20, 2013 5:12:54 PM org.apache.solr.common.cloud.ZkStateReader$2 process INFO: A cluster state change has occurred - updating... Mar 20, 2013 5:12:54 PM org.apache.zookeeper.ClientCnxn$EventThread processEvent SEVERE: Error while calling watcher java.lang.NullPointerException at org.apache.solr.common.cloud.ZkStateReader$2.process(ZkStateReader.java:201) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:526) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502) We ended up upgrading to solr4.2 and rebuilding the whole index from our datastore. -Gary On Sat, Mar 16, 2013 at 9:51 AM, Mark Miller markrmil...@gmail.com wrote: Yeah, I don't know that I've ever tried with 4.0, but I've done this with 4.1 and 4.2. - Mark On Mar 16, 2013, at 12:19 PM, Gary Yngve gary.yn...@gmail.com wrote: Cool, I'll need to try this. I could have sworn that it didn't work that way in 4.0, but maybe my test was bunk. -g On Fri, Mar 15, 2013 at 9:41 PM, Mark Miller markrmil...@gmail.com wrote: You can do this - just modify your starting Solr example to have no cores in solr.xml. You won't be able to make use of the admin UI until you create at least one core, but the core and collection apis will both work fine.
Re: overseer queue clogged
On Mar 22, 2013, at 5:54 PM, Gary Yngve gary.yn...@gmail.com wrote: Thanks, Mark! The core node names in the solr.xml in solr4.2 is great! Maybe in 4.3 it can be supported via API? It is with the core admin api - do you mean the collections api? Please make a JIRA for any feature requests so they don't get lost! Also I am glad you mentioned in other post the chance to namespace zookeeper by adding a path to the end of the comma-delim zk hosts. That works out really well in our situation for having zk serve multiple amazon environments that go up and down independently of each other -- no issues w/ shared clusterstate.json or overseers. Regarding our original problem, we were able to restart all our shards but one, which wasn't getting past Mar 20, 2013 5:12:54 PM org.apache.solr.common.cloud.ZkStateReader$2 process INFO: A cluster state change has occurred - updating... Mar 20, 2013 5:12:54 PM org.apache.zookeeper.ClientCnxn$EventThread processEvent SEVERE: Error while calling watcher java.lang.NullPointerException at org.apache.solr.common.cloud.ZkStateReader$2.process(ZkStateReader.java:201) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:526) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502) We ended up upgrading to solr4.2 and rebuilding the whole index from our datastore. Hmm…hopefully this issue has been addressed. Thanks for the stack trace, I'll use it to do some inspection. - Mark -Gary On Sat, Mar 16, 2013 at 9:51 AM, Mark Miller markrmil...@gmail.com wrote: Yeah, I don't know that I've ever tried with 4.0, but I've done this with 4.1 and 4.2. - Mark On Mar 16, 2013, at 12:19 PM, Gary Yngve gary.yn...@gmail.com wrote: Cool, I'll need to try this. I could have sworn that it didn't work that way in 4.0, but maybe my test was bunk. -g On Fri, Mar 15, 2013 at 9:41 PM, Mark Miller markrmil...@gmail.com wrote: You can do this - just modify your starting Solr example to have no cores in solr.xml. You won't be able to make use of the admin UI until you create at least one core, but the core and collection apis will both work fine.
RE: strange behaviour of wordbreak spellchecker in solr cloud
Alex, You may want to move over to the dev user's list now that you're working on code. Or if you would rather not subscribe to the dev-list, add yourself as a watcher to SOLR-3758 and comment further there. This will help us keep track on progress for the issue. The short answer is that in a distributed set-up SpellCheckComponent (and others) work in 2 phases. In the first phase, each shard is sent the request almost as if they were a complete (non-distributed) index each to its own self. The difference is that an additional parameter is added to the request indicating that this is the first phase of a distributed request. In SpellCheckComponent, it uses this knowledge to include additional information in the response that normally wouldn't go out to an end client. The first phase calls the Component's process() method, just as would be done if this was a non-distributed call. In the second phase, the initiating shard collects the response from all of the shards' process() methods and combines them. This is where finishStage() is called. So while process() runs in parallel on all of the shards, finishStage() runs only on the initiating shard, after the various shards have returned their responses. The code you found in SearchHandler is what coordinates all of these activities. It is very complicated code, but honestly you probably will not need to understand it to fix this. What you probably will find is that each shard's process() returns the correct result, just as you get with your hand-done testing. But somehow finishStage() does not properly combine the responses when grouping is involved. It might be that the responses come back just a little differently and finishStage() cannot cope, or something along those lines. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: alx...@aim.com [mailto:alx...@aim.com] Sent: Friday, March 22, 2013 4:31 PM To: solr-user@lucene.apache.org Subject: Re: strange behaviour of wordbreak spellchecker in solr cloud Thanks. I can fix this, but going over code it seems it is not easy to figure out where the whole request and response come from. I followed up SpellCheckComponent#finishStage and found out that SearchHandler#handleRequestBody calls this function. However, which part calls handleRequestBody and how its arguments are constructed is not clear. Thanks. Alex. -Original Message- From: Dyer, James james.d...@ingramcontent.com To: solr-user solr-user@lucene.apache.org Sent: Fri, Mar 22, 2013 2:08 pm Subject: RE: strange behaviour of wordbreak spellchecker in solr cloud Alex, I added your comments to SOLR-3758 (https://issues.apache.org/jira/browse/SOLR-3758) , which seems to me to be the very same issue. If you need this to work now and if you cannot devise a fix yourself, then perhaps a workaround is if the query returns with 0 results, re-issue the query with rows=0group=false (you would omit all other optional components also). This will give you back just a spell check result. I realize this is not optimal because it requires the overhead of issuing 2 queries but if you do it only in instances the user gets nothing (or very little) back maybe it would be tolerable? Then once a viable fix is devised you can remove the extra code from your application. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: alx...@aim.com [mailto:alx...@aim.com] Sent: Friday, March 22, 2013 12:53 PM To: solr-user@lucene.apache.org Subject: Re: strange behaviour of wordbreak spellchecker in solr cloud Hello, Further investigation shows the following pattern, for both DirectIndex and wordbreak spellchekers. Assume that in all cases there are spellchecker results when distrib=false In distributed mode (distrib=true) case when matches=0 1. group=true, no spellcheck results 2. group=false , there are spellcheck results case when matches0 1. group=true, there are spellcheck results 2. group =false, there are spellcheck results Do these constitute a failing test case? Thanks. Alex. -Original Message- From: alxsss alx...@aim.com To: solr-user solr-user@lucene.apache.org Sent: Thu, Mar 21, 2013 6:50 pm Subject: Re: strange behaviour of wordbreak spellchecker in solr cloud Hello, I am debugging the SpellCheckComponent#finishStage. From the responses I see that not only wordbreak, but also directSpellchecker does not return some results in distributed mode. The request handler I was using had str name=grouptrue/str So, I desided to turn of grouping and I see spellcheck results in distributed mode. curl 'server1:8983/solr/test/testhandler?q=paulusolesindent=truerows=10shards.qt=testhandler' has no spellchek results but curl 'server1:8983/solr/test/testhandler?q=paulusolesindent=truerows=10shards.qt=testhandler group=false' returns results. So, the conclusion is that grouping causes the distributed spellcheker to
Re: Did something change with Payloads?
On Mar 22, 2013, at 5:54 PM, jimtronic jimtro...@gmail.com wrote: Ok, this is very bizzare. If I insert more than one document at a time using the update handler like so: [{id:1,foo_ap:bar|50}},{id:2,foo_ap:bar|75}] It actually stores the same payload value 50 for both docs. That seems like a bug, no? There was a core change in 4.1 to how payloads were stored. I'm wondering if solr is not handling them properly? This could be - if you have compiled a lot of evidence (sorry i have not had time to follow up on this myself), please create a jira issue for more prominence. - Mark Jim -- View this message in context: http://lucene.472066.n3.nabble.com/Did-something-change-with-Payloads-tp4049561p4050599.html Sent from the Solr - User mailing list archive at Nabble.com.
doc cache issues... query-time way to bypass cache?
I have a situation we just discovered in solr4.2 where there are previously cached results from a limited field list, and when querying for the whole field list, it responds differently depending on which shard gets the query (no extra replicas). It either returns the document on the limited field list or the full field list. We're releasing tonight, so is there a query param to selectively bypass the cache, which I can use as a temp fix? Thanks, Gary
Boost query parameter with Lucid parser and using query FunctionQuery
I have been playing around with the bq/bf/boost query parameters available in dismax/edismax. I am using the Lucid parser as my default parser for the query. The lucid parser is an extension of the DisMax parser and should contain everything that is available in that parser. My goal is boost items that have the word treatment in the title field. I started with the bq parameter and this works but it is an additive boost. I would prefer a multiplicative boost so I started to look at using boost which is part of edismax. This is my full query: /lucid?q=cancersort=score+descfl=title,scorewt=xmlindent=truedebugQuery=trueboost=product(10,query({!dismax qf=title v=treatment},0)) What I see in the debug data: str name=parsedqueryBoostedQuery(boost((abstract:blood | author:blood | origtitle:blood | substance:blood | text_all:blood | title:blood^5.0)~0.01,product(const(10),query(+(title:treatment) (abstract:treatment | author:treatment | substance:treatment | title:treatment^5.0 | text_all:treatment | origtitle:treatment),def=0.0/str str name=parsedquery_toStringboost((abstract:blood | author:blood | origtitle:blood | substance:blood | text_all:blood | title:blood^5.0)~0.01,product(const(10),query(+(title:treatment) (abstract:treatment | author:treatment | substance:treatment | title:treatment^5.0 | text_all:treatment | origtitle:treatment),def=0.0)))/str In the boost query I am specifying the field as title but it is expanding to look in all of the fields. How do I restrict the boost query to just look in the title field? Thanks, Will
Re: NoSuchMethodError updateDocument
I just indicated that JVM parameter: -Dsolr.solr.home=/home/projects/lucene-solr/solr/solr_home solr_home is where is my config files etc. stands. My solr.xml has that lines: cores adminPath=/admin/cores defaultCoreName=collection1 host=${host:} hostPort=${jetty.port:} hostContext=${hostContext:} zkClientTimeout=${zkClientTimeout:15000} core name=collection1 instanceDir=collection1/ /cores On the other hand I run it from my tomcat without using example embedded jetty start.jar. Any ideas? 2013/3/22 Furkan KAMACI furkankam...@gmail.com I use Solr 4.1.0 and Nutch 2.1, Java 1.7.0_17, Tomcat 7.0, Intellij IDEA 12.with a Centos 6.4 at my 64 bit computer. I run that command succesfully: bin/nutch solrindex http://localhost:8080/solr -index However when I run that command: bin/nutch solrindex http://localhost:8080/solr -reindex I get that error : Mar 22, 2013 6:48:27 PM org.apache.solr.common.SolrException log SEVERE: null:java.lang.RuntimeException: java.lang.NoSuchMethodError: org.apache.lucene.index.IndexWriter.updateDocument(Lorg/apache/lucene/index/Term;Lorg/apache/lucene/index/IndexDocument;Lorg/apache/lucene/analysis/Analyzer;)V at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:653) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:366) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:936) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1004) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.NoSuchMethodError: org.apache.lucene.index.IndexWriter.updateDocument(Lorg/apache/lucene/index/Term;Lorg/apache/lucene/index/IndexDocument;Lorg/apache/lucene/analysis/Analyzer;)V at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:201) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:451) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:587) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1812) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345) ... 16 more
Re: Boost query parameter with Lucid parser and using query FunctionQuery
You'll have to contact Lucid's support for questions about their code. (I've been away from that code too long to recall much about it.) -- Jack Krupansky -Original Message- From: Miller, Will Jr Sent: Friday, March 22, 2013 7:07 PM To: solr-user@lucene.apache.org Subject: Boost query parameter with Lucid parser and using query FunctionQuery I have been playing around with the bq/bf/boost query parameters available in dismax/edismax. I am using the Lucid parser as my default parser for the query. The lucid parser is an extension of the DisMax parser and should contain everything that is available in that parser. My goal is boost items that have the word treatment in the title field. I started with the bq parameter and this works but it is an additive boost. I would prefer a multiplicative boost so I started to look at using boost which is part of edismax. This is my full query: /lucid?q=cancersort=score+descfl=title,scorewt=xmlindent=truedebugQuery=trueboost=product(10,query({!dismax qf=title v=treatment},0)) What I see in the debug data: str name=parsedqueryBoostedQuery(boost((abstract:blood | author:blood | origtitle:blood | substance:blood | text_all:blood | title:blood^5.0)~0.01,product(const(10),query(+(title:treatment) (abstract:treatment | author:treatment | substance:treatment | title:treatment^5.0 | text_all:treatment | origtitle:treatment),def=0.0/str str name=parsedquery_toStringboost((abstract:blood | author:blood | origtitle:blood | substance:blood | text_all:blood | title:blood^5.0)~0.01,product(const(10),query(+(title:treatment) (abstract:treatment | author:treatment | substance:treatment | title:treatment^5.0 | text_all:treatment | origtitle:treatment),def=0.0)))/str In the boost query I am specifying the field as title but it is expanding to look in all of the fields. How do I restrict the boost query to just look in the title field? Thanks, Will
Re: Boost query parameter with Lucid parser and using query FunctionQuery
Why would you use dismax for the query() when you want to match a simple term to one field? If you share echoParams=all the answer may lie somewhere therein? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 23. mars 2013 kl. 00:07 skrev Miller, Will Jr will.mil...@wolterskluwer.com: I have been playing around with the bq/bf/boost query parameters available in dismax/edismax. I am using the Lucid parser as my default parser for the query. The lucid parser is an extension of the DisMax parser and should contain everything that is available in that parser. My goal is boost items that have the word treatment in the title field. I started with the bq parameter and this works but it is an additive boost. I would prefer a multiplicative boost so I started to look at using boost which is part of edismax. This is my full query: /lucid?q=cancersort=score+descfl=title,scorewt=xmlindent=truedebugQuery=trueboost=product(10,query({!dismax qf=title v=treatment},0)) What I see in the debug data: str name=parsedqueryBoostedQuery(boost((abstract:blood | author:blood | origtitle:blood | substance:blood | text_all:blood | title:blood^5.0)~0.01,product(const(10),query(+(title:treatment) (abstract:treatment | author:treatment | substance:treatment | title:treatment^5.0 | text_all:treatment | origtitle:treatment),def=0.0/str str name=parsedquery_toStringboost((abstract:blood | author:blood | origtitle:blood | substance:blood | text_all:blood | title:blood^5.0)~0.01,product(const(10),query(+(title:treatment) (abstract:treatment | author:treatment | substance:treatment | title:treatment^5.0 | text_all:treatment | origtitle:treatment),def=0.0)))/str In the boost query I am specifying the field as title but it is expanding to look in all of the fields. How do I restrict the boost query to just look in the title field? Thanks, Will
Re: NoSuchMethodError updateDocument
Are you 100% sure you use the exact jars for 4.1.0 *everywhere*, and that you're not blending older versions from the Nutch distro in your classpath here? Any ideas? BTW: What was your question here regarding Jetty vs Tomcat? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 23. mars 2013 kl. 00:50 skrev Furkan KAMACI furkankam...@gmail.com: I just indicated that JVM parameter: -Dsolr.solr.home=/home/projects/lucene-solr/solr/solr_home solr_home is where is my config files etc. stands. My solr.xml has that lines: cores adminPath=/admin/cores defaultCoreName=collection1 host=${host:} hostPort=${jetty.port:} hostContext=${hostContext:} zkClientTimeout=${zkClientTimeout:15000} core name=collection1 instanceDir=collection1/ /cores On the other hand I run it from my tomcat without using example embedded jetty start.jar. Any ideas? 2013/3/22 Furkan KAMACI furkankam...@gmail.com I use Solr 4.1.0 and Nutch 2.1, Java 1.7.0_17, Tomcat 7.0, Intellij IDEA 12.with a Centos 6.4 at my 64 bit computer. I run that command succesfully: bin/nutch solrindex http://localhost:8080/solr -index However when I run that command: bin/nutch solrindex http://localhost:8080/solr -reindex I get that error : Mar 22, 2013 6:48:27 PM org.apache.solr.common.SolrException log SEVERE: null:java.lang.RuntimeException: java.lang.NoSuchMethodError: org.apache.lucene.index.IndexWriter.updateDocument(Lorg/apache/lucene/index/Term;Lorg/apache/lucene/index/IndexDocument;Lorg/apache/lucene/analysis/Analyzer;)V at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:653) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:366) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:936) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1004) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.NoSuchMethodError: org.apache.lucene.index.IndexWriter.updateDocument(Lorg/apache/lucene/index/Term;Lorg/apache/lucene/index/IndexDocument;Lorg/apache/lucene/analysis/Analyzer;)V at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:201) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:451) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:587) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1812) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345) ... 16 more
RE: Boost query parameter with Lucid parser and using query FunctionQuery
This is the echo params... It looks like it ignores the qf in the FunctionQuery and instead takes the qf of the main query. lst name=params str name=spellchecktrue/str str name=facettrue/str str name=sortscore desc/str str name=facet.limit11/str str name=q.alt*:*/str str name=showFindSimilarLinkstrue/str str name=f.body.hl.alternateFieldbody/str str name=hltrue/str str name=stopwords.enabledtrue/str str name=feedbackfalse/str str name=echoParamsall/str str name=fltitle,score/str str name=f.body.hl.maxAlternateFieldLength250/str arr name=role strDEFAULT/str strDEFAULT/str /arr arr name=facet.field strauthor_display/str strdata_source_name/str strkeywords_display/str strmimeType/str /arr str name=synonyms.fieldsabstract,body,comments,country,description,diseaseconcept,genesymbol,grant,institution,investigator,investigatoraffiliation,keywordheading,nlmjournalname,origtitle,otherabstract,personname,primaryauthor,protocolconcept,spaceflight,substance,text_all,title/str str name=auto-completetrue/str str name=likeDoc.flauthor,title/str str name=facet.mincount1/str str name=feedback.emphasisrelevancy/str str name=qfabstract author origtitle substance text_all title^5.0/str str name=hl.flabstract,author,authorfullname,authorlast,body,comments,country,diseaseconcept,genesymbol,grant,institution,investigator,investigatoraffiliation,keywordheading,nlmjournalname,origtitle,otherabstract,personname,primaryauthor,protocolconcept,substance,title/str str name=spellcheck.collatetrue/str str name=spellcheck.onlyMorePopulartrue/str str name=defTypelucid/str str name=pfabstract substance author title^5.0 text_all origtitle/str str name=stopwords.fieldsabstract,body,comments,country,description,diseaseconcept,genesymbol,grant,institution,investigator,investigatoraffiliation,keywordheading,keywords,nlmjournalname,origtitle,otherabstract,personname,primaryauthor,protocolconcept,spaceflight,substance,title/str str name=boostproduct(10,query({!dismax qf=title v=treatment},0))/str str name=synonyms.enabledtrue/str str name=debugQuerytrue/str str name=indenttrue/str str name=qcancer/str str name=wtxml/str /lst -Original Message- From: Jan Høydahl [mailto:jan@cominvent.com] Sent: Friday, March 22, 2013 8:07 PM To: solr-user@lucene.apache.org Subject: Re: Boost query parameter with Lucid parser and using query FunctionQuery Why would you use dismax for the query() when you want to match a simple term to one field? If you share echoParams=all the answer may lie somewhere therein? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 23. mars 2013 kl. 00:07 skrev Miller, Will Jr will.mil...@wolterskluwer.com: I have been playing around with the bq/bf/boost query parameters available in dismax/edismax. I am using the Lucid parser as my default parser for the query. The lucid parser is an extension of the DisMax parser and should contain everything that is available in that parser. My goal is boost items that have the word treatment in the title field. I started with the bq parameter and this works but it is an additive boost. I would prefer a multiplicative boost so I started to look at using boost which is part of edismax. This is my full query: /lucid?q=cancersort=score+descfl=title,scorewt=xmlindent=truedebu gQuery=trueboost=product(10,query({!dismax qf=title v=treatment},0)) What I see in the debug data: str name=parsedqueryBoostedQuery(boost((abstract:blood | author:blood | origtitle:blood | substance:blood | text_all:blood | title:blood^5.0)~0.01,product(const(10),query(+(title:treatment) (abstract:treatment | author:treatment | substance:treatment | title:treatment^5.0 | text_all:treatment | origtitle:treatment),def=0.0/str str name=parsedquery_toStringboost((abstract:blood | author:blood | origtitle:blood | substance:blood | text_all:blood | title:blood^5.0)~0.01,product(const(10),query(+(title:treatment) (abstract:treatment | author:treatment | substance:treatment | title:treatment^5.0 | text_all:treatment | origtitle:treatment),def=0.0)))/str In the boost query I am specifying the field as title but it is expanding to look in all of the fields. How do I restrict the boost query to just look in the title field? Thanks, Will
Question on highlighting of external fields
Some time ago I had worked with a fellow developer to put together an addon to the (then) current Solr Highlighter to support fetching fields from an external source (like a database for instance). The general mechanics seem to work properly but I am seeing issues now where the highlights do not match up with the values in the query (i.e. the user enters dragon and 10 characters after that word are the em tags). A simple test I put together does not exhibit this so I am at a bit of an endpass as to how exactly track the issue down. Are there any general things that I should be aware of when attempting to do this? Is there any encoding/analysis that I need to consider when doing this (i.e. is it sufficient to store the text as it came in or should it be after some analysis via an analyzer has done something to it)? Any thoughts on this would be greatly appreciated.