Re: solr 3.5 taking long to index
There were some changes in solrconfig.xml between solr3.1 and solr3.5. Always read CHANGES.txt when switching to a new version. Also helpful is comparing both versions of solrconfig.xml from the examples. Are you sure you need a MaxPermSize of 5g? Use jvisualvm to see what you really need. This is also for all other JAVA_OPTS. Am 11.04.2012 19:42, schrieb Rohit: We recently migrated from solr3.1 to solr3.5, we have one master and one slave configured. The master has two cores, 1) Core1 - 44555972 documents 2) Core2 - 29419244 documents We commit every 5000 documents, but lately the commit is taking very long 15 minutes plus in some cases. What could have caused this, I have checked the logs and the only warning i can see is, WARNING: Use of deprecated update request parameter update.processor detected. Please use the new parameter update.chain instead, as support for update.processor will be removed in a later version. Memory details: export JAVA_OPTS=$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g Solr Config: useCompoundFilefalse/useCompoundFile mergeFactor10/mergeFactor ramBufferSizeMB32/ramBufferSizeMB !-- maxBufferedDocs1000/maxBufferedDocs -- maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout What could be causing this, as everything was running fine a few days back? Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg http://about.me/rohitg
Re: Multi-words synonyms matching
oh, that's right. thanks a lot, Elisabeth 2012/4/11 Jeevanandam Madanagopal je...@myjeeva.com Elisabeth - As you described, below mapping might suit for your need. mairie = hotel de ville, mairie mairie gets expanded to hotel de ville and mairie at index time. So mairie and hotel de ville searchable on document. However, still white space tokenizer splits at query time will be a problem as described by Markus. --Jeevanandam On Apr 11, 2012, at 12:30 PM, elisabeth benoit wrote: Have you tried the =' mapping instead? Something like hotel de ville = mairie might work for you. Yes, thanks, I've tried it but from what I undestand it doesn't solve my problem, since this means hotel de ville will be replace by mairie at index time (I use synonyms only at index time). So when user will ask hôtel de ville, it won't match. In fact, at index time I have mairie in my data, but I want user to be able to request mairie or hôtel de ville and have mairie as answer, and not have mairie as an answer when requesting hôtel. To map `mairie` to `hotel de ville` as single token you must escape your white space. mairie, hotel\ de\ ville This results in a problem if your tokenizer splits on white space at query time. Ok, I guess this means I have a problem. No simple solution since at query time my tokenizer do split on white spaces. I guess my problem is more or less one of the problems discussed in http://lucene.472066.n3.nabble.com/Multi-word-synonyms-td3716292.html#a3717215 Thanks a lot for your answers, Elisabeth 2012/4/10 Erick Erickson erickerick...@gmail.com Have you tried the =' mapping instead? Something like hotel de ville = mairie might work for you. Best Erick On Tue, Apr 10, 2012 at 1:41 AM, elisabeth benoit elisaelisael...@gmail.com wrote: Hello, I've read several post on this issue, but can't find a real solution to my multi-words synonyms matching problem. I have in my synonyms.txt an entry like mairie, hotel de ville and my index time analyzer is configured as followed for synonyms. filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ The problem I have is that now mairie matches with hotel and I would only want mairie to match with hotel de ville and mairie. When I look into the analyzer, I see that mairie is mapped into hotel, and words de ville are added in second and third position. To change that, I tried to do filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true tokenizerFactory=solr.KeywordTokenizerFactory/ (as I read in one post) and I can see now in the analyzer that mairie is mapped to hotel de ville, but now when I have query hotel de ville, it doesn't match at all with mairie. Anyone has a clue of what I'm doing wrong? I'm using Solr 3.4. Thanks, Elisabeth
Re: Solr 3.5 takes very long to commit gradually
What operating system? Are you using spellchecker with buildOnCommit? Anything special in your Update Chain? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 12. apr. 2012, at 06:45, Rohit wrote: We recently migrated from solr3.1 to solr3.5, we have one master and one slave configured. The master has two cores, 1) Core1 - 44555972 documents 2) Core2 - 29419244 documents We commit every 5000 documents, but lately the commit time gradually increase and solr is taking as very long 15 minutes plus in some cases. What could have caused this, I have checked the logs and the only warning i can see is, WARNING: Use of deprecated update request parameter update.processor detected. Please use the new parameter update.chain instead, as support for update.processor will be removed in a later version. Memory details: export JAVA_OPTS=$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g Solr Config: useCompoundFilefalse/useCompoundFile mergeFactor10/mergeFactor ramBufferSizeMB32/ramBufferSizeMB !-- maxBufferedDocs1000/maxBufferedDocs -- maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout Also noticed, that top command show almost 350GB of Virtual memory usage. What could be causing this, as everything was running fine a few days back? Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg http://about.me/rohitg
Re: Solr 3.5 takes very long to commit gradually
Hi Rohit, What would be the average size of your documents and also can you please share your idea of having 2 cores in the master. I just wanted to know the reasoning behind the design. Thanks in advance Tirthankar On Apr 12, 2012, at 3:19 AM, Jan Høydahl wrote: What operating system? Are you using spellchecker with buildOnCommit? Anything special in your Update Chain? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 12. apr. 2012, at 06:45, Rohit wrote: We recently migrated from solr3.1 to solr3.5, we have one master and one slave configured. The master has two cores, 1) Core1 - 44555972 documents 2) Core2 - 29419244 documents We commit every 5000 documents, but lately the commit time gradually increase and solr is taking as very long 15 minutes plus in some cases. What could have caused this, I have checked the logs and the only warning i can see is, WARNING: Use of deprecated update request parameter update.processor detected. Please use the new parameter update.chain instead, as support for update.processor will be removed in a later version. Memory details: export JAVA_OPTS=$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g Solr Config: useCompoundFilefalse/useCompoundFile mergeFactor10/mergeFactor ramBufferSizeMB32/ramBufferSizeMB !-- maxBufferedDocs1000/maxBufferedDocs -- maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout Also noticed, that top command show almost 350GB of Virtual memory usage. What could be causing this, as everything was running fine a few days back? Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg http://about.me/rohitg **Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. *
Re: Solr 3.5 takes very long to commit gradually
Hi Rohit, Can you please check the solrconfig.xml in 3.5 and compare it with 3.1 if there are any warming queries specified while opening the searchers after a commit. Thanks, Tirthankar On Apr 12, 2012, at 3:30 AM, Tirthankar Chatterjee wrote: Hi Rohit, What would be the average size of your documents and also can you please share your idea of having 2 cores in the master. I just wanted to know the reasoning behind the design. Thanks in advance Tirthankar On Apr 12, 2012, at 3:19 AM, Jan Høydahl wrote: What operating system? Are you using spellchecker with buildOnCommit? Anything special in your Update Chain? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 12. apr. 2012, at 06:45, Rohit wrote: We recently migrated from solr3.1 to solr3.5, we have one master and one slave configured. The master has two cores, 1) Core1 - 44555972 documents 2) Core2 - 29419244 documents We commit every 5000 documents, but lately the commit time gradually increase and solr is taking as very long 15 minutes plus in some cases. What could have caused this, I have checked the logs and the only warning i can see is, WARNING: Use of deprecated update request parameter update.processor detected. Please use the new parameter update.chain instead, as support for update.processor will be removed in a later version. Memory details: export JAVA_OPTS=$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g Solr Config: useCompoundFilefalse/useCompoundFile mergeFactor10/mergeFactor ramBufferSizeMB32/ramBufferSizeMB !-- maxBufferedDocs1000/maxBufferedDocs -- maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout Also noticed, that top command show almost 350GB of Virtual memory usage. What could be causing this, as everything was running fine a few days back? Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg http://about.me/rohitg **Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. *
RE: Solr 3.5 takes very long to commit gradually
Hi Tirthankar, The average size of documents would be a few Kb's this is mostly tweets which are being saved. The two cores are storing different kind of data and nothing else. Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg -Original Message- From: Tirthankar Chatterjee [mailto:tchatter...@commvault.com] Sent: 12 April 2012 13:14 To: solr-user@lucene.apache.org Subject: Re: Solr 3.5 takes very long to commit gradually Hi Rohit, What would be the average size of your documents and also can you please share your idea of having 2 cores in the master. I just wanted to know the reasoning behind the design. Thanks in advance Tirthankar On Apr 12, 2012, at 3:19 AM, Jan Høydahl wrote: What operating system? Are you using spellchecker with buildOnCommit? Anything special in your Update Chain? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 12. apr. 2012, at 06:45, Rohit wrote: We recently migrated from solr3.1 to solr3.5, we have one master and one slave configured. The master has two cores, 1) Core1 - 44555972 documents 2) Core2 - 29419244 documents We commit every 5000 documents, but lately the commit time gradually increase and solr is taking as very long 15 minutes plus in some cases. What could have caused this, I have checked the logs and the only warning i can see is, WARNING: Use of deprecated update request parameter update.processor detected. Please use the new parameter update.chain instead, as support for update.processor will be removed in a later version. Memory details: export JAVA_OPTS=$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g Solr Config: useCompoundFilefalse/useCompoundFile mergeFactor10/mergeFactor ramBufferSizeMB32/ramBufferSizeMB !-- maxBufferedDocs1000/maxBufferedDocs -- maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout Also noticed, that top command show almost 350GB of Virtual memory usage. What could be causing this, as everything was running fine a few days back? Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg http://about.me/rohitg **Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. *
RE: Solr 3.5 takes very long to commit gradually
Operating system in linux ubuntu. No not using spellchecker Only language detection in my update chain. Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg -Original Message- From: Jan Høydahl [mailto:jan@cominvent.com] Sent: 12 April 2012 12:50 To: solr-user@lucene.apache.org Subject: Re: Solr 3.5 takes very long to commit gradually What operating system? Are you using spellchecker with buildOnCommit? Anything special in your Update Chain? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 12. apr. 2012, at 06:45, Rohit wrote: We recently migrated from solr3.1 to solr3.5, we have one master and one slave configured. The master has two cores, 1) Core1 - 44555972 documents 2) Core2 - 29419244 documents We commit every 5000 documents, but lately the commit time gradually increase and solr is taking as very long 15 minutes plus in some cases. What could have caused this, I have checked the logs and the only warning i can see is, WARNING: Use of deprecated update request parameter update.processor detected. Please use the new parameter update.chain instead, as support for update.processor will be removed in a later version. Memory details: export JAVA_OPTS=$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g Solr Config: useCompoundFilefalse/useCompoundFile mergeFactor10/mergeFactor ramBufferSizeMB32/ramBufferSizeMB !-- maxBufferedDocs1000/maxBufferedDocs -- maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout Also noticed, that top command show almost 350GB of Virtual memory usage. What could be causing this, as everything was running fine a few days back? Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg http://about.me/rohitg
Re: Solr 3.5 takes very long to commit gradually
thanks Rohit.. for the information. On Apr 12, 2012, at 4:08 AM, Rohit wrote: Hi Tirthankar, The average size of documents would be a few Kb's this is mostly tweets which are being saved. The two cores are storing different kind of data and nothing else. Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg -Original Message- From: Tirthankar Chatterjee [mailto:tchatter...@commvault.com] Sent: 12 April 2012 13:14 To: solr-user@lucene.apache.org Subject: Re: Solr 3.5 takes very long to commit gradually Hi Rohit, What would be the average size of your documents and also can you please share your idea of having 2 cores in the master. I just wanted to know the reasoning behind the design. Thanks in advance Tirthankar On Apr 12, 2012, at 3:19 AM, Jan Høydahl wrote: What operating system? Are you using spellchecker with buildOnCommit? Anything special in your Update Chain? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 12. apr. 2012, at 06:45, Rohit wrote: We recently migrated from solr3.1 to solr3.5, we have one master and one slave configured. The master has two cores, 1) Core1 - 44555972 documents 2) Core2 - 29419244 documents We commit every 5000 documents, but lately the commit time gradually increase and solr is taking as very long 15 minutes plus in some cases. What could have caused this, I have checked the logs and the only warning i can see is, WARNING: Use of deprecated update request parameter update.processor detected. Please use the new parameter update.chain instead, as support for update.processor will be removed in a later version. Memory details: export JAVA_OPTS=$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g Solr Config: useCompoundFilefalse/useCompoundFile mergeFactor10/mergeFactor ramBufferSizeMB32/ramBufferSizeMB !-- maxBufferedDocs1000/maxBufferedDocs -- maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout Also noticed, that top command show almost 350GB of Virtual memory usage. What could be causing this, as everything was running fine a few days back? Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg http://about.me/rohitg **Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. * **Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. *
Problem to integrate Solr in Jetty (the first example in the Apache Solr 3.1 Cookbook)
Hi, I'm using Apache Solr 3.5.0 and Jetty 8.1.2 with Windows 7. (Versions in the Book used... Solr 3.1, Jetty 6.1.26) I've tried to get Solr running with Jetty. - I copied the jetty.xml and the webdefault.xml from the example Solr. - I copied the solr.war to webapps - I copied the solr directory from the example dir to the jetty dir. When I try to start I get this error message: C:\\jetty-solrjava -jar start.jar java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.eclipse.jetty.start.Main.invokeMain(Main.java:457) at org.eclipse.jetty.start.Main.start(Main.java:602) at org.eclipse.jetty.start.Main.main(Main.java:82) Caused by: java.lang.ClassNotFoundException: org.mortbay.jetty.Server at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at org.eclipse.jetty.util.Loader.loadClass(Loader.java:92) at org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.nodeClass(XmlConfiguration.java:349) at org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.configure(XmlConfiguration.java:327) at org.eclipse.jetty.xml.XmlConfiguration.configure(XmlConfiguration.java:291) at org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1203) at java.security.AccessController.doPrivileged(Native Method) at org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1138) ... 7 more Usage: java -jar start.jar [options] [properties] [configs] java -jar start.jar --help # for more information Thanks for your help, Bastian
Re: Facets involving multiple fields
Hi, Thanks for your answer. Let's say I have to fields : 'keywords' and 'short_title'. For these fields I'd like to make a faceted search : if 'Computer' is stored in at least one of these fields for a document I'd like to get it added in my results. doc1 = keywords : 'Computer' / short_title : 'Computer' doc2 = keywords : 'Computer' doc3 = short_title : 'Computer' In this case I'd like to have : Computer (3) I don't see how to solve this with facet.query. Thanks, Marc. On Wed, Apr 11, 2012 at 5:13 PM, Erick Erickson erickerick...@gmail.com wrote: Have you considered facet.query? You can specify an arbitrary query to facet on which might do what you want. Otherwise, I'm not sure what you mean by faceted search using two fields. How should these fields be combined into a single facet? What that means practically is not at all obvious from your problem statement. Best Erick On Tue, Apr 10, 2012 at 8:55 AM, Marc SCHNEIDER marc.schneide...@gmail.com wrote: Hi, I'd like to make a faceted search using two fields. I want to have a single result and not a result by field (like when using facet.field=f1,facet.field=f2). I don't want to use a copy field either because I want it to be dynamic at search time. As far as I know this is not possible for Solr 3.x... But I saw a new parameter named group.facet for Solr4. Could that solve my problem? If yes could somebody give me an example? Thanks, Marc.
Lexical analysis tools for German language data
Given an input of Windjacke (probably wind jacket in English), I'd like the code that prepares the data for the index (tokenizer etc) to understand that this is a Jacke (jacket) so that a query for Jacke would include the Windjacke document in its result set. It appears to me that such an analysis requires a dictionary-backed approach, which doesn't have to be perfect at all; a list of the most common 2000 words would probably do the job and fulfil a criterion of reasonable usefulness. Do you know of any implementation techniques or working implementations to do this kind of lexical analysis for German language data? (Or other languages, for that matter?) What are they, where can I find them? I'm sure there is something out (commercial or free) because I've seen lots of engines grokking German and the way it builds words. Failing that, what are the proper terms do refer to these techniques so you can search more successfully? Michael
Re: Large Index and OutOfMemoryError: Map failed
Your largest index has 66 segments (690 files) ... biggish but not insane. With 64K maps you should be able to have ~47 searchers open on each core. Enabling compound file format (not the opposite!) will mean fewer maps ... ie should improve this situation. I don't understand why Solr defaults to compound file off... that seems dangerous. Really we need a Solr dev here... to answer how long is a stale searcher kept open. Is it somehow possible 46 old searchers are being left open...? I don't see any other reason why you'd run out of maps. Hmm, unless MMapDirectory didn't think it could safely invoke unmap in your JVM. Which exact JVM are you using? If you can print the MMapDirectory.UNMAP_SUPPORTED constant, we'd know for sure. Yes, switching away from MMapDir will sidestep the too many maps issue, however, 1) MMapDir has better perf than NIOFSDir, and 2) if there really is a leak here (Solr not closing the old searchers or a Lucene bug or something...) then you'll eventually run out of file descriptors (ie, same problem, different manifestation). Mike McCandless http://blog.mikemccandless.com 2012/4/11 Gopal Patwa gopalpa...@gmail.com: I have not change the mergefactor, it was 10. Compound index file is disable in my config but I read from below post, that some one had similar issue and it was resolved by switching from compound index file format to non-compound index file. and some folks resolved by changing lucene code to disable MMapDirectory. Is this best practice to do, if so is this can be done in configuration? http://lucene.472066.n3.nabble.com/MMapDirectory-failed-to-map-a-23G-compound-index-segment-td3317208.html I have index document of core1 = 5 million, core2=8million and core3=3million and all index are hosted in single Solr instance I am going to use Solr for our site StubHub.com, see attached ls -l list of index files for all core SolrConfig.xml: indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor10/mergeFactor maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength-- ramBufferSizeMB4096/ramBufferSizeMB maxThreadStates10/maxThreadStates writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType mergePolicy class=org.apache.lucene.index.TieredMergePolicy double name=forceMergeDeletesPctAllowed0.0/double double name=reclaimDeletesWeight10.0/double /mergePolicy deletionPolicy class=solr.SolrDeletionPolicy str name=keepOptimizedOnlyfalse/str str name=maxCommitsToKeep0/str /deletionPolicy /indexDefaults updateHandler class=solr.DirectUpdateHandler2 maxPendingDeletes1000/maxPendingDeletes autoCommit maxTime90/maxTime openSearcherfalse/openSearcher /autoCommit autoSoftCommit maxTime${inventory.solr.softcommit.duration:1000}/maxTime /autoSoftCommit /updateHandler Forwarded conversation Subject: Large Index and OutOfMemoryError: Map failed From: Gopal Patwa gopalpa...@gmail.com Date: Fri, Mar 30, 2012 at 10:26 PM To: solr-user@lucene.apache.org I need help!! I am using Solr 4.0 nightly build with NRT and I often get this error during auto commit java.lang.OutOfMemoryError: Map failed. I have search this forum and what I found it is related to OS ulimit setting, please se below my ulimit settings. I am not sure what ulimit setting I should have? and we also get java.net.SocketException: Too many open files NOT sure how many open file we need to set? I have 3 core with index size : core1 - 70GB, Core2 - 50GB and Core3 - 15GB, with Single shard We update the index every 5 seconds, soft commit every 1 second and hard commit every 15 minutes Environment: Jboss 4.2, JDK 1.6 , CentOS, JVM Heap Size = 24GB ulimit: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 401408 max locked memory (kbytes, -l) 1024 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 401408 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited ERROR: 2012-03-29 15:14:08,560 [] priority=ERROR app_name= thread=pool-3-thread-1 location=CommitTracker line=93 auto
codecs for sorted indexes
Hello, We're using a sorted index in order to implement early termination efficiently over an index of hundreds of millions of documents. As of now, we're using the default codecs coming with Lucene 4, but we believe that due to the fact that the docids are sorted, we should be able to do much better in terms of storage and achieve much better performance, especially decompression performance. In particular, Robert Muir is commenting on these lines here: https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411 We're aware that the in the bulkpostings branch there are different codecs being implemented and different experiments being done. We don't know whether we should implement our own codec (i.e. using some RLE-like techniques) or we should use one of the codecs implemented there (PFOR, Simple64, ...). Can you please give us some advice on this? Thanks Carlos Carlos Gonzalez-Cadenas CEO, ExperienceOn - New generation search http://www.experienceon.com Mobile: +34 652 911 201 Skype: carlosgonzalezcadenas LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
AW: Lexical analysis tools for German language data
Given an input of Windjacke (probably wind jacket in English), I'd like the code that prepares the data for the index (tokenizer etc) to understand that this is a Jacke (jacket) so that a query for Jacke would include the Windjacke document in its result set. It appears to me that such an analysis requires a dictionary- backed approach, which doesn't have to be perfect at all; a list of the most common 2000 words would probably do the job and fulfil a criterion of reasonable usefulness. A simple approach would obviously be a word list and a regular expression. There will, however, be nuts and bolts to take care of. A more sophisticated and tested approach might be known to you. Michael
Re: Lexical analysis tools for German language data
Michael, I'm on this list and the lucene list since several years and have not found this yet. It's been one neglected topics to my taste. There is a CompoundAnalyzer but it requires the compounds to be dictionary based, as you indicate. I am convinced there's a way to build the de-compounding words efficiently from a broad corpus but I have never seen it (and the experts at DFKI I asked for for also told me they didn't know of one). paul Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit : Given an input of Windjacke (probably wind jacket in English), I'd like the code that prepares the data for the index (tokenizer etc) to understand that this is a Jacke (jacket) so that a query for Jacke would include the Windjacke document in its result set. It appears to me that such an analysis requires a dictionary-backed approach, which doesn't have to be perfect at all; a list of the most common 2000 words would probably do the job and fulfil a criterion of reasonable usefulness. Do you know of any implementation techniques or working implementations to do this kind of lexical analysis for German language data? (Or other languages, for that matter?) What are they, where can I find them? I'm sure there is something out (commercial or free) because I've seen lots of engines grokking German and the way it builds words. Failing that, what are the proper terms do refer to these techniques so you can search more successfully? Michael
Re: Lexical analysis tools for German language data
You might have a look at: http://www.basistech.com/lucene/ Am 12.04.2012 11:52, schrieb Michael Ludwig: Given an input of Windjacke (probably wind jacket in English), I'd like the code that prepares the data for the index (tokenizer etc) to understand that this is a Jacke (jacket) so that a query for Jacke would include the Windjacke document in its result set. It appears to me that such an analysis requires a dictionary-backed approach, which doesn't have to be perfect at all; a list of the most common 2000 words would probably do the job and fulfil a criterion of reasonable usefulness. Do you know of any implementation techniques or working implementations to do this kind of lexical analysis for German language data? (Or other languages, for that matter?) What are they, where can I find them? I'm sure there is something out (commercial or free) because I've seen lots of engines grokking German and the way it builds words. Failing that, what are the proper terms do refer to these techniques so you can search more successfully? Michael
Re: EmbeddedSolrServer and StreamingUpdateSolrServer
Hi Mikhail Khludnev, Thank you for the reply. I think the index is getting corrupted because StreamingUpdateSolrServer is keeping reference to some index files that are being deleted by EmbeddedSolrServer during commit/optimize process. As a result when I Index(Full) using EmbeddedSolrServer and then do Incremental index using StreamingUpdateSolrServer it fails with a FileNotFound exception. A special note: we don't optimize the index after Incremental indexing(StreamingUpdateSolrServer) but we do optimize it after the Full index(EmbeddedSolrServer). Please see the below log and let me know if you need further information. --- Mar 29, 2012 12:05:03 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {add=[035405]} 0 28 Mar 29, 2012 12:05:03 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update/extract params={stream.type=text/htmlliteral.stream_source_info=/snps/docs/customer/q_and_a/html/035405.htmlliteral.stream_name=035405.htmlwt=javabincollectionName=docsversion=2} status=0 QTime=28 Mar 29, 2012 12:05:03 AM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=false,waitSearcher=true,expungeDeletes=false,softCommit=false) Mar 29, 2012 12:05:03 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {commit=} 0 10 Mar 29, 2012 12:05:03 AM org.apache.solr.common.SolrException log SEVERE: java.io.FileNotFoundException: /opt/solr/home/data/docs_index/index/_3d.cfs (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:233) at org.apache.lucene.store.MMapDirectory.createSlicer(MMapDirectory.java:229) at org.apache.lucene.store.CompoundFileDirectory.init(CompoundFileDirectory.java:65) at org.apache.lucene.index.SegmentCoreReaders.init(SegmentCoreReaders.java:82) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:112) at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:700) at org.apache.lucene.index.BufferedDeletesStream.applyDeletes(BufferedDeletesStream.java:263) at org.apache.lucene.index.IndexWriter.applyAllDeletes(IndexWriter.java:2852) at org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:2843) at org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2616) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2731) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2719) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2703) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:325) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:84) at org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:154) at org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:107) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:52) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1477) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) - Thanks, PC Rao. -- View this message in context: http://lucene.472066.n3.nabble.com/EmbeddedSolrServer-and-StreamingUpdateSolrServer-tp3889073p3905071.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Lexical analysis tools for German language data
If you want that query jacke matches a document containing the word windjacke or kinderjacke, you could use a custom update processor. This processor could search the indexed text for words matching the pattern .*jacke and inject the word jacke into an additional field which you can search against. You would need a whole list of possible suffixes, of course. It would slow down the update process but you don't need to split words during search. Best, Valeriy On Thu, Apr 12, 2012 at 12:39 PM, Paul Libbrecht p...@hoplahup.net wrote: Michael, I'm on this list and the lucene list since several years and have not found this yet. It's been one neglected topics to my taste. There is a CompoundAnalyzer but it requires the compounds to be dictionary based, as you indicate. I am convinced there's a way to build the de-compounding words efficiently from a broad corpus but I have never seen it (and the experts at DFKI I asked for for also told me they didn't know of one). paul Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit : Given an input of Windjacke (probably wind jacket in English), I'd like the code that prepares the data for the index (tokenizer etc) to understand that this is a Jacke (jacket) so that a query for Jacke would include the Windjacke document in its result set. It appears to me that such an analysis requires a dictionary-backed approach, which doesn't have to be perfect at all; a list of the most common 2000 words would probably do the job and fulfil a criterion of reasonable usefulness. Do you know of any implementation techniques or working implementations to do this kind of lexical analysis for German language data? (Or other languages, for that matter?) What are they, where can I find them? I'm sure there is something out (commercial or free) because I've seen lots of engines grokking German and the way it builds words. Failing that, what are the proper terms do refer to these techniques so you can search more successfully? Michael
Re: Lexical analysis tools for German language data
Bernd, can you please say a little more? I think this list is ok to contain some description for commercial solutions that satisfy a request formulated on list. Is there any product at BASIS Tech that provides a compound-analyzer with a big dictionary of decomposed compounds in German? If yes, for which domain? The Google Search result (I wonder if this is politically correct to not have yours ;-)) shows me that there's an amount of job done in this direction (e.g. Gärten to match Garten) but being precise for this question would be more helpful! paul Le 12 avr. 2012 à 12:46, Bernd Fehling a écrit : You might have a look at: http://www.basistech.com/lucene/ Am 12.04.2012 11:52, schrieb Michael Ludwig: Given an input of Windjacke (probably wind jacket in English), I'd like the code that prepares the data for the index (tokenizer etc) to understand that this is a Jacke (jacket) so that a query for Jacke would include the Windjacke document in its result set. It appears to me that such an analysis requires a dictionary-backed approach, which doesn't have to be perfect at all; a list of the most common 2000 words would probably do the job and fulfil a criterion of reasonable usefulness. Do you know of any implementation techniques or working implementations to do this kind of lexical analysis for German language data? (Or other languages, for that matter?) What are they, where can I find them? I'm sure there is something out (commercial or free) because I've seen lots of engines grokking German and the way it builds words. Failing that, what are the proper terms do refer to these techniques so you can search more successfully? Michael
Solr Scoring
Hi, I have a field in my index called itemDesc which i am applying EnglishMinimalStemFilterFactory to. So if i index a value to this field containing Edges, the EnglishMinimalStemFilterFactory applies stemming and Edges becomes Edge. Now when i search for Edges, documents with Edge score better than documents with the actual search word - Edges. Is there a way i can make documents with the actual search word in this case Edges score better than document with Edge? I am using Solr 3.5. My field definition is shown below: fieldType name=text_en class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_en.txt enablePositionIncrements=true filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_en.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.EnglishMinimalStemFilterFactory/ /analyzer /fieldType Thanks.
two structures in solr
Hi all, I'm a solr newbie, so sorry if I do anything wrong ;) I want to use SOLR not only for fast text search, but mainly to create a very fast search engine for a high-traffic system (MySQL would not do the job if the db grows too big). I need to store *two big structures* in SOLR: projects and contractors. Contractors will search for available projects and project owners will search for contractors who would do it for them. So far, I have found a solr tutorial for newbies http://www.solrtutorial.com, where I found the schema file which defines the data structure: http://www.solrtutorial.com/schema-xml.html. But my case is that *I want to have two structures*. I guess running two parallel solr instances is not the idea. I took a look at http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/conf/schema.xml?view=markup and I can see that the schema goes like: ?xml version=1.0 encoding=UTF-8 ? schema name=example version=1.5 types ... /types fields field name=id type=string indexed=true stored=true required=true / field name=sku type=text_en_splitting_tight indexed=true stored=true omitNorms=true/ field name=name type=text_general indexed=true stored=true/ field name=alphaNameSort type=alphaOnlySort indexed=true stored=false/ ... /fields /schema But still, this is a single structure. And I need 2. Great thanks in advance for any help. There are not many tutorials for SOLR in the web. -- View this message in context: http://lucene.472066.n3.nabble.com/two-structures-in-solr-tp3905143p3905143.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Question about solr.WordDelimiterFilterFactory
WordDelimiterFilterFactory will _almost_ do what you want by setting things like catenateWords=0 and catenateNumbers=1, _except_ that the punctuation will be removed. So 12.34 - 1234 ab,cd - ab cd is that close enough? Otherwise, writing a simple Filter is probably the way to go. Best Erick On Wed, Apr 11, 2012 at 1:59 PM, Jian Xu joseph...@yahoo.com wrote: Hello, I am new to solr/lucene. I am tasked to index a large number of documents. Some of these documents contain decimal points. I am looking for a way to index these documents so that adjacent numeric characters (such as [0-9.,]) are treated as single token. For example, 12.34 = 12.34 12,345 = 12,345 However, , and . should be treated as usual when around non-digital characters. For example, ab,cd = ab cd. It is so that searching for 12.34 will match 12.34 not 12 34. Searching for ab.cd should match both ab.cd and ab cd. After doing some research on solr, It seems that there is a build-in analyzer called solr.WordDelimiterFilter that supports a types attribute which map special characters as different delimiters. However, it isn't exactly what I want. It doesn't provide context check such as , or . must surround by digital characters, etc. Does anyone have any experience configuring solr to meet this requirements? Is writing my own plugin necessary for this simple thing? Thanks in advance! -Jian
Dismax request handler differences Between Solr Version 3.5 and 1.4
Hi, We are currently using solr (version 1.4.0.2010.01.13.08.09.44). we have a strange situation in dismax request handler. when we search for a keyword and append qt=dismax, we are not getting the any results. The solr request is as follows: http://local:8983/solr/core2/select/?q=Bankversion=2.2start=0rows=10indent=ondefType=dismaxdebugQuery=on The Response is as follows : result name=response numFound=0 start=0 / - lst name=debug str name=rawquerystringBank/str str name=querystringBank/str str name=parsedquery+() ()/str str name=parsedquery_toString+() ()/str lst name=explain / str name=QParserDisMaxQParser/str null name=altquerystring / null name=boostfuncs / - lst name=timing double name=time0.0/double - lst name=prepare double name=time0.0/double - lst name=org.apache.solr.handler.component.QueryComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.MoreLikeThisComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.HighlightComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.StatsComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.DebugComponent double name=time0.0/double /lst /lst - lst name=process double name=time0.0/double - lst name=org.apache.solr.handler.component.QueryComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.MoreLikeThisComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.HighlightComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.StatsComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.DebugComponent double name=time0.0/double /lst /lst /lst /lst /response We are currently testing the Solr Version 3.5, But the same is working fine in that version. Also the Query alternative params are not working properly in SOlr 1.5 when compared with version 3.5. The request seems to be the same, but dono where its making the issue. Please help me out. Thanks i advance. Regards, Sivaganesh siva_srm...@yahoo.co.in -- View this message in context: http://lucene.472066.n3.nabble.com/Dismax-request-handler-differences-Between-Solr-Version-3-5-and-1-4-tp3905192p3905192.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facets involving multiple fields
facet.query=keywords:computer short_title:computer seems like what you're asking for. On Thu, Apr 12, 2012 at 3:19 AM, Marc SCHNEIDER marc.schneide...@gmail.com wrote: Hi, Thanks for your answer. Let's say I have to fields : 'keywords' and 'short_title'. For these fields I'd like to make a faceted search : if 'Computer' is stored in at least one of these fields for a document I'd like to get it added in my results. doc1 = keywords : 'Computer' / short_title : 'Computer' doc2 = keywords : 'Computer' doc3 = short_title : 'Computer' In this case I'd like to have : Computer (3) I don't see how to solve this with facet.query. Thanks, Marc. On Wed, Apr 11, 2012 at 5:13 PM, Erick Erickson erickerick...@gmail.com wrote: Have you considered facet.query? You can specify an arbitrary query to facet on which might do what you want. Otherwise, I'm not sure what you mean by faceted search using two fields. How should these fields be combined into a single facet? What that means practically is not at all obvious from your problem statement. Best Erick On Tue, Apr 10, 2012 at 8:55 AM, Marc SCHNEIDER marc.schneide...@gmail.com wrote: Hi, I'd like to make a faceted search using two fields. I want to have a single result and not a result by field (like when using facet.field=f1,facet.field=f2). I don't want to use a copy field either because I want it to be dynamic at search time. As far as I know this is not possible for Solr 3.x... But I saw a new parameter named group.facet for Solr4. Could that solve my problem? If yes could somebody give me an example? Thanks, Marc.
Re: Lexical analysis tools for German language data
Paul, nearly two years ago I requested an evaluation license and tested BASIS Tech Rosette for Lucene Solr. Was working excellent but the price much much to high. Yes, they also have compound analysis for several languages including German. Just configure your pipeline in solr and setup the processing pipeline in Rosette Language Processing (RLP) and thats it. Example from my very old schema.xml config: fieldtype name=text_rlp class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=com.basistech.rlp.solr.RLPTokenizerFactory rlpContext=solr/conf/rlp-index-context.xml postPartOfSpeech=false postLemma=true postStem=true postCompoundComponents=true/ filter class=solr.LowerCaseFilterFactory/ filter class=org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=com.basistech.rlp.solr.RLPTokenizerFactory rlpContext=solr/conf/rlp-query-context.xml postPartOfSpeech=false postLemma=true postCompoundComponents=true/ filter class=solr.LowerCaseFilterFactory/ filter class=org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldtype So you just point tokenizer to RLP and have two RLP pipelines configured, one for indexing (rlp-index-context.xml) and one for querying (rlp-query-context.xml). Example form my rlp-index-context.xml config: contextconfig properties property name=com.basistech.rex.optimize value=false/ property name=com.basistech.ela.retokenize_for_rex value=true/ /properties languageprocessors languageprocessorUnicode Converter/languageprocessor languageprocessorLanguage Identifier/languageprocessor languageprocessorEncoding and Character Normalizer/languageprocessor languageprocessorEuropean Language Analyzer/languageprocessor !--languageprocessorScript Region Locator/languageprocessor languageprocessorJapanese Language Analyzer/languageprocessor languageprocessorChinese Language Analyzer/languageprocessor languageprocessorKorean Language Analyzer/languageprocessor languageprocessorSentence Breaker/languageprocessor languageprocessorWord Breaker/languageprocessor languageprocessorArabic Language Analyzer/languageprocessor languageprocessorPersian Language Analyzer/languageprocessor languageprocessorUrdu Language Analyzer/languageprocessor -- languageprocessorStopword Locator/languageprocessor languageprocessorBase Noun Phrase Locator/languageprocessor !--languageprocessorStatistical Entity Extractor/languageprocessor -- languageprocessorExact Match Entity Extractor/languageprocessor languageprocessorPattern Match Entity Extractor/languageprocessor languageprocessorEntity Redactor/languageprocessor languageprocessorREXML Writer/languageprocessor /languageprocessors /contextconfig As you can see I used the European Language Analyzer. Bernd Am 12.04.2012 12:58, schrieb Paul Libbrecht: Bernd, can you please say a little more? I think this list is ok to contain some description for commercial solutions that satisfy a request formulated on list. Is there any product at BASIS Tech that provides a compound-analyzer with a big dictionary of decomposed compounds in German? If yes, for which domain? The Google Search result (I wonder if this is politically correct to not have yours ;-)) shows me that there's an amount of job done in this direction (e.g. Gärten to match Garten) but being precise for this question would be more helpful! paul
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
You could use SolrCloud (for the automatic scaling) and just mount a fuse[1] HDFS directory and configure solr to use that directory for its data. [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote: Hi, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Lastly, how much hardware (assuming a medium sized EC2 instance) would you estimate my needing with this setup, for regular web-data (HTML text) at this scale? Any architectural guidance would be greatly appreciated. The more details provided, the wider my grin :). Many many thanks in advance. Thanks, Safdar
is there a downside to combining search fields with copyfield?
hello everyone, can people give me their thoughts on this. currently, my schema has individual fields to search on. are there advantages or disadvantages to taking several of the individual search fields and combining them in to a single search field? would this affect search times, term tokenization or possibly other things. example of individual fields brand category partno example of a single combined search field part_info (would combine brand, category and partno) thank you for any feedback mark -- View this message in context: http://lucene.472066.n3.nabble.com/is-there-a-downside-to-combining-search-fields-with-copyfield-tp3905349p3905349.html Sent from the Solr - User mailing list archive at Nabble.com.
AW: Lexical analysis tools for German language data
Von: Valeriy Felberg If you want that query jacke matches a document containing the word windjacke or kinderjacke, you could use a custom update processor. This processor could search the indexed text for words matching the pattern .*jacke and inject the word jacke into an additional field which you can search against. You would need a whole list of possible suffixes, of course. Merci, Valeriy - I agree on the feasability of such an approach. The list would likely have to be composed of the most frequently used terms for your specific domain. In our case, it's things people would buy in shops. Reducing overly complicated and convoluted product descriptions to proper basic terms - that would do the job. It's like going to a restaurant boasting fancy and unintelligible names for the dishes you may order when they are really just ordinary stuff like pork and potatoes. Thinking some more about it, giving sufficient boost to the attached category data might also do the job. That would shift the burden of supplying proper semantics to the guys doing the categorization. It would slow down the update process but you don't need to split words during search. Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit : Given an input of Windjacke (probably wind jacket in English), I'd like the code that prepares the data for the index (tokenizer etc) to understand that this is a Jacke (jacket) so that a query for Jacke would include the Windjacke document in its result set. A query for Windjacke or Kinderjacke would probably not have to be de-specialized to Jacke because, well, that's the user input and users looking for specific things are probably doing so for a reason. If no matches are found you can still tell them to just broaden their search. Michael
Re: Lexical analysis tools for German language data
Hi, We've done a lot of tests with the HyphenationCompoundWordTokenFilter using a from TeX generated FOP XML file for the Dutch language and have seen decent results. A bonus was that now some tokens can be stemmed properly because not all compounds are listed in the dictionary for the HunspellStemFilter. It does introduce a recall/precision problem but it at least returns results for those many users that do not properly use compounds in their search query. There seem to be a small issue with the filter where minSubwordSize=N yields subwords of size N-1. Cheers, On Thursday 12 April 2012 12:39:44 Paul Libbrecht wrote: Michael, I'm on this list and the lucene list since several years and have not found this yet. It's been one neglected topics to my taste. There is a CompoundAnalyzer but it requires the compounds to be dictionary based, as you indicate. I am convinced there's a way to build the de-compounding words efficiently from a broad corpus but I have never seen it (and the experts at DFKI I asked for for also told me they didn't know of one). paul Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit : Given an input of Windjacke (probably wind jacket in English), I'd like the code that prepares the data for the index (tokenizer etc) to understand that this is a Jacke (jacket) so that a query for Jacke would include the Windjacke document in its result set. It appears to me that such an analysis requires a dictionary-backed approach, which doesn't have to be perfect at all; a list of the most common 2000 words would probably do the job and fulfil a criterion of reasonable usefulness. Do you know of any implementation techniques or working implementations to do this kind of lexical analysis for German language data? (Or other languages, for that matter?) What are they, where can I find them? I'm sure there is something out (commercial or free) because I've seen lots of engines grokking German and the way it builds words. Failing that, what are the proper terms do refer to these techniques so you can search more successfully? Michael -- Markus Jelsma - CTO - Openindex
Further questions about behavior in ReversedWildcardFilterFactory
I ask the question in http://lucene.472066.n3.nabble.com/A-little-onfusion-with-maxPosAsterisk-tt3889226.html However, when I do some implementation, I get a further questions. 1. Suppose I don't use ReversedWildcardFilterFactory in the index time, it seems that Solr doesn't allow the leading wildcard search, it will return the error: org.apache.lucene.queryParser.ParseException: Cannot parse 'sequence:*A*': '*' or '?' not allowed as first character in WildcardQuery But when I use the ReversedWildcardFilterFactory, I can use the *A* in the query. But as I know, the ReversedWildcardFilterFactory should work in the index part, should not affect the query behavior. If it is true, how does this happen? 2.Based on the question above suppose I have those tokens in index. 1.AB/MNO/UUFI 2.BC/MNO/IUYT 3.D/MNO/QEWA 4./MNO/KGJGLI 5.QOEOEF/MNO/ suppose I use the lucene, I can set the QueryParser with AllowLeadingWildcard(true), to search *MNO* it should return the tokens above(1-5) But in solr, when I conduct the *MNO* with the ReversedWildcardFilterFactory in the index, but use the StandardAnalyzer in the query, I don't know what happens here. The leading *MNO should be fast to match the 5 with ReversedWildcardFilterFactory The tailer MNO* should be fast to match 4 But What about *MNO* ? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Further-questions-about-behavior-in-ReversedWildcardFilterFactory-tp3905416p3905416.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
Thanks Darren. Actually, I would like the system to be homogenous - i.e., use Hadoop based tools that already provide all the necessary scaling for the lucene index (in terms of throughput, latency of writes/reads etc). Since SolrCloud adds its own layer of sharding/replication that is outside Hadoop, I feel that using SolrCloud would be redundant, and a step in the opposite direction, which is what I'm trying to avoid in the first place. Or am I mistaken? Thanks, Safdar On Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni dar...@ontrenet.com wrote: You could use SolrCloud (for the automatic scaling) and just mount a fuse[1] HDFS directory and configure solr to use that directory for its data. [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote: Hi, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Lastly, how much hardware (assuming a medium sized EC2 instance) would you estimate my needing with this setup, for regular web-data (HTML text) at this scale? Any architectural guidance would be greatly appreciated. The more details provided, the wider my grin :). Many many thanks in advance. Thanks, Safdar
AW: Lexical analysis tools for German language data
Von: Markus Jelsma We've done a lot of tests with the HyphenationCompoundWordTokenFilter using a from TeX generated FOP XML file for the Dutch language and have seen decent results. A bonus was that now some tokens can be stemmed properly because not all compounds are listed in the dictionary for the HunspellStemFilter. Thank you for pointing me to these two filter classes. It does introduce a recall/precision problem but it at least returns results for those many users that do not properly use compounds in their search query. Could you define what the term recall should be taken to mean in this context? I've also encountered it on the BASIStech website. Okay, I found a definition: http://en.wikipedia.org/wiki/Precision_and_recall Dank je wel! Michael
RE: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
Solrcloud or any other tech specific replication isnt going to 'just work' with hadoop replication. But with some significant custom coding anything should be possible. Interesting idea. brbrbr--- Original Message --- On 4/12/2012 09:21 AM Ali S Kureishy wrote:brThanks Darren. br brActually, I would like the system to be homogenous - i.e., use Hadoop based brtools that already provide all the necessary scaling for the lucene index br(in terms of throughput, latency of writes/reads etc). Since SolrCloud adds brits own layer of sharding/replication that is outside Hadoop, I feel that brusing SolrCloud would be redundant, and a step in the opposite brdirection, which is what I'm trying to avoid in the first place. Or am I brmistaken? br brThanks, brSafdar br br brOn Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni dar...@ontrenet.com wrote: br br You could use SolrCloud (for the automatic scaling) and just mount a br fuse[1] HDFS directory and configure solr to use that directory for its br data. br br [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS br br On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote: br Hi, br br I'm trying to setup a large scale *Crawl + Index + Search *infrastructure br using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, br crawled + indexed every *4 weeks, *with a search latency of less than 0.5 br seconds. br br Needless to mention, the search index needs to scale to 5Billion pages. br It br is also possible that I might need to store multiple indexes -- one for br crawled content, and one for ancillary data that is also very large. Each br of these indices would likely require a logically distributed and br replicated index. br br However, I would like for such a system to be homogenous with the Hadoop br infrastructure that is already installed on the cluster (for the crawl). br In br other words, I would much prefer if the replication and distribution of br the br Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of br using another scalability framework (such as SolrCloud). In addition, it br would be ideal if this environment was flexible enough to be dynamically br scaled based on the size requirements of the index and the search traffic br at the time (i.e. if it is deployed on an Amazon cluster, it should be br easy br enough to automatically provision additional processing power into the br cluster without requiring server re-starts). br br However, I'm not sure which Solr-based tool in the Hadoop ecosystem would br be ideal for this scenario. I've heard mention of Solr-on-HBase, br Solandra, br Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these br is br mature enough and would be the right architectural choice to go along br with br a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling br aspects br above. br br Lastly, how much hardware (assuming a medium sized EC2 instance) would br you br estimate my needing with this setup, for regular web-data (HTML text) at br this scale? br br Any architectural guidance would be greatly appreciated. The more details br provided, the wider my grin :). br br Many many thanks in advance. br br Thanks, br Safdar br br br br
Re: Question about solr.WordDelimiterFilterFactory
Erick, Thank you for your response! The problem with this approach is that searching for 12:34 will also match 12.34 which is not what I want. From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org; Jian Xu joseph...@yahoo.com Sent: Thursday, April 12, 2012 8:01 AM Subject: Re: Question about solr.WordDelimiterFilterFactory WordDelimiterFilterFactory will _almost_ do what you want by setting things like catenateWords=0 and catenateNumbers=1, _except_ that the punctuation will be removed. So 12.34 - 1234 ab,cd - ab cd is that close enough? Otherwise, writing a simple Filter is probably the way to go. Best Erick On Wed, Apr 11, 2012 at 1:59 PM, Jian Xu joseph...@yahoo.com wrote: Hello, I am new to solr/lucene. I am tasked to index a large number of documents. Some of these documents contain decimal points. I am looking for a way to index these documents so that adjacent numeric characters (such as [0-9.,]) are treated as single token. For example, 12.34 = 12.34 12,345 = 12,345 However, , and . should be treated as usual when around non-digital characters. For example, ab,cd = ab cd. It is so that searching for 12.34 will match 12.34 not 12 34. Searching for ab.cd should match both ab.cd and ab cd. After doing some research on solr, It seems that there is a build-in analyzer called solr.WordDelimiterFilter that supports a types attribute which map special characters as different delimiters. However, it isn't exactly what I want. It doesn't provide context check such as , or . must surround by digital characters, etc. Does anyone have any experience configuring solr to meet this requirements? Is writing my own plugin necessary for this simple thing? Thanks in advance! -Jian
RE: SOLR 3.3 DIH and Java 1.6
Thanks guys for all the help. We moved to an upgraded O.S. version and the java script worked. - Randolf -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-3-3-DIH-and-Java-1-6-tp3841355p3905583.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr 3.4 with nTiers = 2: usage of ids param causes NullPointerException (NPE)
Can anyone help me out with this? Is this too complicated / unclear? I could share more detail if needed. On Wed, Apr 11, 2012 at 3:16 PM, Dmitry Kan dmitry@gmail.com wrote: Hello, Hopefully this question is not too complex to handle, but I'm currently stuck with it. We have a system with nTiers, that is: Solr front base --- Solr front -- shards Inside QueryComponent there is a method createRetrieveDocs(ResponseBuilder rb) which collects doc ids of each shard and sends them in different queries using the ids parameter: [code] sreq.params.add(ShardParams.IDS, StrUtils.join(ids, ',')); [/code] This actually produces NPE (same as in https://issues.apache.org/jira/browse/SOLR-1477) in the first tier, because Solr front (on the second tier) fails to process such a query. I have tried to fix this by using a unique field with a value of ids ORed (the following code substitutes the code above): [code] StringBuffer idsORed = new StringBuffer(); for (IteratorString iterator = ids.iterator(); iterator.hasNext(); ) { String next = iterator.next(); if (iterator.hasNext()) { idsORed.append(next).append( OR ); } else { idsORed.append(next); } } sreq.params.add(rb.req.getSchema().getUniqueKeyField().getName(), idsORed.toString()); [/code] This works perfectly if for rows=n there is n or less hits from a distributed query. However, if there are more than 2*n hits, the querying fails with an NPE in a completely different component, which is HighlightComponent (highlights are requested in the same query with hl=truehl.fragsize=5hl.requireFieldMatch=truehl.fl=targetTextField): SEVERE: java.lang.NullPointerException at org.apache.solr.handler.component.HighlightComponent.finishStage(HighlightComponent.java:161) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:295) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:619) It sounds like the ids of documents somehow get shuffled and the instruction (only a hypothesis) [code] ShardDoc sdoc = rb.resultIds.get(id); [/code] returns sdoc=null, which causes the next line of code to fail with an NPE: [code] int idx = sdoc.positionInResponse; [/code] Am I missing anything? Can something be done for solving this issue? Thanks. -- Regards, Dmitry Kan -- Regards, Dmitry Kan
Re: Error
Please review: http://wiki.apache.org/solr/UsingMailingLists You haven't said whether, for instance, you're using trunk which is the only version that supports the termfreq function. Best Erick On Thu, Apr 12, 2012 at 4:08 AM, Abhishek tiwari abhishek.tiwari@gmail.com wrote: http://xyz.com:8080/newschema/mainsearch/select/?q=*%3A*version=2.2start=0rows=10indent=onsort=termfreq%28cuisine_priorities_list,%27Chinese%27%29%20desc Error : HTTP Status 400 - Missing sort order. Why i am getting error ?
Import null values from XML file
We import an XML file directly to SOLR using a the script called post.sh in the exampledocs. This is the script: FILES=$* URL=http://localhost:8983/solr/update for f in $FILES; do echo Posting file $f to $URL curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8' echo done #send the commit command to make sure all the changes are flushed and visible curl $URL --data-binary 'commit/' -H 'Content-type:text/xml; charset=utf-8' echo Our XML file looks something like this: add doc field name=ProductGuidD22BF0B9-EE3A-49AC-A4D6-000B07CDA18A/field field name=SkuGuidD22BF0B9-EE3A-49AC-A4D6-000B07CDA18A/field field name=ProductGroupId1000/field field name=VendorSkuCodeCK4475/field field name=VendorSkuAltCodeCK4475/field field name=ManufacturerSkuCodeNULL/field field name=ManufacturerSkuAltCodeNULL/field field name=UpcEanSkuCode840655037330/field field name=VendorSupersededSkuCodeNULL/field field name=VendorProductDescriptionEBC CLUTCH KIT/field field name=VendorSkuDescriptionEBC CLUTCH KIT/field /doc /add How can I tell solr that the NULL value should be treated as null? Thanks, Randolf -- View this message in context: http://lucene.472066.n3.nabble.com/Import-null-values-from-XML-file-tp3905600p3905600.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Lexical analysis tools for German language data
German noun decompounding is a little more complicated than it might seem. There can be transformations or inflections, like the s in Weinachtsbaum (Weinachten/Baum). Internal nouns should be recapitalized, like Baum above. Some compounds probably should not be decompounded, like Fahrrad (farhren/Rad). With a dictionary-based stemmer, you might decide to avoid decompounding for words in the dictionary. Verbs get more complicated inflections, and might need to be decapitalized, like farhren above. Und so weiter. Note that highlighting gets pretty weird when you are matching only part of a word. Luckily, a lot of compounds are simple, and you could well get a measurable improvement with a very simple algorithm. There isn't anything complicated about compounds like Orgelmusik or Netzwerkbetreuer. The Basis Technology linguistic analyzers aren't cheap or small, but they work well. wunder On Apr 12, 2012, at 3:58 AM, Paul Libbrecht wrote: Bernd, can you please say a little more? I think this list is ok to contain some description for commercial solutions that satisfy a request formulated on list. Is there any product at BASIS Tech that provides a compound-analyzer with a big dictionary of decomposed compounds in German? If yes, for which domain? The Google Search result (I wonder if this is politically correct to not have yours ;-)) shows me that there's an amount of job done in this direction (e.g. Gärten to match Garten) but being precise for this question would be more helpful! paul Le 12 avr. 2012 à 12:46, Bernd Fehling a écrit : You might have a look at: http://www.basistech.com/lucene/ Am 12.04.2012 11:52, schrieb Michael Ludwig: Given an input of Windjacke (probably wind jacket in English), I'd like the code that prepares the data for the index (tokenizer etc) to understand that this is a Jacke (jacket) so that a query for Jacke would include the Windjacke document in its result set. It appears to me that such an analysis requires a dictionary-backed approach, which doesn't have to be perfect at all; a list of the most common 2000 words would probably do the job and fulfil a criterion of reasonable usefulness. Do you know of any implementation techniques or working implementations to do this kind of lexical analysis for German language data? (Or other languages, for that matter?) What are they, where can I find them? I'm sure there is something out (commercial or free) because I've seen lots of engines grokking German and the way it builds words. Failing that, what are the proper terms do refer to these techniques so you can search more successfully? Michael
[Solr 4.0] Is it possible to do soft commit from code and not configuration only
Hi, I need to configure the solr so that the opened searcher will see a new document immidiately after it was adding to the index. And I don't want to perform commit each time a new document is added. I tried to configure maxDocs=1 under autoSoftCommit in solrconfig.xml but it didn't help. Is there way to perform soft commit from code in Solr 4.0 ? Thank you in advance. Best regards, Lyuba
AW: Lexical analysis tools for German language data
Von: Walter Underwood German noun decompounding is a little more complicated than it might seem. There can be transformations or inflections, like the s in Weinachtsbaum (Weinachten/Baum). I remember from my linguistics studies that the terminus technicus for these is Fugenmorphem (interstitial or joint morpheme). But there's not many of them - phrased in a regex, it's /e?[ns]/. The Weinachtsbaum in the example above is from the singular (die Weihnacht), then s, then Baum. Still, it's much more complex then, say, English or Italian. Internal nouns should be recapitalized, like Baum above. Casing won't matter for indexing, I think. The way I would go about obtaining stems from compound words is by using a dictionary of stems and a regex. We'll see how far that'll take us. Some compounds probably should not be decompounded, like Fahrrad (farhren/Rad). With a dictionary-based stemmer, you might decide to avoid decompounding for words in the dictionary. Good point. Note that highlighting gets pretty weird when you are matching only part of a word. Guess it'll be a weird when you get it wrong, like Noten in Notentriegelung. Luckily, a lot of compounds are simple, and you could well get a measurable improvement with a very simple algorithm. There isn't anything complicated about compounds like Orgelmusik or Netzwerkbetreuer. Exactly. The Basis Technology linguistic analyzers aren't cheap or small, but they work well. We will consider our needs and options. Thanks for your thoughts. Michael
Re: [Solr 4.0] Is it possible to do soft commit from code and not configuration only
On Apr 12, 2012, at 11:28 AM, Lyuba Romanchuk wrote: Hi, I need to configure the solr so that the opened searcher will see a new document immidiately after it was adding to the index. And I don't want to perform commit each time a new document is added. I tried to configure maxDocs=1 under autoSoftCommit in solrconfig.xml but it didn't help. Can you elaborate on didn't help? You couldn't find any docs unless you did an explicit commit? If that is true and there is no user error, this would be a bug. Is there way to perform soft commit from code in Solr 4.0 ? Yes - check out the wiki docs - I can't remember how it is offhand (I think it was slightly changed recently). Thank you in advance. Best regards, Lyuba - Mark Miller lucidimagination.com
Re: I've broken delete in SolrCloud and I'm a bit clueless as to how
Please see the documentation: http://wiki.apache.org/solr/SolrCloud#Required_Config schema.xml You must have a _version_ field defined: field name=_version_ type=long indexed=true stored=true/ On Apr 11, 2012, at 9:10 AM, Benson Margulies wrote: I didn't have a _version_ field, since nothing in the schema says that it's required! On Wed, Apr 11, 2012 at 6:35 AM, Darren Govoni dar...@ontrenet.com wrote: Hard to say why its not working for you. Start with a fresh Solr and work forward from there or back out your configs and plugins until it works again. On Tue, 2012-04-10 at 17:15 -0400, Benson Margulies wrote: In my cloud configuration, if I push delete query*:*/query /delete followed by: commit/ I get no errors, the log looks happy enough, but the documents remain in the index, visible to /query. Here's what seems my relevant bit of solrconfig.xml. My URP only implements processAdd. updateRequestProcessorChain name=RNI !-- some day, add parameters when we have some -- processor class=com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory/ processor class=solr.LogUpdateProcessorFactory / processor class=solr.DistributedUpdateProcessorFactory/ processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain !-- activate RNI processing by adding the RNI URP to the chain for xml updates -- requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.chainRNI/str /lst /requestHandler - Mark Miller lucidimagination.com
Re: AW: Lexical analysis tools for German language data
Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit : Some compounds probably should not be decompounded, like Fahrrad (farhren/Rad). With a dictionary-based stemmer, you might decide to avoid decompounding for words in the dictionary. Good point. More or less, Fahrrad is generally abbreviated as Rad. (even though Rad can mean wheel and bike) Note that highlighting gets pretty weird when you are matching only part of a word. Guess it'll be a weird when you get it wrong, like Noten in Notentriegelung. This decomposition should not happen because Noten-triegelung does not have a correct second term. The Basis Technology linguistic analyzers aren't cheap or small, but they work well. We will consider our needs and options. Thanks for your thoughts. My question remains as to which domain it aims at covering. We had such need for mathematics texts... I would be pleasantly surprised if, for example, Differenzen-quotient would be decompounded. paul
Re: Problem to integrate Solr in Jetty (the first example in the Apache Solr 3.1 Cookbook)
On 4/12/2012 2:21 AM, Bastian Hepp wrote: When I try to start I get this error message: C:\\jetty-solrjava -jar start.jar java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.eclipse.jetty.start.Main.invokeMain(Main.java:457) at org.eclipse.jetty.start.Main.start(Main.java:602) at org.eclipse.jetty.start.Main.main(Main.java:82) Caused by: java.lang.ClassNotFoundException: org.mortbay.jetty.Server at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at org.eclipse.jetty.util.Loader.loadClass(Loader.java:92) at org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.nodeClass(XmlConfiguration.java:349) at org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.configure(XmlConfiguration.java:327) at org.eclipse.jetty.xml.XmlConfiguration.configure(XmlConfiguration.java:291) at org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1203) at java.security.AccessController.doPrivileged(Native Method) at org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1138) Bastian, The jetty.xml included with Solr is littered with org.mortbay class references, which are appropriate for Jetty 6. Jetty 7 and 8 use the org.eclipse prefix, and from the very small amount of investigation I did a few weeks ago, have also made other changes to the package names, so you might not be able to simply replace org.mortbay with org.eclipse. The absolutely easiest option would be to just use the jetty included with Solr, not version 8. If you want to keep using Jetty 8, you will need to find/make a new jetty.xml file. If I were set on using Jetty 8 and had to make it work, I would check out trunk (Lucene/Solr 4.0) from the Apache SVN server, find the example jetty.xml there, and use it instead. It's possible that you may need to still make changes, but that is probably the path of least resistance. The jetty version has been upgraded in trunk. Another option would be to download Jetty 6, find its jetty.xml, and compare it with the one in Solr, to find out what the Lucene developers changed from default. Then you would have to take the default jetty.xml from Jetty 8 and make similar changes to make a new config. Apparently Jetty 8 no longer supports JSP with the JRE, so you're probably going to need the JDK. The developers have eliminated JSP from trunk, so it will still work with the JRE. Thanks, Shawn
Re: Large Index and OutOfMemoryError: Map failed
On Apr 12, 2012, at 6:07 AM, Michael McCandless wrote: Your largest index has 66 segments (690 files) ... biggish but not insane. With 64K maps you should be able to have ~47 searchers open on each core. Enabling compound file format (not the opposite!) will mean fewer maps ... ie should improve this situation. I don't understand why Solr defaults to compound file off... that seems dangerous. Really we need a Solr dev here... to answer how long is a stale searcher kept open. Is it somehow possible 46 old searchers are being left open...? Probably only if there is a bug. When a new Searcher is opened, any previous Searcher is closed as soon as there are no more references to it (eg all in flight requests to that Searcher finish). I don't see any other reason why you'd run out of maps. Hmm, unless MMapDirectory didn't think it could safely invoke unmap in your JVM. Which exact JVM are you using? If you can print the MMapDirectory.UNMAP_SUPPORTED constant, we'd know for sure. Yes, switching away from MMapDir will sidestep the too many maps issue, however, 1) MMapDir has better perf than NIOFSDir, and 2) if there really is a leak here (Solr not closing the old searchers or a Lucene bug or something...) then you'll eventually run out of file descriptors (ie, same problem, different manifestation). Mike McCandless http://blog.mikemccandless.com 2012/4/11 Gopal Patwa gopalpa...@gmail.com: I have not change the mergefactor, it was 10. Compound index file is disable in my config but I read from below post, that some one had similar issue and it was resolved by switching from compound index file format to non-compound index file. and some folks resolved by changing lucene code to disable MMapDirectory. Is this best practice to do, if so is this can be done in configuration? http://lucene.472066.n3.nabble.com/MMapDirectory-failed-to-map-a-23G-compound-index-segment-td3317208.html I have index document of core1 = 5 million, core2=8million and core3=3million and all index are hosted in single Solr instance I am going to use Solr for our site StubHub.com, see attached ls -l list of index files for all core SolrConfig.xml: indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor10/mergeFactor maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength-- ramBufferSizeMB4096/ramBufferSizeMB maxThreadStates10/maxThreadStates writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType mergePolicy class=org.apache.lucene.index.TieredMergePolicy double name=forceMergeDeletesPctAllowed0.0/double double name=reclaimDeletesWeight10.0/double /mergePolicy deletionPolicy class=solr.SolrDeletionPolicy str name=keepOptimizedOnlyfalse/str str name=maxCommitsToKeep0/str /deletionPolicy /indexDefaults updateHandler class=solr.DirectUpdateHandler2 maxPendingDeletes1000/maxPendingDeletes autoCommit maxTime90/maxTime openSearcherfalse/openSearcher /autoCommit autoSoftCommit maxTime${inventory.solr.softcommit.duration:1000}/maxTime /autoSoftCommit /updateHandler Forwarded conversation Subject: Large Index and OutOfMemoryError: Map failed From: Gopal Patwa gopalpa...@gmail.com Date: Fri, Mar 30, 2012 at 10:26 PM To: solr-user@lucene.apache.org I need help!! I am using Solr 4.0 nightly build with NRT and I often get this error during auto commit java.lang.OutOfMemoryError: Map failed. I have search this forum and what I found it is related to OS ulimit setting, please se below my ulimit settings. I am not sure what ulimit setting I should have? and we also get java.net.SocketException: Too many open files NOT sure how many open file we need to set? I have 3 core with index size : core1 - 70GB, Core2 - 50GB and Core3 - 15GB, with Single shard We update the index every 5 seconds, soft commit every 1 second and hard commit every 15 minutes Environment: Jboss 4.2, JDK 1.6 , CentOS, JVM Heap Size = 24GB ulimit: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 401408 max locked memory (kbytes, -l) 1024 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t)
Re: I've broken delete in SolrCloud and I'm a bit clueless as to how
On Thu, Apr 12, 2012 at 11:56 AM, Mark Miller markrmil...@gmail.com wrote: Please see the documentation: http://wiki.apache.org/solr/SolrCloud#Required_Config Did I fail to find this in google or did I just goad you into a writing job? I'm inclined to write a JIRA asking for _version_ to be configurable just like the uniqueKey in the schema. schema.xml You must have a _version_ field defined: field name=_version_ type=long indexed=true stored=true/ On Apr 11, 2012, at 9:10 AM, Benson Margulies wrote: I didn't have a _version_ field, since nothing in the schema says that it's required! On Wed, Apr 11, 2012 at 6:35 AM, Darren Govoni dar...@ontrenet.com wrote: Hard to say why its not working for you. Start with a fresh Solr and work forward from there or back out your configs and plugins until it works again. On Tue, 2012-04-10 at 17:15 -0400, Benson Margulies wrote: In my cloud configuration, if I push delete query*:*/query /delete followed by: commit/ I get no errors, the log looks happy enough, but the documents remain in the index, visible to /query. Here's what seems my relevant bit of solrconfig.xml. My URP only implements processAdd. updateRequestProcessorChain name=RNI !-- some day, add parameters when we have some -- processor class=com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory/ processor class=solr.LogUpdateProcessorFactory / processor class=solr.DistributedUpdateProcessorFactory/ processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain !-- activate RNI processing by adding the RNI URP to the chain for xml updates -- requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.chainRNI/str /lst /requestHandler - Mark Miller lucidimagination.com
Re: AW: Lexical analysis tools for German language data
On Apr 12, 2012, at 8:46 AM, Michael Ludwig wrote: I remember from my linguistics studies that the terminus technicus for these is Fugenmorphem (interstitial or joint morpheme). That is some excellent linguistic jargon. I'll file that with hapax legomenon. If you don't highlight, you can get good results with pretty rough analyzers, but highlighting exposes those, even when they don't affect relevance. For example, you can get good relevance just indexing bigrams in Chinese, but it looks awful when you highlight them. As soon as you highlight, you need a dictionary-based segmenter. wunder -- Walter Underwood wun...@wunderwood.org
Re: AW: Lexical analysis tools for German language data
On Thursday 12 April 2012 18:00:14 Paul Libbrecht wrote: Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit : Some compounds probably should not be decompounded, like Fahrrad (farhren/Rad). With a dictionary-based stemmer, you might decide to avoid decompounding for words in the dictionary. Good point. More or less, Fahrrad is generally abbreviated as Rad. (even though Rad can mean wheel and bike) Note that highlighting gets pretty weird when you are matching only part of a word. Guess it'll be a weird when you get it wrong, like Noten in Notentriegelung. This decomposition should not happen because Noten-triegelung does not have a correct second term. The Basis Technology linguistic analyzers aren't cheap or small, but they work well. We will consider our needs and options. Thanks for your thoughts. My question remains as to which domain it aims at covering. We had such need for mathematics texts... I would be pleasantly surprised if, for example, Differenzen-quotient would be decompounded. The HyphenationCompoundWordTokenFilter can do those things but those words must be listed in the dictionary or you'll get strange results. It still yields strange results when it emits tokens that are subwords of a subword. paul -- Markus Jelsma - CTO - Openindex
Re: AW: Lexical analysis tools for German language data
On Apr 12, 2012, at 9:00 AM, Paul Libbrecht wrote: More or less, Fahrrad is generally abbreviated as Rad. (even though Rad can mean wheel and bike) A synonym could handle this, since farhren would not be a good match. It is judgement call, but this seems more like an equivalence Fahrrad = Rad than decompounding. wunder -- Walter Underwood wun...@wunderwood.org
Re: codecs for sorted indexes
Do you mean you are pre-sorting the documents (by what criteria?) yourself, before adding them to the index? In which case... you should already be seeing some benefits (smaller index size) than had you randomly added them (ie the vInts should take fewer bytes), I think. (Probably the savings would be greater for better intblock codecs like PForDelta, SimpleX, but I'm not sure...). Or do you mean having a codec re-sort the documents (on flush/merge)? I think this should be possible w/ the Codec API... but nobody has tried it yet that I know of. Note that the bulkpostings branch is effectively dead (nobody is iterating on it, and we've removed the old bulk API from trunk), but there is likely a GSoC project to add a PForDelta codec to trunk: https://issues.apache.org/jira/browse/LUCENE-3892 Mike McCandless http://blog.mikemccandless.com On Thu, Apr 12, 2012 at 6:13 AM, Carlos Gonzalez-Cadenas c...@experienceon.com wrote: Hello, We're using a sorted index in order to implement early termination efficiently over an index of hundreds of millions of documents. As of now, we're using the default codecs coming with Lucene 4, but we believe that due to the fact that the docids are sorted, we should be able to do much better in terms of storage and achieve much better performance, especially decompression performance. In particular, Robert Muir is commenting on these lines here: https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411 We're aware that the in the bulkpostings branch there are different codecs being implemented and different experiments being done. We don't know whether we should implement our own codec (i.e. using some RLE-like techniques) or we should use one of the codecs implemented there (PFOR, Simple64, ...). Can you please give us some advice on this? Thanks Carlos Carlos Gonzalez-Cadenas CEO, ExperienceOn - New generation search http://www.experienceon.com Mobile: +34 652 911 201 Skype: carlosgonzalezcadenas LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
Re: Error
i am using 3.4 solr version... please assist... On Thu, Apr 12, 2012 at 8:41 PM, Erick Erickson erickerick...@gmail.comwrote: Please review: http://wiki.apache.org/solr/UsingMailingLists You haven't said whether, for instance, you're using trunk which is the only version that supports the termfreq function. Best Erick On Thu, Apr 12, 2012 at 4:08 AM, Abhishek tiwari abhishek.tiwari@gmail.com wrote: http://xyz.com:8080/newschema/mainsearch/select/?q=*%3A*version=2.2start=0rows=10indent=onsort=termfreq%28cuisine_priorities_list,%27Chinese%27%29%20desc Error : HTTP Status 400 - Missing sort order. Why i am getting error ?
Re: EmbeddedSolrServer and StreamingUpdateSolrServer
On 4/12/2012 4:52 AM, pcrao wrote: I think the index is getting corrupted because StreamingUpdateSolrServer is keeping reference to some index files that are being deleted by EmbeddedSolrServer during commit/optimize process. As a result when I Index(Full) using EmbeddedSolrServer and then do Incremental index using StreamingUpdateSolrServer it fails with a FileNotFound exception. A special note: we don't optimize the index after Incremental indexing(StreamingUpdateSolrServer) but we do optimize it after the Full index(EmbeddedSolrServer). Please see the below log and let me know if you need further information. I am a relative newbie to all this, and I've never used EmbeddedSolrServer, only CommonsHttpSolrServer and StreamingUpdateSolrServer. I'm not even sure the embedded object is an option unless your program is running in the same JVM as Solr. Mine is separate. If I am right about ESS needing to be in the same JVM as Solr, then that means it can do a more direct interaction with Solr and therefore might not be coordinated with the HTTP access that SUSS uses. I have read multiple times that the developers don't recommend using ESS. If you are going to use it, you probably have to do everything with it. SUSS does everything in the background, so you have no guarantees as to when it will happen, as well as no ability to check for completion or errors. Because of the lack of error detection, I had to stop using SUSS. Thanks, Shawn
Re: [Solr 4.0] Is it possible to do soft commit from code and not configuration only
Hi Mark, Thank you for reply. I tried to normalize data like in relational databases: - there are some types of documents where \ - documents with the same type have the same fields - documents with not equal types may have different fields - but all documents have type field and unique key field id . - there is main type (all records with this type contains pointers to the corresponding records of other types) There is the configuration that defines what information should be stored in each type. When I get a new data for indexing first of all I check if such document is already in the index\ using facets by the corresponding fields and query on relevant type. I add documents to solr index without commit from the code but with autocommit and autoSoftCommit with maxDocs=1 in the solrconfig.xml. But here there is a problem that if I add a new record for some type the searcher doesn't see it immediately. It causes that I get some equal records with the same type but different ids (unique key). If I do commit from code after each document is added it works OK but it's not a solution. So I wanted to try to do soft commit after adding documents with not-main type from code. I searched in wiki documents but found only commit without parameters and commit with parameters that don't seem to be what I need. Best regards, Lyuba * * * * On Thu, Apr 12, 2012 at 6:55 PM, Mark Miller markrmil...@gmail.com wrote: On Apr 12, 2012, at 11:28 AM, Lyuba Romanchuk wrote: Hi, I need to configure the solr so that the opened searcher will see a new document immidiately after it was adding to the index. And I don't want to perform commit each time a new document is added. I tried to configure maxDocs=1 under autoSoftCommit in solrconfig.xml but it didn't help. Can you elaborate on didn't help? You couldn't find any docs unless you did an explicit commit? If that is true and there is no user error, this would be a bug. Is there way to perform soft commit from code in Solr 4.0 ? Yes - check out the wiki docs - I can't remember how it is offhand (I think it was slightly changed recently). Thank you in advance. Best regards, Lyuba - Mark Miller lucidimagination.com
Re: Error
The termfreq function is only valid for trunk. You're using 3.4. Since 'termfreq' is not recognized, Solr gets confused. Best Erick On Thu, Apr 12, 2012 at 10:20 AM, Abhishek tiwari abhishek.tiwari@gmail.com wrote: i am using 3.4 solr version... please assist... On Thu, Apr 12, 2012 at 8:41 PM, Erick Erickson erickerick...@gmail.comwrote: Please review: http://wiki.apache.org/solr/UsingMailingLists You haven't said whether, for instance, you're using trunk which is the only version that supports the termfreq function. Best Erick On Thu, Apr 12, 2012 at 4:08 AM, Abhishek tiwari abhishek.tiwari@gmail.com wrote: http://xyz.com:8080/newschema/mainsearch/select/?q=*%3A*version=2.2start=0rows=10indent=onsort=termfreq%28cuisine_priorities_list,%27Chinese%27%29%20desc Error : HTTP Status 400 - Missing sort order. Why i am getting error ?
Re: is there a downside to combining search fields with copyfield?
On 4/12/2012 7:27 AM, geeky2 wrote: currently, my schema has individual fields to search on. are there advantages or disadvantages to taking several of the individual search fields and combining them in to a single search field? would this affect search times, term tokenization or possibly other things. example of individual fields brand category partno example of a single combined search field part_info (would combine brand, category and partno) You end up with one multivalued field, which means that you can only have one analyzer chain. With separate fields, each field can be analyzed differently. Also, if you are indexing and/or storing the individual fields, you may have data duplication in your index, making it larger and increasing your disk/RAM requirements. That field will have a higher termcount than the individual fields, which means that searches against it will naturally be just a little bit slower. Your application will not have to do as much work to construct a query, though. If you are already planning to use dismax/edismax, then you don't need the overhead of a copyField. You can simply provide access to (e)dismax search with the qf (and possibly pf) parameters predefined, or your application can provide these parameters. http://wiki.apache.org/solr/ExtendedDisMax Thanks, Shawn
Re: Problem to integrate Solr in Jetty (the first example in the Apache Solr 3.1 Cookbook)
Thanks Shawn, I think I'll stay with the build in. I had problems with Solr Cell, but I could fix it. Greetings, Bastian Am 12. April 2012 18:02 schrieb Shawn Heisey s...@elyograg.org: Bastian, The jetty.xml included with Solr is littered with org.mortbay class references, which are appropriate for Jetty 6. Jetty 7 and 8 use the org.eclipse prefix, and from the very small amount of investigation I did a few weeks ago, have also made other changes to the package names, so you might not be able to simply replace org.mortbay with org.eclipse. The absolutely easiest option would be to just use the jetty included with Solr, not version 8. If you want to keep using Jetty 8, you will need to find/make a new jetty.xml file. If I were set on using Jetty 8 and had to make it work, I would check out trunk (Lucene/Solr 4.0) from the Apache SVN server, find the example jetty.xml there, and use it instead. It's possible that you may need to still make changes, but that is probably the path of least resistance. The jetty version has been upgraded in trunk. Another option would be to download Jetty 6, find its jetty.xml, and compare it with the one in Solr, to find out what the Lucene developers changed from default. Then you would have to take the default jetty.xml from Jetty 8 and make similar changes to make a new config. Apparently Jetty 8 no longer supports JSP with the JRE, so you're probably going to need the JDK. The developers have eliminated JSP from trunk, so it will still work with the JRE. Thanks, Shawn
Re: I've broken delete in SolrCloud and I'm a bit clueless as to how
google must not have found it - i put that in a month or so ago I believe - at least weeks. As you can see, there is still a bit to fill in, but it covers the high level. I'd like to add example snippets for the rest soon. On Thu, Apr 12, 2012 at 12:04 PM, Benson Margulies bimargul...@gmail.comwrote: On Thu, Apr 12, 2012 at 11:56 AM, Mark Miller markrmil...@gmail.com wrote: Please see the documentation: http://wiki.apache.org/solr/SolrCloud#Required_Config Did I fail to find this in google or did I just goad you into a writing job? I'm inclined to write a JIRA asking for _version_ to be configurable just like the uniqueKey in the schema. schema.xml You must have a _version_ field defined: field name=_version_ type=long indexed=true stored=true/ On Apr 11, 2012, at 9:10 AM, Benson Margulies wrote: I didn't have a _version_ field, since nothing in the schema says that it's required! On Wed, Apr 11, 2012 at 6:35 AM, Darren Govoni dar...@ontrenet.com wrote: Hard to say why its not working for you. Start with a fresh Solr and work forward from there or back out your configs and plugins until it works again. On Tue, 2012-04-10 at 17:15 -0400, Benson Margulies wrote: In my cloud configuration, if I push delete query*:*/query /delete followed by: commit/ I get no errors, the log looks happy enough, but the documents remain in the index, visible to /query. Here's what seems my relevant bit of solrconfig.xml. My URP only implements processAdd. updateRequestProcessorChain name=RNI !-- some day, add parameters when we have some -- processor class=com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory/ processor class=solr.LogUpdateProcessorFactory / processor class=solr.DistributedUpdateProcessorFactory/ processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain !-- activate RNI processing by adding the RNI URP to the chain for xml updates -- requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.chainRNI/str /lst /requestHandler - Mark Miller lucidimagination.com -- - Mark http://www.lucidimagination.com
Re: I've broken delete in SolrCloud and I'm a bit clueless as to how
: Please see the documentation: http://wiki.apache.org/solr/SolrCloud#Required_Config : : schema.xml : : You must have a _version_ field defined: : : field name=_version_ type=long indexed=true stored=true/ Seems like this is the kind of thing that should make Solr fail hard and fast on SolrCore init if it sees you are running in cloud mode and yet it doesn't find this -- similar to how some other features fail hard and fast if you don't have uniqueKey. -Hoss
Re: solr 3.4 with nTiers = 2: usage of ids param causes NullPointerException (NPE)
Dmitry, The last NPE in HighlightingComponent is just a sad coding issue. few rows later we can see that developer expected to have some docs not found // remove nulls in case not all docs were able to be retrieved rb.rsp.add(highlighting, SolrPluginUtils.removeNulls(new SimpleOrderedMap(arr))); But as you already know he forgot to check if(sdoc!=null){. Is there anything that stopping you from contributing the patch, beside of the lack of time, of course? about the core issue I can't get into it and, particularly, how the using disjunction query in place of IDS can help you. Could you please provide more detailed info like stacktraces, etc. Btw, have you checked trunk for your case? On Thu, Apr 12, 2012 at 7:08 PM, Dmitry Kan dmitry@gmail.com wrote: Can anyone help me out with this? Is this too complicated / unclear? I could share more detail if needed. On Wed, Apr 11, 2012 at 3:16 PM, Dmitry Kan dmitry@gmail.com wrote: Hello, Hopefully this question is not too complex to handle, but I'm currently stuck with it. We have a system with nTiers, that is: Solr front base --- Solr front -- shards Inside QueryComponent there is a method createRetrieveDocs(ResponseBuilder rb) which collects doc ids of each shard and sends them in different queries using the ids parameter: [code] sreq.params.add(ShardParams.IDS, StrUtils.join(ids, ',')); [/code] This actually produces NPE (same as in https://issues.apache.org/jira/browse/SOLR-1477) in the first tier, because Solr front (on the second tier) fails to process such a query. I have tried to fix this by using a unique field with a value of ids ORed (the following code substitutes the code above): [code] StringBuffer idsORed = new StringBuffer(); for (IteratorString iterator = ids.iterator(); iterator.hasNext(); ) { String next = iterator.next(); if (iterator.hasNext()) { idsORed.append(next).append( OR ); } else { idsORed.append(next); } } sreq.params.add(rb.req.getSchema().getUniqueKeyField().getName(), idsORed.toString()); [/code] This works perfectly if for rows=n there is n or less hits from a distributed query. However, if there are more than 2*n hits, the querying fails with an NPE in a completely different component, which is HighlightComponent (highlights are requested in the same query with hl=truehl.fragsize=5hl.requireFieldMatch=truehl.fl=targetTextField): SEVERE: java.lang.NullPointerException at org.apache.solr.handler.component.HighlightComponent.finishStage(HighlightComponent.java:161) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:295) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:619) It sounds like the ids of documents somehow get shuffled and the instruction (only a hypothesis) [code] ShardDoc sdoc = rb.resultIds.get(id); [/code] returns sdoc=null, which causes the next line of code to fail with an NPE: [code] int idx = sdoc.positionInResponse; [/code] Am I missing anything? Can something be done for solving this issue? Thanks. -- Regards, Dmitry Kan -- Regards, Dmitry Kan -- Sincerely yours Mikhail Khludnev ge...@yandex.ru http://www.griddynamics.com mkhlud...@griddynamics.com
Re: solr 3.4 with nTiers = 2: usage of ids param causes NullPointerException (NPE)
On Wed, Apr 11, 2012 at 8:16 AM, Dmitry Kan dmitry@gmail.com wrote: We have a system with nTiers, that is: Solr front base --- Solr front -- shards Although the architecture had this in mind (multi-tier), all of the pieces are not yet in place to allow it. The errors you see are a direct result of that. -Yonik lucenerevolution.com - Lucene/Solr Open Source Search Conference. Boston May 7-10
RE: solr 3.5 taking long to index
Thanks for pointing these out, but I still have one concern, why is the Virtual Memory running in 300g+? Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg -Original Message- From: Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de] Sent: 12 April 2012 11:58 To: solr-user@lucene.apache.org Subject: Re: solr 3.5 taking long to index There were some changes in solrconfig.xml between solr3.1 and solr3.5. Always read CHANGES.txt when switching to a new version. Also helpful is comparing both versions of solrconfig.xml from the examples. Are you sure you need a MaxPermSize of 5g? Use jvisualvm to see what you really need. This is also for all other JAVA_OPTS. Am 11.04.2012 19:42, schrieb Rohit: We recently migrated from solr3.1 to solr3.5, we have one master and one slave configured. The master has two cores, 1) Core1 - 44555972 documents 2) Core2 - 29419244 documents We commit every 5000 documents, but lately the commit is taking very long 15 minutes plus in some cases. What could have caused this, I have checked the logs and the only warning i can see is, WARNING: Use of deprecated update request parameter update.processor detected. Please use the new parameter update.chain instead, as support for update.processor will be removed in a later version. Memory details: export JAVA_OPTS=$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g Solr Config: useCompoundFilefalse/useCompoundFile mergeFactor10/mergeFactor ramBufferSizeMB32/ramBufferSizeMB !-- maxBufferedDocs1000/maxBufferedDocs -- maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout What could be causing this, as everything was running fine a few days back? Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg http://about.me/rohitg
RE: Solr 3.5 takes very long to commit gradually
Thanks for pointing these out, but I still have one concern, why is the Virtual Memory running in 300g+? Regards, Rohit -Original Message- From: Tirthankar Chatterjee [mailto:tchatter...@commvault.com] Sent: 12 April 2012 13:43 To: solr-user@lucene.apache.org Subject: Re: Solr 3.5 takes very long to commit gradually thanks Rohit.. for the information. On Apr 12, 2012, at 4:08 AM, Rohit wrote: Hi Tirthankar, The average size of documents would be a few Kb's this is mostly tweets which are being saved. The two cores are storing different kind of data and nothing else. Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg -Original Message- From: Tirthankar Chatterjee [mailto:tchatter...@commvault.com] Sent: 12 April 2012 13:14 To: solr-user@lucene.apache.org Subject: Re: Solr 3.5 takes very long to commit gradually Hi Rohit, What would be the average size of your documents and also can you please share your idea of having 2 cores in the master. I just wanted to know the reasoning behind the design. Thanks in advance Tirthankar On Apr 12, 2012, at 3:19 AM, Jan Høydahl wrote: What operating system? Are you using spellchecker with buildOnCommit? Anything special in your Update Chain? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 12. apr. 2012, at 06:45, Rohit wrote: We recently migrated from solr3.1 to solr3.5, we have one master and one slave configured. The master has two cores, 1) Core1 - 44555972 documents 2) Core2 - 29419244 documents We commit every 5000 documents, but lately the commit time gradually increase and solr is taking as very long 15 minutes plus in some cases. What could have caused this, I have checked the logs and the only warning i can see is, WARNING: Use of deprecated update request parameter update.processor detected. Please use the new parameter update.chain instead, as support for update.processor will be removed in a later version. Memory details: export JAVA_OPTS=$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g Solr Config: useCompoundFilefalse/useCompoundFile mergeFactor10/mergeFactor ramBufferSizeMB32/ramBufferSizeMB !-- maxBufferedDocs1000/maxBufferedDocs -- maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout Also noticed, that top command show almost 350GB of Virtual memory usage. What could be causing this, as everything was running fine a few days back? Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg http://about.me/rohitg **Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. * **Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. *
Re: term frequency outweighs exact phrase match
In that case documents 1 and 2 will not be in the results. We need them also be shown in the results but be ranked after those docs with exact match. I think omitting term frequency in calculating ranking in phrase queries will solve this issue, but I do not see that such a parameter in configs. I see omitTermFreqAndPositions=true but not sure if it is the setting I need, because its description is too vague. Thanks. Alex. -Original Message- From: Erick Erickson erickerick...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Wed, Apr 11, 2012 8:23 am Subject: Re: term frequency outweighs exact phrase match Consider boosting on phrase with a SHOULD clause, something like field:apache solr^2.. Best Erick On Tue, Apr 10, 2012 at 12:46 PM, alx...@aim.com wrote: Hello, I use solr 3.5 with edismax. I have the following issue with phrase search. For example if I have three documents with content like 1.apache apache 2. solr solr 3.apache solr then search for apache solr displays documents in the order 1,.2,3 instead of 3, 2, 1 because term frequency in the first and second documents is higher than in the third document. We want results be displayed in the order as 3,2,1 since the third document has exact match. My request handler is as follows. requestHandler name=search class=solr.SearchHandler lst name=defaults str name=defTypeedismax/str str name=echoParamsexplicit/str float name=tie0.01/float str name=qfhost^30 content^0.5 title^1.2/str str name=pfhost^30 content^20 title^22 /str str name=flurl,id, site ,title/str str name=mm2lt;-1 5lt;-2 6lt;90%/str int name=ps1/int bool name=hltrue/bool str name=q.alt*:*/str str name=hl.flcontent/str str name=f.title.hl.fragsize0/str str name=hl.fragsize165/str str name=f.title.hl.alternateFieldtitle/str str name=f.url.hl.fragsize0/str str name=f.url.hl.alternateFieldurl/str str name=f.content.hl.fragmenterregex/str str name=spellchecktrue/str str name=spellcheck.collatetrue/str str name=spellcheck.count5/str str name=grouptrue/str str name=group.fieldsite/str str name=group.ngroupstrue/str /lst arr name=last-components strspellcheck/str /arr /requestHandler Any ideas how to fix this issue? Thanks in advance. Alex.
Re: solr 3.4 with nTiers = 2: usage of ids param causes NullPointerException (NPE)
Mikhail, Thanks for sharing your thoughts. Yes I have tried checking for NULL and the entire chain of queries between tiers seems to work. But I suspect, that some docs will be missing. In principle, unless there is an OutOfMemory or a shard down, the doc ids should be retrieving valid documents. So this is just a design, as Yonik pointed out. I would be willing to contribute a patch, it is just an issue of understanding what exactly should be fixed in the architecture, and I suspect it isn't a small change.. Dmitry On Thu, Apr 12, 2012 at 9:22 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Dmitry, The last NPE in HighlightingComponent is just a sad coding issue. few rows later we can see that developer expected to have some docs not found // remove nulls in case not all docs were able to be retrieved rb.rsp.add(highlighting, SolrPluginUtils.removeNulls(new SimpleOrderedMap(arr))); But as you already know he forgot to check if(sdoc!=null){. Is there anything that stopping you from contributing the patch, beside of the lack of time, of course? about the core issue I can't get into it and, particularly, how the using disjunction query in place of IDS can help you. Could you please provide more detailed info like stacktraces, etc. Btw, have you checked trunk for your case? On Thu, Apr 12, 2012 at 7:08 PM, Dmitry Kan dmitry@gmail.com wrote: Can anyone help me out with this? Is this too complicated / unclear? I could share more detail if needed. On Wed, Apr 11, 2012 at 3:16 PM, Dmitry Kan dmitry@gmail.com wrote: Hello, Hopefully this question is not too complex to handle, but I'm currently stuck with it. We have a system with nTiers, that is: Solr front base --- Solr front -- shards Inside QueryComponent there is a method createRetrieveDocs(ResponseBuilder rb) which collects doc ids of each shard and sends them in different queries using the ids parameter: [code] sreq.params.add(ShardParams.IDS, StrUtils.join(ids, ',')); [/code] This actually produces NPE (same as in https://issues.apache.org/jira/browse/SOLR-1477) in the first tier, because Solr front (on the second tier) fails to process such a query. I have tried to fix this by using a unique field with a value of ids ORed (the following code substitutes the code above): [code] StringBuffer idsORed = new StringBuffer(); for (IteratorString iterator = ids.iterator(); iterator.hasNext(); ) { String next = iterator.next(); if (iterator.hasNext()) { idsORed.append(next).append( OR ); } else { idsORed.append(next); } } sreq.params.add(rb.req.getSchema().getUniqueKeyField().getName(), idsORed.toString()); [/code] This works perfectly if for rows=n there is n or less hits from a distributed query. However, if there are more than 2*n hits, the querying fails with an NPE in a completely different component, which is HighlightComponent (highlights are requested in the same query with hl=truehl.fragsize=5hl.requireFieldMatch=truehl.fl=targetTextField): SEVERE: java.lang.NullPointerException at org.apache.solr.handler.component.HighlightComponent.finishStage(HighlightComponent.java:161) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:295) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at
Wildcard searching
Hi, I am using the edismax query handler with solr 3.5. From the Solr admin interface when i do a wildcard search with the string: edge*, all documents are returned with exactly the same score. When i do the same search from my application using SolrJ to the same solr instance, only a few documents have the same maximum score and all the rest have the minimum score. I was expecting all to have the same score just like in the Solr Admin. Any pointers why this is happening? Thanks.
Re: solr 3.4 with nTiers = 2: usage of ids param causes NullPointerException (NPE)
Thanks Yonik, This is what I expected. How big the change would be, if I'd start just with Query and Highlight components? Did the change to QueryComponent I made make any sense to you? It would of course mean a custom solution, which I'm willing to contribute as a patch (in case anyone interested). To make it part of a releasable trunk, one would most probably need to provide some way to configure 1st tier level. Thanks, Dmitry On Thu, Apr 12, 2012 at 9:34 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Wed, Apr 11, 2012 at 8:16 AM, Dmitry Kan dmitry@gmail.com wrote: We have a system with nTiers, that is: Solr front base --- Solr front -- shards Although the architecture had this in mind (multi-tier), all of the pieces are not yet in place to allow it. The errors you see are a direct result of that. -Yonik lucenerevolution.com - Lucene/Solr Open Source Search Conference. Boston May 7-10 -- Regards, Dmitry Kan
Re: I've broken delete in SolrCloud and I'm a bit clueless as to how
I think someone already made a JIRA issue like that. I think Yonik might have had an opinion about it that I cannot remember right now. On Thu, Apr 12, 2012 at 2:21 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : Please see the documentation: http://wiki.apache.org/solr/SolrCloud#Required_Config : : schema.xml : : You must have a _version_ field defined: : : field name=_version_ type=long indexed=true stored=true/ Seems like this is the kind of thing that should make Solr fail hard and fast on SolrCore init if it sees you are running in cloud mode and yet it doesn't find this -- similar to how some other features fail hard and fast if you don't have uniqueKey. -Hoss -- - Mark http://www.lucidimagination.com
Re: Wildcard searching
Correction, this difference betweeen Solr admin scores and SolrJ scores happens with leading wildcard queries e.g. *edge On Thu, Apr 12, 2012 at 8:13 PM, Kissue Kissue kissue...@gmail.com wrote: Hi, I am using the edismax query handler with solr 3.5. From the Solr admin interface when i do a wildcard search with the string: edge*, all documents are returned with exactly the same score. When i do the same search from my application using SolrJ to the same solr instance, only a few documents have the same maximum score and all the rest have the minimum score. I was expecting all to have the same score just like in the Solr Admin. Any pointers why this is happening? Thanks.
Re: is there a downside to combining search fields with copyfield?
You end up with one multivalued field, which means that you can only have one analyzer chain. actually two of the three fields being considered for combination in to a single field ARE multivalued fields. would this be an issue? With separate fields, each field can be analyzed differently. Also, if you are indexing and/or storing the individual fields, you may have data duplication in your index, making it larger and increasing your disk/RAM requirements. this makes sense That field will have a higher termcount than the individual fields, which means that searches against it will naturally be just a little bit slower. ok Your application will not have to do as much work to construct a query, though. actually this is the primary reason this came up. If you are already planning to use dismax/edismax, then you don't need the overhead of a copyField. You can simply provide access to (e)dismax search with the qf (and possibly pf) parameters predefined, or your application can provide these parameters. http://wiki.apache.org/solr/ExtendedDisMax can you elaborate on this and how EDisMax would preclude the need for copyfield? i am using extended dismax now in my response handlers. here is an example of one of my requestHandlers requestHandler name=partItemNoSearch class=solr.SearchHandler default=false lst name=defaults str name=defTypeedismax/str str name=echoParamsall/str int name=rows5/int str name=qfitemNo^1.0/str str name=q.alt*:*/str /lst lst name=appends str name=fqitemType:1/str str name=sortrankNo asc, score desc/str /lst lst name=invariants str name=facetfalse/str /lst /requestHandler Thanks, Shawn -- View this message in context: http://lucene.472066.n3.nabble.com/is-there-a-downside-to-combining-search-fields-with-copyfield-tp3905349p3906265.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Suggester not working for digit starting terms
Well now I am really lost... 1. yes I want to suggest whole sentences too, I want the tokenizer to be taken into account, and apparently it is working for me in 3.5.0?? I get suggestions that are like foo bar abc. Maybe what you mention is only for file based dictionaries? I am using the field itself. 2. but for the digit issue, in that case nothing is suggested, not even the term 500 that is there cause I can find it with this query http://localhost:8983/solr/select/?q={!prefix f=a_suggest}500 I tried to set threshold to 0 in case the term was being removed, and is not that. Moving to 3.6.0 is not a problem (I had already downloaded the rc actually) but I still see weird things here. xab -- View this message in context: http://lucene.472066.n3.nabble.com/Suggester-not-working-for-digit-starting-terms-tp3893433p3906303.html Sent from the Solr - User mailing list archive at Nabble.com.
searching across multiple fields using edismax - am i setting this up right?
hello all, i just want to check to make sure i have this right. i was reading on this page: http://wiki.apache.org/solr/ExtendedDisMax, thanks to shawn for educating me. *i want the user to be able to fire a requestHandler but search across multiple fields (itemNo, productType and brand) WITHOUT them having to specify in the query url what fields they want / need to search on* this is what i have in my request handler requestHandler name=partItemNoSearch class=solr.SearchHandler default=false lst name=defaults str name=defTypeedismax/str str name=echoParamsall/str int name=rows5/int *str name=qfitemNo^1.0 productType^.8 brand^.5/str* str name=q.alt*:*/str /lst lst name=appends str name=sortrankNo asc, score desc/str /lst lst name=invariants str name=facetfalse/str /lst /requestHandler this would be an example of a single term search going against all three of the fields http://bogus:bogus/somecore/select?qt=partItemNoSearchq=*dishwasher*debugQuery=onrows=100 this would be an example of a multiple term search across all three of the fields http://bogus:bogus/somecore/select?qt=partItemNoSearchq=*dishwasher 123-xyz*debugQuery=onrows=100 do i understand this correctly? thank you, mark -- View this message in context: http://lucene.472066.n3.nabble.com/searching-across-multiple-fields-using-edismax-am-i-setting-this-up-right-tp3906334p3906334.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Responding to Requests with Chunks/Streaming
Hello Developers, I just want to ask don't you think that response streaming can be useful for things like OLAP, e.g. is you have sharded index presorted and pre-joined by BJQ way you can calculate counts in many cube cells in parallel? Essential distributed test for response streaming just passed. https://github.com/m-khl/solr-patches/blob/ec4db7c0422a5515392a7019c5bd23ad3f546e4b/solr/core/src/test/org/apache/solr/response/RespStreamDistributedTest.java branch is https://github.com/m-khl/solr-patches/tree/streaming Regards On Mon, Apr 2, 2012 at 10:55 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello, Small update - reading streamed response is done via callback. No SolrDocumentList in memory. https://github.com/m-khl/solr-patches/tree/streaming here is the test https://github.com/m-khl/solr-patches/blob/d028d4fabe0c20cb23f16098637e2961e9e2366e/solr/core/src/test/org/apache/solr/response/ResponseStreamingTest.java#L138 no progress in distributed search via streaming yet. Pls let me know if you don't want to have updates from my playground. Regards On Thu, Mar 29, 2012 at 1:02 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: @All Why nobody desires such a pretty cool feature? Nicholas, I have a tiny progress: I'm able to stream in javabin codec format while searching, It implies sorting by _docid_ here is the diff https://github.com/m-khl/solr-patches/commit/2f9ff068c379b3008bb983d0df69dff714ddde95 The current issue is that reading response by SolrJ is done as whole. Reading by callback is supported by EmbeddedServer only. Anyway it should not a big deal. ResponseStreamingTest.java somehow works. I'm stuck on introducing response streaming in distributes search, it's actually more challenging - RespStreamDistributedTest fails Regards On Fri, Mar 16, 2012 at 3:51 PM, Nicholas Ball nicholas.b...@nodelay.com wrote: Mikhail Ludovic, Thanks for both your replies, very helpful indeed! Ludovic, I was actually looking into just that and did some tests with SolrJ, it does work well but needs some changes on the Solr server if we want to send out individual documents a various times. This could be done with a write() and flush() to the FastOutputStream (daos) in JavBinCodec. I therefore think that a combination of this and Mikhail's solution would work best! Mikhail, you mention that your solution doesn't currently work and not sure why this is the case, but could it be that you haven't flushed the data (os.flush()) you've written in the collect method of DocSetStreamer? I think placing the output stream into the SolrQueryRequest is the way to go, so that we can access it and write to it how we intend. However, I think using the JavaBinCodec would be ideal so that we can work with SolrJ directly, and not mess around with the encoding of the docs/data etc... At the moment the entry point to JavaBinCodec is through the BinaryResponseWriter which calls the highest level marshal() method which decodes and sends out the entire SolrQueryResponse (line 49 @ BinaryResponseWriter). What would be ideal is to be able to break up the response and call the JavaBinCodec for pieces of it with a flush after each call. Did a few tests with a simple Thread.sleep and a flush to see if this would actually work and looks like it's working out perfectly. Just trying to figure out the best way to actually do it now :) any ideas? An another note, for a solution to work with the chunked transfer encoding (and therefore web browsers), a lot more development is going to be needed. Not sure if it's worth trying yet but might look into it later down the line. Nick On Fri, 16 Mar 2012 07:29:20 +0300, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Ludovic, I looked through. First of all, it seems to me you don't amend regular servlet solr server, but the only embedded one. Anyway, the difference is that you stream DocList via callback, but it means that you've instantiated it in memory and keep it there until it will be completely consumed. Think about a billion numfound. Core idea of my approach is keep almost zero memory for response. Regards On Fri, Mar 16, 2012 at 12:12 AM, lboutros boutr...@gmail.com wrote: Hi, I was looking for something similar. I tried this patch : https://issues.apache.org/jira/browse/SOLR-2112 it's working quite well (I've back-ported the code in Solr 3.5.0...). Is it really different from what you are trying to achieve ? Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Responding-to-Requests-with-Chunks-Streaming-tp3827316p3829909.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev ge...@yandex.ru http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev ge...@yandex.ru
Re: I've broken delete in SolrCloud and I'm a bit clueless as to how
On Thu, Apr 12, 2012 at 2:21 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Please see the documentation: http://wiki.apache.org/solr/SolrCloud#Required_Config : : schema.xml : : You must have a _version_ field defined: : : field name=_version_ type=long indexed=true stored=true/ Seems like this is the kind of thing that should make Solr fail hard and fast on SolrCore init if it sees you are running in cloud mode and yet it doesn't find this -- similar to how some other features fail hard and fast if you don't have uniqueKey. Off the top of my head: _version_ is needed for solr cloud where a leader forwards updates to replicas, unless you're handing update distribution yourself or providing pre-built shards. _version_ is needed for realtime-get and optimistic locking We should document for sure... but at this point it's not clear what we should enforce. (not saying we shouldn't enforce anything... just that I haven't really thought about it) -Yonik lucenerevolution.com - Lucene/Solr Open Source Search Conference. Boston May 7-10
[ANNOUNCE] Apache Solr 3.6 released
12 April 2012, Apache Solr™ 3.6.0 available The Lucene PMC is pleased to announce the release of Apache Solr 3.6.0. Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites. This release contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below. The release is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html (see note below). See the CHANGES.txt file included with the release for a full list of details. Solr 3.6.0 Release Highlights: * New SolrJ client connector using Apache Http Components http client (SOLR-2020) * Many analyzer factories are now multi term query aware allowing for things like field type aware lowercasing when building prefix wildcard queries. (SOLR-2438) * New Kuromoji morphological analyzer tokenizes Japanese text, producing both compound words and their segmentation. (SOLR-3056) * Range Faceting (Dates Numbers) is now supported in distributed search (SOLR-1709) * HTMLStripCharFilter has been completely re-implemented, fixing many bugs and greatly improving the performance (LUCENE-3690) * StreamingUpdateSolrServer now supports the javabin format (SOLR-1565) * New LFU Cache option for use in Solr's internal caches. (SOLR-2906) * Memory performance improvements to all FST based suggesters (SOLR-2888) * New WFSTLookupFactory suggester supports finer-grained ranking for suggestions. (LUCENE-3714) * New options for configuring the amount of concurrency used in distributed searches (SOLR-3221) * Many bug fixes Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Happy searching, Lucene/Solr developers
Re: I've broken delete in SolrCloud and I'm a bit clueless as to how
: Off the top of my head: : _version_ is needed for solr cloud where a leader forwards updates to : replicas, unless you're handing update distribution yourself or : providing pre-built shards. : _version_ is needed for realtime-get and optimistic locking : : We should document for sure... but at this point it's not clear what : we should enforce. (not saying we shouldn't enforce anything... just : that I haven't really thought about it) well ... it may eventually make sense to global enforce it for consistency, but in the meantime the individual components that dpeend on it can certainly enforce it (just like my uniqueKey example; the search components that require it check for themselves on init and fail fast) (ie: sounds like the RealTimeGetHandler and the existing DistributedUpdateProcessor should fail fast on init if the schema doesn't have it) -Hoss
RE: [ANNOUNCE] Apache Solr 3.6 released
I think this page needs updating... it says it's not out yet. https://wiki.apache.org/solr/Solr3.6 -Original Message- From: Robert Muir [mailto:rm...@apache.org] Sent: Thursday, April 12, 2012 1:33 PM To: d...@lucene.apache.org; solr-user@lucene.apache.org; Lucene mailing list; announce Subject: [ANNOUNCE] Apache Solr 3.6 released 12 April 2012, Apache Solr™ 3.6.0 available The Lucene PMC is pleased to announce the release of Apache Solr 3.6.0. Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites. This release contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below. The release is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html (see note below). See the CHANGES.txt file included with the release for a full list of details. Solr 3.6.0 Release Highlights: * New SolrJ client connector using Apache Http Components http client (SOLR-2020) * Many analyzer factories are now multi term query aware allowing for things like field type aware lowercasing when building prefix wildcard queries. (SOLR-2438) * New Kuromoji morphological analyzer tokenizes Japanese text, producing both compound words and their segmentation. (SOLR-3056) * Range Faceting (Dates Numbers) is now supported in distributed search (SOLR-1709) * HTMLStripCharFilter has been completely re-implemented, fixing many bugs and greatly improving the performance (LUCENE-3690) * StreamingUpdateSolrServer now supports the javabin format (SOLR-1565) * New LFU Cache option for use in Solr's internal caches. (SOLR-2906) * Memory performance improvements to all FST based suggesters (SOLR-2888) * New WFSTLookupFactory suggester supports finer-grained ranking for suggestions. (LUCENE-3714) * New options for configuring the amount of concurrency used in distributed searches (SOLR-3221) * Many bug fixes Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Happy searching, Lucene/Solr developers
Re: [ANNOUNCE] Apache Solr 3.6 released
Hi, Just edit it! its a wiki page anyone can edit! There are probably other out of date ones too On Thu, Apr 12, 2012 at 5:57 PM, Robert Petersen rober...@buy.com wrote: I think this page needs updating... it says it's not out yet. https://wiki.apache.org/solr/Solr3.6 -Original Message- From: Robert Muir [mailto:rm...@apache.org] Sent: Thursday, April 12, 2012 1:33 PM To: d...@lucene.apache.org; solr-user@lucene.apache.org; Lucene mailing list; announce Subject: [ANNOUNCE] Apache Solr 3.6 released 12 April 2012, Apache Solr™ 3.6.0 available The Lucene PMC is pleased to announce the release of Apache Solr 3.6.0. Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites. This release contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below. The release is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html (see note below). See the CHANGES.txt file included with the release for a full list of details. Solr 3.6.0 Release Highlights: * New SolrJ client connector using Apache Http Components http client (SOLR-2020) * Many analyzer factories are now multi term query aware allowing for things like field type aware lowercasing when building prefix wildcard queries. (SOLR-2438) * New Kuromoji morphological analyzer tokenizes Japanese text, producing both compound words and their segmentation. (SOLR-3056) * Range Faceting (Dates Numbers) is now supported in distributed search (SOLR-1709) * HTMLStripCharFilter has been completely re-implemented, fixing many bugs and greatly improving the performance (LUCENE-3690) * StreamingUpdateSolrServer now supports the javabin format (SOLR-1565) * New LFU Cache option for use in Solr's internal caches. (SOLR-2906) * Memory performance improvements to all FST based suggesters (SOLR-2888) * New WFSTLookupFactory suggester supports finer-grained ranking for suggestions. (LUCENE-3714) * New options for configuring the amount of concurrency used in distributed searches (SOLR-3221) * Many bug fixes Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Happy searching, Lucene/Solr developers -- lucidimagination.com
Re: I've broken delete in SolrCloud and I'm a bit clueless as to how
I'm probably confused, but it seems to me that the case I hit does not meet any of Yonik's criteria. I have no replicas. I'm running SolrCloud in the simple mode where each doc ends up in exactly one place. I think that it's just a bug that the code refuses to do the local deletion when there's no version info. However, if I am confused, it sure seems like a candidate for the 'at least throw instead of failing silently' policy.
Re: codecs for sorted indexes
Hello Michael, Yes, we are pre-sorting the documents before adding them to the index. We have a score associated to every document (not an IR score but a document-related score that reflects its importance). Therefore, the document with the biggest score will have the lowest docid (we add it first to the index). We do this in order to apply early termination effectively. With the actual coded, we haven't seen much of a difference in terms of space when we have the index sorted vs not sorted. So, the question would be: if we force the docids to be sorted, what is the best way to encode them?. We don't really care if the codec doesn't work for cases where the documents are not sorted (i.e. if it throws an exception if documents are not ordered when creating the index). Our idea here is that it may be possible to trade off generality but achieve very significant improvements for the specific case. Would something along the lines of RLE coding work? i.e. if we have to store docids 1 to 1500, we can represent it as 1::1499 (it would be 2 ints to represent 1500 docids). Thanks a lot for your help, Carlos On Thu, Apr 12, 2012 at 6:19 PM, Michael McCandless luc...@mikemccandless.com wrote: Do you mean you are pre-sorting the documents (by what criteria?) yourself, before adding them to the index? In which case... you should already be seeing some benefits (smaller index size) than had you randomly added them (ie the vInts should take fewer bytes), I think. (Probably the savings would be greater for better intblock codecs like PForDelta, SimpleX, but I'm not sure...). Or do you mean having a codec re-sort the documents (on flush/merge)? I think this should be possible w/ the Codec API... but nobody has tried it yet that I know of. Note that the bulkpostings branch is effectively dead (nobody is iterating on it, and we've removed the old bulk API from trunk), but there is likely a GSoC project to add a PForDelta codec to trunk: https://issues.apache.org/jira/browse/LUCENE-3892 Mike McCandless http://blog.mikemccandless.com On Thu, Apr 12, 2012 at 6:13 AM, Carlos Gonzalez-Cadenas c...@experienceon.com wrote: Hello, We're using a sorted index in order to implement early termination efficiently over an index of hundreds of millions of documents. As of now, we're using the default codecs coming with Lucene 4, but we believe that due to the fact that the docids are sorted, we should be able to do much better in terms of storage and achieve much better performance, especially decompression performance. In particular, Robert Muir is commenting on these lines here: https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411 We're aware that the in the bulkpostings branch there are different codecs being implemented and different experiments being done. We don't know whether we should implement our own codec (i.e. using some RLE-like techniques) or we should use one of the codecs implemented there (PFOR, Simple64, ...). Can you please give us some advice on this? Thanks Carlos Carlos Gonzalez-Cadenas CEO, ExperienceOn - New generation search http://www.experienceon.com Mobile: +34 652 911 201 Skype: carlosgonzalezcadenas LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
Re: I've broken delete in SolrCloud and I'm a bit clueless as to how
On Thu, Apr 12, 2012 at 2:14 PM, Mark Miller markrmil...@gmail.com wrote: google must not have found it - i put that in a month or so ago I believe - at least weeks. As you can see, there is still a bit to fill in, but it covers the high level. I'd like to add example snippets for the rest soon. Mark, is it all true? I don't have an update log or a replication handler, and neither does the default, and it all works fine in the simple case from the top of that wiki page.
Re: is there a downside to combining search fields with copyfield?
On 4/12/2012 1:37 PM, geeky2 wrote: can you elaborate on this and how EDisMax would preclude the need for copyfield? i am using extended dismax now in my response handlers. here is an example of one of my requestHandlers requestHandler name=partItemNoSearch class=solr.SearchHandler default=false lst name=defaults str name=defTypeedismax/str str name=echoParamsall/str int name=rows5/int str name=qfitemNo^1.0/str str name=q.alt*:*/str /lst lst name=appends str name=fqitemType:1/str str name=sortrankNo asc, score desc/str /lst lst name=invariants str name=facetfalse/str /lst /requestHandler I'm not sure whether or not you can use a multiValued field as the source for copyField. This is the sort of thing that the devs tend to think of, so my initial thought would be that it should work, though I would definitely test it to be absolutely sure. Your request handler above has qf set to include the field called itemNo. If you made another that had the following in it, you could do without a copyField, by using that request handler. You would want to customize the field boosts: str name=qfbrand^2.0 category^3.0 partno/str To really leverage edismax, assuming that you are using a tokenizer that splits any of these fields into multiple tokens, and that you want to use relevancy ranking, you might want to consider defining pf as well. Some observations about your handler above... you are free to ignore this: I believe that you don't really need the ^1.0 that's in qf, because there's only one field, and 1.0 is the default boost. Also, from what I can tell, because you are only using one qf field and are not using any of the dismax-specific goodies like pf or mm, you don't really need edismax at all here. If I'm right, to remove edismax, just specify itemNo as the value for the df parameter (default field) and remove the defType. The q.alt parameter might also need to come out. Solr 3.6 (should be released soon) has deprecated the defaultSearchField and defaultOperator parameters in schema.xml, the df and q.op handler parameters are the replacement. This will be enforced in Solr 4.0. http://wiki.apache.org/solr/SearchHandler#Query_Params Thanks, Shawn
Re: Solr Scoring
No, I don't think there's an OOB way to make this happen. It's a recurring theme, make exact matches score higher than stemmed matches. Best Erick On Thu, Apr 12, 2012 at 5:18 AM, Kissue Kissue kissue...@gmail.com wrote: Hi, I have a field in my index called itemDesc which i am applying EnglishMinimalStemFilterFactory to. So if i index a value to this field containing Edges, the EnglishMinimalStemFilterFactory applies stemming and Edges becomes Edge. Now when i search for Edges, documents with Edge score better than documents with the actual search word - Edges. Is there a way i can make documents with the actual search word in this case Edges score better than document with Edge? I am using Solr 3.5. My field definition is shown below: fieldType name=text_en class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_en.txt enablePositionIncrements=true filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_en.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.EnglishMinimalStemFilterFactory/ /analyzer /fieldType Thanks.
Re: Solr Scoring
It is easy. Create two fields, text_exact and text_stem. Don't use the stemmer in the first chain, do use the stemmer in the second. Give the text_exact a bigger weight than text_stem. wunder On Apr 12, 2012, at 4:34 PM, Erick Erickson wrote: No, I don't think there's an OOB way to make this happen. It's a recurring theme, make exact matches score higher than stemmed matches. Best Erick On Thu, Apr 12, 2012 at 5:18 AM, Kissue Kissue kissue...@gmail.com wrote: Hi, I have a field in my index called itemDesc which i am applying EnglishMinimalStemFilterFactory to. So if i index a value to this field containing Edges, the EnglishMinimalStemFilterFactory applies stemming and Edges becomes Edge. Now when i search for Edges, documents with Edge score better than documents with the actual search word - Edges. Is there a way i can make documents with the actual search word in this case Edges score better than document with Edge? I am using Solr 3.5. My field definition is shown below: fieldType name=text_en class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_en.txt enablePositionIncrements=true filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_en.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.EnglishMinimalStemFilterFactory/ /analyzer /fieldType Thanks.
Re: two structures in solr
You have to take off your DB hat when using Solr G... There is no problem at all having documents in the same index that are of different types. There is no penalty for field definitions that aren't used. That is, you can easily have two different types of documents in the same index. It's all about simply populating the two types of documents with different fields. in your case, I suspect you'll have a type field with two valid values, project and contractor or some such. Then just attach a filter query depending on what you want, i.e. fq=type:project or fq=type:contractor and your searches will be restricted to the proper documents. Best Erick On Thu, Apr 12, 2012 at 5:41 AM, tkoomzaaskz tomasz.du...@gmail.com wrote: Hi all, I'm a solr newbie, so sorry if I do anything wrong ;) I want to use SOLR not only for fast text search, but mainly to create a very fast search engine for a high-traffic system (MySQL would not do the job if the db grows too big). I need to store *two big structures* in SOLR: projects and contractors. Contractors will search for available projects and project owners will search for contractors who would do it for them. So far, I have found a solr tutorial for newbies http://www.solrtutorial.com, where I found the schema file which defines the data structure: http://www.solrtutorial.com/schema-xml.html. But my case is that *I want to have two structures*. I guess running two parallel solr instances is not the idea. I took a look at http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/conf/schema.xml?view=markup and I can see that the schema goes like: ?xml version=1.0 encoding=UTF-8 ? schema name=example version=1.5 types ... /types fields field name=id type=string indexed=true stored=true required=true / field name=sku type=text_en_splitting_tight indexed=true stored=true omitNorms=true/ field name=name type=text_general indexed=true stored=true/ field name=alphaNameSort type=alphaOnlySort indexed=true stored=false/ ... /fields /schema But still, this is a single structure. And I need 2. Great thanks in advance for any help. There are not many tutorials for SOLR in the web. -- View this message in context: http://lucene.472066.n3.nabble.com/two-structures-in-solr-tp3905143p3905143.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr 3.5 taking long to index
On 4/12/2012 12:42 PM, Rohit wrote: Thanks for pointing these out, but I still have one concern, why is the Virtual Memory running in 300g+? Solr 3.5 uses MMapDirectoryFactory by default to read the index. This does an mmap on the files that make up your index, so their entire contents are simply accessible to the application as virtual memory (over 300GB in your case), the OS automatically takes care of swapping disk pages in and out of real RAM as required. This approach has less overhead and tends to make better use of the OS disk cache than other methods. It does lead to confused questions and scary numbers in memory usage reporting, though. You have mentioned that you are giving 36GB of RAM to Solr. How much total RAM does the machine have? Thanks, Shawn
Re: Dismax request handler differences Between Solr Version 3.5 and 1.4
Then I suspect your solrconfig is different or you're using a *slightly* different URL. When you specify defType=dismax, you're NOT going to the dismax requestHandler. You're specifying a dismax style parser, and Solr expects that you're going to provide all the parameters on the URL. To whit: qf. If you add qf=field1 field2 field3... you'll see output. I found this extremely confusing when I started using Solr. If you use qt=dismax, _then_ you're specifying that you should use the requestHandler defined in your solrconfig.xml _named_ dismax. And this kind of thing was changed because it was so confusing, but I suspect your 3.5 installation is not quite the same URL. I think 3.5 was changed to use the default field in this case. BTW, 3.6 has just been released, if you're upgrading anyway you might want to jump to 3.6 Best Erick On Thu, Apr 12, 2012 at 6:08 AM, mechravi25 mechrav...@yahoo.co.in wrote: Hi, We are currently using solr (version 1.4.0.2010.01.13.08.09.44). we have a strange situation in dismax request handler. when we search for a keyword and append qt=dismax, we are not getting the any results. The solr request is as follows: http://local:8983/solr/core2/select/?q=Bankversion=2.2start=0rows=10indent=ondefType=dismaxdebugQuery=on The Response is as follows : result name=response numFound=0 start=0 / - lst name=debug str name=rawquerystringBank/str str name=querystringBank/str str name=parsedquery+() ()/str str name=parsedquery_toString+() ()/str lst name=explain / str name=QParserDisMaxQParser/str null name=altquerystring / null name=boostfuncs / - lst name=timing double name=time0.0/double - lst name=prepare double name=time0.0/double - lst name=org.apache.solr.handler.component.QueryComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.MoreLikeThisComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.HighlightComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.StatsComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.DebugComponent double name=time0.0/double /lst /lst - lst name=process double name=time0.0/double - lst name=org.apache.solr.handler.component.QueryComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.MoreLikeThisComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.HighlightComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.StatsComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.DebugComponent double name=time0.0/double /lst /lst /lst /lst /response We are currently testing the Solr Version 3.5, But the same is working fine in that version. Also the Query alternative params are not working properly in SOlr 1.5 when compared with version 3.5. The request seems to be the same, but dono where its making the issue. Please help me out. Thanks i advance. Regards, Sivaganesh siva_srm...@yahoo.co.in -- View this message in context: http://lucene.472066.n3.nabble.com/Dismax-request-handler-differences-Between-Solr-Version-3-5-and-1-4-tp3905192p3905192.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Further questions about behavior in ReversedWildcardFilterFactory
There is special handling build into Solr (but not Lucene I don't think) that deals with the reversed case, that's probably the source of your differences. Leading wildcards are extremely painful if you don't do some trick like Solr does with the reversed stuff. In order to run, you have to spin through _every_ term in the field to see which ones match. It won't be performant on any very large index. So I would stick with using the Solr stuff unless you have a specific need to do things at the Lucene level. In which case I'd look carefully at the Solr implementation to see what I could glean from that implementation. Best Erick On Thu, Apr 12, 2012 at 8:01 AM, neosky neosk...@yahoo.com wrote: I ask the question in http://lucene.472066.n3.nabble.com/A-little-onfusion-with-maxPosAsterisk-tt3889226.html However, when I do some implementation, I get a further questions. 1. Suppose I don't use ReversedWildcardFilterFactory in the index time, it seems that Solr doesn't allow the leading wildcard search, it will return the error: org.apache.lucene.queryParser.ParseException: Cannot parse 'sequence:*A*': '*' or '?' not allowed as first character in WildcardQuery But when I use the ReversedWildcardFilterFactory, I can use the *A* in the query. But as I know, the ReversedWildcardFilterFactory should work in the index part, should not affect the query behavior. If it is true, how does this happen? 2.Based on the question above suppose I have those tokens in index. 1.AB/MNO/UUFI 2.BC/MNO/IUYT 3.D/MNO/QEWA 4./MNO/KGJGLI 5.QOEOEF/MNO/ suppose I use the lucene, I can set the QueryParser with AllowLeadingWildcard(true), to search *MNO* it should return the tokens above(1-5) But in solr, when I conduct the *MNO* with the ReversedWildcardFilterFactory in the index, but use the StandardAnalyzer in the query, I don't know what happens here. The leading *MNO should be fast to match the 5 with ReversedWildcardFilterFactory The tailer MNO* should be fast to match 4 But What about *MNO* ? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Further-questions-about-behavior-in-ReversedWildcardFilterFactory-tp3905416p3905416.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Suggester not working for digit starting terms
On Thu, Apr 12, 2012 at 3:52 PM, jmlucjav jmluc...@gmail.com wrote: Well now I am really lost... 1. yes I want to suggest whole sentences too, I want the tokenizer to be taken into account, and apparently it is working for me in 3.5.0?? I get suggestions that are like foo bar abc. Maybe what you mention is only for file based dictionaries? I am using the field itself. it doesnt use *JUST* your tokenizer. It splits and applies identifier rules. Such identifier rules include things like, 'cannot start with a digit'. That's why i recommend you configure a SuggestQueryConverter so you have complete control of what is going on rather than dealing with the spellchecking one. Moving to 3.6.0 is not a problem (I had already downloaded the rc actually) but I still see weird things here. installing 3.6 isnt going to do anything magical: as mentioned above you have to configure the SuggestQueryConverter like the example in the link if you want to have total control on how the input is treated before going to the suggester. -- lucidimagination.com
Re: Import null values from XML file
What does treated as null mean? Deleted from the doc? The problem here is that null-ness is kind of tricky. What behaviors do you want out of Solr in the NULL case? You can drop this out of the document by writing a custom updateHandler. It's actually quite simple to do. Best Erick On Thu, Apr 12, 2012 at 9:14 AM, randolf.julian randolf.jul...@dominionenterprises.com wrote: We import an XML file directly to SOLR using a the script called post.sh in the exampledocs. This is the script: FILES=$* URL=http://localhost:8983/solr/update for f in $FILES; do echo Posting file $f to $URL curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8' echo done #send the commit command to make sure all the changes are flushed and visible curl $URL --data-binary 'commit/' -H 'Content-type:text/xml; charset=utf-8' echo Our XML file looks something like this: add doc field name=ProductGuidD22BF0B9-EE3A-49AC-A4D6-000B07CDA18A/field field name=SkuGuidD22BF0B9-EE3A-49AC-A4D6-000B07CDA18A/field field name=ProductGroupId1000/field field name=VendorSkuCodeCK4475/field field name=VendorSkuAltCodeCK4475/field field name=ManufacturerSkuCodeNULL/field field name=ManufacturerSkuAltCodeNULL/field field name=UpcEanSkuCode840655037330/field field name=VendorSupersededSkuCodeNULL/field field name=VendorProductDescriptionEBC CLUTCH KIT/field field name=VendorSkuDescriptionEBC CLUTCH KIT/field /doc /add How can I tell solr that the NULL value should be treated as null? Thanks, Randolf -- View this message in context: http://lucene.472066.n3.nabble.com/Import-null-values-from-XML-file-tp3905600p3905600.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: codecs for sorted indexes
On Thu, Apr 12, 2012 at 6:35 PM, Carlos Gonzalez-Cadenas c...@experienceon.com wrote: Hello Michael, Yes, we are pre-sorting the documents before adding them to the index. We have a score associated to every document (not an IR score but a document-related score that reflects its importance). Therefore, the document with the biggest score will have the lowest docid (we add it first to the index). We do this in order to apply early termination effectively. With the actual coded, we haven't seen much of a difference in terms of space when we have the index sorted vs not sorted. I wouldn't expect that you will see space savings when you sort this way. The techniques I was mentioning involve sorting documents by other factors instead (such as grouping related documents from the same website together: idea being they probably share many of the same terms): this hopefully creates smaller document deltas that require less bits to represent. -- lucidimagination.com
Re: searching across multiple fields using edismax - am i setting this up right?
Looks good on a quick glance. There are a couple of things... 1 there's no need for the qt param _if_ you specify the name as /partItemNoSearch, just use blahblah/solr/partItemNoSearch There's a JIRA about when/if you need at. Either will do, it's up to you which you prefer. 2 I'd consider moving the sort from the appends section to the defaults section on the theory that you may want to override sorting sometime. 3 Simple way to see the effects of this is to simply append debugQuery=on to your URL. You'll see the results of the query, including the parsed results. It's a little hard to read, but you should be seeing your search terms spread across all three fields. Best Erick On Thu, Apr 12, 2012 at 2:06 PM, geeky2 gee...@hotmail.com wrote: hello all, i just want to check to make sure i have this right. i was reading on this page: http://wiki.apache.org/solr/ExtendedDisMax, thanks to shawn for educating me. *i want the user to be able to fire a requestHandler but search across multiple fields (itemNo, productType and brand) WITHOUT them having to specify in the query url what fields they want / need to search on* this is what i have in my request handler requestHandler name=partItemNoSearch class=solr.SearchHandler default=false lst name=defaults str name=defTypeedismax/str str name=echoParamsall/str int name=rows5/int *str name=qfitemNo^1.0 productType^.8 brand^.5/str* str name=q.alt*:*/str /lst lst name=appends str name=sortrankNo asc, score desc/str /lst lst name=invariants str name=facetfalse/str /lst /requestHandler this would be an example of a single term search going against all three of the fields http://bogus:bogus/somecore/select?qt=partItemNoSearchq=*dishwasher*debugQuery=onrows=100 this would be an example of a multiple term search across all three of the fields http://bogus:bogus/somecore/select?qt=partItemNoSearchq=*dishwasher 123-xyz*debugQuery=onrows=100 do i understand this correctly? thank you, mark -- View this message in context: http://lucene.472066.n3.nabble.com/searching-across-multiple-fields-using-edismax-am-i-setting-this-up-right-tp3906334p3906334.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Scoring
GAH! I had my head in make this happen in one field when I wrote my response, without being explicit. Of course Walter's solution is pretty much the standard way to deal with this. Best Erick On Thu, Apr 12, 2012 at 5:38 PM, Walter Underwood wun...@wunderwood.org wrote: It is easy. Create two fields, text_exact and text_stem. Don't use the stemmer in the first chain, do use the stemmer in the second. Give the text_exact a bigger weight than text_stem. wunder On Apr 12, 2012, at 4:34 PM, Erick Erickson wrote: No, I don't think there's an OOB way to make this happen. It's a recurring theme, make exact matches score higher than stemmed matches. Best Erick On Thu, Apr 12, 2012 at 5:18 AM, Kissue Kissue kissue...@gmail.com wrote: Hi, I have a field in my index called itemDesc which i am applying EnglishMinimalStemFilterFactory to. So if i index a value to this field containing Edges, the EnglishMinimalStemFilterFactory applies stemming and Edges becomes Edge. Now when i search for Edges, documents with Edge score better than documents with the actual search word - Edges. Is there a way i can make documents with the actual search word in this case Edges score better than document with Edge? I am using Solr 3.5. My field definition is shown below: fieldType name=text_en class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_en.txt enablePositionIncrements=true filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_en.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.EnglishMinimalStemFilterFactory/ /analyzer /fieldType Thanks.
Re: solr hangs
Thanks for the response. I have given a size of 8gb for the instance and has only around few thousands of documents (with 15 fields each having small amount of data)..apparently the problem is the process (solr jetty instance) is consuming lots of threads...one time it consumed around 50k threads and the process maxed out the allowable thread allocated by the OS (centos) for the process..and in the admin page is see tons of threads under Thread Dump...it's lik solr is waiting for somethingi have two leader and replica cores/shards in two instances...and i send the documents to one of the shard through the csv update handler... On Wed, Apr 11, 2012 at 7:39 AM, Pawel Rog pawelro...@gmail.com wrote: You wrote that you can see such error OutOfMemoryError. I had such problems when my caches were to big. It means that there is no more free memory in JVM and probably full gc starts running. How big is your Java heap? Maybe cache sizes in yout solr are to big according to your JVM settings. -- Regards, Pawel On Tue, Apr 10, 2012 at 9:51 PM, Peter Markey sudoma...@gmail.com wrote: Hello, I have a solr cloud setup based on a blog ( http://outerthought.org/blog/491-ot.html) and am able to bring up the instances and cores. But when I start indexing data (through csv update), the core throws a out of memory exception (null:java.lang.RuntimeException: java.lang.OutOfMemoryError: unable to create new native thread). The thread dump from new solr ui is below: cmdDistribExecutor-8-thread-777 (827) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@1bd11b79 - sun.misc.Unsafe.park(Native Method) - java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await (AbstractQueuedSynchronizer.java:2043) - org.apache.http.impl.conn.tsccm.WaitingThread.await(WaitingThread.java:158) - org.apache.http.impl.conn.tsccm.ConnPoolByRoute.getEntryBlocking (ConnPoolByRoute.java:403) - org.apache.http.impl.conn.tsccm.ConnPoolByRoute$1.getPoolEntry (ConnPoolByRoute.java:300) - org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager$1.getConnection (ThreadSafeClientConnManager.java:224) - org.apache.http.impl.client.DefaultRequestDirector.execute (DefaultRequestDirector.java:401) - org.apache.http.impl.client.AbstractHttpClient.execute (AbstractHttpClient.java:820) - org.apache.http.impl.client.AbstractHttpClient.execute (AbstractHttpClient.java:754) - org.apache.http.impl.client.AbstractHttpClient.execute (AbstractHttpClient.java:732) - org.apache.solr.client.solrj.impl.HttpSolrServer.request (HttpSolrServer.java:304) - org.apache.solr.client.solrj.impl.HttpSolrServer.request (HttpSolrServer.java:209) - org.apache.solr.update.SolrCmdDistributor$1.call (SolrCmdDistributor.java:320) - org.apache.solr.update.SolrCmdDistributor$1.call (SolrCmdDistributor.java:301) - java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) - java.util.concurrent.FutureTask.run(FutureTask.java:166) - java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) - java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) - java.util.concurrent.FutureTask.run(FutureTask.java:166) - java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1110) - java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:603) - java.lang.Thread.run(Thread.java:679) Apparently I do see lots of threads like above in the thread dump. I'm using latest build from the trunk (Apr 10th). Any insights into this issue woudl be really helpful. Thanks a lot.
Re: Solr Http Caching
: Are any of you using Solr Http caching? I am interested to see how people : use this functionality. I have an index that basically changes once a day : at midnight. Is it okay to enable Solr Http caching for such an index and : set the max age to 1 day? Any potential issues? : : I am using solr 3.5 with SolrJ. in a past life i put squid in front of solr as an accelerator. i didn't bother configuring solr to output expiration info in the Cache-Control header, i just took advantage of the etag generated from the index version (as well as lastModifiedFrom=openTime) to ensure tha Solr would short circut and return a 304 w/o doing any processing (or wasting a lot of bandwidth returning data) anytime it got an If-Modified-Since or If-None-Match request indicating that the cache already had a current copy. If you know your index only changes ever 24 hours, then setting a max-age would probably make sense, to elimianate even those conditional requests, but i wouldn't set it to 24H (what if a request happens 1 minute before your daily rebuild?) set it to whatever the longest amount of time you are willing to serve stale results. -Hoss
Re: Does the lucene can read the index file from solr?
hi,neosky, how to do? i need this way too. thanks On Thu, Apr 12, 2012 at 9:35 PM, neosky neosk...@yahoo.com wrote: Thanks!I will try again -- View this message in context: http://lucene.472066.n3.nabble.com/Does-the-lucene-can-read-the-index-file-from-solr-tp3902927p3905364.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
Hello Ali, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. Yup, OK. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). There is no such thing just yet. There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Here is a summary on all of them: * Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a mny-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it. * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already. * And there is SolrCloud, which is coming soon and will be solid, but is again not integrated. If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open. If I had Solr bias I'd give SolrCloud a try first. Lastly, how much hardware (assuming a medium sized EC2 instance) would you estimate my needing with this setup, for regular web-data (HTML text) at this scale? I don't know off the topic of my head, but I'm guessing several hundred for serving search requests. HTH, Otis -- Search Analytics - http://sematext.com/search-analytics/index.html Scalable Performance Monitoring - http://sematext.com/spm/index.html Any architectural guidance would be greatly appreciated. The more details provided, the wider my grin :). Many many thanks in advance. Thanks, Safdar
Re: term frequency outweighs exact phrase match
: I use solr 3.5 with edismax. I have the following issue with phrase : search. For example if I have three documents with content like : : 1.apache apache : 2. solr solr : 3.apache solr : : then search for apache solr displays documents in the order 1,.2,3 : instead of 3, 2, 1 because term frequency in the first and second : documents is higher than in the third document. We want results be : displayed in the order as 3,2,1 since the third document has exact : match. you need to give us a lot more info, like what other data is in the various fields for those documents, exactly what your query URL looks like, and what debugQuery=true gives you back in terms of score explanations ofr each document, because if that sample content is the only thing you've got indexed (even if it's in multiple fields), then documents #1 and #2 shouldn't even match your query using the mm you've specified... : str name=mm2lt;-1 5lt;-2 6lt;90%/str ...because doc #1 and #2 will only contain one clause. Otherwise it should work fine. I used the example 3.5 schema, and created 3 docs matching what you described. (with name copyfield'ed into text)... add docfield name=id1/fieldfield name=nameapache apache/field/doc docfield name=id2/fieldfield name=namesolr solr/field/doc docfield name=id3/fieldfield name=nameapache solr/field/doc /add ...and then used this similar query (note mm=1) to get the results you would expect... http://localhost:8983/solr/select/?fl=name,scoredebugQuery=truedefType=edismaxqf=name+textpf=name^10+text^5q=apache%20solrmm=1 result name=response numFound=3 start=0 maxScore=1.309231 doc float name=score1.309231/float str name=nameapache solr/str /doc doc float name=score0.022042051/float str name=nameapache apache/str /doc doc float name=score0.022042051/float str name=namesolr solr/str /doc /result -Hoss
RE: solr 3.5 taking long to index
The machine has a total ram of around 46GB. My Biggest concern is Solr index time gradually increasing and then the commit stops because of timeouts, out commit rate is very high, but I am not able to find the root cause of the issue. Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: 13 April 2012 05:15 To: solr-user@lucene.apache.org Subject: Re: solr 3.5 taking long to index On 4/12/2012 12:42 PM, Rohit wrote: Thanks for pointing these out, but I still have one concern, why is the Virtual Memory running in 300g+? Solr 3.5 uses MMapDirectoryFactory by default to read the index. This does an mmap on the files that make up your index, so their entire contents are simply accessible to the application as virtual memory (over 300GB in your case), the OS automatically takes care of swapping disk pages in and out of real RAM as required. This approach has less overhead and tends to make better use of the OS disk cache than other methods. It does lead to confused questions and scary numbers in memory usage reporting, though. You have mentioned that you are giving 36GB of RAM to Solr. How much total RAM does the machine have? Thanks, Shawn
Re: solr 3.5 taking long to index
On 4/12/2012 8:42 PM, Rohit wrote: The machine has a total ram of around 46GB. My Biggest concern is Solr index time gradually increasing and then the commit stops because of timeouts, out commit rate is very high, but I am not able to find the root cause of the issue. For good performance, Solr relies on the OS having enough free RAM to keep critical portions of the index in the disk cache. Some numbers that I have collected from your information so far are listed below. Please let me know if I've got any of this wrong: 46GB total RAM 36GB RAM allocated to Solr 300GB total index size This leaves only 10GB of RAM free to cache 300GB of index, assuming that this server is dedicated to Solr. The critical portions of your index are very likely considerably larger than 10GB, which causes constant reading from the disk for queries and updates. With a high commit rate and a relatively low mergeFactor of 10, your index will be doing a lot of merging during updates, and some of those merges are likely to be quite large, further complicating the I/O situation. Another thing that can lead to increasing index update times is cache warming, also greatly affected by high I/O levels. If you visit the /solr/corename/admin/stats.jsp#cache URL, you can see the warmupTime for each cache in milliseconds. Adding more memory to the server would probably help things. You'll want to carefully check all the server and Solr statistics you can to make sure that memory is the root of problem, before you actually spend the money. At the server level, look for things like a high iowait CPU percentage. For Solr, you can turn the logging level up to INFO in the admin interface as well as turn on the infostream in solrconfig.xml for extensive debugging. I hope this is helpful. If not, I can try to come up with more specific things you can look at. Thanks, Shawn