Re: Query modification
Hi, Even I am using the QueryComponent to perform similar modification to the query. I am modifying the query in the process() method of the Component. The problem I am facing is that after modifying the query and setting it in the response builder, I make a call to super.process(rb). This call is taking around 100ms and is degrading component's performance. Wanted to know that is process the right place to do it and do we need to make a call to super.process() method? Regards, Sidharth. -- View this message in context: http://lucene.472066.n3.nabble.com/Query-modification-tp939584p4096753.html Sent from the Solr - User mailing list archive at Nabble.com.
caching HTML pages in SOLR
Hi, As google stores HTML pages as *cached* documents, is there a similar provision in SOLR. I am using SOLR-4.4.0. Thanks, Shailendra
Re: caching HTML pages in SOLR
Not in Solr itself, no. Solr is all about Search. Caching (and rewriting resource links, etc) should probably be part of whatever does the document fetching. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Oct 21, 2013 at 1:19 PM, Shailendra Mudgal mudgal.shailen...@gmail.com wrote: Hi, As google stores HTML pages as *cached* documents, is there a similar provision in SOLR. I am using SOLR-4.4.0. Thanks, Shailendra
Re: caching HTML pages in SOLR
Thanks Alex. I was thinking if something already exists of this sort. On Mon, Oct 21, 2013 at 12:05 PM, Alexandre Rafalovitch arafa...@gmail.comwrote: Not in Solr itself, no. Solr is all about Search. Caching (and rewriting resource links, etc) should probably be part of whatever does the document fetching. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Oct 21, 2013 at 1:19 PM, Shailendra Mudgal mudgal.shailen...@gmail.com wrote: Hi, As google stores HTML pages as *cached* documents, is there a similar provision in SOLR. I am using SOLR-4.4.0. Thanks, Shailendra
Re: caching HTML pages in SOLR
I have not used it myself, but perhaps something like http://www.crawl-anywhere.com/ is along what you were looking for. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Oct 21, 2013 at 1:44 PM, Shailendra Mudgal mudgal.shailen...@gmail.com wrote: Thanks Alex. I was thinking if something already exists of this sort. On Mon, Oct 21, 2013 at 12:05 PM, Alexandre Rafalovitch arafa...@gmail.comwrote: Not in Solr itself, no. Solr is all about Search. Caching (and rewriting resource links, etc) should probably be part of whatever does the document fetching. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Oct 21, 2013 at 1:19 PM, Shailendra Mudgal mudgal.shailen...@gmail.com wrote: Hi, As google stores HTML pages as *cached* documents, is there a similar provision in SOLR. I am using SOLR-4.4.0. Thanks, Shailendra
Class name of parsing the fq clause
Hi I search the solr with fq clause,which is like: fq=BEGINTIME:[2013-08-25T16:00:00Z TO *] AND BUSID:(M3 OR M9) I am curious about the parsing process . I want to study it. What is the Java file name describes the parsing process of the fq clause. Thanks Regards.
Re: XLSB files not indexed
Hi Otis, In our case, there is no exception raised by tika or solr, a lucene document is created, but the content field contains only a few white spaces like for ODF files. Roland. On Sat, Oct 19, 2013 at 3:54 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Roland, It looks like: Tika - yes Solr - no? Based on http://search-lucene.com/?q=xlsb ODF != XLSB though, I think... Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Fri, Oct 18, 2013 at 7:36 AM, Roland Everaert reveatw...@gmail.com wrote: Hi, Can someone tells me if tika is supposed to extract data from xlsb files (the new MS Office format in binary form)? If so then it seems that solr is not able to index them like it is not able to index ODF files (a JIRA is already opened for ODF https://issues.apache.org/jira/browse/SOLR-4809) Can someone confirm the problem, or tell me what to do to make solr works with XLSB files. Regards, Roland.
RE: Facet performance
On Fri, 2013-10-18 at 18:30 +0200, Lemke, Michael SZ/HZA-ZSW wrote: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote: Unfortunately the enum-solution is normally quite slow when there are enough unique values to trigger the too many values-exception. [...] [...] And yes, the fc method was terribly slow in a case where it did work. Something like 20 minutes whereas enum returned within a few seconds. Err.. What? That sounds _very_ strange. You have millions of unique values so fc should be a lot faster than enum, not the other way around. I assume the 20 minutes was for the first call. How fast does subsequent calls return for fc? Maybe you could provide some approximate numbers? - Documents in your index - Unique values in the CONTENT field - Hits are returned from a typical query - Xmx Regards, Toke Eskildsen, State and University Library, Denmark
how to debug my own analyzer in solr
Dear solr expert , I would like to write my own analyser ( Chinese analyser ) and integrate them into solr as solr plugin . From the log information , the custom analyzer can be loaded into solr successfully . I define my fieldType with this custom analyzer. Now the problem is that , when I try this analyzer from http://localhost:8983/solr/#/collection1/analysis , click the analysis , then choose my FieldType , then input some text . After I click Analyse Value button , the solr hang there , I cannot get any result or response in a few minutes. I also try to add some data by curl http://localhost:8983/solr/update?commit=true -H Content-Type: text/xml , or by post.sh in exampledocs folder , The same issue , the solr hang there , no result and not response . Can anybody give me some suggestions on how to debug solr to work with my own custom analyzer ? By the way , I write a java program to call my custom analyzer , the result is okay , for example , the following code can work well . == Analyzer analyzer = new MyAnalyzer() ; TokenStream ts = analyzer.tokenStream() ; CharTermAttribute ta = ts.getAttribute(CharTermAttribute.class); ts.reset(); while (ts.incrementToken()){ System.out.println(ta.toString()); } = Thanks, -Mingz
Ordering Results
Hi, I have a situation that if user looking for anything first it has to give the suggestions from the exact match and as well as the fuzzy matches. Suppose we are showing 15 suggestions. First 10 results are exact match results. And remaining 5 results from fuzzy matches. Can anybody give me suggestions how to achieve this task. Regards, kumar -- View this message in context: http://lucene.472066.n3.nabble.com/Ordering-Results-tp4096774.html Sent from the Solr - User mailing list archive at Nabble.com.
how to avoid recover? how to ensure a recover success?
Hi, guys: I have an online application with solrcloud 4.1, but I get errors of syncpeer every 2 or 3 weeks... In my opinion, a recover occers when a replica can not sync data to its leader successfully. I see the topic http://lucene.472066.n3.nabble.com/SolrCloud-5x-Errors-while-recovering-td4022542.html and https://issues.apache.org/jira/i#browse/SOLR-4032, but why did I still get similar errors in solrcloud4.1? so is there any settings for syncpeer? how to reduce the probability of this error? when recover happens, how to ensure its success? The errors I got is like these: [2013.10.21 10:39:13.482]2013-10-21 10:39:13,482 WARN [org.apache.solr.handler.SnapPuller] - Error in fetching packets [2013.10.21 10:39:13.482]java.io.EOFException [2013.10.21 10:39:13.482] at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:154) [2013.10.21 10:39:13.482] at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:146) [2013.10.21 10:39:13.482] at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchPackets(SnapPuller.java:1136) [2013.10.21 10:39:13.482] at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1099) [2013.10.21 10:39:13.482] at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:738) [2013.10.21 10:39:13.482] at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:395) [2013.10.21 10:39:13.482] at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:274) [2013.10.21 10:39:13.482] at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:153) [2013.10.21 10:39:13.482] at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:409) [2013.10.21 10:39:13.482] at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:223) [2013.10.21 10:39:13.485]2013-10-21 10:39:13,485 WARN [org.apache.solr.handler.SnapPuller] - Error in fetching packets [2013.10.21 10:39:13.485]java.io.EOFException [2013.10.21 10:39:13.485] at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:154) [2013.10.21 10:39:13.485] at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:146) [2013.10.21 10:39:13.485] at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchPackets(SnapPuller.java:1136) [2013.10.21 10:39:13.485] at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1099) [2013.10.21 10:39:13.485] at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:738) [2013.10.21 10:39:13.485] at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:395) [2013.10.21 10:39:13.485] at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:274) [2013.10.21 10:39:13.485] at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:153) [2013.10.21 10:39:13.485] at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:409) [2013.10.21 10:39:13.485] at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:223) [2013.10.21 10:41:08.461]2013-10-21 10:41:08,461 ERROR [org.apache.solr.handler.ReplicationHandler] - SnapPull failed :org.apache.solr.common.SolrException: Unable to download _fi05_Lucene41_0.pos completely. Downloaded 0!=1485 [2013.10.21 10:41:08.461] at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapPuller.java:1230) [2013.10.21 10:41:08.461] at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1110) [2013.10.21 10:41:08.461] at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:738) [2013.10.21 10:41:08.461] at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:395) [2013.10.21 10:41:08.461] at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:274) [2013.10.21 10:41:08.461] at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:153) [2013.10.21 10:41:08.461] at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:409) [2013.10.21 10:41:08.461] at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:223) [2013.10.21 10:41:08.461] [2013.10.21 10:41:08.461]2013-10-21 10:41:08,461 ERROR [org.apache.solr.cloud.RecoveryStrategy] - Error while trying to recover:org.apache.solr.common.SolrException: Replication for recovery failed. [2013.10.21 10:41:08.461] at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:156) [2013.10.21 10:41:08.461] at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:409) [2013.10.21 10:41:08.461] at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:223) [2013.10.21 10:41:08.461] [2013.10.21 10:41:08.555]2013-10-21 10:41:08,462 ERROR
Re: Solr timeout after reboot
Thank you, Otis! I've integrated the SPM on my Solr instances and now I have access to monitoring data. Could you give me some hints on which metrics should I watch? Below I've added my query configs. Is there anything I could tweak here? query maxBooleanClauses1024/maxBooleanClauses filterCache class=solr.FastLRUCache size=1000 initialSize=1000 autowarmCount=0/ queryResultCache class=solr.LRUCache size=1000 initialSize=1000 autowarmCount=0/ documentCache class=solr.LRUCache size=1000 initialSize=1000 autowarmCount=0/ fieldValueCache class=solr.FastLRUCache size=1000 initialSize=1000 autowarmCount=0 / enableLazyFieldLoadingtrue/enableLazyFieldLoading queryResultWindowSize20/queryResultWindowSize queryResultMaxDocsCached100/queryResultMaxDocsCached listener event=firstSearcher class=solr.QuerySenderListener arr name=queries lst str name=qactive:true/str /lst /arr /listener useColdSearcherfalse/useColdSearcher maxWarmingSearchers10/maxWarmingSearchers /query - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408p4096780.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solrconfig.xml carrot2 params
Thanks, I'm new to the clustering libraries. I finally made this connection when I started browsing through the carrot2 source. I had pulled down a smaller MM document collection from our test environment. It was not ideal as it was mostly structured, but small. I foolishly thought I could cluster on the text copy field before realizing that it was index only. Doh! That is correct -- for the time being the clustering can only be applied to stored Solr fields. Our documents are indexed in SolrCloud, but stored in HBase. I want to allow users to page through Solr hits, but would like to cluster on all (or at least several thousand) of the top search hits. Now I'm puzzling over how to efficiently cluster over possibly several thousand Solr hits when the documents are in HBase. I thought an HBase coprocessor, but carrot2 isn't designed for distributed computation. Mahout, in the Hadoop M/R context, seems slow and heavy handed for this scale; maybe, I just need to dig deeper into their library. Or I could just be missing something fundamental? :) Carrot2 algorithms were not designed to be distributed, but you can still use them in a single-threaded scenario. To do this, you'd probably need to write a bit of code that gets the text of your documents from your HBase and runs Carrot2 clustering on it. If you use the STC clustering algorithm, you should be able to process several thousands of documents in a reasonable time (order of seconds). The clustering side of the code should be a matter of a few lines of code ( http://download.carrot2.org/stable/javadoc/overview-summary.html#clustering-documents). The tricky bit of the setup may be efficiently getting the text for clustering -- it can happen that fetching can take longer than the actual clustering. S.
Re: how to debug my own analyzer in solr
More information about this , the custom analyzer just implement createComponents of Analyzer. And my configure in schema.xml is just something like : fieldType name=text_cn class=solr.TextField analyzer class=my.package.CustomAnalyzer / /fieldType From the log I cannot see any error information , however , when I want to analysis or add document data , it always hang there . Any way to debug or narrow down the problem ? Thanks in advance . -Mingz On 10/21/13 4:35 PM, Mingzhu Gao m...@adobe.com wrote: Dear solr expert , I would like to write my own analyser ( Chinese analyser ) and integrate them into solr as solr plugin . From the log information , the custom analyzer can be loaded into solr successfully . I define my fieldType with this custom analyzer. Now the problem is that , when I try this analyzer from http://localhost:8983/solr/#/collection1/analysis , click the analysis , then choose my FieldType , then input some text . After I click Analyse Value button , the solr hang there , I cannot get any result or response in a few minutes. I also try to add some data by curl http://localhost:8983/solr/update?commit=true -H Content-Type: text/xml , or by post.sh in exampledocs folder , The same issue , the solr hang there , no result and not response . Can anybody give me some suggestions on how to debug solr to work with my own custom analyzer ? By the way , I write a java program to call my custom analyzer , the result is okay , for example , the following code can work well . == Analyzer analyzer = new MyAnalyzer() ; TokenStream ts = analyzer.tokenStream() ; CharTermAttribute ta = ts.getAttribute(CharTermAttribute.class); ts.reset(); while (ts.incrementToken()){ System.out.println(ta.toString()); } = Thanks, -Mingz
Error: Repeated service interruptions - failure processing document: Read timed out
Hi, Just installed SOLR and when running a job I have the following problem : Error: Repeated service interruptions - failure processing document: Read timed out Like I said, just installed SOLR and so very new to the topic. ( On Windows 2008R2 ) SOLR 4.4 Tomcat 7.0.42 ManifoldCF 1.3 Postgresql 9.1.1 In the log Tomcat I find the following error : ERROR - 2013-10-21 09:35:16.551; org.apache.solr.common.SolrException; null:org.apache.commons.fileupload.FileUploadBase$IOFileUploadException: Processing of multipart/form-data request failed. null at org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:367) at org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126) at org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:492) at org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:626) at org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:143) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:342) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589) at org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:1852) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.net.SocketTimeoutException at org.apache.coyote.http11.InternalAprInputBuffer.fill(InternalAprInputBuffer.java:607) at org.apache.coyote.http11.InternalAprInputBuffer$SocketInputBuffer.doRead(InternalAprInputBuffer.java:642) at org.apache.coyote.http11.filters.ChunkedInputFilter.readBytes(ChunkedInputFilter.java:275) at org.apache.coyote.http11.filters.ChunkedInputFilter.parseCRLF(ChunkedInputFilter.java:377) at org.apache.coyote.http11.filters.ChunkedInputFilter.doRead(ChunkedInputFilter.java:147) at org.apache.coyote.http11.InternalAprInputBuffer.doRead(InternalAprInputBuffer.java:534) at org.apache.coyote.Request.doRead(Request.java:422) at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:290) at org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:449) at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:315) at org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:200) at java.io.FilterInputStream.read(Unknown Source) at org.apache.commons.fileupload.util.LimitedInputStream.read(LimitedInputStream.java:125) at org.apache.commons.fileupload.MultipartStream$ItemInputStream.makeAvailable(MultipartStream.java:977) at org.apache.commons.fileupload.MultipartStream$ItemInputStream.read(MultipartStream.java:887) at java.io.InputStream.read(Unknown Source) at org.apache.commons.fileupload.util.Streams.copy(Streams.java:94) at org.apache.commons.fileupload.util.Streams.copy(Streams.java:64) at org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:362) ... 21 more
Re: how to debug my own analyzer in solr
Thread Dump and/or Remote Debugging?! Cheers, Siegfried Goeschl On 21.10.13 11:58, Mingzhu Gao wrote: More information about this , the custom analyzer just implement createComponents of Analyzer. And my configure in schema.xml is just something like : fieldType name=text_cn class=solr.TextField analyzer class=my.package.CustomAnalyzer / /fieldType From the log I cannot see any error information , however , when I want to analysis or add document data , it always hang there . Any way to debug or narrow down the problem ? Thanks in advance . -Mingz On 10/21/13 4:35 PM, Mingzhu Gao m...@adobe.com wrote: Dear solr expert , I would like to write my own analyser ( Chinese analyser ) and integrate them into solr as solr plugin . From the log information , the custom analyzer can be loaded into solr successfully . I define my fieldType with this custom analyzer. Now the problem is that , when I try this analyzer from http://localhost:8983/solr/#/collection1/analysis , click the analysis , then choose my FieldType , then input some text . After I click Analyse Value button , the solr hang there , I cannot get any result or response in a few minutes. I also try to add some data by curl http://localhost:8983/solr/update?commit=true -H Content-Type: text/xml , or by post.sh in exampledocs folder , The same issue , the solr hang there , no result and not response . Can anybody give me some suggestions on how to debug solr to work with my own custom analyzer ? By the way , I write a java program to call my custom analyzer , the result is okay , for example , the following code can work well . == Analyzer analyzer = new MyAnalyzer() ; TokenStream ts = analyzer.tokenStream() ; CharTermAttribute ta = ts.getAttribute(CharTermAttribute.class); ts.reset(); while (ts.incrementToken()){ System.out.println(ta.toString()); } = Thanks, -Mingz
Re: how to debug my own analyzer in solr
Hi Mingz, If you use Eclipse, you can debug Solr with your plugin like this: # go to Solr install directory $ cd $SOLR $ ant run-example -Dexample.debug=true Then connect the JVM from Eclipse via remote debug port 5005. Good luck! koji (13/10/21 18:58), Mingzhu Gao wrote: More information about this , the custom analyzer just implement createComponents of Analyzer. And my configure in schema.xml is just something like : fieldType name=text_cn class=solr.TextField analyzer class=my.package.CustomAnalyzer / /fieldType From the log I cannot see any error information , however , when I want to analysis or add document data , it always hang there . Any way to debug or narrow down the problem ? Thanks in advance . -Mingz On 10/21/13 4:35 PM, Mingzhu Gao m...@adobe.com wrote: Dear solr expert , I would like to write my own analyser ( Chinese analyser ) and integrate them into solr as solr plugin . From the log information , the custom analyzer can be loaded into solr successfully . I define my fieldType with this custom analyzer. Now the problem is that , when I try this analyzer from http://localhost:8983/solr/#/collection1/analysis , click the analysis , then choose my FieldType , then input some text . After I click Analyse Value button , the solr hang there , I cannot get any result or response in a few minutes. I also try to add some data by curl http://localhost:8983/solr/update?commit=true -H Content-Type: text/xml , or by post.sh in exampledocs folder , The same issue , the solr hang there , no result and not response . Can anybody give me some suggestions on how to debug solr to work with my own custom analyzer ? By the way , I write a java program to call my custom analyzer , the result is okay , for example , the following code can work well . == Analyzer analyzer = new MyAnalyzer() ; TokenStream ts = analyzer.tokenStream() ; CharTermAttribute ta = ts.getAttribute(CharTermAttribute.class); ts.reset(); while (ts.incrementToken()){ System.out.println(ta.toString()); } = Thanks, -Mingz -- http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html
Re: Ordering Results
Do two searches. Why do you want to do this though? It seems a bit strange. Presumably your users want the best matches possible whether exact or fuzzy? Wouldn't it be best to return both exact and fuzzy matches, but score the exact ones above the fuzzy ones? Upayavira On Mon, Oct 21, 2013, at 09:56 AM, kumar wrote: Hi, I have a situation that if user looking for anything first it has to give the suggestions from the exact match and as well as the fuzzy matches. Suppose we are showing 15 suggestions. First 10 results are exact match results. And remaining 5 results from fuzzy matches. Can anybody give me suggestions how to achieve this task. Regards, kumar -- View this message in context: http://lucene.472066.n3.nabble.com/Ordering-Results-tp4096774.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud Performance Issue
Shamik: You're right, the use of NOW shouldn't be making that much of a difference between versions. FYI, though, here's a way to use NOW and re-use fq clauses: http://searchhub.org/2012/02/23/date-math-now-and-filter-queries/ It may well be this setting: autoSoftCommit maxTime1000/maxTime /autoSoftCommit Every second (assuming you're indexing), you're throwing away all your top-level caches and executing any autowarm queries etc. And if you _don't_ have any autowarming queries, you may not be filling caches, an expensive process. Try lengthening that out to, say, a minute (6) or even longer and see if that makes a difference. If that's the culprit, you at least have a place to start. If that's not it, it's also possible you're seeing decompression. How many documents are you returning and how big are they? There's some anecdotal comments that the default stored field decompression for either a large number of doc or very large docs may be playing a role here. Try setting fl=id (don't return any stored fields). If that is faster, this might be your problem. queryResultCache is often not very high re: hit ratio. It's usually used for paging, so if your users aren't hitting the next page you may not hit many. Best, Erick On Sat, Oct 19, 2013 at 4:12 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, What happens if you have just 1 shard - no distributed search, like before? SPM for Solr or any other monitoring tool that captures OS and Solr metrics should help you find the source of the problem faster. Is disk IO the same? utilization of caches? JVM version, heap, etc.? CPU usage? network? I'd look at each of these things side by side and look for big differences. Otis -- Solr ElasticSearch Support -- http://sematext.com/ SOLR Performance Monitoring -- http://sematext.com/spm On Fri, Oct 18, 2013 at 1:38 AM, shamik sham...@gmail.com wrote: I tried commenting out NOW in bq, but didn't make any difference in the performance. I do see minor entry in the queryfiltercache rate which is a meager 0.02. I'm really struggling to figure out the bottleneck, any known pain points I should be checking ? -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Performance-Issue-tp4095971p4096277.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: caching HTML pages in SOLR
You can also try: https://www.varnish-cache.org/ 2013/10/21 Alexandre Rafalovitch arafa...@gmail.com I have not used it myself, but perhaps something like http://www.crawl-anywhere.com/ is along what you were looking for. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Oct 21, 2013 at 1:44 PM, Shailendra Mudgal mudgal.shailen...@gmail.com wrote: Thanks Alex. I was thinking if something already exists of this sort. On Mon, Oct 21, 2013 at 12:05 PM, Alexandre Rafalovitch arafa...@gmail.comwrote: Not in Solr itself, no. Solr is all about Search. Caching (and rewriting resource links, etc) should probably be part of whatever does the document fetching. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Oct 21, 2013 at 1:19 PM, Shailendra Mudgal mudgal.shailen...@gmail.com wrote: Hi, As google stores HTML pages as *cached* documents, is there a similar provision in SOLR. I am using SOLR-4.4.0. Thanks, Shailendra
Re: ExtractRequestHandler, skipping errors
Guido, can you point us to the Commons-Compress JIRA issue which reports your particular problem? Perhaps uncompress works just fine? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 18. okt. 2013 kl. 14:48 skrev Guido Medina guido.med...@temetra.com: Dont, commons compress 1.5 is broken, either use 1.4.1 or later. Our app stopped compressing properly for a maven update. Guido. On 18/10/13 12:40, Roland Everaert wrote: I will open a JIRA issue, I suppose that I just have to create an account first? Regards, Roland. On Fri, Oct 18, 2013 at 12:05 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: Hi, I think the flag cannot ignore NoSuchMethodError. There may be something wrong here? ... I've just checked my Solr 4.5 directories and I found Tika version is 1.4. Tika 1.4 seems to use commons compress 1.5: http://svn.apache.org/viewvc/**tika/tags/1.4/tika-parsers/** pom.xml?view=markuphttp://svn.apache.org/viewvc/tika/tags/1.4/tika-parsers/pom.xml?view=markup But I see commons-compress-1.4.1.jar in solr/contrib/extraction/lib/ directory. Can you open a JIRA issue? For now, you can get commons compress 1.5 and put it to the directory (don't forget to remove 1.4.1 jar file). koji (13/10/18 16:37), Roland Everaert wrote: Hi, We already configure the extractrequesthandler to ignore tika exceptions, but it is solr that complains. The customer manage to reproduce the problem. Following is the error from the solr.log. The file type cause this exception was WMZ. It seems that something is missing in a solr class. We use SOLR 4.4. ERROR - 2013-10-17 18:13:48.902; org.apache.solr.common.**SolrException; null:java.lang.**RuntimeException: java.lang.NoSuchMethodError: org.apache.commons.compress.**compressors.**CompressorStreamFactory.** setDecompressConcatenated(Z)V at org.apache.solr.servlet.**SolrDispatchFilter.sendError(** SolrDispatchFilter.java:673) at org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** SolrDispatchFilter.java:383) at org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** SolrDispatchFilter.java:158) at org.apache.catalina.core.**ApplicationFilterChain.**internalDoFilter(** ApplicationFilterChain.java:**243) at org.apache.catalina.core.**ApplicationFilterChain.**doFilter(** ApplicationFilterChain.java:**210) at org.apache.catalina.core.**StandardWrapperValve.invoke(** StandardWrapperValve.java:222) at org.apache.catalina.core.**StandardContextValve.invoke(** StandardContextValve.java:123) at org.apache.catalina.core.**StandardHostValve.invoke(** StandardHostValve.java:171) at org.apache.catalina.valves.**ErrorReportValve.invoke(** ErrorReportValve.java:99) at org.apache.catalina.valves.**AccessLogValve.invoke(** AccessLogValve.java:953) at org.apache.catalina.core.**StandardEngineValve.invoke(** StandardEngineValve.java:118) at org.apache.catalina.connector.**CoyoteAdapter.service(** CoyoteAdapter.java:408) at org.apache.coyote.http11.**AbstractHttp11Processor.**process(** AbstractHttp11Processor.java:**1023) at org.apache.coyote.**AbstractProtocol$**AbstractConnectionHandler.** process(AbstractProtocol.java:**589) at org.apache.tomcat.util.net.**AprEndpoint$SocketProcessor.** run(AprEndpoint.java:1852) at java.util.concurrent.**ThreadPoolExecutor.runWorker(**Unknown Source) at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.NoSuchMethodError: org.apache.commons.compress.**compressors.**CompressorStreamFactory.** setDecompressConcatenated(Z)V at org.apache.tika.parser.pkg.**CompressorParser.parse(** CompressorParser.java:102) at org.apache.tika.parser.**CompositeParser.parse(** CompositeParser.java:242) at org.apache.tika.parser.**CompositeParser.parse(** CompositeParser.java:242) at org.apache.tika.parser.**AutoDetectParser.parse(** AutoDetectParser.java:120) at org.apache.solr.handler.**extraction.**ExtractingDocumentLoader.load(** ExtractingDocumentLoader.java:**219) at org.apache.solr.handler.**ContentStreamHandlerBase.**handleRequestBody(** ContentStreamHandlerBase.java:**74) at org.apache.solr.handler.**RequestHandlerBase.**handleRequest(** RequestHandlerBase.java:135) at org.apache.solr.core.**RequestHandlers$**LazyRequestHandlerWrapper.** handleRequest(RequestHandlers.**java:241) at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1904) at org.apache.solr.servlet.**SolrDispatchFilter.execute(** SolrDispatchFilter.java:659) at org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** SolrDispatchFilter.java:362) ... 16 more On Thu, Oct 17, 2013 at 5:19 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: Hi Roland, (13/10/17 20:44), Roland Everaert wrote:
Question about docvalues
Hi, If I have a field (named dv_field) configured to be indexed, stored and with docvalues=true. How I know that when I do a query like: q=*:*facet=truefacet.field=dv_field, I'm really using the docvalues and not the normal way? Is it necessary duplicate the field and set index and stored to false and let the docvalues property set to true? - Best regards -- View this message in context: http://lucene.472066.n3.nabble.com/Question-about-docvalues-tp4096802.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr timeout after reboot
Have you tried this old trick to warm the FS cache? cat .../core/data/index/* /dev/null Peter On Mon, Oct 21, 2013 at 5:31 AM, michael.boom my_sky...@yahoo.com wrote: Thank you, Otis! I've integrated the SPM on my Solr instances and now I have access to monitoring data. Could you give me some hints on which metrics should I watch? Below I've added my query configs. Is there anything I could tweak here? query maxBooleanClauses1024/maxBooleanClauses filterCache class=solr.FastLRUCache size=1000 initialSize=1000 autowarmCount=0/ queryResultCache class=solr.LRUCache size=1000 initialSize=1000 autowarmCount=0/ documentCache class=solr.LRUCache size=1000 initialSize=1000 autowarmCount=0/ fieldValueCache class=solr.FastLRUCache size=1000 initialSize=1000 autowarmCount=0 / enableLazyFieldLoadingtrue/enableLazyFieldLoading queryResultWindowSize20/queryResultWindowSize queryResultMaxDocsCached100/queryResultMaxDocsCached listener event=firstSearcher class=solr.QuerySenderListener arr name=queries lst str name=qactive:true/str /lst /arr /listener useColdSearcherfalse/useColdSearcher maxWarmingSearchers10/maxWarmingSearchers /query - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408p4096780.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr timeout after reboot
Hmm, no, I haven't... What would be the effect of this ? - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408p4096809.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr timeout after reboot
To put the file data into file system cache which would make for faster access. François On Oct 21, 2013, at 8:33 AM, michael.boom my_sky...@yahoo.com wrote: Hmm, no, I haven't... What would be the effect of this ? - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408p4096809.html Sent from the Solr - User mailing list archive at Nabble.com.
Exact Match Results
I am querying solr for exact match results. But it is showing some other results also. Examle : User Query String : Okkadu telugu movie Results : 1.Okkadu telugu movie 2.Okkadunnadu telugu movie 3.YuganikiOkkadu telugu movie 4.Okkadu telugu movie stills how can we order these results that 4th result has to come second. Please anyone can you give me any idea? -- View this message in context: http://lucene.472066.n3.nabble.com/Exact-Match-Results-tp4096816.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Exact Match Results
Kumar You might want to look into the 'pf' parameter: https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser François On Oct 21, 2013, at 9:24 AM, kumar pavan2...@gmail.com wrote: I am querying solr for exact match results. But it is showing some other results also. Examle : User Query String : Okkadu telugu movie Results : 1.Okkadu telugu movie 2.Okkadunnadu telugu movie 3.YuganikiOkkadu telugu movie 4.Okkadu telugu movie stills how can we order these results that 4th result has to come second. Please anyone can you give me any idea? -- View this message in context: http://lucene.472066.n3.nabble.com/Exact-Match-Results-tp4096816.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Class name of parsing the fq clause
Start with org.apache.solr.handler.component.QueryComponent#prepare which fetches the fq parameters and indirectly invokes the query parser(s): String[] fqs = req.getParams().getParams(CommonParams.FQ); if (fqs!=null fqs.length!=0) { ListQuery filters = rb.getFilters(); // if filters already exists, make a copy instead of modifying the original filters = filters == null ? new ArrayListQuery(fqs.length) : new ArrayListQuery(filters); for (String fq : fqs) { if (fq != null fq.trim().length()!=0) { QParser fqp = QParser.getParser(fq, null, req); filters.add(fqp.getQuery()); } } // only set the filters if they are not empty otherwise // fq=someotherParam= will trigger all docs filter for every request // if filter cache is disabled if (!filters.isEmpty()) { rb.setFilters( filters ); Note that this line actually invokes the parser: filters.add(fqp.getQuery()); Then in org.apache.lucene.search.Query.QParser#getParser: QParserPlugin qplug = req.getCore().getQueryPlugin(parserName); QParser parser = qplug.createParser(qstr, localParams, req.getParams(), req); And for the common case of the Lucene query parser, org.apache.solr.search.LuceneQParserPlugin#createParser: public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) { return new LuceneQParser(qstr, localParams, params, req); } And then in org.apache.lucene.search.Query.QParser#getQuery: public Query getQuery() throws SyntaxError { if (query==null) { query=parse(); And then in org.apache.lucene.search.Query.LuceneQParser#parse: lparser = new SolrQueryParser(this, defaultField); lparser.setDefaultOperator (QueryParsing.getQueryParserDefaultOperator(getReq().getSchema(), getParam(QueryParsing.OP))); return lparser.parse(qstr); And then in org.apache.solr.parser.SolrQueryParserBase#parse: Query res = TopLevelQuery(null); // pass null so we can tell later if an explicit field was provided or not And then in org.apache.solr.parser.QueryParser#TopLevelQuery, the parsing begins. And org.apache.solr.parser.QueryParser.jj is the grammar for a basic Solr/Lucene query, and org.apache.solr.parser.QueryParser.java is generated by JFlex, and a lot of the logic is in the base class of the generated class, org.apache.solr.parser.SolrQueryParserBase.java. Good luck! Happy hunting! -- Jack Krupansky -Original Message- From: YouPeng Yang Sent: Monday, October 21, 2013 2:57 AM To: solr-user@lucene.apache.org Subject: Class name of parsing the fq clause Hi I search the solr with fq clause,which is like: fq=BEGINTIME:[2013-08-25T16:00:00Z TO *] AND BUSID:(M3 OR M9) I am curious about the parsing process . I want to study it. What is the Java file name describes the parsing process of the fq clause. Thanks Regards.
Re: Solr timeout after reboot
I found this warming to be especially necessary after starting an instance of those m3.xlarge servers, else the response times for the first minutes was terrible. Peter On Mon, Oct 21, 2013 at 8:39 AM, François Schiettecatte fschietteca...@gmail.com wrote: To put the file data into file system cache which would make for faster access. François On Oct 21, 2013, at 8:33 AM, michael.boom my_sky...@yahoo.com wrote: Hmm, no, I haven't... What would be the effect of this ? - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408p4096809.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr timeout after reboot
I'm using the m3.xlarge server with 15G RAM, but my index size is over 100G, so I guess putting running the above command would bite all available memory. - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408p4096827.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr timeout after reboot
Well no, the OS is smarter than that, it manages file system cache along with other memory requirements. If applications need more memory then file system cache will likely be reduced. The command is a cheap trick to get the OS to fill the file system cache as quickly as possible, not sure how much it will help though with a 100GB index on a 15GB machine. This might work if you 'cat' the index files other than the '.fdx' and '.fdt' files. François On Oct 21, 2013, at 10:03 AM, michael.boom my_sky...@yahoo.com wrote: I'm using the m3.xlarge server with 15G RAM, but my index size is over 100G, so I guess putting running the above command would bite all available memory. - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408p4096827.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Class name of parsing the fq clause
HI Jack Thanks a lot for your explanation. 2013/10/21 Jack Krupansky j...@basetechnology.com Start with org.apache.solr.handler.**component.QueryComponent#**prepare which fetches the fq parameters and indirectly invokes the query parser(s): String[] fqs = req.getParams().getParams(**CommonParams.FQ); if (fqs!=null fqs.length!=0) { ListQuery filters = rb.getFilters(); // if filters already exists, make a copy instead of modifying the original filters = filters == null ? new ArrayListQuery(fqs.length) : new ArrayListQuery(filters); for (String fq : fqs) { if (fq != null fq.trim().length()!=0) { QParser fqp = QParser.getParser(fq, null, req); filters.add(fqp.getQuery()); } } // only set the filters if they are not empty otherwise // fq=someotherParam= will trigger all docs filter for every request // if filter cache is disabled if (!filters.isEmpty()) { rb.setFilters( filters ); Note that this line actually invokes the parser: filters.add(fqp.getQuery()); Then in org.apache.lucene.search.**Query.QParser#getParser: QParserPlugin qplug = req.getCore().getQueryPlugin(**parserName); QParser parser = qplug.createParser(qstr, localParams, req.getParams(), req); And for the common case of the Lucene query parser, org.apache.solr.search. **LuceneQParserPlugin#**createParser: public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) { return new LuceneQParser(qstr, localParams, params, req); } And then in org.apache.lucene.search.**Query.QParser#getQuery: public Query getQuery() throws SyntaxError { if (query==null) { query=parse(); And then in org.apache.lucene.search.**Query.LuceneQParser#parse: lparser = new SolrQueryParser(this, defaultField); lparser.setDefaultOperator (QueryParsing.**getQueryParserDefaultOperator(**getReq().getSchema(), getParam(QueryParsing.OP))); return lparser.parse(qstr); And then in org.apache.solr.parser.**SolrQueryParserBase#parse: Query res = TopLevelQuery(null); // pass null so we can tell later if an explicit field was provided or not And then in org.apache.solr.parser.**QueryParser#TopLevelQuery, the parsing begins. And org.apache.solr.parser.**QueryParser.jj is the grammar for a basic Solr/Lucene query, and org.apache.solr.parser.**QueryParser.java is generated by JFlex, and a lot of the logic is in the base class of the generated class, org.apache.solr.parser.**SolrQueryParserBase.java. Good luck! Happy hunting! -- Jack Krupansky -Original Message- From: YouPeng Yang Sent: Monday, October 21, 2013 2:57 AM To: solr-user@lucene.apache.org Subject: Class name of parsing the fq clause Hi I search the solr with fq clause,which is like: fq=BEGINTIME:[2013-08-25T16:**00:00Z TO *] AND BUSID:(M3 OR M9) I am curious about the parsing process . I want to study it. What is the Java file name describes the parsing process of the fq clause. Thanks Regards.
RE: Facet performance
On Mon, October 21, 2013 10:04 AM, Toke Eskildsen wrote: On Fri, 2013-10-18 at 18:30 +0200, Lemke, Michael SZ/HZA-ZSW wrote: Toke Eskildsen wrote: Unfortunately the enum-solution is normally quite slow when there are enough unique values to trigger the too many values-exception. [...] [...] And yes, the fc method was terribly slow in a case where it did work. Something like 20 minutes whereas enum returned within a few seconds. Err.. What? That sounds _very_ strange. You have millions of unique values so fc should be a lot faster than enum, not the other way around. I assume the 20 minutes was for the first call. How fast does subsequent calls return for fc? QTime enum: 1st call: 1200 subsequent calls: 200 QTime fc: never returns, webserver restarts itself after 30 min with 100% CPU load This is on the test system, the production system managed to return with ... Too many values for UnInvertedField faceting However, I also have different faceting queries I played with today. One complete example: q=ottomotorfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0 These are the results, all with facet.method=enum (fc doesn't work). They were executed in the sequence shown on an otherwise unused server: QTime=41205 facet.prefix=q=frequent_word numFound=44532 Same query repeated: QTime=225810 facet.prefix=q=ottomotor numFound=909 QTime=199839 facet.prefix=q=ottomotor numFound=909 QTime=0 facet.prefix=q=ottomotor jkdhwjfh numFound=0 QTime=0 facet.prefix=q=jkdhwjfh numFound=0 QTime=185948 facet.prefix=q=ottomotor numFound=909 QTime=3344 facet.prefix=d q=ottomotor numFound=909 QTime=3078 facet.prefix=d q=ottomotor numFound=909 QTime=3141 facet.prefix=d q=ottomotor numFound=909 The response time is obviously not dependent on the number of documents found. Caching doesn't kick in either. Maybe you could provide some approximate numbers? I'll try, see below. Thanks for asking and having a closer look. - Documents in your index 13,434,414 - Unique values in the CONTENT field Not sure how to get this. In luke I find 21,797,514 term count CONTENT Is that what you mean? - Hits are returned from a typical query Hm, that can be anything between 0 and 40,000 or more. Or do you mean from the facets? Or do my tests above answer it? - Xmx The maximum the system allows me to get: 1612m Maybe I have a hopelessly under-dimensioned server for this sort of things? Thanks a lot for your help, Michael
Re: Solr timeout after reboot
On 10/21/2013 8:03 AM, michael.boom wrote: I'm using the m3.xlarge server with 15G RAM, but my index size is over 100G, so I guess putting running the above command would bite all available memory. With a 100GB index, I would want a minimum server memory size of 64GB, and I would much prefer 128GB. If you shard your index, then each machine will require less memory, because each one will have less of the index onboard. Running a big Solr install is usually best handled on bare metal, because it loves RAM, and getting a lot of memory in a virtual environment is quite expensive. It's also expensive on bare metal too, but unlike Amazon, more memory doesn't increase your monthly cost. With only 15GB total RAM and an index that big, you're probably giving at least half of your RAM to Solr, leaving *very* little for the OS disk cache, compared to your index size. The ideal cache size is the same as your index size, but you can almost always get away with less. http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache If you try the cat trick with your numbers, it's going to take forever every time you run it, it will kill your performance while it's happening, and only the last few GB that it reads will remain in the OS disk cache. Chances are that it will be the wrong part of the index, too. You only want to cat your entire index if you have enough free RAM to *FIT* your entire index. If you *DO* have that much free memory (which for you would require a total RAM size of about 128GB), then the first time will take quite a while, but every time you do it after that, it will happen nearly instantly, because it will not have to actually read the disk at all. You could try only doing the cat on certain index files, but when you don't have enough cache for the entire index, running queries will do a better job of filling the cache intelligently. The first bunch of queries will be slow. Summary: You need more RAM. Quite a bit more RAM. Thanks, Shawn
Re: Local Solr and Webserver-Solr act differently (and treated like or)
Did you completely reindex your data after emptying the stop words file? -- Jack Krupansky -Original Message- From: Stavros Delisavas Sent: Monday, October 21, 2013 10:05 AM To: solr-user@lucene.apache.org Subject: Re: Local Solr and Webserver-Solr act differently (and treated like or) Okay, I emtpied the stopword file. I don't know where the wordlist came from. I have never seen this and never touched that file. Anyways... Now my queries do work with one word, like in or to but the queries still do not work when I use more than one stopword within one query. Instead of too many results I now get NO results at all. What could be the problem? On 17.10.2013 15:02, Jack Krupansky wrote: The default Solr stopwords.txt file is empty, so SOMEBODY created that non-empty stop words file. The StopFilterFactory token filter in the field type analyzer controls stop word processing. You can remove that step entirely, or different field types can reference different stop word files, or some field type analyzers can use the stop filter and some would not have it. This does mean that you would have to use different field types for fields that want different stop word processing. -- Jack Krupansky -Original Message- From: Stavros Delisavas Sent: Thursday, October 17, 2013 3:27 AM To: solr-user@lucene.apache.org Subject: Re: Local Solr and Webserver-Solr act differently (and treated like or) Thank you, I found the file with the stopwords and noticed that my local file is empty (comments only) and the one on my webserver has a big list of english stopwords. That seems to be the problem. I think in general it is a good idea to use stopwords for random searches, but it is not usefull in my special case. Is there a way to (de)activate stopwords query-wise? Like I would like to ignore stopwords when searching in titles but I would like to use stopwords when users do a fulltext-search on whole articles, etc. Thanks again, Stavros On 17.10.2013 09:13, Upayavira wrote: Stopwords are small words such as and, the or is,that we might choose to exclude from our documents and queries because they are such common terms. Once you have stripped stop words from your above query, all that is left is the word wild, or so is being suggested. Somewhere in your config, close to solr config.xml, you will find a file called something like stopwords.txt. Compare these files between your two systems. Upayavira On Thu, Oct 17, 2013, at 07:18 AM, Stavros Delsiavas wrote: Unfortunatly, I don't really know what stopwords are. I would like it to not ignore any words of my query. How/Where can I change this stopwords-behaviour? Am 16.10.2013 23:45, schrieb Jack Krupansky: So, the stopwords.txt file is different between the two systems - the first has stop words but the second does not. Did you expect stop words to be removed, or not? -- Jack Krupansky -Original Message- From: Stavros Delsiavas Sent: Wednesday, October 16, 2013 5:02 PM To: solr-user@lucene.apache.org Subject: Re: Local Solr and Webserver-Solr act differently (and treated like or) Okay I understand, here's the rawquerystring. It was at about line 3000: lst name=debug str name=rawquerystringtitle:(into AND the AND wild*)/str str name=querystringtitle:(into AND the AND wild*)/str str name=parsedquery+title:wild*/str str name=parsedquery_toString+title:wild*/str At this place the debug output DOES differ from the one on my local system. But I don't understand why... This is the local debug output: lst name=debug str name=rawquerystringtitle:(into AND the AND wild*)/str str name=querystringtitle:(into AND the AND wild*)/str str name=parsedquery+title:into +title:the +title:wild*/str str name=parsedquery_toString+title:into +title:the +title:wild*/str Why is that? Any ideas? Am 16.10.2013 21:03, schrieb Shawn Heisey: On 10/16/2013 4:46 AM, Stavros Delisavas wrote: My local solr gives me: http://pastebin.com/Q6d9dFmZ and my webserver this: http://pastebin.com/q87WEjVA I copied only the first few hundret lines (of more than 8000) because the webserver output was to big even for pastebin. On 16.10.2013 12:27, Erik Hatcher wrote: What does the debug output say from debugQuery=true say between the two? What's really needed here is the first part of the debug section, which has rawquerystring, querystring, parsedquery, and parsedquery_toString. The info from your local solr has this part, but what you pasted from the webserver one didn't include those parts, because it's further down than the first few hundred lines. Thanks, Shawn
Re: Local Solr and Webserver-Solr act differently (and treated like or)
I did a full-import again. That solved the issue. I didn't know that the stopwords apply on the indexing itself too. Thanks a lot, Stavros Am 21.10.2013 17:13, schrieb Jack Krupansky: Did you completely reindex your data after emptying the stop words file? -- Jack Krupansky -Original Message- From: Stavros Delisavas Sent: Monday, October 21, 2013 10:05 AM To: solr-user@lucene.apache.org Subject: Re: Local Solr and Webserver-Solr act differently (and treated like or) Okay, I emtpied the stopword file. I don't know where the wordlist came from. I have never seen this and never touched that file. Anyways... Now my queries do work with one word, like in or to but the queries still do not work when I use more than one stopword within one query. Instead of too many results I now get NO results at all. What could be the problem? On 17.10.2013 15:02, Jack Krupansky wrote: The default Solr stopwords.txt file is empty, so SOMEBODY created that non-empty stop words file. The StopFilterFactory token filter in the field type analyzer controls stop word processing. You can remove that step entirely, or different field types can reference different stop word files, or some field type analyzers can use the stop filter and some would not have it. This does mean that you would have to use different field types for fields that want different stop word processing. -- Jack Krupansky -Original Message- From: Stavros Delisavas Sent: Thursday, October 17, 2013 3:27 AM To: solr-user@lucene.apache.org Subject: Re: Local Solr and Webserver-Solr act differently (and treated like or) Thank you, I found the file with the stopwords and noticed that my local file is empty (comments only) and the one on my webserver has a big list of english stopwords. That seems to be the problem. I think in general it is a good idea to use stopwords for random searches, but it is not usefull in my special case. Is there a way to (de)activate stopwords query-wise? Like I would like to ignore stopwords when searching in titles but I would like to use stopwords when users do a fulltext-search on whole articles, etc. Thanks again, Stavros On 17.10.2013 09:13, Upayavira wrote: Stopwords are small words such as and, the or is,that we might choose to exclude from our documents and queries because they are such common terms. Once you have stripped stop words from your above query, all that is left is the word wild, or so is being suggested. Somewhere in your config, close to solr config.xml, you will find a file called something like stopwords.txt. Compare these files between your two systems. Upayavira On Thu, Oct 17, 2013, at 07:18 AM, Stavros Delsiavas wrote: Unfortunatly, I don't really know what stopwords are. I would like it to not ignore any words of my query. How/Where can I change this stopwords-behaviour? Am 16.10.2013 23:45, schrieb Jack Krupansky: So, the stopwords.txt file is different between the two systems - the first has stop words but the second does not. Did you expect stop words to be removed, or not? -- Jack Krupansky -Original Message- From: Stavros Delsiavas Sent: Wednesday, October 16, 2013 5:02 PM To: solr-user@lucene.apache.org Subject: Re: Local Solr and Webserver-Solr act differently (and treated like or) Okay I understand, here's the rawquerystring. It was at about line 3000: lst name=debug str name=rawquerystringtitle:(into AND the AND wild*)/str str name=querystringtitle:(into AND the AND wild*)/str str name=parsedquery+title:wild*/str str name=parsedquery_toString+title:wild*/str At this place the debug output DOES differ from the one on my local system. But I don't understand why... This is the local debug output: lst name=debug str name=rawquerystringtitle:(into AND the AND wild*)/str str name=querystringtitle:(into AND the AND wild*)/str str name=parsedquery+title:into +title:the +title:wild*/str str name=parsedquery_toString+title:into +title:the +title:wild*/str Why is that? Any ideas? Am 16.10.2013 21:03, schrieb Shawn Heisey: On 10/16/2013 4:46 AM, Stavros Delisavas wrote: My local solr gives me: http://pastebin.com/Q6d9dFmZ and my webserver this: http://pastebin.com/q87WEjVA I copied only the first few hundret lines (of more than 8000) because the webserver output was to big even for pastebin. On 16.10.2013 12:27, Erik Hatcher wrote: What does the debug output say from debugQuery=true say between the two? What's really needed here is the first part of the debug section, which has rawquerystring, querystring, parsedquery, and parsedquery_toString. The info from your local solr has this part, but what you pasted from the webserver one didn't include those parts, because it's further down than the first few hundred lines. Thanks, Shawn
SolrCloud performance in VM environment
Hi everyone, I've been working on an installation recently which uses SolrCloud to index 45M documents into 8 shards on 2 VMs running 64-bit Ubuntu (with another 2 identical VMs set up for replicas). The reason we're using so many shards for a relatively small index is that there are complex filtering requirements at search time, to restrict users to items they are licensed to view. Initial tests demonstrated that multiple shards would be required. The total size of the index is about 140GB, and each VM has 16GB RAM (32GB total) and 4 CPU units. I know this is far under what would normally be recommended for an index of this size, and I'm working on persuading the customer to increase the RAM (basically, telling them it won't work otherwise.) Performance is currently pretty poor and I would expect more RAM to improve things. However, there are a couple of other oddities which concern me, The first is that I've been reindexing a fixed set of 500 docs to test indexing and commit performance (with soft commits within 60s). The time taken to complete a hard commit after this is longer than I'd expect, and highly variable - from 10s to 70s. This makes me wonder whether the SAN (which provides all the storage for these VMs and the customers several other VMs) is being saturated periodically. I grabbed some iostat output on different occasions to (possibly) show the variability: Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdb 64.50 0.00 2476.00 0 4952 ... sdb 8.90 0.00 348.00 0 6960 ... sdb 1.15 0.0043.20 0864 The other thing that confuses me is that after a Solr restart or hard commit, search times average about 1.2s under light load. After searching the same set of queries for 5-6 iterations this improves to 0.1s. However, in either case - cold or warm - iostat reports no device reads at all: Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdb 0.40 0.00 8.00 0160 ... sdb 0.30 0.0010.40 0104 (the writes are due to logging). This implies to me that the 'hot' blocks are being completely cached in RAM - so why the variation in search time and the number of iterations required to speed it up? The Solr caches are only being used lightly by these tests and there are no evictions. GC is not a significant overhead. Each Solr shard runs in a separate JVM with 1GB heap. I don't have a great deal of experience in low-level performance tuning, so please forgive any naivety. Any ideas of what to do next would be greatly appreciated. I don't currently have details of the VM implementation but can get hold of this if it's relevant. thanks, Tom
RE: SolrCloud performance in VM environment
some basic tips. -try to create enough shards that you can get the size of each index portion on the shard closer to the amount of RAM you have on each node (e.g. if you are ~140GB index on 16GB nodes, try doing 12-16 shards) -start with just the initial shards, add replicas later when you have dialed things in a bit more -try to leave some memory for the OS as well as the JVM -try starting with 1/2 of the total ram on each vm allocated to JVM as Xmx value -try setting the Xms in the range of .75 to 1.0 of Xmx -do all the normal JVM tuning, esp the part about capturing the gc events in a log such that you can see what is going on with java itself..this will probably lead you to adjust your GC type, etc -make sure you arent hammering your storage devices (or the interconnects between your servers and your storage)...the OS internal tools on the guest are helpful, but you probably want to look at the hypervisor and storage device layer directly as well. on vmware the built in perf graphs for datastore latency and network throughput are easily observed. esxtop is the cli tool which provides the same. -if you are using a SAN, you probably want to make sure you have some sort of MPIO in place (esp if you are using 1GB iscsi) From: Tom Mortimer tom.m.f...@gmail.com Sent: Monday, October 21, 2013 08:48 To: solr-user@lucene.apache.org Subject: SolrCloud performance in VM environment Hi everyone, I've been working on an installation recently which uses SolrCloud to index 45M documents into 8 shards on 2 VMs running 64-bit Ubuntu (with another 2 identical VMs set up for replicas). The reason we're using so many shards for a relatively small index is that there are complex filtering requirements at search time, to restrict users to items they are licensed to view. Initial tests demonstrated that multiple shards would be required. The total size of the index is about 140GB, and each VM has 16GB RAM (32GB total) and 4 CPU units. I know this is far under what would normally be recommended for an index of this size, and I'm working on persuading the customer to increase the RAM (basically, telling them it won't work otherwise.) Performance is currently pretty poor and I would expect more RAM to improve things. However, there are a couple of other oddities which concern me, The first is that I've been reindexing a fixed set of 500 docs to test indexing and commit performance (with soft commits within 60s). The time taken to complete a hard commit after this is longer than I'd expect, and highly variable - from 10s to 70s. This makes me wonder whether the SAN (which provides all the storage for these VMs and the customers several other VMs) is being saturated periodically. I grabbed some iostat output on different occasions to (possibly) show the variability: Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdb 64.50 0.00 2476.00 0 4952 ... sdb 8.90 0.00 348.00 0 6960 ... sdb 1.15 0.0043.20 0864 The other thing that confuses me is that after a Solr restart or hard commit, search times average about 1.2s under light load. After searching the same set of queries for 5-6 iterations this improves to 0.1s. However, in either case - cold or warm - iostat reports no device reads at all: Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdb 0.40 0.00 8.00 0160 ... sdb 0.30 0.0010.40 0104 (the writes are due to logging). This implies to me that the 'hot' blocks are being completely cached in RAM - so why the variation in search time and the number of iterations required to speed it up? The Solr caches are only being used lightly by these tests and there are no evictions. GC is not a significant overhead. Each Solr shard runs in a separate JVM with 1GB heap. I don't have a great deal of experience in low-level performance tuning, so please forgive any naivety. Any ideas of what to do next would be greatly appreciated. I don't currently have details of the VM implementation but can get hold of this if it's relevant. thanks, Tom
Re: Question about docvalues
I really don't understand the question. What behavior are you seeing that leads you to ask? bq: Is it necessary duplicate the field and set index and stored to false and If this means setting _both_ indexed and stored to false, then you effectively throw the field completely away, there's no point in doing this. FWIW, Erick On Mon, Oct 21, 2013 at 1:39 PM, yriveiro yago.rive...@gmail.com wrote: Hi, If I have a field (named dv_field) configured to be indexed, stored and with docvalues=true. How I know that when I do a query like: q=*:*facet=truefacet.field=dv_field, I'm really using the docvalues and not the normal way? Is it necessary duplicate the field and set index and stored to false and let the docvalues property set to true? - Best regards -- View this message in context: http://lucene.472066.n3.nabble.com/Question-about-docvalues-tp4096802.html Sent from the Solr - User mailing list archive at Nabble.com.
Pivot faceting not working after upgrading to 4.5
Hello, We have a rather weird behavior I don't really understand. As written in a few other threads, we're migrating from a master/slave setup running 4.3 to a SolrCloud setup running 4.5. Both run on the same data set (the 4.5 instances have been re-indexed under 4.5 obviously). The following query works fine under our 4.3 setup: ?q=*:*facet.pivot=facet_category,facet_platformfacet=truerows=0 However, in our 4.5 setup, the facet_pivot entry in the facet_count is straight up missing in the response. I've been digging around the logs for a bit, but I'm unable to find something relating to this. If I remove one of the facet.pivot elements (i.e. only having facet.pivot=facet_category) I get an error as expected, so that part of the component is at least working. Does anyone have an idea to something obvious I might have missed? I've been unable to find any change logs suggesting changes to this part of the facet component. Thanks. Regards, Henrik
Re: Pivot faceting not working after upgrading to 4.5
I realise now that distributed pivotal faceting is not implemented yet in SolrCloud after some digging through the internet. Apologies :) Den 21/10/2013 kl. 18.20 skrev Henrik Ossipoff Hansen h...@entertainment-trading.com: Hello, We have a rather weird behavior I don't really understand. As written in a few other threads, we're migrating from a master/slave setup running 4.3 to a SolrCloud setup running 4.5. Both run on the same data set (the 4.5 instances have been re-indexed under 4.5 obviously). The following query works fine under our 4.3 setup: ?q=*:*facet.pivot=facet_category,facet_platformfacet=truerows=0 However, in our 4.5 setup, the facet_pivot entry in the facet_count is straight up missing in the response. I've been digging around the logs for a bit, but I'm unable to find something relating to this. If I remove one of the facet.pivot elements (i.e. only having facet.pivot=facet_category) I get an error as expected, so that part of the component is at least working. Does anyone have an idea to something obvious I might have missed? I've been unable to find any change logs suggesting changes to this part of the facet component. Thanks. Regards, Henrik
Re: Question about docvalues
Sorry if I don't make understand, my english is not too good. My goal is remove pressure from the heap, my indexes are too big and the heap get full very quick and I get an OOM. I read about docValues stored on disk, but I don't know how configure it. A read this link: https://cwiki.apache.org/confluence/display/solr/DocValues#DocValues-HowtoUseDocValues witch has an example that how to configure a field to use docValues: field name=manu_exact type=string indexed=false stored=false docValues=true / With this configuration is obvious that I will use docValues. Q: With this configuration, can I retrieve the field value on a normal search or still need to be stored? If I have a field configured as: field name=manu_exact type=string indexed=true stored=true docValues=true / And I do a facet query on manu_exact field: q=*:*facet=truefacet.field=manu_exact Q: I leverage the docValues feature?, This means, docValues always has precedency if is set over the regular method to do a facet? Q: Make sense the field indexed if I have docValues? -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Monday, October 21, 2013 at 5:10 PM, Erick Erickson wrote: I really don't understand the question. What behavior are you seeing that leads you to ask? bq: Is it necessary duplicate the field and set index and stored to false and If this means setting _both_ indexed and stored to false, then you effectively throw the field completely away, there's no point in doing this. FWIW, Erick On Mon, Oct 21, 2013 at 1:39 PM, yriveiro yago.rive...@gmail.com (mailto:yago.rive...@gmail.com) wrote: Hi, If I have a field (named dv_field) configured to be indexed, stored and with docvalues=true. How I know that when I do a query like: q=*:*facet=truefacet.field=dv_field, I'm really using the docvalues and not the normal way? Is it necessary duplicate the field and set index and stored to false and let the docvalues property set to true? - Best regards -- View this message in context: http://lucene.472066.n3.nabble.com/Question-about-docvalues-tp4096802.html Sent from the Solr - User mailing list archive at Nabble.com (http://Nabble.com).
Re: Question about docvalues
Hello Yago, To my knowledge, in facet calculations docValues take precedence over other methods. So, even if your field is also stored and indexed, your facets won't use the inverted index or fieldValueCache, when docValues are present. I think you will still have to store and index to maintain your other functionality. DocValues are helpful only for facets and sorting to my knowledge. Hope this helps, Gun Akkor www.carbonblack.com Sent from my iPhone On Oct 21, 2013, at 12:41 PM, Yago Riveiro yago.rive...@gmail.com wrote: Sorry if I don't make understand, my english is not too good. My goal is remove pressure from the heap, my indexes are too big and the heap get full very quick and I get an OOM. I read about docValues stored on disk, but I don't know how configure it. A read this link: https://cwiki.apache.org/confluence/display/solr/DocValues#DocValues-HowtoUseDocValues witch has an example that how to configure a field to use docValues: field name=manu_exact type=string indexed=false stored=false docValues=true / With this configuration is obvious that I will use docValues. Q: With this configuration, can I retrieve the field value on a normal search or still need to be stored? If I have a field configured as: field name=manu_exact type=string indexed=true stored=true docValues=true / And I do a facet query on manu_exact field: q=*:*facet=truefacet.field=manu_exact Q: I leverage the docValues feature?, This means, docValues always has precedency if is set over the regular method to do a facet? Q: Make sense the field indexed if I have docValues? -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Monday, October 21, 2013 at 5:10 PM, Erick Erickson wrote: I really don't understand the question. What behavior are you seeing that leads you to ask? bq: Is it necessary duplicate the field and set index and stored to false and If this means setting _both_ indexed and stored to false, then you effectively throw the field completely away, there's no point in doing this. FWIW, Erick On Mon, Oct 21, 2013 at 1:39 PM, yriveiro yago.rive...@gmail.com (mailto:yago.rive...@gmail.com) wrote: Hi, If I have a field (named dv_field) configured to be indexed, stored and with docvalues=true. How I know that when I do a query like: q=*:*facet=truefacet.field=dv_field, I'm really using the docvalues and not the normal way? Is it necessary duplicate the field and set index and stored to false and let the docvalues property set to true? - Best regards -- View this message in context: http://lucene.472066.n3.nabble.com/Question-about-docvalues-tp4096802.html Sent from the Solr - User mailing list archive at Nabble.com (http://Nabble.com).
Re: Exact Match Results
You need to provide us with the fieldtype information.. If you just want to match the phrase entered by user, you can use KeywordTokenizerFactory.. Reference: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Creates org.apache.lucene.analysis.core.KeywordTokenizer. Treats the entire field as a single token, regardless of its content. Example: http://example.com/I-am+example?Text=-Hello; == http://example.com/I-am+example?Text=-Hello; -- View this message in context: http://lucene.472066.n3.nabble.com/Exact-Match-Results-tp4096816p4096846.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Question about docvalues
Hi Gun, Thanks for the response. Indeed I only want docValues to do facets. IMHO I think that a reference to the fact that docValues take precedence over other methods is needed. Is not always obvious. -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Monday, October 21, 2013 at 5:53 PM, Gun Akkor wrote: Hello Yago, To my knowledge, in facet calculations docValues take precedence over other methods. So, even if your field is also stored and indexed, your facets won't use the inverted index or fieldValueCache, when docValues are present. I think you will still have to store and index to maintain your other functionality. DocValues are helpful only for facets and sorting to my knowledge. Hope this helps, Gun Akkor www.carbonblack.com (http://www.carbonblack.com) Sent from my iPhone On Oct 21, 2013, at 12:41 PM, Yago Riveiro yago.rive...@gmail.com (mailto:yago.rive...@gmail.com) wrote: Sorry if I don't make understand, my english is not too good. My goal is remove pressure from the heap, my indexes are too big and the heap get full very quick and I get an OOM. I read about docValues stored on disk, but I don't know how configure it. A read this link: https://cwiki.apache.org/confluence/display/solr/DocValues#DocValues-HowtoUseDocValues witch has an example that how to configure a field to use docValues: field name=manu_exact type=string indexed=false stored=false docValues=true / With this configuration is obvious that I will use docValues. Q: With this configuration, can I retrieve the field value on a normal search or still need to be stored? If I have a field configured as: field name=manu_exact type=string indexed=true stored=true docValues=true / And I do a facet query on manu_exact field: q=*:*facet=truefacet.field=manu_exact Q: I leverage the docValues feature?, This means, docValues always has precedency if is set over the regular method to do a facet? Q: Make sense the field indexed if I have docValues? -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Monday, October 21, 2013 at 5:10 PM, Erick Erickson wrote: I really don't understand the question. What behavior are you seeing that leads you to ask? bq: Is it necessary duplicate the field and set index and stored to false and If this means setting _both_ indexed and stored to false, then you effectively throw the field completely away, there's no point in doing this. FWIW, Erick On Mon, Oct 21, 2013 at 1:39 PM, yriveiro yago.rive...@gmail.com (mailto:yago.rive...@gmail.com) wrote: Hi, If I have a field (named dv_field) configured to be indexed, stored and with docvalues=true. How I know that when I do a query like: q=*:*facet=truefacet.field=dv_field, I'm really using the docvalues and not the normal way? Is it necessary duplicate the field and set index and stored to false and let the docvalues property set to true? - Best regards -- View this message in context: http://lucene.472066.n3.nabble.com/Question-about-docvalues-tp4096802.html Sent from the Solr - User mailing list archive at Nabble.com (http://Nabble.com).
Re: Exact Match Results
Hi i am using field type configuration in the following way. field name=fsw_title type=text_full_startwith_match indexed=true stored=false multiValued=true omitNorms=true omitTermFreqAndPositions=true / fieldType name=text_full_startwith_match class=solr.TextField analyzer type=index charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=([\.,;:-_]) replacement= replace=all/ filter class=solr.EdgeNGramFilterFactory maxGramSize=30 minGramSize=1/ filter class=solr.PatternReplaceFilterFactory pattern=([^\w\d\*æøåÆØÅ ]) replacement= replace=all/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer analyzer type=query charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=([\.,;:-_]) replacement= replace=all/ filter class=solr.PatternReplaceFilterFactory pattern=([^\w\d\*æøåÆØÅ ]) replacement= replace=all/ filter class=solr.PatternReplaceFilterFactory pattern=^(.{30})(.*)? replacement=$1 replace=all/ filter class=solr.SynonymFilterFactory ignoreCase=true synonyms=synonyms_fsw.txt expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/Exact-Match-Results-tp4096816p4096847.html Sent from the Solr - User mailing list archive at Nabble.com.
Custom FunctionQuery Guide/Tutorial (4.3.0+) ?
Does anyone have a good link to a guide / tutorial /etc. for writing a custom function query in Solr 4? The tutorials I've seen vary from showing half the code to being written for older versions of Solr. Any type of pointers would be appreciated, thanks.
Re: Solr timeout after reboot
Hi Michael, I agree with Shawn, don't listen to Peter ;) but only this once - he's a smart guy, as you can see in list archives. And I disagree with Shawn. again, only just this once and only somewhat. :) Because: In general, Shawn's advice is correct, but we have no way of knowing your particular details. TO illustrate the point, let me use an extreme case where you have just one query that you hammer your servers with. Your Solr caches will be well utilized and your servers will not really need a lot of memory to cache your 100 GB index because only a small portion of it will ever be accessed. Of course, this is an extreme case and not realistic, but I think it helps one understands how as the number of distinct queries grows (and thus also the number of distinct documents being matched and returned), the need for more and more memory goes up. So the question is where exactly your particular application falls. You mentioned stress testing. Just like you, I am assuming, have a real index there, you need to have your real queries, too - real volume, real diversity, real rate, real complexity, real or as close to real everything. Since you as using SPM, you should be able to go to various graphs in SPM and look for a little ambulance icon above each graph. Use that to assemble a message with N graphs you want us to look at and we'll be able to help more. Graphs that may be of interest here are your Solr cache graphs, disk IO, and memory graphs -- taken during your realistic stress testing, of course. You can then send that message directly to solr-user, assuming your SPM account email address is subscribed to the list. Or you can paste it into a new email, up to you. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Mon, Oct 21, 2013 at 11:07 AM, Shawn Heisey s...@elyograg.org wrote: On 10/21/2013 8:03 AM, michael.boom wrote: I'm using the m3.xlarge server with 15G RAM, but my index size is over 100G, so I guess putting running the above command would bite all available memory. With a 100GB index, I would want a minimum server memory size of 64GB, and I would much prefer 128GB. If you shard your index, then each machine will require less memory, because each one will have less of the index onboard. Running a big Solr install is usually best handled on bare metal, because it loves RAM, and getting a lot of memory in a virtual environment is quite expensive. It's also expensive on bare metal too, but unlike Amazon, more memory doesn't increase your monthly cost. With only 15GB total RAM and an index that big, you're probably giving at least half of your RAM to Solr, leaving *very* little for the OS disk cache, compared to your index size. The ideal cache size is the same as your index size, but you can almost always get away with less. http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache If you try the cat trick with your numbers, it's going to take forever every time you run it, it will kill your performance while it's happening, and only the last few GB that it reads will remain in the OS disk cache. Chances are that it will be the wrong part of the index, too. You only want to cat your entire index if you have enough free RAM to *FIT* your entire index. If you *DO* have that much free memory (which for you would require a total RAM size of about 128GB), then the first time will take quite a while, but every time you do it after that, it will happen nearly instantly, because it will not have to actually read the disk at all. You could try only doing the cat on certain index files, but when you don't have enough cache for the entire index, running queries will do a better job of filling the cache intelligently. The first bunch of queries will be slow. Summary: You need more RAM. Quite a bit more RAM. Thanks, Shawn
Re: Custom FunctionQuery Guide/Tutorial (4.3.0+) ?
Take a look at the unit tests for various value sources, and find a Jira that added some value source and look at the patch for what changes had to be made. -- Jack Krupansky -Original Message- From: JT Sent: Monday, October 21, 2013 1:17 PM To: solr-user@lucene.apache.org Subject: Custom FunctionQuery Guide/Tutorial (4.3.0+) ? Does anyone have a good link to a guide / tutorial /etc. for writing a custom function query in Solr 4? The tutorials I've seen vary from showing half the code to being written for older versions of Solr. Any type of pointers would be appreciated, thanks.
Re: SolrCloud performance in VM environment
On 10/21/2013 9:48 AM, Tom Mortimer wrote: Hi everyone, I've been working on an installation recently which uses SolrCloud to index 45M documents into 8 shards on 2 VMs running 64-bit Ubuntu (with another 2 identical VMs set up for replicas). The reason we're using so many shards for a relatively small index is that there are complex filtering requirements at search time, to restrict users to items they are licensed to view. Initial tests demonstrated that multiple shards would be required. The total size of the index is about 140GB, and each VM has 16GB RAM (32GB total) and 4 CPU units. I know this is far under what would normally be recommended for an index of this size, and I'm working on persuading the customer to increase the RAM (basically, telling them it won't work otherwise.) Performance is currently pretty poor and I would expect more RAM to improve things. However, there are a couple of other oddities which concern me, Running multiple shards like you are, where each operating system is handling more than one shard, is only going to perform better if your query volume is low and you have lots of CPU cores. If your query volume is high or you only have 2-4 CPU cores on each VM, you might be better off with fewer shards or not sharded at all. The way that I read this is that you've got two physical machines with 32GB RAM, each running two VMs that have 16GB. Each VM houses 4 shards, or 70GB of index. There's a scenario that might be better if all of the following are true: 1) I'm right about how your hardware is provisioned. 2) You or the client owns the hardware. 3) You have an extremely low-end third machine available - single CPU with 1GB of RAM would probably be enough. In this scenario, you run one Solr instance and one zookeeper instance on each of your two big machines, and use the third wimpy machine as a third zookeeper node. No virtualization. For the rest of my reply, I'm assuming that you haven't taken this step, but it will probably apply either way. The first is that I've been reindexing a fixed set of 500 docs to test indexing and commit performance (with soft commits within 60s). The time taken to complete a hard commit after this is longer than I'd expect, and highly variable - from 10s to 70s. This makes me wonder whether the SAN (which provides all the storage for these VMs and the customers several other VMs) is being saturated periodically. I grabbed some iostat output on different occasions to (possibly) show the variability: Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdb 64.50 0.00 2476.00 0 4952 ... sdb 8.90 0.00 348.00 0 6960 ... sdb 1.15 0.0043.20 0864 There are two likely possibilities for this. One or both of them might be in play. 1) Because the OS disk cache is small, not much of the index can be cached. This can result in a lot of disk I/O for a commit, slowing things way down. Increasing the size of the OS disk cache is really the only solution for that. 2) Cache autowarming, particularly the filter cache. In the cache statistics, you can see how long each cache took to warm up after the last searcher was opened. The solution for that is to reduce the autowarmCount values. The other thing that confuses me is that after a Solr restart or hard commit, search times average about 1.2s under light load. After searching the same set of queries for 5-6 iterations this improves to 0.1s. However, in either case - cold or warm - iostat reports no device reads at all: Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdb 0.40 0.00 8.00 0160 ... sdb 0.30 0.0010.40 0104 (the writes are due to logging). This implies to me that the 'hot' blocks are being completely cached in RAM - so why the variation in search time and the number of iterations required to speed it up? Linux is pretty good about making limited OS disk cache resources work. Sounds like the caching is working reasonably well for queries. It might not be working so well for updates or commits, though. Running multiple Solr JVMs per machine, virtual or not, causes more problems than it solves. Solr has no limits on the number of cores (shard replicas) per instance, assuming there are enough system resources. There should be exactly one Solr JVM per operating system. Running more than one results in quite a lot of overhead, and your memory is precious. When you create a collection, you can give the collections API the maxShardsPerNode parameter to create more than one shard per instance. I don't have a great deal of experience in low-level performance tuning, so please forgive any naivety. Any ideas of what to do next would be greatly appreciated. I don't currently have
Re: Custom FunctionQuery Guide/Tutorial (4.3.0+) ?
Hi Jack, Do you have a date for the new version of your book: solr_4x_deep_dive_early_access? Thanks, Fudong On Mon, Oct 21, 2013 at 10:39 AM, Jack Krupansky j...@basetechnology.comwrote: Take a look at the unit tests for various value sources, and find a Jira that added some value source and look at the patch for what changes had to be made. -- Jack Krupansky -Original Message- From: JT Sent: Monday, October 21, 2013 1:17 PM To: solr-user@lucene.apache.org Subject: Custom FunctionQuery Guide/Tutorial (4.3.0+) ? Does anyone have a good link to a guide / tutorial /etc. for writing a custom function query in Solr 4? The tutorials I've seen vary from showing half the code to being written for older versions of Solr. Any type of pointers would be appreciated, thanks.
Re: Custom FunctionQuery Guide/Tutorial (4.3.0+) ?
Hopefully at the end of the week. -- Jack Krupansky -Original Message- From: fudong li Sent: Monday, October 21, 2013 1:45 PM To: solr-user@lucene.apache.org Subject: Re: Custom FunctionQuery Guide/Tutorial (4.3.0+) ? Hi Jack, Do you have a date for the new version of your book: solr_4x_deep_dive_early_access? Thanks, Fudong On Mon, Oct 21, 2013 at 10:39 AM, Jack Krupansky j...@basetechnology.comwrote: Take a look at the unit tests for various value sources, and find a Jira that added some value source and look at the patch for what changes had to be made. -- Jack Krupansky -Original Message- From: JT Sent: Monday, October 21, 2013 1:17 PM To: solr-user@lucene.apache.org Subject: Custom FunctionQuery Guide/Tutorial (4.3.0+) ? Does anyone have a good link to a guide / tutorial /etc. for writing a custom function query in Solr 4? The tutorials I've seen vary from showing half the code to being written for older versions of Solr. Any type of pointers would be appreciated, thanks.
reindexing data
In Solr 4.5, I'm trying to create a new collection on the fly. I have a data dir with the index that should be in there, but the CREATE command makes the directory be: collection name_shard1_replicant# I was hoping that making a collection named something would use a directory with that name to let me use the data that I already have to fill the collection. I could go and just make each one (name_shard_replicant[1,2,3]), but I was hoping there may be an easier way of doing this. Sorry if this is confusing (it is Monday), I can try clarify if needed. Thanks. -- Chris
Re: Questions developing custom functionquery
I would agree the right way to do this is probably just add the information I wish to sort on directly, as a date field or something like that. The issue is we currently have ~300m documents that are already indexed. Not all of the fields have stored=true (for good reason, we maintain the documents externally, about 7TB worth. I didn't want to replicate 7TB of data twice.) so we cannot update these indexed values. I was hoping to spend 2-3 days writing a custom query to avoid 2+ months of indexing everything all over again. So let me just ask this question, given my current situation, lets say you had the following field str name=resourcename/path/to/file/month/day/year/file.txt/str I simply want to extract the month/day/year and sort based on that. My current plan was to convert the month, day, year into seconds from right now, and return that number. Thus sorting ascending, it should return newest documents first. -JT On Fri, Oct 18, 2013 at 3:14 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : Field-Type: org.apache.solr.schema.TextField ... : DocTermsIndexDocValues http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-queries/4.3.0/org/apache/lucene/queries/function/docvalues/DocTermsIndexDocValues.java#DocTermsIndexDocValues . : Calling getVal() on a DocTermsIndexDocValues does some really weird stuff : that I really don't understand. Your TextField is being analyzed in some way you haven't clarified, and the DocTermsIndexDocValues you get contains the details of each term in that TextField : Its possible I'm going about this wrong and need to re-do my approach. I'm : just currently at a loss for what that approach is. Based on your initial goal, you are most certainly going about this in a much more complicated way then you need to... :My goal is to be able to implement a custom sorting technique. :Example: str name=resname/some :example/data/here/2013/09/12/testing.text/str : :I would like to do a custom sort based on this resname field. :Basically, I would like to parse out that date there (2013/09/12) and : sort :on that date. You are going to be *MUCH* happier (both in terms of effort, and in terms of performance) if instead of writing a custom function to parse strings at query time when sorting, you implement the parsing logic when indexing the doc and index it up front as a date field that you can sort on. I would suggest something like CloneFieldUpdateProcessorFactory + RegexReplaceProcessorFactory could save you the work of needing to implement any custom logic -- but as Jack pointed out in SOLR-4864 it doesn't currently allow you to do capture group replacements (but maybe you could contribute a patch to fix that instead of needing to write completely custom code for yourself) Of maybe, as is, you could use RegexReplaceProcessorFactory to throw away non digits - and then use ParseDateFieldUpdateProcessorFactory to get what you want? (I'm not certain - i haven't played with ParseDateFieldUpdateProcessorFactory much) https://issues.apache.org/jira/browse/SOLR-4864 https://lucene.apache.org/solr/4_5_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html https://lucene.apache.org/solr/4_5_0/solr-core/org/apache/solr/update/processor/CloneFieldUpdateProcessorFactory.html https://lucene.apache.org/solr/4_5_0/solr-core/org/apache/solr/update/processor/ParseDateFieldUpdateProcessorFactory.html -Hoss
How to extract a field with a prefixed dimension?
Hi, i'm new in solr. i use the content field to extract the text of solr documents, but this field is too long. Is there a way to extract only a substring of this field? i make my query in java as follow: SolrQuery querySolr = new SolrQuery(); querySolr.setQuery(*:*); querySolr.setRows(3); querySolr.setParam(wt, json); querySolr.addField(content); querySolr.addField(title); querySolr.addField(url); any ideas? Thanks Danilo -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-extract-a-field-with-a-prefixed-dimension-tp4096877.html Sent from the Solr - User mailing list archive at Nabble.com.
External Zookeeper and JBOSS
When I use the Zookeeper CLI utility, I'm not sure if the configuration is uploading correctly. How can I tell? This is the command I am issuing - ./zkCli.sh -cmd upconfig -server 127.0.0.1:2181 -confdir /data/v8p/solr/root/conf -confname defaultconfig -solrhome /data/v8p/solr Then checking with this - [zk: localhost:2181(CONNECTED) 0] ls / [aliases.json, live_nodes, overseer, overseer_elect, collections, zookeeper, clusterstate.json] But I don't see any config node. One thing to note - I have multiple cores but the configs are located in a common dir. Maybe that is causing a problem. Sorl.xml [simplified by removing additional cores] ?xml version=1.0 encoding=UTF-8 ? solr persistent=true sharedLib=lib zkHost=192.168.1.101:2181 cores adminPath=/admin/cores core schema=/data/v8p/solr/root/schema/schema.xml instanceDir=/data/v8p/solr/root/ name=wdsp dataDir=/data/v8p/solr/wdsp2/data/ core schema=/data/v8p/solr/root/schema/schema.xml instanceDir=/data/v8p/solr/root/ name=wdsp2 dataDir=/data/v8p/solr/wdsp/data/ /cores /solr Am I overlooking something obvious? Thanks! Jeremy D. Branham Performance Technologist II Sprint University Performance Support Fort Worth, TX | Tel: **DOTNET http://JeremyBranham.Wordpress.comhttp://jeremybranham.wordpress.com/ http://www.linkedin.com/in/jeremybranham This e-mail may contain Sprint proprietary information intended for the sole use of the recipient(s). Any use by others is prohibited. If you are not the intended recipient, please contact the sender and delete all copies of the message.
RE: External Zookeeper and JBOSS
I've made progress... Rather than using the zkCli.sh in the zookeep bin folder, I used the java libs fom SOLR and the config now shows up. Jeremy D. Branham Performance Technologist II Sprint University Performance Support Fort Worth, TX | Tel: **DOTNET http://JeremyBranham.Wordpress.com http://www.linkedin.com/in/jeremybranham -Original Message- From: Branham, Jeremy [HR] Sent: Monday, October 21, 2013 2:20 PM To: SOLR User distro (solr-user@lucene.apache.org) Subject: External Zookeeper and JBOSS When I use the Zookeeper CLI utility, I'm not sure if the configuration is uploading correctly. How can I tell? This is the command I am issuing - ./zkCli.sh -cmd upconfig -server 127.0.0.1:2181 -confdir /data/v8p/solr/root/conf -confname defaultconfig -solrhome /data/v8p/solr Then checking with this - [zk: localhost:2181(CONNECTED) 0] ls / [aliases.json, live_nodes, overseer, overseer_elect, collections, zookeeper, clusterstate.json] But I don't see any config node. One thing to note - I have multiple cores but the configs are located in a common dir. Maybe that is causing a problem. Sorl.xml [simplified by removing additional cores] ?xml version=1.0 encoding=UTF-8 ? solr persistent=true sharedLib=lib zkHost=192.168.1.101:2181 cores adminPath=/admin/cores core schema=/data/v8p/solr/root/schema/schema.xml instanceDir=/data/v8p/solr/root/ name=wdsp dataDir=/data/v8p/solr/wdsp2/data/ core schema=/data/v8p/solr/root/schema/schema.xml instanceDir=/data/v8p/solr/root/ name=wdsp2 dataDir=/data/v8p/solr/wdsp/data/ /cores /solr Am I overlooking something obvious? Thanks! Jeremy D. Branham Performance Technologist II Sprint University Performance Support Fort Worth, TX | Tel: **DOTNET http://JeremyBranham.Wordpress.comhttp://jeremybranham.wordpress.com/ http://www.linkedin.com/in/jeremybranham This e-mail may contain Sprint proprietary information intended for the sole use of the recipient(s). Any use by others is prohibited. If you are not the intended recipient, please contact the sender and delete all copies of the message. This e-mail may contain Sprint proprietary information intended for the sole use of the recipient(s). Any use by others is prohibited. If you are not the intended recipient, please contact the sender and delete all copies of the message.
Re: External Zookeeper and JBOSS
On 10/21/2013 1:19 PM, Branham, Jeremy [HR] wrote: Sorl.xml [simplified by removing additional cores] ?xml version=1.0 encoding=UTF-8 ? solr persistent=true sharedLib=lib zkHost=192.168.1.101:2181 cores adminPath=/admin/cores core schema=/data/v8p/solr/root/schema/schema.xml instanceDir=/data/v8p/solr/root/ name=wdsp dataDir=/data/v8p/solr/wdsp2/data/ core schema=/data/v8p/solr/root/schema/schema.xml instanceDir=/data/v8p/solr/root/ name=wdsp2 dataDir=/data/v8p/solr/wdsp/data/ /cores /solr These cores that you have listed here do not look like SolrCloud-related cores, because they do not reference a collection or a shard. Here's what I've got on a 4.2.1 box where all cores were automatically created by the CREATE action on the collections API: core schema=schema.xml loadOnStartup=true shard=shard1 instanceDir=eatatjoes_shard1_replica2/ transient=false name=eatatjoes_shard1_replica2 config=solrconfig.xml collection=eatatjoes/ core schema=schema.xml loadOnStartup=true shard=shard1 instanceDir=test3_shard1_replica1/ transient=false name=test3_shard1_replica1 config=solrconfig.xml collection=test3/ core schema=schema.xml loadOnStartup=true shard=shard1 instanceDir=smb2_shard1_replica1/ transient=false name=smb2_shard1_replica1 config=solrconfig.xml collection=smb2/ On the commandline script -- the zkCli.sh script comes with zookeeper, but it is not aware of anything having to do with SolrCloud. There is another script named zkcli.sh (note the lowercase C) that comes with the solr example (in example/cloud-scripts)- it's a very different script and will accept the options that you tried to give. I do wonder how much pain would be caused by renaming the Solr zkcli script so it's not so similar to the one that comes with Zookeeper. Thanks, Shawn
Major GC does not reduce the old gen size
Hello everyone, We are using solr 4.4 version production with 4 shards. This is our memory settings. -d64 -server -Xms8192m -Xmx12288m -XX:MaxPermSize=256m \ -XX:NewRatio=1 -XX:SurvivorRatio=6 \ -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:CMSIncrementalDutyCycleMin=0 \ -XX:CMSIncrementalDutyCycle=10 -XX:+CMSIncrementalPacing \ -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC \ -XX:+CMSClassUnloadingEnabled -XX:+DisableExplicitGC \ -XX:+UseLargePages \ -XX:+UseParNewGC \ -XX:ConcGCThreads=10 \ -XX:ParallelGCThreads=10 \ -XX:MaxGCPauseMillis=3 \ I notice in production that, the old generation becomes full and no amount of garbage collection will free up the memory This is similar to the issue discussed in this link. http://grokbase.com/t/lucene/solr-user/12bwydq5jr/permanently-full-old-generation Did anyone have this problem? Can you please point anything wrong with the GC configuration? -- View this message in context: http://lucene.472066.n3.nabble.com/Major-GC-does-not-reduce-the-old-gen-size-tp4096880.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: measure result set quality
Thanks for your valuable answers. As a first approach I will evaluate (manually :( ) hits that are out of the intersection set for every query in each system. Anyway I will keep searching for literature in the field. Regards. On Sun, Oct 20, 2013 at 10:55 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: That's exactly what we advocate for in our Solr work. We call in Test Driven Relevancy. We work closely with content experts to help build collaboration around search quality. (disclaimer, yes we build a product around this) but the advice still stands regardless. http://www.opensourceconnections.com/2013/10/14/what-is-test-driven-search-relevancy/ Cheers -Doug Turnbull Search Relevancy Expert OpenSource Connections On Sun, Oct 20, 2013 at 4:21 PM, Furkan KAMACI furkankam...@gmail.com wrote: Let's assume that you have keywords to search and different configurations for indexing. A/B testing is one of techniques that you can use as like Erick mentioned. If you want to have an automated comparison and do not have a oracle for A/B testing there is another way. If you have an ideal result list you can compare the similarity of your different configuration results and that ideal result list. The ideal result list can be created by an expert just for one time. If you are developing a search engine you can search same keywords at that one of search engines and you can use that results as ideal result list to measure your result lists' similarities. Kendall's tau is one of the methods to use for such kind of situations. If you do not have any document duplication at your index (without any other versions) I suggest to use tau a. If you explain your system and if you explain what is good for you or what is ideal for you I can explain you more. Thanks; Furkan KAMACI 2013/10/18 Erick Erickson erickerick...@gmail.com bq: How do you compare the quality of your search result in order to decide which schema is better? Well, that's actually a hard problem. There's the various TREC data, but that's a generic solution and most every individual application of this generic thing called search has its own version of good results. Note that scores are NOT comparable across different queries even in the same data set, so don't go down that path. I'd fire the question back at you, Can you define what good (or better) results are in such a way that you can program an evaluation? Often the answer is no... One common technique is to have knowledgable users do what's called A/B testing. You fire the query at two separate Solr instances and display the results side-by-side, and the user says A is more relevant, or B is more relevant. Kind of like an eye doctor. In sophisticated A/B testing, the program randomly changes which side the results go, so you remove sidedness bias. FWIW, Erick On Thu, Oct 17, 2013 at 11:28 AM, Alvaro Cabrerizo topor...@gmail.com wrote: Hi, Imagine the next situation. You have a corpus of documents and a list of queries extracted from production environment. The corpus haven't been manually annotated with relvant/non relevant tags for every query. Then you configure various solr instances changing the schema (adding synonyms, stopwords...). After indexing, you prepare and execute the test over different schema configurations. How do you compare the quality of your search result in order to decide which schema is better? Regards. -- Doug Turnbull Search Big Data Architect OpenSource Connections http://o19s.com
Re: Exact Match Results
For exact phrase match you can wrap the query inside quotes but this will perform the exact match and it wont match other results. The below query will match only : Okkadu telugu movie stills http://localhost:8983/solr/core1/select?q=%22okkadu%20telugu%20movie%20stills%22 Since you are using Edge N Gram filter, it produces so many tokens (as below). You might not get the desired output. You can try using shingle factory with standard analyzer instead of using edge n gram filter. o [6f] 0 26 1 1 word ok [6f 6b] 0 26 1 1 word okk [6f 6b 6b] 0 26 1 1 word okka [6f 6b 6b 61] 0 26 1 1 word okkad [6f 6b 6b 61 64] 0 26 1 1 word okkadu [6f 6b 6b 61 64 75] 0 26 1 1 word okkadu [6f 6b 6b 61 64 75 20] 0 26 1 1 word okkadu t [6f 6b 6b 61 64 75 20 74] 0 26 1 1 word okkadu te [6f 6b 6b 61 64 75 20 74 65] 0 26 1 1 word okkadu tel [6f 6b 6b 61 64 75 20 74 65 6c] 0 26 1 1 word -- View this message in context: http://lucene.472066.n3.nabble.com/Exact-Match-Results-tp4096816p4096906.html Sent from the Solr - User mailing list archive at Nabble.com.