Re: Fast Vector Highlighter Working for some records only
Hi Koji, Thanks for your guidance. i have looked into anlysis page of solr and it's working fine.but still it's not working fine for few documents. here is configuration for highlighter i am using,i have specefied this in solrconfig.xml, please can you tell me what should i change to highlighter to work for all documents. for your information i am not using any kind of filter for custom field, i am just using my custom tokeniser.. searchComponent class=solr.HighlightComponent name=highlight highlighting fragmenter name=gap default=true class=solr.highlight.GapFragmenter lst name=defaults int name=hl.snippets1000/int int name=hl.fragsize7/int int name=hl.maxAnalyzedChars7/int /lst /fragmenter fragmenter name=regex class=org.apache.solr.highlight.RegexFragmenter lst name=defaults int name=hl.fragsize70/int float name=hl.regex.slop0.5/float str name=hl.regex.pattern[-\w ,/\n\']{20,200}/str /lst /fragmenter formatter name=html default=true class=solr.highlight.HtmlFormatter lst name=defaults str name=hl.simple.pre/str str name=hl.simple.post/str /lst /formatter encoder name=html class=solr.highlight.HtmlEncoder / fragListBuilder name=simple default=true class=solr.highlight.SimpleFragListBuilder/ fragListBuilder name=single class=solr.highlight.SingleFragListBuilder/ fragmentsBuilder name=default default=true class=solr.highlight.ScoreOrderFragmentsBuilder /fragmentsBuilder fragmentsBuilder name=colored class=solr.highlight.ScoreOrderFragmentsBuilder lst name=defaults str name=hl.tag.pre/str str name=hl.tag.post/str /lst /fragmentsBuilder boundaryScanner name=default default=true class=solr.highlight.SimpleBoundaryScanner lst name=defaults str name=hl.bs.maxScan10/str str name=hl.bs.chars.,!? #9;#10;#13;/str /lst /boundaryScanner boundaryScanner name=breakIterator class=solr.highlight.BreakIteratorBoundaryScanner lst name=defaults str name=hl.bs.typeWORD/str str name=hl.bs.languageen/str str name=hl.bs.countryUS/str /lst /boundaryScanner /highlighting /searchComponent Koji Sekiguchi wrote Hi dhaivat, I think you may want to use analysis.jsp: http://localhost:8983/solr/admin/analysis.jsp Go to the URL and look into how your custom tokenizer produces tokens, and compare with the output of Solr's inbuilt tokenizer. koji -- Query Log Visualizer for Apache Solr http://soleami.com/ (12/02/22 21:35), dhaivat wrote: Koji Sekiguchi wrote (12/02/22 11:58), dhaivat wrote: Thanks for reply, But can you please tell me why it's working for some documents and not for other. As Solr 1.4.1 cannot recognize hl.useFastVectorHighlighter flag, Solr just ignore it, but due to hl=true is there, Solr tries to create highlight snippets by using (existing; traditional; I mean not FVH) Highlighter. Highlighter (including FVH) cannot produce snippets sometime for some reasons, you can use hl.alternateField parameter. http://wiki.apache.org/solr/HighlightingParameters#hl.alternateField koji -- Query Log Visualizer for Apache Solr http://soleami.com/ Thank you so much explanation, I have updated my solr version and using 3.5, Could you please tell me when i am using custom Tokenizer on the field,so do i need to make any changes related Solr highlighter. here is my custom analyser fieldType name=custom_text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=ns.solr.analyser.CustomIndexTokeniserFactory/ /analyzer analyzer type=query tokenizer class=ns.solr.analyser.CustomSearcherTokeniserFactory/ /analyzer /fieldType here is the field info: field name=contents type=custom_text indexed=true stored=true multiValued=true termPositions=true termVectors=true termOffsets=true/ i am creating tokens using my custom analyser and when i am trying to use highlighter it's not working properly for contents field.. but when i tried to use Solr inbuilt tokeniser i am finding the word highlighted for particular query.. Please can you help me out with this ? Thanks in
Re: 'location' fieldType indexation impossible
You totally get it :) I'v deleted thoses dynamicField (though it was just an exemple), why didn't i read the comment above the line ! Thanks alot ;) Best regards, Xavier. -- View this message in context: http://lucene.472066.n3.nabble.com/location-fieldType-indexation-impossible-tp3766136p3769065.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to merge an autofacet with a predefined facet
Thank you for theses informations, I'll keep that in mind. But i'm sorry, i don't get it about the process to do it ??? Em wrote Well, you could create a keyword-file out of your database and join it with your self-maintained keywordslist. By that you mean : - 'self-maintained keywordslist' is my 'predefined_facet' already filled in database that i'll still import with DIH ? - The keyword-file isnt the same thing that i've created with synonyms/keepsword combination ? And still don't get how to 'merge' those both way of getting facets values in an only one facet ! Thanks for advance, Xavier -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-merge-an-autofacet-with-a-predefined-facet-tp3763988p3769121.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr 3.5 and indexing performance
Ok i found it. Its becouse of Hunspell which now is in solr. Somehow when im using it by myself in 3.4 it is a lot of faster then one from 3.5. Dont know about differences, but is there any way i use my old Google Hunspell jar? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-3-5-and-indexing-performance-tp3766653p3769139.html Sent from the Solr - User mailing list archive at Nabble.com.
Can this type of sorting/boosting be done by solr
Hi, I have a journal article citation schema like this: { AT - article_title AID - article_id (Unique id) AREFS - article_references_list (List of article id's referred/cited in this article. Multi-valued) AA - Article Abstract --- other_article_stuff ... } So for example, in order to search for all those articles that refer(cite) article id 51643, I simply need to search for AREFS:51643 and it will give me the list of articles that have 51643 listed in AREFS. Now, I want to be able to search in the text of articles and sort the results by most referred articles. How can I do this ? Say if my search query is q=AT:metal and it gives me 1700 results. How can I sort 1700 results by those that have received maximum number of citations by others. I have been researching function queries to solve this but have been unable to do so. Thanks in advance. Ritesh -- View this message in context: http://lucene.472066.n3.nabble.com/Can-this-type-of-sorting-boosting-be-done-by-solr-tp3769315p3769315.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can this type of sorting/boosting be done by solr
Hi Ritesh, you could add another field that contains the size of the list in the AREFS field. This way you'd simply sort by that field in descending order. Should you update AREFS dynamically, you'd have to update the field with the size, as well, of course. Chantal On Thu, 2012-02-23 at 11:27 +0100, rks_lucene wrote: Hi, I have a journal article citation schema like this: { AT - article_title AID - article_id (Unique id) AREFS - article_references_list (List of article id's referred/cited in this article. Multi-valued) AA - Article Abstract --- other_article_stuff ... } So for example, in order to search for all those articles that refer(cite) article id 51643, I simply need to search for AREFS:51643 and it will give me the list of articles that have 51643 listed in AREFS. Now, I want to be able to search in the text of articles and sort the results by most referred articles. How can I do this ? Say if my search query is q=AT:metal and it gives me 1700 results. How can I sort 1700 results by those that have received maximum number of citations by others. I have been researching function queries to solve this but have been unable to do so. Thanks in advance. Ritesh -- View this message in context: http://lucene.472066.n3.nabble.com/Can-this-type-of-sorting-boosting-be-done-by-solr-tp3769315p3769315.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: String search in Dismax handler
HI Erick, Thanks for the response. I am currently using solr 1.5 version. We are getting the following query when we give the search query as Pass By Value without quotes and by using qt=dismax in the request query. webapp=/solr path=/select/ params={facet=truef.typeFacet.facet.mincount=1qf=name^2.3+text+x_name^0.3+id^0.3+uxid^0.3hl.fl=*hl=truef.rFacet.facet.mincount=1rows=10debugQuery=truefl=*start=0q=pass+by+valuefacet.field=typeFacetfacet.field=rFacetqt=dismax} hits=0 status=0 QTime=63 and the response for it in the UI is as follows result name=response numFound=0 start=0 / - lst name=facet_counts lst name=facet_queries / - lst name=facet_fields lst name=typeFacet / lst name=rFacet / /lst lst name=facet_dates / /lst lst name=highlighting / - lst name=debug str name=rawquerystringpass by value/str str name=querystringpass by value/str str name=parsedquery+((DisjunctionMaxQuery((uxid:pass^0.3 | id:pass^0.3 | x_name:pass^0.3 | text:loan | name:pass^2.3)) DisjunctionMaxQuery((uxid:by^0.3 | id:by^0.3)) DisjunctionMaxQuery((uxid:value^0.3 | id:value^0.3 | x_name:value^0.3 | text:value | name:value^2.3)))~3) ()/str str name=parsedquery_toString+(((uxid:pass^0.3 | id:loan^0.3 | x_name:pass^0.3 | text:loan | name:pass^2.3) (uxid:by^0.3 | id:by^0.3) (uxid:value^0.3 | id:value^0.3 | x_name:value^0.3 | text:value | name:value^2.3))~3) ()/str lst name=explain / str name=QParserDisMaxQParser/str null name=altquerystring / null name=boostfuncs / - lst name=timing double name=time3.0/double - lst name=prepare double name=time1.0/double - lst name=org.apache.solr.handler.component.QueryComponent double name=time1.0/double /lst - lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.MoreLikeThisComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.HighlightComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.StatsComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.DebugComponent double name=time0.0/double /lst /lst - lst name=process double name=time2.0/double - lst name=org.apache.solr.handler.component.QueryComponent double name=time1.0/double /lst - lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.MoreLikeThisComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.HighlightComponent double name=time1.0/double /lst - lst name=org.apache.solr.handler.component.StatsComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.DebugComponent double name=time0.0/double /lst /lst /lst /lst /response whereas we get the following query when we remove the parameter qt=dismax from the request query and this is fetching the required results. webapp=/solr path=/select/ params={facet=trueqf=name^2.3+text+x_name^0.3+id^0.3+uxid^0.3f.typeFacet.facet.mincount=1hl.fl=*f.rFacet.facet.mincount=1hl=truerows=10fl=*debugQuery=truestart=0q=pass+by+valuefacet.field=typeFacetfacet.field=rFacet} hits=9203 status=0 QTime=1158 In another case where we use Pass by Value with quotes and also with qt=dismax in the request handler, the search query is fetching the right values. The following is the concerned query. webapp=/solr path=/select/ params={facet=trueqf=name^2.3+text+x_name^0.3+id^0.3+uxid^0.3f.typeFacet.facet.mincount=1hl.fl=*f.rFacet.facet.mincount=1hl=truerows=10fl=*debugQuery=truestart=0q=pass+by+valuefacet.field=typeFacetfacet.field=rFacet} hits=18 status=0 QTime=213 and the response for it from UI is ?xml version=1.0 encoding=UTF-8 ? - response - lst name=responseHeader int name=status0/int int name=QTime578/int - lst name=params str name=facettrue/str str name=f.typeFacet.facet.mincount1/str str name=qfname^2.3 text x_name^0.3 id^0.3 xid^0.3/str str name=hl.fl*/str str name=hltrue/str str name=f.rFacet.facet.mincount1/str str name=rows10/str str name=debugQuerytrue/str str name=fl*/str str name=start0/str str name=qpass by value/str - arr name=facet.field strtypeFacet/str strrFacet/str /arr str name=qtdismax/str /lst /lst + result name=response numFound=18 start=0 + lst name=facet_counts + lst name=highlighting - lst name=debug str name=rawquerystringpass by value/str str name=querystringpass by value/str str name=parsedquery+DisjunctionMaxQuery((xid:pass by value^0.3 | id:pass by value^0.3 | x_name:pass ? value^0.3 | text:pass ? value | name:pass ? value^2.3)) ()/str str name=parsedquery_toString+(xid:pass by value^0.3 | id:pass by value^0.3 | x_name:pass ? value^0.3 | text:pass ? value | name:pass ? value^2.3) ()/str + lst name=explain str
Re: Can this type of sorting/boosting be done by solr
Dear Chantal, Thanks for your reply, but thats not what I was asking. Let me explain. The size of the list in AREFS would give me how many records are *referred by* an article and NOT how many records *refer to* an article. Say if an article id - 51463 has been published in 2002 and refers to 10 articles dating from 1990-2002. Then the count of AREFS would be 10 which is static once the journal has been published. However if the same article is being *referred to* by 20 articles published from 2003-2012 then I am talking about this 20 count. This count is dynamic and as we keep adding records to the index, there are more articles that will refer to article 51463 it in their AREFS field in the future. /(Obviously when we are adding article 51463 to the index we have no clue who will be referring to it in the future, so we can have another field in it for this, nor can be update 51463 everytime someone refers to it)/ So today, if I want to know who all are referring to 51463, by actually searching for this id in the AREFS field. The query is as simple as q=AREFS:51463 and it will given the list of articles from 2003 to 2012 and the result count would be 20. So back to the question, say if my search query is q=AT:metal and it gives me 1700 results. How can I sort 1700 results by those that have received maximum number of citations (till date) by others. (i.e., that have maximum number of results if I individually search their ids in the AREFS field). Hope this makes it clear. I feel this is a sort/boost by function query candidate. But I am not able to figure it out. Thanks Ritesh -- View this message in context: http://lucene.472066.n3.nabble.com/Can-this-type-of-sorting-boosting-be-done-by-solr-tp3769315p3769475.html Sent from the Solr - User mailing list archive at Nabble.com.
Range Query with sensitive Scoring
Hello, I have an Integer field which carries a value between 0 to 18. Ist there a way to query this field fuzzy? For example search for field:5 and also match documents near it (like documents containing field:4 oder field:6)? And if this is possible, is it also possible to boost exact matches and lower the boost for the fuzzy matches? Thanks in advance and kind regards Hannes
Re: SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher
it loos like it works, with patch, after a couple of hours of testing under same conditions didn't see it happen (without it, approx. every 15 minutes). I do not think it will happen again with this patch. Thanks again and my respect to your debugging capacity, my bug report was really thin. On Thu, Feb 23, 2012 at 8:47 AM, eks dev eks...@yahoo.co.uk wrote: thanks Mark, I will give it a go and report back... On Thu, Feb 23, 2012 at 1:31 AM, Mark Miller markrmil...@gmail.com wrote: Looks like an issue around replication IndexWriter reboot, soft commits and hard commits. I think I've got a workaround for it: Index: solr/core/src/java/org/apache/solr/handler/SnapPuller.java === --- solr/core/src/java/org/apache/solr/handler/SnapPuller.java (revision 1292344) +++ solr/core/src/java/org/apache/solr/handler/SnapPuller.java (working copy) @@ -499,6 +499,17 @@ // reboot the writer on the new index and get a new searcher solrCore.getUpdateHandler().newIndexWriter(); + Future[] waitSearcher = new Future[1]; + solrCore.getSearcher(true, false, waitSearcher, true); + if (waitSearcher[0] != null) { + try { + waitSearcher[0].get(); + } catch (InterruptedException e) { + SolrException.log(LOG,e); + } catch (ExecutionException e) { + SolrException.log(LOG,e); + } + } // update our commit point to the right dir solrCore.getUpdateHandler().commit(new CommitUpdateCommand(req, false)); That should allow the searcher that the following commit command prompts to see the *new* IndexWriter. On Feb 22, 2012, at 10:56 AM, eks dev wrote: We started observing strange failures from ReplicationHandler when we commit on master trunk version 4-5 days old. It works sometimes, and sometimes not didn't dig deeper yet. Looks like the real culprit hides behind: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed Looks familiar to somebody? 120222 154959 SEVERE SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1043) at org.apache.solr.update.DirectUpdateHandler2.commit(Unknown Source) at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:503) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348) at org.apache.solr.handler.ReplicationHandler.doFetch(Unknown Source) at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:163) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:810) at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:815) at org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:3984) at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:254) at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:233) at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:223) at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1095) ... 15 more - Mark Miller lucidimagination.com
Re: Range Query with sensitive Scoring
I have an Integer field which carries a value between 0 to 18. Ist there a way to query this field fuzzy? For example search for field:5 and also match documents near it (like documents containing field:4 oder field:6)? And if this is possible, is it also possible to boost exact matches and lower the boost for the fuzzy matches? Yes it is possible with query manipulation. If you are using lucene query parser, q=+field:{4 TO 6} field:5^10 if you are using edismax query parser q=field:{4 TO 6}bq=field:5^10
Re: Can this type of sorting/boosting be done by solr
Have you looked at external fields? http://lucidworks.lucidimagination.com/display/solr/Solr+Field+Types#SolrFieldTypes-WorkingwithExternalFiles you will need a process to do the counts and note the limitation of updates only after a commit, but i think it would fit your usecase. On 23 February 2012 12:04, rks_lucene ppro.i...@gmail.com wrote: Dear Chantal, Thanks for your reply, but thats not what I was asking. Let me explain. The size of the list in AREFS would give me how many records are *referred by* an article and NOT how many records *refer to* an article. Say if an article id - 51463 has been published in 2002 and refers to 10 articles dating from 1990-2002. Then the count of AREFS would be 10 which is static once the journal has been published. However if the same article is being *referred to* by 20 articles published from 2003-2012 then I am talking about this 20 count. This count is dynamic and as we keep adding records to the index, there are more articles that will refer to article 51463 it in their AREFS field in the future. /(Obviously when we are adding article 51463 to the index we have no clue who will be referring to it in the future, so we can have another field in it for this, nor can be update 51463 everytime someone refers to it)/ So today, if I want to know who all are referring to 51463, by actually searching for this id in the AREFS field. The query is as simple as q=AREFS:51463 and it will given the list of articles from 2003 to 2012 and the result count would be 20. So back to the question, say if my search query is q=AT:metal and it gives me 1700 results. How can I sort 1700 results by those that have received maximum number of citations (till date) by others. (i.e., that have maximum number of results if I individually search their ids in the AREFS field). Hope this makes it clear. I feel this is a sort/boost by function query candidate. But I am not able to figure it out. Thanks Ritesh -- View this message in context: http://lucene.472066.n3.nabble.com/Can-this-type-of-sorting-boosting-be-done-by-solr-tp3769315p3769475.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can this type of sorting/boosting be done by solr
Sorry to have misunderstood. It seems the new Relevance Functions in Solr 4.0 might help - unless you need to use an official release. http://wiki.apache.org/solr/FunctionQuery#Relevance_Functions On Thu, 2012-02-23 at 13:04 +0100, rks_lucene wrote: Dear Chantal, Thanks for your reply, but thats not what I was asking. Let me explain. The size of the list in AREFS would give me how many records are *referred by* an article and NOT how many records *refer to* an article. Say if an article id - 51463 has been published in 2002 and refers to 10 articles dating from 1990-2002. Then the count of AREFS would be 10 which is static once the journal has been published. However if the same article is being *referred to* by 20 articles published from 2003-2012 then I am talking about this 20 count. This count is dynamic and as we keep adding records to the index, there are more articles that will refer to article 51463 it in their AREFS field in the future. /(Obviously when we are adding article 51463 to the index we have no clue who will be referring to it in the future, so we can have another field in it for this, nor can be update 51463 everytime someone refers to it)/ So today, if I want to know who all are referring to 51463, by actually searching for this id in the AREFS field. The query is as simple as q=AREFS:51463 and it will given the list of articles from 2003 to 2012 and the result count would be 20. So back to the question, say if my search query is q=AT:metal and it gives me 1700 results. How can I sort 1700 results by those that have received maximum number of citations (till date) by others. (i.e., that have maximum number of results if I individually search their ids in the AREFS field). Hope this makes it clear. I feel this is a sort/boost by function query candidate. But I am not able to figure it out. Thanks Ritesh -- View this message in context: http://lucene.472066.n3.nabble.com/Can-this-type-of-sorting-boosting-be-done-by-solr-tp3769315p3769475.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Performance Improvement and degradation Help
It's pretty hard to say, even with the data you've provided. But, try adding debugQuery=on and look particularly down near the bottom there'll be a lst name=timing section. That section lists the time taken by all the components of a search, not just the QTime. Things like highlighting etc. that can often give a clue where the time's spent. What sort of wildcards are you using? Did you have to bump the maxBooleanClauses? This is a bit puzzling though Best Erick On Wed, Feb 22, 2012 at 3:16 PM, naptowndev naptowndev...@gmail.com wrote: As an update to this... I tried running a query again the 4.0.0.2010.12.10.08.54.56 version and the newer 4.0.0.2012.02.16 (both on the same box). So the query params were the same, returned results were the same, but the 4.0.0.2010.12.10.08.54.56 returned the results in about 1.6 seconds and the newer (4.0.0.2012.02.16) version returned the results in about 4 seconds. If I add the wildcard field list to the newer version, the time increases anywhere from .5-1 second. These are all averages after running the queries several times over a 30 minute period. (allowing for warming and cache). Anybody have any insight into why the newer versions are performing a bit slower? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Performance-Improvement-and-degradation-Help-tp3767015p3767725.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Same id on two shards
I really think you'll be in a world of hurt if you have the same ID on different shards. I just wouldn't go there. The statement may be non-deterministic should be taken to mean that this is just unsupported. Why is this the case? What is the use-case for putting the same ID on different shard? Because this seems like an XY problem... Best Erick On Wed, Feb 22, 2012 at 4:43 PM, jerry.min...@gmail.com jerry.min...@gmail.com wrote: Hi, I stumbled across this thread after running into the same question. The answers presented here seem a little vague and I was hoping to renew the discussion. I am using using a branch of Solr 4, distributed searching over 12 shards. I want the documents in the first shard to always be selected over documents that appear in the other 11 shards. The queries to these shards looks something like this: http://solrserver/shard_1_app/select?shards=solr_server:/shard_1_app/,solr_server:/shard_2_app, ... ,solr_server:/shard_12_appq=id: When I execute a query for an ID that I know exists in shard_1 and another shard, I do always get the result from shard 1. Here's some questions that I have: 1. Has anyone rigorously tested the comment in the wiki If docs with duplicate unique keys are encountered, Solr will make an attempt to return valid results, but the behavior may be non-deterministic. 2. Who is relying on this behavior (the document of the first shard is returned) today? When do you notice the wrong document is selected? Do you have a feeling for how frequently your distributed search returns the document from a shard other than the first? 3. Is there a good web source other than the Solr wiki for information about Solr distributed queries? Thanks, Jerry M. On Mon, Aug 8, 2011 at 7:41 PM, simon mtnes...@gmail.com wrote: I think the first one to respond is indeed the way it works, but that's only deterministic up to a point (if your small index is in the throes of a commit and everything required for a response happens to be cached on the larger shard ... who knows ?) On Mon, Aug 8, 2011 at 7:10 PM, Shawn Heisey s...@elyograg.org wrote: On 8/8/2011 4:07 PM, simon wrote: Only one should be returned, but it's non-deterministic. See http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations I had heard it was based on which one responded first. This is part of why we have a small index that contains the newest content and only distribute content to the other shards once a day. The hope is that the small index (less than 1GB, fits into RAM on that virtual machine) will always respond faster than the other larger shards (over 18GB each). Is this an incorrect assumption on our part? The build system does do everything it can to ensure that periods of overlap are limited to the time it takes to commit a change across all of the shards, which should amount to just a few seconds once a day. There might be situations when the index gets out of whack and we have duplicate id values for a longer time period, but in practice it hasn't happened yet. Thanks, Shawn
Re: Trunk build errors
There was recently some work done to get better about checking on licenses, when did you last get trunk? About 9 days ago was the last go-round. And did you do an 'ant clean'? It works on my machine with a fresh pull this morning. Best Erick On Wed, Feb 22, 2012 at 5:27 PM, Darren Govoni dar...@ontrenet.com wrote: Hi, I am getting numerous errors preventing a build of solrcloud trunk. [licenses] MISSING LICENSE for the following file: Any tips to get a clean build working? thanks
Re: Can this type of sorting/boosting be done by solr
Hi Chantal, Yes, I have thought about the docfreq(field_name,'search_text') function, but somehow I will have dereference the article id's (AID) from the result of the query to the sort. The below query does not work: q=AT:metalsort=docfreq(AREFS,$q.AID) Is there a mistake in the query that am missing out or is dereferencing not supported in Relevence functions ? Thanks, Ritesh -- View this message in context: http://lucene.472066.n3.nabble.com/Can-this-type-of-sorting-boosting-be-done-by-solr-tp3769315p3769779.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Trunk build errors
I updated yesterday and did an ant clean, ant test. I will try a clean pull next. I'm on linux. Perhaps an ant version issue? There was recently some work done to get better about checking on licenses, when did you last get trunk? About 9 days ago was the last go-round. And did you do an 'ant clean'? It works on my machine with a fresh pull this morning. Best Erick On Wed, Feb 22, 2012 at 5:27 PM, Darren Govoni dar...@ontrenet.com wrote: Hi, I am getting numerous errors preventing a build of solrcloud trunk. [licenses] MISSING LICENSE for the following file: Any tips to get a clean build working? thanks
Re: Unique key constraint and optimistic locking (versioning)
Em skrev: Hi Per, Solr provides the so called UniqueKey-field. Refer to the Wiki to learn more: http://wiki.apache.org/solr/UniqueKey Belive the uniqueKey does not enforce a unique key constraint, so that you are not allowed to create a document with an id's when an document with the same id already exists. So it is not the whole solution. Optimistic locking (versioning) ... is not provided by Solr out of the box. If you add a new document with the same UniqueKey it replaces the old one. You have to do the versioning on your own (and keep in mind concurrent updates). Kind regards, Em Am 21.02.2012 13:50, schrieb Per Steffensen: Hi Does solr/lucene provide any mechanism for unique key constraint and optimistic locking (versioning)? Unique key constraint: That a client will not succeed creating a new document in solr/lucene if a document already exists having the same value in some field (e.g. an id field). Of course implemented right, so that even though two or more threads are concurrently trying to create a new document with the same value in this field, only one of them will succeed. Optimistic locking (versioning): That a client will only succeed updating a document if this updated document is based on the version of the document currently stored in solr/lucene. Implemented in the optimistic way that clients during an update have to tell which version of the document they fetched from Solr and that they therefore have used as a starting-point for their updated document. So basically having a version field on the document that clients increase by one before sending to solr for update, and some code in Solr that only makes the update succeed if the version number of the updated document is exactly one higher than the version number of the document already stored. Of course again implemented right, so that even though two or more thrads are concurrently trying to update a document, and they all have their updated document based on the current version in solr/lucene, only one of them will succeed. Or do I have to do stuff like this myself outside solr/lucene - e.g. in the client using solr. Regards, Per Steffensen
Re: Unique key constraint and optimistic locking (versioning)
Hi Per, well, Solr has no Update-Method like a RDBMS. It is a re-insert of the whole document. Therefore a document with an existing UniqueKey marks the old document as deleted and inserts the new one. However this is not the whole story, since this constraint only works per index/SolrCore/Shard (depending on your use-case). Does this help you? Kind regards, Em Am 23.02.2012 15:34, schrieb Per Steffensen: Em skrev: Hi Per, Solr provides the so called UniqueKey-field. Refer to the Wiki to learn more: http://wiki.apache.org/solr/UniqueKey Belive the uniqueKey does not enforce a unique key constraint, so that you are not allowed to create a document with an id's when an document with the same id already exists. So it is not the whole solution. Optimistic locking (versioning) ... is not provided by Solr out of the box. If you add a new document with the same UniqueKey it replaces the old one. You have to do the versioning on your own (and keep in mind concurrent updates). Kind regards, Em Am 21.02.2012 13:50, schrieb Per Steffensen: Hi Does solr/lucene provide any mechanism for unique key constraint and optimistic locking (versioning)? Unique key constraint: That a client will not succeed creating a new document in solr/lucene if a document already exists having the same value in some field (e.g. an id field). Of course implemented right, so that even though two or more threads are concurrently trying to create a new document with the same value in this field, only one of them will succeed. Optimistic locking (versioning): That a client will only succeed updating a document if this updated document is based on the version of the document currently stored in solr/lucene. Implemented in the optimistic way that clients during an update have to tell which version of the document they fetched from Solr and that they therefore have used as a starting-point for their updated document. So basically having a version field on the document that clients increase by one before sending to solr for update, and some code in Solr that only makes the update succeed if the version number of the updated document is exactly one higher than the version number of the document already stored. Of course again implemented right, so that even though two or more thrads are concurrently trying to update a document, and they all have their updated document based on the current version in solr/lucene, only one of them will succeed. Or do I have to do stuff like this myself outside solr/lucene - e.g. in the client using solr. Regards, Per Steffensen
Re: Solr Performance Improvement and degradation Help
Erick - Agreed, it is puzzling. What I've found is that it doesn't matter if I pass in wildcards for the field list or not...but that the overall response time from the newer builds of Solr that we've tested (e.g. 4.0.0.2012.02.16) is slower than the older (4.0.0.2010.12.10.08.54.56) build. If I run the exact same query against those two cores, bringing back a payload of just over 13MB (xml), the older build brings it back in about 1.6 seconds and the newer build brings it back in about 8.4 seconds. Implementing the field list wildcard allows us to reduce the payload in the newer build (not an option in the older build). They payload is reduced to 1.8MB but takes over 3.5 seconds to come back as compared to the full payload (13MB) in the older build at about 1.6 seconds. With everything else remaining the same (machine/processors/memory/network and the code base calling Solr) it seems to point to something in the newer builds that's causing the slowdown, but I'm not intimate enough with Solr to be able to figure that out. We are using the debugQuery=on in our test to see timings and they aren't showing any anomalies, so that makes it even more confusing. From a wildcard perspective, it's on the fl parameter... here's a 'snippet' of part of our fl parameter for the query fl=id, CategoryGroupTypeID, MedicalSpecialtyDescription, TermsMisspelled, DictionarySource, timestamp, Category_*_MemberReports, Category_*_MemberReportRange, Category_*_NonMemberReports, Category_*_Grade, Category_*_GradeDisplay, Category_*_GradeTier, Category_*_ReportLocations, Category_*_ReportLocationCoordinates, Category_*_coordinate, score Please note that that fl param is greatly reduced from our full query, we have over 100 static files and a slew of dynamic fields - but that should give you an idea of how we are using wildcards. I'm not sure about the maxBooleanClauses...not being all that familiar with Solr, does that apply to wildcards used in the fl list? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Performance-Improvement-and-degradation-Help-tp3767015p3769995.html Sent from the Solr - User mailing list archive at Nabble.com.
How to retrieve tokens?
Hi to everybody, My name is Thiago and I'm new with Apache Solr and NoSQL databases. At the moment, I'm working and using Solr for document indexing. My Question is: Is there any way to retrieve the tokens in place of the original data? For example: I have a field using the fieldtype text_general from the original schema.xml. If I insert a document with the following string in this field: All you need is love, the tokens that I get are: all, you, need, love. When I search in this base, I want to get the tokens(all, you, need, love) in place of the indexed string. I searched for this in the web and in this forum too, but I saw some people saying to use TermVectorsComponent. Is there any way more easy to do it? As I saw, TermVectorsComponent is more difficult and use more memory. Thanks to everybody. Thiago -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-retrieve-tokens-tp3770007p3770007.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Unique key constraint and optimistic locking (versioning)
Em skrev: Hi Per, well, Solr has no Update-Method like a RDBMS. It is a re-insert of the whole document. Therefore a document with an existing UniqueKey marks the old document as deleted and inserts the new one. Yes I understand. But it is not always what I want to acheive. I want an error to occur if a document with the same id already exists, when my intent is to INSERT a new document. When my intent is to UPDATE a document in solr/lucene I want the old document already in solr/lucene deleted and the new version of this document added (exactly as you explain). It will not be possible for solr/lucene to decide what to do unless I give it some information about my intent - whether it is INSERT or UPDATE semantics I want. I guess solr/lucene always give me INSERT sematics when a document with the same id does not already exist, and that it always give me UPDATE semantics when a document with the same id does exist? I cannot decide? However this is not the whole story, since this constraint only works per index/SolrCore/Shard (depending on your use-case). Yes I know. But with the right routing strategy based on id's I will be able to acheive what I want if the feature was just there per index/core/shard. Does this help you? Yes it helps me getting sure, that what I am looking for is not there. There is not built-in way to make solr/lucene give me an error if I try to insert a new document with an id equal to a document already in the index/core/shard. The existing document will always be updated (implemented as old deleted and new added). Correct? Kind regards, Em Regards, Per Steffensen
RE: Trunk build errors
Hi Darren, I use Ant 1.7.1. There have been some efforts to make the build work with Ant 1.8.X, but it is not (yet) the required version. So if you're not using Ant 1.7.1, I suggest you try it. Steve -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, February 23, 2012 8:59 AM To: solr-user@lucene.apache.org Subject: Re: Trunk build errors I updated yesterday and did an ant clean, ant test. I will try a clean pull next. I'm on linux. Perhaps an ant version issue? There was recently some work done to get better about checking on licenses, when did you last get trunk? About 9 days ago was the last go-round. And did you do an 'ant clean'? It works on my machine with a fresh pull this morning. Best Erick On Wed, Feb 22, 2012 at 5:27 PM, Darren Govoni dar...@ontrenet.com wrote: Hi, I am getting numerous errors preventing a build of solrcloud trunk. [licenses] MISSING LICENSE for the following file: Any tips to get a clean build working? thanks
Re: String search in Dismax handler
OK, I really don't get this. The quoted bit gives: +DisjunctionMaxQuery((xid:pass by value^0.3 | id:pass by value^0.3 | x_name:pass ? value^0.3 | text:pass ? value | name:pass ? value^2.3)) The bare bit gives: +((DisjunctionMaxQuery((uxid:pass^0.3 | id:pass^0.3 | x_name:pass^0.3 | text:loan | name:pass^2.3)) DisjunctionMaxQuery((uxid:by^0.3 | id:by^0.3)) DisjunctionMaxQuery((uxid:value^0.3 | id:value^0.3 | x_name:value^0.3 | text:value | name:value^2.3)))~3 In the one case you're searching on xid, in the other uxid. The unquoted case also has text:loan and id:by and id:value. Is that that's where you are getting your hits? Erick On Thu, Feb 23, 2012 at 6:52 AM, mechravi25 mechrav...@yahoo.co.in wrote: HI Erick, Thanks for the response. I am currently using solr 1.5 version. We are getting the following query when we give the search query as Pass By Value without quotes and by using qt=dismax in the request query. webapp=/solr path=/select/ params={facet=truef.typeFacet.facet.mincount=1qf=name^2.3+text+x_name^0.3+id^0.3+uxid^0.3hl.fl=*hl=truef.rFacet.facet.mincount=1rows=10debugQuery=truefl=*start=0q=pass+by+valuefacet.field=typeFacetfacet.field=rFacetqt=dismax} hits=0 status=0 QTime=63 and the response for it in the UI is as follows result name=response numFound=0 start=0 / - lst name=facet_counts lst name=facet_queries / - lst name=facet_fields lst name=typeFacet / lst name=rFacet / /lst lst name=facet_dates / /lst lst name=highlighting / - lst name=debug str name=rawquerystringpass by value/str str name=querystringpass by value/str str name=parsedquery+((DisjunctionMaxQuery((uxid:pass^0.3 | id:pass^0.3 | x_name:pass^0.3 | text:loan | name:pass^2.3)) DisjunctionMaxQuery((uxid:by^0.3 | id:by^0.3)) DisjunctionMaxQuery((uxid:value^0.3 | id:value^0.3 | x_name:value^0.3 | text:value | name:value^2.3)))~3) ()/str str name=parsedquery_toString+(((uxid:pass^0.3 | id:loan^0.3 | x_name:pass^0.3 | text:loan | name:pass^2.3) (uxid:by^0.3 | id:by^0.3) (uxid:value^0.3 | id:value^0.3 | x_name:value^0.3 | text:value | name:value^2.3))~3) ()/str lst name=explain / str name=QParserDisMaxQParser/str null name=altquerystring / null name=boostfuncs / - lst name=timing double name=time3.0/double - lst name=prepare double name=time1.0/double - lst name=org.apache.solr.handler.component.QueryComponent double name=time1.0/double /lst - lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.MoreLikeThisComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.HighlightComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.StatsComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.DebugComponent double name=time0.0/double /lst /lst - lst name=process double name=time2.0/double - lst name=org.apache.solr.handler.component.QueryComponent double name=time1.0/double /lst - lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.MoreLikeThisComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.HighlightComponent double name=time1.0/double /lst - lst name=org.apache.solr.handler.component.StatsComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.DebugComponent double name=time0.0/double /lst /lst /lst /lst /response whereas we get the following query when we remove the parameter qt=dismax from the request query and this is fetching the required results. webapp=/solr path=/select/ params={facet=trueqf=name^2.3+text+x_name^0.3+id^0.3+uxid^0.3f.typeFacet.facet.mincount=1hl.fl=*f.rFacet.facet.mincount=1hl=truerows=10fl=*debugQuery=truestart=0q=pass+by+valuefacet.field=typeFacetfacet.field=rFacet} hits=9203 status=0 QTime=1158 In another case where we use Pass by Value with quotes and also with qt=dismax in the request handler, the search query is fetching the right values. The following is the concerned query. webapp=/solr path=/select/ params={facet=trueqf=name^2.3+text+x_name^0.3+id^0.3+uxid^0.3f.typeFacet.facet.mincount=1hl.fl=*f.rFacet.facet.mincount=1hl=truerows=10fl=*debugQuery=truestart=0q=pass+by+valuefacet.field=typeFacetfacet.field=rFacet} hits=18 status=0 QTime=213 and the response for it from UI is ?xml version=1.0 encoding=UTF-8 ? - response - lst name=responseHeader int name=status0/int int name=QTime578/int - lst name=params str name=facettrue/str str name=f.typeFacet.facet.mincount1/str str name=qfname^2.3 text x_name^0.3 id^0.3 xid^0.3/str str name=hl.fl*/str str name=hltrue/str str name=f.rFacet.facet.mincount1/str str name=rows10/str str name=debugQuerytrue/str str name=fl*/str str name=start0/str str
Re: Solr Performance Improvement and degradation Help
Ah, no, my mistake. The wildcards for the fl list won't matter re: maxBooleanClauses, I didn't read carefully enough. I assume that just returning a field or two doesn't slow down But one possible culprit, especially since you say this kicks in after a while, is garbage collection. Here's an excellent intro: http://www.lucidimagination.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/ Especially look at the getting a view into garbage collection section and try specifying those options. The result should be that your solr log gets stats dumped every time GC kicks in. If this is a problem, look at the times in the logfile after your system slows down. You'll see a bunch of GC dumps that collect very little unused memory. You can also connect to the process using jConsole (should be in the Java distro) and watch the memory tab, especially after your server has slowed down. You can also connect jConsole remotely... This is just an experiment, but any time I see and it slows down after ### minutes, GC is the first thing I think of. Best Erick On Thu, Feb 23, 2012 at 10:16 AM, naptowndev naptowndev...@gmail.com wrote: Erick - Agreed, it is puzzling. What I've found is that it doesn't matter if I pass in wildcards for the field list or not...but that the overall response time from the newer builds of Solr that we've tested (e.g. 4.0.0.2012.02.16) is slower than the older (4.0.0.2010.12.10.08.54.56) build. If I run the exact same query against those two cores, bringing back a payload of just over 13MB (xml), the older build brings it back in about 1.6 seconds and the newer build brings it back in about 8.4 seconds. Implementing the field list wildcard allows us to reduce the payload in the newer build (not an option in the older build). They payload is reduced to 1.8MB but takes over 3.5 seconds to come back as compared to the full payload (13MB) in the older build at about 1.6 seconds. With everything else remaining the same (machine/processors/memory/network and the code base calling Solr) it seems to point to something in the newer builds that's causing the slowdown, but I'm not intimate enough with Solr to be able to figure that out. We are using the debugQuery=on in our test to see timings and they aren't showing any anomalies, so that makes it even more confusing. From a wildcard perspective, it's on the fl parameter... here's a 'snippet' of part of our fl parameter for the query fl=id, CategoryGroupTypeID, MedicalSpecialtyDescription, TermsMisspelled, DictionarySource, timestamp, Category_*_MemberReports, Category_*_MemberReportRange, Category_*_NonMemberReports, Category_*_Grade, Category_*_GradeDisplay, Category_*_GradeTier, Category_*_ReportLocations, Category_*_ReportLocationCoordinates, Category_*_coordinate, score Please note that that fl param is greatly reduced from our full query, we have over 100 static files and a slew of dynamic fields - but that should give you an idea of how we are using wildcards. I'm not sure about the maxBooleanClauses...not being all that familiar with Solr, does that apply to wildcards used in the fl list? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Performance-Improvement-and-degradation-Help-tp3767015p3769995.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to retrieve tokens?
Essentially, you're talking about reconstructing the field from the tokens, and that's pretty difficult in general and lossy. For instance, if you use stemming and running gets stemmed to run, you get back just run from the index. Is that acceptable? But otherwise, you've got to go into the low levels of Lucene to get this info, and reassembling it is lengthy, I suspect you'd find that performance was unacceptable. Why do you want to do this? This may be an XY problem. http://people.apache.org/~hossman/#xyproblem Best Erick On Thu, Feb 23, 2012 at 10:22 AM, Thiago thiagosousasilve...@gmail.com wrote: Hi to everybody, My name is Thiago and I'm new with Apache Solr and NoSQL databases. At the moment, I'm working and using Solr for document indexing. My Question is: Is there any way to retrieve the tokens in place of the original data? For example: I have a field using the fieldtype text_general from the original schema.xml. If I insert a document with the following string in this field: All you need is love, the tokens that I get are: all, you, need, love. When I search in this base, I want to get the tokens(all, you, need, love) in place of the indexed string. I searched for this in the web and in this forum too, but I saw some people saying to use TermVectorsComponent. Is there any way more easy to do it? As I saw, TermVectorsComponent is more difficult and use more memory. Thanks to everybody. Thiago -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-retrieve-tokens-tp3770007p3770007.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Unique key constraint and optimistic locking (versioning)
Per: Yep, you've got it. You could write a custom update handler that queried (via TermDocs or something) for the ID when your intent was to INSERT, but it'll have to be custom work. I suppose you could query with a divide-and-conquer approach, that is query for id:(1 2 58 90... all your insert IDs) and go/no-go based on whether your return had any hits, but that supposed you have some idea whether pre-existing documents are likely. But Solr doesn't have anything like you're looking for. Best Erick On Thu, Feb 23, 2012 at 10:32 AM, Per Steffensen st...@designware.dk wrote: Em skrev: Hi Per, well, Solr has no Update-Method like a RDBMS. It is a re-insert of the whole document. Therefore a document with an existing UniqueKey marks the old document as deleted and inserts the new one. Yes I understand. But it is not always what I want to acheive. I want an error to occur if a document with the same id already exists, when my intent is to INSERT a new document. When my intent is to UPDATE a document in solr/lucene I want the old document already in solr/lucene deleted and the new version of this document added (exactly as you explain). It will not be possible for solr/lucene to decide what to do unless I give it some information about my intent - whether it is INSERT or UPDATE semantics I want. I guess solr/lucene always give me INSERT sematics when a document with the same id does not already exist, and that it always give me UPDATE semantics when a document with the same id does exist? I cannot decide? However this is not the whole story, since this constraint only works per index/SolrCore/Shard (depending on your use-case). Yes I know. But with the right routing strategy based on id's I will be able to acheive what I want if the feature was just there per index/core/shard. Does this help you? Yes it helps me getting sure, that what I am looking for is not there. There is not built-in way to make solr/lucene give me an error if I try to insert a new document with an id equal to a document already in the index/core/shard. The existing document will always be updated (implemented as old deleted and new added). Correct? Kind regards, Em Regards, Per Steffensen
Probleme with unicode query
hello, I'm using Solr 3.5 over Tomcat 6 and I've some problemes with unicode quey. Here is my text field configuration analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ElisionFilterFactory articles=elisions.txt/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=French / /analyzer analyzer type=query charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ElisionFilterFactory articles=elisions.txt/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=French / /analyzer When I performe this request : select/?q=hygiene sécuritédebugQuery=true Here is debug infos : str name=rawquerystringhygiene sécurité/str str name=querystringhygiene sécurité/str str name=parsedquerysearchText:hygien (searchText:sa searchText:curit)/str str name=parsedquery_toStringsearchText:hygien (searchText:sa searchText:curit)/str Has you can see, unicode request failed : searchText:sa searchText:curit instead of searchText:securite I've tried with ISOLatin1AccentFilterFactory, I've changed the order, but no difference :( Any ideas ? Thanks Frederic
probleme with unicode query
hello, I'm using Solr 3.5 over Tomcat 6 and I've some problemes with unicode quey. Here is my text field configuration analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ElisionFilterFactory articles=elisions.txt/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=French / /analyzer analyzer type=query charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ElisionFilterFactory articles=elisions.txt/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=French / /analyzer When I performe this request : select/?q=hygiene sécuritédebugQuery=true Here is debug infos : str name=rawquerystringhygiene sécurité/str str name=querystringhygiene sécurité/str str name=parsedquerysearchText:hygien (searchText:sa searchText:curit)/str str name=parsedquery_toStringsearchText:hygien (searchText:sa searchText:curit)/str Has you can see, unicode request failed : searchText:sa searchText:curit instead of searchText:securite I've tried with ISOLatin1AccentFilterFactory, I've changed the order, but no difference :( Any ideas ? Thanks Frederic
Re: Unique key constraint and optimistic locking (versioning)
Hi Per, I want an error to occur if a document with the same id already exists, when my intent is to INSERT a new document. When my intent is to UPDATE a document in solr/lucene I want the old document already in solr/lucene deleted and the new version of this document added (exactly as you explain). It will not be possible for solr/lucene to decide what to do unless I give it some information about my intent - whether it is INSERT or UPDATE semantics I want. I guess solr/lucene always give me INSERT sematics when a document with the same id does not already exist, and that it always give me UPDATE semantics when a document with the same id does exist? I cannot decide? Given that you've set a uniqueKey-field and there already exists a document with that uniqueKey, it will delete the old one and insert the new one. There is really no difference between the semantics - updates do not exist. To create a UNIQUE-constraint as you know it from a database you have to check whether a document is already in the index *or* whether it is already pending (waiting for getting flushed to the index). Fortunately Solr manages a so called pending-set with all those documents waiting for beeing flushed to disk (Solr 3.5). I think you have to write your own DirectUpdateHandler to achieve what you want on the Solr-level or to extend Lucenes IndexWriter to do it on the Lucene-Level. While doing so, keep track of what is going on in the trunk and how Near-Real-Time-Search will change the current way of handling updates. There is not built-in way to make solr/lucene give me an error if I try to insert a new document with an id equal to a document already in the index/core/shard. The existing document will always be updated (implemented as old deleted and new added). Correct? Exactly. If you really want to get your hands on that topic I suggest you to learn more about Lucene's IndexWriter: http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/index.html?org/apache/lucene/index/IndexWriter.html Kind Regards, Em
Re: Probleme with unicode query
Hi Frederic, I saw similar issues when sending such a request without proper URL-encoding. It is important to note that the URL-encoded string already has to be an UTF-8-string. What happens if you send that query via Solr's admin-panel? Have a look at this page for troubleshooting: http://wiki.apache.org/solr/SolrTomcat Kind regards, Em Am 23.02.2012 18:15, schrieb Frederic Bouchery: hello, I'm using Solr 3.5 over Tomcat 6 and I've some problemes with unicode quey. Here is my text field configuration analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ElisionFilterFactory articles=elisions.txt/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=French / /analyzer analyzer type=query charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ElisionFilterFactory articles=elisions.txt/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=French / /analyzer When I performe this request : select/?q=hygiene sécuritédebugQuery=true Here is debug infos : str name=rawquerystringhygiene sécurité/str str name=querystringhygiene sécurité/str str name=parsedquerysearchText:hygien (searchText:sa searchText:curit)/str str name=parsedquery_toStringsearchText:hygien (searchText:sa searchText:curit)/str Has you can see, unicode request failed : searchText:sa searchText:curit instead of searchText:securite I've tried with ISOLatin1AccentFilterFactory, I've changed the order, but no difference :( Any ideas ? Thanks Frederic
undefined field on CSV db import
I am trying to import a csv file of values via curl (PHP) and am receiving an 'undefined field' error, but I am not sure why, as I am defining the field. Can someone lend some insight as to what I am missing / doing wrong? Thank you in advance. Sample of CSV File: --- Product_ID Product_Name Product_ManufacturerPart Product_Img ImageURL Manufacturer_Name lowestPrice vendorCount -2121813476 Over-the-Sink Dish Rack 123478 http://image10.bizrate-images.com/resize?sq=60uid=2511766107mid=18900; WALTERDRAKE 24.99 1 -2121813460 Oregon Nike NCAA Twill Shorts - Mens - Green 00025305XODR http://image10.bizrate-images.com/resize?sq=60uid=2564249353mid=23598; Nike 44.99 3 -2121813456 Sudden Change Under Eye Firming Serum 091777 http://image10.bizrate-images.com/resize?sq=60uid=2564994087mid=18900; WALTERDRAKE 19.99 1 -2121813445 Global Keratin Leave-In Conditioner Cream 005248 http://image10.bizrate-images.com/resize?sq=60uid=2101271875mid=21473; Global Keratin 24 1 -2121813443 Oregon Nike NCAA Twill Shorts - Mens - White 00025305XODH http://image10.bizrate-images.com/resize?sq=60uid=2564226023mid=17345; Nike 59.99 3 -2121813441 Paul Brown Hawaii Shine Amplifier 4 oz. 000684 http://image10.bizrate-images.com/resize?sq=60uid=1171412855mid=21473; Paul Brown 20.1 1 -2121813437 Dish Drying Mat Large 077608 http://image10.bizrate-images.com/resize?sq=60uid=1371997268mid=18900; WALTERDRAKE 14.99 1 Solr Update URL: http://localhost:8983/solr/db/update/csv?commit=trueheader=trueseparator=%09escape=\\fieldNames=Product_ID,Product_Name,Product_ManufacturerPart,Product_Img,ImageURL,Manufacturer_Name,lowestPrice,vendorCount Error Output: - html head meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/ titleError 400 undefined field Product_ID/title /head body HTTP ERROR 400 pProblem accessing /solr/db/update/csv. Reason: preundefined field Product_ID/pre/phr //smallPowered by Jetty:///small/br/ -- View this message in context: http://lucene.472066.n3.nabble.com/undefined-field-on-CSV-db-import-tp3770552p3770552.html Sent from the Solr - User mailing list archive at Nabble.com.
autoGeneratePhraseQueries sort of silently set to false
Another thing I noticed when upgrading from Solr 1.4 to Solr 3.5 had to do with results when there were hyphenated words: aaa-bbb. Erik Hatcher pointed me to the autoGeneratePhraseQueries attribute now available on fieldtype definitions in schema.xml. This is a great feature, and everything is peachy if you start with Solr 3.4. But many of us started earlier and are upgrading, and that's a different story. It was surprising to me that a. the default for this new feature caused different search results than Solr 1.4 b. it wasn't documented clearly, IMO http://wiki.apache.org/solr/SchemaXml makes no mention of it In the schema.xml example, there is this at the top: !-- attribute name is the name of this schema and is only used for display purposes. Applications should change this to reflect the nature of the search collection. version=1.4 is Solr's version number for the schema syntax and semantics. It should not normally be changed by applications. 1.0: multiValued attribute did not exist, all fields are multiValued by nature 1.1: multiValued attribute introduced, false by default 1.2: omitTermFreqAndPositions attribute introduced, true by default except for text fields. 1.3: removed optional field compress feature 1.4: default auto-phrase (QueryParser feature) to off -- And there was this in a couple of field definitions: fieldType name=text_en_splitting class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true fieldType name=text_ja class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false But that was it.
Re: Multiple Property Substitution
*bump* I'm also curious is something like this is possible. Being able to nest property substitution variables, especially when using multiple cores, would be a really slick feature. Zach Friedland wrote Has anyone found a way to have multiple properties (override default)? What I'd like to create is a default property with an override property that usually wouldn't be set, but would be set as a JVM parameter if I want to turn off replication on a particular index on a particular server. I tried this syntax but it didn't work... requestHandler name=/replication class=solr.ReplicationHandler lst name=slave str name=enable${Solr.enable.slave.core.override:${Solr.enable.slave.default:false}}/str /lst /requestHandler Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-Property-Substitution-tp2223781p3770649.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
Robert, You found it! it is the phrase slop. What do I do now? I am using Solr from trunk from December, and all those JIRA tixes are marked fixed … - Naomi Solr 1.4: luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 got result Solr 3.5 luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 NO result lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote: On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote: Jonathan has brought it to my attention that BOTH of my failing searches happen to have 8 terms, and one of the terms is repeated: The Beatles as musicians : Revolver through the Anthology Color-blindness [print/digital]; its dangers and its detection but this is a PHRASE search. Can you take your same phrase queries, and simply add some slop to them (e.g. ~3) and ensure they still match with the lucene queryparser? SloppyPhraseQuery has a bit of a history with repeats since Lucene 2.9 that you were using. https://issues.apache.org/jira/browse/LUCENE-3068 https://issues.apache.org/jira/browse/LUCENE-3215 https://issues.apache.org/jira/browse/LUCENE-3412 -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
Is it possible to also provide your document? If you could attach the document and the analysis config and queries to a JIRA issue, that would be most ideal. On Thu, Feb 23, 2012 at 2:05 PM, Naomi Dushay ndus...@stanford.edu wrote: Robert, You found it! it is the phrase slop. What do I do now? I am using Solr from trunk from December, and all those JIRA tixes are marked fixed … - Naomi Solr 1.4: luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 got result Solr 3.5 luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 NO result lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote: On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote: Jonathan has brought it to my attention that BOTH of my failing searches happen to have 8 terms, and one of the terms is repeated: The Beatles as musicians : Revolver through the Anthology Color-blindness [print/digital]; its dangers and its detection but this is a PHRASE search. Can you take your same phrase queries, and simply add some slop to them (e.g. ~3) and ensure they still match with the lucene queryparser? SloppyPhraseQuery has a bit of a history with repeats since Lucene 2.9 that you were using. https://issues.apache.org/jira/browse/LUCENE-3068 https://issues.apache.org/jira/browse/LUCENE-3215 https://issues.apache.org/jira/browse/LUCENE-3412 -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html Sent from the Solr - User mailing list archive at Nabble.com. -- lucidimagination.com
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
Robert, I will create a jira issue with the documentation. FYI, I tried ps values of 3, 2, 1 and 0 and none of them worked with dismax; For lucene QueryParser, only the value of 0 got results. - Naomi On Feb 23, 2012, at 11:12 AM, Robert Muir [via Lucene] wrote: Is it possible to also provide your document? If you could attach the document and the analysis config and queries to a JIRA issue, that would be most ideal. On Thu, Feb 23, 2012 at 2:05 PM, Naomi Dushay [hidden email] wrote: Robert, You found it! it is the phrase slop. What do I do now? I am using Solr from trunk from December, and all those JIRA tixes are marked fixed … - Naomi Solr 1.4: luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 got result Solr 3.5 luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 NO result lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote: On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote: Jonathan has brought it to my attention that BOTH of my failing searches happen to have 8 terms, and one of the terms is repeated: The Beatles as musicians : Revolver through the Anthology Color-blindness [print/digital]; its dangers and its detection but this is a PHRASE search. Can you take your same phrase queries, and simply add some slop to them (e.g. ~3) and ensure they still match with the lucene queryparser? SloppyPhraseQuery has a bit of a history with repeats since Lucene 2.9 that you were using. https://issues.apache.org/jira/browse/LUCENE-3068 https://issues.apache.org/jira/browse/LUCENE-3215 https://issues.apache.org/jira/browse/LUCENE-3412 -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html Sent from the Solr - User mailing list archive at Nabble.com. -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770681.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML
Re: Solr HBase - Re: How is Data Indexed in HBase?
regarding your question on hbase support for high performance and consistency - i would say hbase is highly scalable and performant. how it does what it does can be understood by reading relevant chapters around architecture and design in the hbase book. with regards to ranking, i see your problem. but if you split the problem into hbase specific solution and solr based solution, you can achieve the results probably. may be you do the ranking and store the rank in hbase and then use solr to get the results and then use hbase as a lookup to get the rank. or you can put the rank as part of the document schema and index the rank too for range queries and such. is my understanding of your scenario wrong? thanks On Wed, Feb 22, 2012 at 9:51 AM, Bing Li lbl...@gmail.com wrote: Mr Gupta, Thanks so much for your reply! In my use cases, retrieving data by keyword is one of them. I think Solr is a proper choice. However, Solr does not provide a complex enough support to rank. And, frequent updating is also not suitable in Solr. So it is difficult to retrieve data randomly based on the values other than keyword frequency in text. In this case, I attempt to use HBase. But I don't know how HBase support high performance when it needs to keep consistency in a large scale distributed system. Now both of them are used in my system. I will check out ElasticSearch. Best regards, Bing On Thu, Feb 23, 2012 at 1:35 AM, T Vinod Gupta tvi...@readypulse.comwrote: Bing, Its a classic battle on whether to use solr or hbase or a combination of both. both systems are very different but there is some overlap in the utility. they also differ vastly when it compares to computation power, storage needs, etc. so in the end, it all boils down to your use case. you need to pick the technology that it best suited to your needs. im still not clear on your use case though. btw, if you haven't started using solr yet - then you might want to checkout ElasticSearch. I spent over a week researching between solr and ES and eventually chose ES due to its cool merits. thanks On Wed, Feb 22, 2012 at 9:31 AM, Ted Yu yuzhih...@gmail.com wrote: There is no secondary index support in HBase at the moment. It's on our road map. FYI On Wed, Feb 22, 2012 at 9:28 AM, Bing Li lbl...@gmail.com wrote: Jacques, Yes. But I still have questions about that. In my system, when users search with a keyword arbitrarily, the query is forwarded to Solr. No any updating operations but appending new indexes exist in Solr managed data. When I need to retrieve data based on ranking values, HBase is used. And, the ranking values need to be updated all the time. Is that correct? My question is that the performance must be low if keeping consistency in a large scale distributed environment. How does HBase handle this issue? Thanks so much! Bing On Thu, Feb 23, 2012 at 1:17 AM, Jacques whs...@gmail.com wrote: It is highly unlikely that you could replace Solr with HBase. They're really apples and oranges. On Wed, Feb 22, 2012 at 1:09 AM, Bing Li lbl...@gmail.com wrote: Dear all, I wonder how data in HBase is indexed? Now Solr is used in my system because data is managed in inverted index. Such an index is suitable to retrieve unstructured and huge amount of data. How does HBase deal with the issue? May I replaced Solr with HBase? Thanks so much! Best regards, Bing
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
Please attach your docs if you dont mind. I worked up tests for this (in general for ANY phrase query, increasing the slop should never remove results, only potentially enlarge them). It fails already... but its good to also have your test case too... On Thu, Feb 23, 2012 at 2:20 PM, Naomi Dushay ndus...@stanford.edu wrote: Robert, I will create a jira issue with the documentation. FYI, I tried ps values of 3, 2, 1 and 0 and none of them worked with dismax; For lucene QueryParser, only the value of 0 got results. - Naomi On Feb 23, 2012, at 11:12 AM, Robert Muir [via Lucene] wrote: Is it possible to also provide your document? If you could attach the document and the analysis config and queries to a JIRA issue, that would be most ideal. On Thu, Feb 23, 2012 at 2:05 PM, Naomi Dushay [hidden email] wrote: Robert, You found it! it is the phrase slop. What do I do now? I am using Solr from trunk from December, and all those JIRA tixes are marked fixed … - Naomi Solr 1.4: luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 got result Solr 3.5 luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 NO result lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote: On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote: Jonathan has brought it to my attention that BOTH of my failing searches happen to have 8 terms, and one of the terms is repeated: The Beatles as musicians : Revolver through the Anthology Color-blindness [print/digital]; its dangers and its detection but this is a PHRASE search. Can you take your same phrase queries, and simply add some slop to them (e.g. ~3) and ensure they still match with the lucene queryparser? SloppyPhraseQuery has a bit of a history with repeats since Lucene 2.9 that you were using. https://issues.apache.org/jira/browse/LUCENE-3068 https://issues.apache.org/jira/browse/LUCENE-3215 https://issues.apache.org/jira/browse/LUCENE-3412 -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html Sent from the Solr - User mailing list archive at Nabble.com. -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770681.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- lucidimagination.com
Re: Solr HBase - Re: How is Data Indexed in HBase?
Dear Mr Gupta, Your understanding about my solution is correct. Now both HBase and Solr are used in my system. I hope it could work. Thanks so much for your reply! Best regards, Bing On Fri, Feb 24, 2012 at 3:30 AM, T Vinod Gupta tvi...@readypulse.comwrote: regarding your question on hbase support for high performance and consistency - i would say hbase is highly scalable and performant. how it does what it does can be understood by reading relevant chapters around architecture and design in the hbase book. with regards to ranking, i see your problem. but if you split the problem into hbase specific solution and solr based solution, you can achieve the results probably. may be you do the ranking and store the rank in hbase and then use solr to get the results and then use hbase as a lookup to get the rank. or you can put the rank as part of the document schema and index the rank too for range queries and such. is my understanding of your scenario wrong? thanks On Wed, Feb 22, 2012 at 9:51 AM, Bing Li lbl...@gmail.com wrote: Mr Gupta, Thanks so much for your reply! In my use cases, retrieving data by keyword is one of them. I think Solr is a proper choice. However, Solr does not provide a complex enough support to rank. And, frequent updating is also not suitable in Solr. So it is difficult to retrieve data randomly based on the values other than keyword frequency in text. In this case, I attempt to use HBase. But I don't know how HBase support high performance when it needs to keep consistency in a large scale distributed system. Now both of them are used in my system. I will check out ElasticSearch. Best regards, Bing On Thu, Feb 23, 2012 at 1:35 AM, T Vinod Gupta tvi...@readypulse.comwrote: Bing, Its a classic battle on whether to use solr or hbase or a combination of both. both systems are very different but there is some overlap in the utility. they also differ vastly when it compares to computation power, storage needs, etc. so in the end, it all boils down to your use case. you need to pick the technology that it best suited to your needs. im still not clear on your use case though. btw, if you haven't started using solr yet - then you might want to checkout ElasticSearch. I spent over a week researching between solr and ES and eventually chose ES due to its cool merits. thanks On Wed, Feb 22, 2012 at 9:31 AM, Ted Yu yuzhih...@gmail.com wrote: There is no secondary index support in HBase at the moment. It's on our road map. FYI On Wed, Feb 22, 2012 at 9:28 AM, Bing Li lbl...@gmail.com wrote: Jacques, Yes. But I still have questions about that. In my system, when users search with a keyword arbitrarily, the query is forwarded to Solr. No any updating operations but appending new indexes exist in Solr managed data. When I need to retrieve data based on ranking values, HBase is used. And, the ranking values need to be updated all the time. Is that correct? My question is that the performance must be low if keeping consistency in a large scale distributed environment. How does HBase handle this issue? Thanks so much! Bing On Thu, Feb 23, 2012 at 1:17 AM, Jacques whs...@gmail.com wrote: It is highly unlikely that you could replace Solr with HBase. They're really apples and oranges. On Wed, Feb 22, 2012 at 1:09 AM, Bing Li lbl...@gmail.com wrote: Dear all, I wonder how data in HBase is indexed? Now Solr is used in my system because data is managed in inverted index. Such an index is suitable to retrieve unstructured and huge amount of data. How does HBase deal with the issue? May I replaced Solr with HBase? Thanks so much! Best regards, Bing
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
Robert - Did you mean for me to attach my docs to an existing ticket (which one?) or just want to make sure I attach the docs to the new issue? - Naomi On Feb 23, 2012, at 11:39 AM, Robert Muir [via Lucene] wrote: Please attach your docs if you dont mind. I worked up tests for this (in general for ANY phrase query, increasing the slop should never remove results, only potentially enlarge them). It fails already... but its good to also have your test case too... On Thu, Feb 23, 2012 at 2:20 PM, Naomi Dushay [hidden email] wrote: Robert, I will create a jira issue with the documentation. FYI, I tried ps values of 3, 2, 1 and 0 and none of them worked with dismax; For lucene QueryParser, only the value of 0 got results. - Naomi On Feb 23, 2012, at 11:12 AM, Robert Muir [via Lucene] wrote: Is it possible to also provide your document? If you could attach the document and the analysis config and queries to a JIRA issue, that would be most ideal. On Thu, Feb 23, 2012 at 2:05 PM, Naomi Dushay [hidden email] wrote: Robert, You found it! it is the phrase slop. What do I do now? I am using Solr from trunk from December, and all those JIRA tixes are marked fixed … - Naomi Solr 1.4: luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 got result Solr 3.5 luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 NO result lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote: On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote: Jonathan has brought it to my attention that BOTH of my failing searches happen to have 8 terms, and one of the terms is repeated: The Beatles as musicians : Revolver through the Anthology Color-blindness [print/digital]; its dangers and its detection but this is a PHRASE search. Can you take your same phrase queries, and simply add some slop to them (e.g. ~3) and ensure they still match with the lucene queryparser? SloppyPhraseQuery has a bit of a history with repeats since Lucene 2.9 that you were using. https://issues.apache.org/jira/browse/LUCENE-3068 https://issues.apache.org/jira/browse/LUCENE-3215 https://issues.apache.org/jira/browse/LUCENE-3412 -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html Sent from the Solr - User mailing list archive at Nabble.com. -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770681.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770746.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML
RE: autoGeneratePhraseQueries sort of silently set to false
Seems like a change in default behavior like this should be included in the changes.txt for Solr 3.5. Not sure how to do that. Tom -Original Message- From: Naomi Dushay [mailto:ndus...@stanford.edu] Sent: Thursday, February 23, 2012 1:57 PM To: solr-user@lucene.apache.org Subject: autoGeneratePhraseQueries sort of silently set to false Another thing I noticed when upgrading from Solr 1.4 to Solr 3.5 had to do with results when there were hyphenated words: aaa-bbb. Erik Hatcher pointed me to the autoGeneratePhraseQueries attribute now available on fieldtype definitions in schema.xml. This is a great feature, and everything is peachy if you start with Solr 3.4. But many of us started earlier and are upgrading, and that's a different story. It was surprising to me that a. the default for this new feature caused different search results than Solr 1.4 b. it wasn't documented clearly, IMO http://wiki.apache.org/solr/SchemaXml makes no mention of it In the schema.xml example, there is this at the top: !-- attribute name is the name of this schema and is only used for display purposes. Applications should change this to reflect the nature of the search collection. version=1.4 is Solr's version number for the schema syntax and semantics. It should not normally be changed by applications. 1.0: multiValued attribute did not exist, all fields are multiValued by nature 1.1: multiValued attribute introduced, false by default 1.2: omitTermFreqAndPositions attribute introduced, true by default except for text fields. 1.3: removed optional field compress feature 1.4: default auto-phrase (QueryParser feature) to off -- And there was this in a couple of field definitions: fieldType name=text_en_splitting class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true fieldType name=text_ja class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false But that was it.
Re: autoGeneratePhraseQueries sort of silently set to false
there's this (for 3.1, but in the 3.x CHANGES.txt): * SOLR-2015: Add a boolean attribute autoGeneratePhraseQueries to TextField. autoGeneratePhraseQueries=true (the default) causes the query parser to generate phrase queries if multiple tokens are generated from a single non-quoted analysis string. For example WordDelimiterFilter splitting text:pdp-11 will cause the parser to generate text:pdp 11 rather than (text:PDP OR text:11). Note that autoGeneratePhraseQueries=true tends to not work well for non whitespace delimited languages. (yonik) with a ton of useful, though back and forth, commentary here: https://issues.apache.org/jira/browse/SOLR-2015 Note that the behavior, as Naomi pointed out so succinctly, is adjustable based off the *schema* version setting. (look at your schema line in schema.xml). The code is simply this: if (schema.getVersion() 1.3f) { autoGeneratePhraseQueries = false; } else { autoGeneratePhraseQueries = true; } on TextField. Specifying autoGeneratePhraseQueries explicitly on a field type overrides whatever the default may be. Erik On Feb 23, 2012, at 14:45 , Burton-West, Tom wrote: Seems like a change in default behavior like this should be included in the changes.txt for Solr 3.5. Not sure how to do that. Tom -Original Message- From: Naomi Dushay [mailto:ndus...@stanford.edu] Sent: Thursday, February 23, 2012 1:57 PM To: solr-user@lucene.apache.org Subject: autoGeneratePhraseQueries sort of silently set to false Another thing I noticed when upgrading from Solr 1.4 to Solr 3.5 had to do with results when there were hyphenated words: aaa-bbb. Erik Hatcher pointed me to the autoGeneratePhraseQueries attribute now available on fieldtype definitions in schema.xml. This is a great feature, and everything is peachy if you start with Solr 3.4. But many of us started earlier and are upgrading, and that's a different story. It was surprising to me that a. the default for this new feature caused different search results than Solr 1.4 b. it wasn't documented clearly, IMO http://wiki.apache.org/solr/SchemaXml makes no mention of it In the schema.xml example, there is this at the top: !-- attribute name is the name of this schema and is only used for display purposes. Applications should change this to reflect the nature of the search collection. version=1.4 is Solr's version number for the schema syntax and semantics. It should not normally be changed by applications. 1.0: multiValued attribute did not exist, all fields are multiValued by nature 1.1: multiValued attribute introduced, false by default 1.2: omitTermFreqAndPositions attribute introduced, true by default except for text fields. 1.3: removed optional field compress feature 1.4: default auto-phrase (QueryParser feature) to off -- And there was this in a couple of field definitions: fieldType name=text_en_splitting class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true fieldType name=text_ja class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false But that was it.
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
Please make a new one if you dont mind! On Thu, Feb 23, 2012 at 2:45 PM, Naomi Dushay ndus...@stanford.edu wrote: Robert - Did you mean for me to attach my docs to an existing ticket (which one?) or just want to make sure I attach the docs to the new issue? - Naomi On Feb 23, 2012, at 11:39 AM, Robert Muir [via Lucene] wrote: Please attach your docs if you dont mind. I worked up tests for this (in general for ANY phrase query, increasing the slop should never remove results, only potentially enlarge them). It fails already... but its good to also have your test case too... On Thu, Feb 23, 2012 at 2:20 PM, Naomi Dushay [hidden email] wrote: Robert, I will create a jira issue with the documentation. FYI, I tried ps values of 3, 2, 1 and 0 and none of them worked with dismax; For lucene QueryParser, only the value of 0 got results. - Naomi On Feb 23, 2012, at 11:12 AM, Robert Muir [via Lucene] wrote: Is it possible to also provide your document? If you could attach the document and the analysis config and queries to a JIRA issue, that would be most ideal. On Thu, Feb 23, 2012 at 2:05 PM, Naomi Dushay [hidden email] wrote: Robert, You found it! it is the phrase slop. What do I do now? I am using Solr from trunk from December, and all those JIRA tixes are marked fixed … - Naomi Solr 1.4: luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 got result Solr 3.5 luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 NO result lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote: On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote: Jonathan has brought it to my attention that BOTH of my failing searches happen to have 8 terms, and one of the terms is repeated: The Beatles as musicians : Revolver through the Anthology Color-blindness [print/digital]; its dangers and its detection but this is a PHRASE search. Can you take your same phrase queries, and simply add some slop to them (e.g. ~3) and ensure they still match with the lucene queryparser? SloppyPhraseQuery has a bit of a history with repeats since Lucene 2.9 that you were using. https://issues.apache.org/jira/browse/LUCENE-3068 https://issues.apache.org/jira/browse/LUCENE-3215 https://issues.apache.org/jira/browse/LUCENE-3412 -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html Sent from the Solr - User mailing list archive at Nabble.com. -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770681.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770746.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- lucidimagination.com
Re: DataImportHandler running out of memory
On 2/20/2012 6:49 AM, v_shan wrote: DIH still running out of memory for me, with Full Import on a database of size 1.5 GB. Solr version: 3_5_0 Note that I have already added batchSize=-1 but getting same error. A few questions: - How much memory have you given to the JVM running this Solr instance? - How much memory does your server have? - What is the size of all your index cores, and how many documents are in them? - How large are your Solr caches (filterCache, documentCache, queryResultCache)? - What is your ramBufferSize set to in the indexDefaults section? Thanks, Shawn
RE: autoGeneratePhraseQueries sort of silently set to false
Thanks Erik, The 3.1 changes document the ability to set this and the default being set to true However apparently the change between 3.4 and 3.5 the default was set to false Since this will change the behavior of any field where autoGeneratePhraseQueries is not explicitly set, it could easily surprise users who update to 3.5. That's why I think the changing of the default behavior (i.e. when not explicitly set) should be called out explicitly in the changes.txt for 3.5. True, everyone should read the notes in the example schema.xml, but I think it would help if the change was also noted in changes.txt. Is it possible to revise the changes.txt for 3.5? Do you by any chance know where the change in the default behavior was discussed? I know it has been a contentious issue. Tom -Original Message- From: Erik Hatcher [mailto:erik.hatc...@gmail.com] Sent: Thursday, February 23, 2012 2:53 PM To: solr-user@lucene.apache.org Subject: Re: autoGeneratePhraseQueries sort of silently set to false there's this (for 3.1, but in the 3.x CHANGES.txt): * SOLR-2015: Add a boolean attribute autoGeneratePhraseQueries to TextField. autoGeneratePhraseQueries=true (the default) causes the query parser to generate phrase queries if multiple tokens are generated from a single non-quoted analysis string. For example WordDelimiterFilter splitting text:pdp-11 will cause the parser to generate text:pdp 11 rather than (text:PDP OR text:11). Note that autoGeneratePhraseQueries=true tends to not work well for non whitespace delimited languages. (yonik) with a ton of useful, though back and forth, commentary here: https://issues.apache.org/jira/browse/SOLR-2015 Note that the behavior, as Naomi pointed out so succinctly, is adjustable based off the *schema* version setting. (look at your schema line in schema.xml). The code is simply this: if (schema.getVersion() 1.3f) { autoGeneratePhraseQueries = false; } else { autoGeneratePhraseQueries = true; } on TextField. Specifying autoGeneratePhraseQueries explicitly on a field type overrides whatever the default may be. Erik On Feb 23, 2012, at 14:45 , Burton-West, Tom wrote: Seems like a change in default behavior like this should be included in the changes.txt for Solr 3.5. Not sure how to do that. Tom -Original Message- From: Naomi Dushay [mailto:ndus...@stanford.edu] Sent: Thursday, February 23, 2012 1:57 PM To: solr-user@lucene.apache.org Subject: autoGeneratePhraseQueries sort of silently set to false Another thing I noticed when upgrading from Solr 1.4 to Solr 3.5 had to do with results when there were hyphenated words: aaa-bbb. Erik Hatcher pointed me to the autoGeneratePhraseQueries attribute now available on fieldtype definitions in schema.xml. This is a great feature, and everything is peachy if you start with Solr 3.4. But many of us started earlier and are upgrading, and that's a different story. It was surprising to me that a. the default for this new feature caused different search results than Solr 1.4 b. it wasn't documented clearly, IMO http://wiki.apache.org/solr/SchemaXml makes no mention of it In the schema.xml example, there is this at the top: !-- attribute name is the name of this schema and is only used for display purposes. Applications should change this to reflect the nature of the search collection. version=1.4 is Solr's version number for the schema syntax and semantics. It should not normally be changed by applications. 1.0: multiValued attribute did not exist, all fields are multiValued by nature 1.1: multiValued attribute introduced, false by default 1.2: omitTermFreqAndPositions attribute introduced, true by default except for text fields. 1.3: removed optional field compress feature 1.4: default auto-phrase (QueryParser feature) to off -- And there was this in a couple of field definitions: fieldType name=text_en_splitting class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true fieldType name=text_ja class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false But that was it.
Re: Solr Performance Improvement and degradation Help
Erick - Thanks. We've actually worked with Sematext to optimize the GC settings and saw initial (and continued) performance boosts as a result... The situation we're seeing now, has both versions of Solr running on the same box under the same JVM, but we are undeploying an instance at a time so as to prevent any outlying performance hits in the tests... So, that being said, both instances of solr, on the same box are running under the optimized settings. I'd assume if GC was impacting the results of the newer version of Solr, we'd see similar decrease in performance on the older version. Aside from the QTime and other timings (highlight, etc) - which are all faster in the new version, the overall response time/delivery of the results are significantly slower under the new version. I've unfortunately exhausted my knowledge of Solr and what may or may not have changed between the nightly builds. I do appreciate your insight and hope you'll continue to throw out some ideas...and maybe someone else out there has seen these inconsistencies as well. The last set of test I ran consistently showed the the older build of Solr bringing back a result set of 13.1MB with 1200 records in 2.3 seconds wheres the newer build was bringing back the same result set in about 17.4 seconds. The catch is that the qtime and highlighting component time in the newer version are faster than the older version. Again, if you have any more ideas, let me know. Thanks! Brian On Thu, Feb 23, 2012 at 11:51 AM, Erick Erickson [via Lucene] ml-node+s472066n377030...@n3.nabble.com wrote: Ah, no, my mistake. The wildcards for the fl list won't matter re: maxBooleanClauses, I didn't read carefully enough. I assume that just returning a field or two doesn't slow down But one possible culprit, especially since you say this kicks in after a while, is garbage collection. Here's an excellent intro: http://www.lucidimagination.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/ Especially look at the getting a view into garbage collection section and try specifying those options. The result should be that your solr log gets stats dumped every time GC kicks in. If this is a problem, look at the times in the logfile after your system slows down. You'll see a bunch of GC dumps that collect very little unused memory. You can also connect to the process using jConsole (should be in the Java distro) and watch the memory tab, especially after your server has slowed down. You can also connect jConsole remotely... This is just an experiment, but any time I see and it slows down after ### minutes, GC is the first thing I think of. Best Erick On Thu, Feb 23, 2012 at 10:16 AM, naptowndev [hidden email]http://user/SendEmail.jtp?type=nodenode=3770307i=0 wrote: Erick - Agreed, it is puzzling. What I've found is that it doesn't matter if I pass in wildcards for the field list or not...but that the overall response time from the newer builds of Solr that we've tested (e.g. 4.0.0.2012.02.16) is slower than the older (4.0.0.2010.12.10.08.54.56) build. If I run the exact same query against those two cores, bringing back a payload of just over 13MB (xml), the older build brings it back in about 1.6 seconds and the newer build brings it back in about 8.4 seconds. Implementing the field list wildcard allows us to reduce the payload in the newer build (not an option in the older build). They payload is reduced to 1.8MB but takes over 3.5 seconds to come back as compared to the full payload (13MB) in the older build at about 1.6 seconds. With everything else remaining the same (machine/processors/memory/network and the code base calling Solr) it seems to point to something in the newer builds that's causing the slowdown, but I'm not intimate enough with Solr to be able to figure that out. We are using the debugQuery=on in our test to see timings and they aren't showing any anomalies, so that makes it even more confusing. From a wildcard perspective, it's on the fl parameter... here's a 'snippet' of part of our fl parameter for the query fl=id, CategoryGroupTypeID, MedicalSpecialtyDescription, TermsMisspelled, DictionarySource, timestamp, Category_*_MemberReports, Category_*_MemberReportRange, Category_*_NonMemberReports, Category_*_Grade, Category_*_GradeDisplay, Category_*_GradeTier, Category_*_ReportLocations, Category_*_ReportLocationCoordinates, Category_*_coordinate, score Please note that that fl param is greatly reduced from our full query, we have over 100 static files and a slew of dynamic fields - but that should give you an idea of how we are using wildcards. I'm not sure about the maxBooleanClauses...not being all that familiar with Solr, does that apply to wildcards used in the fl list? Thanks! -- View this message in context:
Backporting Wildcard fieldlist Features to 3.x versions
We are currently running tests against some of the more recent nightly builds of Solr 4, but have noticed some significant performance decreases recently. Some of the reasons we are using Solr 4 is because we needed geofiltering and highlighting which were not originally available in 3 from my understanding It appears however, that those features have been backported to 3.x. One other feature that we are very interested in because we have very large payloads returning in our search is the wildcard field list for return fields. We've seen it work in the later builds of 4.x, but again, the gain we are getting from the smaller payload by leaving out some fields (out of hundreds), is negated by some poor performance on the response times. Are there any plans to backport the wildcard fieldlist feature to 3.x? -- View this message in context: http://lucene.472066.n3.nabble.com/Backporting-Wildcard-fieldlist-Features-to-3-x-versions-tp3770953p3770953.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: need to support bi-directional synonyms
Honestly, I'd just map em both the same thing in the index. sprayer, washer = sprayer or sprayer, washer = sprayer_washer At both index and query time. Now if the source document includes either 'sprayer' or 'washer', it'll get indexed as 'sprayer_washer'. And if the user enters either 'sprayer' or 'washer', it'll search the index for 'sprayer_washer', and find source documents that included either 'sprayer' or 'washer'. Of course, if you really use sprayer_washer, then if the user actually enters sprayer_washer they'll also find sprayer, washer, and sprayer_washer. So it's probably best to actually use either 'sprayer' or 'washer' as the destination, even though it seems odd: sprayer, washer = washer Will do what you want, pretty sure. On 2/23/2012 1:03 AM, remi tassing wrote: Same question here... On Wednesday, February 22, 2012, geeky2gee...@hotmail.com wrote: hello all, i need to support the following: if the user enters sprayer in the desc field - then they get results for BOTH sprayer and washer. and in the other direction if the user enters washer in the desc field - then they get results for BOTH washer and sprayer. would i set up my synonym file like this? assuming expand = true.. sprayer = washer washer = sprayer thank you, mark -- View this message in context: http://lucene.472066.n3.nabble.com/need-to-support-bi-directional-synonyms-tp3767990p3767990.html Sent from the Solr - User mailing list archive at Nabble.com.
Date search by specific month and day
Hello all! We have a situation involving date searching that I could use some seasoned opinions on. What we have is a collection of records, each containing a Solr date field by which we want search on. The catch is that we want to be able to search for items that match a specific day/month. Essentially, we're trying to implement a this day in history feature for our dataset, so that users would be able to put in a date and we'd return all matching records from the past 100 years or so. Is there a way to perform this kind of search with only the basic Solr date field? Or would I have parse out the month and day and store them in separate fields at indexing time? Thanks for the help! -Kurt
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
Ticket created: https://issues.apache.org/jira/browse/SOLR-3158 (perhaps it's a lucene problem, not a Solr one -- feel free to move it or whatever.) - Naomi On Feb 23, 2012, at 11:55 AM, Robert Muir [via Lucene] wrote: Please make a new one if you dont mind! On Thu, Feb 23, 2012 at 2:45 PM, Naomi Dushay [hidden email] wrote: Robert - Did you mean for me to attach my docs to an existing ticket (which one?) or just want to make sure I attach the docs to the new issue? - Naomi On Feb 23, 2012, at 11:39 AM, Robert Muir [via Lucene] wrote: Please attach your docs if you dont mind. I worked up tests for this (in general for ANY phrase query, increasing the slop should never remove results, only potentially enlarge them). It fails already... but its good to also have your test case too... On Thu, Feb 23, 2012 at 2:20 PM, Naomi Dushay [hidden email] wrote: Robert, I will create a jira issue with the documentation. FYI, I tried ps values of 3, 2, 1 and 0 and none of them worked with dismax; For lucene QueryParser, only the value of 0 got results. - Naomi On Feb 23, 2012, at 11:12 AM, Robert Muir [via Lucene] wrote: Is it possible to also provide your document? If you could attach the document and the analysis config and queries to a JIRA issue, that would be most ideal. On Thu, Feb 23, 2012 at 2:05 PM, Naomi Dushay [hidden email] wrote: Robert, You found it! it is the phrase slop. What do I do now? I am using Solr from trunk from December, and all those JIRA tixes are marked fixed … - Naomi Solr 1.4: luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 got result Solr 3.5 luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 NO result lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote: On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote: Jonathan has brought it to my attention that BOTH of my failing searches happen to have 8 terms, and one of the terms is repeated: The Beatles as musicians : Revolver through the Anthology Color-blindness [print/digital]; its dangers and its detection but this is a PHRASE search. Can you take your same phrase queries, and simply add some slop to them (e.g. ~3) and ensure they still match with the lucene queryparser? SloppyPhraseQuery has a bit of a history with repeats since Lucene 2.9 that you were using. https://issues.apache.org/jira/browse/LUCENE-3068 https://issues.apache.org/jira/browse/LUCENE-3215 https://issues.apache.org/jira/browse/LUCENE-3412 -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html Sent from the Solr - User mailing list archive at Nabble.com. -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770681.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770746.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770786.html To unsubscribe from result present
Preferred file system for Solr
We are using a VeloDrive (SSD) to store and search our solr index. The system is running on SLES 11. Right now we are using ext3 but wondering if anyone has any experience using XFS/ext3 on SSD or FusionIO for Solr . Does solr have any preference for the underlined file system ? Our index will be big ( around 250 M ) docs to start with , adding 5 M docs every week , 50 to 60 % of that will be updates. -- View this message in context: http://lucene.472066.n3.nabble.com/Preferred-file-system-for-Solr-tp3771250p3771250.html Sent from the Solr - User mailing list archive at Nabble.com.
how to ignore cases while querying with a field with type=string?
hi all, I am storing a list of tags in a field using type=string with multiValued setting: field name=pageKeywords type=string indexed=true stored=true multiValued=true/ It works ok, when I query with pageKeyword:The ones. and when I search for ones no record will come up as desired. However, it appears that the query is case sensitive. so the query pageKeyword:The ones and pageKeyword:The Ones give different results, which is not desirable in my case. Is there some setting in the query to let it ignore the cases? or I have to correct the data by keeping everything lower case. Thank you. Yuhan Zhang
Re: undefined field on CSV db import
What does your schema.xml file look like? Is Product_ID defined as a field? Best Erick On Thu, Feb 23, 2012 at 1:24 PM, pmcgovern pmcgov...@portal63.com wrote: I am trying to import a csv file of values via curl (PHP) and am receiving an 'undefined field' error, but I am not sure why, as I am defining the field. Can someone lend some insight as to what I am missing / doing wrong? Thank you in advance. Sample of CSV File: --- Product_ID Product_Name Product_ManufacturerPart Product_Img ImageURL Manufacturer_Name lowestPrice vendorCount -2121813476 Over-the-Sink Dish Rack 123478 http://image10.bizrate-images.com/resize?sq=60uid=2511766107mid=18900; WALTERDRAKE 24.99 1 -2121813460 Oregon Nike NCAA Twill Shorts - Mens - Green 00025305XODR http://image10.bizrate-images.com/resize?sq=60uid=2564249353mid=23598; Nike 44.99 3 -2121813456 Sudden Change Under Eye Firming Serum 091777 http://image10.bizrate-images.com/resize?sq=60uid=2564994087mid=18900; WALTERDRAKE 19.99 1 -2121813445 Global Keratin Leave-In Conditioner Cream 005248 http://image10.bizrate-images.com/resize?sq=60uid=2101271875mid=21473; Global Keratin 24 1 -2121813443 Oregon Nike NCAA Twill Shorts - Mens - White 00025305XODH http://image10.bizrate-images.com/resize?sq=60uid=2564226023mid=17345; Nike 59.99 3 -2121813441 Paul Brown Hawaii Shine Amplifier 4 oz. 000684 http://image10.bizrate-images.com/resize?sq=60uid=1171412855mid=21473; Paul Brown 20.1 1 -2121813437 Dish Drying Mat Large 077608 http://image10.bizrate-images.com/resize?sq=60uid=1371997268mid=18900; WALTERDRAKE 14.99 1 Solr Update URL: http://localhost:8983/solr/db/update/csv?commit=trueheader=trueseparator=%09escape=\\fieldNames=Product_ID,Product_Name,Product_ManufacturerPart,Product_Img,ImageURL,Manufacturer_Name,lowestPrice,vendorCount Error Output: - html head meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/ titleError 400 undefined field Product_ID/title /head body HTTP ERROR 400 pProblem accessing /solr/db/update/csv. Reason: pre undefined field Product_ID/pre/phr //smallPowered by Jetty:///small/br/ -- View this message in context: http://lucene.472066.n3.nabble.com/undefined-field-on-CSV-db-import-tp3770552p3770552.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Performance Improvement and degradation Help
It's still worth looking at the GC characteristics, there's a possibility that the newer build uses memory such that you're tripping over some threshold, but that's grasping at straws. I'd at least hook up jConsole for a sanity check... But if your QTimes are fast, the next thing that comes to mind is that you're spending (for some reason I can't name) more time gathering your fields off disk. Which, with 1,200 records is a possibility. Again, the why is a mystery. But you can do some triage by returning just a few fields to see if that's the issue. Wild stab: Did you re-index the data for your new version of Solr? The index format changed not too long ago, so it's at least possible. But why that would slow things down so much is another mystery but it's worth testing. Another wild bit would be your documentCache. Is it sized large enough? As I remember, the figure is (max docs returned) * (possible number of simultaneous requests), see: http://wiki.apache.org/solr/SolrCaching#documentCache Is there any chance that enableLazyFieldLoading is false in solrconfig.xml? That could account for it. But I'm afraid it's a matter of trying to remove stuff from your process until something changes because this is pretty surprising... Best Erick On Thu, Feb 23, 2012 at 4:44 PM, naptowndev naptowndev...@gmail.com wrote: Erick - Thanks. We've actually worked with Sematext to optimize the GC settings and saw initial (and continued) performance boosts as a result... The situation we're seeing now, has both versions of Solr running on the same box under the same JVM, but we are undeploying an instance at a time so as to prevent any outlying performance hits in the tests... So, that being said, both instances of solr, on the same box are running under the optimized settings. I'd assume if GC was impacting the results of the newer version of Solr, we'd see similar decrease in performance on the older version. Aside from the QTime and other timings (highlight, etc) - which are all faster in the new version, the overall response time/delivery of the results are significantly slower under the new version. I've unfortunately exhausted my knowledge of Solr and what may or may not have changed between the nightly builds. I do appreciate your insight and hope you'll continue to throw out some ideas...and maybe someone else out there has seen these inconsistencies as well. The last set of test I ran consistently showed the the older build of Solr bringing back a result set of 13.1MB with 1200 records in 2.3 seconds wheres the newer build was bringing back the same result set in about 17.4 seconds. The catch is that the qtime and highlighting component time in the newer version are faster than the older version. Again, if you have any more ideas, let me know. Thanks! Brian On Thu, Feb 23, 2012 at 11:51 AM, Erick Erickson [via Lucene] ml-node+s472066n377030...@n3.nabble.com wrote: Ah, no, my mistake. The wildcards for the fl list won't matter re: maxBooleanClauses, I didn't read carefully enough. I assume that just returning a field or two doesn't slow down But one possible culprit, especially since you say this kicks in after a while, is garbage collection. Here's an excellent intro: http://www.lucidimagination.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/ Especially look at the getting a view into garbage collection section and try specifying those options. The result should be that your solr log gets stats dumped every time GC kicks in. If this is a problem, look at the times in the logfile after your system slows down. You'll see a bunch of GC dumps that collect very little unused memory. You can also connect to the process using jConsole (should be in the Java distro) and watch the memory tab, especially after your server has slowed down. You can also connect jConsole remotely... This is just an experiment, but any time I see and it slows down after ### minutes, GC is the first thing I think of. Best Erick On Thu, Feb 23, 2012 at 10:16 AM, naptowndev [hidden email]http://user/SendEmail.jtp?type=nodenode=3770307i=0 wrote: Erick - Agreed, it is puzzling. What I've found is that it doesn't matter if I pass in wildcards for the field list or not...but that the overall response time from the newer builds of Solr that we've tested (e.g. 4.0.0.2012.02.16) is slower than the older (4.0.0.2010.12.10.08.54.56) build. If I run the exact same query against those two cores, bringing back a payload of just over 13MB (xml), the older build brings it back in about 1.6 seconds and the newer build brings it back in about 8.4 seconds. Implementing the field list wildcard allows us to reduce the payload in the newer build (not an option in the older build). They payload is reduced to 1.8MB but takes over 3.5 seconds to come back as compared to the full payload (13MB) in the older build at about 1.6 seconds.
Re: Date search by specific month and day
I think your best bet is to parse out the relevant units and index them independently. But this is probably only a few ints per record, so it shouldn't be much of a resource hog Best Erick On Thu, Feb 23, 2012 at 5:24 PM, Kurt Nordstrom kurt.nordst...@unt.edu wrote: Hello all! We have a situation involving date searching that I could use some seasoned opinions on. What we have is a collection of records, each containing a Solr date field by which we want search on. The catch is that we want to be able to search for items that match a specific day/month. Essentially, we're trying to implement a this day in history feature for our dataset, so that users would be able to put in a date and we'd return all matching records from the past 100 years or so. Is there a way to perform this kind of search with only the basic Solr date field? Or would I have parse out the month and day and store them in separate fields at indexing time? Thanks for the help! -Kurt
Re: how to ignore cases while querying with a field with type=string?
I think your best bet is to NOT use string, use something like: fieldType name=lowercase class=solr.TextField sortMissingLast=true omitNorms=true analyzer !-- KeywordTokenizer does no actual tokenizing, so the entire input string is preserved as a single token -- tokenizer class=solr.KeywordTokenizerFactory/ !-- The LowerCase TokenFilter does what you expect, which can be when you want your sorting to be case insensitive -- filter class=solr.LowerCaseFilterFactory / !-- The TrimFilter removes any leading or trailing whitespace -- filter class=solr.TrimFilterFactory / /fieldType The TrimFilterFactory is optional here. this will do what you need. Of course you'll have to re-index. Best Erick On Thu, Feb 23, 2012 at 6:29 PM, Yuhan Zhang yzh...@onescreen.com wrote: hi all, I am storing a list of tags in a field using type=string with multiValued setting: field name=pageKeywords type=string indexed=true stored=true multiValued=true/ It works ok, when I query with pageKeyword:The ones. and when I search for ones no record will come up as desired. However, it appears that the query is case sensitive. so the query pageKeyword:The ones and pageKeyword:The Ones give different results, which is not desirable in my case. Is there some setting in the query to let it ignore the cases? or I have to correct the data by keeping everything lower case. Thank you. Yuhan Zhang
TikaLanguageIdentifierUpdateProcessorFactory(since Solr3.5.0) to be used in Solr3.3.0?
Hi, all, I am using org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory (since Solr3.5.0) to do language detection, and it's cool. An issue: if I deploy Solr3.3.0, is it possible to import that factory in Solr3.5.0 to be used in Solr3.3.0? Why I stick on Solr3.3.0 is because I am working on Dspace (discovery) to call solr, and for now the highest version that Solr can be upgraded to is 3.3.0. I would hope to do this while keep Dspace + Solr at the most. Say, import that factory into Solr3.3.0, is it possible? Does any one happen to know certain way to solve this? Best Regards, Bing -- View this message in context: http://lucene.472066.n3.nabble.com/TikaLanguageIdentifierUpdateProcessorFactory-since-Solr3-5-0-to-be-used-in-Solr3-3-0-tp3771620p3771620.html Sent from the Solr - User mailing list archive at Nabble.com.
How to increase Size of Document in solr
Hello friends, I am facing a problem during indexing of solr. Indexing successfully working when data size 300 mb but now my data size have increased its around 50 GB when i caching data its taking 8 hours and after that I found that data have not committed i have tried 2 time but same issue occurred. Is this any setting need to be done in solrconfig.xml file to increase capacity of data or its is any other problem. Please suggest me this will be very helpful to me. Thanks Regards - Suneel Pandey Sr. Software Developer -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-increase-Size-of-Document-in-solr-tp3771813p3771813.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to increase Size of Document in solr
Hi, Suneel, There is a configuration in solrconfig.xml that you might need to look at. Following I set the limit as 2GB. requestParsers enableRemoteStreaming=true multipartUploadLimitInKB=2048000 / Best Regards, Bing -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-increase-Size-of-Document-in-solr-tp3771813p3771931.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Fast Vector Highlighter Working for some records only
Hi Koji i am using solr 3.5 and i want to highlight the multivalued field, when i supply single value for the multi field value at that highlighter is working fine. but when i am indexing multiple values for field and try to highlight that field at that time i am getting following error with Fast Vector Highlighter java.lang.StringIndexOutOfBoundsException: String index out of range: -1099 i have set following parameter using solrj query.add(hl.q,term); query.add(hl.fl,contents); query.add(hl,true); query.add(hl.useFastVectorHighlighter,true); query.add(hl.snippets,100); query.add(hl.fragsize,7); query.add(hl.maxAnalyzedChars,7); can you please tell me the cause of this error ? Thanks in advance Dhaivat Koji Sekiguchi wrote Hi dhaivat, I think you may want to use analysis.jsp: http://localhost:8983/solr/admin/analysis.jsp Go to the URL and look into how your custom tokenizer produces tokens, and compare with the output of Solr's inbuilt tokenizer. koji -- Query Log Visualizer for Apache Solr http://soleami.com/ (12/02/22 21:35), dhaivat wrote: Koji Sekiguchi wrote (12/02/22 11:58), dhaivat wrote: Thanks for reply, But can you please tell me why it's working for some documents and not for other. As Solr 1.4.1 cannot recognize hl.useFastVectorHighlighter flag, Solr just ignore it, but due to hl=true is there, Solr tries to create highlight snippets by using (existing; traditional; I mean not FVH) Highlighter. Highlighter (including FVH) cannot produce snippets sometime for some reasons, you can use hl.alternateField parameter. http://wiki.apache.org/solr/HighlightingParameters#hl.alternateField koji -- Query Log Visualizer for Apache Solr http://soleami.com/ Thank you so much explanation, I have updated my solr version and using 3.5, Could you please tell me when i am using custom Tokenizer on the field,so do i need to make any changes related Solr highlighter. here is my custom analyser fieldType name=custom_text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=ns.solr.analyser.CustomIndexTokeniserFactory/ /analyzer analyzer type=query tokenizer class=ns.solr.analyser.CustomSearcherTokeniserFactory/ /analyzer /fieldType here is the field info: field name=contents type=custom_text indexed=true stored=true multiValued=true termPositions=true termVectors=true termOffsets=true/ i am creating tokens using my custom analyser and when i am trying to use highlighter it's not working properly for contents field.. but when i tried to use Solr inbuilt tokeniser i am finding the word highlighted for particular query.. Please can you help me out with this ? Thanks in advance Dhaivat -- View this message in context: http://lucene.472066.n3.nabble.com/Fast-Vector-Highlighter-Working-for-some-records-only-tp3763286p3766335.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Fast-Vector-Highlighter-Working-for-some-records-only-tp3763286p3771933.html Sent from the Solr - User mailing list archive at Nabble.com.