mm, tie, qs, ps and CJKBigramFilter and edismax and dismax
When I have a field using CJKBigramFilter, parsed CJK chars have a different parsedQuery than non-CJK queries. (旧小说 is 3 chars, so 2 bigrams) args sent in: q={!qf=bi_fld}旧小说pf=pf2=pf3= debugQuery str name=rawquerystring{!qf=bi_fld}旧小说/str str name=querystring{!qf=bi_fld}旧小说/str str name=parsedquery(+DisjunctionMaxQuerybi_fld:旧小 bi_fld:小说)~2))~0.01) ())/no_coord/str str name=parsedquery_toString+(((bi_fld:旧小 bi_fld:小说)~2))~0.01 ()/str If i use a non-CJK query string, with the same field: args sent in: q={!qf=bi_fld}foo barpf=pf2=pf3= debugQuery: str name=rawquerystring{!qf=bi_fld}foo bar/str str name=querystring{!qf=bi_fld}foo bar/str str name=parsedquery(+((DisjunctionMaxQuery((bi_fld:foo)~0.01) DisjunctionMaxQuery((bi_fld:bar)~0.01))~2))/no_coord/str str name=parsedquery_toString+(((bi_fld:foo)~0.01 (bi_fld:bar)~0.01)~2)/str Why are the parsedquery_toString formula different? And is there any difference in the actual relevancy formula? How can you tell the difference between the MinNrShouldMatch and a qs or ps or tie value, if they are all represented as ~n in the parsedQuery string? To try to get a handle on qs, ps, tie and mm: args: q={!qf=bi_fld pf=bi_fld}a b c dqs=5ps=4 debugQuery: str name=rawquerystring{!qf=bi_fld pf=bi_fld}a b c d/str str name=querystring{!qf=bi_fld pf=bi_fld}a b c d/str str name=parsedquery(+((DisjunctionMaxQuery((bi_fld:a b~5)~0.01) DisjunctionMaxQuery((bi_fld:c)~0.01) DisjunctionMaxQuery((bi_fld:d)~0.01))~3) DisjunctionMaxQuery((bi_fld:c d~4)~0.01))/no_coord/str str name=parsedquery_toString+(((bi_fld:a b~5)~0.01 (bi_fld:c)~0.01 (bi_fld:d)~0.01)~3) (bi_fld:c d~4)~0.01/str I get that qs, the query slop, is for explicit phrases in the query, so a b~5 makes sense. I also get that ps is for boosting of phrases, so I get (bi_fld:c d~4) … but where is (cjk_uni_pub_search:a b c d~4) ? Using dismax (instead of edismax): args: q={!dismax qf=bi_fld pf=bi_fld}a b c dqs=5ps=4 debugQuery: str name=rawquerystring{!dismax qf=bi_fld pf=bi_fld}a b c d/str str name=querystring{!dismax qf=bi_fld pf=bi_fld}a b c d/str str name=parsedquery(+((DisjunctionMaxQuery((bi_fld:a b~5)~0.01) DisjunctionMaxQuery((bi_fld:c)~0.01) DisjunctionMaxQuery((bi_fld:d)~0.01))~3) DisjunctionMaxQuery((bi_fld:a b c d~4)~0.01))/no_coord/str str name=parsedquery_toString+(((bi_fld:a b~5)~0.01 (bi_fld:c)~0.01 (bi_fld:d)~0.01)~3) (bi_fld:a b c d~4)~0.01/str So is this an edismax bug? FYI, I am running Solr 4.4. I have fields defined like so: fieldtype name=text_cjk_bi class=solr.TextField positionIncrementGap=1 autoGeneratePhraseQueries=false analyzer tokenizer class=solr.ICUTokenizerFactory / filter class=solr.CJKWidthFilterFactory/ filter class=solr.ICUTransformFilterFactory id=Traditional-Simplified/ filter class=solr.ICUTransformFilterFactory id=Katakana-Hiragana/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.CJKBigramFilterFactory han=true hiragana=true katakana=true hangul=true outputUnigrams=false / /analyzer /fieldtype The request handler uses edismax: requestHandler name=search class=solr.SearchHandler default=true lst name=defaults str name=defTypeedismax/str str name=q.alt:/str str name=mm6-1 690%/str int name=qs1/int int name=ps0/int
Re: mm, tie, qs, ps and CJKBigramFilter and edismax and dismax
Re the relevancy changes I note below for edismax, there are already some issues filed: pertaining to the difference in how the phrase queries are merged into the main query: See Michael Dodsworth's comment of 25/Sep/12 on this issue: https://issues.apache.org/jira/browse/SOLR-2058 -- ticket is closed, but this issue is not addressed. and pertaining to skipping terms in phrase boosting when part of the query is a phrase: https://issues.apache.org/jira/browse/SOLR-4130 - Naomi On Sep 3, 2013, at 5:54 PM, Naomi Dushay wrote: When I have a field using CJKBigramFilter, parsed CJK chars have a different parsedQuery than non-CJK queries. (旧小说 is 3 chars, so 2 bigrams) args sent in: q={!qf=bi_fld}旧小说pf=pf2=pf3= debugQuery str name=rawquerystring{!qf=bi_fld}旧小说/str str name=querystring{!qf=bi_fld}旧小说/str str name=parsedquery(+DisjunctionMaxQuerybi_fld:旧小 bi_fld:小说)~2))~0.01) ())/no_coord/str str name=parsedquery_toString+(((bi_fld:旧小 bi_fld:小说)~2))~0.01 ()/str If i use a non-CJK query string, with the same field: args sent in: q={!qf=bi_fld}foo barpf=pf2=pf3= debugQuery: str name=rawquerystring{!qf=bi_fld}foo bar/str str name=querystring{!qf=bi_fld}foo bar/str str name=parsedquery(+((DisjunctionMaxQuery((bi_fld:foo)~0.01) DisjunctionMaxQuery((bi_fld:bar)~0.01))~2))/no_coord/str str name=parsedquery_toString+(((bi_fld:foo)~0.01 (bi_fld:bar)~0.01)~2)/str Why are the parsedquery_toString formula different? And is there any difference in the actual relevancy formula? How can you tell the difference between the MinNrShouldMatch and a qs or ps or tie value, if they are all represented as ~n in the parsedQuery string? To try to get a handle on qs, ps, tie and mm: args: q={!qf=bi_fld pf=bi_fld}a b c dqs=5ps=4 debugQuery: str name=rawquerystring{!qf=bi_fld pf=bi_fld}a b c d/str str name=querystring{!qf=bi_fld pf=bi_fld}a b c d/str str name=parsedquery(+((DisjunctionMaxQuery((bi_fld:a b~5)~0.01) DisjunctionMaxQuery((bi_fld:c)~0.01) DisjunctionMaxQuery((bi_fld:d)~0.01))~3) DisjunctionMaxQuery((bi_fld:c d~4)~0.01))/no_coord/str str name=parsedquery_toString+(((bi_fld:a b~5)~0.01 (bi_fld:c)~0.01 (bi_fld:d)~0.01)~3) (bi_fld:c d~4)~0.01/str I get that qs, the query slop, is for explicit phrases in the query, so a b~5makes sense. I also get that ps is for boosting of phrases, so I get (bi_fld:c d~4) … but where is (cjk_uni_pub_search:a b c d~4) ? Using dismax (instead of edismax): args: q={!dismax qf=bi_fld pf=bi_fld}a b c dqs=5ps=4 debugQuery: str name=rawquerystring{!dismax qf=bi_fld pf=bi_fld}a b c d/str str name=querystring{!dismax qf=bi_fld pf=bi_fld}a b c d/str str name=parsedquery(+((DisjunctionMaxQuery((bi_fld:a b~5)~0.01) DisjunctionMaxQuery((bi_fld:c)~0.01) DisjunctionMaxQuery((bi_fld:d)~0.01))~3) DisjunctionMaxQuery((bi_fld:a b c d~4)~0.01))/no_coord/str str name=parsedquery_toString+(((bi_fld:a b~5)~0.01 (bi_fld:c)~0.01 (bi_fld:d)~0.01)~3) (bi_fld:a b c d~4)~0.01/str So is this an edismax bug? FYI, I am running Solr 4.4. I have fields defined like so: fieldtype name=text_cjk_bi class=solr.TextField positionIncrementGap=1 autoGeneratePhraseQueries=false analyzer tokenizer class=solr.ICUTokenizerFactory / filter class=solr.CJKWidthFilterFactory/ filter class=solr.ICUTransformFilterFactory id=Traditional-Simplified/ filter class=solr.ICUTransformFilterFactory id=Katakana-Hiragana/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.CJKBigramFilterFactory han=true hiragana=true katakana=true hangul=true outputUnigrams=false / /analyzer /fieldtype The request handler uses edismax: requestHandler name=search class=solr.SearchHandler default=true lst name=defaults str name=defTypeedismax/str str name=q.alt:/str str name=mm6-1 690%/str int name=qs1/int int name=ps0/int
Re: ICUTokenizer class not found with Solr 4.4
Hi Tom, Sorry - I was meeting with the East-Asia librarians … Perhaps you are missing the following from your solrconfig lib dir=/home/blacklight/solr-home/lib / (this is the top of my solrconfig.xml: config !-- NOTE: various comments and unused configuration possibilities have been purged from this file. Please refer to http://wiki.apache.org/solr/SolrConfigXml, as well as the default solrconfig file included with Solr -- abortOnConfigurationError${solr.abortOnConfigurationError:true}/abortOnConfigurationError luceneMatchVersion4.4/luceneMatchVersion lib dir=/home/blacklight/solr-home/lib / dataDir/data/solr/cjk-icu/dataDir directoryFactory name=DirectoryFactory class=${solr.directoryFactory:solr.NRTCachingDirectoryFactory}/ codecFactory class=solr.SchemaCodecFactory/ schemaFactory class=ClassicIndexSchemaFactory/ indexConfig … and here is my solr.xml, if it matters: note the sharedLib value ?xml version=1.0 encoding=UTF-8 ? solr persistent=true sharedLib=lib cores defaultCoreName=current adminPath=/admin/cores core name=current collection=current dataDir=/data/solr/ loadOnStartup=true instanceDir=./ transient=false/ /cores /solr On Aug 27, 2013, at 3:29 PM, Tom Burton-West wrote: Hello all, According to the README.txt in solr-4.4.0/solr/example/solr/collection1, all we have to do is create a collection1/lib directory and put whatever jars we want in there. .. /lib. If it exists, Solr will load any Jars found in this directory and use them to resolve any plugins specified in your solrconfig.xml or schema.xml I did so (see below). However, I keep getting a class not found error (see below). Has the default changed from what is documented in the README.txt file? Is there something I have to change in solrconfig.xml or solr.xml to make this work? I looked at SOLR-4852, but don't understand. It sounds like maybe there is a problem if the collection1/lib directory is also specified in solrconfig.xml. But I didn't do that. (i.e. out of the box solrconfig.xml) Does this mean that by following what it says in the README.txt, I am making some kind of a configuration error. I also don't understand the workaround in SOLR-4852. Is this an ICU issue? A java 7 issue? a Solr 4.4 issue, or did I simply not understand the README.txt? Tom -- org.apache.solr.common.SolrException; null:java.lang.NoClassDefFoundError: org/apache/lucene/analysis/icu/segmentation/ICUTokenizer ls collection1/lib icu4j-49.1.jar lucene-analyzers-icu-4.4-SNAPSHOT.jar solr-analysis-extras-4.4-SNAPSHOT.jar https://issues.apache.org/jira/browse/SOLR-4852 Collection1/README.txt excerpt: lib/ This directory is optional. If it exists, Solr will load any Jars found in this directory and use them to resolve any plugins specified in your solrconfig.xml or schema.xml (ie: Analyzers, Request Handlers, etc...). Alternatively you can use the lib syntax in conf/solrconfig.xml to direct Solr to your plugins. See the example conf/solrconfig.xml file for details.
Re: [solrmarc-tech] apostrophe / ayn / alif
The alif and ayn can also be used as diacritic-like characters in Korean; this is a known practice. But thanks anyway. On May 24, 2012, at 9:30 AM, Charles Riley wrote: Hi Naomi, I don't have a conclusive answer for you on this yet, but let me pick up on a few points. First, the apostrophe is probably being handled through ignoring punctuation in the ICUCollationKeyFilterFactory. Alif isn't a diacritic but a letter, and its character properties would be handled as such, apparently also outside the scope of what the folding filter factory does unless it's tailored. From the solrwiki, this looks like a helpful rule of thumb: When To use a CharFilter vs a TokenFilter There are several pairs of CharFilters and TokenFilters that have related (ie: MappingCharFilter and ASCIIFoldingFilter) or nearly identical functionality (ie: PatternReplaceCharFilterFactory and PatternReplaceFilterFactory) and it may not always be obvious which is the best choice. The ultimate decision depends largely on what Tokenizer you are using, and whether you need to out smart it by preprocessing the stream of characters. For example, maybe you have a tokenizer such as StandardTokenizer and you are pretty happy with how it works overall, but you want to customize how some specific characters behave. In such a situation you could modify the rules and re-build your own tokenizer with javacc, but perhaps its easier to simply map some of the characters before tokenization with a CharFilter. Charles On Tue, May 15, 2012 at 2:47 PM, Naomi Dushay ndus...@stanford.edu wrote: We are using the ICUFoldingFilterFactory with great success to fold diacritics so searches with and without the diacritics get the same results. We recently discovered we have some Korean records that use an alif diacritic instead of an apostrophe, and this diacritic is NOT getting folded. Has anyone experienced this for alif or ayn characters? Do you have a solution? - Naomi -- You received this message because you are subscribed to the Google Groups solrmarc-tech group. To post to this group, send email to solrmarc-t...@googlegroups.com. To unsubscribe from this group, send email to solrmarc-tech+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en. -- Charles L. Riley Catalog Librarian for Africana Sterling Memorial Library, Yale University zenodo...@gmail.com 203-432-7566 -- You received this message because you are subscribed to the Google Groups solrmarc-tech group. To post to this group, send email to solrmarc-t...@googlegroups.com. To unsubscribe from this group, send email to solrmarc-tech+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.
apostrophe / ayn / alif
We are using the ICUFoldingFilterFactory with great success to fold diacritics so searches with and without the diacritics get the same results. We recently discovered we have some Korean records that use an alif diacritic instead of an apostrophe, and this diacritic is NOT getting folded. Has anyone experienced this for alif or ayn characters? Do you have a solution? - Naomi
autoGeneratePhraseQueries sort of silently set to false
Another thing I noticed when upgrading from Solr 1.4 to Solr 3.5 had to do with results when there were hyphenated words: aaa-bbb. Erik Hatcher pointed me to the autoGeneratePhraseQueries attribute now available on fieldtype definitions in schema.xml. This is a great feature, and everything is peachy if you start with Solr 3.4. But many of us started earlier and are upgrading, and that's a different story. It was surprising to me that a. the default for this new feature caused different search results than Solr 1.4 b. it wasn't documented clearly, IMO http://wiki.apache.org/solr/SchemaXml makes no mention of it In the schema.xml example, there is this at the top: !-- attribute name is the name of this schema and is only used for display purposes. Applications should change this to reflect the nature of the search collection. version=1.4 is Solr's version number for the schema syntax and semantics. It should not normally be changed by applications. 1.0: multiValued attribute did not exist, all fields are multiValued by nature 1.1: multiValued attribute introduced, false by default 1.2: omitTermFreqAndPositions attribute introduced, true by default except for text fields. 1.3: removed optional field compress feature 1.4: default auto-phrase (QueryParser feature) to off -- And there was this in a couple of field definitions: fieldType name=text_en_splitting class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true fieldType name=text_ja class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false But that was it.
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
Robert, You found it! it is the phrase slop. What do I do now? I am using Solr from trunk from December, and all those JIRA tixes are marked fixed … - Naomi Solr 1.4: luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 got result Solr 3.5 luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 NO result lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote: On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote: Jonathan has brought it to my attention that BOTH of my failing searches happen to have 8 terms, and one of the terms is repeated: The Beatles as musicians : Revolver through the Anthology Color-blindness [print/digital]; its dangers and its detection but this is a PHRASE search. Can you take your same phrase queries, and simply add some slop to them (e.g. ~3) and ensure they still match with the lucene queryparser? SloppyPhraseQuery has a bit of a history with repeats since Lucene 2.9 that you were using. https://issues.apache.org/jira/browse/LUCENE-3068 https://issues.apache.org/jira/browse/LUCENE-3215 https://issues.apache.org/jira/browse/LUCENE-3412 -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
Robert, I will create a jira issue with the documentation. FYI, I tried ps values of 3, 2, 1 and 0 and none of them worked with dismax; For lucene QueryParser, only the value of 0 got results. - Naomi On Feb 23, 2012, at 11:12 AM, Robert Muir [via Lucene] wrote: Is it possible to also provide your document? If you could attach the document and the analysis config and queries to a JIRA issue, that would be most ideal. On Thu, Feb 23, 2012 at 2:05 PM, Naomi Dushay [hidden email] wrote: Robert, You found it! it is the phrase slop. What do I do now? I am using Solr from trunk from December, and all those JIRA tixes are marked fixed … - Naomi Solr 1.4: luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 got result Solr 3.5 luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 NO result lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote: On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote: Jonathan has brought it to my attention that BOTH of my failing searches happen to have 8 terms, and one of the terms is repeated: The Beatles as musicians : Revolver through the Anthology Color-blindness [print/digital]; its dangers and its detection but this is a PHRASE search. Can you take your same phrase queries, and simply add some slop to them (e.g. ~3) and ensure they still match with the lucene queryparser? SloppyPhraseQuery has a bit of a history with repeats since Lucene 2.9 that you were using. https://issues.apache.org/jira/browse/LUCENE-3068 https://issues.apache.org/jira/browse/LUCENE-3215 https://issues.apache.org/jira/browse/LUCENE-3412 -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html Sent from the Solr - User mailing list archive at Nabble.com. -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770681.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
Robert - Did you mean for me to attach my docs to an existing ticket (which one?) or just want to make sure I attach the docs to the new issue? - Naomi On Feb 23, 2012, at 11:39 AM, Robert Muir [via Lucene] wrote: Please attach your docs if you dont mind. I worked up tests for this (in general for ANY phrase query, increasing the slop should never remove results, only potentially enlarge them). It fails already... but its good to also have your test case too... On Thu, Feb 23, 2012 at 2:20 PM, Naomi Dushay [hidden email] wrote: Robert, I will create a jira issue with the documentation. FYI, I tried ps values of 3, 2, 1 and 0 and none of them worked with dismax; For lucene QueryParser, only the value of 0 got results. - Naomi On Feb 23, 2012, at 11:12 AM, Robert Muir [via Lucene] wrote: Is it possible to also provide your document? If you could attach the document and the analysis config and queries to a JIRA issue, that would be most ideal. On Thu, Feb 23, 2012 at 2:05 PM, Naomi Dushay [hidden email] wrote: Robert, You found it! it is the phrase slop. What do I do now? I am using Solr from trunk from December, and all those JIRA tixes are marked fixed … - Naomi Solr 1.4: luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 got result Solr 3.5 luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 NO result lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote: On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote: Jonathan has brought it to my attention that BOTH of my failing searches happen to have 8 terms, and one of the terms is repeated: The Beatles as musicians : Revolver through the Anthology Color-blindness [print/digital]; its dangers and its detection but this is a PHRASE search. Can you take your same phrase queries, and simply add some slop to them (e.g. ~3) and ensure they still match with the lucene queryparser? SloppyPhraseQuery has a bit of a history with repeats since Lucene 2.9 that you were using. https://issues.apache.org/jira/browse/LUCENE-3068 https://issues.apache.org/jira/browse/LUCENE-3215 https://issues.apache.org/jira/browse/LUCENE-3412 -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html Sent from the Solr - User mailing list archive at Nabble.com. -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770681.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770746.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
Ticket created: https://issues.apache.org/jira/browse/SOLR-3158 (perhaps it's a lucene problem, not a Solr one -- feel free to move it or whatever.) - Naomi On Feb 23, 2012, at 11:55 AM, Robert Muir [via Lucene] wrote: Please make a new one if you dont mind! On Thu, Feb 23, 2012 at 2:45 PM, Naomi Dushay [hidden email] wrote: Robert - Did you mean for me to attach my docs to an existing ticket (which one?) or just want to make sure I attach the docs to the new issue? - Naomi On Feb 23, 2012, at 11:39 AM, Robert Muir [via Lucene] wrote: Please attach your docs if you dont mind. I worked up tests for this (in general for ANY phrase query, increasing the slop should never remove results, only potentially enlarge them). It fails already... but its good to also have your test case too... On Thu, Feb 23, 2012 at 2:20 PM, Naomi Dushay [hidden email] wrote: Robert, I will create a jira issue with the documentation. FYI, I tried ps values of 3, 2, 1 and 0 and none of them worked with dismax; For lucene QueryParser, only the value of 0 got results. - Naomi On Feb 23, 2012, at 11:12 AM, Robert Muir [via Lucene] wrote: Is it possible to also provide your document? If you could attach the document and the analysis config and queries to a JIRA issue, that would be most ideal. On Thu, Feb 23, 2012 at 2:05 PM, Naomi Dushay [hidden email] wrote: Robert, You found it! it is the phrase slop. What do I do now? I am using Solr from trunk from December, and all those JIRA tixes are marked fixed … - Naomi Solr 1.4: luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 got result Solr 3.5 luceneQueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3 final query: all_search:the beatl as musician revolv through the antholog~3 NO result lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote: On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote: Jonathan has brought it to my attention that BOTH of my failing searches happen to have 8 terms, and one of the terms is repeated: The Beatles as musicians : Revolver through the Anthology Color-blindness [print/digital]; its dangers and its detection but this is a PHRASE search. Can you take your same phrase queries, and simply add some slop to them (e.g. ~3) and ensure they still match with the lucene queryparser? SloppyPhraseQuery has a bit of a history with repeats since Lucene 2.9 that you were using. https://issues.apache.org/jira/browse/LUCENE-3068 https://issues.apache.org/jira/browse/LUCENE-3215 https://issues.apache.org/jira/browse/LUCENE-3412 -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html Sent from the Solr - User mailing list archive at Nabble.com. -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770681.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770746.html To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, dismax only, click here. NAML -- lucidimagination.com If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770786.html To unsubscribe from result present
result present in Solr 1.4, but missing in Solr 3.5, dismax only
I am working on upgrading Solr from 1.4 to 3.5, and I have hit a problem. I have a test checking for a search result in Solr, and the test passes in Solr 1.4, but fails in Solr 3.5. Dismax is the desired QueryParser -- I just included output from lucene QueryParser to prove the document exists and is found I am completely stumped. Here are the debugQuery details: ***Solr 3.5*** lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog 6.0562754 = (MATCH) weight(all_search:the beatl as musician revolv through the antholog in 1064395), product of: 1.0 = queryWeight(all_search:the beatl as musician revolv through the antholog), product of: 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611) 0.02063975 = queryNorm 6.0562754 = fieldWeight(all_search:the beatl as musician revolv through the antholog in 1064395), product of: 1.0 = tf(phraseFreq=1.0) 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611) 0.125 = fieldNorm(field=all_search, doc=1064395) dismax QueryParser: URL: qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver through the Anthology final query: +(all_search:the beatl as musician revolv through the antholog~1)~0.01 (all_search:the beatl as musician revolv through the antholog~3)~0.01 (no matches) ***Solr 1.4*** lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog 5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the antholog in 3469163), product of: 1.0 = tf(phraseFreq=1.0) 48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 revolv=820 through=88238 the=3542123 antholog=11205) 0.109375 = fieldNorm(field=all_search, doc=3469163) dismax QueryParser: URL: qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver through the Anthology final query: +(all_search:the beatl as musician revolv through the antholog~1)~0.01 (all_search:the beatl as musician revolv through the antholog~3)~0.01 score: 7.449651 = (MATCH) sum of: 3.7248254 = weight(all_search:the beatl as musician revolv through the antholog~1 in 3469163), product of: 0.7071068 = queryWeight(all_search:the beatl as musician revolv through the antholog~1), product of: 48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 revolv=820 through=88238 the=3542123 antholog=11205) 0.014681898 = queryNorm 5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the antholog in 3469163), product of: 1.0 = tf(phraseFreq=1.0) 48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 revolv=820 through=88238 the=3542123 antholog=11205) 0.109375 = fieldNorm(field=all_search, doc=3469163) 3.7248254 = weight(all_search:the beatl as musician revolv through the antholog~3 in 3469163), product of: 0.7071068 = queryWeight(all_search:the beatl as musician revolv through the antholog~3), product of: 48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 revolv=820 through=88238 the=3542123 antholog=11205) 0.014681898 = queryNorm 5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the antholog in 3469163), product of: 1.0 = tf(phraseFreq=1.0) 48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 revolv=820 through=88238 the=3542123 antholog=11205) 0.109375 = fieldNorm(field=all_search, doc=3469163)
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
I forgot to include the field definition information: schema.xml: field name=all_search type=text indexed=true stored=false / solr 3.5: fieldtype name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.ICUFoldingFilterFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 generateWordParts=1 catenateWords=1 splitOnNumerics=0 generateNumberParts=1 catenateNumbers=1 catenateAll=0 preserveOriginal=0 stemEnglishPossessive=1 / filter class=solr.EnglishPorterFilterFactory protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer /fieldtype solr1.4: fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=schema.UnicodeNormalizationFilterFactory version=icu4j composed=false remove_diacritics=true remove_modifiers=true fold=true / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 generateWordParts=1 catenateWords=1 splitOnNumerics=0 generateNumberParts=1 catenateNumbers=1 catenateAll=0 preserveOriginal=0 stemEnglishPossessive=1 / filter class=solr.LowerCaseFilterFactory / filter class=solr.EnglishPorterFilterFactory protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer /fieldtype And the analysis page shows the same results for Solr 3.5 and 1.4 Solr 3.5: position1 2 3 4 5 6 7 8 term text the beatl as musicianrevolv through the antholog keyword false false false false false false false false startOffset 0 4 12 15 27 36 44 48 endOffset 3 11 14 24 35 43 47 57 typewordwordwordwordwordwordwordword Solr 1.4: term position 1 2 3 4 5 6 7 8 term text the beatl as musicianrevolv through the antholog term type wordwordwordwordwordwordwordword source start,end0,3 4,1112,14 15,24 27,35 36,43 44,47 48,57 - Naomi -- View this message in context: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768007.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
Jonathan, I have the same problem without the colon - I tested that, but didn't mention it. mm can't be the issue either: in Solr 3.5, if I remove one of the occurrences of the (doesn't matter which), I get results. Removing any other word does NOT get results. And if the query isn't a phrase query, it gets results. And no, it can't be related to what you refer to as the dismax stopwords problem, since i can demonstrate the problem with a single field. mm can't be the issue I have run into problems in the past with a non-alpha character surrounded by spaces tanking my search results for dismax … but I fixed that with this fieldType: !-- single token with punctuation terms removed so dismax doesn't look for punctuation terms in these fields -- !-- On client side, Lucene query parser breaks things up by whitespace *before* field analysis for dismax -- !-- so punctuation terms ( : ;) are stopwords to allow results from other fields when these chars are surrounded by spaces in query -- !-- do not lowercase -- fieldType name=string_punct_stop class=solr.TextField omitNorms=true analyzer type=index tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.ICUNormalizer2FilterFactory name=nfkc mode=compose / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.ICUNormalizer2FilterFactory name=nfkc mode=compose / !-- removing punctuation for Lucene query parser issues -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_punctuation.txt enablePositionIncrements=true / /analyzer /fieldType My stopwords_punctuation.txt file is #Punctuation characters we want to ignore in queries : ; / and used this type instead of string for fields in my dismax qf.Thus, the punctuation terms in the query are not present for the fields that were formerly string fields. - Naomi On Feb 22, 2012, at 3:41 PM, Jonathan Rochkind wrote: So I don't really know what I'm talking about, and I'm not really sure if it's related or not, but your particular query: The Beatles as musicians : Revolver through the Anthology With the lone word that's a ':', reminds me of a dismax stopwords-type problem I ran into. Now, I ran into it on 1.4. I don't know why it would be different on 1.4 and 3.x. And I see you aren't even using a multi-field dismax in your sample query, so it couldn't possibly be what I ran into... I don't think. But I'll write this anyway in case it gives someone some ideas. The problem I ran into is caused by different analysis in two fields both used in a dismax, one that ends up keeping : as a token, and one that doesn't. Which ends up having the same effect as the famous 'dismax stopwords problem'. Maybe somehow your schema changed such to produce this problem in 3.x but not in 1.4? Although again I realize the fact that you are only using a single field in your demo dismax query kind of suggests it's not this problem. Wonder if you try the query without the :, if the problem goes away, that might be a hint. Or, maybe someone more skilled at understanding what's in those Solr debug statements than I am (it's kind of all greek to me) will be able to take this hint and rule out or confirm that it may have something to do with your problem. Here I write up the issue I ran into (which may or may not have anything to do with what you ran into) http://bibwild.wordpress.com/2011/06/15/more-dismax-gotchas-varying-field-analysis-and-mm/ Also, you don't say what your 'mm' is in your dismax queries, that could be relevant if it's got anything to do with anything similar to the issue I'm talking about. Hmm, I wonder if Solr 3.x changes the way dismax calculates number of tokens for 'mm' in such a way that the 'varying field analysis dismax gotcha' can manifest with only one field, if the way dismax counts tokens for 'mm' differs from number of tokens the single field's analysis produces? Jonathan On 2/22/2012 2:55 PM, Naomi Dushay wrote: I am working on upgrading Solr from 1.4 to 3.5, and I have hit a problem. I have a test checking for a search result in Solr, and the test passes in Solr 1.4, but fails in Solr 3.5. Dismax is the desired QueryParser -- I just included output from lucene QueryParser to prove the document exists and is found I am completely stumped. Here are the debugQuery details: ***Solr 3.5*** lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog 6.0562754 = (MATCH) weight(all_search:the beatl as musician revolv through the antholog in 1064395), product of: 1.0 = queryWeight(all_search:the beatl as musician revolv through the antholog), product of: 48.450203 = idf(all_search
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
Jonathan has brought it to my attention that BOTH of my failing searches happen to have 8 terms, and one of the terms is repeated: The Beatles as musicians : Revolver through the Anthology Color-blindness [print/digital]; its dangers and its detection but this is a PHRASE search. In case it's relevant, both Solr 1.4 and Solr 3.5: do NOT use stopwords in the fieldtype; mm is 6-1 690% for dismax qs is 1 ps is 3 And both use this filter last filter class=solr.RemoveDuplicatesTokenFilterFactory / … but I believe that filter is only used for consecutive tokens. Lastly, Color-blindness [print/digital]; its and its detection works (danger is removed, rather than one of the repeated its) - Naomi On Feb 22, 2012, at 3:41 PM, Jonathan Rochkind wrote: So I don't really know what I'm talking about, and I'm not really sure if it's related or not, but your particular query: The Beatles as musicians : Revolver through the Anthology With the lone word that's a ':', reminds me of a dismax stopwords-type problem I ran into. Now, I ran into it on 1.4. I don't know why it would be different on 1.4 and 3.x. And I see you aren't even using a multi-field dismax in your sample query, so it couldn't possibly be what I ran into... I don't think. But I'll write this anyway in case it gives someone some ideas. The problem I ran into is caused by different analysis in two fields both used in a dismax, one that ends up keeping : as a token, and one that doesn't. Which ends up having the same effect as the famous 'dismax stopwords problem'. Maybe somehow your schema changed such to produce this problem in 3.x but not in 1.4? Although again I realize the fact that you are only using a single field in your demo dismax query kind of suggests it's not this problem. Wonder if you try the query without the :, if the problem goes away, that might be a hint. Or, maybe someone more skilled at understanding what's in those Solr debug statements than I am (it's kind of all greek to me) will be able to take this hint and rule out or confirm that it may have something to do with your problem. Here I write up the issue I ran into (which may or may not have anything to do with what you ran into) http://bibwild.wordpress.com/2011/06/15/more-dismax-gotchas-varying-field-analysis-and-mm/ Also, you don't say what your 'mm' is in your dismax queries, that could be relevant if it's got anything to do with anything similar to the issue I'm talking about. Hmm, I wonder if Solr 3.x changes the way dismax calculates number of tokens for 'mm' in such a way that the 'varying field analysis dismax gotcha' can manifest with only one field, if the way dismax counts tokens for 'mm' differs from number of tokens the single field's analysis produces? Jonathan On 2/22/2012 2:55 PM, Naomi Dushay wrote: I am working on upgrading Solr from 1.4 to 3.5, and I have hit a problem. I have a test checking for a search result in Solr, and the test passes in Solr 1.4, but fails in Solr 3.5. Dismax is the desired QueryParser -- I just included output from lucene QueryParser to prove the document exists and is found I am completely stumped. Here are the debugQuery details: ***Solr 3.5*** lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog 6.0562754 = (MATCH) weight(all_search:the beatl as musician revolv through the antholog in 1064395), product of: 1.0 = queryWeight(all_search:the beatl as musician revolv through the antholog), product of: 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611) 0.02063975 = queryNorm 6.0562754 = fieldWeight(all_search:the beatl as musician revolv through the antholog in 1064395), product of: 1.0 = tf(phraseFreq=1.0) 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611) 0.125 = fieldNorm(field=all_search, doc=1064395) dismax QueryParser: URL: qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver through the Anthology final query: +(all_search:the beatl as musician revolv through the antholog~1)~0.01 (all_search:the beatl as musician revolv through the antholog~3)~0.01 (no matches) ***Solr 1.4*** lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog 5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the antholog in 3469163), product of: 1.0 = tf(phraseFreq=1.0) 48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 revolv=820 through=88238 the=3542123 antholog=11205) 0.109375 = fieldNorm(field
Re: hierarchical faceting in Solr?
Chris Beer just did a revamp of the wiki page at: http://wiki.apache.org/solr/HierarchicalFaceting Yay Chris! - Naomi ( ... and I helped!) On Aug 22, 2011, at 10:49 AM, Naomi Dushay wrote: Chris, Is there a document somewhere on how to do this? If not, might you create one? I could even imagine such a document living on the Solr wiki ... this one has mostly ancient content: http://wiki.apache.org/solr/HierarchicalFaceting - Naomi
hierarchical faceting in Solr?
Chris, Is there a document somewhere on how to do this? If not, might you create one? I could even imagine such a document living on the Solr wiki ... this one has mostly ancient content: http://wiki.apache.org/solr/HierarchicalFaceting - Naomi
Re: defType argument weirdness
qf_dismax and pf_dismax are irrelevant -- I shouldn't have included that info. They are passed in the url and they work; they do not affect this problem. Your reminder of debugQuery was a good one - I use that a lot but forgot in this case. Regardless, I thought that defType=dismaxq=*:* is supposed to be equivalent to q={!defType=dismax}*:* and also equivalent to q={! dismax}*:* defType=dismaxq=*:* DOESN'T WORK str name=rawquerystring*:*/str str name=querystring*:*/str str name=parsedquery+() ()/str str name=parsedquery_toString+() ()/str leaving out the explicit query defType=dismax WORKS null name=rawquerystring/ null name=querystring/ str name=parsedquery+MatchAllDocsQuery(*:*)/str str name=parsedquery_toString+*:*/str q={!dismax}*:* DOESN'T WORK str name=rawquerystring*:*/str str name=querystring*:*/str str name=parsedquery+() ()/str str name=parsedquery_toString+() ()/str leaving out the explicit query: q={!dismax}WORKS str name=rawquerystring{!dismax}/str str name=querystring{!dismax}/str str name=parsedquery+MatchAllDocsQuery(*:*)/str str name=parsedquery_toString+*:*/str q={!defType=dismax}*:*WORKS str name=rawquerystring{!defType=dismax}*:*/str str name=querystring{!defType=dismax}*:*/str str name=parsedqueryMatchAllDocsQuery(*:*)/str str name=parsedquery_toString*:*/str leaving out the explicit query: q={!defType=dismax}DOESN'T WORK org.apache.lucene.queryParser.ParseException: Cannot parse '': Encountered EOF at line 1, column 0. On Jul 18, 2011, at 5:44 PM, Erick Erickson wrote: What are qf_dismax and pf_dismax? They are meaningless to Solr. Try adding debugQuery=on to your URL and you'll see the parsed query, which helps a lot here If you change these to the proper dismax values (qf and pf) you'll get beter results. As it is, I think you'll see output like: str name=parsedquery+() ()/str showing that your query isn't actually going against any fields Best Erick On Mon, Jul 18, 2011 at 7:15 PM, Naomi Dushay ndus...@stanford.edu wrote: I found a weird behavior with the Solr defType argument, perhaps with respect to default queries? defType=dismaxq=*:* no hits q={!defType=dismax}*:* hits defType=dismax hits Here is the request handler, which I explicitly indicate: requestHandler name=search class=solr.SearchHandler default=true lst name=defaults str name=defTypelucene/str !-- lucene params -- str name=dfhas_model_s/str str name=q.opAND/str !-- dismax params -- str name=mm 2-1 5-2 690% /str str name=q.alt*:*/str str name=qf_dismaxid^0.8 id_t^0.8 title_t^0.3 mods_t^0.2 text/str str name=pf_dismaxid^0.9 id_t^0.9 title_t^0.5 mods_t^0.2 text/str int name=ps100/int float name=tie0.01/float /requestHandler Solr Specification Version: 1.4.0 Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06 12:33:40 Lucene Specification Version: 2.9.1 Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25 - Naomi
defType argument weirdness
I found a weird behavior with the Solr defType argument, perhaps with respect to default queries? defType=dismaxq=*:* no hits q={!defType=dismax}*:* hits defType=dismax hits Here is the request handler, which I explicitly indicate: requestHandler name=search class=solr.SearchHandler default=true lst name=defaults str name=defTypelucene/str !-- lucene params -- str name=dfhas_model_s/str str name=q.opAND/str !-- dismax params -- str name=mm 2-1 5-2 690% /str str name=q.alt*:*/str str name=qf_dismaxid^0.8 id_t^0.8 title_t^0.3 mods_t^0.2 text/ str str name=pf_dismaxid^0.9 id_t^0.9 title_t^0.5 mods_t^0.2 text/ str int name=ps100/int float name=tie0.01/float /requestHandler Solr Specification Version: 1.4.0 Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06 12:33:40 Lucene Specification Version: 2.9.1 Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25 - Naomi
a Solr search recall problem you probably don't even know you're having
(sorry for cross postings - I think this is important information to disseminate) Executive Summary: you probably need to increase your query slop. A lot. We recently had a feedback ticket that a title search with a hyphen wasn't working properly. This is especially curious because we solved a bunch of problems with hyphen searching AND WROTE TESTS in the process, and all the existing hyphen tests pass. Tests like hyphens with no spaces before or after, 3 significant terms, 2 stopwords pass. Our metadata contains: record A with title: Red-rose chain. record B with title: Prisoner in a red-rose chain. A title search: prisoner in a red-rose chain returns no results Further exploration (the following are all title searches): red-rose chain == record A only red rose chain == record A only red rose chain == record A only red-rose chain == record A only red rose chain == records A and B red rose chain == records A and B (!!) For more details and more about the solution, see http://discovery-grindstone.blogspot.com/2010/11/solr-and-hyphenated-words.html - Naomi Dushay Senior Developer Stanford University Libraries
Re: a Solr search recall problem you probably don't even know you're having
Robert, Thanks! I've been using Solr 1.5 from trunk back in March - time to upgrade! I also like the put the stopword filter after the WDF filter fix. - Naomi On Nov 5, 2010, at 12:36 PM, Robert Muir wrote: On Fri, Nov 5, 2010 at 3:04 PM, Naomi Dushay ndus...@stanford.edu wrote: (sorry for cross postings - I think this is important information to disseminate) Executive Summary: you probably need to increase your query slop. A lot. I looked at your example, and it really looks a lot like https://issues.apache.org/jira/browse/SOLR-1852 This was fixed, and released in Solr 1.4.1... and of course from the upgrading notes: However, a reindex is needed for some of the analysis fixes to take effect. Your example Prisoner in a red-rose chain in Solr 1.4.1 no longer has the positions 1,4,7,8, but instead 1,4,5,6. I recommend upgrading to this bugfix release and re-indexing if you are having problems like this
facet data cleanup
Hi folks, We have a data cleanup effort going on here, and I thought I would share some information about how to poke around your facet values. Most of this comes from: http://wiki.apache.org/solr/SimpleFacetParameters Exploring Facet Values: --- facet field to examine: facet.field= number of values to return: facet.limit=n offset into the values: facet.offset=n sort the facets alphabetically: facet.sort=index http://your.solr.baseurl/select?rows=0facet.field=ffldnamefacet.sort=indexfacet.limit=250facet.offset=0 Missing Facet Values: --- to find how many documents are missing values: facet.missing=truefacet.mincount=really big http://your.solr.baseurl/select?rows=0facet.field=ffldnamefacet.mincount=1000facet.missing=true to find the documents with missing values: http://your.solr.baseurl/select?qt=standardq=+uniquekey:[* TO *] - ffldname:[* TO *] number of rows: rows= offset: start= - Naomi Dushay Stanford University Libraries http://searchworks.stanford.edu -- Blacklight on top of Solr
indexversion not updating on master
I'm having trouble with replication, and i believe it's because the indexversion isn't updating on master. My solrconfig.xml on master: requestHandler name=/replication class=solr.ReplicationHandler lst name=master str name=replicateAfterstartup/str str name=replicateAftercommit/str str name=replicateAfteroptimize/str !-- str name=backupAfteroptimize/str -- str name=confFilessolrconfig- slave.xml:solrconfig.xml,schema.xml,stopwords.txt/str /lst /requestHandler BTW, I am certain that this does NOT work: str name=replicateAfterstartup,commit,optimize/str it MUST be separate elements. My solrconfig.xml on slave: requestHandler name=/replication class=solr.ReplicationHandler lst name=slave str name=masterUrlhttp://my_host:8983/solr/replication/str !--Format is HH:mm:ss -- str name=pollInterval00:15:00/str /lst /requestHandler /replication?command=details on master: (I don't understand why there are two indexVersion and two generation entries in this data) lst name=details str name=indexSize19.91 GB/str str name=indexPath/data/solr/index/str arr name=commits lst long name=indexVersion1270535894533/long long name=generation32/long arr name=filelist str_1xv.fdt/str ... str_1xv.frq/str strsegments_w/str /arr /lst /arr str name=isMastertrue/str str name=isSlavefalse/str long name=indexVersion1270535894534/long long name=generation33/long /lst master log shows the commit: INFO: start commit (optimize=false,waitFlush=false,waitSearcher=true,expungeDeletes=false) Apr 12, 2010 4:00:54 PM org.apache.solr.search.SolrIndexSearcher init INFO: Opening searc...@31dd7736 main Apr 12, 2010 4:00:54 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: end_commit_flush Apr 12, 2010 4:00:54 PM org.apache.solr.search.SolrIndexSearcher warm but indexversion is the OLD one, not the NEW one: response lst name=responseHeader int name=status0/int int name=QTime0/int /lst long name=indexversion1270535894533/long long name=generation32/long /response What's going on? - Naomi
Re: indexversion not updating on master
Does it matter that my last index update did NOT add any new documents and did NOT delete any existing documents? (For testing, I just re- ran the last update) - Naomi On Apr 13, 2010, at 11:09 AM, Naomi Dushay wrote: I'm having trouble with replication, and i believe it's because the indexversion isn't updating on master. My solrconfig.xml on master: requestHandler name=/replication class=solr.ReplicationHandler lst name=master str name=replicateAfterstartup/str str name=replicateAftercommit/str str name=replicateAfteroptimize/str !-- str name=backupAfteroptimize/str -- str name=confFilessolrconfig- slave.xml:solrconfig.xml,schema.xml,stopwords.txt/str /lst /requestHandler BTW, I am certain that this does NOT work: str name=replicateAfterstartup,commit,optimize/str it MUST be separate elements. My solrconfig.xml on slave: requestHandler name=/replication class=solr.ReplicationHandler lst name=slave str name=masterUrlhttp://my_host:8983/solr/replication/str !--Format is HH:mm:ss -- str name=pollInterval00:15:00/str /lst /requestHandler /replication?command=details on master: (I don't understand why there are two indexVersion and two generation entries in this data) lst name=details str name=indexSize19.91 GB/str str name=indexPath/data/solr/index/str arr name=commits lst long name=indexVersion1270535894533/long long name=generation32/long arr name=filelist str_1xv.fdt/str ... str_1xv.frq/str strsegments_w/str /arr /lst /arr str name=isMastertrue/str str name=isSlavefalse/str long name=indexVersion1270535894534/long long name=generation33/long /lst master log shows the commit: INFO: start commit (optimize =false,waitFlush=false,waitSearcher=true,expungeDeletes=false) Apr 12, 2010 4:00:54 PM org.apache.solr.search.SolrIndexSearcher init INFO: Opening searc...@31dd7736 main Apr 12, 2010 4:00:54 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: end_commit_flush Apr 12, 2010 4:00:54 PM org.apache.solr.search.SolrIndexSearcher warm but indexversion is the OLD one, not the NEW one: response lst name=responseHeader int name=status0/int int name=QTime0/int /lst long name=indexversion1270535894533/long long name=generation32/long /response What's going on? - Naomi
termsComponent and filter queries
I have a field that has millions of values, and I need to get the next X values in alpha order. The terms component works fabulously for this. Here is a cooked up example of the terms a b f q r rr rrr y z zzz So if I ask for the 3 terms after r, I get rr, rrr and y. But now I'd like to apply a filter query on a different field. After the filter, my terms might be: b q r y z zzz So the 3 terms after r, given the filter, become y z and zzz Given that I have millions of terms, and they are not predictable for range queries ... how can I get the next X values of my field after one or more filters are applied? - Naomi
java doc error local params syntax for dismax
The javadoc for DisMaxQParserPlugin states: {!dismax qf=myfield,mytitle^2}foo creates a dismax query but actually, that gives an error. The correct syntax is {!dismax qf=myfield mytitle^2}foo (could use single quote instead of double quote). - Naomi
Re: java doc error local params syntax for dismax
It's not just the spaces - it's that the quotes (single or double flavor) is required as well. On Sep 23, 2009, at 3:10 PM, Yonik Seeley wrote: On Wed, Sep 23, 2009 at 5:59 PM, Naomi Dushay ndus...@stanford.edu wrote: The javadoc for DisMaxQParserPlugin states: {!dismax qf=myfield,mytitle^2}foo creates a dismax query but actually, that gives an error. The correct syntax is {!dismax qf=myfield mytitle^2}foo (could use single quote instead of double quote). Thanks, I always forget that dismax uses space separated, not comma separated lists. -Yonik
Re: java doc error local params syntax for dismax
Okay, but {!dismax qf=myfield mytitle^2}foo works {!dismax qf=myfield mytitle^2}foo does NOT work - Naomi On Sep 23, 2009, at 5:52 PM, Yonik Seeley wrote: On Wed, Sep 23, 2009 at 8:24 PM, Naomi Dushay ndus...@stanford.edu wrote: It's not just the spaces - it's that the quotes (single or double flavor) is required as well. LocalParams are space delimited, so the original example would have worked if the dismax parser accepted comma delimited fields. -Yonik http://www.lucidimagination.com On Sep 23, 2009, at 3:10 PM, Yonik Seeley wrote: On Wed, Sep 23, 2009 at 5:59 PM, Naomi Dushay ndus...@stanford.edu wrote: The javadoc for DisMaxQParserPlugin states: {!dismax qf=myfield,mytitle^2}foo creates a dismax query but actually, that gives an error. The correct syntax is {!dismax qf=myfield mytitle^2}foo (could use single quote instead of double quote). Thanks, I always forget that dismax uses space separated, not comma separated lists. -Yonik
Re: range queries on string field with millions of values
Hi Hoss, Thanks for this. The terms component approach, if i understand it correctly, will be problematic. I need to present not only the next X call numbers in sequence, but other fields in those documents (e.g. title, author). I assume the Terms Component approach will only give me the next X call number values, not the documents. It sounds like Glen Newton's suggestion of mapping the call numbers to a float number is the most likely solution. I know it sounds ridiculous to do all this for a call number browse but our faculty have explicitly asked for this. For humanities scholars especially, they know the call numbers that are of interest to them, and they browse the stacks that way (ML 1500s are opera, V35 is verdi ...). They are using the research methods that have been successful for their entire careers. Plus, library materials are going to off site, high density storage, so the only way for them to to browse all materials, regardless of location, via call number is online. I doubt they'll find this feature as useful as they expect, but it behooves us to give the users what they ask for. So yeah, our user needs are perhaps a little outside of your expectations. :-) - Naomi On Nov 29, 2008, at 2:58 PM, Chris Hostetter wrote: : The results are correct. But the response time sucks. : : Reading the docs about caches, I thought I could populate the query result : cache with an autowarming query and the response time would be okay. But that : hasn't worked. (See excerpts from my solrConfig file below.) : : A repeated query is very fast, implying caching happens for a particular : starting point (42 above). : : Is there a way to populate the cache with the ENTIRE sorted list of values for : the field, so any arbitrary starting point will get results from the cache, : rather than grabbing all results from (x) to the end, then sorting all these : results, then returning the first 10? there's two caches that come into play for something like this... the first cache is a low level Lucene cache called the FieldCache that is completley hidden from you (and for the most part: from Solr). anytime you sort on a field, it get's built, and reuse for all sorts on that field. My originl concern was that it wasn't getting warmed on newSearcher (because you have to be explicit about that. the second cache is the queryResultsCache which caches a window of an ordered list of documents based on a query, and a sort. you can see this cache in your Solr stats, and yes: these two requests results in different cache keys for the queryResultsCache... q=yourField:[42+TO+*]sort=yourField+ascrows=10 q=yourField:[52+TO+*]sort=yourField+ascrows=10 ...BUT! ... the two queries below will result in the same cache key, and the second will be a cache hit, provided a sufficient value for the queryResultWindowSize ... q=yourField:[42+TO+*]sort=yourField+ascrows=10 q=yourField:[42+TO+*]sort=yourField+ascrows=10start=10 so perhaps the key to your problem is to just make sure that once the user gives you an id to start with, you scroll by increasing the start param (not altering the id) ... the first query might be slow but every query after that should be a cache hit (depending on your page size, and how far you expect people to scroll, you should consider increasing queryResultWindowSize) But as Yonik said: the new TermsComponent may actually be a better option for you -- doing two requests for every page (the first to get the N Terms in your id field starting with your input, the second to do an query for docs matching any of those N ids) might actually be faster even though there won't likely even be any cache hits. My opinion: Your use case sounds like a waste of effort. I can't imagine anyone using a library catalog system ever wanting to lookup a callnumber, and then scroll through all posisble books with similar call numbers -- it seems much more likely that i'd want to look at other books with similar authors, or keywords, or tags ... all things that are actaully *easier* to do with Solr. (but then again: i don't work in a library. i trust that you know something i don't about what your users want.) -Hoss Naomi Dushay [EMAIL PROTECTED]
Re: range queries on string field with millions of values
The point isn't really how the exact sort works - it's the performance issues, coupled with an unpredictable distribution along the entire possible sort space. the sort works the range queries work the performance sucks and I haven't thought of a clever work around. - Naomi On Nov 27, 2008, at 9:41 AM, Alexander Ramos Jardim wrote: I did not even understand what you are considering to be the order on your call numbers. 2008/11/26 Naomi Dushay [EMAIL PROTECTED] I have a performance problem and I haven't thought of a clever way around it. I work at the Stanford University Libraries. We have a collection of over 8 million items. Each item has a call number. I have been asked to provide a way to browse forward and backward from an arbitrary call number. I have managed to create a fields that present the call numbers in appropriate sorts, both forward and reverse. (This is necessary because raw call numbers don't sort properly: A123 AZ27 B99 B999 BBB11). We can ignore the reverse sorted range query problem; it's the same as the forward sorted range query. So I use a query like this: sortCallNum[A123 B34 1970 TO *]rows=10. Call numbers are squirrelly, so we can't predict the string that will appropriately grab at least 10 subsequent documents. They are certainly not consecutive! so from A123 B34 1970 we're unable to predict if any of these will return at least 10 results: A123 B34 1980 or A123 B34 V.8 or A123 B44 or A123 B67 or A123 C27 or A124* or A22* or AA* or You get the idea. I have not figured out a way to efficiently query for the next 10 call numbers in sort order. I have also mucked about with the cache initialization, but that's not working either: listener event=firstSearcher class=solr.QuerySenderListener arr name=queries !-- populate query result cache for sorted queries -- lst str name=qshelfkey:[0 TO *]/str str name=sortshelfkey asc/str /lst /arr Can anyone help me with this? - Naomi -- Alexander Ramos Jardim Naomi Dushay [EMAIL PROTECTED]
Re: range queries on string field with millions of values
Gosh, I'm sorry to be so unclear. Hmm. Trying to clarify below: On Nov 28, 2008, at 3:52 PM, Chris Hostetter wrote: Having read through this thread, i'm not sure i understand what exactly the problem is. my naive understanding is... 1) you want to sort by a field 2) you want to be able to paginate through all docs in order of this field. 3) you want to be able to start your pagination at any arbitrary value for this field. so (assuming the field is a simple number for now) you could us something like q=yourField:[42 TO *sort=yourField+ascrows=10start-0 where 42 is the arbitrary ID someone wants to start at. perfect. This is the query I'm using. The results are correct. But the response time sucks. Reading the docs about caches, I thought I could populate the query result cache with an autowarming query and the response time would be okay. But that hasn't worked. (See excerpts from my solrConfig file below.) A repeated query is very fast, implying caching happens for a particular starting point (42 above). Is there a way to populate the cache with the ENTIRE sorted list of values for the field, so any arbitrary starting point will get results from the cache, rather than grabbing all results from (x) to the end, then sorting all these results, then returning the first 10? This sentence below seems to imply that you have a solution which produces correct results, but doesn't produce results quickly... right. : I have a performance problem and I haven't thought of a clever way around it. ...however this lines seems to suggest that you're having trouble getting at least 10 results from any query (?) : Call numbers are squirrelly, so we can't predict the string that will : appropriately grab at least 10 subsequent documents. They are certainly not : consecutive! : : so from : A123 B34 1970 : : we're unable to predict if any of these will return at least 10 results: I was trying to express that I couldn't do this: myfield:[X TO Y] because I can't algorithmically compute Y. Glen Newton suggested a work around, whereby I represent my squirrelly, but sortable, field values as floating point numbers, and then I can compute Y. ...but i'm not sure what exactly that means. for any given field, there is always going to be some values X such that myField:[X TO *] won't return at least 10 docs ... the are the last values in the index in order -- surely it's okay for your app to have an end state when you run out of data? :) yes. Understood. This is not an issue. Oh, and BTW... : numbers in sort order. I have also mucked about with the cache : initialization, but that's not working either: : : listener event=firstSearcher class=solr.QuerySenderListener ...make sure you also do a newSearcher listener that does the same thing, otherwise your FieldCache (used for sorting) may not be warmed when commits happen) Yup yup yup. from solrconfig: filterCache class=solr.LRUCache size=2000 initialSize=1000 autowarmCount=50/ queryResultCache class=solr.LRUCache size=1000 initialSize=500 autowarmCount=500/ listener event=newSearcher class=solr.QuerySenderListener arr name=queries !-- populate query result cache for sorted queries -- lst str name=qshelfkey:[0 TO *]/str str name=sortshelfkey asc/str /lst /arr /listener listener event=firstSearcher class=solr.QuerySenderListener arr name=queries !-- populate query result cache for sorted queries -- lst str name=qshelfkey:[0 TO *]/str str name=sortshelfkey asc/str /lst
range queries on string field with millions of values
I have a performance problem and I haven't thought of a clever way around it. I work at the Stanford University Libraries. We have a collection of over 8 million items. Each item has a call number. I have been asked to provide a way to browse forward and backward from an arbitrary call number. I have managed to create a fields that present the call numbers in appropriate sorts, both forward and reverse. (This is necessary because raw call numbers don't sort properly: A123 AZ27 B99 B999 BBB11). We can ignore the reverse sorted range query problem; it's the same as the forward sorted range query. So I use a query like this: sortCallNum[A123 B34 1970 TO *]rows=10. Call numbers are squirrelly, so we can't predict the string that will appropriately grab at least 10 subsequent documents. They are certainly not consecutive! so from A123 B34 1970 we're unable to predict if any of these will return at least 10 results: A123 B34 1980 or A123 B34 V.8 or A123 B44 or A123 B67 or A123 C27 or A124* or A22* or AA* or You get the idea. I have not figured out a way to efficiently query for the next 10 call numbers in sort order. I have also mucked about with the cache initialization, but that's not working either: listener event=firstSearcher class=solr.QuerySenderListener arr name=queries !-- populate query result cache for sorted queries -- lst str name=qshelfkey:[0 TO *]/str str name=sortshelfkey asc/str /lst /arr Can anyone help me with this? - Naomi
single character terms in index - why?
I'm experienced with Lucene, less so than SOLR. I am looking at two systems built on top of SOLR for a library discovery service: blacklight and vufind. I checked the raw lucene index using Luke and noticed that both of these indexes have single character terms in the index, such as d or f. I asked about this on the vufind list, and was told I didn't understand SOLR and why it would need these. So I'm now asking: why would SOLR want single character terms? a is usually a stopword. I know the Library MARC data from which the index is derived has a lot of these characters because they denote subfields in the data. But why would we want them to be searchable? Naomi Dushay [EMAIL PROTECTED]