Re: Index-time synonyms and trailing wildcard issue
Hello Jack, Thanks for your answer, it helped me gaining a deeper understandig what happens at index time, and finding a solution myself: It seems that putting the synonym filter in both filter chains (index and query), setting expand=false, and putting the desired synonym first in the row, does the trick: Synonyms line (reversed order!): orange, apfelsine All documents containing apfelsine are now mapped to orange, so there are no more documets containing apfelsine that would match a wildcard-query for apfel* (Apfelsine is a true synonym for Orange in german, meaning chinese apple. Apfel = apple, shouldnt match oranges). Problem solved, thanks again for the help! Johannes Rodenwald - Ursprüngliche Mail - Von: Jack Krupansky j...@basetechnology.com An: solr-user@lucene.apache.org Gesendet: Mittwoch, 13. Februar 2013 17:17:40 Betreff: Re: Index-time synonyms and trailing wildcard issue By doing synonyms at index time, you cause apfelsin to be added to documents that contain only orang, so of course documents that previously only contained orang will now match for apfelsin or any term query that matches apfelsin, such as a wildcard. At query time, Lucene cannot tell whether your original document contained apfelsin or if apfelsin was added when the document was indexed due to an index-time synonym. Solution: Either disable index time synonyms, or have a parallel field (via copyField) that does not have the index-time synonyms. But... perhaps you should clarify what you really intend to happen with these pseudo-synonyms. -- Jack Krupansky
Index-time synonyms and trailing wildcard issue
Hi, I use Solr 3.6.0 with a synonym filter as the last filter at index time, using a list of stemmed terms. When i do a wildcard search that matches a part of an entry on the synonym list, the synonyms found are used by solr to generate the search results. I am trying to disable that behaviour, but with no success. Example: Stemmed synonyms: apfelsin, orang Search term: apfel* Matches: Apfelkuchen, Apfelsaft, Apfelsine... (good, i want these matches) Orange (bad, i dont want this match) My questions are: - Why does the synonym filter react on a wildcard query? For it is not a multiterm-aware component (see http://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/MultiTermAwareComponent.html) - How can i disable this behaviour, so that Orange is no longer returned by the query for apfel*? Regards, Johannes
Re: Index-time synonyms and trailing wildcard issue
By doing synonyms at index time, you cause apfelsin to be added to documents that contain only orang, so of course documents that previously only contained orang will now match for apfelsin or any term query that matches apfelsin, such as a wildcard. At query time, Lucene cannot tell whether your original document contained apfelsin or if apfelsin was added when the document was indexed due to an index-time synonym. Solution: Either disable index time synonyms, or have a parallel field (via copyField) that does not have the index-time synonyms. But... perhaps you should clarify what you really intend to happen with these pseudo-synonyms. -- Jack Krupansky -Original Message- From: Johannes Rodenwald Sent: Wednesday, February 13, 2013 10:25 AM To: solr-user@lucene.apache.org Subject: Index-time synonyms and trailing wildcard issue Hi, I use Solr 3.6.0 with a synonym filter as the last filter at index time, using a list of stemmed terms. When i do a wildcard search that matches a part of an entry on the synonym list, the synonyms found are used by solr to generate the search results. I am trying to disable that behaviour, but with no success. Example: Stemmed synonyms: apfelsin, orang Search term: apfel* Matches: Apfelkuchen, Apfelsaft, Apfelsine... (good, i want these matches) Orange (bad, i dont want this match) My questions are: - Why does the synonym filter react on a wildcard query? For it is not a multiterm-aware component (see http://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/MultiTermAwareComponent.html) - How can i disable this behaviour, so that Orange is no longer returned by the query for apfel*? Regards, Johannes
Re: Wildcard ? issue?
It seams it is applicable for Solr 3.6 and 4.0. Mines version is 3.5 Regards! Dalius Sidlauskas On 08/02/12 17:26, Ahmet Arslan wrote: I have already tried this and it did not helped because it does not highlight matches if wild-card is used. The field configuration turns data to: This writeup should explain your scenario : http://wiki.apache.org/solr/MultitermQueryAnalysis
Re: Wildcard ? issue?
Okay, I get it, 3.6 is not released yet. Thanks for help fellas! Regards! Dalius Sidlauskas On 09/02/12 10:19, Dalius Sidlauskas wrote: It seams it is applicable for Solr 3.6 and 4.0. Mines version is 3.5 Regards! Dalius Sidlauskas On 08/02/12 17:26, Ahmet Arslan wrote: I have already tried this and it did not helped because it does not highlight matches if wild-card is used. The field configuration turns data to: This writeup should explain your scenario : http://wiki.apache.org/solr/MultitermQueryAnalysis
Re: Wildcard ? issue?
You can pull down 3.5 (aka 3.x) from the nightly build if you want, see: https://builds.apache.org//view/S-Z/view/Solr/job/Solr-3.x/ the last successful artifacts link will probably be what you want. Best Erick On Thu, Feb 9, 2012 at 5:35 AM, Dalius Sidlauskas dalius.sidlaus...@semantico.com wrote: Okay, I get it, 3.6 is not released yet. Thanks for help fellas! Regards! Dalius Sidlauskas On 09/02/12 10:19, Dalius Sidlauskas wrote: It seams it is applicable for Solr 3.6 and 4.0. Mines version is 3.5 Regards! Dalius Sidlauskas On 08/02/12 17:26, Ahmet Arslan wrote: I have already tried this and it did not helped because it does not highlight matches if wild-card is used. The field configuration turns data to: This writeup should explain your scenario : http://wiki.apache.org/solr/MultitermQueryAnalysis
Wildcard ? issue?
Sorry for inaccurate title. I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) containing same value: title xmlns=http://www.tei-c.org/ns/1.0;cal.lígraf/title and these fields are configured accordingly: fieldType name=xml class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ /analyzer /fieldType fieldType name=xml_unicode class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType fieldType name=xml_unicode_full class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType And finally my search configuration: requestHandler name=dictionary class=solr.SearchHandler lst name=defaults str name=echoParamsall/str str name=defTypeedismax/str str name=mm2lt;-25%/str str name=qfdc_title_unicode_full^2 dc_title_unicode^2 dc_title/str int name=rows10/int str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.extendedResultsfalse/str str name=spellcheck.count1/str /lst arr name=last-components strspellcheck/str /arr /requestHandler I am trying to match the field with various search phrases (that are valid). There are results: # search phrase match? Comment 1 cal.lígra? yes 2 cal.ligra? no Changed í to i 3 cal.ligraf yes 4 calligra? no The problem is the #2 attempt to match a data. The #3 works replacing ? with f. One more thing. If * is used insted of ? other data is matched as cal.lígrafia but not cal.lígraf... Also I have spotted some logic missmatch in debug parsedQuery field: * cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0)) *cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0)) Should the second be *calligra?* insted?* *Environment: Tomcat 7.0.25 (request encoding UTF-8) Solr 3.5.0 Java 7 Oracle Ubuntu 11.10 -- Regards! Dalius Sidlauskas
Re: Wildcard ? issue?
If you can not read this mail easily check this ticket: https://issues.apache.org/jira/browse/SOLR-3106 This is a copy. Regards! Dalius Sidlauskas On 08/02/12 15:44, Dalius Sidlauskas wrote: Sorry for inaccurate title. I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) containing same value: title xmlns=http://www.tei-c.org/ns/1.0;cal.lígraf/title and these fields are configured accordingly: fieldType name=xml class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ /analyzer /fieldType fieldType name=xml_unicode class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType fieldType name=xml_unicode_full class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType And finally my search configuration: requestHandler name=dictionary class=solr.SearchHandler lst name=defaults str name=echoParamsall/str str name=defTypeedismax/str str name=mm2lt;-25%/str str name=qfdc_title_unicode_full^2 dc_title_unicode^2 dc_title/str int name=rows10/int str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.extendedResultsfalse/str str name=spellcheck.count1/str /lst arr name=last-components strspellcheck/str /arr /requestHandler I am trying to match the field with various search phrases (that are valid). There are results: # search phrase match? Comment 1 cal.lígra? yes 2 cal.ligra? no Changed í to i 3 cal.ligraf yes 4 calligra? no The problem is the #2 attempt to match a data. The #3 works replacing ? with f. One more thing. If * is used insted of ? other data is matched as cal.lígrafia but not cal.lígraf... Also I have spotted some logic missmatch in debug parsedQuery field: * cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0)) *cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0)) Should the second be *calligra?* insted?* *Environment: Tomcat 7.0.25 (request encoding UTF-8) Solr 3.5.0 Java 7 Oracle Ubuntu 11.10
Re: Wildcard ? issue?
Hi Dalius, If not already tried, Check http://localhost:8983/solr/admin/analysis.jsp (enable verbose output for both Field Value index and query for details) for your queries and see what all filters/tokenizers are being applied. Hope it helps! -param On 2/8/12 10:48 AM, Dalius Sidlauskas dalius.sidlaus...@semantico.com wrote: If you can not read this mail easily check this ticket: https://issues.apache.org/jira/browse/SOLR-3106 This is a copy. Regards! Dalius Sidlauskas On 08/02/12 15:44, Dalius Sidlauskas wrote: Sorry for inaccurate title. I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) containing same value: title xmlns=http://www.tei-c.org/ns/1.0;cal.lígraf/title and these fields are configured accordingly: fieldType name=xml class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ /analyzer /fieldType fieldType name=xml_unicode class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType fieldType name=xml_unicode_full class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType And finally my search configuration: requestHandler name=dictionary class=solr.SearchHandler lst name=defaults str name=echoParamsall/str str name=defTypeedismax/str str name=mm2lt;-25%/str str name=qfdc_title_unicode_full^2 dc_title_unicode^2 dc_title/str int name=rows10/int str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.extendedResultsfalse/str str name=spellcheck.count1/str /lst arr name=last-components strspellcheck/str /arr /requestHandler I am trying to match the field with various search phrases (that are valid). There are results: # search phrase match? Comment 1 cal.lígra? yes 2 cal.ligra? no Changed í to i 3 cal.ligraf yes 4 calligra? no The problem is the #2 attempt to match a data. The #3 works replacing ? with f. One more thing. If * is used insted of ? other data is matched as cal.lígrafia but not cal.lígraf... Also I have spotted some logic missmatch in debug parsedQuery field: * cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0)) *cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0)) Should the second be *calligra?* insted?* *Environment: Tomcat 7.0.25 (request encoding UTF-8) Solr 3.5.0 Java 7 Oracle Ubuntu 11.10
Re: Wildcard ? issue?
I have already tried this and it did not helped because it does not highlight matches if wild-card is used. The field configuration turns data to: dc_title: calligraf dc_title_unicode: cal·lígraf dc_title_unicode_full: cal·lígraf Debug parsedquery says: [Search for *cal·ligraf*] +DisjunctionMaxQuery((dc_title:*calligraf* | dc_title_unicode:cal·ligraf^2.0 | dc_title_unicode_full:cal·ligraf^2.0)) [Search for *cal·ligra?*] +DisjunctionMaxQuery((dc_title:*cal·ligra?* | dc_title_unicode:cal·ligra?^2.0 | dc_title_unicode_full:cal·ligra?^2.0)) Why the *dc_title* field is handled differently? The analysis looks fine: Index Analyzer org.apache.solr.analysis.HTMLStripCharFilterFactory {luceneMatchVersion=LUCENE_34} textcal·lígraf org.apache.solr.analysis.PatternReplaceCharFilterFactory {replacement=, pattern=-, maxBlockChars=1, luceneMatchVersion=LUCENE_34, blockDelimiters=} textcal·lígraf org.apache.solr.analysis.WhitespaceTokenizerFactory {luceneMatchVersion=LUCENE_34} position1 term text cal·lígraf startOffset 43 endOffset 53 org.apache.solr.analysis.ICUFoldingFilterFactory {luceneMatchVersion=LUCENE_34} position1 term text calligraf startOffset 43 endOffset 53 Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {luceneMatchVersion=LUCENE_34} position1 term text cal·ligra? startOffset 0 endOffset 10 org.apache.solr.analysis.ICUFoldingFilterFactory {luceneMatchVersion=LUCENE_34} position1 term text calligra? startOffset 0 endOffset 10 Is this a Solr or Lucene bug? Regards! Dalius Sidlauskas On 08/02/12 16:03, Sethi, Parampreet wrote: Hi Dalius, If not already tried, Check http://localhost:8983/solr/admin/analysis.jsp (enable verbose output for both Field Value index and query for details) for your queries and see what all filters/tokenizers are being applied. Hope it helps! -param On 2/8/12 10:48 AM, Dalius Sidlauskasdalius.sidlaus...@semantico.com wrote: If you can not read this mail easily check this ticket: https://issues.apache.org/jira/browse/SOLR-3106 This is a copy. Regards! Dalius Sidlauskas On 08/02/12 15:44, Dalius Sidlauskas wrote: Sorry for inaccurate title. I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) containing same value: title xmlns=http://www.tei-c.org/ns/1.0;cal.lígraf/title and these fields are configured accordingly: fieldType name=xml class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ /analyzer /fieldType fieldType name=xml_unicode class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType fieldType name=xml_unicode_full class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType And finally my search configuration: requestHandler name=dictionary class=solr.SearchHandler lst name=defaults str name=echoParamsall/str str name=defTypeedismax/str str name=mm2lt;-25%/str str name=qfdc_title_unicode_full^2 dc_title_unicode^2 dc_title/str int name=rows10/int str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.extendedResultsfalse/str str name=spellcheck.count1/str /lst arr name=last-components strspellcheck/str /arr /requestHandler I am trying to match the field with various search phrases (that are valid). There are results: # search phrase match? Comment 1 cal.lígra? yes 2 cal.ligra? no Changed í to i 3 cal.ligraf yes 4 calligra? no The problem is the #2 attempt to match a data. The #3 works replacing ? with f. One more thing. If * is used insted of ? other data is matched as cal.lígrafia but not cal.lígraf... Also I have spotted some logic missmatch in debug parsedQuery field: * cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0)) *cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0)) Should the second be *calligra?* insted?* *Environment: Tomcat 7.0.25 (request encoding UTF-8) Solr 3.5.0 Java 7 Oracle Ubuntu 11.10
Re: Wildcard ? issue?
I have already tried this and it did not helped because it does not highlight matches if wild-card is used. The field configuration turns data to: This writeup should explain your scenario : http://wiki.apache.org/solr/MultitermQueryAnalysis