Re: Index-time synonyms and trailing wildcard issue

2013-02-14 Thread Johannes Rodenwald
Hello Jack,

Thanks for your answer, it helped me gaining a deeper understandig what happens 
at index time, and finding a solution myself:

It seems that putting the synonym filter in both filter chains (index and 
query), setting expand=false, and putting the desired synonym first in the 
row, does the trick:
Synonyms line (reversed order!):
orange, apfelsine

All documents containing apfelsine are now mapped to orange, so there are 
no more documets containing apfelsine that would match a wildcard-query for 
apfel*  (Apfelsine is a true synonym for Orange in german, meaning 
chinese apple. Apfel = apple, shouldnt match oranges).

Problem solved, thanks again for the help!

Johannes Rodenwald 

- Ursprüngliche Mail -
Von: Jack Krupansky j...@basetechnology.com
An: solr-user@lucene.apache.org
Gesendet: Mittwoch, 13. Februar 2013 17:17:40
Betreff: Re: Index-time synonyms and trailing wildcard issue

By doing synonyms at index time, you cause apfelsin to be added to 
documents that contain only orang, so of course documents that previously 
only contained orang will now match for apfelsin or any term query that 
matches apfelsin, such as a wildcard. At query time, Lucene cannot tell 
whether your original document contained apfelsin or if apfelsin was 
added when the document was indexed due to an index-time synonym.

Solution: Either disable index time synonyms, or have a parallel field (via 
copyField) that does not have the index-time synonyms.

But... perhaps you should clarify what you really intend to happen with 
these pseudo-synonyms.

-- Jack Krupansky




Index-time synonyms and trailing wildcard issue

2013-02-13 Thread Johannes Rodenwald
Hi,

I use Solr 3.6.0 with a synonym filter as the last filter at index time, using 
a list of stemmed terms. When i do a wildcard search that matches a part of an 
entry on the synonym list, the synonyms found are used by solr to generate the 
search results. I am trying to disable that behaviour, but with no success.

Example:

Stemmed synonyms: 
apfelsin, orang

Search term:
apfel*

Matches:
Apfelkuchen, Apfelsaft, Apfelsine... (good, i want these matches)
Orange (bad, i dont want this match)

My questions are:
- Why does the synonym filter react on a wildcard query? For it is not a 
multiterm-aware component (see 
http://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/MultiTermAwareComponent.html)
- How can i disable this behaviour, so that Orange is no longer returned by 
the query for apfel*?

Regards,

Johannes


Re: Index-time synonyms and trailing wildcard issue

2013-02-13 Thread Jack Krupansky
By doing synonyms at index time, you cause apfelsin to be added to 
documents that contain only orang, so of course documents that previously 
only contained orang will now match for apfelsin or any term query that 
matches apfelsin, such as a wildcard. At query time, Lucene cannot tell 
whether your original document contained apfelsin or if apfelsin was 
added when the document was indexed due to an index-time synonym.


Solution: Either disable index time synonyms, or have a parallel field (via 
copyField) that does not have the index-time synonyms.


But... perhaps you should clarify what you really intend to happen with 
these pseudo-synonyms.


-- Jack Krupansky

-Original Message- 
From: Johannes Rodenwald

Sent: Wednesday, February 13, 2013 10:25 AM
To: solr-user@lucene.apache.org
Subject: Index-time synonyms and trailing wildcard issue

Hi,

I use Solr 3.6.0 with a synonym filter as the last filter at index time, 
using a list of stemmed terms. When i do a wildcard search that matches a 
part of an entry on the synonym list, the synonyms found are used by solr to 
generate the search results. I am trying to disable that behaviour, but with 
no success.


Example:

Stemmed synonyms:
apfelsin, orang

Search term:
apfel*

Matches:
Apfelkuchen, Apfelsaft, Apfelsine... (good, i want these matches)
Orange (bad, i dont want this match)

My questions are:
- Why does the synonym filter react on a wildcard query? For it is not a 
multiterm-aware component (see 
http://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/MultiTermAwareComponent.html)
- How can i disable this behaviour, so that Orange is no longer returned 
by the query for apfel*?


Regards,

Johannes 



Re: Wildcard ? issue?

2012-02-09 Thread Dalius Sidlauskas

It seams it is applicable for Solr 3.6 and 4.0. Mines version is 3.5

Regards!
Dalius Sidlauskas


On 08/02/12 17:26, Ahmet Arslan wrote:

I have already tried this and it did
not helped because it does not
highlight matches if wild-card is used. The field
configuration turns
data to:

This writeup should explain your scenario :
http://wiki.apache.org/solr/MultitermQueryAnalysis


Re: Wildcard ? issue?

2012-02-09 Thread Dalius Sidlauskas

Okay, I get it, 3.6 is not released yet. Thanks for help fellas!

Regards!
Dalius Sidlauskas


On 09/02/12 10:19, Dalius Sidlauskas wrote:

It seams it is applicable for Solr 3.6 and 4.0. Mines version is 3.5

Regards!
Dalius Sidlauskas


On 08/02/12 17:26, Ahmet Arslan wrote:

I have already tried this and it did
not helped because it does not
highlight matches if wild-card is used. The field
configuration turns
data to:

This writeup should explain your scenario :
http://wiki.apache.org/solr/MultitermQueryAnalysis


Re: Wildcard ? issue?

2012-02-09 Thread Erick Erickson
You can pull down 3.5 (aka 3.x) from the nightly build if you want, see:
https://builds.apache.org//view/S-Z/view/Solr/job/Solr-3.x/
the last successful artifacts link will probably be what you want.

Best
Erick

On Thu, Feb 9, 2012 at 5:35 AM, Dalius Sidlauskas
dalius.sidlaus...@semantico.com wrote:
 Okay, I get it, 3.6 is not released yet. Thanks for help fellas!

 Regards!
 Dalius Sidlauskas



 On 09/02/12 10:19, Dalius Sidlauskas wrote:

 It seams it is applicable for Solr 3.6 and 4.0. Mines version is 3.5

 Regards!
 Dalius Sidlauskas


 On 08/02/12 17:26, Ahmet Arslan wrote:

 I have already tried this and it did
 not helped because it does not
 highlight matches if wild-card is used. The field
 configuration turns
 data to:

 This writeup should explain your scenario :
 http://wiki.apache.org/solr/MultitermQueryAnalysis


Wildcard ? issue?

2012-02-08 Thread Dalius Sidlauskas

Sorry for inaccurate title.

I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) 
containing same value:


title xmlns=http://www.tei-c.org/ns/1.0;cal.lígraf/title

and these fields are configured accordingly:

fieldType name=xml  class=solr.TextField  positionIncrementGap=100
  analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
  /analyzer
/fieldType

fieldType name=xml_unicode  class=solr.TextField  
positionIncrementGap=100
  analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
/fieldType

fieldType name=xml_unicode_full  class=solr.TextField  
positionIncrementGap=100
  analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
/fieldType

And finally my search configuration:

requestHandler name=dictionary  class=solr.SearchHandler
 lst name=defaults
   str name=echoParamsall/str
   str name=defTypeedismax/str
   str name=mm2lt;-25%/str
   str name=qfdc_title_unicode_full^2 dc_title_unicode^2 
dc_title/str
   int  name=rows10/int
   str name=spellcheck.onlyMorePopulartrue/str
   str name=spellcheck.extendedResultsfalse/str
   str name=spellcheck.count1/str
 /lst
arr name=last-components
  strspellcheck/str
/arr
/requestHandler

I am trying to match the field with various search phrases (that are 
valid). There are results:



#   search phrase   match?  Comment
1   cal.lígra?  yes 
2   cal.ligra?  no  Changed í to i
3   cal.ligraf  yes 
4   calligra?   no  


The problem is the #2 attempt to match a data. The #3 works replacing ? 
with f.


One more thing. If * is used insted of ? other data is matched as 
cal.lígrafia but not cal.lígraf...


Also I have spotted some logic missmatch in debug parsedQuery field:
*
cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | 
dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0))
*cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | 
dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0))


Should the second be *calligra?* insted?*

*Environment:
Tomcat 7.0.25 (request encoding UTF-8)
Solr 3.5.0
Java 7 Oracle
Ubuntu 11.10

--
Regards!
Dalius Sidlauskas



Re: Wildcard ? issue?

2012-02-08 Thread Dalius Sidlauskas
If you can not read this mail easily check this ticket: 
https://issues.apache.org/jira/browse/SOLR-3106 This is a copy.


Regards!
Dalius Sidlauskas


On 08/02/12 15:44, Dalius Sidlauskas wrote:

Sorry for inaccurate title.

I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) 
containing same value:


title xmlns=http://www.tei-c.org/ns/1.0;cal.lígraf/title

and these fields are configured accordingly:

fieldType name=xml  class=solr.TextField  
positionIncrementGap=100

analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
/analyzer
/fieldType

fieldType name=xml_unicode  class=solr.TextField  
positionIncrementGap=100

analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
/analyzer
/fieldType

fieldType name=xml_unicode_full  class=solr.TextField  
positionIncrementGap=100

analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.WhitespaceTokenizerFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
/analyzer
/fieldType

And finally my search configuration:

requestHandler name=dictionary  class=solr.SearchHandler
lst name=defaults
str name=echoParamsall/str
str name=defTypeedismax/str
str name=mm2lt;-25%/str
str name=qfdc_title_unicode_full^2 dc_title_unicode^2 dc_title/str
int  name=rows10/int
str name=spellcheck.onlyMorePopulartrue/str
str name=spellcheck.extendedResultsfalse/str
str name=spellcheck.count1/str
/lst
arr name=last-components
strspellcheck/str
/arr
/requestHandler

I am trying to match the field with various search phrases (that are 
valid). There are results:



# search phrase match? Comment
1 cal.lígra? yes
2 cal.ligra? no Changed í to i
3 cal.ligraf yes
4 calligra? no


The problem is the #2 attempt to match a data. The #3 works replacing 
? with f.


One more thing. If * is used insted of ? other data is matched as 
cal.lígrafia but not cal.lígraf...


Also I have spotted some logic missmatch in debug parsedQuery field:
*
cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | 
dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0))
*cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | 
dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0))


Should the second be *calligra?* insted?*

*Environment:
Tomcat 7.0.25 (request encoding UTF-8)
Solr 3.5.0
Java 7 Oracle
Ubuntu 11.10



Re: Wildcard ? issue?

2012-02-08 Thread Sethi, Parampreet
Hi Dalius,

If not already tried, Check http://localhost:8983/solr/admin/analysis.jsp
(enable verbose output for both Field Value index and query for details)
for your queries and see what all filters/tokenizers are being applied.

Hope it helps!

-param

On 2/8/12 10:48 AM, Dalius Sidlauskas dalius.sidlaus...@semantico.com
wrote:

If you can not read this mail easily check this ticket:
https://issues.apache.org/jira/browse/SOLR-3106 This is a copy.

Regards!
Dalius Sidlauskas


On 08/02/12 15:44, Dalius Sidlauskas wrote:
 Sorry for inaccurate title.

 I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full)
 containing same value:

 title xmlns=http://www.tei-c.org/ns/1.0;cal.lígraf/title

 and these fields are configured accordingly:

 fieldType name=xml  class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 charFilter class=solr.HTMLStripCharFilterFactory/
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.ICUFoldingFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.ICUFoldingFilterFactory/
 /analyzer
 /fieldType

 fieldType name=xml_unicode  class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 charFilter class=solr.HTMLStripCharFilterFactory/
 tokenizer class=solr.StandardTokenizerFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 /analyzer
 /fieldType

 fieldType name=xml_unicode_full  class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 charFilter class=solr.HTMLStripCharFilterFactory/
 tokenizer class=solr.WhitespaceTokenizerFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 /analyzer
 /fieldType

 And finally my search configuration:

 requestHandler name=dictionary  class=solr.SearchHandler
 lst name=defaults
 str name=echoParamsall/str
 str name=defTypeedismax/str
 str name=mm2lt;-25%/str
 str name=qfdc_title_unicode_full^2 dc_title_unicode^2 dc_title/str
 int  name=rows10/int
 str name=spellcheck.onlyMorePopulartrue/str
 str name=spellcheck.extendedResultsfalse/str
 str name=spellcheck.count1/str
 /lst
 arr name=last-components
 strspellcheck/str
 /arr
 /requestHandler

 I am trying to match the field with various search phrases (that are
 valid). There are results:


 # search phrase match? Comment
 1 cal.lígra? yes
 2 cal.ligra? no Changed í to i
 3 cal.ligraf yes
 4 calligra? no


 The problem is the #2 attempt to match a data. The #3 works replacing
 ? with f.

 One more thing. If * is used insted of ? other data is matched as
 cal.lígrafia but not cal.lígraf...

 Also I have spotted some logic missmatch in debug parsedQuery field:
 *
 cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 |
 dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0))
 *cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 |
 dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0))

 Should the second be *calligra?* insted?*

 *Environment:
 Tomcat 7.0.25 (request encoding UTF-8)
 Solr 3.5.0
 Java 7 Oracle
 Ubuntu 11.10




Re: Wildcard ? issue?

2012-02-08 Thread Dalius Sidlauskas
I have already tried this and it did not helped because it does not 
highlight matches if wild-card is used. The field configuration turns 
data to:


dc_title: calligraf
dc_title_unicode: cal·lígraf
dc_title_unicode_full: cal·lígraf

Debug parsedquery says:

[Search for *cal·ligraf*]

+DisjunctionMaxQuery((dc_title:*calligraf* |  
dc_title_unicode:cal·ligraf^2.0 | dc_title_unicode_full:cal·ligraf^2.0))


[Search for *cal·ligra?*]

+DisjunctionMaxQuery((dc_title:*cal·ligra?* | 
dc_title_unicode:cal·ligra?^2.0 | dc_title_unicode_full:cal·ligra?^2.0))


Why the *dc_title* field is handled differently? The analysis looks fine:


 Index Analyzer


   org.apache.solr.analysis.HTMLStripCharFilterFactory
   {luceneMatchVersion=LUCENE_34}

textcal·lígraf


   org.apache.solr.analysis.PatternReplaceCharFilterFactory
   {replacement=, pattern=-, maxBlockChars=1,
   luceneMatchVersion=LUCENE_34, blockDelimiters=}

textcal·lígraf


   org.apache.solr.analysis.WhitespaceTokenizerFactory
   {luceneMatchVersion=LUCENE_34}

position1
term text   cal·lígraf
startOffset 43
endOffset   53


   org.apache.solr.analysis.ICUFoldingFilterFactory
   {luceneMatchVersion=LUCENE_34}

position1
term text   calligraf
startOffset 43
endOffset   53


 Query Analyzer


   org.apache.solr.analysis.WhitespaceTokenizerFactory
   {luceneMatchVersion=LUCENE_34}

position1
term text   cal·ligra?
startOffset 0
endOffset   10


   org.apache.solr.analysis.ICUFoldingFilterFactory
   {luceneMatchVersion=LUCENE_34}

position1
term text   calligra?
startOffset 0
endOffset   10


Is this a Solr or Lucene bug?

Regards!
Dalius Sidlauskas


On 08/02/12 16:03, Sethi, Parampreet wrote:

Hi Dalius,

If not already tried, Check http://localhost:8983/solr/admin/analysis.jsp
(enable verbose output for both Field Value index and query for details)
for your queries and see what all filters/tokenizers are being applied.

Hope it helps!

-param

On 2/8/12 10:48 AM, Dalius Sidlauskasdalius.sidlaus...@semantico.com
wrote:


If you can not read this mail easily check this ticket:
https://issues.apache.org/jira/browse/SOLR-3106 This is a copy.

Regards!
Dalius Sidlauskas


On 08/02/12 15:44, Dalius Sidlauskas wrote:

Sorry for inaccurate title.

I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full)
containing same value:

title xmlns=http://www.tei-c.org/ns/1.0;cal.lígraf/title

and these fields are configured accordingly:

fieldType name=xml  class=solr.TextField
positionIncrementGap=100
analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
/analyzer
/fieldType

fieldType name=xml_unicode  class=solr.TextField
positionIncrementGap=100
analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
/analyzer
/fieldType

fieldType name=xml_unicode_full  class=solr.TextField
positionIncrementGap=100
analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.WhitespaceTokenizerFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
/analyzer
/fieldType

And finally my search configuration:

requestHandler name=dictionary  class=solr.SearchHandler
lst name=defaults
str name=echoParamsall/str
str name=defTypeedismax/str
str name=mm2lt;-25%/str
str name=qfdc_title_unicode_full^2 dc_title_unicode^2 dc_title/str
int  name=rows10/int
str name=spellcheck.onlyMorePopulartrue/str
str name=spellcheck.extendedResultsfalse/str
str name=spellcheck.count1/str
/lst
arr name=last-components
strspellcheck/str
/arr
/requestHandler

I am trying to match the field with various search phrases (that are
valid). There are results:


# search phrase match? Comment
1 cal.lígra? yes
2 cal.ligra? no Changed í to i
3 cal.ligraf yes
4 calligra? no


The problem is the #2 attempt to match a data. The #3 works replacing
? with f.

One more thing. If * is used insted of ? other data is matched as
cal.lígrafia but not cal.lígraf...

Also I have spotted some logic missmatch in debug parsedQuery field:
*
cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 |
dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0))
*cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 |
dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0))

Should the second be *calligra?* insted?*

*Environment:
Tomcat 7.0.25 (request encoding UTF-8)
Solr 3.5.0
Java 7 Oracle
Ubuntu 11.10



Re: Wildcard ? issue?

2012-02-08 Thread Ahmet Arslan
 I have already tried this and it did
 not helped because it does not 
 highlight matches if wild-card is used. The field
 configuration turns 
 data to:

This writeup should explain your scenario :
http://wiki.apache.org/solr/MultitermQueryAnalysis