Re: Wildcard ? issue?

Dalius Sidlauskas Wed, 08 Feb 2012 08:24:10 -0800

I have already tried this and it did not helped because it does nothighlight matches if wild-card is used. The field configuration turnsdata to:


dc_title: calligraf
dc_title_unicode: cal·lígraf
dc_title_unicode_full: cal·lígraf


Debug parsedquery says:

[Search for *cal·ligraf*]

+DisjunctionMaxQuery((dc_title:*calligraf* |dc_title_unicode:cal·ligraf^2.0 | dc_title_unicode_full:cal·ligraf^2.0))


[Search for *cal·ligra?*]

+DisjunctionMaxQuery((dc_title:*cal·ligra?* |dc_title_unicode:cal·ligra?^2.0 | dc_title_unicode_full:cal·ligra?^2.0))


Why the *dc_title* field is handled differently? The analysis looks fine:


     Index Analyzer


       org.apache.solr.analysis.HTMLStripCharFilterFactory
       {luceneMatchVersion=LUCENE_34}

text    cal·lígraf


       org.apache.solr.analysis.PatternReplaceCharFilterFactory
       {replacement=, pattern=-, maxBlockChars=10000,
       luceneMatchVersion=LUCENE_34, blockDelimiters=}

text    cal·lígraf


       org.apache.solr.analysis.WhitespaceTokenizerFactory
       {luceneMatchVersion=LUCENE_34}

position        1
term text       cal·lígraf
startOffset     43
endOffset       53


       org.apache.solr.analysis.ICUFoldingFilterFactory
       {luceneMatchVersion=LUCENE_34}

position        1
term text       calligraf
startOffset     43
endOffset       53


     Query Analyzer


       org.apache.solr.analysis.WhitespaceTokenizerFactory
       {luceneMatchVersion=LUCENE_34}

position        1
term text       cal·ligra?
startOffset     0
endOffset       10


       org.apache.solr.analysis.ICUFoldingFilterFactory
       {luceneMatchVersion=LUCENE_34}

position        1
term text       calligra?
startOffset     0
endOffset       10


Is this a Solr or Lucene bug?

Regards!
Dalius Sidlauskas


On 08/02/12 16:03, Sethi, Parampreet wrote:

Hi Dalius,

If not already tried, Check http://localhost:8983/solr/admin/analysis.jsp
(enable verbose output for both Field Value index and query for details)
for your queries and see what all filters/tokenizers are being applied.

Hope it helps!

-param

On 2/8/12 10:48 AM, "Dalius Sidlauskas"<dalius.sidlaus...@semantico.com>
wrote:

If you can not read this mail easily check this ticket:
https://issues.apache.org/jira/browse/SOLR-3106 This is a copy.

Regards!
Dalius Sidlauskas


On 08/02/12 15:44, Dalius Sidlauskas wrote:

Sorry for inaccurate title.

I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full)
containing same value:

<title xmlns="http://www.tei-c.org/ns/1.0";>cal.lígraf</title>

and these fields are configured accordingly:

<fieldType name="xml"  class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
</fieldType>

<fieldType name="xml_unicode"  class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>

<fieldType name="xml_unicode_full"  class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>

And finally my search configuration:

<requestHandler name="dictionary"  class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">all</str>
<str name="defType">edismax</str>
<str name="mm">2&lt;-25%</str>
<str name="qf">dc_title_unicode_full^2 dc_title_unicode^2 dc_title</str>
<int  name="rows">10</int>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.extendedResults">false</str>
<str name="spellcheck.count">1</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>

I am trying to match the field with various search phrases (that are
valid). There are results:


#     search phrase     match?     Comment
1     cal.lígra?     yes
2     cal.ligra?     no     Changed í to i
3     cal.ligraf     yes
4     calligra?     no


The problem is the #2 attempt to match a data. The #3 works replacing
? with f.

One more thing. If * is used insted of ? other data is matched as
cal.lígrafia but not cal.lígraf...

Also I have spotted some logic missmatch in debug parsedQuery field:
*
cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 |
dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0))
*cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 |
dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0))

Should the second be "*calligra?*" insted?*

*Environment:
Tomcat 7.0.25 (request encoding UTF-8)
Solr 3.5.0
Java 7 Oracle
Ubuntu 11.10

Re: Wildcard ? issue?

Reply via email to