Re: WordDelimiter filter, expanding to multiple words, unexpected results

Alexandre Rafalovitch Mon, 29 Dec 2014 14:28:28 -0800

> splitOnCaseChange="1"

So, it does not get split during indexing because there is no case
change. But does get split during search and now you are looking for
partial tokens against a combined single-token in the index. And not
matching.


The WordDelimiterFilterFactory is more for product IDs that have
multitudes of spellings. Your use-case seems to be a lot more of just
matching with ignoring case (looking at last email only).

Regards,
   Alex.
----
Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 29 December 2014 at 17:12, Jonathan Rochkind <rochk...@jhu.edu> wrote:
> Okay, some months later I've come back to this with an isolated reproduction
> case. Thanks very much for any advice or debugging help you can give.
>
> The WordDelimiter filter is making a mixed-case query NOT match the
> single-case source, when it ought to.
>
> I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no
> sense to debug here, and I need to install and try to reproduce on a more
> recent version).
>
> I have an index that includes ONE document (deleted and reindexed after
> index change), with content in only one field ("text") other than 'id', and
> that content is one word: "delalain".
>
> My analysis (both index and query, I don't have different ones) for the
> 'text' field is simply:
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100"
> autoGeneratePhraseQueries="true">
>       <analyzer>
>         <tokenizer class="solr.ICUTokenizerFactory" />
>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" catenateWords="1" splitOnCaseChange="1"/>
>
>         <filter class="solr.ICUFoldingFilterFactory" />
>       </analyzer>
> </fieldType>
>
> I am querying simply with eg /select?defType=lucene&q=text%3Adelalain
>
> Querying for "delalain" finds this document, as expected. Querying for
> "DELALAIN" finds this document, as expected (note the ICUFoldingFactory).
>
> However, querying for "deLALAIN" does not find this document, which is
> unexpected.
>
> INDEX analysis of the source, "delalain", ends in this in the index, which
> seems pretty straightforward, so I'll only bother pasting in the final index
> analysis:
>
> ######
> text    delalain
> raw_bytes       [64 65 6c 61 6c 61 69 6e]
> position        1
> start   0
> end     8
> type    <ALPHANUM>
> script  Latin
> #######
>
>
>
>
> QUERY analysis of the problematic query, "deLALAIN", looks like this:
>
> #####
> ICUT    text    deLALAIN
>         raw_bytes       [64 65 4c 41 4c 41 49 4e]
>         start   0
>         end     8
>         type    <ALPHANUM>
>         script  Latin
>         position        1
>
>
> WDF     text    de      LALAIN  deLALAIN
>         raw_bytes       [64 65] [4c 41 4c 41 49 4e]     [64 65 4c 41 4c 41
> 49 4e]
>         start   0       2       0
>         end     2       8       8
>         type    <ALPHANUM>      <ALPHANUM>      <ALPHANUM>
>         position        1       2       2
>         script  Common  Common  Common
>
>
> ICUFF   text    de      lalain  delalain
>         raw_bytes       [64 65] [6c 61 6c 61 69 6e]     [64 65 6c 61 6c 61
> 69 6e]
>         position        1       2       2
>         start   0       2       0
>         end     2       8       8
>         type    <ALPHANUM>      <ALPHANUM>      <ALPHANUM>
>         script  Common  Common  Common
> #######
>
>
>
> It's obviously the WordDelimiterFilter that is messing things up -- but
> how/why, and is it a bug?
>
> It wants to search for both "de lalain" as a phrase, as well as alternately
> "delalain" as one word -- that's the intended supported point of the WDF
> with this configuration, right? And should work?
>
> The problem is that is not succesfully matching "delalain" as one word --
> so, how to figure out why not and what to do about it?
>
> Previously, Erick and Diego asked for the info from &debug=query, so here is
> that as well:
>
> ####
> <lst name="debug">
>   <str name="rawquerystring">text:deLALAIN</str>
>   <str name="querystring">text:deLALAIN</str>
>   <str name="parsedquery">MultiPhraseQuery(text:"de (lalain
> delalain)")</str>
>   <str name="parsedquery_toString">text:"de (lalain delalain)"</str>
>   <str name="QParser">LuceneQParser</str>
> </lst>
> ####
>
> Hmm, that does not seem to quite look like neccesarily, if I interpret that
> correctly, it's looking for "de" followed by either "lalain" or "delalain".
> Ie, it would match "de delalain"?  But that's not right at all.
>
> So, what's gone wrong? Something with WDF with configuration to
> generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's a
> bug, one that might be fixed in a more recent Solr?).
>
> Thanks!
>
> Jonathan
>
>
>
>
>
> On 9/3/14 7:15 PM, Erick Erickson wrote:
>>
>> Jonathan:
>>
>> If at all possible, delete your collection/data directory (the whole
>> directory, including data) between runs after you've changed
>> your schema (at least any of your analysis that pertains to indexing).
>> Mixing old and new schema definitions can add to the confusion!
>>
>> Good luck!
>> Erick
>>
>> On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind <rochk...@jhu.edu>
>> wrote:
>>>
>>> Thanks Erick and Diego. Yes, I noticed in my last message I'm not
>>> actually
>>> using defaults, not sure why I chose non-defaults originally.
>>>
>>> I still need to find time to make a smaller isolation/reproduction case,
>>> I'm
>>> getting confusing results that suggest some other part of my field def
>>> may
>>> be pertinent.
>>>
>>> I'll come back when I've done that (hopefully next week), and include the
>>> _parsed_ from &debug=query then. Thanks!
>>>
>>> Jonathan
>>>
>>>
>>>
>>> On 9/2/14 4:26 PM, Erick Erickson wrote:
>>>>
>>>>
>>>> What happens if you append &debug=query to your query? IOW, what does
>>>> the
>>>> _parsed_ query look like?
>>>>
>>>> Also note that the defaults for WDFF are _not_ identical. catenateWords
>>>> and
>>>> catenateNumbers are 1 in the
>>>> index portion and 0 in the query section. Still, this shouldn't be a
>>>> problem all other things being equal.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>>
>>>> On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind <rochk...@jhu.edu>
>>>> wrote:
>>>>
>>>>> On 9/2/14 1:51 PM, Erick Erickson wrote:
>>>>>
>>>>>> bq: In my actual index, query "MacBook" is matching ONLY "mac book",
>>>>>> and
>>>>>> not "macbook"
>>>>>>
>>>>>> I suspect your query parameters for WordDelimiterFilterFactory doesn't
>>>>>> have
>>>>>> catenate words set.
>>>>>>
>>>>>> What do you see when you enter these in both the index and query
>>>>>> portions
>>>>>> of the admin/analysis page?
>>>>>>
>>>>>
>>>>> Thanks Erick!
>>>>>
>>>>> Our WordDelimiterFilterFactory does have catenate words set, in both
>>>>> index
>>>>> and query phases (is that right?):
>>>>>
>>>>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>>>>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>>>>> catenateAll="0" splitOnCaseChange="1"/>
>>>>>
>>>>> It's hard to cut and paste the results of the analysis page into email
>>>>> (or
>>>>> anywhere!), I'll give you screenshots, sorry -- and I'll give them for
>>>>> our
>>>>> whole real world app complex field definition. I'll also paste in our
>>>>> entire field definition below. But I realize my next step is probably
>>>>> creating a simpler isolation/reproduction case (unless you have a magic
>>>>> answer from this!).
>>>>>
>>>>> Again, the problem is that "MacBook" seems to be only matching on
>>>>> indexed
>>>>> "macbook" and not indexed "mac book".
>>>>>
>>>>>
>>>>> "MacBook" query analysis:
>>>>> https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png
>>>>>
>>>>> "MacBook" index analysis:
>>>>> https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png
>>>>>
>>>>> "mac book" index analysis:
>>>>> https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png
>>>>>
>>>>>
>>>>> Our entire actual field definition:
>>>>>
>>>>>     <fieldType name="text" class="solr.TextField"
>>>>> positionIncrementGap="100"
>>>>> autoGeneratePhraseQueries="true">
>>>>>         <analyzer>
>>>>>          <!-- the rulefiles thing is to keep ICUTokenizerFactory from
>>>>> stripping punctuation,
>>>>>               so our synonym filter involving C++ etc can still work.
>>>>>               From: https://mail-archives.apache.
>>>>> org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.
>>>>> 6070...@elyograg.org%3E
>>>>>               the rbbi file is in our local ./conf, copied from lucene
>>>>> source tree -->
>>>>>          <tokenizer class="solr.ICUTokenizerFactory"
>>>>> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
>>>>>
>>>>>          <filter class="solr.SynonymFilterFactory"
>>>>> synonyms="punctuation-whitelist.txt"
>>>>> ignoreCase="true"/>
>>>>>
>>>>>           <filter class="solr.WordDelimiterFilterFactory"
>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>>
>>>>>
>>>>>           <!-- folding need sto be after WordDelimiter, so
>>>>> WordDelimiter
>>>>>                can do it's thing with full cases and such -->
>>>>>           <filter class="solr.ICUFoldingFilterFactory" />
>>>>>
>>>>>
>>>>>           <!-- ICUFolding already includes lowercasing, no
>>>>>                need for seperate lowercasing step
>>>>>           <filter class="solr.LowerCaseFilterFactory"/>
>>>>>           -->
>>>>>
>>>>>           <filter class="solr.SnowballPorterFilterFactory"
>>>>> language="English" protected="protwords.txt"/>
>>>>>           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>>         </analyzer>
>>>>>       </fieldType>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Reply via email to