> splitOnCaseChange="1" So, it does not get split during indexing because there is no case change. But does get split during search and now you are looking for partial tokens against a combined single-token in the index. And not matching.
The WordDelimiterFilterFactory is more for product IDs that have multitudes of spellings. Your use-case seems to be a lot more of just matching with ignoring case (looking at last email only). Regards, Alex. ---- Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 29 December 2014 at 17:12, Jonathan Rochkind <rochk...@jhu.edu> wrote: > Okay, some months later I've come back to this with an isolated reproduction > case. Thanks very much for any advice or debugging help you can give. > > The WordDelimiter filter is making a mixed-case query NOT match the > single-case source, when it ought to. > > I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no > sense to debug here, and I need to install and try to reproduce on a more > recent version). > > I have an index that includes ONE document (deleted and reindexed after > index change), with content in only one field ("text") other than 'id', and > that content is one word: "delalain". > > My analysis (both index and query, I don't have different ones) for the > 'text' field is simply: > > <fieldType name="text" class="solr.TextField" positionIncrementGap="100" > autoGeneratePhraseQueries="true"> > <analyzer> > <tokenizer class="solr.ICUTokenizerFactory" /> > > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1" catenateWords="1" splitOnCaseChange="1"/> > > <filter class="solr.ICUFoldingFilterFactory" /> > </analyzer> > </fieldType> > > I am querying simply with eg /select?defType=lucene&q=text%3Adelalain > > Querying for "delalain" finds this document, as expected. Querying for > "DELALAIN" finds this document, as expected (note the ICUFoldingFactory). > > However, querying for "deLALAIN" does not find this document, which is > unexpected. > > INDEX analysis of the source, "delalain", ends in this in the index, which > seems pretty straightforward, so I'll only bother pasting in the final index > analysis: > > ###### > text delalain > raw_bytes [64 65 6c 61 6c 61 69 6e] > position 1 > start 0 > end 8 > type <ALPHANUM> > script Latin > ####### > > > > > QUERY analysis of the problematic query, "deLALAIN", looks like this: > > ##### > ICUT text deLALAIN > raw_bytes [64 65 4c 41 4c 41 49 4e] > start 0 > end 8 > type <ALPHANUM> > script Latin > position 1 > > > WDF text de LALAIN deLALAIN > raw_bytes [64 65] [4c 41 4c 41 49 4e] [64 65 4c 41 4c 41 > 49 4e] > start 0 2 0 > end 2 8 8 > type <ALPHANUM> <ALPHANUM> <ALPHANUM> > position 1 2 2 > script Common Common Common > > > ICUFF text de lalain delalain > raw_bytes [64 65] [6c 61 6c 61 69 6e] [64 65 6c 61 6c 61 > 69 6e] > position 1 2 2 > start 0 2 0 > end 2 8 8 > type <ALPHANUM> <ALPHANUM> <ALPHANUM> > script Common Common Common > ####### > > > > It's obviously the WordDelimiterFilter that is messing things up -- but > how/why, and is it a bug? > > It wants to search for both "de lalain" as a phrase, as well as alternately > "delalain" as one word -- that's the intended supported point of the WDF > with this configuration, right? And should work? > > The problem is that is not succesfully matching "delalain" as one word -- > so, how to figure out why not and what to do about it? > > Previously, Erick and Diego asked for the info from &debug=query, so here is > that as well: > > #### > <lst name="debug"> > <str name="rawquerystring">text:deLALAIN</str> > <str name="querystring">text:deLALAIN</str> > <str name="parsedquery">MultiPhraseQuery(text:"de (lalain > delalain)")</str> > <str name="parsedquery_toString">text:"de (lalain delalain)"</str> > <str name="QParser">LuceneQParser</str> > </lst> > #### > > Hmm, that does not seem to quite look like neccesarily, if I interpret that > correctly, it's looking for "de" followed by either "lalain" or "delalain". > Ie, it would match "de delalain"? But that's not right at all. > > So, what's gone wrong? Something with WDF with configuration to > generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's a > bug, one that might be fixed in a more recent Solr?). > > Thanks! > > Jonathan > > > > > > On 9/3/14 7:15 PM, Erick Erickson wrote: >> >> Jonathan: >> >> If at all possible, delete your collection/data directory (the whole >> directory, including data) between runs after you've changed >> your schema (at least any of your analysis that pertains to indexing). >> Mixing old and new schema definitions can add to the confusion! >> >> Good luck! >> Erick >> >> On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind <rochk...@jhu.edu> >> wrote: >>> >>> Thanks Erick and Diego. Yes, I noticed in my last message I'm not >>> actually >>> using defaults, not sure why I chose non-defaults originally. >>> >>> I still need to find time to make a smaller isolation/reproduction case, >>> I'm >>> getting confusing results that suggest some other part of my field def >>> may >>> be pertinent. >>> >>> I'll come back when I've done that (hopefully next week), and include the >>> _parsed_ from &debug=query then. Thanks! >>> >>> Jonathan >>> >>> >>> >>> On 9/2/14 4:26 PM, Erick Erickson wrote: >>>> >>>> >>>> What happens if you append &debug=query to your query? IOW, what does >>>> the >>>> _parsed_ query look like? >>>> >>>> Also note that the defaults for WDFF are _not_ identical. catenateWords >>>> and >>>> catenateNumbers are 1 in the >>>> index portion and 0 in the query section. Still, this shouldn't be a >>>> problem all other things being equal. >>>> >>>> Best, >>>> Erick >>>> >>>> >>>> On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind <rochk...@jhu.edu> >>>> wrote: >>>> >>>>> On 9/2/14 1:51 PM, Erick Erickson wrote: >>>>> >>>>>> bq: In my actual index, query "MacBook" is matching ONLY "mac book", >>>>>> and >>>>>> not "macbook" >>>>>> >>>>>> I suspect your query parameters for WordDelimiterFilterFactory doesn't >>>>>> have >>>>>> catenate words set. >>>>>> >>>>>> What do you see when you enter these in both the index and query >>>>>> portions >>>>>> of the admin/analysis page? >>>>>> >>>>> >>>>> Thanks Erick! >>>>> >>>>> Our WordDelimiterFilterFactory does have catenate words set, in both >>>>> index >>>>> and query phases (is that right?): >>>>> >>>>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" >>>>> generateNumberParts="1" catenateWords="1" catenateNumbers="1" >>>>> catenateAll="0" splitOnCaseChange="1"/> >>>>> >>>>> It's hard to cut and paste the results of the analysis page into email >>>>> (or >>>>> anywhere!), I'll give you screenshots, sorry -- and I'll give them for >>>>> our >>>>> whole real world app complex field definition. I'll also paste in our >>>>> entire field definition below. But I realize my next step is probably >>>>> creating a simpler isolation/reproduction case (unless you have a magic >>>>> answer from this!). >>>>> >>>>> Again, the problem is that "MacBook" seems to be only matching on >>>>> indexed >>>>> "macbook" and not indexed "mac book". >>>>> >>>>> >>>>> "MacBook" query analysis: >>>>> https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png >>>>> >>>>> "MacBook" index analysis: >>>>> https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png >>>>> >>>>> "mac book" index analysis: >>>>> https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png >>>>> >>>>> >>>>> Our entire actual field definition: >>>>> >>>>> <fieldType name="text" class="solr.TextField" >>>>> positionIncrementGap="100" >>>>> autoGeneratePhraseQueries="true"> >>>>> <analyzer> >>>>> <!-- the rulefiles thing is to keep ICUTokenizerFactory from >>>>> stripping punctuation, >>>>> so our synonym filter involving C++ etc can still work. >>>>> From: https://mail-archives.apache. >>>>> org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70. >>>>> 6070...@elyograg.org%3E >>>>> the rbbi file is in our local ./conf, copied from lucene >>>>> source tree --> >>>>> <tokenizer class="solr.ICUTokenizerFactory" >>>>> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/> >>>>> >>>>> <filter class="solr.SynonymFilterFactory" >>>>> synonyms="punctuation-whitelist.txt" >>>>> ignoreCase="true"/> >>>>> >>>>> <filter class="solr.WordDelimiterFilterFactory" >>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1" >>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> >>>>> >>>>> >>>>> <!-- folding need sto be after WordDelimiter, so >>>>> WordDelimiter >>>>> can do it's thing with full cases and such --> >>>>> <filter class="solr.ICUFoldingFilterFactory" /> >>>>> >>>>> >>>>> <!-- ICUFolding already includes lowercasing, no >>>>> need for seperate lowercasing step >>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>> --> >>>>> >>>>> <filter class="solr.SnowballPorterFilterFactory" >>>>> language="English" protected="protwords.txt"/> >>>>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> >>>>> </analyzer> >>>>> </fieldType> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >