Hi Jonathan, Little confused by this line:
> And, what I think it's trying to do, is match text indexed as "d elalain" as well as text indexed by "delalain". In this case, I don't know how WordDelimiterFilter will help, as you're likely tokenizing on spaces somewhere, and that input text has a space. I could be wrong. It's probably best if you post your field definition from your schema. Also, is this a free-text field, or something that's more like a short string? Thanks, Michael Della Bitta Applications Developer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions <https://twitter.com/Appinions> | g+: plus.google.com/appinions <https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts> w: appinions.com <http://www.appinions.com/> On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote: > Hello, I'm running into a case where a query is not returning the results > I expect, and I'm hoping someone can offer some explanation that might help > me fine tune things or understand what's up. > > I am running Solr 4.3. > > My filter chain includes a WordDelimiterFilter and, later a filter that > downcases everything for case-insensitive searching. It includes many other > things too, but I think these are the pertinent facts. > > For query "dELALAIN", the WordDelimiterFilter splits into: > > text: d > start: 0 > position: 1 > > text: ELALAIN > start: 1 > position: 2 > > text: dELALAIN > start: 0 > position: 2 > > Note the duplication/overlap of the tokens -- one version with "d" and > "ELALAIN" split into two tokens, and another with just one token. > > Later, all the tokens are lowercased by another filter in the chain. > (actually an ICU filter which is doing something more complicated than just > lowercasing, but I think we can consider it lowercasing for the purposes of > this discussion). > > If I understand right what the WordDelimiterFilter is trying to do here, > it's probably doing something special because of the lowercase "d" followed > by an uppercase letter, a special case for that. (I don't get this behavior > with other mixed case queries not beginning with 'd'). > > And, what I think it's trying to do, is match text indexed as "d elalain" > as well as text indexed by "delalain". > > The problem is, it's not accomplishing that -- it is NOT matching text > that was indexed as "delalain" (one token). > > I don't entirely understand what the "position" attribute is for -- but I > wonder if in this case, the position on "dELALAIN" is really supposed to be > 1, not 2? Could that be responsible for the bug? Or is position > irrelevant in this case? > > If that's not it, then I'm at a loss as to what may be causing this bug -- > or even if it's a bug at all, or I'm just not understanding intended > behavior. I expect a query for "dELALAIN" to match text indexed as > "delalain" (because of the forced lowercasing in the filter chain). But > it's not doing so. Are my expectations wrong? Bug? Something else? > > Thanks for any advice, > > Jonathan >