(resending with ALL Asian characters removed from example, which apparently
trips a filter)
I'm getting phrase queries instead of implicit "OR" queries with Asian
text.  I first noticed it with the Dismax query handler, but it also happens
with the Standard query.

Of course Asian text is broken up into N-Gram pairs, I understand that.  But
after analysis (via the Web UI) the 2-character "words" still have spaces in
between them, so I'd expect similar results to an English sentence which
also has spaces.

English: (default field title_en)
User Query: I need help with my iPod
Generates: title_en:i title_en:need title_en:help title_en:with title_en:my
title_en:ipod

Japanese: (default field title_cjk)
User Query: iPodC1C2C3C4C5C6C7...
Generates: PhraseQuery(title_cjk:"ipod C1C2 C2C3 C3C4 C4C5 C5C6 C6C7")
The problem is the cjk phrase queries are too rigid, everything has to
match.  Although setting phrase slop helps with proximity, I don't think you
can tell it to not require 100% of the bigrams to be present.

What I'd like is just: title_cjk:ipod title_cjk:C1C2 title_cjk:C2C3
title_cjk:C3C4 etc...
The only theory I have so far, looking through the code and mailing list
comments, this might have something to do with token offsets?  Though the
start of each token is 1 past the previous one, they do overlap by 1 char
each time.  I'm not sure that's it, nor what the logic would be.  Bumping
the increments from 1 to 3 or 4 would make them no longer overlap, if that's
all there is to it.

Ideally I'd like the cjk queries to be structured the same as the English
ones.  Also it'd be better if this could be done with just schema or config
changes, though I realize that's not as likely.

--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Reply via email to