The analysis chain on some of my Solr fieldType entries includes CJKBigramFilterFactory on both the index and query. I had outputUnigrams enabled on the index side, but had it disabled on the query side. This resulted in a problem with phrase queries. This is a subset of the index analysis for the three terms you can see in the ICUNF step, separated by spaces. One word has been replaced with 'redacted' ... it's in Latin1 script and there's nothing unusual about it:
https://www.dropbox.com/s/9q1x9pdbsjhzocg/bigram-position-problem.png Note that in the CJKBF step, the second unigram is output at position 2, pushing the english terms to 3 and 4. Imagine that the customer is doing a phrase search. What ends up getting sent to Solr is a filter query like this: field:"綾瀬 haruka" The query analysis on this, which doesn't output unigrams, has "haruka" at position 2. As already shown, the index analysis puts "haruka" at position 3. The query doesn't match, because it's a phrase query and has no phrase slop. I would have expected both unigrams to be at position 1. To me, it's a bug ... or at least something that I should be able to configure on the filter. If this gets sent via the main query (edismax), it all works, because I have phrase slop enabled by default. The customer does not like what happens when the index and query analyzers match, either with or without outputUnigrams. When outputUnigrams is completely disabled, searching for a single character doesn't match multi-character strings, and when it is enabled on both, they get matches they did not want. I've already been pointed at an awesome blog series, which will hopefully help me improve things, but I think that the customer will still want outputUnigrams disabled on the query side, so I still have this problem. http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html If I file an issue, should it be bug or improvement? Thanks, Shawn --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org