Hi Shawn,

I may still be missing your point.  Below is an example where the
ICUTokenizer splits
Now, I'm beginning to wonder if I really understand what those flags on the
CJKBigramFilter do.
The ICUTokenizer spits out unigrams and the CJKBigramFilter will put them
back together into bigrams.

I thought if you set  han=true, hiragana=true
You would get this kind of result where the third bigram is composed of a
hirigana and han character

いろは革命歌   =>    “いろ” ”ろは“  “は革”   ”革命” “命歌”

Hopefully the e-mail hasn't munged the output of the Solr analysis panel
below:

I can see this in our query processing where outpugUnigrams=false:
org.apache.solr.analysis.ICUTokenizerFactory {luceneMatchVersion=LUCENE_36}
Splits into unigrams
term text いろは革命歌
org.apache.solr.analysis.CJKBigramFilterFactory {hangul=false,
outputUnigrams=false, katakana=false, han=true, hiragana=true,
luceneMatchVersion=LUCENE_36}
makes bigrams including the middle one which is one character hirigana and
one han
term text いろろはは革革命命歌

It appears that if you include outputUnigrams=true (as we both do in the
indexing configuration) that this doesn't happen.
org.apache.solr.analysis.CJKBigramFilterFactory {hangul=false,
outputUnigrams=true, katakana=false, han=true, hiragana=true ,
luceneMatchVersion=LUCENE_36}
いろは革命歌 革命命歌 type <HIRAGANA><HIRAGANA><HIRAGANA><SINGLE><SINGLE><SINGLE>
<DOUBLE><DOUBLE>

Not sure what happens for katakana as the ICUTokenizer doesn't convert it
to unigrams and our configuration is set to katakana=false.   I'll play
around on the test machine when I have time.

Tom

Reply via email to