Add flag to CJKBigramFilter to also output unigrams (Single character Han queries)

Tom Burton-West Fri, 03 Aug 2012 15:20:01 -0700

Hello all,

About 10% of our queries that contain Han characters are single character
queries.   It looks like the CJKBigram filter only outputs single
characters when there are no adjacent bigrammable characters in the input.
  This means we have to create a separate field to index Han unigrams in
order to address single character queries and then write application code
to search that separate field if we detect a single character Han query.
This is rather kludgey.    As an alternative approach to dealing with
single character Han queryies, would it be possible to add an optional
 flag to the CJKBigramFilter to tell it to also output unigrams?


That way on indexing we could set the flag so that both unigrams and
bigrams would be indexed.  On querying we would not set the flag so that
the current logic which outputs bigrams unless there is a single Han
character (in which case that gets output) would take care of queries
containing a single Han unigram.

This is somewhat analogus to the flags in LUCENE-1370 for the ShingleFilter.

If this makes sense I'll open a JIRA issue.

Tom Burton-West

Add flag to CJKBigramFilter to also output unigrams (Single character Han queries)

Reply via email to