The ICUTokenizer now adds a script attribute for tokens (as do Standard 
Tokenizer and a couple of others (LUCENE-2911)  For example "Tibetan" or "Han". 
  If the Shingle filter had some provision to only make token n-grams when the 
script attribute matched some specified script, it would solve both the need to 
produce character bigrams for CJK ( Han)  and syllable bigrams for Tibetan.  We 
already opened an issue to create overlapping bigrams for CJK (LUCENE-2906) .

Would it make sense to open an issue for modifying the Shingle filter to have 
configurable script-specific behavior, or is this just another use case for 
LUCENE 2906?

If it is another use case for LUCENE 2906, then perhaps we need to change the 
summary of the issue to generalize it beyond CJK.

Any suggestions ?

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

Reply via email to