The ICUTokenizer now adds a script attribute for tokens (as do Standard Tokenizer and a couple of others (LUCENE-2911) For example "Tibetan" or "Han". If the Shingle filter had some provision to only make token n-grams when the script attribute matched some specified script, it would solve both the need to produce character bigrams for CJK ( Han) and syllable bigrams for Tibetan. We already opened an issue to create overlapping bigrams for CJK (LUCENE-2906) .
Would it make sense to open an issue for modifying the Shingle filter to have configurable script-specific behavior, or is this just another use case for LUCENE 2906? If it is another use case for LUCENE 2906, then perhaps we need to change the summary of the issue to generalize it beyond CJK. Any suggestions ? Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search