On Fri, Dec 16, 2011 at 5:44 PM, Burton-West, Tom <[email protected]> wrote:
> The ICUTokenizer now adds a script attribute for tokens (as do Standard
> Tokenizer and a couple of others (LUCENE-2911)  For example “Tibetan” or
> “Han”.   If the Shingle filter had some provision to only make token n-grams
> when the script attribute matched some specified script, it would solve both
> the need to produce character bigrams for CJK ( Han) and syllable bigrams
> for Tibetan.  We already opened an issue to create overlapping bigrams for
> CJK (LUCENE-2906) .

Not sure it totally would because there are key important differences,
and a few complications:
1. CJKTokenizer today creates bigrams in runs "cjk" text where this is
something like: [IHK]+ (run of ideographic, hiragana, katakana). There
are different variations on this available too, like only bigram I+
and do something else with the katakana (like keep as word). Seems
like the verdict from previous studies is that there are options there
and they tend to both work well. But one thing is still for sure, I
think it would bad here to form bigrams across what was not contiguous
text (e.g. across sentence boundaries). Finally, some CJK
normalization (such as halfwidth/fullwidth conversion) is not 1:1
replacement and so really the process here should at least be aware of
this and consider some sequences of half-width-kana as a single
'character'.
2. Unlike the CJK case, where you bigram a "run", Tibetan separates
syllables with special punctuation (tsheg among other things). The
reason you have syllables as output from these tokenizers is because
of this reason. So this is already a fundamentally different bigram
algorithm, because its not longer contiguous runs, instead syllables
often had something in between, and depending upon what that something
is tells you if its e.g. a syllable separator or something more like a
phrase separator. I suppose to inhibit stupid bigrams you would *not*
shingle across shad as well.. how to generalize that? The verdict for
this language definitely isn't out here, I've only see some very
initial rough work on this language and we aren't totally sure this
works well on average.
3. Other "complex" languages besides these are also emitting syllables
"at best", too: Thai,Lao,Myanmar,Khmer? Shouldn't we bigram those too?
Except, one implementation (ICUTokenizer) is emitting syllables here
(what type of syllable depends upon the current implementation, too!),
and the other (StandardTokenizer) is emitting whole phrases as words.
Would be great to bigram the former (we think!), but even more
horrible to do it to the latter. I put "we think" here because there
has really been no work done here, so its just intuition/guessing.
And to make matters worse, we have a filter in contrib
(ThaiWordFilter) that relies upon the specifics of how
StandardTokenizer screws up Thai tokenization so it can 'retokenize'.

>
> Would it make sense to open an issue for modifying the Shingle filter to
> have configurable script-specific behavior, or is this just another use case
> for LUCENE 2906?
>
> If it is another use case for LUCENE 2906, then perhaps we need to change
> the summary of the issue to generalize it beyond CJK.
>
> Any suggestions ?
>
> Tom Burton-West
> http://www.hathitrust.org/blogs/large-scale-search
>



-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to