Hi Robert, Thanks for the quick and thoughtful response.
I didn't realize these complexities and thought maybe there was an easy solution :) We may be involved in a project that involves Tibetan text and given our current resources and priorities, we would stick it in the same field as the other 400+ languages. I was hoping that with the script attribute output by the ICUTokenizer, we could figure out something to do script/language specific processing for Tibetan without adversely affecting anything else. >>. I suppose to inhibit stupid bigrams you would *not*shingle across shad as >>well Unfortunately, it sounds like the ICUTokenizer will segment on the Tibetan phrase separators but downstream filters won't know that, so we couldn't have a downstream filter that avoided bigramming across a phrase separator. On the other hand it might be that "stupid" overlapping bigrams don't hurt retrieval compared to treating syllables as if they were words i.e. syllable unigrams. ( I've not been able to find much published research in English on the issue, and many of the references are to articles in Chinese language publications. I'm pretty much relying on the article by Hackett and Oard) Tom Hackett, P. G., & Oard, D. W. (2000). Comparison of word-based and syllable-based retrieval for Tibetan (poster session). In Proceedings of the fifth international workshop on on Information retrieval with Asian languages - IRAL '00 (pp. 197-198). Presented at the the fifth international workshop on, Hong Kong, China. doi:10.1145/355214.355242 http://dl.acm.org/citation.cfm?doid=355214.355242 -----Original Message----- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Friday, December 16, 2011 6:45 PM To: dev@lucene.apache.org Subject: Re: Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906 On Fri, Dec 16, 2011 at 5:44 PM, Burton-West, Tom <tburt...@umich.edu> wrote: > The ICUTokenizer now adds a script attribute for tokens (as do Standard > Tokenizer and a couple of others (LUCENE-2911) For example “Tibetan” or > “Han”. If the Shingle filter had some provision to only make token n-grams > when the script attribute matched some specified script, it would solve both > the need to produce character bigrams for CJK ( Han) and syllable bigrams > for Tibetan. We already opened an issue to create overlapping bigrams for > CJK (LUCENE-2906) . Not sure it totally would because there are key important differences, and a few complications: 1. CJKTokenizer today creates bigrams in runs "cjk" text where this is something like: [IHK]+ (run of ideographic, hiragana, katakana). There are different variations on this available too, like only bigram I+ and do something else with the katakana (like keep as word). Seems like the verdict from previous studies is that there are options there and they tend to both work well. But one thing is still for sure, I think it would bad here to form bigrams across what was not contiguous text (e.g. across sentence boundaries). Finally, some CJK normalization (such as halfwidth/fullwidth conversion) is not 1:1 replacement and so really the process here should at least be aware of this and consider some sequences of half-width-kana as a single 'character'. 2. Unlike the CJK case, where you bigram a "run", Tibetan separates syllables with special punctuation (tsheg among other things). The reason you have syllables as output from these tokenizers is because of this reason. So this is already a fundamentally different bigram algorithm, because its not longer contiguous runs, instead syllables often had something in between, and depending upon what that something is tells you if its e.g. a syllable separator or something more like a phrase separator. I suppose to inhibit stupid bigrams you would *not* shingle across shad as well.. how to generalize that? The verdict for this language definitely isn't out here, I've only see some very initial rough work on this language and we aren't totally sure this works well on average. 3. Other "complex" languages besides these are also emitting syllables "at best", too: Thai,Lao,Myanmar,Khmer? Shouldn't we bigram those too? Except, one implementation (ICUTokenizer) is emitting syllables here (what type of syllable depends upon the current implementation, too!), and the other (StandardTokenizer) is emitting whole phrases as words. Would be great to bigram the former (we think!), but even more horrible to do it to the latter. I put "we think" here because there has really been no work done here, so its just intuition/guessing. And to make matters worse, we have a filter in contrib (ThaiWordFilter) that relies upon the specifics of how StandardTokenizer screws up Thai tokenization so it can 'retokenize'. > > Would it make sense to open an issue for modifying the Shingle filter to > have configurable script-specific behavior, or is this just another use case > for LUCENE 2906? > > If it is another use case for LUCENE 2906, then perhaps we need to change > the summary of the issue to generalize it beyond CJK. > > Any suggestions ? > > Tom Burton-West > http://www.hathitrust.org/blogs/large-scale-search > -- lucidimagination.com --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org