RE: Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906

Burton-West, Tom Fri, 16 Dec 2011 16:33:27 -0800

Hi Robert,

Thanks for the quick and thoughtful response.

I didn't realize these complexities and thought maybe there was an easy 
solution :)

We may be involved in a project that involves Tibetan text and given our 
current resources and priorities, we would stick it in the same field as the 
other 400+ languages.  I was hoping that with the script attribute output by 
the ICUTokenizer, we could figure out something to do script/language specific 
processing for Tibetan without adversely affecting anything else. 

>>. I suppose to inhibit stupid bigrams you would *not*shingle across shad as 
>>well

Unfortunately, it sounds like the ICUTokenizer will segment on the Tibetan 
phrase separators but downstream filters won't know that, so we couldn't have a 
downstream filter that avoided bigramming across a phrase separator. On the 
other hand it might be that "stupid" overlapping bigrams don't hurt retrieval 
compared to treating syllables as if they were words i.e. syllable unigrams. ( 
I've not been able to find much published research in English on the issue, and 
many of the references are to articles in Chinese language publications.  I'm 
pretty much relying on the article by Hackett and Oard) 

Tom

Hackett, P. G., & Oard, D. W. (2000). Comparison of word-based and 
syllable-based retrieval for Tibetan (poster session). In Proceedings of the 
fifth international workshop on on Information retrieval with Asian languages - 
IRAL '00 (pp. 197-198). Presented at the the fifth international workshop on, 
Hong Kong, China. doi:10.1145/355214.355242

http://dl.acm.org/citation.cfm?doid=355214.355242

-----Original Message-----
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Friday, December 16, 2011 6:45 PM
To: dev@lucene.apache.org
Subject: Re: Shingle filter that reads the script attribute from ICUTokenizer 
and LUCENE-2906

On Fri, Dec 16, 2011 at 5:44 PM, Burton-West, Tom <tburt...@umich.edu> wrote:
> The ICUTokenizer now adds a script attribute for tokens (as do Standard
> Tokenizer and a couple of others (LUCENE-2911)  For example “Tibetan” or
> “Han”.   If the Shingle filter had some provision to only make token n-grams
> when the script attribute matched some specified script, it would solve both
> the need to produce character bigrams for CJK ( Han) and syllable bigrams
> for Tibetan.  We already opened an issue to create overlapping bigrams for
> CJK (LUCENE-2906) .

Not sure it totally would because there are key important differences,
and a few complications:
1. CJKTokenizer today creates bigrams in runs "cjk" text where this is
something like: [IHK]+ (run of ideographic, hiragana, katakana). There
are different variations on this available too, like only bigram I+
and do something else with the katakana (like keep as word). Seems
like the verdict from previous studies is that there are options there
and they tend to both work well. But one thing is still for sure, I
think it would bad here to form bigrams across what was not contiguous
text (e.g. across sentence boundaries). Finally, some CJK
normalization (such as halfwidth/fullwidth conversion) is not 1:1
replacement and so really the process here should at least be aware of
this and consider some sequences of half-width-kana as a single
'character'.
2. Unlike the CJK case, where you bigram a "run", Tibetan separates
syllables with special punctuation (tsheg among other things). The
reason you have syllables as output from these tokenizers is because
of this reason. So this is already a fundamentally different bigram
algorithm, because its not longer contiguous runs, instead syllables
often had something in between, and depending upon what that something
is tells you if its e.g. a syllable separator or something more like a
phrase separator. I suppose to inhibit stupid bigrams you would *not*
shingle across shad as well.. how to generalize that? The verdict for
this language definitely isn't out here, I've only see some very
initial rough work on this language and we aren't totally sure this
works well on average.
3. Other "complex" languages besides these are also emitting syllables
"at best", too: Thai,Lao,Myanmar,Khmer? Shouldn't we bigram those too?
Except, one implementation (ICUTokenizer) is emitting syllables here
(what type of syllable depends upon the current implementation, too!),
and the other (StandardTokenizer) is emitting whole phrases as words.
Would be great to bigram the former (we think!), but even more
horrible to do it to the latter. I put "we think" here because there
has really been no work done here, so its just intuition/guessing.
And to make matters worse, we have a filter in contrib
(ThaiWordFilter) that relies upon the specifics of how
StandardTokenizer screws up Thai tokenization so it can 'retokenize'.

>
> Would it make sense to open an issue for modifying the Shingle filter to
> have configurable script-specific behavior, or is this just another use case
> for LUCENE 2906?
>
> If it is another use case for LUCENE 2906, then perhaps we need to change
> the summary of the issue to generalize it beyond CJK.
>
> Any suggestions ?
>
> Tom Burton-West
> http://www.hathitrust.org/blogs/large-scale-search
>

-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

RE: Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906

Reply via email to