Re: Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906

Robert Muir Fri, 16 Dec 2011 16:53:26 -0800

On Fri, Dec 16, 2011 at 7:32 PM, Burton-West, Tom <[email protected]> wrote:


> Unfortunately, it sounds like the ICUTokenizer will segment on the Tibetan 
> phrase separators but downstream filters won't know that, so we couldn't have 
> a downstream filter that avoided bigramming across a phrase separator. On the 
> other hand it might be that "stupid" overlapping bigrams don't hurt retrieval 
> compared to treating syllables as if they were words i.e. syllable unigrams. 
> ( I've not been able to find much published research in English on the issue, 
> and many of the references are to articles in Chinese language publications.  
> I'm pretty much relying on the article by Hackett and Oard)
>

Yeah thats the one I was referring to. I think its a good article but
the methods there are "rough" so we don't know for sure.

Again from my intuition I agree with it, and the solution you mention
might be good, but my general opinion is that its not simple to make
this a general thing where you just supply a list of scripts and it
'does its thing'.

Another idea apart from your solution would be to add a tailoring for
tibetan that sets some special attribute indicating 'word-final
syllable'. Then this information is not 'lost' and downstream can do
the right thing.
Its not a difficult thing to do for the tokenizer, but we would need
more details: a quick glance at some stuff on tibetan punctuation
indicates its not 'this simple': for some syllables sometimes the
punctuation is omitted. Honestly i don't know why this is, maybe it
means there are some syllables that only appear in word-final
position? If so, such important clues should also trigger this
attribute. So essentially before doing anything like that, it would be
best to know 'the rules of the game' before thinking about any design.


-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906

Reply via email to