Here is the ElisionFilter of Lucene:

https://lucene.apache.org/core/4_8_0/analyzers-common/org/apache/lucene/analysis/util/ElisionFilter.html

This one only works with apostrophe elisions (' and U+2019), so maybe does not 
apply for Tibetan. But it should inspire you.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -----Original Message-----
> From: Uwe Schindler [mailto:u...@thetaphi.de]
> Sent: Monday, April 28, 2014 10:36 PM
> To: java-user@lucene.apache.org
> Cc: 'Chris Tomlinson'
> Subject: RE: What is the proper use of stop words in Lucene?
> 
> Hi,
> 
> > > What you intend to do is not a "stopword" use case. You want to "ignore"
> > some words - Lucene has no support for this, because in native
> > language processing this makes no sense.
> >
> > Thank you for the information. I was unaware that ignoring some words
> > "makes no sense". I thought I gave a reasonable example of exactly
> > this situation in the native processing of Tibetan. Perhaps I am still
> > not understanding.
> 
> Elisions are a bit different than stopwords (although I don't know about them
> in Tibet language). The Tokenizer should *not* split Elisions from the terms
> (initially the term is the full word including the elision). In most languages
> those are separated by (for example) an apostrophe (e.g. French: le + arbre
> → l’arbre). The Tokenizer would keep those parts together (l’arbre). A later
> TokenFilter would then edit the token and remove the elision (if needed):
> arbre. This is how the French Analyzer in Lucene works.
> 
> Lucene currently does not have Tibetanian Analyzer, so you have to make
> your own one (I think this is what you tried to do). You should carefully
> choose the Tokenizer and add something like an TibetanElisionFilter that
> removes the not wanted parts from the tokens.
> 
> Uwe
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to