Re: What is the proper use of stop words in Lucene?

Chris Tomlinson Mon, 28 Apr 2014 15:02:38 -0700

On Apr 28, 2014, at 3:36 PM, Uwe Schindler <u...@thetaphi.de> wrote:


> Hi,
> 
>>> What you intend to do is not a "stopword" use case. You want to "ignore"
>> some words - Lucene has no support for this, because in native language
>> processing this makes no sense.
>> 
>> Thank you for the information. I was unaware that ignoring some words
>> "makes no sense". I thought I gave a reasonable example of exactly this
>> situation in the native processing of Tibetan. Perhaps I am still not
>> understanding.
> 
> Elisions are a bit different than stopwords (although I don't know about them 
> in Tibet language). The Tokenizer should *not* split Elisions from the terms 
> (initially the term is the full word including the elision). In most 
> languages those are separated by (for example) an apostrophe (e.g. French: le 
> + arbre → l’arbre). The Tokenizer would keep those parts together (l’arbre). 
> A later TokenFilter would then edit the token and remove the elision (if 
> needed): arbre. This is how the French Analyzer in Lucene works.

Tibetan has no markers for elisions. They can be quite idiosyncratic to a 
school or tradition. It seems that the most flexible approach is to ignore stop 
words and their positions - however, incorrect that may be; otherwise, one gets 
into ever more complex analysis that may not yield cost-effective results. 
We'll work with the pre-4.4 setEnablePositionIncrements(false) approach for now.

Tibetan has no sentence or phrase markers per sé. There are no word boundary 
markers. Essentially, Tibetan is a sequence of syllables with occasional 
markers indicating some break in a thought or what not.


> Lucene currently does not have Tibetanian Analyzer, so you have to make your 
> own one (I think this is what you tried to do). You should carefully choose 
> the Tokenizer and add something like an TibetanElisionFilter that removes the 
> not wanted parts from the tokens.

We have developed a pair of analyzers and associated filters and tokenizers for 
both Tibetan Unicode and the Extended Wylie transliteration system. If there is 
interest we will be happy to donate this work to Apache Lucene. This includes 
paying attention to the myriad punctuation characters, stemming and so on.

Thank you,
Chris

> 
> Uwe
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: What is the proper use of stop words in Lucene?

Reply via email to