Re: What is the proper use of stop words in Lucene?

Chris Tomlinson Mon, 28 Apr 2014 12:47:31 -0700

Hi,

On Apr 28, 2014, at 11:45 AM, Uwe Schindler <u...@thetaphi.de> wrote:


>> Hello Uwe,
>> 
>> Thank you for the reply. I see that there is a version check for the use of
>> setEnablePositionIncrements(false); and, I think I may be able to use an
>> earlier api with the eXist-db embedding of Lucene 4.4 to avoid the version
>> check.
> 
> Hi,
> 
> you don't need an older version of the Lucene library. It is enough to pass 
> the constant, also with Lucene 4.7 or 4.8 (release in a moment):
> sf = new StopFilter(Version.LUCENE_43, ...);
> sf. setEnablePositionIncrements (false);
> 
> The version constant is exactly to use some components that changed in an 
> incompatible way still in later versions, and preserve index/behavior 
> compatibility.

Thank you for the explanation.


> About stop words: What you are doing, is not really "stop words". The main 
> reason for stop words is the following:
> - Stop words are in almost every document, so it makes no sense to query for 
> them.

This was my understanding.


> - The only relevant information behind the stop word is "there was a word at 
> this position that"

I didn't realize that this was a "necessary" aspect. I can certainly understand 
that it may be relevant in some (most) cases and it makes sense to me that it 
would appropriate to always preserve the information in indexing. I was looking 
for a solution that would essentially work at query time and had initially 
thought that the CommonQueryParserConfiguration#setEnablePositionIncrements() 
was intended to work this way but it does not.


> If the second item would not be taken care, this information would get lost, 
> too.
> 
> If every document really contains a specific stop word (which is almost 
> always the case), there must be no difference between a phrase query with 
> mentioned stop word, using an index with all stop words indexed and one with 
> stop words left out. This can only be done, if the stop word reserves a 
> position.
> 
> What you intend to do is not a "stopword" use case. You want to "ignore" some 
> words - Lucene has no support for this, because in native language processing 
> this makes no sense.

Thank you for the information. I was unaware that ignoring some words "makes no 
sense". I thought I gave a reasonable example of exactly this situation in the 
native processing of Tibetan. Perhaps I am still not understanding.


> One way to do this is to:
> a) write your own TokenFilter, violating the TokenStream contracts
> b) use the Backwards compatibility layer with matchVersion=LUCENE_43
> c) maybe remove the words before tokenizing (e.g. MappingCharFilter, mapping 
> the "ignore words" to empty string)

Thank you for these useful approaches to solving the use case.

ciao,
Chris



> 
> Uwe
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: What is the proper use of stop words in Lucene?

Reply via email to