Hi, On Apr 28, 2014, at 11:45 AM, Uwe Schindler <u...@thetaphi.de> wrote:
>> Hello Uwe, >> >> Thank you for the reply. I see that there is a version check for the use of >> setEnablePositionIncrements(false); and, I think I may be able to use an >> earlier api with the eXist-db embedding of Lucene 4.4 to avoid the version >> check. > > Hi, > > you don't need an older version of the Lucene library. It is enough to pass > the constant, also with Lucene 4.7 or 4.8 (release in a moment): > sf = new StopFilter(Version.LUCENE_43, ...); > sf. setEnablePositionIncrements (false); > > The version constant is exactly to use some components that changed in an > incompatible way still in later versions, and preserve index/behavior > compatibility. Thank you for the explanation. > About stop words: What you are doing, is not really "stop words". The main > reason for stop words is the following: > - Stop words are in almost every document, so it makes no sense to query for > them. This was my understanding. > - The only relevant information behind the stop word is "there was a word at > this position that" I didn't realize that this was a "necessary" aspect. I can certainly understand that it may be relevant in some (most) cases and it makes sense to me that it would appropriate to always preserve the information in indexing. I was looking for a solution that would essentially work at query time and had initially thought that the CommonQueryParserConfiguration#setEnablePositionIncrements() was intended to work this way but it does not. > If the second item would not be taken care, this information would get lost, > too. > > If every document really contains a specific stop word (which is almost > always the case), there must be no difference between a phrase query with > mentioned stop word, using an index with all stop words indexed and one with > stop words left out. This can only be done, if the stop word reserves a > position. > > What you intend to do is not a "stopword" use case. You want to "ignore" some > words - Lucene has no support for this, because in native language processing > this makes no sense. Thank you for the information. I was unaware that ignoring some words "makes no sense". I thought I gave a reasonable example of exactly this situation in the native processing of Tibetan. Perhaps I am still not understanding. > One way to do this is to: > a) write your own TokenFilter, violating the TokenStream contracts > b) use the Backwards compatibility layer with matchVersion=LUCENE_43 > c) maybe remove the words before tokenizing (e.g. MappingCharFilter, mapping > the "ignore words" to empty string) Thank you for these useful approaches to solving the use case. ciao, Chris > > Uwe > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org