[ https://issues.apache.org/jira/browse/LUCENE-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13867446#comment-13867446 ]
Robert Muir commented on LUCENE-5386: ------------------------------------- {quote} TokenStream itself takes care of PositionIncrementAttribute. {quote} Well it doesnt really "take care" at all. any tokenizer/tokenfilter that "removes" tokens and has the concept of "holes" (e.g. stopflter, or standardtokenizer when it drops a too-long token), has to provide that inforamtion here or its lost forever. Lets make a concrete example: pages of a book. lets say each one is a value in a multi-valued field. page1 ends with "the quick brown fox jumps over the" and then page2 starts with "lazy dog." as you know, internally both "values" are each independently analyzed but basically concatenated together in indexwriter, with analyzer.getPositionIncrementGap() \[default=0] in between. in this case, phrase queries will not work correctly for this sentence, unless the analyzer propagates that 'the' was removed in end(), so the position increment is increased against "lazy". Only the guy who removed that token (e.g. stopfilter) knows this, so it must provide it in end(). But yeah, i agree with your assessment, I think we want the tokenizer class to be very simple and just be "tokenstream that works on reader". We want it to be very flexible. On the other hand it sucks for simple typical use cases (e.g. the ones that happen 99% of the time) to be so difficult, when I think honestly most tokenizers are gonna worry about just offsets and positions in end(). So a compromise to make easy things easy and still leave sophisticate things possible is to provide some subclass, even if its "limited" in some ways that it can't do all the crazy stuff. But if it can make the ordinary use-cases that happen 99% of the time easier, I think it would be really helpful. Its at least a good step in the right direction that won't hurt anyone. > Make Tokenizers deliver their final offsets > ------------------------------------------- > > Key: LUCENE-5386 > URL: https://issues.apache.org/jira/browse/LUCENE-5386 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Benson Margulies > > Tokenizers _must_ have an implementation of #end() in which they set up the > final offset. Currently, nothing enforces this. end() has a useful > implementation in TokenStream, so just making it abstract is not attractive. > Proposal: add > abstract int finalOffset(); > to tokenizer, and then make > void end() { > super.end(); > int fo = finalOffset(); > offsetAttr.setOffsets(fo, fo); > } > or something to that effect. > Other alternative to be considered depending on how this looks. -- This message was sent by Atlassian JIRA (v6.1.5#6160) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org