[ https://issues.apache.org/jira/browse/LUCENE-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863584#comment-13863584 ]
Robert Muir commented on LUCENE-5386: ------------------------------------- I don't have a clear idea how it can relate, just some vague thoughts at the moment. The other most common attribute that is manipulated is position increments, take a look at StandardTokenizer.end() for an example (when it removes a too-long token), and FilteringTokenFilter.end() (subclass for StopFilter & co). {code} @Override public final void end() throws IOException { super.end(); // set final offset int finalOffset = correctOffset(scanner.yychar() + scanner.yylength()); offsetAtt.setOffset(finalOffset, finalOffset); // adjust any skipped tokens posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement()+skippedPositions); } {code} otherwise skipped tokens at the end of a multi-valued field get lost forever. As far as other attributes, i guess it depends what they do (e.g. they could be some custom one where some end() logic makes sense too). But it would be good if we could make the basics easier, while still allowing crazy custom classes to do whatever they need. As far as how to tie this in and make it simpler? I'm not sure. One idea would be a TokenizerBase that takes care of these two things or maybe even some other things (e.g. provides a skipToken() method for subclasses to call and implements final position adjustments, and wraps the reader + implements final offset). Sucks to add a new class though, but its one idea. Making a tokenizer is really hard. One that is easier to subclass but cant do 100% of the crazy possibilities could be a nice balance: experts could still subclass Tokenizer directly. > Make Tokenizers deliver their final offsets > ------------------------------------------- > > Key: LUCENE-5386 > URL: https://issues.apache.org/jira/browse/LUCENE-5386 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Benson Margulies > > Tokenizers _must_ have an implementation of #end() in which they set up the > final offset. Currently, nothing enforces this. end() has a useful > implementation in TokenStream, so just making it abstract is not attractive. > Proposal: add > abstract int finalOffset(); > to tokenizer, and then make > void end() { > super.end(); > int fo = finalOffset(); > offsetAttr.setOffsets(fo, fo); > } > or something to that effect. > Other alternative to be considered depending on how this looks. -- This message was sent by Atlassian JIRA (v6.1.5#6160) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org