[ https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alan Woodward updated LUCENE-8650: ---------------------------------- Attachment: LUCENE-8650.patch > ConcatenatingTokenStream does not end() nor reset() properly > ------------------------------------------------------------ > > Key: LUCENE-8650 > URL: https://issues.apache.org/jira/browse/LUCENE-8650 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Reporter: Dan Meehl > Assignee: Alan Woodward > Priority: Major > Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, > LUCENE-8650-3.patch, LUCENE-8650.patch > > > All (I think) TokenStream implementations set a "final offset" after calling > super.end() in their end() methods. ConcatenatingTokenStream fails to do > this. Because of this, it's final offset is not readable and > DefaultIndexingChain in turn fails to set the lastStartOffset properly. This > results in problems with indexing which can include unsearchable content or > IllegalStateExceptions. > ConcatenatingTokenStream also fails to reset() properly. Specifically, it > does not set its currentSource and offsetIncrement back to 0. Because of > this, copyField directives (in the schema) do not work and content becomes > unsearchable. > I've created a few patches that illustrate the problem and then provide a fix. > The first patch enhances the TestConcatenatingTokensStream to check for > finalOffset, which as you can see ends up being 0. > I created the next patch separately because it includes extra classes used > for the testing that Lucene may or may not want to merge in. This patch adds > an integration test that loads some content into the 'text' field. The schema > then copies it to 'content' using a copyField directive. The test searches in > the content field for the loaded text and fails to find it even though the > field does contain the content. Flip the debug flag to see a nicer printout > of the response and what's in the index. Notice that the added class I > alluded to is KeywordTokenStream .This class had to be added because of > another (ultimately unrelated) problem: ConcatenatingTokenStream cannot > concatenate Tokenziers. This is because Tokenizer violates the contract put > forth by TokenStream.reset(). This separate problem warrants its own ticket, > though. However, ultimately KeywordTokenStream may be useful to others and > could be considered for adding to the repo. > The third patch finally fixes ConcatenatingTokenStream by storing and setting > a finalOffset as the last task in the end() method, and resetting > currentSource, offsetIncrement and finalOffset when reset() is called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org