[ https://issues.apache.org/jira/browse/LUCENE-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-4127: --------------------------------------- Attachment: LUCENE-4127.patch I think we should also strongly check posIncr coming into IndexWriter ... attached patch does that and fixes a couple tests that were sending posInc=0 for first token. > negative offsets/deltas corrumption > ----------------------------------- > > Key: LUCENE-4127 > URL: https://issues.apache.org/jira/browse/LUCENE-4127 > Project: Lucene - Java > Issue Type: Bug > Components: core/index > Affects Versions: 4.0 > Reporter: Robert Muir > Attachments: LUCENE-4127.patch, LUCENE-4127_test.patch > > > If offsets go negative or backwards, it can corrupt the index with > DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS: the offsets will have wrong values > (different from the term vectors) or even crazy values like -2147483645 > The problem with this is that its not just theoretical: its too easy to do > this with lucene's own analyzer chains (e.g. ngramtokenizer). > > See issues such as LUCENE-3920 and some discussion on LUCENE-3738 > The question is how to fix this, e.g. should we: > # start enforcing that offsets cannot be crazy values in > OffsetAttributeImpl/IndexWriter and fix the broken analyzers > # leave offsets as a pair of opaque integers, declaring this a limitation of > the current codec, and either workaround or throw UOE from the postings > writer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org