Hi Mike, Yes, they are not zero offsets - I was instinctively avoiding "negative offsets"; but they are indeed backward offsets.
Here is the token stream as produced by the analyzer chain indexing "THE HUBBLE constant: a summary of the hubble space telescope program" term=hubble pos=2 type=word offsetStart=4 offsetEnd=10 term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10 term=constant pos=1 type=word offsetStart=11 offsetEnd=20 term=summary pos=1 type=word offsetStart=23 offsetEnd=30 term=hubble pos=1 type=word offsetStart=38 offsetEnd=44 term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 offsetEnd=60 term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60 term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60 term=space pos=1 type=word offsetStart=45 offsetEnd=50 term=telescope pos=1 type=word offsetStart=51 offsetEnd=60 term=program pos=1 type=word offsetStart=61 offsetEnd=68 Sometimes, we'll even have a situation when synonyms overlap: for example "anti de sitter space time" "anti de sitter space time" -> "antidesitter space" (one token spanning offsets 0-26; it gets emitted with the first token "anti" right now) "space time" -> "spacetime" (synonym 16-26) "space" -> "universe" (25-26) Yes, weird, but useful if people want to search for `universe NEAR anti` -- but another usecase which would be prohibited by the "new" rule. DefaultIndexingChain checks new token offset against the last emitted token, so I don't see a way to emit the multi-token synonym with offsetts spanning multiple tokens if even one of these tokens was already emitted. And the complement is equally true: if multi-token is emitted as last of the group - it trips over `startOffset < invertState.lastStartOffset` https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915 -roman On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless <luc...@mikemccandless.com> wrote: > > Hi Roman, > > Hmm, this is all very tricky! > > First off, why do you call this "zero offsets"? Isn't it "backwards offsets" > that your analysis chain is trying to produce? > > Second, in your first example, if you output the tokens in the right order, > they would not violate the "offsets do not go backwards" check in > IndexWriter? I thought IndexWriter is just checking that the startOffset for > a token is not lower than the previous token's startOffset? (And that the > token's endOffset is not lower than its startOffset). > > So I am confused why your first example is tripping up on IW's offset checks. > Could you maybe redo the example, listing single token per line with the > start/end offsets they are producing? > > Mike McCandless > > http://blog.mikemccandless.com > > > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <roman.ch...@gmail.com> wrote: >> >> Hello devs, >> >> I wanted to create an issue but the helpful message in red letters >> reminded me to ask first. >> >> While porting from lucene 6.x to 7x I'm struggling with a change that >> was introduced in LUCENE-7626 >> (https://issues.apache.org/jira/browse/LUCENE-7626) >> >> It is believed that zero offset tokens are bad bad - Mike McCandles >> made the change which made me automatically doubt myself. I must be >> wrong, hell, I was living in sin the past 5 years! >> >> Sadly, we have been indexing and searching large volumes of data >> without any corruption in index whatsover, but also without this new >> change: >> >> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774 >> >> With that change, our multi-token synonyms house of cards is falling. >> >> Mike has this wonderful blogpost explaining troubles with multi-token >> synonyms: >> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html >> >> Recommended way to index multi-token synonyms appears to be this: >> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr >> >> BUT, but! We don't want to place multi-token synonym into the same >> position as the other words. We want to preserve their positions! We >> want to preserve informaiton about offsets! >> >> Here is an example: >> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program >> >> This is how it gets indexed >> >> [(0, []), >> (1, ['acr::hubble']), >> (2, ['constant']), >> (3, ['summary']), >> (4, []), >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', 'hubble'']), >> (6, ['acr::space', 'space']), >> (7, ['acr::telescope', 'telescope']), >> (8, ['program']), >> >> Notice the position 5 - multi-token synonym `syn::hubble space >> telescope` token is on the first token which started the group >> (emitted by Lucene's synonym filter). hst is another synonym; we also >> index the 'hubble' word there. >> >> if you were to search for a phrase "HST program" it will be found >> because our search parser will search for ("HST ? ? program" | "Hubble >> Space Telescope program") >> >> It simply found that by looking at synonyms: HST -> Hubble Space Telescope >> >> And because of those funny 'syn::' prefixes, we don't suffer from the >> other problem that Mike described -- "hst space" phrase search will >> NOT find this paper (and that is a correct behaviour) >> >> But all of this is possible only because lucene was indexing tokens >> with offsets that can be lower than the last emitted token; for >> example 'hubble space telescope' wil have offset 21-45; and the next >> emitted token "space" will have offset 28-33 >> >> And it just works (lucene 6.x) >> >> Here is another proof with the appropriate verbiage ("crazy"): >> >> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618 >> >> Zero offsets have been working wonderfully for us so far. And I >> actually cannot imagine how it can work without them - i.e. without >> the ability to emit a token stream with offsets that are lower than >> the last seen token. >> >> I haven't tried SynonymFlatten filter, but because of this line in the >> DefaultIndexingChain - I'm convinced the flatten symbol is not going >> to do what we need (as seen in the example above) >> >> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915 >> >> What would you say? Is it a bug, is it not a bug but just some special >> usecase? If it is a special usecase, what do we need to do? Plug in >> our own indexing chain? >> >> Thanks! >> >> -roman >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org