Hi Roman, Can you share the full exception / stack trace that IndexWriter throws on that one *'d token in your first example? I thought IndexWriter checks 1) startOffset >= last token's startOffset, and 2) endOffset >= startOffset for the current token.
But you seem to be hitting an exception due to endOffset check across tokens, which I didn't remember/realize IW was enforcing. Could you share a small standalone test case showing the first example? Maybe attach it to the issue ( http://issues.apache.org/jira/browse/LUCENE-8776)? Thanks, Mike McCandless http://blog.mikemccandless.com On Fri, Aug 14, 2020 at 12:09 PM Roman Chyla <roman.ch...@gmail.com> wrote: > Hi Mike, > > Thanks for the question! And sorry for the delay, I haven't managed to > get to it yesterday. I have generated better output, marked with (*) > where it currently fails the first time and also included one extra > case to illustrate the PositionLength attribute. > > assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603", > "title", "THE HUBBLE constant: a summary of the hubble space > telescope program")); > > > term=hubble posInc=2 posLen=1 type=word offsetStart=4 offsetEnd=10 > term=acr::hubble posInc=0 posLen=1 type=ACRONYM offsetStart=4 offsetEnd=10 > term=constant posInc=1 posLen=1 type=word offsetStart=11 offsetEnd=20 > term=summary posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=30 > term=hubble posInc=1 posLen=1 type=word offsetStart=38 offsetEnd=44 > term=syn::hubble space telescope posInc=0 posLen=3 type=SYNONYM > offsetStart=38 offsetEnd=60 > term=syn::hst posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60 > term=acr::hst posInc=0 posLen=3 type=ACRONYM offsetStart=38 offsetEnd=60 > * term=space posInc=1 posLen=1 type=word offsetStart=45 offsetEnd=50 > term=telescope posInc=1 posLen=1 type=word offsetStart=51 offsetEnd=60 > term=program posInc=1 posLen=1 type=word offsetStart=61 offsetEnd=68 > > * - fails because of offsetEnd < lastToken.offsetEnd; If reordered > (the multi-token synonym emitted as a last token) it would fail as > well, because of the check for lastToken.beginOffset < > currentToken.beginOffset. Basically, any reordering would result in a > failure (unless offsets are trimmed). > > > > The following example has additional twist because of `space-time`; > the tokenizer first splits the word and generate two new tokens -- > those alternative tokens are then used to find synonyms (space == > universe) > > assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604", > "title", "MIT and anti de sitter space-time")); > > > term=xxxxxxxxxx604 posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=13 > term=mit posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=3 > term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3 > term=syn::massachusetts institute of technology posInc=0 posLen=1 > type=SYNONYM offsetStart=0 offsetEnd=3 > term=syn::mit posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3 > term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3 > term=anti posInc=1 posLen=1 type=word offsetStart=8 offsetEnd=12 > term=syn::ads posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28 > term=acr::ads posInc=0 posLen=4 type=ACRONYM offsetStart=8 offsetEnd=28 > term=syn::anti de sitter space posInc=0 posLen=4 type=SYNONYM > offsetStart=8 offsetEnd=28 > term=syn::antidesitter spacetime posInc=0 posLen=4 type=SYNONYM > offsetStart=8 offsetEnd=28 > term=syn::antidesitter space posInc=0 posLen=4 type=SYNONYM > offsetStart=8 offsetEnd=28 > * term=de posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=15 > term=sitter posInc=1 posLen=1 type=word offsetStart=16 offsetEnd=22 > term=space posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=28 > term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=23 > offsetEnd=28 > term=time posInc=1 posLen=1 type=word offsetStart=29 offsetEnd=33 > term=spacetime posInc=0 posLen=1 type=word offsetStart=23 offsetEnd=33 > > So far, all of these cases could be handled with the new position > length attribute. But let us look at a case where that would fail too. > > assertU(adoc("id", "606", "bibcode", "xxxxxxxxxx604", > "title", "Massachusetts Institute of Technology and > antidesitter space-time")); > > > term=massachusetts posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=12 > term=syn::massachusetts institute of technology posInc=0 posLen=4 > type=SYNONYM offsetStart=0 offsetEnd=36 > term=syn::mit posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36 > term=acr::mit posInc=0 posLen=4 type=ACRONYM offsetStart=0 offsetEnd=36 > term=institute posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=22 > term=technology posInc=1 posLen=1 type=word offsetStart=26 offsetEnd=36 > term=antidesitter posInc=1 posLen=1 type=word offsetStart=41 offsetEnd=53 > term=syn::ads posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59 > term=acr::ads posInc=0 posLen=2 type=ACRONYM offsetStart=41 offsetEnd=59 > term=syn::anti de sitter space posInc=0 posLen=2 type=SYNONYM > offsetStart=41 offsetEnd=59 > term=syn::antidesitter spacetime posInc=0 posLen=2 type=SYNONYM > offsetStart=41 offsetEnd=59 > term=syn::antidesitter space posInc=0 posLen=2 type=SYNONYM > offsetStart=41 offsetEnd=59 > term=space posInc=1 posLen=1 type=word offsetStart=54 offsetEnd=59 > term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=54 > offsetEnd=59 > term=time posInc=1 posLen=1 type=word offsetStart=60 offsetEnd=64 > term=spacetime posInc=0 posLen=1 type=word offsetStart=54 offsetEnd=64 > > Notice the posLen=4 of MIT; it would cover tokens `massachusetts > institute technology antidesitter` while offsets are still correct. > > This would, I think, affect not only highlighting, but also search > (which is, at least for us, more important). But I can imagine that in > more NLP-related domains, ability to identify the source of a > transformation could be more than a highlighting problem. > > Admittedly, most users would not care to notice, but it might be > important to some. Fundamentally, I think, the problem translates to > inability to reconstruct the DAG graph (under certain circumstances) > because of the lost pieces of information. > > ~roman > > On Wed, Aug 12, 2020 at 4:59 PM Michael McCandless > <luc...@mikemccandless.com> wrote: > > > > Hi Roman, > > > > Sorry for the late reply! > > > > I think there remains substantial confusion about multi-token synonyms > and IW's enforcement of offsets. It really is worth thoroughly > iterating/understanding your examples so we can get to the bottom of this. > It looks to me it is possible to emit tokens whose offsets do not go > backwards and that properly model your example synonyms, so I do not yet > see what the problem is. Maybe I am being blind/tired ... > > > > What do you mean by pos=2, pos=0, etc.? I think that is really the > position increment? Can you re-do the examples with posInc instead? > (Alternatively, you could keep "pos" but make it the absolute position, not > the increment?). > > > > Could you also add posLength to each token? This helps (me?) visualize > the resulting graph, even though IW does not enforce it today. > > > > Looking at your first example, "THE HUBBLE constant: a summary of the > hubble space telescope program", it looks to me like those tokens would all > be accepted by IW's checks as they are? startOffset never goes backwards, > and for every token, endOffset >= startOffset. Where in that first example > does IW throw an exception? Maybe insert a "** IW fails here" under the > problematic token? Or, maybe write a simple test case using e.g. > CannedTokenStream? > > > > Your second example should also be fine, and not at all weird, but could > you enumerate it into the specific tokens with posInc, posLength, start/end > offset, "** IW fails here", etc., so we have a concrete example to discuss? > > > > Lucene's TokenStreams are really serializing a directed acyclic graph > (DAG), in a specific order, one transition at a time. > Ironically/strangely, it is similar to the graph that git history > maintains, and how "git log" then serializes that graph into an ordered > series of transitions. The simple int position in Lucene's TokenStream > corresponds to git's githashes, to uniquely identify each "node", though, I > do not think there is an analog in git to Lucene's offsets. Hmm, maybe a > timestamp? > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > > > On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla <roman.ch...@gmail.com> > wrote: > >> > >> Hi Mike, > >> > >> Yes, they are not zero offsets - I was instinctively avoiding > >> "negative offsets"; but they are indeed backward offsets. > >> > >> Here is the token stream as produced by the analyzer chain indexing > >> "THE HUBBLE constant: a summary of the hubble space telescope program" > >> > >> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10 > >> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10 > >> term=constant pos=1 type=word offsetStart=11 offsetEnd=20 > >> term=summary pos=1 type=word offsetStart=23 offsetEnd=30 > >> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44 > >> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 > offsetEnd=60 > >> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60 > >> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60 > >> term=space pos=1 type=word offsetStart=45 offsetEnd=50 > >> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60 > >> term=program pos=1 type=word offsetStart=61 offsetEnd=68 > >> > >> Sometimes, we'll even have a situation when synonyms overlap: for > >> example "anti de sitter space time" > >> > >> "anti de sitter space time" -> "antidesitter space" (one token > >> spanning offsets 0-26; it gets emitted with the first token "anti" > >> right now) > >> "space time" -> "spacetime" (synonym 16-26) > >> "space" -> "universe" (25-26) > >> > >> Yes, weird, but useful if people want to search for `universe NEAR > >> anti` -- but another usecase which would be prohibited by the "new" > >> rule. > >> > >> DefaultIndexingChain checks new token offset against the last emitted > >> token, so I don't see a way to emit the multi-token synonym with > >> offsetts spanning multiple tokens if even one of these tokens was > >> already emitted. And the complement is equally true: if multi-token is > >> emitted as last of the group - it trips over `startOffset < > >> invertState.lastStartOffset` > >> > >> > https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915 > >> > >> > >> -roman > >> > >> On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless > >> <luc...@mikemccandless.com> wrote: > >> > > >> > Hi Roman, > >> > > >> > Hmm, this is all very tricky! > >> > > >> > First off, why do you call this "zero offsets"? Isn't it "backwards > offsets" that your analysis chain is trying to produce? > >> > > >> > Second, in your first example, if you output the tokens in the right > order, they would not violate the "offsets do not go backwards" check in > IndexWriter? I thought IndexWriter is just checking that the startOffset > for a token is not lower than the previous token's startOffset? (And that > the token's endOffset is not lower than its startOffset). > >> > > >> > So I am confused why your first example is tripping up on IW's offset > checks. Could you maybe redo the example, listing single token per line > with the start/end offsets they are producing? > >> > > >> > Mike McCandless > >> > > >> > http://blog.mikemccandless.com > >> > > >> > > >> > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <roman.ch...@gmail.com> > wrote: > >> >> > >> >> Hello devs, > >> >> > >> >> I wanted to create an issue but the helpful message in red letters > >> >> reminded me to ask first. > >> >> > >> >> While porting from lucene 6.x to 7x I'm struggling with a change that > >> >> was introduced in LUCENE-7626 > >> >> (https://issues.apache.org/jira/browse/LUCENE-7626) > >> >> > >> >> It is believed that zero offset tokens are bad bad - Mike McCandles > >> >> made the change which made me automatically doubt myself. I must be > >> >> wrong, hell, I was living in sin the past 5 years! > >> >> > >> >> Sadly, we have been indexing and searching large volumes of data > >> >> without any corruption in index whatsover, but also without this new > >> >> change: > >> >> > >> >> > https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774 > >> >> > >> >> With that change, our multi-token synonyms house of cards is falling. > >> >> > >> >> Mike has this wonderful blogpost explaining troubles with > multi-token synonyms: > >> >> > http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html > >> >> > >> >> Recommended way to index multi-token synonyms appears to be this: > >> >> > https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr > >> >> > >> >> BUT, but! We don't want to place multi-token synonym into the same > >> >> position as the other words. We want to preserve their positions! We > >> >> want to preserve informaiton about offsets! > >> >> > >> >> Here is an example: > >> >> > >> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE > program > >> >> > >> >> This is how it gets indexed > >> >> > >> >> [(0, []), > >> >> (1, ['acr::hubble']), > >> >> (2, ['constant']), > >> >> (3, ['summary']), > >> >> (4, []), > >> >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', > 'hubble'']), > >> >> (6, ['acr::space', 'space']), > >> >> (7, ['acr::telescope', 'telescope']), > >> >> (8, ['program']), > >> >> > >> >> Notice the position 5 - multi-token synonym `syn::hubble space > >> >> telescope` token is on the first token which started the group > >> >> (emitted by Lucene's synonym filter). hst is another synonym; we also > >> >> index the 'hubble' word there. > >> >> > >> >> if you were to search for a phrase "HST program" it will be found > >> >> because our search parser will search for ("HST ? ? program" | > "Hubble > >> >> Space Telescope program") > >> >> > >> >> It simply found that by looking at synonyms: HST -> Hubble Space > Telescope > >> >> > >> >> And because of those funny 'syn::' prefixes, we don't suffer from the > >> >> other problem that Mike described -- "hst space" phrase search will > >> >> NOT find this paper (and that is a correct behaviour) > >> >> > >> >> But all of this is possible only because lucene was indexing tokens > >> >> with offsets that can be lower than the last emitted token; for > >> >> example 'hubble space telescope' wil have offset 21-45; and the next > >> >> emitted token "space" will have offset 28-33 > >> >> > >> >> And it just works (lucene 6.x) > >> >> > >> >> Here is another proof with the appropriate verbiage ("crazy"): > >> >> > >> >> > https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618 > >> >> > >> >> Zero offsets have been working wonderfully for us so far. And I > >> >> actually cannot imagine how it can work without them - i.e. without > >> >> the ability to emit a token stream with offsets that are lower than > >> >> the last seen token. > >> >> > >> >> I haven't tried SynonymFlatten filter, but because of this line in > the > >> >> DefaultIndexingChain - I'm convinced the flatten symbol is not going > >> >> to do what we need (as seen in the example above) > >> >> > >> >> > https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915 > >> >> > >> >> What would you say? Is it a bug, is it not a bug but just some > special > >> >> usecase? If it is a special usecase, what do we need to do? Plug in > >> >> our own indexing chain? > >> >> > >> >> Thanks! > >> >> > >> >> -roman > >> >> > >> >> --------------------------------------------------------------------- > >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> >> >