Hi Roman,

Can you share the full exception / stack trace that IndexWriter throws on
that one *'d token in your first example?  I thought IndexWriter checks 1)
startOffset >= last token's startOffset, and 2) endOffset >= startOffset
for the current token.

But you seem to be hitting an exception due to endOffset check across
tokens, which I didn't remember/realize IW was enforcing.

Could you share a small standalone test case showing the first example?
Maybe attach it to the issue (
http://issues.apache.org/jira/browse/LUCENE-8776)?

Thanks,

Mike McCandless

http://blog.mikemccandless.com


On Fri, Aug 14, 2020 at 12:09 PM Roman Chyla <roman.ch...@gmail.com> wrote:

> Hi Mike,
>
> Thanks for the question! And sorry for the delay, I haven't managed to
> get to it yesterday. I have generated better output, marked with (*)
> where it currently fails the first time and also included one extra
> case to illustrate the PositionLength attribute.
>
> assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603",
>         "title", "THE HUBBLE constant: a summary of the hubble space
> telescope program"));
>
>
> term=hubble posInc=2 posLen=1 type=word offsetStart=4 offsetEnd=10
> term=acr::hubble posInc=0 posLen=1 type=ACRONYM offsetStart=4 offsetEnd=10
> term=constant posInc=1 posLen=1 type=word offsetStart=11 offsetEnd=20
> term=summary posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=30
> term=hubble posInc=1 posLen=1 type=word offsetStart=38 offsetEnd=44
> term=syn::hubble space telescope posInc=0 posLen=3 type=SYNONYM
> offsetStart=38 offsetEnd=60
> term=syn::hst posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60
> term=acr::hst posInc=0 posLen=3 type=ACRONYM offsetStart=38 offsetEnd=60
> * term=space posInc=1 posLen=1 type=word offsetStart=45 offsetEnd=50
> term=telescope posInc=1 posLen=1 type=word offsetStart=51 offsetEnd=60
> term=program posInc=1 posLen=1 type=word offsetStart=61 offsetEnd=68
>
> * - fails because of offsetEnd < lastToken.offsetEnd; If reordered
> (the multi-token synonym emitted as a last token) it would fail as
> well, because of the check for lastToken.beginOffset <
> currentToken.beginOffset. Basically, any reordering would result in a
> failure (unless offsets are trimmed).
>
>
>
> The following example has additional twist because of `space-time`;
> the tokenizer first splits the word and generate two new tokens --
> those alternative tokens are then used to find synonyms (space ==
> universe)
>
> assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604",
>         "title", "MIT and anti de sitter space-time"));
>
>
> term=xxxxxxxxxx604 posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=13
> term=mit posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=3
> term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
> term=syn::massachusetts institute of technology posInc=0 posLen=1
> type=SYNONYM offsetStart=0 offsetEnd=3
> term=syn::mit posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3
> term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
> term=anti posInc=1 posLen=1 type=word offsetStart=8 offsetEnd=12
> term=syn::ads posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
> term=acr::ads posInc=0 posLen=4 type=ACRONYM offsetStart=8 offsetEnd=28
> term=syn::anti de sitter space posInc=0 posLen=4 type=SYNONYM
> offsetStart=8 offsetEnd=28
> term=syn::antidesitter spacetime posInc=0 posLen=4 type=SYNONYM
> offsetStart=8 offsetEnd=28
> term=syn::antidesitter space posInc=0 posLen=4 type=SYNONYM
> offsetStart=8 offsetEnd=28
> * term=de posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=15
> term=sitter posInc=1 posLen=1 type=word offsetStart=16 offsetEnd=22
> term=space posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=28
> term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=23
> offsetEnd=28
> term=time posInc=1 posLen=1 type=word offsetStart=29 offsetEnd=33
> term=spacetime posInc=0 posLen=1 type=word offsetStart=23 offsetEnd=33
>
> So far, all of these cases could be handled with the new position
> length attribute. But let us look at a case where that would fail too.
>
> assertU(adoc("id", "606", "bibcode", "xxxxxxxxxx604",
>         "title", "Massachusetts Institute of Technology and
> antidesitter space-time"));
>
>
> term=massachusetts posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=12
> term=syn::massachusetts institute of technology posInc=0 posLen=4
> type=SYNONYM offsetStart=0 offsetEnd=36
> term=syn::mit posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36
> term=acr::mit posInc=0 posLen=4 type=ACRONYM offsetStart=0 offsetEnd=36
> term=institute posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=22
> term=technology posInc=1 posLen=1 type=word offsetStart=26 offsetEnd=36
> term=antidesitter posInc=1 posLen=1 type=word offsetStart=41 offsetEnd=53
> term=syn::ads posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
> term=acr::ads posInc=0 posLen=2 type=ACRONYM offsetStart=41 offsetEnd=59
> term=syn::anti de sitter space posInc=0 posLen=2 type=SYNONYM
> offsetStart=41 offsetEnd=59
> term=syn::antidesitter spacetime posInc=0 posLen=2 type=SYNONYM
> offsetStart=41 offsetEnd=59
> term=syn::antidesitter space posInc=0 posLen=2 type=SYNONYM
> offsetStart=41 offsetEnd=59
> term=space posInc=1 posLen=1 type=word offsetStart=54 offsetEnd=59
> term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=54
> offsetEnd=59
> term=time posInc=1 posLen=1 type=word offsetStart=60 offsetEnd=64
> term=spacetime posInc=0 posLen=1 type=word offsetStart=54 offsetEnd=64
>
> Notice the posLen=4 of MIT; it would cover tokens `massachusetts
> institute technology antidesitter` while offsets are still correct.
>
> This would, I think, affect not only highlighting, but also search
> (which is, at least for us, more important). But I can imagine that in
> more NLP-related domains, ability to identify the source of a
> transformation could be more than a highlighting problem.
>
> Admittedly, most users would not care to notice, but it might be
> important to some. Fundamentally, I think, the problem translates to
> inability to reconstruct the DAG graph (under certain circumstances)
> because of the lost pieces of information.
>
> ~roman
>
> On Wed, Aug 12, 2020 at 4:59 PM Michael McCandless
> <luc...@mikemccandless.com> wrote:
> >
> > Hi Roman,
> >
> > Sorry for the late reply!
> >
> > I think there remains substantial confusion about multi-token synonyms
> and IW's enforcement of offsets.  It really is worth thoroughly
> iterating/understanding your examples so we can get to the bottom of this.
> It looks to me it is possible to emit tokens whose offsets do not go
> backwards and that properly model your example synonyms, so I do not yet
> see what the problem is.  Maybe I am being blind/tired ...
> >
> > What do you mean by pos=2, pos=0, etc.?  I think that is really the
> position increment?  Can you re-do the examples with posInc instead?
> (Alternatively, you could keep "pos" but make it the absolute position, not
> the increment?).
> >
> > Could you also add posLength to each token?  This helps (me?) visualize
> the resulting graph, even though IW does not enforce it today.
> >
> > Looking at your first example, "THE HUBBLE constant: a summary of the
> hubble space telescope program", it looks to me like those tokens would all
> be accepted by IW's checks as they are?  startOffset never goes backwards,
> and for every token, endOffset >= startOffset.  Where in that first example
> does IW throw an exception?  Maybe insert a "** IW fails here" under the
> problematic token?  Or, maybe write a simple test case using e.g.
> CannedTokenStream?
> >
> > Your second example should also be fine, and not at all weird, but could
> you enumerate it into the specific tokens with posInc, posLength, start/end
> offset, "** IW fails here", etc., so we have a concrete example to discuss?
> >
> > Lucene's TokenStreams are really serializing a directed acyclic graph
> (DAG), in a specific order, one transition at a time.
> Ironically/strangely, it is similar to the graph that git history
> maintains, and how "git log" then serializes that graph into an ordered
> series of transitions.  The simple int position in Lucene's TokenStream
> corresponds to git's githashes, to uniquely identify each "node", though, I
> do not think there is an analog in git to Lucene's offsets.  Hmm, maybe a
> timestamp?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla <roman.ch...@gmail.com>
> wrote:
> >>
> >> Hi Mike,
> >>
> >> Yes, they are not zero offsets - I was instinctively avoiding
> >> "negative offsets"; but they are indeed backward offsets.
> >>
> >> Here is the token stream as produced by the analyzer chain indexing
> >> "THE HUBBLE constant: a summary of the hubble space telescope program"
> >>
> >> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
> >> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
> >> term=constant pos=1 type=word offsetStart=11 offsetEnd=20
> >> term=summary pos=1 type=word offsetStart=23 offsetEnd=30
> >> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
> >> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38
> offsetEnd=60
> >> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
> >> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
> >> term=space pos=1 type=word offsetStart=45 offsetEnd=50
> >> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
> >> term=program pos=1 type=word offsetStart=61 offsetEnd=68
> >>
> >> Sometimes, we'll even have a situation when synonyms overlap: for
> >> example "anti de sitter space time"
> >>
> >> "anti de sitter space time" -> "antidesitter space" (one token
> >> spanning offsets 0-26; it gets emitted with the first token "anti"
> >> right now)
> >> "space time" -> "spacetime" (synonym 16-26)
> >> "space" -> "universe" (25-26)
> >>
> >> Yes, weird, but useful if people want to search for `universe NEAR
> >> anti` -- but another usecase which would be prohibited by the "new"
> >> rule.
> >>
> >> DefaultIndexingChain checks new token offset against the last emitted
> >> token, so I don't see a way to emit the multi-token synonym with
> >> offsetts spanning multiple tokens if even one of these tokens was
> >> already emitted. And the complement is equally true: if multi-token is
> >> emitted as last of the group - it trips over `startOffset <
> >> invertState.lastStartOffset`
> >>
> >>
> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
> >>
> >>
> >>   -roman
> >>
> >> On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless
> >> <luc...@mikemccandless.com> wrote:
> >> >
> >> > Hi Roman,
> >> >
> >> > Hmm, this is all very tricky!
> >> >
> >> > First off, why do you call this "zero offsets"?  Isn't it "backwards
> offsets" that your analysis chain is trying to produce?
> >> >
> >> > Second, in your first example, if you output the tokens in the right
> order, they would not violate the "offsets do not go backwards" check in
> IndexWriter?  I thought IndexWriter is just checking that the startOffset
> for a token is not lower than the previous token's startOffset?  (And that
> the token's endOffset is not lower than its startOffset).
> >> >
> >> > So I am confused why your first example is tripping up on IW's offset
> checks.  Could you maybe redo the example, listing single token per line
> with the start/end offsets they are producing?
> >> >
> >> > Mike McCandless
> >> >
> >> > http://blog.mikemccandless.com
> >> >
> >> >
> >> > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <roman.ch...@gmail.com>
> wrote:
> >> >>
> >> >> Hello devs,
> >> >>
> >> >> I wanted to create an issue but the helpful message in red letters
> >> >> reminded me to ask first.
> >> >>
> >> >> While porting from lucene 6.x to 7x I'm struggling with a change that
> >> >> was introduced in LUCENE-7626
> >> >> (https://issues.apache.org/jira/browse/LUCENE-7626)
> >> >>
> >> >> It is believed that zero offset tokens are bad bad - Mike McCandles
> >> >> made the change which made me automatically doubt myself. I must be
> >> >> wrong, hell, I was living in sin the past 5 years!
> >> >>
> >> >> Sadly, we have been indexing and searching large volumes of data
> >> >> without any corruption in index whatsover, but also without this new
> >> >> change:
> >> >>
> >> >>
> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
> >> >>
> >> >> With that change, our multi-token synonyms house of cards is falling.
> >> >>
> >> >> Mike has this wonderful blogpost explaining troubles with
> multi-token synonyms:
> >> >>
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> >> >>
> >> >> Recommended way to index multi-token synonyms appears to be this:
> >> >>
> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
> >> >>
> >> >> BUT, but! We don't want to place multi-token synonym into the same
> >> >> position as the other words. We want to preserve their positions! We
> >> >> want to preserve informaiton about offsets!
> >> >>
> >> >> Here is an example:
> >> >>
> >> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE
> program
> >> >>
> >> >> This is how it gets indexed
> >> >>
> >> >> [(0, []),
> >> >> (1, ['acr::hubble']),
> >> >> (2, ['constant']),
> >> >> (3, ['summary']),
> >> >> (4, []),
> >> >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope',
> 'hubble'']),
> >> >> (6, ['acr::space', 'space']),
> >> >> (7, ['acr::telescope', 'telescope']),
> >> >> (8, ['program']),
> >> >>
> >> >> Notice the position 5 - multi-token synonym `syn::hubble space
> >> >> telescope` token is on the first token which started the group
> >> >> (emitted by Lucene's synonym filter). hst is another synonym; we also
> >> >> index the 'hubble' word there.
> >> >>
> >> >>  if you were to search for a phrase "HST program" it will be found
> >> >> because our search parser will search for ("HST ? ? program" |
> "Hubble
> >> >> Space Telescope program")
> >> >>
> >> >> It simply found that by looking at synonyms: HST -> Hubble Space
> Telescope
> >> >>
> >> >> And because of those funny 'syn::' prefixes, we don't suffer from the
> >> >> other problem that Mike described -- "hst space" phrase search will
> >> >> NOT find this paper (and that is a correct behaviour)
> >> >>
> >> >> But all of this is possible only because lucene was indexing tokens
> >> >> with offsets that can be lower than the last emitted token; for
> >> >> example 'hubble space telescope' wil have offset 21-45; and the next
> >> >> emitted token "space" will have offset 28-33
> >> >>
> >> >> And it just works (lucene 6.x)
> >> >>
> >> >> Here is another proof with the appropriate verbiage ("crazy"):
> >> >>
> >> >>
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
> >> >>
> >> >> Zero offsets have been working wonderfully for us so far. And I
> >> >> actually cannot imagine how it can work without them - i.e. without
> >> >> the ability to emit a token stream with offsets that are lower than
> >> >> the last seen token.
> >> >>
> >> >> I haven't tried SynonymFlatten filter, but because of this line in
> the
> >> >> DefaultIndexingChain - I'm convinced the flatten symbol is not going
> >> >> to do what we need (as seen in the example above)
> >> >>
> >> >>
> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
> >> >>
> >> >> What would you say? Is it a bug, is it not a bug but just some
> special
> >> >> usecase? If it is a special usecase, what do we need to do? Plug in
> >> >> our own indexing chain?
> >> >>
> >> >> Thanks!
> >> >>
> >> >>   -roman
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >> >>
>

Reply via email to