Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

David Smiley Thu, 06 Aug 2020 08:08:45 -0700

I sympathize with your pain, Roman.

It appears we can't really do index-time multi-word synonyms because of the
offset ordering rule.  But it's not just synonyms, it's other forms of
multi-token expansion.  Where I work, I've seen an interesting approach to
mixed language text analysis in which a sophisticated Tokenizer effectively
re-tokenizes an input multiple ways by producing a token stream that is a
concatenation of different interpretations of the input.  On a Lucene
upgrade, we had to "coarsen" the offsets to the point of having highlights
that point to a whole sentence instead of the words in that sentence :-(.
I need to do something to fix this; I'm trying hard to resist modifying our
Lucene fork for this constraint.  Maybe instead of concatenating, it might
be interleaved / overlapped but the interpretations aren't necessarily
aligned to make this possible without risking breaking position-sensitive
queries.


So... I'm not a fan of this constraint on offsets.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla <[email protected]> wrote:

> Hi Mike,
>
> Yes, they are not zero offsets - I was instinctively avoiding
> "negative offsets"; but they are indeed backward offsets.
>
> Here is the token stream as produced by the analyzer chain indexing
> "THE HUBBLE constant: a summary of the hubble space telescope program"
>
> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
> term=constant pos=1 type=word offsetStart=11 offsetEnd=20
> term=summary pos=1 type=word offsetStart=23 offsetEnd=30
> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38
> offsetEnd=60
> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
> term=space pos=1 type=word offsetStart=45 offsetEnd=50
> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
> term=program pos=1 type=word offsetStart=61 offsetEnd=68
>
> Sometimes, we'll even have a situation when synonyms overlap: for
> example "anti de sitter space time"
>
> "anti de sitter space time" -> "antidesitter space" (one token
> spanning offsets 0-26; it gets emitted with the first token "anti"
> right now)
> "space time" -> "spacetime" (synonym 16-26)
> "space" -> "universe" (25-26)
>
> Yes, weird, but useful if people want to search for `universe NEAR
> anti` -- but another usecase which would be prohibited by the "new"
> rule.
>
> DefaultIndexingChain checks new token offset against the last emitted
> token, so I don't see a way to emit the multi-token synonym with
> offsetts spanning multiple tokens if even one of these tokens was
> already emitted. And the complement is equally true: if multi-token is
> emitted as last of the group - it trips over `startOffset <
> invertState.lastStartOffset`
>
>
> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>
>
>   -roman
>
> On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless
> <[email protected]> wrote:
> >
> > Hi Roman,
> >
> > Hmm, this is all very tricky!
> >
> > First off, why do you call this "zero offsets"?  Isn't it "backwards
> offsets" that your analysis chain is trying to produce?
> >
> > Second, in your first example, if you output the tokens in the right
> order, they would not violate the "offsets do not go backwards" check in
> IndexWriter?  I thought IndexWriter is just checking that the startOffset
> for a token is not lower than the previous token's startOffset?  (And that
> the token's endOffset is not lower than its startOffset).
> >
> > So I am confused why your first example is tripping up on IW's offset
> checks.  Could you maybe redo the example, listing single token per line
> with the start/end offsets they are producing?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <[email protected]>
> wrote:
> >>
> >> Hello devs,
> >>
> >> I wanted to create an issue but the helpful message in red letters
> >> reminded me to ask first.
> >>
> >> While porting from lucene 6.x to 7x I'm struggling with a change that
> >> was introduced in LUCENE-7626
> >> (https://issues.apache.org/jira/browse/LUCENE-7626)
> >>
> >> It is believed that zero offset tokens are bad bad - Mike McCandles
> >> made the change which made me automatically doubt myself. I must be
> >> wrong, hell, I was living in sin the past 5 years!
> >>
> >> Sadly, we have been indexing and searching large volumes of data
> >> without any corruption in index whatsover, but also without this new
> >> change:
> >>
> >>
> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
> >>
> >> With that change, our multi-token synonyms house of cards is falling.
> >>
> >> Mike has this wonderful blogpost explaining troubles with multi-token
> synonyms:
> >>
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> >>
> >> Recommended way to index multi-token synonyms appears to be this:
> >>
> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
> >>
> >> BUT, but! We don't want to place multi-token synonym into the same
> >> position as the other words. We want to preserve their positions! We
> >> want to preserve informaiton about offsets!
> >>
> >> Here is an example:
> >>
> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program
> >>
> >> This is how it gets indexed
> >>
> >> [(0, []),
> >> (1, ['acr::hubble']),
> >> (2, ['constant']),
> >> (3, ['summary']),
> >> (4, []),
> >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope',
> 'hubble'']),
> >> (6, ['acr::space', 'space']),
> >> (7, ['acr::telescope', 'telescope']),
> >> (8, ['program']),
> >>
> >> Notice the position 5 - multi-token synonym `syn::hubble space
> >> telescope` token is on the first token which started the group
> >> (emitted by Lucene's synonym filter). hst is another synonym; we also
> >> index the 'hubble' word there.
> >>
> >>  if you were to search for a phrase "HST program" it will be found
> >> because our search parser will search for ("HST ? ? program" | "Hubble
> >> Space Telescope program")
> >>
> >> It simply found that by looking at synonyms: HST -> Hubble Space
> Telescope
> >>
> >> And because of those funny 'syn::' prefixes, we don't suffer from the
> >> other problem that Mike described -- "hst space" phrase search will
> >> NOT find this paper (and that is a correct behaviour)
> >>
> >> But all of this is possible only because lucene was indexing tokens
> >> with offsets that can be lower than the last emitted token; for
> >> example 'hubble space telescope' wil have offset 21-45; and the next
> >> emitted token "space" will have offset 28-33
> >>
> >> And it just works (lucene 6.x)
> >>
> >> Here is another proof with the appropriate verbiage ("crazy"):
> >>
> >>
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
> >>
> >> Zero offsets have been working wonderfully for us so far. And I
> >> actually cannot imagine how it can work without them - i.e. without
> >> the ability to emit a token stream with offsets that are lower than
> >> the last seen token.
> >>
> >> I haven't tried SynonymFlatten filter, but because of this line in the
> >> DefaultIndexingChain - I'm convinced the flatten symbol is not going
> >> to do what we need (as seen in the example above)
> >>
> >>
> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
> >>
> >> What would you say? Is it a bug, is it not a bug but just some special
> >> usecase? If it is a special usecase, what do we need to do? Plug in
> >> our own indexing chain?
> >>
> >> Thanks!
> >>
> >>   -roman
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Reply via email to