Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Roman Chyla Mon, 10 Aug 2020 10:31:12 -0700

I'll have to somehow find a solution for this situation, giving up
offsets seems like too big a price to pay, I see that overriding
DefaultIndexingChain is not exactly easy -- the only thing I can think
of is to just trick the classloader into giving it a different version
of the chain (praying this can be done without compromising security,
I have not followed JDK evolutions for some time...) - aside from
forking lucene and editing that; which I decidedly don't want to do
(monkey-patching it, ok, i can live with that... :-))


It *seems* to me that the original reason for negative offset checks
stemmed from the fact that vint could have been written (and possibly
vlong too) - https://issues.apache.org/jira/browse/LUCENE-3738

but the underlying issue and some of the patches seem to have been
addressing those problems; but a much shorter version of the patch was
committed -- despite the perf results not being indicative (i.e. it
could have been good with the longer patch) -- but to really
understand it, one would have to spend more than 10mins reading the
comments

Further to the point, I think negative offsets can be produced only on
the very first token, unless there is a bug in a filter (there was/is
a separate check for that in 6x and perhaps it is still there in 7x).
That would be much less restrictive than the current condition which
disallows all backward offsets. We never ran into an index corruption
in lucene 4-6x, so I really wonder if the "forbid all backwards
offsets" approach might be too restrictive.

Looks like I should create an issue...

On Thu, Aug 6, 2020 at 11:28 AM Gus Heck <gus.h...@gmail.com> wrote:
>
> I've had a nearly identical experience to what Dave describes, I also chafe 
> under this restriction.
>
> On Thu, Aug 6, 2020 at 11:07 AM David Smiley <dsmi...@apache.org> wrote:
>>
>> I sympathize with your pain, Roman.
>>
>> It appears we can't really do index-time multi-word synonyms because of the 
>> offset ordering rule.  But it's not just synonyms, it's other forms of 
>> multi-token expansion.  Where I work, I've seen an interesting approach to 
>> mixed language text analysis in which a sophisticated Tokenizer effectively 
>> re-tokenizes an input multiple ways by producing a token stream that is a 
>> concatenation of different interpretations of the input.  On a Lucene 
>> upgrade, we had to "coarsen" the offsets to the point of having highlights 
>> that point to a whole sentence instead of the words in that sentence :-(.  I 
>> need to do something to fix this; I'm trying hard to resist modifying our 
>> Lucene fork for this constraint.  Maybe instead of concatenating, it might 
>> be interleaved / overlapped but the interpretations aren't necessarily 
>> aligned to make this possible without risking breaking position-sensitive 
>> queries.
>>
>> So... I'm not a fan of this constraint on offsets.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla <roman.ch...@gmail.com> wrote:
>>>
>>> Hi Mike,
>>>
>>> Yes, they are not zero offsets - I was instinctively avoiding
>>> "negative offsets"; but they are indeed backward offsets.
>>>
>>> Here is the token stream as produced by the analyzer chain indexing
>>> "THE HUBBLE constant: a summary of the hubble space telescope program"
>>>
>>> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
>>> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
>>> term=constant pos=1 type=word offsetStart=11 offsetEnd=20
>>> term=summary pos=1 type=word offsetStart=23 offsetEnd=30
>>> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
>>> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 
>>> offsetEnd=60
>>> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
>>> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
>>> term=space pos=1 type=word offsetStart=45 offsetEnd=50
>>> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
>>> term=program pos=1 type=word offsetStart=61 offsetEnd=68
>>>
>>> Sometimes, we'll even have a situation when synonyms overlap: for
>>> example "anti de sitter space time"
>>>
>>> "anti de sitter space time" -> "antidesitter space" (one token
>>> spanning offsets 0-26; it gets emitted with the first token "anti"
>>> right now)
>>> "space time" -> "spacetime" (synonym 16-26)
>>> "space" -> "universe" (25-26)
>>>
>>> Yes, weird, but useful if people want to search for `universe NEAR
>>> anti` -- but another usecase which would be prohibited by the "new"
>>> rule.
>>>
>>> DefaultIndexingChain checks new token offset against the last emitted
>>> token, so I don't see a way to emit the multi-token synonym with
>>> offsetts spanning multiple tokens if even one of these tokens was
>>> already emitted. And the complement is equally true: if multi-token is
>>> emitted as last of the group - it trips over `startOffset <
>>> invertState.lastStartOffset`
>>>
>>> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>>>
>>>
>>>   -roman
>>>
>>> On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless
>>> <luc...@mikemccandless.com> wrote:
>>> >
>>> > Hi Roman,
>>> >
>>> > Hmm, this is all very tricky!
>>> >
>>> > First off, why do you call this "zero offsets"?  Isn't it "backwards 
>>> > offsets" that your analysis chain is trying to produce?
>>> >
>>> > Second, in your first example, if you output the tokens in the right 
>>> > order, they would not violate the "offsets do not go backwards" check in 
>>> > IndexWriter?  I thought IndexWriter is just checking that the startOffset 
>>> > for a token is not lower than the previous token's startOffset?  (And 
>>> > that the token's endOffset is not lower than its startOffset).
>>> >
>>> > So I am confused why your first example is tripping up on IW's offset 
>>> > checks.  Could you maybe redo the example, listing single token per line 
>>> > with the start/end offsets they are producing?
>>> >
>>> > Mike McCandless
>>> >
>>> > http://blog.mikemccandless.com
>>> >
>>> >
>>> > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <roman.ch...@gmail.com> wrote:
>>> >>
>>> >> Hello devs,
>>> >>
>>> >> I wanted to create an issue but the helpful message in red letters
>>> >> reminded me to ask first.
>>> >>
>>> >> While porting from lucene 6.x to 7x I'm struggling with a change that
>>> >> was introduced in LUCENE-7626
>>> >> (https://issues.apache.org/jira/browse/LUCENE-7626)
>>> >>
>>> >> It is believed that zero offset tokens are bad bad - Mike McCandles
>>> >> made the change which made me automatically doubt myself. I must be
>>> >> wrong, hell, I was living in sin the past 5 years!
>>> >>
>>> >> Sadly, we have been indexing and searching large volumes of data
>>> >> without any corruption in index whatsover, but also without this new
>>> >> change:
>>> >>
>>> >> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
>>> >>
>>> >> With that change, our multi-token synonyms house of cards is falling.
>>> >>
>>> >> Mike has this wonderful blogpost explaining troubles with multi-token 
>>> >> synonyms:
>>> >> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>>> >>
>>> >> Recommended way to index multi-token synonyms appears to be this:
>>> >> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
>>> >>
>>> >> BUT, but! We don't want to place multi-token synonym into the same
>>> >> position as the other words. We want to preserve their positions! We
>>> >> want to preserve informaiton about offsets!
>>> >>
>>> >> Here is an example:
>>> >>
>>> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program
>>> >>
>>> >> This is how it gets indexed
>>> >>
>>> >> [(0, []),
>>> >> (1, ['acr::hubble']),
>>> >> (2, ['constant']),
>>> >> (3, ['summary']),
>>> >> (4, []),
>>> >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', 
>>> >> 'hubble'']),
>>> >> (6, ['acr::space', 'space']),
>>> >> (7, ['acr::telescope', 'telescope']),
>>> >> (8, ['program']),
>>> >>
>>> >> Notice the position 5 - multi-token synonym `syn::hubble space
>>> >> telescope` token is on the first token which started the group
>>> >> (emitted by Lucene's synonym filter). hst is another synonym; we also
>>> >> index the 'hubble' word there.
>>> >>
>>> >>  if you were to search for a phrase "HST program" it will be found
>>> >> because our search parser will search for ("HST ? ? program" | "Hubble
>>> >> Space Telescope program")
>>> >>
>>> >> It simply found that by looking at synonyms: HST -> Hubble Space 
>>> >> Telescope
>>> >>
>>> >> And because of those funny 'syn::' prefixes, we don't suffer from the
>>> >> other problem that Mike described -- "hst space" phrase search will
>>> >> NOT find this paper (and that is a correct behaviour)
>>> >>
>>> >> But all of this is possible only because lucene was indexing tokens
>>> >> with offsets that can be lower than the last emitted token; for
>>> >> example 'hubble space telescope' wil have offset 21-45; and the next
>>> >> emitted token "space" will have offset 28-33
>>> >>
>>> >> And it just works (lucene 6.x)
>>> >>
>>> >> Here is another proof with the appropriate verbiage ("crazy"):
>>> >>
>>> >> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
>>> >>
>>> >> Zero offsets have been working wonderfully for us so far. And I
>>> >> actually cannot imagine how it can work without them - i.e. without
>>> >> the ability to emit a token stream with offsets that are lower than
>>> >> the last seen token.
>>> >>
>>> >> I haven't tried SynonymFlatten filter, but because of this line in the
>>> >> DefaultIndexingChain - I'm convinced the flatten symbol is not going
>>> >> to do what we need (as seen in the example above)
>>> >>
>>> >> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>>> >>
>>> >> What would you say? Is it a bug, is it not a bug but just some special
>>> >> usecase? If it is a special usecase, what do we need to do? Plug in
>>> >> our own indexing chain?
>>> >>
>>> >> Thanks!
>>> >>
>>> >>   -roman
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Reply via email to