Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Roman Chyla Thu, 06 Aug 2020 07:50:10 -0700

Hi Mike,

Yes, they are not zero offsets - I was instinctively avoiding
"negative offsets"; but they are indeed backward offsets.


Here is the token stream as produced by the analyzer chain indexing
"THE HUBBLE constant: a summary of the hubble space telescope program"

term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
term=constant pos=1 type=word offsetStart=11 offsetEnd=20
term=summary pos=1 type=word offsetStart=23 offsetEnd=30
term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
term=space pos=1 type=word offsetStart=45 offsetEnd=50
term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
term=program pos=1 type=word offsetStart=61 offsetEnd=68

Sometimes, we'll even have a situation when synonyms overlap: for
example "anti de sitter space time"

"anti de sitter space time" -> "antidesitter space" (one token
spanning offsets 0-26; it gets emitted with the first token "anti"
right now)
"space time" -> "spacetime" (synonym 16-26)
"space" -> "universe" (25-26)

Yes, weird, but useful if people want to search for `universe NEAR
anti` -- but another usecase which would be prohibited by the "new"
rule.

DefaultIndexingChain checks new token offset against the last emitted
token, so I don't see a way to emit the multi-token synonym with
offsetts spanning multiple tokens if even one of these tokens was
already emitted. And the complement is equally true: if multi-token is
emitted as last of the group - it trips over `startOffset <
invertState.lastStartOffset`

https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915


  -roman

On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless
<luc...@mikemccandless.com> wrote:
>
> Hi Roman,
>
> Hmm, this is all very tricky!
>
> First off, why do you call this "zero offsets"?  Isn't it "backwards offsets" 
> that your analysis chain is trying to produce?
>
> Second, in your first example, if you output the tokens in the right order, 
> they would not violate the "offsets do not go backwards" check in 
> IndexWriter?  I thought IndexWriter is just checking that the startOffset for 
> a token is not lower than the previous token's startOffset?  (And that the 
> token's endOffset is not lower than its startOffset).
>
> So I am confused why your first example is tripping up on IW's offset checks. 
>  Could you maybe redo the example, listing single token per line with the 
> start/end offsets they are producing?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <roman.ch...@gmail.com> wrote:
>>
>> Hello devs,
>>
>> I wanted to create an issue but the helpful message in red letters
>> reminded me to ask first.
>>
>> While porting from lucene 6.x to 7x I'm struggling with a change that
>> was introduced in LUCENE-7626
>> (https://issues.apache.org/jira/browse/LUCENE-7626)
>>
>> It is believed that zero offset tokens are bad bad - Mike McCandles
>> made the change which made me automatically doubt myself. I must be
>> wrong, hell, I was living in sin the past 5 years!
>>
>> Sadly, we have been indexing and searching large volumes of data
>> without any corruption in index whatsover, but also without this new
>> change:
>>
>> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
>>
>> With that change, our multi-token synonyms house of cards is falling.
>>
>> Mike has this wonderful blogpost explaining troubles with multi-token 
>> synonyms:
>> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>>
>> Recommended way to index multi-token synonyms appears to be this:
>> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
>>
>> BUT, but! We don't want to place multi-token synonym into the same
>> position as the other words. We want to preserve their positions! We
>> want to preserve informaiton about offsets!
>>
>> Here is an example:
>>
>> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program
>>
>> This is how it gets indexed
>>
>> [(0, []),
>> (1, ['acr::hubble']),
>> (2, ['constant']),
>> (3, ['summary']),
>> (4, []),
>> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', 'hubble'']),
>> (6, ['acr::space', 'space']),
>> (7, ['acr::telescope', 'telescope']),
>> (8, ['program']),
>>
>> Notice the position 5 - multi-token synonym `syn::hubble space
>> telescope` token is on the first token which started the group
>> (emitted by Lucene's synonym filter). hst is another synonym; we also
>> index the 'hubble' word there.
>>
>>  if you were to search for a phrase "HST program" it will be found
>> because our search parser will search for ("HST ? ? program" | "Hubble
>> Space Telescope program")
>>
>> It simply found that by looking at synonyms: HST -> Hubble Space Telescope
>>
>> And because of those funny 'syn::' prefixes, we don't suffer from the
>> other problem that Mike described -- "hst space" phrase search will
>> NOT find this paper (and that is a correct behaviour)
>>
>> But all of this is possible only because lucene was indexing tokens
>> with offsets that can be lower than the last emitted token; for
>> example 'hubble space telescope' wil have offset 21-45; and the next
>> emitted token "space" will have offset 28-33
>>
>> And it just works (lucene 6.x)
>>
>> Here is another proof with the appropriate verbiage ("crazy"):
>>
>> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
>>
>> Zero offsets have been working wonderfully for us so far. And I
>> actually cannot imagine how it can work without them - i.e. without
>> the ability to emit a token stream with offsets that are lower than
>> the last seen token.
>>
>> I haven't tried SynonymFlatten filter, but because of this line in the
>> DefaultIndexingChain - I'm convinced the flatten symbol is not going
>> to do what we need (as seen in the example above)
>>
>> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>>
>> What would you say? Is it a bug, is it not a bug but just some special
>> usecase? If it is a special usecase, what do we need to do? Plug in
>> our own indexing chain?
>>
>> Thanks!
>>
>>   -roman
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Reply via email to