Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Roman Chyla Mon, 10 Aug 2020 15:28:22 -0700

oh,thanks! that saves everybody some time. I have commented in there,
pleading to be allowed to do something - if that proposal sounds even
little bit reasonable, please consider amplifying the signal


On Mon, Aug 10, 2020 at 4:22 PM David Smiley <dsmi...@apache.org> wrote:
>
> There already is one: https://issues.apache.org/jira/browse/LUCENE-8776
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Mon, Aug 10, 2020 at 1:30 PM Roman Chyla <roman.ch...@gmail.com> wrote:
>>
>> I'll have to somehow find a solution for this situation, giving up
>> offsets seems like too big a price to pay, I see that overriding
>> DefaultIndexingChain is not exactly easy -- the only thing I can think
>> of is to just trick the classloader into giving it a different version
>> of the chain (praying this can be done without compromising security,
>> I have not followed JDK evolutions for some time...) - aside from
>> forking lucene and editing that; which I decidedly don't want to do
>> (monkey-patching it, ok, i can live with that... :-))
>>
>> It *seems* to me that the original reason for negative offset checks
>> stemmed from the fact that vint could have been written (and possibly
>> vlong too) - https://issues.apache.org/jira/browse/LUCENE-3738
>>
>> but the underlying issue and some of the patches seem to have been
>> addressing those problems; but a much shorter version of the patch was
>> committed -- despite the perf results not being indicative (i.e. it
>> could have been good with the longer patch) -- but to really
>> understand it, one would have to spend more than 10mins reading the
>> comments
>>
>> Further to the point, I think negative offsets can be produced only on
>> the very first token, unless there is a bug in a filter (there was/is
>> a separate check for that in 6x and perhaps it is still there in 7x).
>> That would be much less restrictive than the current condition which
>> disallows all backward offsets. We never ran into an index corruption
>> in lucene 4-6x, so I really wonder if the "forbid all backwards
>> offsets" approach might be too restrictive.
>>
>> Looks like I should create an issue...
>>
>> On Thu, Aug 6, 2020 at 11:28 AM Gus Heck <gus.h...@gmail.com> wrote:
>> >
>> > I've had a nearly identical experience to what Dave describes, I also 
>> > chafe under this restriction.
>> >
>> > On Thu, Aug 6, 2020 at 11:07 AM David Smiley <dsmi...@apache.org> wrote:
>> >>
>> >> I sympathize with your pain, Roman.
>> >>
>> >> It appears we can't really do index-time multi-word synonyms because of 
>> >> the offset ordering rule.  But it's not just synonyms, it's other forms 
>> >> of multi-token expansion.  Where I work, I've seen an interesting 
>> >> approach to mixed language text analysis in which a sophisticated 
>> >> Tokenizer effectively re-tokenizes an input multiple ways by producing a 
>> >> token stream that is a concatenation of different interpretations of the 
>> >> input.  On a Lucene upgrade, we had to "coarsen" the offsets to the point 
>> >> of having highlights that point to a whole sentence instead of the words 
>> >> in that sentence :-(.  I need to do something to fix this; I'm trying 
>> >> hard to resist modifying our Lucene fork for this constraint.  Maybe 
>> >> instead of concatenating, it might be interleaved / overlapped but the 
>> >> interpretations aren't necessarily aligned to make this possible without 
>> >> risking breaking position-sensitive queries.
>> >>
>> >> So... I'm not a fan of this constraint on offsets.
>> >>
>> >> ~ David Smiley
>> >> Apache Lucene/Solr Search Developer
>> >> http://www.linkedin.com/in/davidwsmiley
>> >>
>> >>
>> >> On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla <roman.ch...@gmail.com> wrote:
>> >>>
>> >>> Hi Mike,
>> >>>
>> >>> Yes, they are not zero offsets - I was instinctively avoiding
>> >>> "negative offsets"; but they are indeed backward offsets.
>> >>>
>> >>> Here is the token stream as produced by the analyzer chain indexing
>> >>> "THE HUBBLE constant: a summary of the hubble space telescope program"
>> >>>
>> >>> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
>> >>> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
>> >>> term=constant pos=1 type=word offsetStart=11 offsetEnd=20
>> >>> term=summary pos=1 type=word offsetStart=23 offsetEnd=30
>> >>> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
>> >>> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 
>> >>> offsetEnd=60
>> >>> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
>> >>> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
>> >>> term=space pos=1 type=word offsetStart=45 offsetEnd=50
>> >>> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
>> >>> term=program pos=1 type=word offsetStart=61 offsetEnd=68
>> >>>
>> >>> Sometimes, we'll even have a situation when synonyms overlap: for
>> >>> example "anti de sitter space time"
>> >>>
>> >>> "anti de sitter space time" -> "antidesitter space" (one token
>> >>> spanning offsets 0-26; it gets emitted with the first token "anti"
>> >>> right now)
>> >>> "space time" -> "spacetime" (synonym 16-26)
>> >>> "space" -> "universe" (25-26)
>> >>>
>> >>> Yes, weird, but useful if people want to search for `universe NEAR
>> >>> anti` -- but another usecase which would be prohibited by the "new"
>> >>> rule.
>> >>>
>> >>> DefaultIndexingChain checks new token offset against the last emitted
>> >>> token, so I don't see a way to emit the multi-token synonym with
>> >>> offsetts spanning multiple tokens if even one of these tokens was
>> >>> already emitted. And the complement is equally true: if multi-token is
>> >>> emitted as last of the group - it trips over `startOffset <
>> >>> invertState.lastStartOffset`
>> >>>
>> >>> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>> >>>
>> >>>
>> >>>   -roman
>> >>>
>> >>> On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless
>> >>> <luc...@mikemccandless.com> wrote:
>> >>> >
>> >>> > Hi Roman,
>> >>> >
>> >>> > Hmm, this is all very tricky!
>> >>> >
>> >>> > First off, why do you call this "zero offsets"?  Isn't it "backwards 
>> >>> > offsets" that your analysis chain is trying to produce?
>> >>> >
>> >>> > Second, in your first example, if you output the tokens in the right 
>> >>> > order, they would not violate the "offsets do not go backwards" check 
>> >>> > in IndexWriter?  I thought IndexWriter is just checking that the 
>> >>> > startOffset for a token is not lower than the previous token's 
>> >>> > startOffset?  (And that the token's endOffset is not lower than its 
>> >>> > startOffset).
>> >>> >
>> >>> > So I am confused why your first example is tripping up on IW's offset 
>> >>> > checks.  Could you maybe redo the example, listing single token per 
>> >>> > line with the start/end offsets they are producing?
>> >>> >
>> >>> > Mike McCandless
>> >>> >
>> >>> > http://blog.mikemccandless.com
>> >>> >
>> >>> >
>> >>> > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <roman.ch...@gmail.com> 
>> >>> > wrote:
>> >>> >>
>> >>> >> Hello devs,
>> >>> >>
>> >>> >> I wanted to create an issue but the helpful message in red letters
>> >>> >> reminded me to ask first.
>> >>> >>
>> >>> >> While porting from lucene 6.x to 7x I'm struggling with a change that
>> >>> >> was introduced in LUCENE-7626
>> >>> >> (https://issues.apache.org/jira/browse/LUCENE-7626)
>> >>> >>
>> >>> >> It is believed that zero offset tokens are bad bad - Mike McCandles
>> >>> >> made the change which made me automatically doubt myself. I must be
>> >>> >> wrong, hell, I was living in sin the past 5 years!
>> >>> >>
>> >>> >> Sadly, we have been indexing and searching large volumes of data
>> >>> >> without any corruption in index whatsover, but also without this new
>> >>> >> change:
>> >>> >>
>> >>> >> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
>> >>> >>
>> >>> >> With that change, our multi-token synonyms house of cards is falling.
>> >>> >>
>> >>> >> Mike has this wonderful blogpost explaining troubles with multi-token 
>> >>> >> synonyms:
>> >>> >> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>> >>> >>
>> >>> >> Recommended way to index multi-token synonyms appears to be this:
>> >>> >> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
>> >>> >>
>> >>> >> BUT, but! We don't want to place multi-token synonym into the same
>> >>> >> position as the other words. We want to preserve their positions! We
>> >>> >> want to preserve informaiton about offsets!
>> >>> >>
>> >>> >> Here is an example:
>> >>> >>
>> >>> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program
>> >>> >>
>> >>> >> This is how it gets indexed
>> >>> >>
>> >>> >> [(0, []),
>> >>> >> (1, ['acr::hubble']),
>> >>> >> (2, ['constant']),
>> >>> >> (3, ['summary']),
>> >>> >> (4, []),
>> >>> >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', 
>> >>> >> 'hubble'']),
>> >>> >> (6, ['acr::space', 'space']),
>> >>> >> (7, ['acr::telescope', 'telescope']),
>> >>> >> (8, ['program']),
>> >>> >>
>> >>> >> Notice the position 5 - multi-token synonym `syn::hubble space
>> >>> >> telescope` token is on the first token which started the group
>> >>> >> (emitted by Lucene's synonym filter). hst is another synonym; we also
>> >>> >> index the 'hubble' word there.
>> >>> >>
>> >>> >>  if you were to search for a phrase "HST program" it will be found
>> >>> >> because our search parser will search for ("HST ? ? program" | "Hubble
>> >>> >> Space Telescope program")
>> >>> >>
>> >>> >> It simply found that by looking at synonyms: HST -> Hubble Space 
>> >>> >> Telescope
>> >>> >>
>> >>> >> And because of those funny 'syn::' prefixes, we don't suffer from the
>> >>> >> other problem that Mike described -- "hst space" phrase search will
>> >>> >> NOT find this paper (and that is a correct behaviour)
>> >>> >>
>> >>> >> But all of this is possible only because lucene was indexing tokens
>> >>> >> with offsets that can be lower than the last emitted token; for
>> >>> >> example 'hubble space telescope' wil have offset 21-45; and the next
>> >>> >> emitted token "space" will have offset 28-33
>> >>> >>
>> >>> >> And it just works (lucene 6.x)
>> >>> >>
>> >>> >> Here is another proof with the appropriate verbiage ("crazy"):
>> >>> >>
>> >>> >> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
>> >>> >>
>> >>> >> Zero offsets have been working wonderfully for us so far. And I
>> >>> >> actually cannot imagine how it can work without them - i.e. without
>> >>> >> the ability to emit a token stream with offsets that are lower than
>> >>> >> the last seen token.
>> >>> >>
>> >>> >> I haven't tried SynonymFlatten filter, but because of this line in the
>> >>> >> DefaultIndexingChain - I'm convinced the flatten symbol is not going
>> >>> >> to do what we need (as seen in the example above)
>> >>> >>
>> >>> >> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>> >>> >>
>> >>> >> What would you say? Is it a bug, is it not a bug but just some special
>> >>> >> usecase? If it is a special usecase, what do we need to do? Plug in
>> >>> >> our own indexing chain?
>> >>> >>
>> >>> >> Thanks!
>> >>> >>
>> >>> >>   -roman
>> >>> >>
>> >>> >> ---------------------------------------------------------------------
>> >>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>> >>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >>> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>>
>> >
>> >
>> > --
>> > http://www.needhamsoftware.com (work)
>> > http://www.the111shift.com (play)
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Reply via email to