Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Roman Chyla Tue, 18 Aug 2020 18:42:54 -0700

Hi Mike,

I'm sorry, the problem all the time is inside related to a
word-delimiter filter factory. This is embarrassing but I have to
admit publicly and self-flagellate.


A word-delimiter filter is used to split tokens, these then are used
to find multi-token synonyms (hence the connection). In my desire to
simplify, I have omitted that detail while writing my first email.

I went to generate the stack trace:

```
assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603",
        "title", "THE HUBBLE constant: a summary of the HUBBLE SPACE
TELESCOPE program"));```

stage:indexer term=xxxxxxxxxx603 pos=1 type=word offsetStart=0 offsetEnd=13
stage:indexer term=acr::the pos=1 type=ACRONYM offsetStart=0 offsetEnd=3
stage:indexer term=hubble pos=1 type=word offsetStart=4 offsetEnd=10
stage:indexer term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
stage:indexer term=constant pos=1 type=word offsetStart=11 offsetEnd=20
stage:indexer term=summary pos=1 type=word offsetStart=23 offsetEnd=30
stage:indexer term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
stage:indexer term=syn::hubble space telescope pos=0 type=SYNONYM
offsetStart=38 offsetEnd=60
stage:indexer term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
stage:indexer term=space pos=1 type=word offsetStart=45 offsetEnd=50
stage:indexer term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
stage:indexer term=program pos=1 type=word offsetStart=61 offsetEnd=68

that worked, only the next one failed:

```assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604",
        "title", "MIT and anti de sitter space-time"));```


stage:indexer term=xxxxxxxxxx604 pos=1 type=word offsetStart=0 offsetEnd=13
stage:indexer term=mit pos=1 type=word offsetStart=0 offsetEnd=3
stage:indexer term=acr::mit pos=0 type=ACRONYM offsetStart=0 offsetEnd=3
stage:indexer term=syn::massachusetts institute of technology pos=0
type=SYNONYM offsetStart=0 offsetEnd=3
stage:indexer term=syn::mit pos=0 type=SYNONYM offsetStart=0 offsetEnd=3
stage:indexer term=anti pos=1 type=word offsetStart=8 offsetEnd=12
stage:indexer term=syn::ads pos=0 type=SYNONYM offsetStart=8 offsetEnd=28
stage:indexer term=syn::anti de sitter space pos=0 type=SYNONYM
offsetStart=8 offsetEnd=28
stage:indexer term=syn::antidesitter spacetime pos=0 type=SYNONYM
offsetStart=8 offsetEnd=28
stage:indexer term=de pos=1 type=word offsetStart=13 offsetEnd=15
stage:indexer term=sitter pos=1 type=word offsetStart=16 offsetEnd=22
stage:indexer term=space pos=1 type=word offsetStart=23 offsetEnd=28
stage:indexer term=time pos=1 type=word offsetStart=29 offsetEnd=33
stage:indexer term=spacetime pos=0 type=word offsetStart=23 offsetEnd=33

```325677 ERROR
(TEST-TestAdsabsTypeFulltextParsing.testNoSynChain-seed#[ADFAB495DA8F6F40])
[    ] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException: Exception writing document id
605 to the index; possible analysis error: startOffset must be
non-negative, and endOffset must be >= startOffset, and offsets must
not go backwards startOffset=23,endOffset=33,lastStartOffset=29 for
field 'title'
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:242)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:1002)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:1233)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$2(DistributedUpdateProcessor.java:1082)
at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1082)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:694)
at 
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:261)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:188)
at 
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2551)
at 
org.apache.solr.servlet.DirectSolrConnection.request(DirectSolrConnection.java:125)
at org.apache.solr.util.TestHarness.update(TestHarness.java:285)
at 
org.apache.solr.util.BaseTestHarness.checkUpdateStatus(BaseTestHarness.java:274)
at org.apache.solr.util.BaseTestHarness.validateUpdate(BaseTestHarness.java:244)
at org.apache.solr.SolrTestCaseJ4.checkUpdateU(SolrTestCaseJ4.java:874)
at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:853)
at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:847)
at 
org.apache.solr.analysis.TestAdsabsTypeFulltextParsing.setUp(TestAdsabsTypeFulltextParsing.java:223)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:972)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
at 
com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
at 
org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at 
org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at 
org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: startOffset must be
non-negative, and endOffset must be >= startOffset, and offsets must
not go backwards startOffset=23,endOffset=33,lastStartOffset=29 for
field 'title'
at 
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:823)
at 
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
at 
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:251)
at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1616)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1608)
at 
org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:969)
at 
org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:341)
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:288)
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:235)
... 61 more
```

Embarrassingly Yours,

  Roman



On Mon, Aug 17, 2020 at 10:39 AM Michael McCandless
<luc...@mikemccandless.com> wrote:
>
> Hi Roman,
>
> Can you share the full exception / stack trace that IndexWriter throws on 
> that one *'d token in your first example?  I thought IndexWriter checks 1) 
> startOffset >= last token's startOffset, and 2) endOffset >= startOffset for 
> the current token.
>
> But you seem to be hitting an exception due to endOffset check across tokens, 
> which I didn't remember/realize IW was enforcing.
>
> Could you share a small standalone test case showing the first example?  
> Maybe attach it to the issue 
> (http://issues.apache.org/jira/browse/LUCENE-8776)?
>
> Thanks,
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Aug 14, 2020 at 12:09 PM Roman Chyla <roman.ch...@gmail.com> wrote:
>>
>> Hi Mike,
>>
>> Thanks for the question! And sorry for the delay, I haven't managed to
>> get to it yesterday. I have generated better output, marked with (*)
>> where it currently fails the first time and also included one extra
>> case to illustrate the PositionLength attribute.
>>
>> assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603",
>>         "title", "THE HUBBLE constant: a summary of the hubble space
>> telescope program"));
>>
>>
>> term=hubble posInc=2 posLen=1 type=word offsetStart=4 offsetEnd=10
>> term=acr::hubble posInc=0 posLen=1 type=ACRONYM offsetStart=4 offsetEnd=10
>> term=constant posInc=1 posLen=1 type=word offsetStart=11 offsetEnd=20
>> term=summary posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=30
>> term=hubble posInc=1 posLen=1 type=word offsetStart=38 offsetEnd=44
>> term=syn::hubble space telescope posInc=0 posLen=3 type=SYNONYM
>> offsetStart=38 offsetEnd=60
>> term=syn::hst posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60
>> term=acr::hst posInc=0 posLen=3 type=ACRONYM offsetStart=38 offsetEnd=60
>> * term=space posInc=1 posLen=1 type=word offsetStart=45 offsetEnd=50
>> term=telescope posInc=1 posLen=1 type=word offsetStart=51 offsetEnd=60
>> term=program posInc=1 posLen=1 type=word offsetStart=61 offsetEnd=68
>>
>> * - fails because of offsetEnd < lastToken.offsetEnd; If reordered
>> (the multi-token synonym emitted as a last token) it would fail as
>> well, because of the check for lastToken.beginOffset <
>> currentToken.beginOffset. Basically, any reordering would result in a
>> failure (unless offsets are trimmed).
>>
>>
>>
>> The following example has additional twist because of `space-time`;
>> the tokenizer first splits the word and generate two new tokens --
>> those alternative tokens are then used to find synonyms (space ==
>> universe)
>>
>> assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604",
>>         "title", "MIT and anti de sitter space-time"));
>>
>>
>> term=xxxxxxxxxx604 posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=13
>> term=mit posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=3
>> term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
>> term=syn::massachusetts institute of technology posInc=0 posLen=1
>> type=SYNONYM offsetStart=0 offsetEnd=3
>> term=syn::mit posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3
>> term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
>> term=anti posInc=1 posLen=1 type=word offsetStart=8 offsetEnd=12
>> term=syn::ads posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
>> term=acr::ads posInc=0 posLen=4 type=ACRONYM offsetStart=8 offsetEnd=28
>> term=syn::anti de sitter space posInc=0 posLen=4 type=SYNONYM
>> offsetStart=8 offsetEnd=28
>> term=syn::antidesitter spacetime posInc=0 posLen=4 type=SYNONYM
>> offsetStart=8 offsetEnd=28
>> term=syn::antidesitter space posInc=0 posLen=4 type=SYNONYM
>> offsetStart=8 offsetEnd=28
>> * term=de posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=15
>> term=sitter posInc=1 posLen=1 type=word offsetStart=16 offsetEnd=22
>> term=space posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=28
>> term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=23 offsetEnd=28
>> term=time posInc=1 posLen=1 type=word offsetStart=29 offsetEnd=33
>> term=spacetime posInc=0 posLen=1 type=word offsetStart=23 offsetEnd=33
>>
>> So far, all of these cases could be handled with the new position
>> length attribute. But let us look at a case where that would fail too.
>>
>> assertU(adoc("id", "606", "bibcode", "xxxxxxxxxx604",
>>         "title", "Massachusetts Institute of Technology and
>> antidesitter space-time"));
>>
>>
>> term=massachusetts posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=12
>> term=syn::massachusetts institute of technology posInc=0 posLen=4
>> type=SYNONYM offsetStart=0 offsetEnd=36
>> term=syn::mit posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36
>> term=acr::mit posInc=0 posLen=4 type=ACRONYM offsetStart=0 offsetEnd=36
>> term=institute posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=22
>> term=technology posInc=1 posLen=1 type=word offsetStart=26 offsetEnd=36
>> term=antidesitter posInc=1 posLen=1 type=word offsetStart=41 offsetEnd=53
>> term=syn::ads posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
>> term=acr::ads posInc=0 posLen=2 type=ACRONYM offsetStart=41 offsetEnd=59
>> term=syn::anti de sitter space posInc=0 posLen=2 type=SYNONYM
>> offsetStart=41 offsetEnd=59
>> term=syn::antidesitter spacetime posInc=0 posLen=2 type=SYNONYM
>> offsetStart=41 offsetEnd=59
>> term=syn::antidesitter space posInc=0 posLen=2 type=SYNONYM
>> offsetStart=41 offsetEnd=59
>> term=space posInc=1 posLen=1 type=word offsetStart=54 offsetEnd=59
>> term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=54 offsetEnd=59
>> term=time posInc=1 posLen=1 type=word offsetStart=60 offsetEnd=64
>> term=spacetime posInc=0 posLen=1 type=word offsetStart=54 offsetEnd=64
>>
>> Notice the posLen=4 of MIT; it would cover tokens `massachusetts
>> institute technology antidesitter` while offsets are still correct.
>>
>> This would, I think, affect not only highlighting, but also search
>> (which is, at least for us, more important). But I can imagine that in
>> more NLP-related domains, ability to identify the source of a
>> transformation could be more than a highlighting problem.
>>
>> Admittedly, most users would not care to notice, but it might be
>> important to some. Fundamentally, I think, the problem translates to
>> inability to reconstruct the DAG graph (under certain circumstances)
>> because of the lost pieces of information.
>>
>> ~roman
>>
>> On Wed, Aug 12, 2020 at 4:59 PM Michael McCandless
>> <luc...@mikemccandless.com> wrote:
>> >
>> > Hi Roman,
>> >
>> > Sorry for the late reply!
>> >
>> > I think there remains substantial confusion about multi-token synonyms and 
>> > IW's enforcement of offsets.  It really is worth thoroughly 
>> > iterating/understanding your examples so we can get to the bottom of this. 
>> >  It looks to me it is possible to emit tokens whose offsets do not go 
>> > backwards and that properly model your example synonyms, so I do not yet 
>> > see what the problem is.  Maybe I am being blind/tired ...
>> >
>> > What do you mean by pos=2, pos=0, etc.?  I think that is really the 
>> > position increment?  Can you re-do the examples with posInc instead?  
>> > (Alternatively, you could keep "pos" but make it the absolute position, 
>> > not the increment?).
>> >
>> > Could you also add posLength to each token?  This helps (me?) visualize 
>> > the resulting graph, even though IW does not enforce it today.
>> >
>> > Looking at your first example, "THE HUBBLE constant: a summary of the 
>> > hubble space telescope program", it looks to me like those tokens would 
>> > all be accepted by IW's checks as they are?  startOffset never goes 
>> > backwards, and for every token, endOffset >= startOffset.  Where in that 
>> > first example does IW throw an exception?  Maybe insert a "** IW fails 
>> > here" under the problematic token?  Or, maybe write a simple test case 
>> > using e.g. CannedTokenStream?
>> >
>> > Your second example should also be fine, and not at all weird, but could 
>> > you enumerate it into the specific tokens with posInc, posLength, 
>> > start/end offset, "** IW fails here", etc., so we have a concrete example 
>> > to discuss?
>> >
>> > Lucene's TokenStreams are really serializing a directed acyclic graph 
>> > (DAG), in a specific order, one transition at a time.  
>> > Ironically/strangely, it is similar to the graph that git history 
>> > maintains, and how "git log" then serializes that graph into an ordered 
>> > series of transitions.  The simple int position in Lucene's TokenStream 
>> > corresponds to git's githashes, to uniquely identify each "node", though, 
>> > I do not think there is an analog in git to Lucene's offsets.  Hmm, maybe 
>> > a timestamp?
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla <roman.ch...@gmail.com> wrote:
>> >>
>> >> Hi Mike,
>> >>
>> >> Yes, they are not zero offsets - I was instinctively avoiding
>> >> "negative offsets"; but they are indeed backward offsets.
>> >>
>> >> Here is the token stream as produced by the analyzer chain indexing
>> >> "THE HUBBLE constant: a summary of the hubble space telescope program"
>> >>
>> >> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
>> >> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
>> >> term=constant pos=1 type=word offsetStart=11 offsetEnd=20
>> >> term=summary pos=1 type=word offsetStart=23 offsetEnd=30
>> >> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
>> >> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 
>> >> offsetEnd=60
>> >> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
>> >> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
>> >> term=space pos=1 type=word offsetStart=45 offsetEnd=50
>> >> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
>> >> term=program pos=1 type=word offsetStart=61 offsetEnd=68
>> >>
>> >> Sometimes, we'll even have a situation when synonyms overlap: for
>> >> example "anti de sitter space time"
>> >>
>> >> "anti de sitter space time" -> "antidesitter space" (one token
>> >> spanning offsets 0-26; it gets emitted with the first token "anti"
>> >> right now)
>> >> "space time" -> "spacetime" (synonym 16-26)
>> >> "space" -> "universe" (25-26)
>> >>
>> >> Yes, weird, but useful if people want to search for `universe NEAR
>> >> anti` -- but another usecase which would be prohibited by the "new"
>> >> rule.
>> >>
>> >> DefaultIndexingChain checks new token offset against the last emitted
>> >> token, so I don't see a way to emit the multi-token synonym with
>> >> offsetts spanning multiple tokens if even one of these tokens was
>> >> already emitted. And the complement is equally true: if multi-token is
>> >> emitted as last of the group - it trips over `startOffset <
>> >> invertState.lastStartOffset`
>> >>
>> >> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>> >>
>> >>
>> >>   -roman
>> >>
>> >> On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless
>> >> <luc...@mikemccandless.com> wrote:
>> >> >
>> >> > Hi Roman,
>> >> >
>> >> > Hmm, this is all very tricky!
>> >> >
>> >> > First off, why do you call this "zero offsets"?  Isn't it "backwards 
>> >> > offsets" that your analysis chain is trying to produce?
>> >> >
>> >> > Second, in your first example, if you output the tokens in the right 
>> >> > order, they would not violate the "offsets do not go backwards" check 
>> >> > in IndexWriter?  I thought IndexWriter is just checking that the 
>> >> > startOffset for a token is not lower than the previous token's 
>> >> > startOffset?  (And that the token's endOffset is not lower than its 
>> >> > startOffset).
>> >> >
>> >> > So I am confused why your first example is tripping up on IW's offset 
>> >> > checks.  Could you maybe redo the example, listing single token per 
>> >> > line with the start/end offsets they are producing?
>> >> >
>> >> > Mike McCandless
>> >> >
>> >> > http://blog.mikemccandless.com
>> >> >
>> >> >
>> >> > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <roman.ch...@gmail.com> 
>> >> > wrote:
>> >> >>
>> >> >> Hello devs,
>> >> >>
>> >> >> I wanted to create an issue but the helpful message in red letters
>> >> >> reminded me to ask first.
>> >> >>
>> >> >> While porting from lucene 6.x to 7x I'm struggling with a change that
>> >> >> was introduced in LUCENE-7626
>> >> >> (https://issues.apache.org/jira/browse/LUCENE-7626)
>> >> >>
>> >> >> It is believed that zero offset tokens are bad bad - Mike McCandles
>> >> >> made the change which made me automatically doubt myself. I must be
>> >> >> wrong, hell, I was living in sin the past 5 years!
>> >> >>
>> >> >> Sadly, we have been indexing and searching large volumes of data
>> >> >> without any corruption in index whatsover, but also without this new
>> >> >> change:
>> >> >>
>> >> >> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
>> >> >>
>> >> >> With that change, our multi-token synonyms house of cards is falling.
>> >> >>
>> >> >> Mike has this wonderful blogpost explaining troubles with multi-token 
>> >> >> synonyms:
>> >> >> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>> >> >>
>> >> >> Recommended way to index multi-token synonyms appears to be this:
>> >> >> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
>> >> >>
>> >> >> BUT, but! We don't want to place multi-token synonym into the same
>> >> >> position as the other words. We want to preserve their positions! We
>> >> >> want to preserve informaiton about offsets!
>> >> >>
>> >> >> Here is an example:
>> >> >>
>> >> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program
>> >> >>
>> >> >> This is how it gets indexed
>> >> >>
>> >> >> [(0, []),
>> >> >> (1, ['acr::hubble']),
>> >> >> (2, ['constant']),
>> >> >> (3, ['summary']),
>> >> >> (4, []),
>> >> >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', 
>> >> >> 'hubble'']),
>> >> >> (6, ['acr::space', 'space']),
>> >> >> (7, ['acr::telescope', 'telescope']),
>> >> >> (8, ['program']),
>> >> >>
>> >> >> Notice the position 5 - multi-token synonym `syn::hubble space
>> >> >> telescope` token is on the first token which started the group
>> >> >> (emitted by Lucene's synonym filter). hst is another synonym; we also
>> >> >> index the 'hubble' word there.
>> >> >>
>> >> >>  if you were to search for a phrase "HST program" it will be found
>> >> >> because our search parser will search for ("HST ? ? program" | "Hubble
>> >> >> Space Telescope program")
>> >> >>
>> >> >> It simply found that by looking at synonyms: HST -> Hubble Space 
>> >> >> Telescope
>> >> >>
>> >> >> And because of those funny 'syn::' prefixes, we don't suffer from the
>> >> >> other problem that Mike described -- "hst space" phrase search will
>> >> >> NOT find this paper (and that is a correct behaviour)
>> >> >>
>> >> >> But all of this is possible only because lucene was indexing tokens
>> >> >> with offsets that can be lower than the last emitted token; for
>> >> >> example 'hubble space telescope' wil have offset 21-45; and the next
>> >> >> emitted token "space" will have offset 28-33
>> >> >>
>> >> >> And it just works (lucene 6.x)
>> >> >>
>> >> >> Here is another proof with the appropriate verbiage ("crazy"):
>> >> >>
>> >> >> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
>> >> >>
>> >> >> Zero offsets have been working wonderfully for us so far. And I
>> >> >> actually cannot imagine how it can work without them - i.e. without
>> >> >> the ability to emit a token stream with offsets that are lower than
>> >> >> the last seen token.
>> >> >>
>> >> >> I haven't tried SynonymFlatten filter, but because of this line in the
>> >> >> DefaultIndexingChain - I'm convinced the flatten symbol is not going
>> >> >> to do what we need (as seen in the example above)
>> >> >>
>> >> >> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>> >> >>
>> >> >> What would you say? Is it a bug, is it not a bug but just some special
>> >> >> usecase? If it is a special usecase, what do we need to do? Plug in
>> >> >> our own indexing chain?
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >>   -roman
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >> >>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Reply via email to