Hi Mike, Sorry for the delay, I was away last week. Now that I get back to it again my plan is to write a test for the WordDelimiterFilter and pinpoint the problem.
Cheers, Roman On Thu, Aug 20, 2020 at 11:21 AM Michael McCandless <luc...@mikemccandless.com> wrote: > > Hi Roman, > > No need for anyone to be falling on swords here! This is really complicated > stuff, no worries. And I think we have a compelling plan to move forwards so > that we can index multi-token synonyms AND have 100% correct positional > queries at search time, thanks to Michael Gibney's cool approach on > https://issues.apache.org/jira/browse/LUCENE-4312. > > So it looks like WordDelimiterGraphFilter is producing buggy (out of order > offsets) tokens here? > > Or are you running SynonymGraphFilter after WordDelimiterFilter? > > Looking at that failing example, it should have output'd that spacetime token > immediately after the space token, not after the time token. > > Maybe use TokenStreamToDot to visualize what the heck token graph you are > getting ... > > Mike McCandless > > http://blog.mikemccandless.com > > > On Tue, Aug 18, 2020 at 9:41 PM Roman Chyla <roman.ch...@gmail.com> wrote: >> >> Hi Mike, >> >> I'm sorry, the problem all the time is inside related to a >> word-delimiter filter factory. This is embarrassing but I have to >> admit publicly and self-flagellate. >> >> A word-delimiter filter is used to split tokens, these then are used >> to find multi-token synonyms (hence the connection). In my desire to >> simplify, I have omitted that detail while writing my first email. >> >> I went to generate the stack trace: >> >> ``` >> assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603", >> "title", "THE HUBBLE constant: a summary of the HUBBLE SPACE >> TELESCOPE program"));``` >> >> stage:indexer term=xxxxxxxxxx603 pos=1 type=word offsetStart=0 offsetEnd=13 >> stage:indexer term=acr::the pos=1 type=ACRONYM offsetStart=0 offsetEnd=3 >> stage:indexer term=hubble pos=1 type=word offsetStart=4 offsetEnd=10 >> stage:indexer term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10 >> stage:indexer term=constant pos=1 type=word offsetStart=11 offsetEnd=20 >> stage:indexer term=summary pos=1 type=word offsetStart=23 offsetEnd=30 >> stage:indexer term=hubble pos=1 type=word offsetStart=38 offsetEnd=44 >> stage:indexer term=syn::hubble space telescope pos=0 type=SYNONYM >> offsetStart=38 offsetEnd=60 >> stage:indexer term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60 >> stage:indexer term=space pos=1 type=word offsetStart=45 offsetEnd=50 >> stage:indexer term=telescope pos=1 type=word offsetStart=51 offsetEnd=60 >> stage:indexer term=program pos=1 type=word offsetStart=61 offsetEnd=68 >> >> that worked, only the next one failed: >> >> ```assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604", >> "title", "MIT and anti de sitter space-time"));``` >> >> >> stage:indexer term=xxxxxxxxxx604 pos=1 type=word offsetStart=0 offsetEnd=13 >> stage:indexer term=mit pos=1 type=word offsetStart=0 offsetEnd=3 >> stage:indexer term=acr::mit pos=0 type=ACRONYM offsetStart=0 offsetEnd=3 >> stage:indexer term=syn::massachusetts institute of technology pos=0 >> type=SYNONYM offsetStart=0 offsetEnd=3 >> stage:indexer term=syn::mit pos=0 type=SYNONYM offsetStart=0 offsetEnd=3 >> stage:indexer term=anti pos=1 type=word offsetStart=8 offsetEnd=12 >> stage:indexer term=syn::ads pos=0 type=SYNONYM offsetStart=8 offsetEnd=28 >> stage:indexer term=syn::anti de sitter space pos=0 type=SYNONYM >> offsetStart=8 offsetEnd=28 >> stage:indexer term=syn::antidesitter spacetime pos=0 type=SYNONYM >> offsetStart=8 offsetEnd=28 >> stage:indexer term=de pos=1 type=word offsetStart=13 offsetEnd=15 >> stage:indexer term=sitter pos=1 type=word offsetStart=16 offsetEnd=22 >> stage:indexer term=space pos=1 type=word offsetStart=23 offsetEnd=28 >> stage:indexer term=time pos=1 type=word offsetStart=29 offsetEnd=33 >> stage:indexer term=spacetime pos=0 type=word offsetStart=23 offsetEnd=33 >> >> ```325677 ERROR >> (TEST-TestAdsabsTypeFulltextParsing.testNoSynChain-seed#[ADFAB495DA8F6F40]) >> [ ] o.a.s.h.RequestHandlerBase >> org.apache.solr.common.SolrException: Exception writing document id >> 605 to the index; possible analysis error: startOffset must be >> non-negative, and endOffset must be >= startOffset, and offsets must >> not go backwards startOffset=23,endOffset=33,lastStartOffset=29 for >> field 'title' >> at >> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:242) >> at >> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67) >> at >> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55) >> at >> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:1002) >> at >> org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:1233) >> at >> org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$2(DistributedUpdateProcessor.java:1082) >> at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50) >> at >> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1082) >> at >> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:694) >> at >> org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103) >> at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:261) >> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:188) >> at >> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97) >> at >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) >> at >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199) >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2551) >> at >> org.apache.solr.servlet.DirectSolrConnection.request(DirectSolrConnection.java:125) >> at org.apache.solr.util.TestHarness.update(TestHarness.java:285) >> at >> org.apache.solr.util.BaseTestHarness.checkUpdateStatus(BaseTestHarness.java:274) >> at >> org.apache.solr.util.BaseTestHarness.validateUpdate(BaseTestHarness.java:244) >> at org.apache.solr.SolrTestCaseJ4.checkUpdateU(SolrTestCaseJ4.java:874) >> at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:853) >> at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:847) >> at >> org.apache.solr.analysis.TestAdsabsTypeFulltextParsing.setUp(TestAdsabsTypeFulltextParsing.java:223) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:498) >> at >> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750) >> at >> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:972) >> at >> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988) >> at >> com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57) >> at >> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49) >> at >> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) >> at >> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48) >> at >> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) >> at >> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) >> at >> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) >> at >> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368) >> at >> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817) >> at >> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468) >> at >> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947) >> at >> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832) >> at >> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883) >> at >> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894) >> at >> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) >> at >> com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57) >> at >> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) >> at >> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) >> at >> org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41) >> at >> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) >> at >> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) >> at >> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) >> at >> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) >> at >> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) >> at >> org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53) >> at >> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) >> at >> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) >> at >> org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54) >> at >> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) >> at >> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368) >> at java.lang.Thread.run(Thread.java:748) >> Caused by: java.lang.IllegalArgumentException: startOffset must be >> non-negative, and endOffset must be >= startOffset, and offsets must >> not go backwards startOffset=23,endOffset=33,lastStartOffset=29 for >> field 'title' >> at >> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:823) >> at >> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430) >> at >> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) >> at >> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:251) >> at >> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494) >> at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1616) >> at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1608) >> at >> org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:969) >> at >> org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:341) >> at >> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:288) >> at >> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:235) >> ... 61 more >> ``` >> >> Embarrassingly Yours, >> >> Roman >> >> >> >> On Mon, Aug 17, 2020 at 10:39 AM Michael McCandless >> <luc...@mikemccandless.com> wrote: >> > >> > Hi Roman, >> > >> > Can you share the full exception / stack trace that IndexWriter throws on >> > that one *'d token in your first example? I thought IndexWriter checks 1) >> > startOffset >= last token's startOffset, and 2) endOffset >= startOffset >> > for the current token. >> > >> > But you seem to be hitting an exception due to endOffset check across >> > tokens, which I didn't remember/realize IW was enforcing. >> > >> > Could you share a small standalone test case showing the first example? >> > Maybe attach it to the issue >> > (http://issues.apache.org/jira/browse/LUCENE-8776)? >> > >> > Thanks, >> > >> > Mike McCandless >> > >> > http://blog.mikemccandless.com >> > >> > >> > On Fri, Aug 14, 2020 at 12:09 PM Roman Chyla <roman.ch...@gmail.com> wrote: >> >> >> >> Hi Mike, >> >> >> >> Thanks for the question! And sorry for the delay, I haven't managed to >> >> get to it yesterday. I have generated better output, marked with (*) >> >> where it currently fails the first time and also included one extra >> >> case to illustrate the PositionLength attribute. >> >> >> >> assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603", >> >> "title", "THE HUBBLE constant: a summary of the hubble space >> >> telescope program")); >> >> >> >> >> >> term=hubble posInc=2 posLen=1 type=word offsetStart=4 offsetEnd=10 >> >> term=acr::hubble posInc=0 posLen=1 type=ACRONYM offsetStart=4 offsetEnd=10 >> >> term=constant posInc=1 posLen=1 type=word offsetStart=11 offsetEnd=20 >> >> term=summary posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=30 >> >> term=hubble posInc=1 posLen=1 type=word offsetStart=38 offsetEnd=44 >> >> term=syn::hubble space telescope posInc=0 posLen=3 type=SYNONYM >> >> offsetStart=38 offsetEnd=60 >> >> term=syn::hst posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60 >> >> term=acr::hst posInc=0 posLen=3 type=ACRONYM offsetStart=38 offsetEnd=60 >> >> * term=space posInc=1 posLen=1 type=word offsetStart=45 offsetEnd=50 >> >> term=telescope posInc=1 posLen=1 type=word offsetStart=51 offsetEnd=60 >> >> term=program posInc=1 posLen=1 type=word offsetStart=61 offsetEnd=68 >> >> >> >> * - fails because of offsetEnd < lastToken.offsetEnd; If reordered >> >> (the multi-token synonym emitted as a last token) it would fail as >> >> well, because of the check for lastToken.beginOffset < >> >> currentToken.beginOffset. Basically, any reordering would result in a >> >> failure (unless offsets are trimmed). >> >> >> >> >> >> >> >> The following example has additional twist because of `space-time`; >> >> the tokenizer first splits the word and generate two new tokens -- >> >> those alternative tokens are then used to find synonyms (space == >> >> universe) >> >> >> >> assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604", >> >> "title", "MIT and anti de sitter space-time")); >> >> >> >> >> >> term=xxxxxxxxxx604 posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=13 >> >> term=mit posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=3 >> >> term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3 >> >> term=syn::massachusetts institute of technology posInc=0 posLen=1 >> >> type=SYNONYM offsetStart=0 offsetEnd=3 >> >> term=syn::mit posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3 >> >> term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3 >> >> term=anti posInc=1 posLen=1 type=word offsetStart=8 offsetEnd=12 >> >> term=syn::ads posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28 >> >> term=acr::ads posInc=0 posLen=4 type=ACRONYM offsetStart=8 offsetEnd=28 >> >> term=syn::anti de sitter space posInc=0 posLen=4 type=SYNONYM >> >> offsetStart=8 offsetEnd=28 >> >> term=syn::antidesitter spacetime posInc=0 posLen=4 type=SYNONYM >> >> offsetStart=8 offsetEnd=28 >> >> term=syn::antidesitter space posInc=0 posLen=4 type=SYNONYM >> >> offsetStart=8 offsetEnd=28 >> >> * term=de posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=15 >> >> term=sitter posInc=1 posLen=1 type=word offsetStart=16 offsetEnd=22 >> >> term=space posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=28 >> >> term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=23 >> >> offsetEnd=28 >> >> term=time posInc=1 posLen=1 type=word offsetStart=29 offsetEnd=33 >> >> term=spacetime posInc=0 posLen=1 type=word offsetStart=23 offsetEnd=33 >> >> >> >> So far, all of these cases could be handled with the new position >> >> length attribute. But let us look at a case where that would fail too. >> >> >> >> assertU(adoc("id", "606", "bibcode", "xxxxxxxxxx604", >> >> "title", "Massachusetts Institute of Technology and >> >> antidesitter space-time")); >> >> >> >> >> >> term=massachusetts posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=12 >> >> term=syn::massachusetts institute of technology posInc=0 posLen=4 >> >> type=SYNONYM offsetStart=0 offsetEnd=36 >> >> term=syn::mit posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36 >> >> term=acr::mit posInc=0 posLen=4 type=ACRONYM offsetStart=0 offsetEnd=36 >> >> term=institute posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=22 >> >> term=technology posInc=1 posLen=1 type=word offsetStart=26 offsetEnd=36 >> >> term=antidesitter posInc=1 posLen=1 type=word offsetStart=41 offsetEnd=53 >> >> term=syn::ads posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59 >> >> term=acr::ads posInc=0 posLen=2 type=ACRONYM offsetStart=41 offsetEnd=59 >> >> term=syn::anti de sitter space posInc=0 posLen=2 type=SYNONYM >> >> offsetStart=41 offsetEnd=59 >> >> term=syn::antidesitter spacetime posInc=0 posLen=2 type=SYNONYM >> >> offsetStart=41 offsetEnd=59 >> >> term=syn::antidesitter space posInc=0 posLen=2 type=SYNONYM >> >> offsetStart=41 offsetEnd=59 >> >> term=space posInc=1 posLen=1 type=word offsetStart=54 offsetEnd=59 >> >> term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=54 >> >> offsetEnd=59 >> >> term=time posInc=1 posLen=1 type=word offsetStart=60 offsetEnd=64 >> >> term=spacetime posInc=0 posLen=1 type=word offsetStart=54 offsetEnd=64 >> >> >> >> Notice the posLen=4 of MIT; it would cover tokens `massachusetts >> >> institute technology antidesitter` while offsets are still correct. >> >> >> >> This would, I think, affect not only highlighting, but also search >> >> (which is, at least for us, more important). But I can imagine that in >> >> more NLP-related domains, ability to identify the source of a >> >> transformation could be more than a highlighting problem. >> >> >> >> Admittedly, most users would not care to notice, but it might be >> >> important to some. Fundamentally, I think, the problem translates to >> >> inability to reconstruct the DAG graph (under certain circumstances) >> >> because of the lost pieces of information. >> >> >> >> ~roman >> >> >> >> On Wed, Aug 12, 2020 at 4:59 PM Michael McCandless >> >> <luc...@mikemccandless.com> wrote: >> >> > >> >> > Hi Roman, >> >> > >> >> > Sorry for the late reply! >> >> > >> >> > I think there remains substantial confusion about multi-token synonyms >> >> > and IW's enforcement of offsets. It really is worth thoroughly >> >> > iterating/understanding your examples so we can get to the bottom of >> >> > this. It looks to me it is possible to emit tokens whose offsets do >> >> > not go backwards and that properly model your example synonyms, so I do >> >> > not yet see what the problem is. Maybe I am being blind/tired ... >> >> > >> >> > What do you mean by pos=2, pos=0, etc.? I think that is really the >> >> > position increment? Can you re-do the examples with posInc instead? >> >> > (Alternatively, you could keep "pos" but make it the absolute position, >> >> > not the increment?). >> >> > >> >> > Could you also add posLength to each token? This helps (me?) visualize >> >> > the resulting graph, even though IW does not enforce it today. >> >> > >> >> > Looking at your first example, "THE HUBBLE constant: a summary of the >> >> > hubble space telescope program", it looks to me like those tokens would >> >> > all be accepted by IW's checks as they are? startOffset never goes >> >> > backwards, and for every token, endOffset >= startOffset. Where in >> >> > that first example does IW throw an exception? Maybe insert a "** IW >> >> > fails here" under the problematic token? Or, maybe write a simple test >> >> > case using e.g. CannedTokenStream? >> >> > >> >> > Your second example should also be fine, and not at all weird, but >> >> > could you enumerate it into the specific tokens with posInc, posLength, >> >> > start/end offset, "** IW fails here", etc., so we have a concrete >> >> > example to discuss? >> >> > >> >> > Lucene's TokenStreams are really serializing a directed acyclic graph >> >> > (DAG), in a specific order, one transition at a time. >> >> > Ironically/strangely, it is similar to the graph that git history >> >> > maintains, and how "git log" then serializes that graph into an ordered >> >> > series of transitions. The simple int position in Lucene's TokenStream >> >> > corresponds to git's githashes, to uniquely identify each "node", >> >> > though, I do not think there is an analog in git to Lucene's offsets. >> >> > Hmm, maybe a timestamp? >> >> > >> >> > Mike McCandless >> >> > >> >> > http://blog.mikemccandless.com >> >> > >> >> > >> >> > On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla <roman.ch...@gmail.com> >> >> > wrote: >> >> >> >> >> >> Hi Mike, >> >> >> >> >> >> Yes, they are not zero offsets - I was instinctively avoiding >> >> >> "negative offsets"; but they are indeed backward offsets. >> >> >> >> >> >> Here is the token stream as produced by the analyzer chain indexing >> >> >> "THE HUBBLE constant: a summary of the hubble space telescope program" >> >> >> >> >> >> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10 >> >> >> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10 >> >> >> term=constant pos=1 type=word offsetStart=11 offsetEnd=20 >> >> >> term=summary pos=1 type=word offsetStart=23 offsetEnd=30 >> >> >> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44 >> >> >> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 >> >> >> offsetEnd=60 >> >> >> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60 >> >> >> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60 >> >> >> term=space pos=1 type=word offsetStart=45 offsetEnd=50 >> >> >> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60 >> >> >> term=program pos=1 type=word offsetStart=61 offsetEnd=68 >> >> >> >> >> >> Sometimes, we'll even have a situation when synonyms overlap: for >> >> >> example "anti de sitter space time" >> >> >> >> >> >> "anti de sitter space time" -> "antidesitter space" (one token >> >> >> spanning offsets 0-26; it gets emitted with the first token "anti" >> >> >> right now) >> >> >> "space time" -> "spacetime" (synonym 16-26) >> >> >> "space" -> "universe" (25-26) >> >> >> >> >> >> Yes, weird, but useful if people want to search for `universe NEAR >> >> >> anti` -- but another usecase which would be prohibited by the "new" >> >> >> rule. >> >> >> >> >> >> DefaultIndexingChain checks new token offset against the last emitted >> >> >> token, so I don't see a way to emit the multi-token synonym with >> >> >> offsetts spanning multiple tokens if even one of these tokens was >> >> >> already emitted. And the complement is equally true: if multi-token is >> >> >> emitted as last of the group - it trips over `startOffset < >> >> >> invertState.lastStartOffset` >> >> >> >> >> >> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915 >> >> >> >> >> >> >> >> >> -roman >> >> >> >> >> >> On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless >> >> >> <luc...@mikemccandless.com> wrote: >> >> >> > >> >> >> > Hi Roman, >> >> >> > >> >> >> > Hmm, this is all very tricky! >> >> >> > >> >> >> > First off, why do you call this "zero offsets"? Isn't it "backwards >> >> >> > offsets" that your analysis chain is trying to produce? >> >> >> > >> >> >> > Second, in your first example, if you output the tokens in the right >> >> >> > order, they would not violate the "offsets do not go backwards" >> >> >> > check in IndexWriter? I thought IndexWriter is just checking that >> >> >> > the startOffset for a token is not lower than the previous token's >> >> >> > startOffset? (And that the token's endOffset is not lower than its >> >> >> > startOffset). >> >> >> > >> >> >> > So I am confused why your first example is tripping up on IW's >> >> >> > offset checks. Could you maybe redo the example, listing single >> >> >> > token per line with the start/end offsets they are producing? >> >> >> > >> >> >> > Mike McCandless >> >> >> > >> >> >> > http://blog.mikemccandless.com >> >> >> > >> >> >> > >> >> >> > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <roman.ch...@gmail.com> >> >> >> > wrote: >> >> >> >> >> >> >> >> Hello devs, >> >> >> >> >> >> >> >> I wanted to create an issue but the helpful message in red letters >> >> >> >> reminded me to ask first. >> >> >> >> >> >> >> >> While porting from lucene 6.x to 7x I'm struggling with a change >> >> >> >> that >> >> >> >> was introduced in LUCENE-7626 >> >> >> >> (https://issues.apache.org/jira/browse/LUCENE-7626) >> >> >> >> >> >> >> >> It is believed that zero offset tokens are bad bad - Mike McCandles >> >> >> >> made the change which made me automatically doubt myself. I must be >> >> >> >> wrong, hell, I was living in sin the past 5 years! >> >> >> >> >> >> >> >> Sadly, we have been indexing and searching large volumes of data >> >> >> >> without any corruption in index whatsover, but also without this new >> >> >> >> change: >> >> >> >> >> >> >> >> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774 >> >> >> >> >> >> >> >> With that change, our multi-token synonyms house of cards is >> >> >> >> falling. >> >> >> >> >> >> >> >> Mike has this wonderful blogpost explaining troubles with >> >> >> >> multi-token synonyms: >> >> >> >> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html >> >> >> >> >> >> >> >> Recommended way to index multi-token synonyms appears to be this: >> >> >> >> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr >> >> >> >> >> >> >> >> BUT, but! We don't want to place multi-token synonym into the same >> >> >> >> position as the other words. We want to preserve their positions! We >> >> >> >> want to preserve informaiton about offsets! >> >> >> >> >> >> >> >> Here is an example: >> >> >> >> >> >> >> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE >> >> >> >> program >> >> >> >> >> >> >> >> This is how it gets indexed >> >> >> >> >> >> >> >> [(0, []), >> >> >> >> (1, ['acr::hubble']), >> >> >> >> (2, ['constant']), >> >> >> >> (3, ['summary']), >> >> >> >> (4, []), >> >> >> >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', >> >> >> >> 'hubble'']), >> >> >> >> (6, ['acr::space', 'space']), >> >> >> >> (7, ['acr::telescope', 'telescope']), >> >> >> >> (8, ['program']), >> >> >> >> >> >> >> >> Notice the position 5 - multi-token synonym `syn::hubble space >> >> >> >> telescope` token is on the first token which started the group >> >> >> >> (emitted by Lucene's synonym filter). hst is another synonym; we >> >> >> >> also >> >> >> >> index the 'hubble' word there. >> >> >> >> >> >> >> >> if you were to search for a phrase "HST program" it will be found >> >> >> >> because our search parser will search for ("HST ? ? program" | >> >> >> >> "Hubble >> >> >> >> Space Telescope program") >> >> >> >> >> >> >> >> It simply found that by looking at synonyms: HST -> Hubble Space >> >> >> >> Telescope >> >> >> >> >> >> >> >> And because of those funny 'syn::' prefixes, we don't suffer from >> >> >> >> the >> >> >> >> other problem that Mike described -- "hst space" phrase search will >> >> >> >> NOT find this paper (and that is a correct behaviour) >> >> >> >> >> >> >> >> But all of this is possible only because lucene was indexing tokens >> >> >> >> with offsets that can be lower than the last emitted token; for >> >> >> >> example 'hubble space telescope' wil have offset 21-45; and the next >> >> >> >> emitted token "space" will have offset 28-33 >> >> >> >> >> >> >> >> And it just works (lucene 6.x) >> >> >> >> >> >> >> >> Here is another proof with the appropriate verbiage ("crazy"): >> >> >> >> >> >> >> >> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618 >> >> >> >> >> >> >> >> Zero offsets have been working wonderfully for us so far. And I >> >> >> >> actually cannot imagine how it can work without them - i.e. without >> >> >> >> the ability to emit a token stream with offsets that are lower than >> >> >> >> the last seen token. >> >> >> >> >> >> >> >> I haven't tried SynonymFlatten filter, but because of this line in >> >> >> >> the >> >> >> >> DefaultIndexingChain - I'm convinced the flatten symbol is not going >> >> >> >> to do what we need (as seen in the example above) >> >> >> >> >> >> >> >> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915 >> >> >> >> >> >> >> >> What would you say? Is it a bug, is it not a bug but just some >> >> >> >> special >> >> >> >> usecase? If it is a special usecase, what do we need to do? Plug in >> >> >> >> our own indexing chain? >> >> >> >> >> >> >> >> Thanks! >> >> >> >> >> >> >> >> -roman >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org