[ https://issues.apache.org/jira/browse/LUCENE-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16041177#comment-16041177 ]
Steve Rowe commented on LUCENE-7854: ------------------------------------ My Jenkins found a reproducing master seed for a {{TestIndexWriterExceptions.testTooManyTokens()}} failure, which {{git bisect}} blames on commit {{d276acfb}} on this issue: {noformat} Checking out Revision 1921b61ba8f3c7579bc04975b7ce90167a74e51e (refs/remotes/origin/master) [...] [junit4] Suite: org.apache.lucene.index.TestIndexWriterExceptions [junit4] 2> NOTE: download the large Jenkins line-docs file by running 'ant get-jenkins-line-docs' in the lucene directory. [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexWriterExceptions -Dtests.method=testTooManyTokens -Dtests.seed=244E0F9076AF909A -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt -Dtests.locale=sl -Dtests.timezone=Pacific/Rarotonga -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] FAILURE 344s J0 | TestIndexWriterExceptions.testTooManyTokens <<< [junit4] > Throwable #1: junit.framework.AssertionFailedError: Unexpected exception type, expected IllegalArgumentException but got java.lang.ArithmeticException: integer overflow [junit4] > at __randomizedtesting.SeedInfo.seed([244E0F9076AF909A:C9CA5B1B9805BDB9]:0) [junit4] > at org.apache.lucene.util.LuceneTestCase.expectThrows(LuceneTestCase.java:2679) [junit4] > at org.apache.lucene.index.TestIndexWriterExceptions.testTooManyTokens(TestIndexWriterExceptions.java:2047) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] > Caused by: java.lang.ArithmeticException: integer overflow [junit4] > at java.lang.Math.addExact(Math.java:790) [junit4] > at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:773) [junit4] > at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:431) [junit4] > at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:393) [junit4] > at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:236) [junit4] > at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:478) [junit4] > at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1570) [junit4] > at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1315) [junit4] > at org.apache.lucene.index.TestIndexWriterExceptions.lambda$testTooManyTokens$22(TestIndexWriterExceptions.java:2048) [junit4] > at org.apache.lucene.util.LuceneTestCase.expectThrows(LuceneTestCase.java:2674) [...] [junit4] 2> NOTE: test params are: codec=CheapBastard, sim=RandomSimilarity(queryNorm=false): {content6=LM Jelinek-Mercer(0.700000), field=IB LL-D2, content4=DFR I(n)2, contents=DFR I(F)1, content2=DFR I(ne)B3(800.0), content1=LM Jelinek-Mercer(0.100000), id=DFR I(F)L2, content=DFR I(ne)Z(0.3)}, locale=sl, timezone=Pacific/Rarotonga [junit4] 2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_77 (64-bit)/cpus=16,threads=1,free=77388832,total=333447168 {noformat} > Indexing custom term frequencies > -------------------------------- > > Key: LUCENE-7854 > URL: https://issues.apache.org/jira/browse/LUCENE-7854 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: master (7.0) > > Attachments: LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch, > LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch > > > When you index a field with {{IndexOptions.DOCS_AND_FREQS}}, Lucene will > store just the docID and term frequency (how many times that term occurred in > that document) for all documents that have a given term. > We compute that term frequency by counting how many times a given token > appeared in the field during analysis. > But it can be useful, in expert use cases, to customize what Lucene stores as > the term frequency, e.g. to hold custom scoring signals that are a function > of term and document (this is my use case). Users have also asked for this > before, e.g. see > https://stackoverflow.com/questions/26605090/lucene-overwrite-term-frequency-at-index-time. > One way to do this today is to stuff your custom data into a {{byte[]}} > payload. But that's quite inefficient, forcing you to index positions, and > pay the overhead of retrieving payloads at search time. > Another approach is "token stuffing": just enumerate the same token N times > where N is the custom number you want to store, but that's also inefficient > when N gets high. > I think we can make this simple to do in Lucene. I have a working version, > using my own custom indexing chain, but the required changes are quite simple > so I think we can add it to Lucene's default indexing chain? > I created a new token attribute, {{TermDocFrequencyAttribute}}, and tweaked > the indexing chain to use that attribute's value as the term frequency if > it's present, and if the index options are {{DOCS_AND_FREQS}} for that field. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org