[jira] [Commented] (LUCENE-7854) Indexing custom term frequencies

Steve Rowe (JIRA) Wed, 07 Jun 2017 09:34:36 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16041177#comment-16041177
 ]


Steve Rowe commented on LUCENE-7854:
------------------------------------

My Jenkins found a reproducing master seed for a 
{{TestIndexWriterExceptions.testTooManyTokens()}} failure, which {{git bisect}} 
blames on commit {{d276acfb}} on this issue:

{noformat}
Checking out Revision 1921b61ba8f3c7579bc04975b7ce90167a74e51e 
(refs/remotes/origin/master)
[...]
   [junit4] Suite: org.apache.lucene.index.TestIndexWriterExceptions
   [junit4]   2> NOTE: download the large Jenkins line-docs file by running 
'ant get-jenkins-line-docs' in the lucene directory.
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestIndexWriterExceptions -Dtests.method=testTooManyTokens 
-Dtests.seed=244E0F9076AF909A -Dtests.multiplier=2 -Dtests.nightly=true 
-Dtests.slow=true 
-Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt 
-Dtests.locale=sl -Dtests.timezone=Pacific/Rarotonga -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   [junit4] FAILURE  344s J0  | TestIndexWriterExceptions.testTooManyTokens <<<
   [junit4]    > Throwable #1: junit.framework.AssertionFailedError: Unexpected 
exception type, expected IllegalArgumentException but got 
java.lang.ArithmeticException: integer overflow
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([244E0F9076AF909A:C9CA5B1B9805BDB9]:0)
   [junit4]    >        at 
org.apache.lucene.util.LuceneTestCase.expectThrows(LuceneTestCase.java:2679)
   [junit4]    >        at 
org.apache.lucene.index.TestIndexWriterExceptions.testTooManyTokens(TestIndexWriterExceptions.java:2047)
   [junit4]    >        at java.lang.Thread.run(Thread.java:745)
   [junit4]    > Caused by: java.lang.ArithmeticException: integer overflow
   [junit4]    >        at java.lang.Math.addExact(Math.java:790)
   [junit4]    >        at 
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:773)
   [junit4]    >        at 
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:431)
   [junit4]    >        at 
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:393)
   [junit4]    >        at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:236)
   [junit4]    >        at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:478)
   [junit4]    >        at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1570)
   [junit4]    >        at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1315)
   [junit4]    >        at 
org.apache.lucene.index.TestIndexWriterExceptions.lambda$testTooManyTokens$22(TestIndexWriterExceptions.java:2048)
   [junit4]    >        at 
org.apache.lucene.util.LuceneTestCase.expectThrows(LuceneTestCase.java:2674)
[...]
   [junit4]   2> NOTE: test params are: codec=CheapBastard, 
sim=RandomSimilarity(queryNorm=false): {content6=LM Jelinek-Mercer(0.700000), 
field=IB LL-D2, content4=DFR I(n)2, contents=DFR I(F)1, content2=DFR 
I(ne)B3(800.0), content1=LM Jelinek-Mercer(0.100000), id=DFR I(F)L2, 
content=DFR I(ne)Z(0.3)}, locale=sl, timezone=Pacific/Rarotonga
   [junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 
1.8.0_77 (64-bit)/cpus=16,threads=1,free=77388832,total=333447168
{noformat}

> Indexing custom term frequencies
> --------------------------------
>
>                 Key: LUCENE-7854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7854
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: master (7.0)
>
>         Attachments: LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch, 
> LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch
>
>
> When you index a field with {{IndexOptions.DOCS_AND_FREQS}}, Lucene will 
> store just the docID and term frequency (how many times that term occurred in 
> that document) for all documents that have a given term.
> We compute that term frequency by counting how many times a given token 
> appeared in the field during analysis.
> But it can be useful, in expert use cases, to customize what Lucene stores as 
> the term frequency, e.g. to hold custom scoring signals that are a function 
> of term and document (this is my use case).  Users have also asked for this 
> before, e.g. see 
> https://stackoverflow.com/questions/26605090/lucene-overwrite-term-frequency-at-index-time.
> One way to do this today is to stuff your custom data into a {{byte[]}} 
> payload.  But that's quite inefficient, forcing you to index positions, and 
> pay the overhead of retrieving payloads at search time.
> Another approach is "token stuffing": just enumerate the same token N times 
> where N is the custom number you want to store, but that's also inefficient 
> when N gets high.
> I think we can make this simple to do in Lucene.  I have a working version, 
> using my own custom indexing chain, but the required changes are quite simple 
> so I think we can add it to Lucene's default indexing chain?
> I created a new token attribute, {{TermDocFrequencyAttribute}}, and tweaked 
> the indexing chain to use that attribute's value as the term frequency if 
> it's present, and if the index options are {{DOCS_AND_FREQS}} for that field.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7854) Indexing custom term frequencies

Reply via email to