[ https://issues.apache.org/jira/browse/LUCENE-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-7854: --------------------------------------- Attachment: LUCENE-7854.patch Another iteration, doing the rename [~thetaphi] suggested, and also cleaning up {{PackedTokenAttributeImpl#end}} a bit. > Indexing custom term frequencies > -------------------------------- > > Key: LUCENE-7854 > URL: https://issues.apache.org/jira/browse/LUCENE-7854 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: master (7.0) > > Attachments: LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch, > LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch > > > When you index a field with {{IndexOptions.DOCS_AND_FREQS}}, Lucene will > store just the docID and term frequency (how many times that term occurred in > that document) for all documents that have a given term. > We compute that term frequency by counting how many times a given token > appeared in the field during analysis. > But it can be useful, in expert use cases, to customize what Lucene stores as > the term frequency, e.g. to hold custom scoring signals that are a function > of term and document (this is my use case). Users have also asked for this > before, e.g. see > https://stackoverflow.com/questions/26605090/lucene-overwrite-term-frequency-at-index-time. > One way to do this today is to stuff your custom data into a {{byte[]}} > payload. But that's quite inefficient, forcing you to index positions, and > pay the overhead of retrieving payloads at search time. > Another approach is "token stuffing": just enumerate the same token N times > where N is the custom number you want to store, but that's also inefficient > when N gets high. > I think we can make this simple to do in Lucene. I have a working version, > using my own custom indexing chain, but the required changes are quite simple > so I think we can add it to Lucene's default indexing chain? > I created a new token attribute, {{TermDocFrequencyAttribute}}, and tweaked > the indexing chain to use that attribute's value as the term frequency if > it's present, and if the index options are {{DOCS_AND_FREQS}} for that field. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org