[jira] [Commented] (LUCENE-7854) Indexing custom term frequencies

Mike Sokolov (JIRA) Fri, 26 May 2017 11:22:18 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026630#comment-16026630
 ]


Mike Sokolov commented on LUCENE-7854:
--------------------------------------

I had wanted this in the past for indexing collaborative filtering similarity 
scores as a sparse matrix. In that case, say you want to index 
document-document similarity, basically some function SIM(doc1,doc2). The full 
matrix is too enormous to store, so you only record the top N most similar 
docs. One way to store this is to index the LHS in one field, and all the 
related document ids as terms in another field. Then you can use Lucene's 
queries to perform weighted averages if you several documents and so on, as 
well as mixing with other term constraints, but you really want to manipulate 
the frequencies in order to represent the SIM function.

> Indexing custom term frequencies
> --------------------------------
>
>                 Key: LUCENE-7854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7854
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: master (7.0)
>
>         Attachments: LUCENE-7854.patch
>
>
> When you index a field with {{IndexOptions.DOCS_AND_FREQS}}, Lucene will 
> store just the docID and term frequency (how many times that term occurred in 
> that document) for all documents that have a given term.
> We compute that term frequency by counting how many times a given token 
> appeared in the field during analysis.
> But it can be useful, in expert use cases, to customize what Lucene stores as 
> the term frequency, e.g. to hold custom scoring signals that are a function 
> of term and document (this is my use case).  Users have also asked for this 
> before, e.g. see 
> https://stackoverflow.com/questions/26605090/lucene-overwrite-term-frequency-at-index-time.
> One way to do this today is to stuff your custom data into a {{byte[]}} 
> payload.  But that's quite inefficient, forcing you to index positions, and 
> pay the overhead of retrieving payloads at search time.
> Another approach is "token stuffing": just enumerate the same token N times 
> where N is the custom number you want to store, but that's also inefficient 
> when N gets high.
> I think we can make this simple to do in Lucene.  I have a working version, 
> using my own custom indexing chain, but the required changes are quite simple 
> so I think we can add it to Lucene's default indexing chain?
> I created a new token attribute, {{TermDocFrequencyAttribute}}, and tweaked 
> the indexing chain to use that attribute's value as the term frequency if 
> it's present, and if the index options are {{DOCS_AND_FREQS}} for that field.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7854) Indexing custom term frequencies

Reply via email to