[jira] [Updated] (LUCENE-7854) Indexing custom term frequencies

2017-06-01 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-7854:
---
Attachment: LUCENE-7854.patch

Another iteration, doing the rename [~thetaphi] suggested, and also cleaning up 
{{PackedTokenAttributeImpl#end}} a bit.

> Indexing custom term frequencies
> 
>
> Key: LUCENE-7854
> URL: https://issues.apache.org/jira/browse/LUCENE-7854
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0)
>
> Attachments: LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch, 
> LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch
>
>
> When you index a field with {{IndexOptions.DOCS_AND_FREQS}}, Lucene will 
> store just the docID and term frequency (how many times that term occurred in 
> that document) for all documents that have a given term.
> We compute that term frequency by counting how many times a given token 
> appeared in the field during analysis.
> But it can be useful, in expert use cases, to customize what Lucene stores as 
> the term frequency, e.g. to hold custom scoring signals that are a function 
> of term and document (this is my use case).  Users have also asked for this 
> before, e.g. see 
> https://stackoverflow.com/questions/26605090/lucene-overwrite-term-frequency-at-index-time.
> One way to do this today is to stuff your custom data into a {{byte[]}} 
> payload.  But that's quite inefficient, forcing you to index positions, and 
> pay the overhead of retrieving payloads at search time.
> Another approach is "token stuffing": just enumerate the same token N times 
> where N is the custom number you want to store, but that's also inefficient 
> when N gets high.
> I think we can make this simple to do in Lucene.  I have a working version, 
> using my own custom indexing chain, but the required changes are quite simple 
> so I think we can add it to Lucene's default indexing chain?
> I created a new token attribute, {{TermDocFrequencyAttribute}}, and tweaked 
> the indexing chain to use that attribute's value as the term frequency if 
> it's present, and if the index options are {{DOCS_AND_FREQS}} for that field.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7854) Indexing custom term frequencies

2017-05-31 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-7854:
---
Attachment: LUCENE-7854.patch

New patch, folding in [~jpountz]'s last feedback (thank you!).  I think it's 
ready.

> Indexing custom term frequencies
> 
>
> Key: LUCENE-7854
> URL: https://issues.apache.org/jira/browse/LUCENE-7854
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0)
>
> Attachments: LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch, 
> LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch
>
>
> When you index a field with {{IndexOptions.DOCS_AND_FREQS}}, Lucene will 
> store just the docID and term frequency (how many times that term occurred in 
> that document) for all documents that have a given term.
> We compute that term frequency by counting how many times a given token 
> appeared in the field during analysis.
> But it can be useful, in expert use cases, to customize what Lucene stores as 
> the term frequency, e.g. to hold custom scoring signals that are a function 
> of term and document (this is my use case).  Users have also asked for this 
> before, e.g. see 
> https://stackoverflow.com/questions/26605090/lucene-overwrite-term-frequency-at-index-time.
> One way to do this today is to stuff your custom data into a {{byte[]}} 
> payload.  But that's quite inefficient, forcing you to index positions, and 
> pay the overhead of retrieving payloads at search time.
> Another approach is "token stuffing": just enumerate the same token N times 
> where N is the custom number you want to store, but that's also inefficient 
> when N gets high.
> I think we can make this simple to do in Lucene.  I have a working version, 
> using my own custom indexing chain, but the required changes are quite simple 
> so I think we can add it to Lucene's default indexing chain?
> I created a new token attribute, {{TermDocFrequencyAttribute}}, and tweaked 
> the indexing chain to use that attribute's value as the term frequency if 
> it's present, and if the index options are {{DOCS_AND_FREQS}} for that field.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7854) Indexing custom term frequencies

2017-05-28 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-7854:
---
Attachment: LUCENE-7854.patch

Good catch [~thetaphi], I normalized the output of {{reflectWith}} between the 
two.

> Indexing custom term frequencies
> 
>
> Key: LUCENE-7854
> URL: https://issues.apache.org/jira/browse/LUCENE-7854
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0)
>
> Attachments: LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch, 
> LUCENE-7854.patch, LUCENE-7854.patch
>
>
> When you index a field with {{IndexOptions.DOCS_AND_FREQS}}, Lucene will 
> store just the docID and term frequency (how many times that term occurred in 
> that document) for all documents that have a given term.
> We compute that term frequency by counting how many times a given token 
> appeared in the field during analysis.
> But it can be useful, in expert use cases, to customize what Lucene stores as 
> the term frequency, e.g. to hold custom scoring signals that are a function 
> of term and document (this is my use case).  Users have also asked for this 
> before, e.g. see 
> https://stackoverflow.com/questions/26605090/lucene-overwrite-term-frequency-at-index-time.
> One way to do this today is to stuff your custom data into a {{byte[]}} 
> payload.  But that's quite inefficient, forcing you to index positions, and 
> pay the overhead of retrieving payloads at search time.
> Another approach is "token stuffing": just enumerate the same token N times 
> where N is the custom number you want to store, but that's also inefficient 
> when N gets high.
> I think we can make this simple to do in Lucene.  I have a working version, 
> using my own custom indexing chain, but the required changes are quite simple 
> so I think we can add it to Lucene's default indexing chain?
> I created a new token attribute, {{TermDocFrequencyAttribute}}, and tweaked 
> the indexing chain to use that attribute's value as the term frequency if 
> it's present, and if the index options are {{DOCS_AND_FREQS}} for that field.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7854) Indexing custom term frequencies

2017-05-27 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-7854:
---
Attachment: LUCENE-7854.patch

Woops, another iteration ;)  Thanks [~thetaphi].

> Indexing custom term frequencies
> 
>
> Key: LUCENE-7854
> URL: https://issues.apache.org/jira/browse/LUCENE-7854
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0)
>
> Attachments: LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch, 
> LUCENE-7854.patch
>
>
> When you index a field with {{IndexOptions.DOCS_AND_FREQS}}, Lucene will 
> store just the docID and term frequency (how many times that term occurred in 
> that document) for all documents that have a given term.
> We compute that term frequency by counting how many times a given token 
> appeared in the field during analysis.
> But it can be useful, in expert use cases, to customize what Lucene stores as 
> the term frequency, e.g. to hold custom scoring signals that are a function 
> of term and document (this is my use case).  Users have also asked for this 
> before, e.g. see 
> https://stackoverflow.com/questions/26605090/lucene-overwrite-term-frequency-at-index-time.
> One way to do this today is to stuff your custom data into a {{byte[]}} 
> payload.  But that's quite inefficient, forcing you to index positions, and 
> pay the overhead of retrieving payloads at search time.
> Another approach is "token stuffing": just enumerate the same token N times 
> where N is the custom number you want to store, but that's also inefficient 
> when N gets high.
> I think we can make this simple to do in Lucene.  I have a working version, 
> using my own custom indexing chain, but the required changes are quite simple 
> so I think we can add it to Lucene's default indexing chain?
> I created a new token attribute, {{TermDocFrequencyAttribute}}, and tweaked 
> the indexing chain to use that attribute's value as the term frequency if 
> it's present, and if the index options are {{DOCS_AND_FREQS}} for that field.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7854) Indexing custom term frequencies

2017-05-27 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-7854:
---
Attachment: LUCENE-7854.patch

Another iteration:

  * FIS.length is now computed properly

  * The indexing chain uses {{addAttribute}} to get the term freq add, adding 
it if it's missing.  The value defaults to 1, and I also implement it in 
{{PackedTokenAttribute}}.


> Indexing custom term frequencies
> 
>
> Key: LUCENE-7854
> URL: https://issues.apache.org/jira/browse/LUCENE-7854
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0)
>
> Attachments: LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch
>
>
> When you index a field with {{IndexOptions.DOCS_AND_FREQS}}, Lucene will 
> store just the docID and term frequency (how many times that term occurred in 
> that document) for all documents that have a given term.
> We compute that term frequency by counting how many times a given token 
> appeared in the field during analysis.
> But it can be useful, in expert use cases, to customize what Lucene stores as 
> the term frequency, e.g. to hold custom scoring signals that are a function 
> of term and document (this is my use case).  Users have also asked for this 
> before, e.g. see 
> https://stackoverflow.com/questions/26605090/lucene-overwrite-term-frequency-at-index-time.
> One way to do this today is to stuff your custom data into a {{byte[]}} 
> payload.  But that's quite inefficient, forcing you to index positions, and 
> pay the overhead of retrieving payloads at search time.
> Another approach is "token stuffing": just enumerate the same token N times 
> where N is the custom number you want to store, but that's also inefficient 
> when N gets high.
> I think we can make this simple to do in Lucene.  I have a working version, 
> using my own custom indexing chain, but the required changes are quite simple 
> so I think we can add it to Lucene's default indexing chain?
> I created a new token attribute, {{TermDocFrequencyAttribute}}, and tweaked 
> the indexing chain to use that attribute's value as the term frequency if 
> it's present, and if the index options are {{DOCS_AND_FREQS}} for that field.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7854) Indexing custom term frequencies

2017-05-27 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-7854:
---
Attachment: LUCENE-7854.patch

OK I folded in the feedback (thank you!):

  * Also catch attempts to index DOCS while using custom term freq
and throw an exc similar to if you try to index positions

  * For norms, I still increment FIS.length just once for each
custom-term-freq term, and I added a test case that checks the
FieldInvertState.

  * I also added a separate test case for FieldInvertState ... Rob
noticed, and I agree, we don't seem to do a good job directly
testing this important indexing class.

  * Test totalTermFreq (postings and term vectors) too

I think the patch is ready.


> Indexing custom term frequencies
> 
>
> Key: LUCENE-7854
> URL: https://issues.apache.org/jira/browse/LUCENE-7854
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0)
>
> Attachments: LUCENE-7854.patch, LUCENE-7854.patch
>
>
> When you index a field with {{IndexOptions.DOCS_AND_FREQS}}, Lucene will 
> store just the docID and term frequency (how many times that term occurred in 
> that document) for all documents that have a given term.
> We compute that term frequency by counting how many times a given token 
> appeared in the field during analysis.
> But it can be useful, in expert use cases, to customize what Lucene stores as 
> the term frequency, e.g. to hold custom scoring signals that are a function 
> of term and document (this is my use case).  Users have also asked for this 
> before, e.g. see 
> https://stackoverflow.com/questions/26605090/lucene-overwrite-term-frequency-at-index-time.
> One way to do this today is to stuff your custom data into a {{byte[]}} 
> payload.  But that's quite inefficient, forcing you to index positions, and 
> pay the overhead of retrieving payloads at search time.
> Another approach is "token stuffing": just enumerate the same token N times 
> where N is the custom number you want to store, but that's also inefficient 
> when N gets high.
> I think we can make this simple to do in Lucene.  I have a working version, 
> using my own custom indexing chain, but the required changes are quite simple 
> so I think we can add it to Lucene's default indexing chain?
> I created a new token attribute, {{TermDocFrequencyAttribute}}, and tweaked 
> the indexing chain to use that attribute's value as the term frequency if 
> it's present, and if the index options are {{DOCS_AND_FREQS}} for that field.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7854) Indexing custom term frequencies

2017-05-26 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-7854:
---
Attachment: LUCENE-7854.patch

Initial patch; I think it's close.

> Indexing custom term frequencies
> 
>
> Key: LUCENE-7854
> URL: https://issues.apache.org/jira/browse/LUCENE-7854
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0)
>
> Attachments: LUCENE-7854.patch
>
>
> When you index a field with {{IndexOptions.DOCS_AND_FREQS}}, Lucene will 
> store just the docID and term frequency (how many times that term occurred in 
> that document) for all documents that have a given term.
> We compute that term frequency by counting how many times a given token 
> appeared in the field during analysis.
> But it can be useful, in expert use cases, to customize what Lucene stores as 
> the term frequency, e.g. to hold custom scoring signals that are a function 
> of term and document (this is my use case).  Users have also asked for this 
> before, e.g. see 
> https://stackoverflow.com/questions/26605090/lucene-overwrite-term-frequency-at-index-time.
> One way to do this today is to stuff your custom data into a {{byte[]}} 
> payload.  But that's quite inefficient, forcing you to index positions, and 
> pay the overhead of retrieving payloads at search time.
> Another approach is "token stuffing": just enumerate the same token N times 
> where N is the custom number you want to store, but that's also inefficient 
> when N gets high.
> I think we can make this simple to do in Lucene.  I have a working version, 
> using my own custom indexing chain, but the required changes are quite simple 
> so I think we can add it to Lucene's default indexing chain?
> I created a new token attribute, {{TermDocFrequencyAttribute}}, and tweaked 
> the indexing chain to use that attribute's value as the term frequency if 
> it's present, and if the index options are {{DOCS_AND_FREQS}} for that field.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org