[ 
https://issues.apache.org/jira/browse/LUCENE-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918711#comment-16918711
 ] 

David Smiley commented on LUCENE-8403:
--------------------------------------

{quote}I understand the approaches – your approach seems to be a longer term 
solution (I am not sure of the complexity implications though).
{quote}
I don't think it's long term; I expect it's a simple flag to inform CheckIndex 
that it shouldn't check something in this case.  Perhaps if you want to explore 
this you might see if it's this simple.  The biggest part would be a test 
including a custom format that exercises this flag to ensure check index 
doesn't freak out.

> Support 'filtered' term vectors - don't require all terms to be present
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-8403
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8403
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Braun
>            Priority: Minor
>         Attachments: LUCENE-8403.patch
>
>
> The genesis of this was a conversation and idea from [~dsmiley] several years 
> ago.
> In order to optimize term vector storage, we may not actually need all tokens 
> to be present in the term vectors - and if so, ideally our codec could just 
> opt not to store them.
> I attempted to fork the standard codec and override the TermVectorsFormat and 
> TermVectorsWriter to ignore storing certain Terms within a field. This 
> worked, however, CheckIndex checks that the terms present in the standard 
> postings are also present in the TVs, if TVs enabled. So this then doesn't 
> work as 'valid' according to CheckIndex.
> Can the TermVectorsFormat be made in such a way to support configuration of 
> tokens that should not be stored (benefits: less storage, more optimal 
> retrieval per doc)? Is this valuable to the wider community? Is there a way 
> we can design this to not break CheckIndex's contract while at the same time 
> lessening storage for unneeded tokens?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to