[ 
https://issues.apache.org/jira/browse/LUCENE-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916687#comment-16916687
 ] 

David Smiley commented on LUCENE-8403:
--------------------------------------

bq. Does it make sense for me to adapt the patch to support pattern based 
filtering?

I think you should discuss the idea here first since there's a blocker.

What I'd most prefer, if [~rcmuir] might approve, is a way for the 
TermVectorsFormat to somehow advertise that it's contents do not align with a 
PostingsFormat.  Straw-man: perhaps a {{TermVectorsFormat.isFiltered()}} 
method.  In such a case, CheckIndex could still check that the TVF API works 
(it would call CheckIndex.checkFields(tvFields, ...) but it would not compare 
it to the terms() -- logic gated by {{doSlowChecks}} param in 
{{testTermVectors()}}.  This would be very general and allow all manner of 
variations a term vector might have from the analyzed text.  

A less general approach is one akin to Hoss's suggestion that the TVF 
advertises *which* terms are consistent.  Though not a list which is way too 
inflexible, more like a callback method such as 
{{TermVectorsFormat.acceptsTerm(BytesRef)}}.

I don't think IndexWriterConfig should be modified as I think this is too 
expert to warrant that.

Atri, curious, what exactly was the error message string thrown by CheckIndex 
for a filtered term?

Side note: TermVectorsWriter's API is dated; ought to look more like Postings 
writing.  I have some old notes on a plan to tackle that.

> Support 'filtered' term vectors - don't require all terms to be present
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-8403
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8403
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Braun
>            Priority: Minor
>         Attachments: LUCENE-8403.patch
>
>
> The genesis of this was a conversation and idea from [~dsmiley] several years 
> ago.
> In order to optimize term vector storage, we may not actually need all tokens 
> to be present in the term vectors - and if so, ideally our codec could just 
> opt not to store them.
> I attempted to fork the standard codec and override the TermVectorsFormat and 
> TermVectorsWriter to ignore storing certain Terms within a field. This 
> worked, however, CheckIndex checks that the terms present in the standard 
> postings are also present in the TVs, if TVs enabled. So this then doesn't 
> work as 'valid' according to CheckIndex.
> Can the TermVectorsFormat be made in such a way to support configuration of 
> tokens that should not be stored (benefits: less storage, more optimal 
> retrieval per doc)? Is this valuable to the wider community? Is there a way 
> we can design this to not break CheckIndex's contract while at the same time 
> lessening storage for unneeded tokens?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to