[
https://issues.apache.org/jira/browse/LUCENE-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916316#comment-16916316
]
David Smiley commented on LUCENE-8403:
--------------------------------------
Atri, I appreciate you put some effort into this but your patch wouldn't work
for the use case that inspired the creation of this feature-request. The terms
to be omitted by the term vector are matchable by a pattern; it's not a fixed
pre-determined list. For example imagine filtering all terms that start or end
with a special character.
But this issue is stuck without addressing the concern Robert raises --
CheckIndex. I don't recall the particulars of where in CheckIndex.java it
complains but try it out on your patch to see. Given randomized checkIndex
usage automatically within tests, I suspect your patch will ultimately fail
given enough iterations.
> Support 'filtered' term vectors - don't require all terms to be present
> -----------------------------------------------------------------------
>
> Key: LUCENE-8403
> URL: https://issues.apache.org/jira/browse/LUCENE-8403
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael Braun
> Priority: Minor
> Attachments: LUCENE-8403.patch
>
>
> The genesis of this was a conversation and idea from [~dsmiley] several years
> ago.
> In order to optimize term vector storage, we may not actually need all tokens
> to be present in the term vectors - and if so, ideally our codec could just
> opt not to store them.
> I attempted to fork the standard codec and override the TermVectorsFormat and
> TermVectorsWriter to ignore storing certain Terms within a field. This
> worked, however, CheckIndex checks that the terms present in the standard
> postings are also present in the TVs, if TVs enabled. So this then doesn't
> work as 'valid' according to CheckIndex.
> Can the TermVectorsFormat be made in such a way to support configuration of
> tokens that should not be stored (benefits: less storage, more optimal
> retrieval per doc)? Is this valuable to the wider community? Is there a way
> we can design this to not break CheckIndex's contract while at the same time
> lessening storage for unneeded tokens?
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]