[ https://issues.apache.org/jira/browse/LUCENE-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916316#comment-16916316 ]
David Smiley commented on LUCENE-8403: -------------------------------------- Atri, I appreciate you put some effort into this but your patch wouldn't work for the use case that inspired the creation of this feature-request. The terms to be omitted by the term vector are matchable by a pattern; it's not a fixed pre-determined list. For example imagine filtering all terms that start or end with a special character. But this issue is stuck without addressing the concern Robert raises -- CheckIndex. I don't recall the particulars of where in CheckIndex.java it complains but try it out on your patch to see. Given randomized checkIndex usage automatically within tests, I suspect your patch will ultimately fail given enough iterations. > Support 'filtered' term vectors - don't require all terms to be present > ----------------------------------------------------------------------- > > Key: LUCENE-8403 > URL: https://issues.apache.org/jira/browse/LUCENE-8403 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael Braun > Priority: Minor > Attachments: LUCENE-8403.patch > > > The genesis of this was a conversation and idea from [~dsmiley] several years > ago. > In order to optimize term vector storage, we may not actually need all tokens > to be present in the term vectors - and if so, ideally our codec could just > opt not to store them. > I attempted to fork the standard codec and override the TermVectorsFormat and > TermVectorsWriter to ignore storing certain Terms within a field. This > worked, however, CheckIndex checks that the terms present in the standard > postings are also present in the TVs, if TVs enabled. So this then doesn't > work as 'valid' according to CheckIndex. > Can the TermVectorsFormat be made in such a way to support configuration of > tokens that should not be stored (benefits: less storage, more optimal > retrieval per doc)? Is this valuable to the wider community? Is there a way > we can design this to not break CheckIndex's contract while at the same time > lessening storage for unneeded tokens? -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org