[ https://issues.apache.org/jira/browse/LUCENE-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918913#comment-16918913 ]
David Smiley commented on LUCENE-8403: -------------------------------------- RE a separate field: That's a valid approach, yes. I/Michael should have acknowledged that up front. However the trade-off is that it would mean analyzing the text all over again[1], and mucking with the higher level features to use a separate field for the term vector (e.g. in a highlighter). [1]: It'd be neat if somehow one IndexableField could _listen_ for analysis events processed from another field. It's probably possible to hack something up that works today assuming you know the order of fields. This might be used not only for populating term vectors but also for populating SortedSetDocValues sourced from analyzed terms. RE "There's no good technical reason to introduce a layering violation". It debatable if term vectors need to be seen as "layered". I understand that you do, and hence your strong opposition about the proposal here. > Support 'filtered' term vectors - don't require all terms to be present > ----------------------------------------------------------------------- > > Key: LUCENE-8403 > URL: https://issues.apache.org/jira/browse/LUCENE-8403 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael Braun > Priority: Minor > Attachments: LUCENE-8403.patch > > > The genesis of this was a conversation and idea from [~dsmiley] several years > ago. > In order to optimize term vector storage, we may not actually need all tokens > to be present in the term vectors - and if so, ideally our codec could just > opt not to store them. > I attempted to fork the standard codec and override the TermVectorsFormat and > TermVectorsWriter to ignore storing certain Terms within a field. This > worked, however, CheckIndex checks that the terms present in the standard > postings are also present in the TVs, if TVs enabled. So this then doesn't > work as 'valid' according to CheckIndex. > Can the TermVectorsFormat be made in such a way to support configuration of > tokens that should not be stored (benefits: less storage, more optimal > retrieval per doc)? Is this valuable to the wider community? Is there a way > we can design this to not break CheckIndex's contract while at the same time > lessening storage for unneeded tokens? -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org