[jira] [Commented] (LUCENE-8403) Support 'filtered' term vectors - don't require all terms to be present

David Smiley (Jira) Thu, 29 Aug 2019 12:54:08 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918913#comment-16918913
 ]


David Smiley commented on LUCENE-8403:
--------------------------------------

RE a separate field:  That's a valid approach, yes.  I/Michael should have 
acknowledged that up front.  However the trade-off is that it would mean 
analyzing the text all over again[1], and mucking with the higher level 
features to use a separate field for the term vector (e.g. in a highlighter).

[1]: It'd be neat if somehow one IndexableField could _listen_ for analysis 
events processed from another field.  It's probably possible to hack something 
up that works today assuming you know the order of fields.  This might be used 
not only for populating term vectors but also for populating SortedSetDocValues 
sourced from analyzed terms.

RE "There's no good technical reason to introduce a layering violation".  It 
debatable if term vectors need to be seen as "layered".  I understand that you 
do, and hence your strong opposition about the proposal here.

> Support 'filtered' term vectors - don't require all terms to be present
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-8403
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8403
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Braun
>            Priority: Minor
>         Attachments: LUCENE-8403.patch
>
>
> The genesis of this was a conversation and idea from [~dsmiley] several years 
> ago.
> In order to optimize term vector storage, we may not actually need all tokens 
> to be present in the term vectors - and if so, ideally our codec could just 
> opt not to store them.
> I attempted to fork the standard codec and override the TermVectorsFormat and 
> TermVectorsWriter to ignore storing certain Terms within a field. This 
> worked, however, CheckIndex checks that the terms present in the standard 
> postings are also present in the TVs, if TVs enabled. So this then doesn't 
> work as 'valid' according to CheckIndex.
> Can the TermVectorsFormat be made in such a way to support configuration of 
> tokens that should not be stored (benefits: less storage, more optimal 
> retrieval per doc)? Is this valuable to the wider community? Is there a way 
> we can design this to not break CheckIndex's contract while at the same time 
> lessening storage for unneeded tokens?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8403) Support 'filtered' term vectors - don't require all terms to be present

Reply via email to