[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues

Michael Gibney (Jira) Thu, 08 Jul 2021 14:20:08 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377594#comment-17377594
 ]


Michael Gibney commented on LUCENE-10023:
-----------------------------------------

{quote}pushing the trappiness onto them is a good thing
{quote}

I think that's a reasonable perspective. That said, I was motivated to add the 
functionality here because I see the same questions being raised in 
Elasticsearch and Solr forums (and presumably it'd be useful in other contexts 
as well); and it's a discrete enough change that I was looking to factor out 
the common functionality to the appropriate place in the software stack.

{quote}I'm especially worried about search servers that would expose this 
blindly without such checks, and using configured analyzers that wouldn't have 
limit-filters and so on.
{quote}

I could see this concern leading one way or the other: such a default limit 
(perhaps configurable via FieldType?) could be enforced with negligible 
overhead directly in IndexingChain, protecting users against even a buggy 
TokenStream. I definitely see how the careless use of this feature could be 
problematic; but even _without_ built-in limits, assuming this feature is not 
enabled by default, a user would have to explicitly enable it -- which they 
would presumably only do in response to a specific need for the functionality 
this feature supports.

{quote}The current patch really just shoves sugar into indexwriter [....] It 
isn't any more efficient than the user consuming the TS themselves and adding 
the field
{quote}

Wouldn't it be more efficient though, in at least some cases? With an 
"external" approach such as you're suggesting, a tokenized and indexed field 
would have to run analysis twice: once outside of IndexingChain in order to 
generate (and buffer) explicit SortedSetDocValuesFields, and again within 
IndexingChain to generate indexed tokens (with associated token attributes). In 
contrast, the current PR only runs analysis once, and avoids buffering 
BytesRefs (i.e., via deepCopy() -- also avoids creating disposable one-off 
SortedSetDocValuesField objects), and instead sends BytesRefs directly to the 
docValuesWriter.

Aside from performance, the approach taken by this PR is more than sugar in the 
sense that an IndexingChain-internal approach can enforce consistency between 
indexed terms and docValues terms, which could be useful in a number of ways 
(e.g., reliable behavior in applications that assume 1:1 correspondence between 
indexed terms and docValues, and potential future optimizations such as a 
shared terms dictionary).

> Multi-token post-analysis DocValues
> -----------------------------------
>
>                 Key: LUCENE-10023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10023
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael Gibney
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> The single-token case for post-analysis DocValues is accounted for by 
> {{Analyzer.normalize(...)}} (and formerly {{MultiTermAwareComponent}}); but 
> there are cases where it would be desirable to have post-analysis DocValues 
> based on multi-token fields.
> The main use cases that I can think of are variants of faceting/terms 
> aggregation. I understand that this could be viewed as "trappy" for the naive 
> "Moby Dick word cloud" case; but:
> # I think this can be supported fairly cleanly in Lucene
> # Explicit user configuration of this option would help prevent people 
> shooting themselves in the foot
> # The current situation is arguably "trappy" as well; it just offloads the 
> trappiness onto Lucene-external workarounds for systems/users that want to 
> support this kind of behavior
> # Integrating this functionality directly in Lucene would afford consistency 
> guarantees that present opportunities for future optimizations (e.g., shared 
> Terms dictionary between indexed terms and DocValues).
> This issue proposes adding support for multi-token post-analysis DocValues 
> directly to {{IndexingChain}}. The initial proposal involves extending the 
> API to include {{IndexableFieldType.tokenDocValuesType()}} (in addition to 
> existing {{IndexableFieldType.docValuesType()}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues

Reply via email to