[ https://issues.apache.org/jira/browse/LUCENE-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378269#comment-17378269 ]
Michael Gibney commented on LUCENE-10023: ----------------------------------------- That would make sense. Really the main difference in a sandbox-based impl would be performance (double-consuming TokenStream and extra buffering of token BytesRefs). Having a concrete sandbox impl available would give a baseline for evaluating any performance difference, and would also address the desire to add this functionality in a place where it would be factored out and accessible to Elasticsearch, Solr, etc... My only hesitation about the sandbox approach is that if there's not even a remote thought of ever considering/evaluating the performance gain that would come from integrating this in IndexingChain, and entertaining the legitimacy of the "text corpus analytics"/many-token use case (with trappiness somehow mitigated), then the sandbox change would be _exclusively_ sugar. This is neither here nor there, and not an argument against the sandbox approach, but tbh it likely wouldn't have occurred to me to file a Lucene issue for this if the change were _strictly_ about sugar, with no performance aspect. That said, I think it still might be worth pursuing a sandbox-based approach, particularly if: # there's _any_ potential of revisiting closer integration in IndexingChain, for performance reasons, or # there's interest in leveraging this from "other-than-Solr" (e.g., Elasticsearch, etc. ... I'm approaching this from the Solr side, so could just as well implement it there, in a custom Solr FieldType, I think). I probably won't move immediately on to implementing a sandbox-based approach; I'd be interested if anyone feels inclined to weigh in on whether they'd find such an approach useful. > Multi-token post-analysis DocValues > ----------------------------------- > > Key: LUCENE-10023 > URL: https://issues.apache.org/jira/browse/LUCENE-10023 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index > Reporter: Michael Gibney > Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > The single-token case for post-analysis DocValues is accounted for by > {{Analyzer.normalize(...)}} (and formerly {{MultiTermAwareComponent}}); but > there are cases where it would be desirable to have post-analysis DocValues > based on multi-token fields. > The main use cases that I can think of are variants of faceting/terms > aggregation. I understand that this could be viewed as "trappy" for the naive > "Moby Dick word cloud" case; but: > # I think this can be supported fairly cleanly in Lucene > # Explicit user configuration of this option would help prevent people > shooting themselves in the foot > # The current situation is arguably "trappy" as well; it just offloads the > trappiness onto Lucene-external workarounds for systems/users that want to > support this kind of behavior > # Integrating this functionality directly in Lucene would afford consistency > guarantees that present opportunities for future optimizations (e.g., shared > Terms dictionary between indexed terms and DocValues). > This issue proposes adding support for multi-token post-analysis DocValues > directly to {{IndexingChain}}. The initial proposal involves extending the > API to include {{IndexableFieldType.tokenDocValuesType()}} (in addition to > existing {{IndexableFieldType.docValuesType()}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org