[ https://issues.apache.org/jira/browse/LUCENE-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17506402#comment-17506402 ]
Adrien Grand commented on LUCENE-10427: --------------------------------------- Thanks I understand better now. With the sidecar approach, could you compute rollups at index time by performing updates instead of hooking into the merging process? For instance if a user is adding a new sample, you could retrieve data for the current <your-data-granularity-goes-here> bucket for the given dimensions and update the min/max/sum values? > OLAP likewise rollup during segment merge process > ------------------------------------------------- > > Key: LUCENE-10427 > URL: https://issues.apache.org/jira/browse/LUCENE-10427 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Suhan Mao > Priority: Major > > Currently, many OLAP engines support rollup feature like > clickhouse(AggregateMergeTree)/druid. > Rollup definition: [https://athena.ecs.csus.edu/~mei/olap/OLAPoperations.php] > One of the way to do rollup is to merge the same dimension buckets into one > and do sum()/min()/max() operation on metric fields during segment > compact/merge process. This can significantly reduce the size of the data and > speed up the query a lot. > > *Abstraction of how to do* > # Define rollup logic: which is dimensions and metrics. > # Rollup definition for each metric field: max/min/sum ... > # index sorting should the the same as dimension fields. > # We will do rollup calculation during segment merge just like other OLAP > engine do. > > *Assume the scenario* > We use ES to ingest realtime raw temperature data every minutes of each > sensor device along with many dimension information. User may want to query > the data like "what is the max temperature of some device within some/latest > hour" or "what is the max temperature of some city within some/latest hour" > In that way, we can define such fields and rollup definition: > # event_hour(round to hour granularity) > # device_id(dimension) > # city_id(dimension) > # temperature(metrics, max/min rollup logic) > The raw data will periodically be rolled up to the hour granularity during > segment merge process, which should save 60x storage ideally in the end. > > *How we do rollup in segment merge* > bucket: docs should belong to the same bucket if the dimension values are all > the same. > # For docvalues merge, we send the normal mappedDocId if we encounter a new > bucket in DocIDMerger. > # Since the index sorting fields are the same with dimension fields. if we > encounter more docs in the same bucket, We emit special mappedDocId from > DocIDMerger . > # In DocValuesConsumer.mergeNumericField, if we meet special mappedDocId, we > do a rollup calculation on metric fields and fold the result value to the > first doc in the bucket. The calculation just like a streaming merge sort > rollup. > # We discard all the special mappedDocId docs because the metrics is already > folded to the first doc of in the bucket. > # In BKD/posting structure, we discard all the special mappedDocId docs and > only place the first doc id within a bucket in the BKD/posting data. It > should be simple. > > *How to define the logic* > > {code:java} > public class RollupMergeConfig { > private List<String> dimensionNames; > private List<RollupMergeAggregateField> aggregateFields; > } > public class RollupMergeAggregateField { > private String name; > private RollupMergeAggregateType aggregateType; > } > public enum RollupMergeAggregateType { > COUNT, > SUM, > MIN, > MAX, > CARDINALITY // if data sketch is stored in binary doc values, we can do a > union logic > }{code} > > > I have written the initial code in a basic level. I can submit the complete > PR if you think this feature is good to try. > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org