Hi Carbon Team, Recently, I am considering working on implementing a secondary index over the DataMap API. After a careful look on the design, there are some questions and concerns I want to raise here:
- Datamap only support partition level (more precisely blocklet level pruning), Specifically, the `prune()` function in DataMap interface will consume a filter and produce a list of Blocklets. Then, it seems that building sophisticated data structure may be not useful. For example, in the case of spatial range query, the only thing I may want to know is the boundary of a blocklet, anything other insight will not be exposed to the pruning procedure. - It is common in most commercial databases that only one index will be used for the filter process even though we can use other secondary index to prune. Most likely, the query optimizer will choose the index which provides the highest selectivity to use. - I feel confused on the semantics of `toDistribute()` function in DataMap API. One problem I found in the MinMax DataMap example is that there will be a single thread consume all these indexes then construct the pruning. As a result, we may loose any advantage of massive parallelism. Is `distributed datamap` supposed to solve this problem? - Finally, could you give me an example on iterating through all rows in a blocklet, block and segment so that I can get my input for index bulk loading. In one of the DataMap example which is still in a pull request (https://github.com/apache/carbondata/pull/1359), I cannot find this part since it pull the statistics directly from Blocklet built in MinMax Index. (Please refer to `loadBlockDetails` and `constructMinMaxIndex` function in `MinMaxDataWriter.java` under that PR). Thanks, Dong