Hi Carbon Team,

Recently, I am considering working on implementing a secondary index over the 
DataMap API. After a careful look on the design, there are some questions and 
concerns I want to raise here:

- Datamap only support partition level (more precisely blocklet level pruning), 
Specifically, the `prune()` function in DataMap interface will consume a filter 
and produce a list of Blocklets. Then, it seems that building sophisticated 
data structure may be not useful. For example, in the case of spatial range 
query, the only thing I may want to know is the boundary of a blocklet, 
anything other insight will not be exposed to the pruning procedure.
- It is common in most commercial databases that only one index will be used 
for the filter process even though we can use other secondary index to prune. 
Most likely, the query optimizer will choose the index which provides the 
highest selectivity to use.
- I feel confused on the semantics of  `toDistribute()` function in DataMap 
API. One problem I found in the MinMax DataMap example is that there will be a 
single thread consume all these indexes then construct the pruning. As a 
result, we may loose any advantage of massive parallelism. Is `distributed 
datamap` supposed to solve this problem?
- Finally, could you give me an example on iterating through all rows in a 
blocklet, block and segment so that I can get my input for index bulk loading. 
In one of the DataMap example which is still in a pull request 
(https://github.com/apache/carbondata/pull/1359), I cannot find this part since 
it pull the statistics directly from Blocklet built in MinMax Index. (Please 
refer to `loadBlockDetails` and `constructMinMaxIndex` function in 
`MinMaxDataWriter.java` under that PR).


Thanks,
Dong  

Reply via email to