Re: Questions and Concerns on DataMap API

Sounak Chakraborty Wed, 04 Oct 2017 04:03:08 -0700

Hi Dong, 

Embedded are the answers.


> - Datamap only support partition level (more precisely blocklet level 
> pruning), Specifically, the `prune()` function in DataMap interface will 
> consume a filter and produce a list of Blocklets. Then, it seems that 
> building sophisticated data structure may be not useful. For example, in the 
> case of spatial range query, the only thing I may want to know is the 
> boundary of a blocklet, anything other insight will not be exposed to the 
> pruning procedure.
      In Case you are developing your own dataMap for spatial data then you can 
override the pruning logic and write your own. Your secondary index can store 
the spatial data and its corresponding blocklet during write phase of your 
dataMap, later while pruning filter out the spatial data and its corresponding 
BlockletID. This blockletId will be feed to BlockletDataMap (which is a default 
dataMap) to retrieve the Detailed blocklet information.
> - It is common in most commercial databases that only one index will be used 
> for the filter process even though we can use other secondary index to prune. 
> Most likely, the query optimizer will choose the index which provides the 
> highest selectivity to use.
      Yes, most commercial database optimizer chooses the best access path 
(when multiple indexes are present) based on their coverage and stats. Those 
indexes are preferred which covers the projection and predicated completely. 
But this feature is not there in cardondata and will be good to have.
> - I feel confused on the semantics of  `toDistribute()` function in DataMap 
> API. One problem I found in the MinMax DataMap example is that there will be 
> a single thread consume all these indexes then construct the pruning. As a 
> result, we may loose any advantage of massive parallelism. Is `distributed 
> datamap` supposed to solve this problem?
     Min Max example is only applying the dataMap in the Driver side not in 
executors. This is placed just as an example. In case you want your dataMap to 
be distributed and executed in the executers then it can be distributed. 
> - Finally, could you give me an example on iterating through all rows in a 
> blocklet, block and segment so that I can get my input for index bulk 
> loading. In one of the DataMap example which is still in a pull request 
> (https://github.com/apache/carbondata/pull/1359), I cannot find this part 
> since it pull the statistics directly from Blocklet built in MinMax Index. 
> (Please refer to `loadBlockDetails` and `constructMinMaxIndex` function in 
> `MinMaxDataWriter.java` under that PR).
    I will be updating another pull request shortly which will scan data from 
the FactFile i.e. carbondata file and updates DataMap secondary Index. I will 
share the PR with you shortly.   


Thanks 
Sounak


> On 03-Oct-2017, at 10:42 PM, Dong Xie <xiedong1...@gmail.com> wrote:
> 
> Hi Carbon Team,
> 
> Recently, I am considering working on implementing a secondary index over the 
> DataMap API. After a careful look on the design, there are some questions and 
> concerns I want to raise here:
> 
> - Datamap only support partition level (more precisely blocklet level 
> pruning), Specifically, the `prune()` function in DataMap interface will 
> consume a filter and produce a list of Blocklets. Then, it seems that 
> building sophisticated data structure may be not useful. For example, in the 
> case of spatial range query, the only thing I may want to know is the 
> boundary of a blocklet, anything other insight will not be exposed to the 
> pruning procedure.
> - It is common in most commercial databases that only one index will be used 
> for the filter process even though we can use other secondary index to prune. 
> Most likely, the query optimizer will choose the index which provides the 
> highest selectivity to use.
> - I feel confused on the semantics of  `toDistribute()` function in DataMap 
> API. One problem I found in the MinMax DataMap example is that there will be 
> a single thread consume all these indexes then construct the pruning. As a 
> result, we may loose any advantage of massive parallelism. Is `distributed 
> datamap` supposed to solve this problem?
> - Finally, could you give me an example on iterating through all rows in a 
> blocklet, block and segment so that I can get my input for index bulk 
> loading. In one of the DataMap example which is still in a pull request 
> (https://github.com/apache/carbondata/pull/1359), I cannot find this part 
> since it pull the statistics directly from Blocklet built in MinMax Index. 
> (Please refer to `loadBlockDetails` and `constructMinMaxIndex` function in 
> `MinMaxDataWriter.java` under that PR).
> 
> 
> Thanks,
> Dong

Re: Questions and Concerns on DataMap API

Reply via email to