Hi,

Problem:
 Current bloom filter is calculated at the
blocklet level and if the cardinality of a
column is more and number of blocklets loaded are more then the bloom
size will become bigger. In few of our use
cases, it is grown till 60 GB also, and it might
increase when data grows or add more bloom datamap columns.
 It is not practical to keep this large amounts of bloom data in
driver memory.  So currently we have option to  launch a distributed
job to prune the data using bloom datamap, but it takes more time as
it needs to load bloom data to all executor memories and then prune
it. And also there is no guarantee that subsequent queries will reuse
the same loaded bloom data from executor as spark scheduler does not
guarantee it.

Solution: Create hierarchal bloom index and pruning.

   - We can create a bloom in a
   hierarchal way, it means maintain bloom at the task (carbonindex)
   and at blocklet level.
   - While loading
   the data we can create bloom also at task level along with blocklet level.
   Bloom at task level is very small compare to bloom at a
   blocklet level so we can load it to the driver memory.
   - Maintain a footer in current blocklet level bloom file to get the
   blocklet blooms offset information for a block.
   This information will be used in executor to get the blocklet
blooms for the corresponding block during the
   query. This footer information also we load to driver memory.
   - During pruning,
   first level pruning happens at task level using task level bloom
and get the pruned blocks. Launch the main job along with respective
block bloom information which is already available in the
   footer of the file.
   - In AbstractQueryExecutor first read the blooms for respective
blocks using footer information sent from the
   driver and prune the blocklets.

 In this way, we maintain only less information in memory and also
avoid launching multiple jobs
for pruning. And also reads only necessary blocklet booms during pruning
instead of reading all.

This is a draft discussion proposal and we need to change the pruning flow
design for datamap to implement it. But I feel this types of coarse-grained
datamaps we should avoid launching job for pruning.  I can output the
design document after the initial discussion.

-- 
Thanks & Regards,
Ravindra.

Reply via email to