Sergey Shelukhin created HIVE-20380:
---------------------------------------

             Summary: explore storing multiple CBs in a single cache buffer in 
LLAP cache
                 Key: HIVE-20380
                 URL: https://issues.apache.org/jira/browse/HIVE-20380
             Project: Hive
          Issue Type: Bug
            Reporter: Sergey Shelukhin


Lately ORC CBs are becoming ridiculously small. First there's the 4Kb minimum 
(instead of 256Kb), then after we moved metadata cache off-heap, the index 
streams that are all tiny take up a lot of CBs and waste space. 
Wasted space can require larger cache and lead to cache OOMs on some workloads.
Reducing min.alloc solves this problem, but then there's a lot of heap (and 
probably compute) overhead to track all these buffers. Arguably even the 4Kb 
min.alloc is too small.

We should store contiguous CBs in the same buffer; to start, we can do it for 
ROW_INDEX streams. That probably means reading all ROW_INDEX streams instead of 
doing projection when we see that they are too small.
We need to investigate what the pattern is for ORC data blocks. One option is 
to increase min.alloc and then consolidate multiple 4-8Kb CBs, but only for the 
same stream. However larger min.alloc will result in wastage for really small 
streams, so we can also consolidate multiple streams (potentially across 
columns) if needed. This will result in some priority anomalies but they 
probably ok.

Another consideration is making tracking less object oriented, in particular 
passing around integer indexes instead of objects and storing state in giant 
arrays somewhere (potentially with some optimizations for less common things), 
instead of every buffers getting its own object. 

cc [~gopalv] [~prasanth_j]





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to