[ https://issues.apache.org/jira/browse/HIVE-20380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sergey Shelukhin reassigned HIVE-20380: --------------------------------------- Assignee: Sergey Shelukhin > explore storing multiple CBs in a single cache buffer in LLAP cache > ------------------------------------------------------------------- > > Key: HIVE-20380 > URL: https://issues.apache.org/jira/browse/HIVE-20380 > Project: Hive > Issue Type: Bug > Reporter: Sergey Shelukhin > Assignee: Sergey Shelukhin > Priority: Major > > Lately ORC CBs are becoming ridiculously small. First there's the 4Kb minimum > (instead of 256Kb), then after we moved metadata cache off-heap, the index > streams that are all tiny take up a lot of CBs and waste space. > Wasted space can require larger cache and lead to cache OOMs on some > workloads. > Reducing min.alloc solves this problem, but then there's a lot of heap (and > probably compute) overhead to track all these buffers. Arguably even the 4Kb > min.alloc is too small. > We should store contiguous CBs in the same buffer; to start, we can do it for > ROW_INDEX streams. That probably means reading all ROW_INDEX streams instead > of doing projection when we see that they are too small. > We need to investigate what the pattern is for ORC data blocks. One option is > to increase min.alloc and then consolidate multiple 4-8Kb CBs, but only for > the same stream. However larger min.alloc will result in wastage for really > small streams, so we can also consolidate multiple streams (potentially > across columns) if needed. This will result in some priority anomalies but > they probably ok. > Another consideration is making tracking less object oriented, in particular > passing around integer indexes instead of objects and storing state in giant > arrays somewhere (potentially with some optimizations for less common > things), instead of every buffers getting its own object. > cc [~gopalv] [~prasanth_j] -- This message was sent by Atlassian JIRA (v7.6.3#76005)