[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Prasanth J updated HIVE-6455: ----------------------------- Attachment: HIVE-6455.19.patch Rebased the patch to trunk. > Scalable dynamic partitioning and bucketing optimization > -------------------------------------------------------- > > Key: HIVE-6455 > URL: https://issues.apache.org/jira/browse/HIVE-6455 > Project: Hive > Issue Type: New Feature > Components: Query Processor > Affects Versions: 0.13.0 > Reporter: Prasanth J > Assignee: Prasanth J > Labels: optimization > Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, > HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, > HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, > HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, > HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, > HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, > HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, > HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch > > > The current implementation of dynamic partition works by keeping at least one > record writer open per dynamic partition directory. In case of bucketing > there can be multispray file writers which further adds up to the number of > open record writers. The record writers of column oriented file format (like > ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or > compression buffers) open all the time to buffer up the rows and compress > them before flushing it to disk. Since these buffers are maintained per > column basis the amount of constant memory that will required at runtime > increases as the number of partitions and number of columns per partition > increases. This often leads to OutOfMemory (OOM) exception in mappers or > reducers depending on the number of open record writers. Users often tune the > JVM heapsize (runtime memory) to get over such OOM issues. > With this optimization, the dynamic partition columns and bucketing columns > (in case of bucketed tables) are sorted before being fed to the reducers. > Since the partitioning and bucketing columns are sorted, each reducers can > keep only one record writer open at any time thereby reducing the memory > pressure on the reducers. This optimization is highly scalable as the number > of partition and number of columns per partition increases at the cost of > sorting the columns. -- This message was sent by Atlassian JIRA (v6.2#6252)