[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168550#comment-14168550 ] Lefty Leverenz commented on HIVE-6455: -- Doc note: Prasanth added *hive.optimize.sort.dynamic.partition* to the wiki here: * [Configuration Properties -- hive.optimize.sort.dynamic.partition | https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.optimize.sort.dynamic.partition] So I'm removing the TODOC13 and TODOC14 labels. Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Fix For: 0.13.0, 0.14.0 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.20.patch, HIVE-6455.21.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133075#comment-14133075 ] Lefty Leverenz commented on HIVE-6455: -- Doc note: *hive.optimize.sort.dynamic.partition* has its description in HiveConf.java (post HIVE-6037), but it isn't documented in the wiki yet. Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: TODOC13, TODOC14, optimization Fix For: 0.13.0, 0.14.0 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.20.patch, HIVE-6455.21.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047293#comment-14047293 ] Lefty Leverenz commented on HIVE-6455: -- Added comment to HIVE-6586 so the description of *hive.optimize.sort.dynamic.partition* won't get lost in the shuffle when HIVE-6037 changes HiveConf.java. Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: TODOC13, TODOC14, optimization Fix For: 0.13.0, 0.14.0 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.20.patch, HIVE-6455.21.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944943#comment-13944943 ] Hive QA commented on HIVE-6455: --- {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12636297/HIVE-6455.21.patch {color:green}SUCCESS:{color} +1 5444 tests passed Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1946/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1946/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12636297 Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.20.patch, HIVE-6455.21.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13945520#comment-13945520 ] Vikram Dixit K commented on HIVE-6455: -- Committed to trunk. Thanks Prasanth! Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.20.patch, HIVE-6455.21.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1394#comment-1394 ] Hive QA commented on HIVE-6455: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12636170/HIVE-6455.20.patch {color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 5441 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample10 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_infer_bucket_sort_dyn_part {noformat} Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1920/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1920/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 5 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12636170 Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.20.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13943861#comment-13943861 ] Prasanth J commented on HIVE-6455: -- [~leftylev] Following configs are no longer required.. hive.cache.table.desc.max.entries hive.cache.properties.max.entries hive.cache.io.format.max.entries Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.20.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941452#comment-13941452 ] Prasanth J commented on HIVE-6455: -- [~vikram.dixit] refreshed the patch. Can you apply now? Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941453#comment-13941453 ] Vikram Dixit K commented on HIVE-6455: -- Applied cleanly. Thanks for the prompt response. Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941454#comment-13941454 ] Vikram Dixit K commented on HIVE-6455: -- Changes look good. +1 pending tests. Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941463#comment-13941463 ] Lefty Leverenz commented on HIVE-6455: -- For the record, this adds 4 configuration parameters in HiveConf.java and hive-default.xml.template: * hive.cache.table.desc.max.entries * hive.cache.properties.max.entries * hive.cache.io.format.max.entries * hive.optimize.sort.dynamic.partition So I'll add them to the Configuration Properties wikidoc after this gets committed. Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942698#comment-13942698 ] Hive QA commented on HIVE-6455: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12635729/HIVE-6455.19.patch {color:red}ERROR:{color} -1 due to 18 failed/errored test(s), 5432 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_mapjoin_decimal org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_analyze org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucketmapjoin6 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input1 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input3 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input6 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input7 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input9 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample3 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample4 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample5 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample6 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample7 org.apache.hadoop.hive.ql.parse.TestParse.testParse_union org.apache.hive.jdbc.TestJdbcDriver2.testNewConnectionConfiguration {noformat} Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1875/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1875/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 18 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12635729 Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13936136#comment-13936136 ] Hive QA commented on HIVE-6455: --- {color:red}Overall{color}: -1 no tests executed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12634884/HIVE-6455.17.patch Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1828/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1828/console Messages: {noformat} This message was trimmed, see log for full details [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] skip non existing resourceDirectory /data/hive-ptest/working/apache-svn-trunk-source/hwi/src/test/resources [INFO] Copying 3 resources [INFO] [INFO] --- maven-antrun-plugin:1.7:run (setup-test-dirs) @ hive-hwi --- [INFO] Executing tasks main: [mkdir] Created dir: /data/hive-ptest/working/apache-svn-trunk-source/hwi/target/tmp [mkdir] Created dir: /data/hive-ptest/working/apache-svn-trunk-source/hwi/target/warehouse [mkdir] Created dir: /data/hive-ptest/working/apache-svn-trunk-source/hwi/target/tmp/conf [copy] Copying 5 files to /data/hive-ptest/working/apache-svn-trunk-source/hwi/target/tmp/conf [INFO] Executed tasks [INFO] [INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ hive-hwi --- [INFO] Compiling 2 source files to /data/hive-ptest/working/apache-svn-trunk-source/hwi/target/test-classes [INFO] [INFO] --- maven-surefire-plugin:2.16:test (default-test) @ hive-hwi --- [INFO] Tests are skipped. [INFO] [INFO] --- maven-jar-plugin:2.2:jar (default-jar) @ hive-hwi --- [INFO] Building jar: /data/hive-ptest/working/apache-svn-trunk-source/hwi/target/hive-hwi-0.14.0-SNAPSHOT.jar [INFO] [INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @ hive-hwi --- [INFO] [INFO] --- maven-install-plugin:2.4:install (default-install) @ hive-hwi --- [INFO] Installing /data/hive-ptest/working/apache-svn-trunk-source/hwi/target/hive-hwi-0.14.0-SNAPSHOT.jar to /data/hive-ptest/working/maven/org/apache/hive/hive-hwi/0.14.0-SNAPSHOT/hive-hwi-0.14.0-SNAPSHOT.jar [INFO] Installing /data/hive-ptest/working/apache-svn-trunk-source/hwi/pom.xml to /data/hive-ptest/working/maven/org/apache/hive/hive-hwi/0.14.0-SNAPSHOT/hive-hwi-0.14.0-SNAPSHOT.pom [INFO] [INFO] [INFO] Building Hive ODBC 0.14.0-SNAPSHOT [INFO] [INFO] [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ hive-odbc --- [INFO] Deleting /data/hive-ptest/working/apache-svn-trunk-source/odbc (includes = [datanucleus.log, derby.log], excludes = []) [INFO] [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ hive-odbc --- [INFO] [INFO] --- maven-antrun-plugin:1.7:run (define-classpath) @ hive-odbc --- [INFO] Executing tasks main: [INFO] Executed tasks [INFO] [INFO] --- maven-antrun-plugin:1.7:run (setup-test-dirs) @ hive-odbc --- [INFO] Executing tasks main: [mkdir] Created dir: /data/hive-ptest/working/apache-svn-trunk-source/odbc/target/tmp [mkdir] Created dir: /data/hive-ptest/working/apache-svn-trunk-source/odbc/target/warehouse [mkdir] Created dir: /data/hive-ptest/working/apache-svn-trunk-source/odbc/target/tmp/conf [copy] Copying 5 files to /data/hive-ptest/working/apache-svn-trunk-source/odbc/target/tmp/conf [INFO] Executed tasks [INFO] [INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @ hive-odbc --- [INFO] [INFO] --- maven-install-plugin:2.4:install (default-install) @ hive-odbc --- [INFO] Installing /data/hive-ptest/working/apache-svn-trunk-source/odbc/pom.xml to /data/hive-ptest/working/maven/org/apache/hive/hive-odbc/0.14.0-SNAPSHOT/hive-odbc-0.14.0-SNAPSHOT.pom [INFO] [INFO] [INFO] Building Hive Shims Aggregator 0.14.0-SNAPSHOT [INFO] [INFO] [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ hive-shims-aggregator --- [INFO] Deleting /data/hive-ptest/working/apache-svn-trunk-source/shims (includes = [datanucleus.log, derby.log], excludes = []) [INFO] [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ hive-shims-aggregator --- [INFO] [INFO] --- maven-antrun-plugin:1.7:run (define-classpath) @ hive-shims-aggregator --- [INFO] Executing tasks main: [INFO] Executed tasks [INFO] [INFO] --- maven-antrun-plugin:1.7:run (setup-test-dirs) @ hive-shims-aggregator --- [INFO] Executing tasks main: [mkdir] Created dir:
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932915#comment-13932915 ] Gopal V commented on HIVE-6455: --- {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.partialCopyToStandardObject(ObjectInspectorUtils.java:213) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.startGroup(FileSinkOperator.java:816) at org.apache.hadoop.hive.ql.exec.Operator.defaultStartGroup(Operator.java:497) at org.apache.hadoop.hive.ql.exec.Operator.startGroup(Operator.java:520) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.processKeyValues(ReduceRecordProcessor.java:276) ... 8 more {code} for a partitioned insert in Tez. Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932916#comment-13932916 ] Gopal V commented on HIVE-6455: --- {code} create table session ( ... ) partitioned by (start_dt date); insert overwrite table session partition(start_dt) from session_raw; {code} with the patch .15 Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13933995#comment-13933995 ] Hive QA commented on HIVE-6455: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12634395/HIVE-6455.16.patch {color:red}ERROR:{color} -1 due to 14 failed/errored test(s), 5388 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input1 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input3 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input6 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input7 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input9 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample3 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample4 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample5 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample6 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample7 org.apache.hadoop.hive.ql.parse.TestParse.testParse_union {noformat} Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1765/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1765/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 14 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12634395 Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934493#comment-13934493 ] Vikram Dixit K commented on HIVE-6455: -- Left some comments on review board. Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13931518#comment-13931518 ] Hive QA commented on HIVE-6455: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12633838/HIVE-6455.14.patch {color:red}ERROR:{color} -1 due to 16 failed/errored test(s), 5381 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_reduce_deduplicate_extended org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_infer_bucket_sort_dyn_part org.apache.hadoop.hive.ql.parse.TestParse.testParse_input1 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input3 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input6 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input7 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input9 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample3 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample4 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample5 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample6 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample7 org.apache.hadoop.hive.ql.parse.TestParse.testParse_union {noformat} Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1707/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1707/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 16 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12633838 Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, HIVE-6455.14.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921377#comment-13921377 ] Hive QA commented on HIVE-6455: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12632762/HIVE-6455.11.patch {color:red}ERROR:{color} -1 due to 18 failed/errored test(s), 5356 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_optimization org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part14 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge4 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_infer_bucket_sort_dyn_part org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_infer_bucket_sort_map_operators org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_dyn_part_max_per_node org.apache.hadoop.hive.ql.parse.TestParse.testParse_input1 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input3 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input6 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input7 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input9 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_union org.apache.hive.beeline.TestSchemaTool.testSchemaInit org.apache.hive.beeline.TestSchemaTool.testSchemaUpgrade {noformat} Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1632/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1632/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 18 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12632762 Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918047#comment-13918047 ] Hive QA commented on HIVE-6455: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12632210/HIVE-6455.10.patch {color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 5209 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testNegativeCliDriver_mapreduce_stack_trace_hadoop20 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input1 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_union {noformat} Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1603/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1603/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 4 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12632210 Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13917227#comment-13917227 ] Hive QA commented on HIVE-6455: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12631856/HIVE-6455.9.patch {color:red}ERROR:{color} -1 due to 10 failed/errored test(s), 5187 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_combine2_hadoop20 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part15 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample10 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_root_dir_external_table org.apache.hadoop.hive.ql.parse.TestParse.testParse_input1 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_union org.apache.hive.service.cli.TestEmbeddedThriftBinaryCLIService.testExecuteStatementAsync {noformat} Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1574/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1574/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 10 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12631856 Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13912237#comment-13912237 ] Gunther Hagleitner commented on HIVE-6455: -- This is cool. Still reviewing but some ideas: - Instead of adding a column to the record to be used in the file sink, it'd be cleaner (and faster) to use the key to determine new files. I believe that could be achieved through startGroup/endGroup - Looks like we'd end up duplicating partition column, bucket column, sort columns in both key and value on the reduce sink. It might be possible to avoid that, making the intermediate output smaller. Although I'm not sure this would require additional changes to rebuild the row in the reduce task. Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13909858#comment-13909858 ] Hive QA commented on HIVE-6455: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12630450/HIVE-6455.7.patch {color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 5177 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.ql.parse.TestParse.testParse_input1 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_union {noformat} Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1458/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1458/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12630450 Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908801#comment-13908801 ] Hive QA commented on HIVE-6455: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12630239/HIVE-6455.6.patch {color:red}ERROR:{color} -1 due to 18 failed/errored test(s), 5170 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_optimization org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample10 org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_insert_into5 org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_insert_into6 org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testNegativeCliDriver_mapreduce_stack_trace_hadoop20 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input1 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input3 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input6 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input7 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input9 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample3 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample4 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample5 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample6 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample7 org.apache.hadoop.hive.ql.parse.TestParse.testParse_union {noformat} Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1436/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1436/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 18 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12630239 Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907394#comment-13907394 ] Prasanth J commented on HIVE-6455: -- Thanks [~leftylev] for the update. I will refresh this patch once HIVE-6037 is committed. Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906608#comment-13906608 ] Hive QA commented on HIVE-6455: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12629854/HIVE-6455.4.patch {color:red}ERROR:{color} -1 due to 70 failed/errored test(s), 5161 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_optimization org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part10 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part5 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part9 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge_dynamic_partition org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge_dynamic_partition2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge_dynamic_partition3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge_dynamic_partition4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge_dynamic_partition5 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_create org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample10 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats14 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats15 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats_noscan_2 org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver_hbase_stats org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver_hbase_stats2 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_stats_counter_partitioned org.apache.hadoop.hive.ql.parse.TestParse.testParse_case_sensitivity org.apache.hadoop.hive.ql.parse.TestParse.testParse_cast1 org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby1 org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby3 org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby4 org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby5 org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby6 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input1 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input20 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input3 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input4 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input5 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input6 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input7 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input8 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input9 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input_part1 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input_testsequencefile org.apache.hadoop.hive.ql.parse.TestParse.testParse_input_testxpath org.apache.hadoop.hive.ql.parse.TestParse.testParse_input_testxpath2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_join1 org.apache.hadoop.hive.ql.parse.TestParse.testParse_join2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_join3 org.apache.hadoop.hive.ql.parse.TestParse.testParse_join4 org.apache.hadoop.hive.ql.parse.TestParse.testParse_join5 org.apache.hadoop.hive.ql.parse.TestParse.testParse_join6 org.apache.hadoop.hive.ql.parse.TestParse.testParse_join7 org.apache.hadoop.hive.ql.parse.TestParse.testParse_join8 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample1 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample3 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample4 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample5 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample6 org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample7 org.apache.hadoop.hive.ql.parse.TestParse.testParse_subq org.apache.hadoop.hive.ql.parse.TestParse.testParse_udf1 org.apache.hadoop.hive.ql.parse.TestParse.testParse_udf4 org.apache.hadoop.hive.ql.parse.TestParse.testParse_udf6 org.apache.hadoop.hive.ql.parse.TestParse.testParse_udf_case org.apache.hadoop.hive.ql.parse.TestParse.testParse_udf_when
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906693#comment-13906693 ] Lefty Leverenz commented on HIVE-6455: -- HIVE-6037 was reopened, so HiveConf.java hasn't changed yet. Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Lefty Leverenz Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906694#comment-13906694 ] Lefty Leverenz commented on HIVE-6455: -- Yikes, didn't mean to reassign this -- sorry. Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Lefty Leverenz Labels: optimization Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904168#comment-13904168 ] Hive QA commented on HIVE-6455: --- {color:red}Overall{color}: -1 no tests executed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12629461/HIVE-6455.1.patch Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1382/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1382/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Tests exited with: NonZeroExitCodeException Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ [[ -n '' ]] + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m ' + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m ' + export 'M2_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + M2_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + cd /data/hive-ptest/working/ + tee /data/hive-ptest/logs/PreCommit-HIVE-Build-1382/source-prep.txt + [[ false == \t\r\u\e ]] + mkdir -p maven ivy + [[ svn = \s\v\n ]] + [[ -n '' ]] + [[ -d apache-svn-trunk-source ]] + [[ ! -d apache-svn-trunk-source/.svn ]] + [[ ! -d apache-svn-trunk-source ]] + cd apache-svn-trunk-source + svn revert -R . Reverted 'conf/hive-default.xml.template' Reverted 'ql/src/test/results/clientnegative/udf_format_number_wrong2.q.out' Reverted 'ql/src/test/results/clientnegative/udf_format_number_wrong4.q.out' Reverted 'ql/src/test/results/clientnegative/udf_format_number_wrong6.q.out' Reverted 'ql/src/test/results/clientnegative/udf_format_number_wrong1.q.out' Reverted 'ql/src/test/results/clientpositive/udf_format_number.q.out' Reverted 'ql/src/test/queries/clientnegative/udf_format_number_wrong6.q' Reverted 'ql/src/test/queries/clientpositive/udf_format_number.q' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFFormatNumber.java' ++ awk '{print $2}' ++ egrep -v '^X|^Performing status on external' ++ svn status --no-ignore + rm -rf target datanucleus.log ant/target shims/target shims/0.20/target shims/0.20S/target shims/0.23/target shims/aggregator/target shims/common/target shims/common-secure/target packaging/target hbase-handler/target testutils/target jdbc/target metastore/target itests/target itests/hcatalog-unit/target itests/test-serde/target itests/qtest/target itests/hive-unit/target itests/custom-serde/target itests/util/target hcatalog/target hcatalog/storage-handlers/hbase/target hcatalog/server-extensions/target hcatalog/core/target hcatalog/webhcat/svr/target hcatalog/webhcat/java-client/target hcatalog/hcatalog-pig-adapter/target hwi/target common/target common/src/gen contrib/target service/target serde/target beeline/target odbc/target cli/target ql/dependency-reduced-pom.xml ql/target + svn update Fetching external item into 'hcatalog/src/test/e2e/harness' External at revision 1569394. At revision 1569394. + patchCommandPath=/data/hive-ptest/working/scratch/smart-apply-patch.sh + patchFilePath=/data/hive-ptest/working/scratch/build.patch + [[ -f /data/hive-ptest/working/scratch/build.patch ]] + chmod +x /data/hive-ptest/working/scratch/smart-apply-patch.sh + /data/hive-ptest/working/scratch/smart-apply-patch.sh /data/hive-ptest/working/scratch/build.patch The patch does not appear to apply with p0, p1, or p2 + exit 1 ' {noformat} This message is automatically generated. ATTACHMENT ID: 12629461 Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to
[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization
[ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13903744#comment-13903744 ] Lefty Leverenz commented on HIVE-6455: -- Patch 1 adds *hive.optimize.sort.dynamic.partition* to HiveConf.java and hive-default.xml.template, so when this commits the parameter needs to be added to the wiki with version information. But HIVE-6037 was committed today (Synchronize HiveConf with hive-default.xml.template and support show conf) so from now on HiveConf.java includes a description as part of the parameter definition instead of a comment. For example, {code} HIVEENFORCEBUCKETING(hive.enforce.bucketing, false, Whether bucketing is enforced. If true, while inserting into the table, bucketing is enforced.), HIVEENFORCESORTING(hive.enforce.sorting, false, Whether sorting is enforced. If true, while inserting into the table, sorting is enforced.), HIVEOPTIMIZEBUCKETINGSORTING(hive.optimize.bucketingsorting, true, If hive.enforce.bucketing or hive.enforce.sorting is true, don't create a reducer for enforcing \n + bucketing/sorting for queries of the form: \n + insert overwrite table T2 select * from T1;\n + where T1 and T2 are bucketed/sorted by the same keys into the same number of buckets.), {code} Since hive-default.xml.template will be generated from HiveConf.java, it might not be necessary to add the new parameter to the template file -- but I'm not sure of that, maybe we need to keep editing the template file until 0.13 is released, so I'll raise the question in the comments on HIVE-6037. Anyway it won't do any harm. Scalable dynamic partitioning and bucketing optimization Key: HIVE-6455 URL: https://issues.apache.org/jira/browse/HIVE-6455 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: optimization Attachments: HIVE-6455.1.patch The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.1.5#6160)