[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-10-12 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168550#comment-14168550
 ] 

Lefty Leverenz commented on HIVE-6455:
--

Doc note:  Prasanth added *hive.optimize.sort.dynamic.partition* to the wiki 
here:

* [Configuration Properties -- hive.optimize.sort.dynamic.partition | 
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.optimize.sort.dynamic.partition]

So I'm removing the TODOC13 and TODOC14 labels.

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Fix For: 0.13.0, 0.14.0

 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, 
 HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, 
 HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, 
 HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, 
 HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.20.patch, 
 HIVE-6455.21.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, 
 HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, 
 HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-09-13 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133075#comment-14133075
 ] 

Lefty Leverenz commented on HIVE-6455:
--

Doc note:  *hive.optimize.sort.dynamic.partition* has its description in 
HiveConf.java (post HIVE-6037), but it isn't documented in the wiki yet.

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: TODOC13, TODOC14, optimization
 Fix For: 0.13.0, 0.14.0

 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, 
 HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, 
 HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, 
 HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, 
 HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.20.patch, 
 HIVE-6455.21.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, 
 HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, 
 HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-06-29 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047293#comment-14047293
 ] 

Lefty Leverenz commented on HIVE-6455:
--

Added comment to HIVE-6586 so the description of 
*hive.optimize.sort.dynamic.partition* won't get lost in the shuffle when 
HIVE-6037 changes HiveConf.java.

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: TODOC13, TODOC14, optimization
 Fix For: 0.13.0, 0.14.0

 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, 
 HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, 
 HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, 
 HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, 
 HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.20.patch, 
 HIVE-6455.21.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, 
 HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, 
 HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-03-24 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944943#comment-13944943
 ] 

Hive QA commented on HIVE-6455:
---



{color:green}Overall{color}: +1 all checks pass

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12636297/HIVE-6455.21.patch

{color:green}SUCCESS:{color} +1 5444 tests passed

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1946/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1946/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12636297

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, 
 HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, 
 HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, 
 HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, 
 HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.20.patch, 
 HIVE-6455.21.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, 
 HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, 
 HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-03-24 Thread Vikram Dixit K (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13945520#comment-13945520
 ] 

Vikram Dixit K commented on HIVE-6455:
--

Committed to trunk. Thanks Prasanth!

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, 
 HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, 
 HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, 
 HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, 
 HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.20.patch, 
 HIVE-6455.21.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, 
 HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, 
 HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-03-23 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1394#comment-1394
 ] 

Hive QA commented on HIVE-6455:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12636170/HIVE-6455.20.patch

{color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 5441 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part8
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample10
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_infer_bucket_sort_dyn_part
{noformat}

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1920/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1920/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 5 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12636170

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, 
 HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, 
 HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, 
 HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, 
 HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.20.patch, HIVE-6455.3.patch, 
 HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, 
 HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-03-21 Thread Prasanth J (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13943861#comment-13943861
 ] 

Prasanth J commented on HIVE-6455:
--

[~leftylev] Following configs are no longer required..
hive.cache.table.desc.max.entries
hive.cache.properties.max.entries
hive.cache.io.format.max.entries

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, 
 HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, 
 HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, 
 HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, 
 HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.20.patch, HIVE-6455.3.patch, 
 HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, 
 HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-03-20 Thread Prasanth J (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941452#comment-13941452
 ] 

Prasanth J commented on HIVE-6455:
--

[~vikram.dixit] refreshed the patch. Can you apply now?

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, 
 HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, 
 HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, 
 HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, 
 HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, 
 HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, 
 HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-03-20 Thread Vikram Dixit K (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941453#comment-13941453
 ] 

Vikram Dixit K commented on HIVE-6455:
--

Applied cleanly. Thanks for the prompt response.

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, 
 HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, 
 HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, 
 HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, 
 HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, 
 HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, 
 HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-03-20 Thread Vikram Dixit K (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941454#comment-13941454
 ] 

Vikram Dixit K commented on HIVE-6455:
--

Changes look good. +1 pending tests.

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, 
 HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, 
 HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, 
 HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, 
 HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, 
 HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, 
 HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-03-20 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941463#comment-13941463
 ] 

Lefty Leverenz commented on HIVE-6455:
--

For the record, this adds 4 configuration parameters in HiveConf.java and 
hive-default.xml.template:

* hive.cache.table.desc.max.entries
* hive.cache.properties.max.entries
* hive.cache.io.format.max.entries
* hive.optimize.sort.dynamic.partition

So I'll add them to the Configuration Properties wikidoc after this gets 
committed.

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, 
 HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, 
 HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, 
 HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, 
 HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, 
 HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, 
 HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-03-20 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942698#comment-13942698
 ] 

Hive QA commented on HIVE-6455:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12635729/HIVE-6455.19.patch

{color:red}ERROR:{color} -1 due to 18 failed/errored test(s), 5432 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_mapjoin_decimal
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_analyze
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucketmapjoin6
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input1
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input3
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input6
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input7
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input9
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample3
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample4
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample5
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample6
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample7
org.apache.hadoop.hive.ql.parse.TestParse.testParse_union
org.apache.hive.jdbc.TestJdbcDriver2.testNewConnectionConfiguration
{noformat}

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1875/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1875/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 18 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12635729

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, 
 HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, 
 HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, 
 HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, 
 HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, 
 HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, 
 HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-03-15 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13936136#comment-13936136
 ] 

Hive QA commented on HIVE-6455:
---



{color:red}Overall{color}: -1 no tests executed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12634884/HIVE-6455.17.patch

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1828/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1828/console

Messages:
{noformat}
 This message was trimmed, see log for full details 
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory 
/data/hive-ptest/working/apache-svn-trunk-source/hwi/src/test/resources
[INFO] Copying 3 resources
[INFO] 
[INFO] --- maven-antrun-plugin:1.7:run (setup-test-dirs) @ hive-hwi ---
[INFO] Executing tasks

main:
[mkdir] Created dir: 
/data/hive-ptest/working/apache-svn-trunk-source/hwi/target/tmp
[mkdir] Created dir: 
/data/hive-ptest/working/apache-svn-trunk-source/hwi/target/warehouse
[mkdir] Created dir: 
/data/hive-ptest/working/apache-svn-trunk-source/hwi/target/tmp/conf
 [copy] Copying 5 files to 
/data/hive-ptest/working/apache-svn-trunk-source/hwi/target/tmp/conf
[INFO] Executed tasks
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ 
hive-hwi ---
[INFO] Compiling 2 source files to 
/data/hive-ptest/working/apache-svn-trunk-source/hwi/target/test-classes
[INFO] 
[INFO] --- maven-surefire-plugin:2.16:test (default-test) @ hive-hwi ---
[INFO] Tests are skipped.
[INFO] 
[INFO] --- maven-jar-plugin:2.2:jar (default-jar) @ hive-hwi ---
[INFO] Building jar: 
/data/hive-ptest/working/apache-svn-trunk-source/hwi/target/hive-hwi-0.14.0-SNAPSHOT.jar
[INFO] 
[INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @ 
hive-hwi ---
[INFO] 
[INFO] --- maven-install-plugin:2.4:install (default-install) @ hive-hwi ---
[INFO] Installing 
/data/hive-ptest/working/apache-svn-trunk-source/hwi/target/hive-hwi-0.14.0-SNAPSHOT.jar
 to 
/data/hive-ptest/working/maven/org/apache/hive/hive-hwi/0.14.0-SNAPSHOT/hive-hwi-0.14.0-SNAPSHOT.jar
[INFO] Installing /data/hive-ptest/working/apache-svn-trunk-source/hwi/pom.xml 
to 
/data/hive-ptest/working/maven/org/apache/hive/hive-hwi/0.14.0-SNAPSHOT/hive-hwi-0.14.0-SNAPSHOT.pom
[INFO] 
[INFO] 
[INFO] Building Hive ODBC 0.14.0-SNAPSHOT
[INFO] 
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ hive-odbc ---
[INFO] Deleting /data/hive-ptest/working/apache-svn-trunk-source/odbc (includes 
= [datanucleus.log, derby.log], excludes = [])
[INFO] 
[INFO] --- maven-remote-resources-plugin:1.5:process (default) @ hive-odbc ---
[INFO] 
[INFO] --- maven-antrun-plugin:1.7:run (define-classpath) @ hive-odbc ---
[INFO] Executing tasks

main:
[INFO] Executed tasks
[INFO] 
[INFO] --- maven-antrun-plugin:1.7:run (setup-test-dirs) @ hive-odbc ---
[INFO] Executing tasks

main:
[mkdir] Created dir: 
/data/hive-ptest/working/apache-svn-trunk-source/odbc/target/tmp
[mkdir] Created dir: 
/data/hive-ptest/working/apache-svn-trunk-source/odbc/target/warehouse
[mkdir] Created dir: 
/data/hive-ptest/working/apache-svn-trunk-source/odbc/target/tmp/conf
 [copy] Copying 5 files to 
/data/hive-ptest/working/apache-svn-trunk-source/odbc/target/tmp/conf
[INFO] Executed tasks
[INFO] 
[INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @ 
hive-odbc ---
[INFO] 
[INFO] --- maven-install-plugin:2.4:install (default-install) @ hive-odbc ---
[INFO] Installing /data/hive-ptest/working/apache-svn-trunk-source/odbc/pom.xml 
to 
/data/hive-ptest/working/maven/org/apache/hive/hive-odbc/0.14.0-SNAPSHOT/hive-odbc-0.14.0-SNAPSHOT.pom
[INFO] 
[INFO] 
[INFO] Building Hive Shims Aggregator 0.14.0-SNAPSHOT
[INFO] 
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ hive-shims-aggregator 
---
[INFO] Deleting /data/hive-ptest/working/apache-svn-trunk-source/shims 
(includes = [datanucleus.log, derby.log], excludes = [])
[INFO] 
[INFO] --- maven-remote-resources-plugin:1.5:process (default) @ 
hive-shims-aggregator ---
[INFO] 
[INFO] --- maven-antrun-plugin:1.7:run (define-classpath) @ 
hive-shims-aggregator ---
[INFO] Executing tasks

main:
[INFO] Executed tasks
[INFO] 
[INFO] --- maven-antrun-plugin:1.7:run (setup-test-dirs) @ 
hive-shims-aggregator ---
[INFO] Executing tasks

main:
[mkdir] Created dir: 

[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-03-13 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932915#comment-13932915
 ] 

Gopal V commented on HIVE-6455:
---

{code}
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.partialCopyToStandardObject(ObjectInspectorUtils.java:213)
at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.startGroup(FileSinkOperator.java:816)
at 
org.apache.hadoop.hive.ql.exec.Operator.defaultStartGroup(Operator.java:497)
at org.apache.hadoop.hive.ql.exec.Operator.startGroup(Operator.java:520)
at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.processKeyValues(ReduceRecordProcessor.java:276)
... 8 more
{code}

for a partitioned insert in Tez.

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, 
 HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, 
 HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, 
 HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, 
 HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-03-13 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932916#comment-13932916
 ] 

Gopal V commented on HIVE-6455:
---

{code}
create table session ( ... )
partitioned by (start_dt date);

insert overwrite table session partition(start_dt)  from session_raw;
{code}

with the patch .15

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, 
 HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, 
 HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, 
 HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, 
 HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-03-13 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13933995#comment-13933995
 ] 

Hive QA commented on HIVE-6455:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12634395/HIVE-6455.16.patch

{color:red}ERROR:{color} -1 due to 14 failed/errored test(s), 5388 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input1
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input3
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input6
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input7
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input9
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample3
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample4
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample5
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample6
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample7
org.apache.hadoop.hive.ql.parse.TestParse.testParse_union
{noformat}

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1765/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1765/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 14 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12634395

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, 
 HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, 
 HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, 
 HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, 
 HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, 
 HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-03-13 Thread Vikram Dixit K (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934493#comment-13934493
 ] 

Vikram Dixit K commented on HIVE-6455:
--

Left some comments on review board.

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, 
 HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, 
 HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, 
 HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, 
 HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, 
 HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-03-12 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13931518#comment-13931518
 ] 

Hive QA commented on HIVE-6455:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12633838/HIVE-6455.14.patch

{color:red}ERROR:{color} -1 due to 16 failed/errored test(s), 5381 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_reduce_deduplicate_extended
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_infer_bucket_sort_dyn_part
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input1
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input3
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input6
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input7
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input9
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample3
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample4
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample5
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample6
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample7
org.apache.hadoop.hive.ql.parse.TestParse.testParse_union
{noformat}

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1707/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1707/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 16 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12633838

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, 
 HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, 
 HIVE-6455.14.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, 
 HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, 
 HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-03-05 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921377#comment-13921377
 ] 

Hive QA commented on HIVE-6455:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12632762/HIVE-6455.11.patch

{color:red}ERROR:{color} -1 due to 18 failed/errored test(s), 5356 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_optimization
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part14
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge4
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_infer_bucket_sort_dyn_part
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_infer_bucket_sort_map_operators
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_dyn_part_max_per_node
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input1
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input3
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input6
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input7
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input9
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_union
org.apache.hive.beeline.TestSchemaTool.testSchemaInit
org.apache.hive.beeline.TestSchemaTool.testSchemaUpgrade
{noformat}

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1632/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1632/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 18 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12632762

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, 
 HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, 
 HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, 
 HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-03-03 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918047#comment-13918047
 ] 

Hive QA commented on HIVE-6455:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12632210/HIVE-6455.10.patch

{color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 5209 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testNegativeCliDriver_mapreduce_stack_trace_hadoop20
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input1
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_union
{noformat}

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1603/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1603/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 4 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12632210

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, 
 HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, 
 HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-03-01 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13917227#comment-13917227
 ] 

Hive QA commented on HIVE-6455:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12631856/HIVE-6455.9.patch

{color:red}ERROR:{color} -1 due to 10 failed/errored test(s), 5187 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_combine2_hadoop20
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part15
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge4
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample10
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_root_dir_external_table
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input1
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_union
org.apache.hive.service.cli.TestEmbeddedThriftBinaryCLIService.testExecuteStatementAsync
{noformat}

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1574/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1574/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 10 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12631856

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, 
 HIVE-6455.10.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, 
 HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, 
 HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-02-25 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13912237#comment-13912237
 ] 

Gunther Hagleitner commented on HIVE-6455:
--

This is cool. Still reviewing but some ideas:

- Instead of adding a column to the record to be used in the file sink, it'd be 
cleaner (and faster) to use the key to determine new files. I believe that 
could be achieved through startGroup/endGroup
- Looks like we'd end up duplicating partition column, bucket column, sort 
columns in both key and value on the reduce sink. It might be possible to avoid 
that, making the intermediate output smaller. Although I'm not sure this would 
require additional changes to rebuild the row in the reduce task.

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.2.patch, 
 HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, 
 HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-02-23 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13909858#comment-13909858
 ] 

Hive QA commented on HIVE-6455:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12630450/HIVE-6455.7.patch

{color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 5177 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input1
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_union
{noformat}

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1458/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1458/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 3 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12630450

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.2.patch, 
 HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, 
 HIVE-6455.6.patch, HIVE-6455.7.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-02-21 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908801#comment-13908801
 ] 

Hive QA commented on HIVE-6455:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12630239/HIVE-6455.6.patch

{color:red}ERROR:{color} -1 due to 18 failed/errored test(s), 5170 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_optimization
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample10
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_insert_into5
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_insert_into6
org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testNegativeCliDriver_mapreduce_stack_trace_hadoop20
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input1
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input3
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input6
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input7
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input9
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample3
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample4
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample5
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample6
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample7
org.apache.hadoop.hive.ql.parse.TestParse.testParse_union
{noformat}

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1436/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1436/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 18 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12630239

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.2.patch, 
 HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, 
 HIVE-6455.6.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-02-20 Thread Prasanth J (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907394#comment-13907394
 ] 

Prasanth J commented on HIVE-6455:
--

Thanks [~leftylev] for the update. I will refresh this patch once HIVE-6037 is 
committed.

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.2.patch, 
 HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-02-19 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906608#comment-13906608
 ] 

Hive QA commented on HIVE-6455:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12629854/HIVE-6455.4.patch

{color:red}ERROR:{color} -1 due to 70 failed/errored test(s), 5161 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_optimization
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part10
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part4
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part5
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part8
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_dyn_part9
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge4
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge_dynamic_partition
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge_dynamic_partition2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge_dynamic_partition3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge_dynamic_partition4
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge_dynamic_partition5
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_create
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample10
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats14
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats15
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats4
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats_noscan_2
org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver_hbase_stats
org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver_hbase_stats2
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_stats_counter_partitioned
org.apache.hadoop.hive.ql.parse.TestParse.testParse_case_sensitivity
org.apache.hadoop.hive.ql.parse.TestParse.testParse_cast1
org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby1
org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby3
org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby4
org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby5
org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby6
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input1
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input20
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input3
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input4
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input5
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input6
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input7
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input8
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input9
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input_part1
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input_testsequencefile
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input_testxpath
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input_testxpath2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_join1
org.apache.hadoop.hive.ql.parse.TestParse.testParse_join2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_join3
org.apache.hadoop.hive.ql.parse.TestParse.testParse_join4
org.apache.hadoop.hive.ql.parse.TestParse.testParse_join5
org.apache.hadoop.hive.ql.parse.TestParse.testParse_join6
org.apache.hadoop.hive.ql.parse.TestParse.testParse_join7
org.apache.hadoop.hive.ql.parse.TestParse.testParse_join8
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample1
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample3
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample4
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample5
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample6
org.apache.hadoop.hive.ql.parse.TestParse.testParse_sample7
org.apache.hadoop.hive.ql.parse.TestParse.testParse_subq
org.apache.hadoop.hive.ql.parse.TestParse.testParse_udf1
org.apache.hadoop.hive.ql.parse.TestParse.testParse_udf4
org.apache.hadoop.hive.ql.parse.TestParse.testParse_udf6
org.apache.hadoop.hive.ql.parse.TestParse.testParse_udf_case
org.apache.hadoop.hive.ql.parse.TestParse.testParse_udf_when

[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-02-19 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906693#comment-13906693
 ] 

Lefty Leverenz commented on HIVE-6455:
--

HIVE-6037 was reopened, so HiveConf.java hasn't changed yet.

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Lefty Leverenz
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.2.patch, 
 HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-02-19 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906694#comment-13906694
 ] 

Lefty Leverenz commented on HIVE-6455:
--

Yikes, didn't mean to reassign this -- sorry.

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Lefty Leverenz
  Labels: optimization
 Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.2.patch, 
 HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-02-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904168#comment-13904168
 ] 

Hive QA commented on HIVE-6455:
---



{color:red}Overall{color}: -1 no tests executed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12629461/HIVE-6455.1.patch

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1382/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1382/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Tests exited with: NonZeroExitCodeException
Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit 
status 1 and output '+ [[ -n '' ]]
+ export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
+ ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
+ export 'M2_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost 
-Dhttp.proxyPort=3128'
+ M2_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost 
-Dhttp.proxyPort=3128'
+ cd /data/hive-ptest/working/
+ tee /data/hive-ptest/logs/PreCommit-HIVE-Build-1382/source-prep.txt
+ [[ false == \t\r\u\e ]]
+ mkdir -p maven ivy
+ [[ svn = \s\v\n ]]
+ [[ -n '' ]]
+ [[ -d apache-svn-trunk-source ]]
+ [[ ! -d apache-svn-trunk-source/.svn ]]
+ [[ ! -d apache-svn-trunk-source ]]
+ cd apache-svn-trunk-source
+ svn revert -R .
Reverted 'conf/hive-default.xml.template'
Reverted 'ql/src/test/results/clientnegative/udf_format_number_wrong2.q.out'
Reverted 'ql/src/test/results/clientnegative/udf_format_number_wrong4.q.out'
Reverted 'ql/src/test/results/clientnegative/udf_format_number_wrong6.q.out'
Reverted 'ql/src/test/results/clientnegative/udf_format_number_wrong1.q.out'
Reverted 'ql/src/test/results/clientpositive/udf_format_number.q.out'
Reverted 'ql/src/test/queries/clientnegative/udf_format_number_wrong6.q'
Reverted 'ql/src/test/queries/clientpositive/udf_format_number.q'
Reverted 
'ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFFormatNumber.java'
++ awk '{print $2}'
++ egrep -v '^X|^Performing status on external'
++ svn status --no-ignore
+ rm -rf target datanucleus.log ant/target shims/target shims/0.20/target 
shims/0.20S/target shims/0.23/target shims/aggregator/target 
shims/common/target shims/common-secure/target packaging/target 
hbase-handler/target testutils/target jdbc/target metastore/target 
itests/target itests/hcatalog-unit/target itests/test-serde/target 
itests/qtest/target itests/hive-unit/target itests/custom-serde/target 
itests/util/target hcatalog/target hcatalog/storage-handlers/hbase/target 
hcatalog/server-extensions/target hcatalog/core/target 
hcatalog/webhcat/svr/target hcatalog/webhcat/java-client/target 
hcatalog/hcatalog-pig-adapter/target hwi/target common/target common/src/gen 
contrib/target service/target serde/target beeline/target odbc/target 
cli/target ql/dependency-reduced-pom.xml ql/target
+ svn update

Fetching external item into 'hcatalog/src/test/e2e/harness'
External at revision 1569394.

At revision 1569394.
+ patchCommandPath=/data/hive-ptest/working/scratch/smart-apply-patch.sh
+ patchFilePath=/data/hive-ptest/working/scratch/build.patch
+ [[ -f /data/hive-ptest/working/scratch/build.patch ]]
+ chmod +x /data/hive-ptest/working/scratch/smart-apply-patch.sh
+ /data/hive-ptest/working/scratch/smart-apply-patch.sh 
/data/hive-ptest/working/scratch/build.patch
The patch does not appear to apply with p0, p1, or p2
+ exit 1
'
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12629461

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to 

[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

2014-02-17 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13903744#comment-13903744
 ] 

Lefty Leverenz commented on HIVE-6455:
--

Patch 1 adds *hive.optimize.sort.dynamic.partition* to HiveConf.java and 
hive-default.xml.template, so when this commits the parameter needs to be added 
to the wiki with version information.

But HIVE-6037 was committed today (Synchronize HiveConf with 
hive-default.xml.template and support show conf) so from now on HiveConf.java 
includes a description as part of the parameter definition instead of a 
comment.  For example,

{code}
HIVEENFORCEBUCKETING(hive.enforce.bucketing, false,
Whether bucketing is enforced. If true, while inserting into the 
table, bucketing is enforced.),
HIVEENFORCESORTING(hive.enforce.sorting, false,
Whether sorting is enforced. If true, while inserting into the table, 
sorting is enforced.),
HIVEOPTIMIZEBUCKETINGSORTING(hive.optimize.bucketingsorting, true,
If hive.enforce.bucketing or hive.enforce.sorting is true, don't 
create a reducer for enforcing \n +
bucketing/sorting for queries of the form: \n +
insert overwrite table T2 select * from T1;\n +
where T1 and T2 are bucketed/sorted by the same keys into the same 
number of buckets.),
{code}

Since hive-default.xml.template will be generated from HiveConf.java, it might 
not be necessary to add the new parameter to the template file -- but I'm not 
sure of that, maybe we need to keep editing the template file until 0.13 is 
released, so I'll raise the question in the comments on HIVE-6037.  Anyway it 
won't do any harm.

 Scalable dynamic partitioning and bucketing optimization
 

 Key: HIVE-6455
 URL: https://issues.apache.org/jira/browse/HIVE-6455
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: optimization
 Attachments: HIVE-6455.1.patch


 The current implementation of dynamic partition works by keeping at least one 
 record writer open per dynamic partition directory. In case of bucketing 
 there can be multispray file writers which further adds up to the number of 
 open record writers. The record writers of column oriented file format (like 
 ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
 compression buffers) open all the time to buffer up the rows and compress 
 them before flushing it to disk. Since these buffers are maintained per 
 column basis the amount of constant memory that will required at runtime 
 increases as the number of partitions and number of columns per partition 
 increases. This often leads to OutOfMemory (OOM) exception in mappers or 
 reducers depending on the number of open record writers. Users often tune the 
 JVM heapsize (runtime memory) to get over such OOM issues. 
 With this optimization, the dynamic partition columns and bucketing columns 
 (in case of bucketed tables) are sorted before being fed to the reducers. 
 Since the partitioning and bucketing columns are sorted, each reducers can 
 keep only one record writer open at any time thereby reducing the memory 
 pressure on the reducers. This optimization is highly scalable as the number 
 of partition and number of columns per partition increases at the cost of 
 sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)