[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

Lefty Leverenz (JIRA) Mon, 17 Feb 2014 19:28:23 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13903744#comment-13903744
 ]


Lefty Leverenz commented on HIVE-6455:
--------------------------------------

Patch 1 adds *hive.optimize.sort.dynamic.partition* to HiveConf.java and 
hive-default.xml.template, so when this commits the parameter needs to be added 
to the wiki with version information.

But HIVE-6037 was committed today ("Synchronize HiveConf with 
hive-default.xml.template and support show conf") so from now on HiveConf.java 
includes a description as part of the parameter definition instead of a 
comment.  For example,

{code}
    HIVEENFORCEBUCKETING("hive.enforce.bucketing", false,
        "Whether bucketing is enforced. If true, while inserting into the 
table, bucketing is enforced."),
    HIVEENFORCESORTING("hive.enforce.sorting", false,
        "Whether sorting is enforced. If true, while inserting into the table, 
sorting is enforced."),
    HIVEOPTIMIZEBUCKETINGSORTING("hive.optimize.bucketingsorting", true,
        "If hive.enforce.bucketing or hive.enforce.sorting is true, don't 
create a reducer for enforcing \n" +
        "bucketing/sorting for queries of the form: \n" +
        "insert overwrite table T2 select * from T1;\n" +
        "where T1 and T2 are bucketed/sorted by the same keys into the same 
number of buckets."),
{code}

Since hive-default.xml.template will be generated from HiveConf.java, it might 
not be necessary to add the new parameter to the template file -- but I'm not 
sure of that, maybe we need to keep editing the template file until 0.13 is 
released, so I'll raise the question in the comments on HIVE-6037.  Anyway it 
won't do any harm.

> Scalable dynamic partitioning and bucketing optimization
> --------------------------------------------------------
>
>                 Key: HIVE-6455
>                 URL: https://issues.apache.org/jira/browse/HIVE-6455
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.13.0
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>              Labels: optimization
>         Attachments: HIVE-6455.1.patch
>
>
> The current implementation of dynamic partition works by keeping at least one 
> record writer open per dynamic partition directory. In case of bucketing 
> there can be multispray file writers which further adds up to the number of 
> open record writers. The record writers of column oriented file format (like 
> ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
> compression buffers) open all the time to buffer up the rows and compress 
> them before flushing it to disk. Since these buffers are maintained per 
> column basis the amount of constant memory that will required at runtime 
> increases as the number of partitions and number of columns per partition 
> increases. This often leads to OutOfMemory (OOM) exception in mappers or 
> reducers depending on the number of open record writers. Users often tune the 
> JVM heapsize (runtime memory) to get over such OOM issues. 
> With this optimization, the dynamic partition columns and bucketing columns 
> (in case of bucketed tables) are sorted before being fed to the reducers. 
> Since the partitioning and bucketing columns are sorted, each reducers can 
> keep only one record writer open at any time thereby reducing the memory 
> pressure on the reducers. This optimization is highly scalable as the number 
> of partition and number of columns per partition increases at the cost of 
> sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization

Reply via email to