[
https://issues.apache.org/jira/browse/HIVE-6872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajesh Balamohan reassigned HIVE-6872:
--------------------------------------
Assignee: Rajesh Balamohan
> Explore options of optimizing FileSinkOperator-->getDynOutPaths()
> -----------------------------------------------------------------
>
> Key: HIVE-6872
> URL: https://issues.apache.org/jira/browse/HIVE-6872
> Project: Hive
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Priority: Critical
>
> 1. Download hive-testbench from
> https://github.com/cartershanklin/hive-testbench
> 2. Generate data using "./tpcds-setup.sh 10 /user/hive/external partitioned"
> 3. Most of the data population for tables with "partition + bucket + sorted
> data" will run a lot slower even with scale factor of 10 on 20 node cluster.
> Bottleneck seems to be in FileSinkOperator-->getDynOutPaths() where it tries
> to close FSPath writers. Every call takes almost 150-200 ms.
> set hive.enforce.bucketing=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.max.dynamic.partitions.pernode=4096;
> With the above setting, one of the data loading (for web_sales table) took
> almost 4096 * 150 = 600 seconds just in closing the writers sequentially.
> Purpose of this jira is to figure out options of optimizing this code path.
--
This message was sent by Atlassian JIRA
(v6.2#6252)