[Hive 0.13.1] - Explanation/confusion over "Fatal error occurred when node tried to create too many dynamic partitions" on small dataset with dynamic partitions

Daniel Harper Wed, 15 Apr 2015 08:50:09 -0700

Hi there,

We've been encountering the exception


Error: java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]: Fatal 
error occurred when node tried to create too many dynamic partitions. The 
maximum number of dynamic partitions is controlled by 
hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode. 
Maximum was set to: 100

On a very small dataset (180 lines) using the following setup

CREATE TABLE enriched_data (
enriched_json_data string
)
PARTITIONED BY (yyyy string, mm string, dd string, identifier string, 
sub_identifier string, unique_run_id string)
CLUSTERED BY (enriched_json_data) INTO 128 BUCKETS
LOCATION "${OUTDIR}";

INSERT OVERWRITE TABLE enriched_data PARTITION (yyyy, mm, dd, identifier, 
sub_identifier, unique_run_id)
SELECT …

We’ve not seen this issue before (normally our dataset is billions of lines), 
but in this case we have a very tiny amount of data causing this issue.

After looking at the code, it appears as if this condition is failing 
https://github.com/apache/hive/blob/branch-0.13/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L745
I downloaded and rebuilt the branch with a bit of debugging/stdout printing on 
the contents of the valToPaths map and it fails as there are 101 entries in it

All the entries look like this

yyyy=2015/mm=04/dd=09/identifier=1/sub-identifier=3/unique_run_id=df-345345/000047_0
yyyy=2015/mm=04/dd=09/identifier=1/sub-identifier=3/unique_run_id=df-345345/000048_0
yyyy=2015/mm=04/dd=09/identifier=1/sub-identifier=3/unique_run_id=df-345345/000049_0
yyyy=2015/mm=04/dd=09/identifier=1/sub-identifier=3/unique_run_id=df-345345/000051_0
….

We’re just confused as to why Hive considers the final bit of the output path 
(e.g. 000047_0) to be a “dynamic partition”, as this is not in our PARTITIONED 
BY clause

The only thing I can think of is the CLUSTERED BY 128 BUCKETS clause, combined 
with the dataset being really small (180 lines), is loading everything into 1 
REDUCER task – but the hashing of each line is distributing the rows fairly 
uniformly so we have > 100 buckets to write to via one reducer

Any help will be greatly appreciated

With thanks,

Daniel Harper
Software Engineer, OTG ANT
BC5 A5

[Hive 0.13.1] - Explanation/confusion over "Fatal error occurred when node tried to create too many dynamic partitions" on small dataset with dynamic partitions

Reply via email to