Re: Cannot INSERT OVERWRITE on clustered table with > 8 buckets

Gopal Vijayaraghavan Fri, 13 Jul 2018 11:19:11 -0700


> I'm using Hive 1.2.1 with LLAP on HDP 2.6.5. Tez AM is 3GB, there are 3 
> daemons for a total of 34816 MB.


Assuming you're using Hive2 here (with LLAP) and LLAP kinda sucks for ETL 
workloads, but this is a different problem.

> PARTITIONED BY (DATAPASSAGGIO string, ORAPASSAGGIO string)
> CLUSTERED BY (ID_TICKETTYPE, ID_PERSONTYPE, NPOOLNR, NKASSANR) INTO 8 BUCKETS 
> STORED AS ORC
...
> Total number of partitions is 137k.

20Gb divided by 137k makes for very poorly written ORC files, because I'd guess 
that it has too few rows in a file (will be much smaller than 1 HDFS block) - 
partitioning this fine is actually a performance issue on compile time.

You can make this insert work by changing the insert shuffle mechanism (run an 
explain with/without to see the difference).

set hive.optimize.sort.dynamic.partition=true; -- 
https://issues.apache.org/jira/browse/HIVE-6455

But I suspect you will be very disappointed by the performance of the read 
queries after this insert.

>      ,NPOOLNR decimal(4,0)
>     ,NZUTRNR decimal(3,0)
>     ,NKASSANR decimal(3,0)
>     ,ID_TICKETTYPE decimal(5,0)
>     ,ID_PERSONTYPE decimal(6,0)
>     ,ID_TICKETPERSONTYPEDEF decimal(6,0)

That's also going to hurt - your schema raises a lot of red-flags that I find 
people do when they first migrated to hive.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/

In general, you need to fix the partition count, bucketing structure (how 
clustered by does not "cluster", you need another "sorted by"), zero scale 
decimals.

Can you try running with (& see what your query read-perf looks like)

https://gist.github.com/t3rmin4t0r/087b61f79514673c307bb9a88327a4db

Cheers,
Gopal

Re: Cannot INSERT OVERWRITE on clustered table with > 8 buckets

Reply via email to