> I'm using Hive 1.2.1 with LLAP on HDP 2.6.5. Tez AM is 3GB, there are 3 > daemons for a total of 34816 MB.
Assuming you're using Hive2 here (with LLAP) and LLAP kinda sucks for ETL workloads, but this is a different problem. > PARTITIONED BY (DATAPASSAGGIO string, ORAPASSAGGIO string) > CLUSTERED BY (ID_TICKETTYPE, ID_PERSONTYPE, NPOOLNR, NKASSANR) INTO 8 BUCKETS > STORED AS ORC ... > Total number of partitions is 137k. 20Gb divided by 137k makes for very poorly written ORC files, because I'd guess that it has too few rows in a file (will be much smaller than 1 HDFS block) - partitioning this fine is actually a performance issue on compile time. You can make this insert work by changing the insert shuffle mechanism (run an explain with/without to see the difference). set hive.optimize.sort.dynamic.partition=true; -- https://issues.apache.org/jira/browse/HIVE-6455 But I suspect you will be very disappointed by the performance of the read queries after this insert. > ,NPOOLNR decimal(4,0) > ,NZUTRNR decimal(3,0) > ,NKASSANR decimal(3,0) > ,ID_TICKETTYPE decimal(5,0) > ,ID_PERSONTYPE decimal(6,0) > ,ID_TICKETPERSONTYPEDEF decimal(6,0) That's also going to hurt - your schema raises a lot of red-flags that I find people do when they first migrated to hive. https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/ In general, you need to fix the partition count, bucketing structure (how clustered by does not "cluster", you need another "sorted by"), zero scale decimals. Can you try running with (& see what your query read-perf looks like) https://gist.github.com/t3rmin4t0r/087b61f79514673c307bb9a88327a4db Cheers, Gopal
