Optimal approach for changing file format of a partitioned table

2018-08-04 Thread Elliot West
Hi, I’m trying to simply change the format of a very large partitioned table from Json to ORC. I’m finding that it is unexpectedly resource intensive, primarily due to a shuffle phase with the partition key. I end up running out of disk space in what looks like a spill to disk in the reducers. How

Re: Optimal approach for changing file format of a partitioned table

2018-08-06 Thread Furcy Pin
Hi Elliot, >From your description of the problem, I'm assuming that you are doing a INSERT OVERWRITE table PARTITION(p1, p2) SELECT * FROM table or something close, like a CREATE TABLE AS ... maybe. If this is the case, I suspect that your shuffle phase comes from dynamic partitioning, and in pa

Re: Optimal approach for changing file format of a partitioned table

2018-08-06 Thread Gopal Vijayaraghavan
A hive version would help to preface this, because that matters for this (like TEZ-3709 doesn't apply for hive-1.2). > I’m trying to simply change the format of a very large partitioned table from > Json to ORC. I’m finding that it is unexpectedly resource intensive, > primarily due to a shu