Re: Merging small files

Gopal Vijayaraghavan Fri, 16 Oct 2015 09:20:18 -0700

> Is there a more efficient way to have Hive merge small files on the
>files without running with two passes?


Not entirely an efficient way, but adding a shuffle stage usually works
much better as it gives you the ability to layout the files for better
vectorization.

Like for TPC-H, doing ETL with

create table lineitem as select * from lineitem sort by l_shipdate,
l_suppkey;

will produce fewer files (exactly as many as your reducer #) & compresses
harder due to the natural order of transactions (saves ~20Gb or so at 1000
scale).

Caveat: that is not more efficient in MRv2, only in Tez/Spark which can
run MRR pipelines as-is.

Cheers,
Gopal

Re: Merging small files

Reply via email to