> Is there a more efficient way to have Hive merge small files on the >files without running with two passes?
Not entirely an efficient way, but adding a shuffle stage usually works much better as it gives you the ability to layout the files for better vectorization. Like for TPC-H, doing ETL with create table lineitem as select * from lineitem sort by l_shipdate, l_suppkey; will produce fewer files (exactly as many as your reducer #) & compresses harder due to the natural order of transactions (saves ~20Gb or so at 1000 scale). Caveat: that is not more efficient in MRv2, only in Tez/Spark which can run MRR pipelines as-is. Cheers, Gopal
