Changed it to sort by.
On Sat, Oct 17, 2015 at 6:05 PM, Daniel Haviv < [email protected]> wrote: > Thanks for the tip Gopal. > I tried what you suggested (on Tez) but I'm getting a middle stage with 1 > reducer (which is awful for performance). > > This is my query: > insert into upstreamparam_org partition(day_ts, cmtsid) select * from > upstreamparam_20151013 order by datats,macaddress; > > I've attached the query plan in case it might help understand why. > > Thank you. > Daniel. > > > > > On Fri, Oct 16, 2015 at 7:19 PM, Gopal Vijayaraghavan <[email protected]> > wrote: > >> >> > Is there a more efficient way to have Hive merge small files on the >> >files without running with two passes? >> >> Not entirely an efficient way, but adding a shuffle stage usually works >> much better as it gives you the ability to layout the files for better >> vectorization. >> >> Like for TPC-H, doing ETL with >> >> create table lineitem as select * from lineitem sort by l_shipdate, >> l_suppkey; >> >> will produce fewer files (exactly as many as your reducer #) & compresses >> harder due to the natural order of transactions (saves ~20Gb or so at 1000 >> scale). >> >> Caveat: that is not more efficient in MRv2, only in Tez/Spark which can >> run MRR pipelines as-is. >> >> Cheers, >> Gopal >> >> >> >
