Using a single bucket per partition seems to create a single reducer which is too slow. I've tried enforcing small files merge but that didn't work. I still got multiple output files.
Creating a temp table and then "combining" the multiple files into one using a simple select * is the only option that seems to work. It's odd that I have to create the temp table but I don't see a workaround. On Wed, Aug 21, 2013 at 8:51 AM, Stephen Sprague <sprag...@gmail.com> wrote: > hi igor, > lots of ideas there! I can't speak for them all but let me confirm first > that "cluster by X into 1 bucket" didn't work? I would have thought that > would have done it. > > > > > On Tue, Aug 20, 2013 at 2:29 PM, Igor Tatarinov <i...@decide.com> wrote: > >> What's the best way to enforce a single output file per partition? >> >> INSERT OVERWRITE TABLE <table> >> PARTITION (x,y,z) >> SELECT ... >> FROM ... >> WHERE ... >> >> It tried adding CLUSTER BY x,y,z at the end thinking that sorting will >> force a single reducer per partition but that didn't work. I still got >> multiple files per partition. >> >> Do I have to use a single reduce task? With a few TB of data that's >> probably not a good idea. >> >> My current idea is to create a temp table with the same partitioning >> structure. Insert into that table first and then select * from that table >> into the output table. With combineinputformat=true that should work right? >> >> Or should I make Hive merge output files instead? (using hive.merge.mapfiles) >> Will that work with a partitioned table? >> >> Thanks! >> igor >> > >