Actually, using a temp table doesn't work either. Apparently, a single mapper can read from multiple partitions (and output multiple files). There is no way to force a single mapper per partition.
On Wed, Aug 21, 2013 at 11:12 AM, Igor Tatarinov <i...@decide.com> wrote: > Using a single bucket per partition seems to create a single reducer which > is too slow. > I've tried enforcing small files merge but that didn't work. I still got > multiple output files. > > Creating a temp table and then "combining" the multiple files into one > using a simple select * is the only option that seems to work. It's odd > that I have to create the temp table but I don't see a workaround. > > > On Wed, Aug 21, 2013 at 8:51 AM, Stephen Sprague <sprag...@gmail.com>wrote: > >> hi igor, >> lots of ideas there! I can't speak for them all but let me confirm first >> that "cluster by X into 1 bucket" didn't work? I would have thought that >> would have done it. >> >> >> >> >> On Tue, Aug 20, 2013 at 2:29 PM, Igor Tatarinov <i...@decide.com> wrote: >> >>> What's the best way to enforce a single output file per partition? >>> >>> INSERT OVERWRITE TABLE <table> >>> PARTITION (x,y,z) >>> SELECT ... >>> FROM ... >>> WHERE ... >>> >>> It tried adding CLUSTER BY x,y,z at the end thinking that sorting will >>> force a single reducer per partition but that didn't work. I still got >>> multiple files per partition. >>> >>> Do I have to use a single reduce task? With a few TB of data that's >>> probably not a good idea. >>> >>> My current idea is to create a temp table with the same partitioning >>> structure. Insert into that table first and then select * from that table >>> into the output table. With combineinputformat=true that should work right? >>> >>> Or should I make Hive merge output files instead? (using >>> hive.merge.mapfiles) >>> Will that work with a partitioned table? >>> >>> Thanks! >>> igor >>> >> >> >