Unleash ze file crusha! https://github.com/edwardcapriolo/filecrush
On Fri, Jul 18, 2014 at 10:51 AM, diogo <di...@uken.com> wrote: > Sweet, great answers, thanks. > > Indeed, I have a small number of partitions, but lots of small files, > ~20MB each. I'll make sure to combine them. Also, increasing the heap size > of the cli process already helped speed it up. > > Thanks, again. > > > On Fri, Jul 18, 2014 at 10:26 AM, Edward Capriolo <edlinuxg...@gmail.com> > wrote: > >> The planning phase needs to do work for every hive partition and every >> hadoop files. If you have a lot of 'small' files or many partitions this >> can take a long time. >> Also the planning phase that happens on the job tracker is single >> threaded. >> Also the new yarn stuff requires back and forth to allocated containers. >> >> Sometimes raising the heap to for the hive-cli/launching process helps >> because the default heap of 1 GB may not be a lot of space to deal with all >> of the partition information and memory overhead will make this go faster. >> Sometimes setting the min split size higher launches less map tasks which >> speeds up everything. >> >> So the answer...Try to tune everything, start hive like this: >> >> bin/hive -hiveconf hive.root.logger=DEBUG,console >> >> And record where the longest spaces with no output are, that is what you >> should try to tune first. >> >> >> >> >> On Fri, Jul 18, 2014 at 9:36 AM, diogo <di...@uken.com> wrote: >> >>> This is probably a simple question, but I'm noticing that for queries >>> that run on 1+TB of data, it can take Hive up to 30 minutes to actually >>> start the first map-reduce stage. What is it doing? I imagine it's >>> gathering information about the data somehow, this 'startup' time is >>> clearly a function of the amount of data I'm trying to process. >>> >>> Cheers, >>> >> >> >