Hive huge 'startup time'

2014-07-18 Thread diogo
This is probably a simple question, but I'm noticing that for queries that
run on 1+TB of data, it can take Hive up to 30 minutes to actually start
the first map-reduce stage. What is it doing? I imagine it's gathering
information about the data somehow, this 'startup' time is clearly a
function of the amount of data I'm trying to process.

Cheers,


Re: Hive huge 'startup time'

2014-07-18 Thread Prem Yadav
may be you can post your partition structure and the query..Over
partitioning data is one of the reasons it happens.


On Fri, Jul 18, 2014 at 2:36 PM, diogo di...@uken.com wrote:

 This is probably a simple question, but I'm noticing that for queries that
 run on 1+TB of data, it can take Hive up to 30 minutes to actually start
 the first map-reduce stage. What is it doing? I imagine it's gathering
 information about the data somehow, this 'startup' time is clearly a
 function of the amount of data I'm trying to process.

 Cheers,



Re: Hive huge 'startup time'

2014-07-18 Thread Edward Capriolo
The planning phase needs to do work for every hive partition and every
hadoop files. If you have a lot of 'small' files or many partitions this
can take a long time.
Also the planning phase that happens on the job tracker is single threaded.
Also the new yarn stuff requires back and forth to allocated containers.

Sometimes raising the heap to for the hive-cli/launching process helps
because the default heap of 1 GB may not be a lot of space to deal with all
of the partition information and memory overhead will make this go faster.
Sometimes setting the min split size higher launches less map tasks which
speeds up everything.

So the answer...Try to tune everything, start hive like this:

bin/hive -hiveconf hive.root.logger=DEBUG,console

And record where the longest spaces with no output are, that is what you
should try to tune first.




On Fri, Jul 18, 2014 at 9:36 AM, diogo di...@uken.com wrote:

 This is probably a simple question, but I'm noticing that for queries that
 run on 1+TB of data, it can take Hive up to 30 minutes to actually start
 the first map-reduce stage. What is it doing? I imagine it's gathering
 information about the data somehow, this 'startup' time is clearly a
 function of the amount of data I'm trying to process.

 Cheers,



Re: Hive huge 'startup time'

2014-07-18 Thread diogo
Sweet, great answers, thanks.

Indeed, I have a small number of partitions, but lots of small files, ~20MB
each. I'll make sure to combine them. Also, increasing the heap size of the
cli process already helped speed it up.

Thanks, again.


On Fri, Jul 18, 2014 at 10:26 AM, Edward Capriolo edlinuxg...@gmail.com
wrote:

 The planning phase needs to do work for every hive partition and every
 hadoop files. If you have a lot of 'small' files or many partitions this
 can take a long time.
 Also the planning phase that happens on the job tracker is single threaded.
 Also the new yarn stuff requires back and forth to allocated containers.

 Sometimes raising the heap to for the hive-cli/launching process helps
 because the default heap of 1 GB may not be a lot of space to deal with all
 of the partition information and memory overhead will make this go faster.
 Sometimes setting the min split size higher launches less map tasks which
 speeds up everything.

 So the answer...Try to tune everything, start hive like this:

 bin/hive -hiveconf hive.root.logger=DEBUG,console

 And record where the longest spaces with no output are, that is what you
 should try to tune first.




 On Fri, Jul 18, 2014 at 9:36 AM, diogo di...@uken.com wrote:

 This is probably a simple question, but I'm noticing that for queries
 that run on 1+TB of data, it can take Hive up to 30 minutes to actually
 start the first map-reduce stage. What is it doing? I imagine it's
 gathering information about the data somehow, this 'startup' time is
 clearly a function of the amount of data I'm trying to process.

 Cheers,





Re: Hive huge 'startup time'

2014-07-18 Thread Edward Capriolo
Unleash ze file crusha!

https://github.com/edwardcapriolo/filecrush


On Fri, Jul 18, 2014 at 10:51 AM, diogo di...@uken.com wrote:

 Sweet, great answers, thanks.

 Indeed, I have a small number of partitions, but lots of small files,
 ~20MB each. I'll make sure to combine them. Also, increasing the heap size
 of the cli process already helped speed it up.

 Thanks, again.


 On Fri, Jul 18, 2014 at 10:26 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 The planning phase needs to do work for every hive partition and every
 hadoop files. If you have a lot of 'small' files or many partitions this
 can take a long time.
 Also the planning phase that happens on the job tracker is single
 threaded.
 Also the new yarn stuff requires back and forth to allocated containers.

 Sometimes raising the heap to for the hive-cli/launching process helps
 because the default heap of 1 GB may not be a lot of space to deal with all
 of the partition information and memory overhead will make this go faster.
 Sometimes setting the min split size higher launches less map tasks which
 speeds up everything.

 So the answer...Try to tune everything, start hive like this:

 bin/hive -hiveconf hive.root.logger=DEBUG,console

 And record where the longest spaces with no output are, that is what you
 should try to tune first.




 On Fri, Jul 18, 2014 at 9:36 AM, diogo di...@uken.com wrote:

 This is probably a simple question, but I'm noticing that for queries
 that run on 1+TB of data, it can take Hive up to 30 minutes to actually
 start the first map-reduce stage. What is it doing? I imagine it's
 gathering information about the data somehow, this 'startup' time is
 clearly a function of the amount of data I'm trying to process.

 Cheers,






Re: Hive huge 'startup time'

2014-07-18 Thread Db-Blog
Hello everyone, 

Thanks for sharing valuable inputs. I am working on similar kind of task, it 
will be really helpful if you can share the command for  increasing the heap 
size of hive-cli/launching process. 

Thanks,
Saurabh

Sent from my iPhone, please avoid typos.

 On 18-Jul-2014, at 8:23 pm, Edward Capriolo edlinuxg...@gmail.com wrote:
 
 Unleash ze file crusha!
 
 https://github.com/edwardcapriolo/filecrush
 
 
 On Fri, Jul 18, 2014 at 10:51 AM, diogo di...@uken.com wrote:
 Sweet, great answers, thanks.
 
 Indeed, I have a small number of partitions, but lots of small files, ~20MB 
 each. I'll make sure to combine them. Also, increasing the heap size of the 
 cli process already helped speed it up.
 
 Thanks, again.
 
 
 On Fri, Jul 18, 2014 at 10:26 AM, Edward Capriolo edlinuxg...@gmail.com 
 wrote:
 The planning phase needs to do work for every hive partition and every 
 hadoop files. If you have a lot of 'small' files or many partitions this 
 can take a long time. 
 Also the planning phase that happens on the job tracker is single threaded.
 Also the new yarn stuff requires back and forth to allocated containers. 
 
 Sometimes raising the heap to for the hive-cli/launching process helps 
 because the default heap of 1 GB may not be a lot of space to deal with all 
 of the partition information and memory overhead will make this go faster.
 Sometimes setting the min split size higher launches less map tasks which 
 speeds up everything.
 
 So the answer...Try to tune everything, start hive like this:
 
 bin/hive -hiveconf hive.root.logger=DEBUG,console
 
 And record where the longest spaces with no output are, that is what you 
 should try to tune first.
 
 
 
 
 On Fri, Jul 18, 2014 at 9:36 AM, diogo di...@uken.com wrote:
 This is probably a simple question, but I'm noticing that for queries that 
 run on 1+TB of data, it can take Hive up to 30 minutes to actually start 
 the first map-reduce stage. What is it doing? I imagine it's gathering 
 information about the data somehow, this 'startup' time is clearly a 
 function of the amount of data I'm trying to process.
 
 Cheers,