Optimizing UDF

2015-07-14 Thread Bennie Leo
Hi,
 
I'm trying to optimize a UDF that runs very slowly on Hive. The UDF takes in a 
5GB table and builds a large data structure out of it to facilitate lookups. 
The 5GB input is loaded into the distributed cache with an 'add file ' 
command, and the UDF builds the data structure a single time per instance (or 
so it should). 
 
My problem is that the Hive UDF takes several hours to complete, while running 
the exact same code on my local machine takes 5 minutes! What could be causing 
Hive to be so impractically slow? According to the Hive logs, the data transfer 
takes 5-10 minutes, which is reasonable. What else is taking so long?
 
Thanks,
B
  

Re: Optimizing UDF

2015-07-14 Thread Gopal Vijayaraghavan

 
> I'm trying to optimize a UDF that runs very slowly on Hive. The UDF
>takes in a 5GB table and builds a large data structure out of it to
>facilitate lookups. The 5GB input is loaded into the distributed cache
>with an 'add file ' command, and the UDF builds
> the data structure a single time per instance (or so it should).

No, this builds it once per map attempt in MRv2, because each JVM is
killed after executing a single map attempt.

In Tez, however you can build this once per container (usually, a ~10x
perf improvement).

This has a fix in Tez, since the UDFs can only load it over the network
once per JVM init and you can hang onto that in the loaded GenericUDF
object (*not* a static, but a private final), which is held in the
TezCache as long as the task keeps running the same vertex.

That will be thrown away whenever the container switches over to running a
reducer, so the cache is transient.

Cheers,
Gopal




RE: Optimizing UDF

2015-07-14 Thread Bennie Leo
Thanks for your reply.
 
I am already using Tez (sorry, forgot to mention this), and my goal is indeed 
to build the instance once per container.
 
I'm sorry I don't understand what the solution would be with Tez. Are you 
saying that the object should be a private final? The only element I would be 
missing in this case is the final keyword. I fail to see how this will make a 
difference...
 
Thanks,
B

> Date: Tue, 14 Jul 2015 15:19:16 -0700
> Subject: Re: Optimizing UDF
> From: gop...@apache.org
> To: user@hive.apache.org
> CC: tben...@hotmail.com
> 
> 
>  
> > I'm trying to optimize a UDF that runs very slowly on Hive. The UDF
> >takes in a 5GB table and builds a large data structure out of it to
> >facilitate lookups. The 5GB input is loaded into the distributed cache
> >with an 'add file ' command, and the UDF builds
> > the data structure a single time per instance (or so it should).
> 
> No, this builds it once per map attempt in MRv2, because each JVM is
> killed after executing a single map attempt.
> 
> In Tez, however you can build this once per container (usually, a ~10x
> perf improvement).
> 
> This has a fix in Tez, since the UDFs can only load it over the network
> once per JVM init and you can hang onto that in the loaded GenericUDF
> object (*not* a static, but a private final), which is held in the
> TezCache as long as the task keeps running the same vertex.
> 
> That will be thrown away whenever the container switches over to running a
> reducer, so the cache is transient.
> 
> Cheers,
> Gopal
> 
> 
  

Re: Optimizing UDF

2015-07-14 Thread Gopal Vijayaraghavan


 
> I am already using Tez (sorry, forgot to mention this), and my goal is
>indeed to build the instance once per container.

Put a log line in your UDF init() and check if it is being called multiple
times per container. If you¹re loading the data everytime, then that might
be something to fix.

The other aspect is that there¹s GC pauses that can happen due to that and
such extraneous reasons for the slow-down.

But first, look at how many times you are loading the distributed cache
data per container.

Cheers,
Gopal