Thanks for your reply.
 
I am already using Tez (sorry, forgot to mention this), and my goal is indeed 
to build the instance once per container.
 
I'm sorry I don't understand what the solution would be with Tez. Are you 
saying that the object should be a private final? The only element I would be 
missing in this case is the final keyword. I fail to see how this will make a 
difference...
 
Thanks,
B

> Date: Tue, 14 Jul 2015 15:19:16 -0700
> Subject: Re: Optimizing UDF
> From: gop...@apache.org
> To: user@hive.apache.org
> CC: tben...@hotmail.com
> 
> 
>  
> > I'm trying to optimize a UDF that runs very slowly on Hive. The UDF
> >takes in a 5GB table and builds a large data structure out of it to
> >facilitate lookups. The 5GB input is loaded into the distributed cache
> >with an 'add file <path>' command, and the UDF builds
> > the data structure a single time per instance (or so it should).
> 
> No, this builds it once per map attempt in MRv2, because each JVM is
> killed after executing a single map attempt.
> 
> In Tez, however you can build this once per container (usually, a ~10x
> perf improvement).
> 
> This has a fix in Tez, since the UDFs can only load it over the network
> once per JVM init and you can hang onto that in the loaded GenericUDF
> object (*not* a static, but a private final), which is held in the
> TezCache as long as the task keeps running the same vertex.
> 
> That will be thrown away whenever the container switches over to running a
> reducer, so the cache is transient.
> 
> Cheers,
> Gopal
> 
> 
                                          

Reply via email to