Thanks for your reply. I am already using Tez (sorry, forgot to mention this), and my goal is indeed to build the instance once per container. I'm sorry I don't understand what the solution would be with Tez. Are you saying that the object should be a private final? The only element I would be missing in this case is the final keyword. I fail to see how this will make a difference... Thanks, B
> Date: Tue, 14 Jul 2015 15:19:16 -0700 > Subject: Re: Optimizing UDF > From: gop...@apache.org > To: user@hive.apache.org > CC: tben...@hotmail.com > > > > > I'm trying to optimize a UDF that runs very slowly on Hive. The UDF > >takes in a 5GB table and builds a large data structure out of it to > >facilitate lookups. The 5GB input is loaded into the distributed cache > >with an 'add file <path>' command, and the UDF builds > > the data structure a single time per instance (or so it should). > > No, this builds it once per map attempt in MRv2, because each JVM is > killed after executing a single map attempt. > > In Tez, however you can build this once per container (usually, a ~10x > perf improvement). > > This has a fix in Tez, since the UDFs can only load it over the network > once per JVM init and you can hang onto that in the loaded GenericUDF > object (*not* a static, but a private final), which is held in the > TezCache as long as the task keeps running the same vertex. > > That will be thrown away whenever the container switches over to running a > reducer, so the cache is transient. > > Cheers, > Gopal > >