> I'm trying to optimize a UDF that runs very slowly on Hive. The UDF
>takes in a 5GB table and builds a large data structure out of it to
>facilitate lookups. The 5GB input is loaded into the distributed cache
>with an 'add file <path>' command, and the UDF builds
> the data structure a single time per instance (or so it should).

No, this builds it once per map attempt in MRv2, because each JVM is
killed after executing a single map attempt.

In Tez, however you can build this once per container (usually, a ~10x
perf improvement).

This has a fix in Tez, since the UDFs can only load it over the network
once per JVM init and you can hang onto that in the loaded GenericUDF
object (*not* a static, but a private final), which is held in the
TezCache as long as the task keeps running the same vertex.

That will be thrown away whenever the container switches over to running a
reducer, so the cache is transient.

Cheers,
Gopal


Reply via email to