> I'm trying to optimize a UDF that runs very slowly on Hive. The UDF >takes in a 5GB table and builds a large data structure out of it to >facilitate lookups. The 5GB input is loaded into the distributed cache >with an 'add file <path>' command, and the UDF builds > the data structure a single time per instance (or so it should).
No, this builds it once per map attempt in MRv2, because each JVM is killed after executing a single map attempt. In Tez, however you can build this once per container (usually, a ~10x perf improvement). This has a fix in Tez, since the UDFs can only load it over the network once per JVM init and you can hang onto that in the loaded GenericUDF object (*not* a static, but a private final), which is held in the TezCache as long as the task keeps running the same vertex. That will be thrown away whenever the container switches over to running a reducer, so the cache is transient. Cheers, Gopal