Hi everyone, I have a question for you guys.
Well, I've started doing some experiments with the UDFs that I've created. And at this point I'm interested in assessing their performance. I have something like: A = LOAD ... using JsonLoader(); B = FOREACH A GENERATE MyUDF(); This code, that is translated into a single Map task (no reduce) takes 1:20 to execute. If I comment the projection and just load the data it takes 27 seconds. So the first assumption is that the rest of the time was spent in MyUDF right? Not quite. I printed (using System.nanoTime()) all the calls to exec() and they don't sum up more than 5 seconds. So where have the other 48 seconds gone? The output of my UDF is a bag. Basically for each input tuple I "create" several output tuples and put them in a bag. Thanks, Rodrigo Ferreira.
