I don't think anybody uses Pig because it is efficient use of a computer cluster. Instead people use it because it is an efficient use of their time.
If you're getting to the point where CPU performance matters you can generally write a plain Hadoop job that is faster, particularly if you think a lot about the algorithms and data structures. ᐧ On Thu, Jul 24, 2014 at 9:11 AM, Rodrigo Ferreira <[email protected]> wrote: > Hi everyone, > > I have a question for you guys. > > Well, I've started doing some experiments with the UDFs that I've created. > And at this point I'm interested in assessing their performance. > > I have something like: > > A = LOAD ... using JsonLoader(); > > B = FOREACH A GENERATE MyUDF(); > > This code, that is translated into a single Map task (no reduce) takes 1:20 > to execute. If I comment the projection and just load the data it takes 27 > seconds. So the first assumption is that the rest of the time was spent in > MyUDF right? Not quite. > > I printed (using System.nanoTime()) all the calls to exec() and they don't > sum up more than 5 seconds. So where have the other 48 seconds gone? > > The output of my UDF is a bag. Basically for each input tuple I "create" > several output tuples and put them in a bag. > > Thanks, > > Rodrigo Ferreira. -- Paul Houle Expert on Freebase, DBpedia, Hadoop and RDF (607) 539 6254 paul.houle on Skype [email protected]
