Re: Some proposals for Pig performance optimization

Thejas Nair Thu, 21 Jun 2012 13:28:26 -0700

bcc'ing the user list.

1. Order-by

The comparison against hive order-by is misleading. Hive does not dototal ordering, unless you use a single reducer.But yes, in case of pig, the sampling phase is unnecessary, if you use asingle reducer. A single reducer can make sense if the data you aresorting is small. I agree that it makes sense to remove the samplingphase in pig in such cases.


2. Lazy type conversion
Can you add a note about how many records are there in input vs output ?

In this example, we can improve by using the logical optimizer, so onlynecessary parts are typecast before the filter.

One problem in pig is that it uses java objects like Integer, String etcwhich are final types. Which means that we can't create a subclass bythat delays the conversion until it actually gets used. The types arepart of the udf interface. We should consider if we want to do somethinglike this, when we add new udf interfaces.

Some thoughts on serialization/deserialization improvements that i hadwritten earlier - http://wiki.apache.org/pig/AvoidingSedes


Thanks,
Thejas






On 6/21/12 11:14 AM, Jie Li wrote:

Hello everyone,

I compiled a list of possible optimizaiton for Pig's performance.

https://cwiki.apache.org/confluence/display/PIG/Pig+Performance+Optimization

As I haven't been very familiar with the codebase, I'm likely to
underestimate the complexity involved, so any input will be
appreciated.

Thanks,
Jie

Re: Some proposals for Pig performance optimization

Reply via email to