Hi,
I was looking at the PIG on Spark effort and I noticed that there is scope
for optimization for performance. For example, we don't try to evaluate
from the plan what the best fit for groupBy is, it could be mapped
groupBy/aggregateBy/reduceBy of Spark.

With DataFrames in Spark, the Catalyst Optimizer kicks in and the resulting
code is much optimized. I was wondering if we could leverage that by
converting the PIG operators to DataFrames rather than RDDs. Has anyone
thought on these lines?

I did play around a little bit (got basic load, filter, store to work) and
noticed the following challenges:
1. *Schema *- DataFrames is schema tied. On the other hand, PIG allows
users to consume a Tuple of a particular schema and generate a Tuple of a
totally different schema. Schema changes on the fly is not easy to
implement in DataFrames (and not encouraged either).
2. *Addition/Modification of Columns (very common in PIG) -* Since DFs are
immutable, any change will result in a new DataFrame (not sure if there
will be performance deterioration because of that).
3. *PIG UDFs* - Any user-defined or PIG-supplied UDFs will require us to
fall-back on RDDs.
4. *Load/Store* - Since PigInputFormat is used, ended up creating data
frames from RDD. This I could still overcome if the Loader was of type
PigStorage, but, custom loaders will be a problem (no way to load using
just DataFrames).
5. The Java API is more tedious to use than scala.

Any thoughts?

Thanks,
Pallavi

-- 
_____________________________________________________________
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.

Reply via email to