Hi, I was looking at the PIG on Spark effort and I noticed that there is scope for optimization for performance. For example, we don't try to evaluate from the plan what the best fit for groupBy is, it could be mapped groupBy/aggregateBy/reduceBy of Spark.
With DataFrames in Spark, the Catalyst Optimizer kicks in and the resulting code is much optimized. I was wondering if we could leverage that by converting the PIG operators to DataFrames rather than RDDs. Has anyone thought on these lines? I did play around a little bit (got basic load, filter, store to work) and noticed the following challenges: 1. *Schema *- DataFrames is schema tied. On the other hand, PIG allows users to consume a Tuple of a particular schema and generate a Tuple of a totally different schema. Schema changes on the fly is not easy to implement in DataFrames (and not encouraged either). 2. *Addition/Modification of Columns (very common in PIG) -* Since DFs are immutable, any change will result in a new DataFrame (not sure if there will be performance deterioration because of that). 3. *PIG UDFs* - Any user-defined or PIG-supplied UDFs will require us to fall-back on RDDs. 4. *Load/Store* - Since PigInputFormat is used, ended up creating data frames from RDD. This I could still overcome if the Loader was of type PigStorage, but, custom loaders will be a problem (no way to load using just DataFrames). 5. The Java API is more tedious to use than scala. Any thoughts? Thanks, Pallavi -- _____________________________________________________________ The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt.