Yes. That was the first question I asked when I started work on Pig on Spark. After investigating a little more, I realized that the current design does not allow for easy use of DataFrame API. We do an operator by operator substitution and use Tuple as the datatype. We would end up converting RDDs to DataFrames and visa-versa, which is not really optimal.
So, as Kelly said, we should take up that optimization post one release. And, we would even move to Dataset API then. On Mon, Jan 9, 2017 at 7:53 AM, Zhang, Liyun <liyun.zh...@intel.com> wrote: > Hi Jeff: > Thanks for your interest, when this project is started (Aug in 2014) > DataFrame API is not available and this is why we don't use this in the > project. Engineer in InMobi raised similar idea before. In my view, if > DataFrame API is more suitable than RDD API, we can consider this in late > optimization work after first release. Now you can file a subtask on > PIG-4856(an umbrella jira for optimization work) and work on it if have > interest. > > > > Best Regards > Kelly Zhang/Zhang,Liyun > > > > -----Original Message----- > From: Jeff Zhang [mailto:zjf...@gmail.com] > Sent: Sunday, January 8, 2017 10:13 AM > To: dev@pig.apache.org > Subject: Why pig on spark use RDD API rather than DataFrame API ? > > Hi Folks, > > I am very interested on the project of pig on spark. When I read the code, > I find that the current implementation is based on spark RDD API. I don't > know the original background (maybe when this project is started, DataFrame > API is not available) , but for now I feel DataFrame API might be more > suitable than RDD API. Here's 2 advantages of DataFrame API I can think of: > 1. DataFrame API is easier to use than RDD API, although it is not > flexible than RDD, but I think Pig's tuple data structure is very similar > with that of DataFrame. I think it should be able to map each pig operation > to data frame operation. If not, we can give feedback to spark community. > 2. Spark's catalyst provide lots of optimization on DataFrame. If we use > DataFrame API, we can leverage lots of optimization in catalyst rather than > reinvent the wheel in pig. > > What do you think ? Thanks > -- _____________________________________________________________ The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt.