Hi Folks,

I am very interested on the project of pig on spark. When I read the code,
I find that the current implementation is based on spark RDD API. I don't
know the original background (maybe when this project is started, DataFrame
API is not available) , but for now I feel DataFrame API might be more
suitable than RDD API. Here's 2 advantages of DataFrame API I can think of:
1.  DataFrame API is easier to use than RDD API, although it is not
flexible than RDD, but I think Pig's tuple data structure is very similar
with that of DataFrame. I think it should be able to map each pig operation
to data frame operation. If not, we can give feedback to spark community.
2.  Spark's catalyst provide lots of optimization on DataFrame. If we use
DataFrame API, we can leverage lots of optimization in catalyst rather than
reinvent the wheel in pig.

What do you think ? Thanks

Reply via email to