Hi Folks, I am very interested on the project of pig on spark. When I read the code, I find that the current implementation is based on spark RDD API. I don't know the original background (maybe when this project is started, DataFrame API is not available) , but for now I feel DataFrame API might be more suitable than RDD API. Here's 2 advantages of DataFrame API I can think of: 1. DataFrame API is easier to use than RDD API, although it is not flexible than RDD, but I think Pig's tuple data structure is very similar with that of DataFrame. I think it should be able to map each pig operation to data frame operation. If not, we can give feedback to spark community. 2. Spark's catalyst provide lots of optimization on DataFrame. If we use DataFrame API, we can leverage lots of optimization in catalyst rather than reinvent the wheel in pig.
What do you think ? Thanks
