There is a port of Pig that runs over Spark. http://blog.cloudera.com/blog/2014/09/pig-is-flying-apache-pig-on-apache-spark/
The way I understand it, Pig will analyze the pipeline you give it and then compile the job so it runs on some fabric. Apache Tez is a computational fabric which is somewhere in between Spark and the old Map/Reduce in the sense that Tez eliminates many of the extreme inefficiencies of Map/Reduce by allowing sequences other than [storage] -> [map] -> [reduce] -> [storage] but Tez is otherwise a lot like Map/Reduce, whereas Spark offers in an in-memory execution model (as well as on-disk) and is different in deeper ways. It could be that Pig-over-Spark is less compelling than Pig-over-something else because a Spark program is a lot more like a Pig program than an M/R program is. On Sun, Jul 19, 2015 at 5:02 PM, Yang <teddyyyy...@gmail.com> wrote: > Spark is very hot now, but after reading the paper, I found it surprisingly > similar to PIG's concept: the RDD is just Relation/set in PIG's > terminology. > > I think a great strength of Spark is that it tries to merge multiple > "narrow dependency" stages together to avoid too much IO. does PIG do that > too? otherwise, I can't figure out what other major design differences > would lead to huge performance difference, if Spark also uses on-disk > storage. The overhead to start a MR task should not be that big. > -- Paul Houle *Applying Schemas for Natural Language Processing, Distributed Systems, Classification and Text Mining and Data Lakes* (607) 539 6254 paul.houle on Skype ontolo...@gmail.com https://legalentityidentifier.info/lei/lookup/ <http://legalentityidentifier.info/lei/lookup/>