There is a port of Pig that runs over Spark.

The way I understand it,  Pig will analyze the pipeline you give it and
then compile the job so it runs on some fabric. Apache Tez is a
computational fabric which is somewhere in between Spark and the old
Map/Reduce in the sense that Tez eliminates many of the extreme
inefficiencies of Map/Reduce by allowing sequences other than

[storage] -> [map] -> [reduce] -> [storage]

but Tez is otherwise a lot like Map/Reduce,  whereas Spark offers in an
in-memory execution model (as well as on-disk) and is different in deeper

It could be that Pig-over-Spark is less compelling than Pig-over-something
else because a Spark program is a lot more like a Pig program than an M/R
program is.

On Sun, Jul 19, 2015 at 5:02 PM, Yang <> wrote:

> Spark is very hot now, but after reading the paper, I found it surprisingly
> similar to PIG's concept: the RDD is just Relation/set in PIG's
> terminology.
> I think a great strength of Spark is that it tries to merge multiple
> "narrow dependency" stages together to avoid too much IO. does PIG do that
> too? otherwise, I can't figure out what other major design differences
> would lead to huge performance difference, if Spark also uses on-disk
> storage. The overhead to start a MR task should not be that big.

Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254    paul.houle on Skype

Reply via email to