This is a good start at adding a cost based optimizer to Pig. I have a number of comments:

1) Your argument for putting it in the physical layer rather than the logical is that the logical layer does not know physical statistics. This need not be true. You suggest adding a getStatistics call to the loader to give statistics. The logical layer can make this call and make decisions based on the results without understanding the underlying physical layer. It seems that the real reason you want to put the optimizer in the physical layer is, rather than trying to do predictive statistics (such as we guess this join will result in a 2x data explosion) you want to see the results of actual MR jobs and then make decisions. This seems like a reasonable choice for a couple of reasons: a) statistical guesses are hard to get right, and Pig has limited statistics to begin with; b) since Pig Latin scripts can be arbitrarily long, bad guesses at the beginning will have a worse ripple effect than bad guesses in a SQL optimizer.

2) The changes you propose in Pig Server are quite complex. Would it be possible instead to put the changes in MapReduceLauncher? It could run the first MR job in a Pig Latin script, look at the results, and then rerun your CBO on the remaining physical plan and re-translate this to a new MR plan and resubmit. This would require annotations to the MR plan to indicate where in a physical plan the MR boundaries fall, so that correct portions of the original physical plan could be used for reoptimization and recompilation. But it would contain the complexity of your changes to MapReduceLauncher instead of scattering them through the entire system.

3) On adding getStatistics, I am currently working on a proposal to make a number of changes to the load interface, including getStatistics. I hope to publish that proposal by next week. Similarly I am working on a proposal of how Pig will interact with metadata systems (such as Owl) which I also hope to propose next week. We will be actively working in these areas because we need them for our SQL implementation. So, one, you'll get a lot of this for free; two, we should stay connected on these things so what we implement works for what you need.

Alan.

On Sep 1, 2009, at 9:54 AM, Dmitriy Ryaboy wrote:

Whoops :-)
Here's the Google doc:
http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en

-Dmitriy

On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasan<s...@yahoo- inc.com> wrote:
Dmitriy and Gang,

The mailing list does not allow attachments. Can you post it on a
website and just send the URL ?

Thanks,
Santhosh

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
Sent: Tuesday, September 01, 2009 9:48 AM
To: pig-dev@hadoop.apache.org
Subject: Request for feedback: cost-based optimizer

Hi everyone,
Attached is a (very) preliminary document outlining a rough design we
are proposing for a cost-based optimizer for Pig.
This is being done as a capstone project by three CMU Master's students
(myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not
necessarily meant for immediate incorporation into the Pig codebase,
although it would be nice if it, or parts of it, are found to be useful
in the mainline.

We would love to get some feedback from the developer community
regarding the ideas expressed in the document, any concerns about the
design, suggestions for improvement, etc.

Thanks,
Dmitriy, Ashutosh, Tejal


Reply via email to