Re: Request for feedback: cost-based optimizer

Alan Gates Fri, 11 Sep 2009 10:57:38 -0700

This is a good start at adding a cost based optimizer to Pig. I havea number of comments:

1) Your argument for putting it in the physical layer rather than thelogical is that the logical layer does not know physical statistics.This need not be true. You suggest adding a getStatistics call to theloader to give statistics. The logical layer can make this call andmake decisions based on the results without understanding theunderlying physical layer. It seems that the real reason you want toput the optimizer in the physical layer is, rather than trying to dopredictive statistics (such as we guess this join will result in a 2xdata explosion) you want to see the results of actual MR jobs and thenmake decisions. This seems like a reasonable choice for a couple ofreasons: a) statistical guesses are hard to get right, and Pig haslimited statistics to begin with; b) since Pig Latin scripts can bearbitrarily long, bad guesses at the beginning will have a worseripple effect than bad guesses in a SQL optimizer.

2) The changes you propose in Pig Server are quite complex. Would itbe possible instead to put the changes in MapReduceLauncher? It couldrun the first MR job in a Pig Latin script, look at the results, andthen rerun your CBO on the remaining physical plan and re-translatethis to a new MR plan and resubmit. This would require annotations tothe MR plan to indicate where in a physical plan the MR boundariesfall, so that correct portions of the original physical plan could beused for reoptimization and recompilation. But it would contain thecomplexity of your changes to MapReduceLauncher instead of scatteringthem through the entire system.

3) On adding getStatistics, I am currently working on a proposal tomake a number of changes to the load interface, includinggetStatistics. I hope to publish that proposal by next week.Similarly I am working on a proposal of how Pig will interact withmetadata systems (such as Owl) which I also hope to propose nextweek. We will be actively working in these areas because we need themfor our SQL implementation. So, one, you'll get a lot of this forfree; two, we should stay connected on these things so what weimplement works for what you need.


Alan.

On Sep 1, 2009, at 9:54 AM, Dmitriy Ryaboy wrote:

Whoops :-)
Here's the Google doc:
http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en

-Dmitriy

On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasan<s...@yahoo-inc.com> wrote:

Dmitriy and Gang,

The mailing list does not allow attachments. Can you post it on a
website and just send the URL ?

Thanks,
Santhosh

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
Sent: Tuesday, September 01, 2009 9:48 AM
To: pig-dev@hadoop.apache.org
Subject: Request for feedback: cost-based optimizer

Hi everyone,
Attached is a (very) preliminary document outlining a rough design we
are proposing for a cost-based optimizer for Pig.

This is being done as a capstone project by three CMU Master'sstudents

(myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not
necessarily meant for immediate incorporation into the Pig codebase,

although it would be nice if it, or parts of it, are found to beuseful

in the mainline.

We would love to get some feedback from the developer community
regarding the ideas expressed in the document, any concerns about the
design, suggestions for improvement, etc.

Thanks,
Dmitriy, Ashutosh, Tejal

Re: Request for feedback: cost-based optimizer

Reply via email to