Re: Request for feedback: cost-based optimizer

2009-09-11 Thread Dmitriy Ryaboy
Hi Alan,
Thanks for the detailed review.

After getting Daniel's feedback (and grokking the relationship between
Pig's logical and physical operators, which is a little different than
that described in the literature), we agree that the proper place to
put the optimizer is at the logical layer, although we will need to
compile to the physical layer to get cost estimates (for example, the
number of generated MR jobs, which have associated
network/queueing/startup costs). In order to adaptively adjust
estimates, we will need to be able to trace back from an executed MR
job ("job set", really, as some operations like order and join may
require several jobs that are considered a single unit) to the logical
operators this job covered. Adding that ability will have the
additional benefit of enabling more helpful debugging output to end
users by associating a failed MR job with what it was supposed to be
doing.

Totally agree with respect to PigServer and MapReduceLauncher.  Making
PigServer an actual "server" would be good, but is somewhat orthogonal
to this work.

Great to know you are working on statistics, looking forward to
looking at the proposal.  Are you working on just data stats or also
execution stats (time per operator per record, that sort of thing)?

Thanks
-Dmitriy

On Fri, Sep 11, 2009 at 1:56 PM, Alan Gates  wrote:
> This is a good start at adding a cost based optimizer to Pig.  I have a
> number of comments:
>
> 1) Your argument for putting it in the physical layer rather than the
> logical is that the logical layer does not know physical statistics.  This
> need not be true.  You suggest adding a getStatistics call to the loader to
> give statistics.  The logical layer can make this call and make decisions
> based on the results without understanding the underlying physical layer.
>  It seems that the real reason you want to put the optimizer in the physical
> layer is, rather than trying to do predictive statistics (such as we guess
> this join will result in a 2x data explosion) you want to see the results of
> actual MR jobs and then make decisions.  This seems like a reasonable choice
> for a couple of reasons:  a) statistical guesses are hard to get right, and
> Pig has limited statistics to begin with; b) since Pig Latin scripts can be
> arbitrarily long, bad guesses at the beginning will have a worse ripple
> effect than bad guesses in a SQL optimizer.
>
> 2) The changes you propose in Pig Server are quite complex.  Would it be
> possible instead to put the changes in MapReduceLauncher?  It could run the
> first MR job in a Pig Latin script, look at the results, and then rerun your
> CBO on the remaining physical plan and re-translate this to a new MR plan
> and resubmit.  This would require annotations to the MR plan to indicate
> where in a physical plan the MR boundaries fall, so that correct portions of
> the original physical plan could be used for reoptimization and
> recompilation.  But it would contain the complexity of your changes to
> MapReduceLauncher instead of scattering them through the entire system.
>
> 3) On adding getStatistics, I am currently working on a proposal to make a
> number of changes to the load interface, including getStatistics.  I hope to
> publish that proposal by next week.  Similarly I am working on a proposal of
> how Pig will interact with metadata systems (such as Owl) which I also hope
> to propose next week.  We will be actively working in these areas because we
> need them for our SQL implementation.  So, one, you'll get a lot of this for
> free; two, we should stay connected on these things so what we implement
> works for what you need.
>
> Alan.
>
> On Sep 1, 2009, at 9:54 AM, Dmitriy Ryaboy wrote:
>
>> Whoops :-)
>> Here's the Google doc:
>>
>> http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en
>>
>> -Dmitriy
>>
>> On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasan
>> wrote:
>>>
>>> Dmitriy and Gang,
>>>
>>> The mailing list does not allow attachments. Can you post it on a
>>> website and just send the URL ?
>>>
>>> Thanks,
>>> Santhosh
>>>
>>> -Original Message-
>>> From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
>>> Sent: Tuesday, September 01, 2009 9:48 AM
>>> To: pig-dev@hadoop.apache.org
>>> Subject: Request for feedback: cost-based optimizer
>>>
>>> Hi everyone,
>>> Attached is a (very) preliminary document outlining a rough design we
>>> are proposing for a cost-based optimizer for Pig.
>>> This is being done as a capstone project by three CMU Master's students
>>> (myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not
>>> necessarily meant for immediate incorporation into the Pig codebase,
>>> although it would be nice if it, or parts of it, are found to be useful
>>> in the mainline.
>>>
>>> We would love to get some feedback from the developer community
>>> regarding the ideas expressed in the document, any concerns about the
>>> design, suggestions for improvement, etc.
>>>
>>> Thanks

Re: Request for feedback: cost-based optimizer

2009-09-11 Thread Alan Gates
This is a good start at adding a cost based optimizer to Pig.  I have  
a number of comments:


1) Your argument for putting it in the physical layer rather than the  
logical is that the logical layer does not know physical statistics.   
This need not be true.  You suggest adding a getStatistics call to the  
loader to give statistics.  The logical layer can make this call and  
make decisions based on the results without understanding the  
underlying physical layer.  It seems that the real reason you want to  
put the optimizer in the physical layer is, rather than trying to do  
predictive statistics (such as we guess this join will result in a 2x  
data explosion) you want to see the results of actual MR jobs and then  
make decisions.  This seems like a reasonable choice for a couple of  
reasons:  a) statistical guesses are hard to get right, and Pig has  
limited statistics to begin with; b) since Pig Latin scripts can be  
arbitrarily long, bad guesses at the beginning will have a worse  
ripple effect than bad guesses in a SQL optimizer.


2) The changes you propose in Pig Server are quite complex.  Would it  
be possible instead to put the changes in MapReduceLauncher?  It could  
run the first MR job in a Pig Latin script, look at the results, and  
then rerun your CBO on the remaining physical plan and re-translate  
this to a new MR plan and resubmit.  This would require annotations to  
the MR plan to indicate where in a physical plan the MR boundaries  
fall, so that correct portions of the original physical plan could be  
used for reoptimization and recompilation.  But it would contain the  
complexity of your changes to MapReduceLauncher instead of scattering  
them through the entire system.


3) On adding getStatistics, I am currently working on a proposal to  
make a number of changes to the load interface, including  
getStatistics.  I hope to publish that proposal by next week.   
Similarly I am working on a proposal of how Pig will interact with  
metadata systems (such as Owl) which I also hope to propose next  
week.  We will be actively working in these areas because we need them  
for our SQL implementation.  So, one, you'll get a lot of this for  
free; two, we should stay connected on these things so what we  
implement works for what you need.


Alan.

On Sep 1, 2009, at 9:54 AM, Dmitriy Ryaboy wrote:


Whoops :-)
Here's the Google doc:
http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en

-Dmitriy

On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasaninc.com> wrote:

Dmitriy and Gang,

The mailing list does not allow attachments. Can you post it on a
website and just send the URL ?

Thanks,
Santhosh

-Original Message-
From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
Sent: Tuesday, September 01, 2009 9:48 AM
To: pig-dev@hadoop.apache.org
Subject: Request for feedback: cost-based optimizer

Hi everyone,
Attached is a (very) preliminary document outlining a rough design we
are proposing for a cost-based optimizer for Pig.
This is being done as a capstone project by three CMU Master's  
students

(myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not
necessarily meant for immediate incorporation into the Pig codebase,
although it would be nice if it, or parts of it, are found to be  
useful

in the mainline.

We would love to get some feedback from the developer community
regarding the ideas expressed in the document, any concerns about the
design, suggestions for improvement, etc.

Thanks,
Dmitriy, Ashutosh, Tejal





Re: Request for feedback: cost-based optimizer

2009-09-03 Thread Dmitriy Ryaboy
Daniel, thanks for the information, this is useful.


On Wed, Sep 2, 2009 at 2:06 PM, Jianyong Dai wrote:
> Yes, physical properties is important for an optimizer. To optimize Pig
> well, we need to know the underlying hadoop execution environment, such as #
> of map-reduce jobs, how many maps/reducers, how the job is configured, etc.
> This is true even for a rule based optimizer. Unfortunately, physical layer
> does not provide much physical information as the name suggests. Basically
> physical layer is a rephrase of logical layer using physical operators.
> Compare to logical operators, physical operators include implementation of
> pipeline processing but strip away many logical details such as "schema".
> Also, in logical layer, we have infrastructure to restructure logical
> operator such as move nodes around, swap nodes, etc, which does not exist in
> physical layer. From optimizer's point of view, physical layer does not give
> necessary information but more harder to deal with. If you would like to
> work with physical details, I think map-reduce layer is the right place to
> look at. However, restructure map-reduce layer is hard cuz we do not have
> all the infrastructure to move things around. Another approach is to use a
> combined logical layer and map-reduce layer for the optimization. In this,
> you restructure the logical layer by observing the physical details from
> map-reduce layer. The down side is that we have to tightly couple Pig to
> hadoop. But now Pig is a subproject of hadoop and almost all Pig users are
> using hadoop, I think it is fine to optimize thing towards hadoop.
>
>
> Dmitriy Ryaboy wrote:
>>
>> Our initial survey of related literature showed that the usual place
>> for a CBO tends to be between the physical and logical layer (in fact,
>> the famous Cascades paper advocates removing the distinction between
>> physical and logical operators altogether, and using an "is_logical"
>> and "is_physical" flag instead -- meaning an operator can be one,
>> both, or neither).
>>
>> The reasoning is that you cannot properly determine a cost of a plan
>> if you don't know the physical "properties" of the operators that
>> implement it. An optimizer that works at a logical layer would by
>> definition create the same plan whether in local or mapreduce mode
>> (since such differences are abstracted from it). This is clearly
>> incorrect, as the properties of the environment in which these plans
>> are executed are drastically different.  Working at the physical layer
>> lets us stay close to the iron and adjust based on the specifics of
>> the execution environment.
>>
>> Certainly one can posit a framework for a CBO that would set up the
>> necessary interfaces and plumbing for optimizing in any execution
>> mode, and invoke the proper implementations at run time; we are not
>> discounting that possibility (haven't gotten quite that far in the
>> design, to be honest).  But we feel that the implementations have to
>> be execution mode specific.
>>
>> -Dmitriy
>>
>> On Tue, Sep 1, 2009 at 6:26 PM, Jianyong Dai
>> wrote:
>>
>>>
>>> I am still reading but one interesting question is why you decide to put
>>> CBO
>>> in physical layer?
>>>
>>> Dmitriy Ryaboy wrote:
>>>

 Whoops :-)
 Here's the Google doc:


 http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en

 -Dmitriy

 On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasan
 wrote:


>
> Dmitriy and Gang,
>
> The mailing list does not allow attachments. Can you post it on a
> website and just send the URL ?
>
> Thanks,
> Santhosh
>
> -Original Message-
> From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
> Sent: Tuesday, September 01, 2009 9:48 AM
> To: pig-dev@hadoop.apache.org
> Subject: Request for feedback: cost-based optimizer
>
> Hi everyone,
> Attached is a (very) preliminary document outlining a rough design we
> are proposing for a cost-based optimizer for Pig.
> This is being done as a capstone project by three CMU Master's students
> (myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not
> necessarily meant for immediate incorporation into the Pig codebase,
> although it would be nice if it, or parts of it, are found to be useful
> in the mainline.
>
> We would love to get some feedback from the developer community
> regarding the ideas expressed in the document, any concerns about the
> design, suggestions for improvement, etc.
>
> Thanks,
> Dmitriy, Ashutosh, Tejal
>
>
>
>>>
>>>
>
>


Re: Request for feedback: cost-based optimizer

2009-09-02 Thread Jianyong Dai
Yes, physical properties is important for an optimizer. To optimize Pig 
well, we need to know the underlying hadoop execution environment, such 
as # of map-reduce jobs, how many maps/reducers, how the job is 
configured, etc. This is true even for a rule based optimizer. 
Unfortunately, physical layer does not provide much physical information 
as the name suggests. Basically physical layer is a rephrase of logical 
layer using physical operators. Compare to logical operators, physical 
operators include implementation of pipeline processing but strip away 
many logical details such as "schema". Also, in logical layer, we have 
infrastructure to restructure logical operator such as move nodes 
around, swap nodes, etc, which does not exist in physical layer. From 
optimizer's point of view, physical layer does not give necessary 
information but more harder to deal with. If you would like to work with 
physical details, I think map-reduce layer is the right place to look 
at. However, restructure map-reduce layer is hard cuz we do not have all 
the infrastructure to move things around. Another approach is to use a 
combined logical layer and map-reduce layer for the optimization. In 
this, you restructure the logical layer by observing the physical 
details from map-reduce layer. The down side is that we have to tightly 
couple Pig to hadoop. But now Pig is a subproject of hadoop and almost 
all Pig users are using hadoop, I think it is fine to optimize thing 
towards hadoop.



Dmitriy Ryaboy wrote:

Our initial survey of related literature showed that the usual place
for a CBO tends to be between the physical and logical layer (in fact,
the famous Cascades paper advocates removing the distinction between
physical and logical operators altogether, and using an "is_logical"
and "is_physical" flag instead -- meaning an operator can be one,
both, or neither).

The reasoning is that you cannot properly determine a cost of a plan
if you don't know the physical "properties" of the operators that
implement it. An optimizer that works at a logical layer would by
definition create the same plan whether in local or mapreduce mode
(since such differences are abstracted from it). This is clearly
incorrect, as the properties of the environment in which these plans
are executed are drastically different.  Working at the physical layer
lets us stay close to the iron and adjust based on the specifics of
the execution environment.

Certainly one can posit a framework for a CBO that would set up the
necessary interfaces and plumbing for optimizing in any execution
mode, and invoke the proper implementations at run time; we are not
discounting that possibility (haven't gotten quite that far in the
design, to be honest).  But we feel that the implementations have to
be execution mode specific.

-Dmitriy

On Tue, Sep 1, 2009 at 6:26 PM, Jianyong Dai wrote:
  

I am still reading but one interesting question is why you decide to put CBO
in physical layer?

Dmitriy Ryaboy wrote:


Whoops :-)
Here's the Google doc:

http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en

-Dmitriy

On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasan
wrote:

  

Dmitriy and Gang,

The mailing list does not allow attachments. Can you post it on a
website and just send the URL ?

Thanks,
Santhosh

-Original Message-
From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
Sent: Tuesday, September 01, 2009 9:48 AM
To: pig-dev@hadoop.apache.org
Subject: Request for feedback: cost-based optimizer

Hi everyone,
Attached is a (very) preliminary document outlining a rough design we
are proposing for a cost-based optimizer for Pig.
This is being done as a capstone project by three CMU Master's students
(myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not
necessarily meant for immediate incorporation into the Pig codebase,
although it would be nice if it, or parts of it, are found to be useful
in the mainline.

We would love to get some feedback from the developer community
regarding the ideas expressed in the document, any concerns about the
design, suggestions for improvement, etc.

Thanks,
Dmitriy, Ashutosh, Tejal








Re: Request for feedback: cost-based optimizer

2009-09-01 Thread Dmitriy Ryaboy
Our initial survey of related literature showed that the usual place
for a CBO tends to be between the physical and logical layer (in fact,
the famous Cascades paper advocates removing the distinction between
physical and logical operators altogether, and using an "is_logical"
and "is_physical" flag instead -- meaning an operator can be one,
both, or neither).

The reasoning is that you cannot properly determine a cost of a plan
if you don't know the physical "properties" of the operators that
implement it. An optimizer that works at a logical layer would by
definition create the same plan whether in local or mapreduce mode
(since such differences are abstracted from it). This is clearly
incorrect, as the properties of the environment in which these plans
are executed are drastically different.  Working at the physical layer
lets us stay close to the iron and adjust based on the specifics of
the execution environment.

Certainly one can posit a framework for a CBO that would set up the
necessary interfaces and plumbing for optimizing in any execution
mode, and invoke the proper implementations at run time; we are not
discounting that possibility (haven't gotten quite that far in the
design, to be honest).  But we feel that the implementations have to
be execution mode specific.

-Dmitriy

On Tue, Sep 1, 2009 at 6:26 PM, Jianyong Dai wrote:
> I am still reading but one interesting question is why you decide to put CBO
> in physical layer?
>
> Dmitriy Ryaboy wrote:
>>
>> Whoops :-)
>> Here's the Google doc:
>>
>> http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en
>>
>> -Dmitriy
>>
>> On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasan
>> wrote:
>>
>>>
>>> Dmitriy and Gang,
>>>
>>> The mailing list does not allow attachments. Can you post it on a
>>> website and just send the URL ?
>>>
>>> Thanks,
>>> Santhosh
>>>
>>> -Original Message-
>>> From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
>>> Sent: Tuesday, September 01, 2009 9:48 AM
>>> To: pig-dev@hadoop.apache.org
>>> Subject: Request for feedback: cost-based optimizer
>>>
>>> Hi everyone,
>>> Attached is a (very) preliminary document outlining a rough design we
>>> are proposing for a cost-based optimizer for Pig.
>>> This is being done as a capstone project by three CMU Master's students
>>> (myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not
>>> necessarily meant for immediate incorporation into the Pig codebase,
>>> although it would be nice if it, or parts of it, are found to be useful
>>> in the mainline.
>>>
>>> We would love to get some feedback from the developer community
>>> regarding the ideas expressed in the document, any concerns about the
>>> design, suggestions for improvement, etc.
>>>
>>> Thanks,
>>> Dmitriy, Ashutosh, Tejal
>>>
>>>
>
>


Re: Request for feedback: cost-based optimizer

2009-09-01 Thread Jianyong Dai
I am still reading but one interesting question is why you decide to put 
CBO in physical layer?


Dmitriy Ryaboy wrote:

Whoops :-)
Here's the Google doc:
http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en

-Dmitriy

On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasan wrote:
  

Dmitriy and Gang,

The mailing list does not allow attachments. Can you post it on a
website and just send the URL ?

Thanks,
Santhosh

-Original Message-
From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
Sent: Tuesday, September 01, 2009 9:48 AM
To: pig-dev@hadoop.apache.org
Subject: Request for feedback: cost-based optimizer

Hi everyone,
Attached is a (very) preliminary document outlining a rough design we
are proposing for a cost-based optimizer for Pig.
This is being done as a capstone project by three CMU Master's students
(myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not
necessarily meant for immediate incorporation into the Pig codebase,
although it would be nice if it, or parts of it, are found to be useful
in the mainline.

We would love to get some feedback from the developer community
regarding the ideas expressed in the document, any concerns about the
design, suggestions for improvement, etc.

Thanks,
Dmitriy, Ashutosh, Tejal






Re: Request for feedback: cost-based optimizer

2009-09-01 Thread Dmitriy Ryaboy
Whoops :-)
Here's the Google doc:
http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en

-Dmitriy

On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasan wrote:
> Dmitriy and Gang,
>
> The mailing list does not allow attachments. Can you post it on a
> website and just send the URL ?
>
> Thanks,
> Santhosh
>
> -Original Message-
> From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
> Sent: Tuesday, September 01, 2009 9:48 AM
> To: pig-dev@hadoop.apache.org
> Subject: Request for feedback: cost-based optimizer
>
> Hi everyone,
> Attached is a (very) preliminary document outlining a rough design we
> are proposing for a cost-based optimizer for Pig.
> This is being done as a capstone project by three CMU Master's students
> (myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not
> necessarily meant for immediate incorporation into the Pig codebase,
> although it would be nice if it, or parts of it, are found to be useful
> in the mainline.
>
> We would love to get some feedback from the developer community
> regarding the ideas expressed in the document, any concerns about the
> design, suggestions for improvement, etc.
>
> Thanks,
> Dmitriy, Ashutosh, Tejal
>


RE: Request for feedback: cost-based optimizer

2009-09-01 Thread Santhosh Srinivasan
Dmitriy and Gang,

The mailing list does not allow attachments. Can you post it on a
website and just send the URL ?

Thanks,
Santhosh 

-Original Message-
From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] 
Sent: Tuesday, September 01, 2009 9:48 AM
To: pig-dev@hadoop.apache.org
Subject: Request for feedback: cost-based optimizer

Hi everyone,
Attached is a (very) preliminary document outlining a rough design we
are proposing for a cost-based optimizer for Pig.
This is being done as a capstone project by three CMU Master's students
(myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not
necessarily meant for immediate incorporation into the Pig codebase,
although it would be nice if it, or parts of it, are found to be useful
in the mainline.

We would love to get some feedback from the developer community
regarding the ideas expressed in the document, any concerns about the
design, suggestions for improvement, etc.

Thanks,
Dmitriy, Ashutosh, Tejal