Daniel, thanks for the information, this is useful.
On Wed, Sep 2, 2009 at 2:06 PM, Jianyong Dai<jiany...@yahoo-inc.com> wrote: > Yes, physical properties is important for an optimizer. To optimize Pig > well, we need to know the underlying hadoop execution environment, such as # > of map-reduce jobs, how many maps/reducers, how the job is configured, etc. > This is true even for a rule based optimizer. Unfortunately, physical layer > does not provide much physical information as the name suggests. Basically > physical layer is a rephrase of logical layer using physical operators. > Compare to logical operators, physical operators include implementation of > pipeline processing but strip away many logical details such as "schema". > Also, in logical layer, we have infrastructure to restructure logical > operator such as move nodes around, swap nodes, etc, which does not exist in > physical layer. From optimizer's point of view, physical layer does not give > necessary information but more harder to deal with. If you would like to > work with physical details, I think map-reduce layer is the right place to > look at. However, restructure map-reduce layer is hard cuz we do not have > all the infrastructure to move things around. Another approach is to use a > combined logical layer and map-reduce layer for the optimization. In this, > you restructure the logical layer by observing the physical details from > map-reduce layer. The down side is that we have to tightly couple Pig to > hadoop. But now Pig is a subproject of hadoop and almost all Pig users are > using hadoop, I think it is fine to optimize thing towards hadoop. > > > Dmitriy Ryaboy wrote: >> >> Our initial survey of related literature showed that the usual place >> for a CBO tends to be between the physical and logical layer (in fact, >> the famous Cascades paper advocates removing the distinction between >> physical and logical operators altogether, and using an "is_logical" >> and "is_physical" flag instead -- meaning an operator can be one, >> both, or neither). >> >> The reasoning is that you cannot properly determine a cost of a plan >> if you don't know the physical "properties" of the operators that >> implement it. An optimizer that works at a logical layer would by >> definition create the same plan whether in local or mapreduce mode >> (since such differences are abstracted from it). This is clearly >> incorrect, as the properties of the environment in which these plans >> are executed are drastically different. Working at the physical layer >> lets us stay close to the iron and adjust based on the specifics of >> the execution environment. >> >> Certainly one can posit a framework for a CBO that would set up the >> necessary interfaces and plumbing for optimizing in any execution >> mode, and invoke the proper implementations at run time; we are not >> discounting that possibility (haven't gotten quite that far in the >> design, to be honest). But we feel that the implementations have to >> be execution mode specific. >> >> -Dmitriy >> >> On Tue, Sep 1, 2009 at 6:26 PM, Jianyong Dai<jiany...@yahoo-inc.com> >> wrote: >> >>> >>> I am still reading but one interesting question is why you decide to put >>> CBO >>> in physical layer? >>> >>> Dmitriy Ryaboy wrote: >>> >>>> >>>> Whoops :-) >>>> Here's the Google doc: >>>> >>>> >>>> http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en >>>> >>>> -Dmitriy >>>> >>>> On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasan<s...@yahoo-inc.com> >>>> wrote: >>>> >>>> >>>>> >>>>> Dmitriy and Gang, >>>>> >>>>> The mailing list does not allow attachments. Can you post it on a >>>>> website and just send the URL ? >>>>> >>>>> Thanks, >>>>> Santhosh >>>>> >>>>> -----Original Message----- >>>>> From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] >>>>> Sent: Tuesday, September 01, 2009 9:48 AM >>>>> To: pig-dev@hadoop.apache.org >>>>> Subject: Request for feedback: cost-based optimizer >>>>> >>>>> Hi everyone, >>>>> Attached is a (very) preliminary document outlining a rough design we >>>>> are proposing for a cost-based optimizer for Pig. >>>>> This is being done as a capstone project by three CMU Master's students >>>>> (myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not >>>>> necessarily meant for immediate incorporation into the Pig codebase, >>>>> although it would be nice if it, or parts of it, are found to be useful >>>>> in the mainline. >>>>> >>>>> We would love to get some feedback from the developer community >>>>> regarding the ideas expressed in the document, any concerns about the >>>>> design, suggestions for improvement, etc. >>>>> >>>>> Thanks, >>>>> Dmitriy, Ashutosh, Tejal >>>>> >>>>> >>>>> >>> >>> > >