Re: Consider cleaning up backend code
+1 for removing. This interface does not bring us any value when we decide to move closer to hadoop. Writing a backend is almost writing half of Pig. I don't think this interface is attractive to most developers. Instead, I +1 for Milind's idea to make intermediate artifacts available, or provide some hook for user to peek/morph the plan at different stages. This opens the door for developers to visualize/debug/improve Pig without knowing every details of Pig. Daniel Alan Gates wrote: A couple of years ago we had this concept that Pig as is should be able to run on other backends (like say Dryad if it were open source). So we built this whole backend interface and (mostly) kept Hadoop specific objects out of the front end. Recently we have modified that stand and said that this implementation of Pig is Hadoop specific. Pig Latin itself will still stay Hadoop independent. So the ability to have multiple backends is fine. But the ability to have non-Hadoop backends is not really interesting now. So I at least see the proposal here as getting rid of generic code that tries to hide the fact that we are working on top of Hadoop (things like DataStorage and ExecutionEngine). Alan. On Apr 22, 2010, at 4:14 PM, Arun C Murthy wrote: I read it as getting rid of concepts parallel to hadoop in src/org/ apache/pig/backend/hadoop/datastorage. Is that true? thanks, Arun On Apr 22, 2010, at 1:34 PM, Dmitriy Ryaboy wrote: I kind of dig the concept of being able to plug in a different backend, though I definitely thing we should get rid of the dead localmode code. Can you give an example of how this will simplify the codebase? Is it more than just GenericClass foo = new SpecificClass(), and the associated extra files? -D On Thu, Apr 22, 2010 at 1:25 PM, Arun C Murthy wrote: +1 Arun On Apr 22, 2010, at 11:35 AM, Richard Ding wrote: Pig has an abstraction layer (interfaces and abstract classes) to support multiple execution engines. After PIG-1053, Hadoop is the only execution engine supported by Pig. I wonder if we should remove this layer of code, and make Hadoop THE execution engine for Pig. This will simplify a lot the backend code. Thanks, -Richard
Re: Broken build
Hi, Dmitriy, I just did a fresh build, and run test-commit, didn't see the problem. Besides, org.apache.pig.experimental.logical.optimizer.PlanPrinter is in the trunk. Can you double check? Daniel Dmitriy Ryaboy wrote: Hi guys, Trunk has been broken for a while. A bunch of tests in the test-commit target fail, mostly due to "The import org.apache.pig.experimental.logical.optimizer.PlanPrinter cannot be resolved." Could someone check in the missing file? -D
Re: [VOTE] Branch for Pig 0.6.0 release
+1. I think Jeff's patch for the file system commands (PIG-891) also deserve some advertisement. Those commands are really handy to the end users. Daniel Alan Gates wrote: +1. In addition to the new features we've added, our change to use Hadoop's LineRecordReader brought Pig to parity with Hadoop in the PigMix tests, about a 30% average performance improvement. This should be huge for our users. Alan. On Nov 9, 2009, at 12:26 PM, Olga Natkovich wrote: Hi, I would like to propose to branch for Pig 0.6.0 release with the intent to have a release before the end of the year. We have done a lot of work since branching for Pig 0.5.0 that we would like to share with users. This includes changing how bags are spilled onto disk (PIG-975, PIG-1037), skewed and fragment-replicated outer join plus many other performance improvements and bug fixes. Please vote by Thursday. Thanks, Olga
Re: [VOTE] Release Pig 0.4.0 (candidate 2)
I removed ~/pigtest/conf/hadoop-site.xml and build piggybank again, all pass. For some reason MiniCluster do not regenerate hadoop-site.xml and reuse the old one, which happens to be wrong Olga Natkovich wrote: Hi, The new version is available in http://people.apache.org/~olga/pig-0.4.0-candidate-2/. I see one failure in a unit test in piggybank (contrib.) but it is not related to the functions themselves but seems to be an issue with MiniCluster and I don't feel we need to chase this down. I made sure that the same test runs ok with Hadoop 20. Please, vote by end of day on Thursday, 9/24. Olga -Original Message- From: Olga Natkovich [mailto:ol...@yahoo-inc.com] Sent: Thursday, September 17, 2009 12:09 PM To: pig-dev@hadoop.apache.org; priv...@hadoop.apache.org Subject: [VOTE] Release Pig 0.4.0 (candidate 1) Hi, I have fixed the issue causing the failure that Alan reported. Please test the new release: http://people.apache.org/~olga/pig-0.4.0-candidate-1/. Vote closes on Tuesday, 9/22. Olga -Original Message- From: Olga Natkovich [mailto:ol...@yahoo-inc.com] Sent: Monday, September 14, 2009 2:06 PM To: pig-dev@hadoop.apache.org; priv...@hadoop.apache.org Subject: [VOTE] Release Pig 0.4.0 (candidate 0) Hi, I created a candidate build for Pig 0.4.0 release. The highlights of this release are - Performance improvements especially in the area of JOIN support where we introduced two new join types: skew join to deal with data skew and sort merge join to take advantage of the sorted data sets. - Support for Outer join. - Works with Hadoop 18 I ran the release audit and rat report looked fine. The relevant part is attached below. Keys used to sign the release are available at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup. Please download the release and try it out: http://people.apache.org/~olga/pig-0.4.0-candidate-0. Should we release this? Vote closes on Thursday, 9/17. Olga [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/CHANGES.txt [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/zebra/CHANG ES.txt [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/broken-links.x ml [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/cookbook.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/index.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/linkmap.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/piglatin_refer ence.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/piglatin_users .html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/setup.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/tutorial.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/udf.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/api/package-li st [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes. html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/missingS inces.txt [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/user_com ments_for_pig_0.3.1_to_pig_0.5.0-dev.xml [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ alldiffs_index_additions.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ alldiffs_index_all.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ alldiffs_index_changes.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ alldiffs_index_removals.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ changes-summary.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ classes_index_additions.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ classes_index_all.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ classes_index_changes.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ classes_index_removals.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ constructors_index_additions.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ constructors_index_all.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ constructors_index_changes.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/j
Re: Request for feedback: cost-based optimizer
Yes, physical properties is important for an optimizer. To optimize Pig well, we need to know the underlying hadoop execution environment, such as # of map-reduce jobs, how many maps/reducers, how the job is configured, etc. This is true even for a rule based optimizer. Unfortunately, physical layer does not provide much physical information as the name suggests. Basically physical layer is a rephrase of logical layer using physical operators. Compare to logical operators, physical operators include implementation of pipeline processing but strip away many logical details such as "schema". Also, in logical layer, we have infrastructure to restructure logical operator such as move nodes around, swap nodes, etc, which does not exist in physical layer. From optimizer's point of view, physical layer does not give necessary information but more harder to deal with. If you would like to work with physical details, I think map-reduce layer is the right place to look at. However, restructure map-reduce layer is hard cuz we do not have all the infrastructure to move things around. Another approach is to use a combined logical layer and map-reduce layer for the optimization. In this, you restructure the logical layer by observing the physical details from map-reduce layer. The down side is that we have to tightly couple Pig to hadoop. But now Pig is a subproject of hadoop and almost all Pig users are using hadoop, I think it is fine to optimize thing towards hadoop. Dmitriy Ryaboy wrote: Our initial survey of related literature showed that the usual place for a CBO tends to be between the physical and logical layer (in fact, the famous Cascades paper advocates removing the distinction between physical and logical operators altogether, and using an "is_logical" and "is_physical" flag instead -- meaning an operator can be one, both, or neither). The reasoning is that you cannot properly determine a cost of a plan if you don't know the physical "properties" of the operators that implement it. An optimizer that works at a logical layer would by definition create the same plan whether in local or mapreduce mode (since such differences are abstracted from it). This is clearly incorrect, as the properties of the environment in which these plans are executed are drastically different. Working at the physical layer lets us stay close to the iron and adjust based on the specifics of the execution environment. Certainly one can posit a framework for a CBO that would set up the necessary interfaces and plumbing for optimizing in any execution mode, and invoke the proper implementations at run time; we are not discounting that possibility (haven't gotten quite that far in the design, to be honest). But we feel that the implementations have to be execution mode specific. -Dmitriy On Tue, Sep 1, 2009 at 6:26 PM, Jianyong Dai wrote: I am still reading but one interesting question is why you decide to put CBO in physical layer? Dmitriy Ryaboy wrote: Whoops :-) Here's the Google doc: http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en -Dmitriy On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasan wrote: Dmitriy and Gang, The mailing list does not allow attachments. Can you post it on a website and just send the URL ? Thanks, Santhosh -Original Message- From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] Sent: Tuesday, September 01, 2009 9:48 AM To: pig-dev@hadoop.apache.org Subject: Request for feedback: cost-based optimizer Hi everyone, Attached is a (very) preliminary document outlining a rough design we are proposing for a cost-based optimizer for Pig. This is being done as a capstone project by three CMU Master's students (myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not necessarily meant for immediate incorporation into the Pig codebase, although it would be nice if it, or parts of it, are found to be useful in the mainline. We would love to get some feedback from the developer community regarding the ideas expressed in the document, any concerns about the design, suggestions for improvement, etc. Thanks, Dmitriy, Ashutosh, Tejal
Re: Request for feedback: cost-based optimizer
I am still reading but one interesting question is why you decide to put CBO in physical layer? Dmitriy Ryaboy wrote: Whoops :-) Here's the Google doc: http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en -Dmitriy On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasan wrote: Dmitriy and Gang, The mailing list does not allow attachments. Can you post it on a website and just send the URL ? Thanks, Santhosh -Original Message- From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] Sent: Tuesday, September 01, 2009 9:48 AM To: pig-dev@hadoop.apache.org Subject: Request for feedback: cost-based optimizer Hi everyone, Attached is a (very) preliminary document outlining a rough design we are proposing for a cost-based optimizer for Pig. This is being done as a capstone project by three CMU Master's students (myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not necessarily meant for immediate incorporation into the Pig codebase, although it would be nice if it, or parts of it, are found to be useful in the mainline. We would love to get some feedback from the developer community regarding the ideas expressed in the document, any concerns about the design, suggestions for improvement, etc. Thanks, Dmitriy, Ashutosh, Tejal