Notes from Pig contributor workshop

Alan Gates Tue, 13 Jul 2010 11:27:04 -0700

On June 30th Yahoo hosted a Pig contributor workshop. Pigcontributors from Yahoo, Twitter, LinkedIn, and Cloudera werepresent. The slides used for the presentations that day have beenuploaded to http://wiki.apache.org/pig/PigTalksPapers Here's a digestof what was discussed there. For those who were there, if I forgotanything please feel free to add it in.

Thejas Nair discussed his work on performance. In particular he hasbeen looking into how to more efficiently de/serialize complex datatypes and when Pig can make use of lazy deserialization. DmitriyRyaboy brought up the question of whether Pig would be open to usingAvro for de/serialization between Map and Reduce and between MR jobs.We concluded that we are open to using whatever is fast.

Richard Ding discussed the work he has been doing to make Pig runstatistics available to users via the logs, applications running Pig(such as workflow systems) via a new PigRunner API, and to developersvia Hadoop job history files. Russell Jurney brought up that it wouldbe nice if this API also included record input and output on a per MRjob level so that users diagnosing issues with their Pig Latin scriptswould have a better idea in which MR job things went wrong.

Ashutosh Chauhan gave an overview of the work that has been going onto add UDFs in scripting languages to Pig (PIG-928).

Daniel Dai talked about the rewrite of the logical optimizer that hehas been doing, including an overview of the major rules beingimplemented in the new optimizer framework. Dmitriy indicated that hewould really like to see pushing of limits into the RecordReader (sothat we can terminate reading early) added to the list of rules. Thiswould involve making use of the new optimizer framework in the MRoptimizer. Alan Gates indicated that while he does not believe weshould translate the entire set of MR optimizer visitors into the newframework until we've further tested the framework, this might be agood first test for the new optimizer in the MR optimizer.

Aniket Mokashi showed the work he's been doing to add a custompartitioner to Pig. He also covered his work to add the ability to re-use a relation that contains a single record with a single field as ascalar. Dmitriy pointed out that we need to make sure this uses thedistributed cache to minimize strain on the namenode.

Pradeep Kamath gave a short presentation on Howl, the work he isleading to create a shared metadata system between Pig, Hive, and MapReduce. Dmitriy noted that we need to get this work more in the openso others can participate and contribute.

Russell Jurney talked about his work on adding datetime types to Pig.He indicated he was interested in using Jodatime as the basis forthis. There were some questions on how these types would beserialized in text files where the type information might be lost.

Olga Natkovich talked about areas the Yahoo Pig team would like towork on in the future, mostly focussed in the areas of usability.These included changing our parser to one that will allow us to givebetter error messages. Dmitriy indicated he strongly preferredAntlr. It also includes resurrecting support for the illustratecommand, which we have let lapse. Richard and Ashutosh noted that howillustrate works internally needs some redesign, because currently itrequires special code inside each physical operator. This makes ithard to maintain illustrate in the face of new operators, and pollutesthe main code path during execution. Instead it should be done viacallbacks or some other solution.

After these presentations the group took on a couple of topics fordiscussion. The first was how Pig should grow to become Turingcomplete. For this Dmitriy and Ning Liang presented Piglet, a Rubylibrary they use at Twitter to wrap Pig and provide branching,looping, functions, and modules. Several people in the groupexpressed concerns that growing Pig Latin itself to be Turing completewill result in a poorly thought out language with insufficient toolsand too much maintenance in the future. One suggestion that was madewas to create a Java interface that allowed users to directlyconstruct Pig data flows. That is, this interface would (roughly)have a method for each Pig operator. Users could then construct Pigdata flows directly in Java. Users who wished to use scriptinglanguages could still access this with no additional work via Jython,JRuby, Groovy, etc.

The second discussion centered on Pig's support for workflow systemssuch as Oozie and Azkaban. There have been proposals in the past thatPig switch to generate Oozie workflows instead of MR jobs. Alanindicated that he does not see the value of this. There have beenproposals that Pig Latin be extended to include workflow controls.Dmitriy and Russell both indicated they thought extending Pig Latin inthis was was a bad idea and seemed like a layer violation. AlejandroAbdelnur (architect for Oozie at Yahoo) indicated he was happy withthe interface changes being made by Richard as part of 0.8. Alanindicated we need to talk with the Azkaban guys to see what would makeintegration better for them.

We ended with a few last discussion points. Dmitriy suggested thatPiggybank should move out of contrib into a more cpan like environmentthat was version independent. This frees Pig contributors fromneeding to keep older UDFs up to date. It allows users to downloadversions of the UDFs that are appropriate to the version of Pig theyare using. And it allows UDF contributors to more easily contributetheir code without going through the whole patch acceptance process.The group indicated they were open to this approach, though no onevolunteered to undertake setting it up.

Ashutosh asked whether there would be a 0.7.1 release since severalimportant issues had been found and resolved since 0.7.0. The Yahooteam (which has driven all previous releases) indicated it had noimmediate plans to do so, but it was open to helping anyone who wantedto drive it. No one volunteered.

At the end we agreed that this had been useful and we should do it ona more regular basis. We also agreed that we need to find a way toopen this up to others who do not live in the Bay Area. Alan agreedto work on facilitating this.


Alan.

Notes from Pig contributor workshop

Reply via email to