On June 30th Yahoo hosted a Pig contributor workshop. Pig contributors from Yahoo, Twitter, LinkedIn, and Cloudera were present. The slides used for the presentations that day have been uploaded to http://wiki.apache.org/pig/PigTalksPapers Here's a digest of what was discussed there. For those who were there, if I forgot anything please feel free to add it in.

Thejas Nair discussed his work on performance. In particular he has been looking into how to more efficiently de/serialize complex data types and when Pig can make use of lazy deserialization. Dmitriy Ryaboy brought up the question of whether Pig would be open to using Avro for de/serialization between Map and Reduce and between MR jobs. We concluded that we are open to using whatever is fast.

Richard Ding discussed the work he has been doing to make Pig run statistics available to users via the logs, applications running Pig (such as workflow systems) via a new PigRunner API, and to developers via Hadoop job history files. Russell Jurney brought up that it would be nice if this API also included record input and output on a per MR job level so that users diagnosing issues with their Pig Latin scripts would have a better idea in which MR job things went wrong.

Ashutosh Chauhan gave an overview of the work that has been going on to add UDFs in scripting languages to Pig (PIG-928).

Daniel Dai talked about the rewrite of the logical optimizer that he has been doing, including an overview of the major rules being implemented in the new optimizer framework. Dmitriy indicated that he would really like to see pushing of limits into the RecordReader (so that we can terminate reading early) added to the list of rules. This would involve making use of the new optimizer framework in the MR optimizer. Alan Gates indicated that while he does not believe we should translate the entire set of MR optimizer visitors into the new framework until we've further tested the framework, this might be a good first test for the new optimizer in the MR optimizer.

Aniket Mokashi showed the work he's been doing to add a custom partitioner to Pig. He also covered his work to add the ability to re- use a relation that contains a single record with a single field as a scalar. Dmitriy pointed out that we need to make sure this uses the distributed cache to minimize strain on the namenode.

Pradeep Kamath gave a short presentation on Howl, the work he is leading to create a shared metadata system between Pig, Hive, and Map Reduce. Dmitriy noted that we need to get this work more in the open so others can participate and contribute.

Russell Jurney talked about his work on adding datetime types to Pig. He indicated he was interested in using Jodatime as the basis for this. There were some questions on how these types would be serialized in text files where the type information might be lost.

Olga Natkovich talked about areas the Yahoo Pig team would like to work on in the future, mostly focussed in the areas of usability. These included changing our parser to one that will allow us to give better error messages. Dmitriy indicated he strongly preferred Antlr. It also includes resurrecting support for the illustrate command, which we have let lapse. Richard and Ashutosh noted that how illustrate works internally needs some redesign, because currently it requires special code inside each physical operator. This makes it hard to maintain illustrate in the face of new operators, and pollutes the main code path during execution. Instead it should be done via callbacks or some other solution.

After these presentations the group took on a couple of topics for discussion. The first was how Pig should grow to become Turing complete. For this Dmitriy and Ning Liang presented Piglet, a Ruby library they use at Twitter to wrap Pig and provide branching, looping, functions, and modules. Several people in the group expressed concerns that growing Pig Latin itself to be Turing complete will result in a poorly thought out language with insufficient tools and too much maintenance in the future. One suggestion that was made was to create a Java interface that allowed users to directly construct Pig data flows. That is, this interface would (roughly) have a method for each Pig operator. Users could then construct Pig data flows directly in Java. Users who wished to use scripting languages could still access this with no additional work via Jython, JRuby, Groovy, etc.

The second discussion centered on Pig's support for workflow systems such as Oozie and Azkaban. There have been proposals in the past that Pig switch to generate Oozie workflows instead of MR jobs. Alan indicated that he does not see the value of this. There have been proposals that Pig Latin be extended to include workflow controls. Dmitriy and Russell both indicated they thought extending Pig Latin in this was was a bad idea and seemed like a layer violation. Alejandro Abdelnur (architect for Oozie at Yahoo) indicated he was happy with the interface changes being made by Richard as part of 0.8. Alan indicated we need to talk with the Azkaban guys to see what would make integration better for them.

We ended with a few last discussion points. Dmitriy suggested that Piggybank should move out of contrib into a more cpan like environment that was version independent. This frees Pig contributors from needing to keep older UDFs up to date. It allows users to download versions of the UDFs that are appropriate to the version of Pig they are using. And it allows UDF contributors to more easily contribute their code without going through the whole patch acceptance process. The group indicated they were open to this approach, though no one volunteered to undertake setting it up.

Ashutosh asked whether there would be a 0.7.1 release since several important issues had been found and resolved since 0.7.0. The Yahoo team (which has driven all previous releases) indicated it had no immediate plans to do so, but it was open to helping anyone who wanted to drive it. No one volunteered.

At the end we agreed that this had been useful and we should do it on a more regular basis. We also agreed that we need to find a way to open this up to others who do not live in the Bay Area. Alan agreed to work on facilitating this.

Alan.


Reply via email to