Thanks Alan - we use Azkaban http://sna-projects.com/azkaban/ at LinkedIn to
do the same thing, but the code itself gets to be problematic.

To give an example - on my primary project, I have about 20 pig scripts, a
couple Java UDFs, and a dozen or so Python streaming UDFs.  There is several
thousand lines of Pig.  Without a good way to make external functions
(anyone got one?) that are parametizable so they are flexible enough to be
used multiple places, lots of that is duplicate code, with slight
differences.  There is cutting and pasting.  Making a change in one place
often requires a find/replace across multiple files as data formats change.

Given Pig's limitations, and that dataflow programming is still relatively
new to me - and that I've not read books on cleanly building big dataflow
pipelines (are there any?) - I regularly do things in my Pig that would be
completely unacceptable in a procedural, functional or object oriented
language.  Things seem to get spindly no matter what I try.  Refactoring to
remove common code from a big pipeline can be scary, with frequent full-runs
required.

I'll check out http://wiki.apache.org/pig/TuringCompletePig thanks!

Russ

On Tue, Jun 22, 2010 at 2:10 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

> Here at Yahoo we use Oozie for managing large workflows (latest open source
> edition at http://github.com/tucu00/oozie1 though they expect to make
> another drop before the Hadoop summit).  There are plans to make Oozie a
> full open source project (instead of just making drops to github).
>
> We've started thinking a lot about how to extend Pig Latin itself to
> provide functions, modules, loops, and branches.  The recorded thoughts so
> far are at http://wiki.apache.org/pig/TuringCompletePig  Your feedback on
> this would be helpful.
>
> Alan.
>
>
> On Jun 22, 2010, at 10:40 AM, Russell Jurney wrote:
>
>  I'm curious to hear how other people are scaling the code on big Pig
>> projects.
>>
>> Thousands of lines of dataflow code can get pretty hairy for a team of
>> developers - and practices to ensure code sanity don't seem as well
>> developed (or at least I don't know them) for dataflow programming as for
>> other forms?  How do you efficiently avoid pasted code?  Anyone got tips
>> for
>> refactoring your Pig as a project progresses to reduce complexity?
>>
>> Russ
>>
>
>

Reply via email to