A lot of our pig scripts are generated by Ruby code/scripts dynamically and
on the fly.

So the specific pig commands, LOAD data, the columns that are input,
outputed, etc are handled by Ruby and a back-end database to create the
concatenated strings that turn into pig code so that we can reuse specific
logic for different aggregations and querys we need.

We do this also to automate the processing for a lot of our jobs and to help
keep the reuse of pig code as part of an event driven process that is
similar across data sets and business logic.

We will create the pig script file and call pig from Ruby handling all of
the processing from a object oriented duck typed approach.

I have been recently toying with moving this to Scala but that is an
entirely another story (I like LIFT more than Rails) as we use Ruby for our
Hadoop M/R jobs too.

On Tue, Jun 22, 2010 at 1:40 PM, Russell Jurney <russell.jur...@gmail.com>wrote:

> I'm curious to hear how other people are scaling the code on big Pig
> projects.
>
> Thousands of lines of dataflow code can get pretty hairy for a team of
> developers - and practices to ensure code sanity don't seem as well
> developed (or at least I don't know them) for dataflow programming as for
> other forms?  How do you efficiently avoid pasted code?  Anyone got tips
> for
> refactoring your Pig as a project progresses to reduce complexity?
>
> Russ
>



-- 
/*
Joe Stein
http://allthingshadoop.com
*/

Reply via email to