A lot of our pig scripts are generated by Ruby code/scripts dynamically and on the fly.
So the specific pig commands, LOAD data, the columns that are input, outputed, etc are handled by Ruby and a back-end database to create the concatenated strings that turn into pig code so that we can reuse specific logic for different aggregations and querys we need. We do this also to automate the processing for a lot of our jobs and to help keep the reuse of pig code as part of an event driven process that is similar across data sets and business logic. We will create the pig script file and call pig from Ruby handling all of the processing from a object oriented duck typed approach. I have been recently toying with moving this to Scala but that is an entirely another story (I like LIFT more than Rails) as we use Ruby for our Hadoop M/R jobs too. On Tue, Jun 22, 2010 at 1:40 PM, Russell Jurney <russell.jur...@gmail.com>wrote: > I'm curious to hear how other people are scaling the code on big Pig > projects. > > Thousands of lines of dataflow code can get pretty hairy for a team of > developers - and practices to ensure code sanity don't seem as well > developed (or at least I don't know them) for dataflow programming as for > other forms? How do you efficiently avoid pasted code? Anyone got tips > for > refactoring your Pig as a project progresses to reduce complexity? > > Russ > -- /* Joe Stein http://allthingshadoop.com */