RE: Scaling Pig Projects - The Hairy Pig

2010-06-22 Thread Katukuri, Jay
Hi, The issues raised by Russ are really important. I have recently worked on a project using Pig at EBay Search. I could not avoid some of the pasted code. It will be useful to learn good practices tips from experienced folks for scaling to big projects. Jay -Original Message- From: Ru

Re: Scaling Pig Projects - The Hairy Pig

2010-06-22 Thread Joe Stein
A lot of our pig scripts are generated by Ruby code/scripts dynamically and on the fly. So the specific pig commands, LOAD data, the columns that are input, outputed, etc are handled by Ruby and a back-end database to create the concatenated strings that turn into pig code so that we can reuse spe

Re: Scaling Pig Projects - The Hairy Pig

2010-06-22 Thread Alan Gates
Here at Yahoo we use Oozie for managing large workflows (latest open source edition at http://github.com/tucu00/oozie1 though they expect to make another drop before the Hadoop summit). There are plans to make Oozie a full open source project (instead of just making drops to github). We'

Re: Scaling Pig Projects - The Hairy Pig

2010-06-22 Thread Russell Jurney
Thanks Alan - we use Azkaban http://sna-projects.com/azkaban/ at LinkedIn to do the same thing, but the code itself gets to be problematic. To give an example - on my primary project, I have about 20 pig scripts, a couple Java UDFs, and a dozen or so Python streaming UDFs. There is several thousa

Re: Scaling Pig Projects - The Hairy Pig

2010-06-22 Thread Scott Carey
Even without loops and functions, templating would be very useful. Often, the exact same sort of join happens repeated with slightly different aliases or columns --- which is basically copy-paste with substitution. I have seen several subtle bugs in Pig scripts because the find/replace was done

Re: Scaling Pig Projects - The Hairy Pig

2010-06-22 Thread Dmitriy Ryaboy
I think everyone has some sort of an ad-hoc system for building and managing these types of things. Seems like a prime candidate for some community development -- we would all benefit from sharing a framework like that, and it should be possible to generalize. Something to discuss at the contributo

Re: Scaling Pig Projects - The Hairy Pig

2010-06-22 Thread Alan Gates
On Jun 22, 2010, at 1:06 PM, Dmitriy Ryaboy wrote: I think everyone has some sort of an ad-hoc system for building and managing these types of things. Seems like a prime candidate for some community development -- we would all benefit from sharing a framework like that, and it should be pos

Re: Scaling Pig Projects - The Hairy Pig

2010-06-22 Thread Alan Gates
On Jun 22, 2010, at 1:06 PM, Dmitriy Ryaboy wrote: I think everyone has some sort of an ad-hoc system for building and managing these types of things. Seems like a prime candidate for some community development -- we would all benefit from sharing a framework like that, and it should be pos

Re: Scaling Pig Projects - The Hairy Pig

2010-06-22 Thread hc busy
Russ, That is a great wiki page with a lot of insightful discussions!! As a non-Ph.D. I'd like to say that I feel that the theoretic adherence to turing machines is rather artificial(I mean who the heck uses turing machine (directly) anyways?? What's the point of simulating it? And at what level?

Re: Scaling Pig Projects - The Hairy Pig

2010-06-22 Thread hc busy
Hey, Scott, yeah, that's brilliant! Macro expansion means the script that PIG receives is a expanded script with all aliases defined, so that PIG can perform it's optimization. And the technology is old wheel, I'll bet you can take cpp and get it to work on PigLatin. ;-) On Tue, Jun 22, 2010

Re: Scaling Pig Projects - The Hairy Pig

2010-06-23 Thread Scott Carey
There is one other thing that would be immensely useful, and does not require that much from pig other than the parser: Script inclusion and alias export. Think bash or other shell languages. You want to define a set of aliases for export for other users. This can be stored in a file separat

Re: Scaling Pig Projects - The Hairy Pig

2010-06-24 Thread hc busy
More great ideas, Scott! The one thing about idempotency of IMPORT is that you may not necessarily want it. The scripts that I wrote will indeed take alias from a previously imported pig script and overwrite it with an improved version with additional columns. This satisfies the need to be able to