Hi,
The issues raised by Russ are really important. I have recently worked on a
project using Pig at EBay Search.
I could not avoid some of the pasted code.
It will be useful to learn good practices tips from experienced folks for
scaling to big projects.
Jay
-Original Message-
From: Ru
A lot of our pig scripts are generated by Ruby code/scripts dynamically and
on the fly.
So the specific pig commands, LOAD data, the columns that are input,
outputed, etc are handled by Ruby and a back-end database to create the
concatenated strings that turn into pig code so that we can reuse spe
Here at Yahoo we use Oozie for managing large workflows (latest open
source edition at http://github.com/tucu00/oozie1 though they expect
to make another drop before the Hadoop summit). There are plans to
make Oozie a full open source project (instead of just making drops to
github).
We'
Thanks Alan - we use Azkaban http://sna-projects.com/azkaban/ at LinkedIn to
do the same thing, but the code itself gets to be problematic.
To give an example - on my primary project, I have about 20 pig scripts, a
couple Java UDFs, and a dozen or so Python streaming UDFs. There is several
thousa
Even without loops and functions, templating would be very useful.
Often, the exact same sort of join happens repeated with slightly different
aliases or columns --- which is basically copy-paste with substitution. I have
seen several subtle bugs in Pig scripts because the find/replace was done
I think everyone has some sort of an ad-hoc system for building and managing
these types of things. Seems like a prime candidate for some community
development -- we would all benefit from sharing a framework like that, and
it should be possible to generalize. Something to discuss at the contributo
On Jun 22, 2010, at 1:06 PM, Dmitriy Ryaboy wrote:
I think everyone has some sort of an ad-hoc system for building and
managing
these types of things. Seems like a prime candidate for some community
development -- we would all benefit from sharing a framework like
that, and
it should be pos
On Jun 22, 2010, at 1:06 PM, Dmitriy Ryaboy wrote:
I think everyone has some sort of an ad-hoc system for building and
managing
these types of things. Seems like a prime candidate for some community
development -- we would all benefit from sharing a framework like
that, and
it should be pos
Russ, That is a great wiki page with a lot of insightful discussions!!
As a non-Ph.D. I'd like to say that I feel that the theoretic adherence to
turing machines is rather artificial(I mean who the heck uses turing machine
(directly) anyways?? What's the point of simulating it? And at what level?
Hey, Scott, yeah, that's brilliant!
Macro expansion means the script that PIG receives is a expanded script with
all aliases defined, so that PIG can perform it's optimization.
And the technology is old wheel, I'll bet you can take cpp and get it to
work on PigLatin.
;-)
On Tue, Jun 22, 2010
There is one other thing that would be immensely useful, and does not require
that much from pig other than the parser:
Script inclusion and alias export.
Think bash or other shell languages. You want to define a set of aliases for
export for other users. This can be stored in a file separat
More great ideas, Scott!
The one thing about idempotency of IMPORT is that you may not necessarily
want it. The scripts that I wrote will indeed take alias from a previously
imported pig script and overwrite it with an improved version with
additional columns. This satisfies the need to be able to
12 matches
Mail list logo