In your example below how would the results of these load functions be
accessed in your main script?
I certainly see the value of #include plus functions (or #define if
you prefer). Without functions though you'll have namespace clashes
(any relation names used in the imported files will be visible to
other imported files and to the main script) and the user will have to
know the name of input and output relations for the imported files so
he can use it subsequently in his script. For example if you had a
pig script that implemented a certain type of join:
RETURN = join INPUT1 by $0, INPUT2 by $0
Now the user has to know that INPUT1 and INPUT2 must be the names of
his input relations and that the output relation will be named
RETURN. This is also limited because we can't define which key(s) to
do the join on. To make this useful we're going to want a macro or
function ability so we can pass in names of inputs and other
parameters (like which keys to join on), control the names of results,
and have variable scoping.
That said, I'm all for it. I think it would make Pig must more usable.
Alan.
On Mar 15, 2010, at 2:58 PM, Dmitriy Ryaboy wrote:
Alan, this would be quite useful, as essentially this would allow
developers
to create functions by writing them into separate pig scripts and
combining
them as necessary.
For example we have code that auto-generates load statements with
fairly
complex schemas based on protocol buffers (see
http://www.slideshare.net/hadoopusergroup/twitter-protobufs-and-hadoop-hug-021709)
.
It would be very handy to be able to say something like
#include common_jars.pig
#include load_tweets.pig
#include load_users.pig
#include filter_nonenglish_tweets.pig
#include geomap_users.pig
.. etc ..
-D
On Mon, Mar 15, 2010 at 2:23 PM, Alan Gates <ga...@yahoo-inc.com>
wrote:
On Mar 12, 2010, at 10:36 AM, hc busy wrote:
Is there any work towards something like C languages '#include' in
Pig? My
large pig script is actually developed separately in several
smaller pig
files. Individually the pig files do not run because they depend on
previous
scripts, but logically they are separate because each step does
something
different.
Currently the only thing existing along these lines is the exec
command
in grunt. I don't think we're opposed to a #include functionality,
we just
haven't done it. However, given that Pig doesn't have function
calls, and
presumably each Pig Latin script is self contained, it isn't clear
to me how
useful it will be.
Alan.