[
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872007#action_12872007
]
Arnab Nandi commented on PIG-928:
---------------------------------
Thanks for looking into the patch Ashutosh! Very good question, short answer: I
couldn't come up with an elegant solution using {{define}} :)
I spent a bunch of time thinking about the "right thing to do" before going
this way. As Woody mentioned, my initial instinct was to do this in in
{{define}}, but kept hitting roadblocks when working with {{define}}:
# I came up with the analogy that "register" is like "import" in java, and
"define" is like "alias" in bash. In this interpretation, whenever you want to
introduce new code, you {{register}} it with Pig. Whenever you want to alias
anything for convenience or to add meta-information, you {{define}} it.
# Define is not amenable to multiple functions in the same script.
#* For example, to follow the {{stream}} convention, {quote} \{define X 'x.py'
[inputoutputspec][schemaspec];\}. {quote} Which function is the input/output
spec for? A solution like {quote} \{[func1():schemaspec1,func2:schemaspec2]}
{quote} is... ugly.
#* Further, how do we access these functions? One solution is to have the
namespace as a codeblock, e.g. X.func1(), which is doable by registering
functions as "X.func1", but we're (mis)leading the user to believe there is
some sort of real namespacing going on. I foresee multi-function files as a
very common use case; people could have a "util.py" with their commonly used
suite of functions instead of forcing 1 file per 2-3 line function.
#* Note that Julien's @decorator idea cleanly solves this problem and I think
it'll work for all languages.
# With inline {{define}}, most languages have the convention of mentioning
function definitions with the function name, input references & return schema
spec, it seems redundant to force the user to break this convention and have
something like {quote} \{define x as script('def X(a,b): return a + b;');},
{quote} and have x.X(). Lambdas can solve this problem halfway, you'll need to
then worry about the schema spec and we're back at a kludgy solution!
# My plan for inline functions is to write all to a temp file (1 per script
engine) and then deal with them as registering a file.
# Jython code runs in its own interpreter because I couldn't figure out how to
load Jython bytecode into Java, this has something to do with the lack of a
jythonc afaik(I may be wrong). There will be one interpreter per non-compilable
scriptengine, for others(Janino, Groovy), we load the class directly into the
runtime.
# From a code-writing perspective, overloading {{define}} to tack on a third
use-case despite would involve an overhaul to the POStream physical operator
and felt very inelegant; register on the other hand is well contained to a
single purpose -- including files for UDFs.
# Consider the use of Janino as a ScriptEngine. Unlike the Jython scriptengine,
this loads java UDFs into the native runtime and doesn't translate objects; so
we're looking at potentially _zero_ loss of performance for inline UDFs (or
register 'UDF.java'; ). The difference between native and script code gets
blurry here...
[tl;dr] ...and then I thought fair enough, let's just go with {{register}}! :D
> UDFs in scripting languages
> ---------------------------
>
> Key: PIG-928
> URL: https://issues.apache.org/jira/browse/PIG-928
> Project: Pig
> Issue Type: New Feature
> Reporter: Alan Gates
> Fix For: 0.8.0
>
> Attachments: calltrace.png, package.zip, pig-greek.tgz,
> pig.scripting.patch.arnab, pyg.tgz, scripting.tgz, scripting.tgz, test.zip
>
>
> It should be possible to write UDFs in scripting languages such as python,
> ruby, etc. This frees users from needing to compile Java, generate a jar,
> etc. It also opens Pig to programmers who prefer scripting languages over
> Java.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.