[
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Le Dem updated PIG-928:
------------------------------
Attachment: pyg.tgz
Hi,
I'm attaching something I implemented last year. I cleaned it up and updated
the dependency to Pig 0.6.0 for the occasion.
There's probably some overlap with previous posts, sorry about the late
submission.
Here is my approach.
I wanted to make easier a couple of things:
- writing programs that require multiple calls to pig
- UDFs
- parameter passing to Pig
So I integrated Pig with Jython so that the whole program (main program, UDFs,
Pig scripts) could be in one python script.
example: python/tc.py in the attachment
The script defines Python functions that are available as UDFs to pig
automatically. The decorator @outputSchema is an easy way to specify what the
output schema of the UDF is.
example (see script): @outputSchema("relationships:{t:(target:chararray,
candidate:chararray)}"
Also notice that the UDFs use the standard python constructs: tuple, dictionary
and list. they are converted to Pig constructs on the fly. This makes the
definition of UDFs in Python very easy. Notice that the udf takes a list of
arguments, not a tuple. The input tuple gets automatically mapped to the
arguments.
Then the script defines a main() function that will be the main program
executed on the client.
In the main the Python program has access to a global pig variable that
provides two methods (for now) and is designed to be an equivalent to PigServer.
List<ExecJob> executeScript(String script)
to execute a pig script in-lined in Python
deleteFile(String filename)
to delete a file
This looks a little bit like the JDBC approach where you "query" Pig and then
can process the data.
also you can embed python expressions in the pig statements using ${ ... }
example: ${n - 1}
They get executed in the current scope and replaced in the script.
To run the example (assuming javac, jar and java are in your PATH):
- tar xzvf pyg.tgz
- add pig-0.6.0-core.jar to the lib folder
- ./makejar.sh
- ./runme.sh
It runs the following:
org.apache.pig.pyg.Pyg local tc.py
tc.py is a python script that performs a transitive closure on a list of
relation using an iterative algorithm. It defines python functions
Limitations:
- you can not include other python scripts but this should be doable.
- I haven't spent much time testing performance. I suspect the Pig<->Python
type conversion to be a little slow as it creates many new objects. It could
possibly be improved by making the Pig objects implement the Python interfaces.
(the attachment contains jython.jar 2.5.0 for simplicity)
Best regards, Julien
> UDFs in scripting languages
> ---------------------------
>
> Key: PIG-928
> URL: https://issues.apache.org/jira/browse/PIG-928
> Project: Pig
> Issue Type: New Feature
> Reporter: Alan Gates
> Attachments: package.zip, pyg.tgz, scripting.tgz, scripting.tgz
>
>
> It should be possible to write UDFs in scripting languages such as python,
> ruby, etc. This frees users from needing to compile Java, generate a jar,
> etc. It also opens Pig to programmers who prefer scripting languages over
> Java.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.