[ 
https://issues.apache.org/jira/browse/PIG-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175415#comment-13175415
 ] 

Jeremy Karn commented on PIG-2417:
----------------------------------

This latest patch (streaming2.patch) contains all of the functionality 
necessary for writing streaming UDFs.

Registering python files still works as outlined above.

Declaring the output schema of your python udf uses an outputSchema decorator 
(so the same syntax used for jython udfs).  When the user registers the file 
and pig scans for functions it also looks for the outputSchema decorator and 
only registers functions that have it.  The schema string from the decorator is 
passed to the StreamingUDF instance(s) so that it knows what output schema to 
expect from the streaming process. 

Performance:

I haven't done exhaustive testing, profiling, or tuning but right now it looks 
like small data sets using standalone hadoop are about 2-3 times slower using 
python streaming udfs instead of jython udfs.

Running similar scripts on a small data set but on a cluster improves a bit and 
the python streaming udfs are twice as slow.

When you move up to much larger data sets and run on the cluster I'm seeing 
python streaming udfs being around 50% slower than equivalent jython udfs.

The code still has a few bugs and I need to add unit tests for the Pig changes 
I've made but I'd definitely appreciate any feedback on what's already done.

                
> Streaming UDFs -  allow users to easily write UDFs in scripting languages 
> with no JVM implementation.
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2417
>                 URL: https://issues.apache.org/jira/browse/PIG-2417
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.11
>            Reporter: Jeremy Karn
>            Assignee: Jeremy Karn
>         Attachments: streaming.patch, streaming2.patch
>
>
> The goal of Streaming UDFs is to allow users to easily write UDFs in 
> scripting languages with no JVM implementation or a limited JVM 
> implementation.  The initial proposal is outlined here: 
> https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs.
> In order to implement this we need new syntax to distinguish a streaming UDF 
> from an embedded JVM UDF.  I'd propose something like the following (although 
> I'm not sure 'language' is the best term to be using):
> {code}define my_streaming_udfs language('python') 
> ship('my_streaming_udfs.py'){code}
> We'll also need a language-specific controller script that gets shipped to 
> the cluster which is responsible for reading the input stream, deserializing 
> the input data, passing it to the user written script, serializing that 
> script output, and writing that to the output stream.
> Finally, we'll need to add a StreamingUDF class that extends evalFunc.  This 
> class will likely share some of the existing code in POStream and 
> ExecutableManager (where it make sense to pull out shared code) to stream 
> data to/from the controller script.
> One alternative approach to creating the StreamingUDF EvalFunc is to use the 
> POStream operator directly.  This would involve inserting the POStream 
> operator instead of the POUserFunc operator whenever we encountered a 
> streaming UDF while building the physical plan.  This approach seemed 
> problematic because there would need to be a lot of changes in order to 
> support POStream in all of the places we want to be able use UDFs (For 
> example - to operate on a single field inside of a for each statement).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to