[ 
https://issues.apache.org/jira/browse/PIG-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13166889#comment-13166889
 ] 

Alan Gates commented on PIG-2417:
---------------------------------

This looks interesting.  Some thoughts regarding the open questions on your 
wiki.


bq. We'd want to either update DEFINE or add a new command to support streaming 
UDFs.

We should definitely use DEFINE for this as well.  It seems it should stay as 
close to the JVM based UDF defines as possible.  I'm wondering if it could 
follow the same:

DEFINE 'filename' using <language> as <namespace>

that the JVM based UDFs use, and just add new language tokens that indicate the 
streaming nature, such as 'streaming_python', 'streaming_perl', etc.

bq. How can we return the output type information back to pig? Perhaps we could 
support something like the @outputSchema decorator in python at least, and have 
the controller script gather that information and provide it back to pig in a 
separate file?

There's two sides to this, one how you communicate the information through the 
channel you're creating, and two how the UDF writer communicates it in his UDF. 
 The design will need to propose a way for implementations of streaming UDF for 
various languages to communicate schema information back to Pig.  But how the 
UDF writer communicates it should be language specific.  Wherever possible it 
should mimic the choices made in the JVM based implementations.  So a streaming 
Python implementation should use the same @outputSchema as Jython does.

bq. How can we return the output type information back to pig? Perhaps we could 
support something like the @outputSchema decorator in python at least, and have 
the controller script gather that information and provide it back to pig in a 
separate file?
This should be a Java property for each language implementation.  Something 
like pig.streaming_udf.executable.python.

Have you done any prototyping?  I'm curious how the performance of this will 
compare against the JVM based implementations.  I realize you are doing this to 
extend functionality, not get performance.
                
> Streaming UDFs -  allow users to easily write UDFs in scripting languages 
> with no JVM implementation.
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2417
>                 URL: https://issues.apache.org/jira/browse/PIG-2417
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.11
>            Reporter: Jeremy Karn
>
> The goal of Streaming UDFs is to allow users to easily write UDFs in 
> scripting languages with no JVM implementation or a limited JVM 
> implementation.  The initial proposal is outlined here: 
> https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs.
> In order to implement this we need new syntax to distinguish a streaming UDF 
> from an embedded JVM UDF.  I'd propose something like the following (although 
> I'm not sure 'language' is the best term to be using):
> {code}define my_streaming_udfs language('python') 
> ship('my_streaming_udfs.py'){code}
> We'll also need a language-specific controller script that gets shipped to 
> the cluster which is responsible for reading the input stream, deserializing 
> the input data, passing it to the user written script, serializing that 
> script output, and writing that to the output stream.
> Finally, we'll need to add a StreamingUDF class that extends evalFunc.  This 
> class will likely share some of the existing code in POStream and 
> ExecutableManager (where it make sense to pull out shared code) to stream 
> data to/from the controller script.
> One alternative approach to creating the StreamingUDF EvalFunc is to use the 
> POStream operator directly.  This would involve inserting the POStream 
> operator instead of the POUserFunc operator whenever we encountered a 
> streaming UDF while building the physical plan.  This approach seemed 
> problematic because there would need to be a lot of changes in order to 
> support POStream in all of the places we want to be able use UDFs (For 
> example - to operate on a single field inside of a for each statement).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to