[ 
https://issues.apache.org/jira/browse/PIG-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13588969#comment-13588969
 ] 

Jeremy Karn commented on PIG-2417:
----------------------------------

Jonathan Coveney had expressed interest at looking at the streaming python work 
we've done so this patch is just to get our code available for people to look 
at.  I updated the code to work on trunk but I had to do it in a quick and 
dirty way.  Here's a list of some specific things I know need some work (and 
there are probably a few more I haven't thought of but that would come out in 
review):

1. In the serialization/deserialization code I don't support any of the new 
data types added to pig since 0.9.  

2. When I first wrote this code I pulled some common logic out of 
ExecutableManager into a class called StreamingUtil.  ExecutableManager has 
changed enough since 0.9 that it wasn't straightforward to figure out how it 
should work now so there's some duplicated logic in StreamingUtil and 
ExecutableManager.

3. There's some Mortar specific wording in a couple of places and a couple of 
places in StreamingUDF where I'm handling the cases that come up with how we 
run Pig/Hadoop but that might need to be more generic/robust to work for 
everyone out of the box.

4. There's some exception handling decisions and some code for capturing 
standard output from the UDF for illustrate that might not make much sense 
without the rest of our illustrate changes.

5. It might make sense to use a more efficient serialization/deserialization 
method.  I tried to use the existing code (just adding code to handle cases 
that didn't work before) but its probably not the most efficient approach.  I'm 
not sure if this is something that would need to be tackled now or if it could 
be a future enhancement.
                
> Streaming UDFs -  allow users to easily write UDFs in scripting languages 
> with no JVM implementation.
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2417
>                 URL: https://issues.apache.org/jira/browse/PIG-2417
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.11
>            Reporter: Jeremy Karn
>            Assignee: Jeremy Karn
>         Attachments: PIG-2417-4.patch, streaming2.patch, streaming3.patch, 
> streaming.patch
>
>
> The goal of Streaming UDFs is to allow users to easily write UDFs in 
> scripting languages with no JVM implementation or a limited JVM 
> implementation.  The initial proposal is outlined here: 
> https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs.
> In order to implement this we need new syntax to distinguish a streaming UDF 
> from an embedded JVM UDF.  I'd propose something like the following (although 
> I'm not sure 'language' is the best term to be using):
> {code}define my_streaming_udfs language('python') 
> ship('my_streaming_udfs.py'){code}
> We'll also need a language-specific controller script that gets shipped to 
> the cluster which is responsible for reading the input stream, deserializing 
> the input data, passing it to the user written script, serializing that 
> script output, and writing that to the output stream.
> Finally, we'll need to add a StreamingUDF class that extends evalFunc.  This 
> class will likely share some of the existing code in POStream and 
> ExecutableManager (where it make sense to pull out shared code) to stream 
> data to/from the controller script.
> One alternative approach to creating the StreamingUDF EvalFunc is to use the 
> POStream operator directly.  This would involve inserting the POStream 
> operator instead of the POUserFunc operator whenever we encountered a 
> streaming UDF while building the physical plan.  This approach seemed 
> problematic because there would need to be a lot of changes in order to 
> support POStream in all of the places we want to be able use UDFs (For 
> example - to operate on a single field inside of a for each statement).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to