[jira] [Commented] (PIG-2417) Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation.

Daniel Dai (JIRA) Wed, 18 Sep 2013 16:41:32 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13771397#comment-13771397
 ]


Daniel Dai commented on PIG-2417:
---------------------------------

Running e2e tests on Hadoop 2, hit the following stack on map side:
{code}
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:338)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:298)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:282)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:277)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1477)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.lang.NullPointerException
        at 
org.apache.pig.impl.builtin.StreamingUDF.getControllerPath(StreamingUDF.java:252)
        at 
org.apache.pig.impl.builtin.StreamingUDF.constructCommand(StreamingUDF.java:183)
        at 
org.apache.pig.impl.builtin.StreamingUDF.startUdfController(StreamingUDF.java:147)
        at 
org.apache.pig.impl.builtin.StreamingUDF.initialize(StreamingUDF.java:140)
        at org.apache.pig.impl.builtin.StreamingUDF.exec(StreamingUDF.java:130)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextInteger(POUserFunc.java:379)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:321)
{code}
I will try to take a look tonight.
                
> Streaming UDFs -  allow users to easily write UDFs in scripting languages 
> with no JVM implementation.
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2417
>                 URL: https://issues.apache.org/jira/browse/PIG-2417
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.12
>            Reporter: Jeremy Karn
>             Fix For: 0.12
>
>         Attachments: PIG-2417-4.patch, PIG-2417-5.patch, PIG-2417-6.patch, 
> PIG-2417-7.patch, PIG-2417-8.patch, PIG-2417-9.patch, PIG-2417-e2e.patch, 
> streaming2.patch, streaming3.patch, streaming.patch
>
>
> The goal of Streaming UDFs is to allow users to easily write UDFs in 
> scripting languages with no JVM implementation or a limited JVM 
> implementation.  The initial proposal is outlined here: 
> https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs.
> In order to implement this we need new syntax to distinguish a streaming UDF 
> from an embedded JVM UDF.  I'd propose something like the following (although 
> I'm not sure 'language' is the best term to be using):
> {code}define my_streaming_udfs language('python') 
> ship('my_streaming_udfs.py'){code}
> We'll also need a language-specific controller script that gets shipped to 
> the cluster which is responsible for reading the input stream, deserializing 
> the input data, passing it to the user written script, serializing that 
> script output, and writing that to the output stream.
> Finally, we'll need to add a StreamingUDF class that extends evalFunc.  This 
> class will likely share some of the existing code in POStream and 
> ExecutableManager (where it make sense to pull out shared code) to stream 
> data to/from the controller script.
> One alternative approach to creating the StreamingUDF EvalFunc is to use the 
> POStream operator directly.  This would involve inserting the POStream 
> operator instead of the POUserFunc operator whenever we encountered a 
> streaming UDF while building the physical plan.  This approach seemed 
> problematic because there would need to be a lot of changes in order to 
> support POStream in all of the places we want to be able use UDFs (For 
> example - to operate on a single field inside of a for each statement).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2417) Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation.

Reply via email to