[ https://issues.apache.org/jira/browse/PIG-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13771397#comment-13771397 ]
Daniel Dai commented on PIG-2417: --------------------------------- Running e2e tests on Hadoop 2, hit the following stack on map side: {code} at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:338) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:298) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:282) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:277) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1477) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157) Caused by: java.lang.NullPointerException at org.apache.pig.impl.builtin.StreamingUDF.getControllerPath(StreamingUDF.java:252) at org.apache.pig.impl.builtin.StreamingUDF.constructCommand(StreamingUDF.java:183) at org.apache.pig.impl.builtin.StreamingUDF.startUdfController(StreamingUDF.java:147) at org.apache.pig.impl.builtin.StreamingUDF.initialize(StreamingUDF.java:140) at org.apache.pig.impl.builtin.StreamingUDF.exec(StreamingUDF.java:130) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextInteger(POUserFunc.java:379) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:321) {code} I will try to take a look tonight. > Streaming UDFs - allow users to easily write UDFs in scripting languages > with no JVM implementation. > ----------------------------------------------------------------------------------------------------- > > Key: PIG-2417 > URL: https://issues.apache.org/jira/browse/PIG-2417 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.12 > Reporter: Jeremy Karn > Fix For: 0.12 > > Attachments: PIG-2417-4.patch, PIG-2417-5.patch, PIG-2417-6.patch, > PIG-2417-7.patch, PIG-2417-8.patch, PIG-2417-9.patch, PIG-2417-e2e.patch, > streaming2.patch, streaming3.patch, streaming.patch > > > The goal of Streaming UDFs is to allow users to easily write UDFs in > scripting languages with no JVM implementation or a limited JVM > implementation. The initial proposal is outlined here: > https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs. > In order to implement this we need new syntax to distinguish a streaming UDF > from an embedded JVM UDF. I'd propose something like the following (although > I'm not sure 'language' is the best term to be using): > {code}define my_streaming_udfs language('python') > ship('my_streaming_udfs.py'){code} > We'll also need a language-specific controller script that gets shipped to > the cluster which is responsible for reading the input stream, deserializing > the input data, passing it to the user written script, serializing that > script output, and writing that to the output stream. > Finally, we'll need to add a StreamingUDF class that extends evalFunc. This > class will likely share some of the existing code in POStream and > ExecutableManager (where it make sense to pull out shared code) to stream > data to/from the controller script. > One alternative approach to creating the StreamingUDF EvalFunc is to use the > POStream operator directly. This would involve inserting the POStream > operator instead of the POUserFunc operator whenever we encountered a > streaming UDF while building the physical plan. This approach seemed > problematic because there would need to be a lot of changes in order to > support POStream in all of the places we want to be able use UDFs (For > example - to operate on a single field inside of a for each statement). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira