python UDF invocation or memory problem

william.dowling Thu, 10 Dec 2015 09:23:55 -0800

Hi Pig community,

I am running a pig process using a python UDF, and getting a failure that is 
hard to debug.  The relevant parts of the script are:


REGISTER [...]clustercentroid_udfs.py using jython as UDFS ;

[... definition of cluster_vals ...]
grouped = group cluster_vals by (clusters::cluster_id, tfidf::att,
                                 clusters::block_size);
cluster_tfidf = foreach grouped {
    generate
       group.clusters::cluster_id as cluster_id,
       group.clusters::block_size as block_size,
       group.tfidf::att as att,
       UDFS.normalize_avg_words(cluster_vals.tfidf::pairs) as centroid;
}
store cluster_tfidf into [...]

I can remove essentially all the logic from UDFS.normalize_avg_words
and still get the failure, for example I get the failure with this
definition of normalize_avg_words():
@outputSchema('words: {wvpairs: (word: chararray, normvalue: double)}')
def normalize_avg_words(line):
     return []

The log for the failing task has

2015-12-09 16:18:47,510 INFO [main] org.apache.pig.data.SchemaTupleBackend: Key 
[pig.schematuple] was not set... will not generate code.
2015-12-09 16:18:47,534 INFO [main] 
org.apache.pig.scripting.jython.JythonScriptEngine: created tmp 
python.cachedir=/data/3/yarn/nm/usercache/sesadmin/appcache/application_1444666458457_553099/container_e17_1444666458457_553099_01_685857/tmp/pig_jython_6256288828533965407
2015-12-09 16:18:49,443 INFO [main] 
org.apache.pig.scripting.jython.JythonFunction: Schema 'words: {wvpairs: (word: 
chararray, normvalue: double)}' defined for func normalize_avg_words
2015-12-09 16:18:49,498 INFO [main] 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce:
 Aliases being processed per job phase (AliasName[line,offset]): M: 
grouped[87,10] C:  R: cluster_tfidf[99,16]
2015-12-09 16:18:49,511 WARN [main] org.apache.hadoop.mapred.YarnChild: 
Exception running child : java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
        at java.util.ArrayList.rangeCheck(ArrayList.java:638)
        at java.util.ArrayList.get(ArrayList.java:414)
        at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:118)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getValueTuple(POPackage.java:348)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getNextTuple(POPackage.java:269)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:421)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:412)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:256)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
        at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

I do not get the failure if I use just a 5GB segment of my full 300GB data set.

Also, I do not get the failure if I comment out the call to the UDF:

cluster_tfidf = foreach grouped {
    generate
       group.clusters::cluster_id as cluster_id,
       group.clusters::block_size as block_size,
       group.tfidf::att as att;
       -- UDFS.normalize_avg_words(cluster_vals.tfidf::pairs) as centroid;
}

I wonder if the failure is ultimately caused by an out of memory someplace, but 
I haven't seen anything in the log that indicates that directly. (I have tried 
using a large number of reducers in the definition of grouped but the result is 
the same).  What should I look for in the log that would be a telltale for out 
of memory? How would I address it?

Since I don't get the failure when the UDF call is commented out, I wonder if 
the problem is in the call itself, but don't know how to diagnose or debug that.

Any help would be much appreciated!

Apache Pig version 0.12.0-cdh5.3.3 (rexported)
Hadoop 2.5.0-cdh5.3.3

Thanks,
Will

William F Dowling
Senior Technologist
Thomson Reuters

python UDF invocation or memory problem

Reply via email to