Hi Pig community,
I am running a pig process using a python UDF, and getting a failure that is
hard to debug. The relevant parts of the script are:
REGISTER [...]clustercentroid_udfs.py using jython as UDFS ;
[... definition of cluster_vals ...]
grouped = group cluster_vals by (clusters::cluster_id, tfidf::att,
clusters::block_size);
cluster_tfidf = foreach grouped {
generate
group.clusters::cluster_id as cluster_id,
group.clusters::block_size as block_size,
group.tfidf::att as att,
UDFS.normalize_avg_words(cluster_vals.tfidf::pairs) as centroid;
}
store cluster_tfidf into [...]
I can remove essentially all the logic from UDFS.normalize_avg_words
and still get the failure, for example I get the failure with this
definition of normalize_avg_words():
@outputSchema('words: {wvpairs: (word: chararray, normvalue: double)}')
def normalize_avg_words(line):
return []
The log for the failing task has
2015-12-09 16:18:47,510 INFO [main] org.apache.pig.data.SchemaTupleBackend: Key
[pig.schematuple] was not set... will not generate code.
2015-12-09 16:18:47,534 INFO [main]
org.apache.pig.scripting.jython.JythonScriptEngine: created tmp
python.cachedir=/data/3/yarn/nm/usercache/sesadmin/appcache/application_1444666458457_553099/container_e17_1444666458457_553099_01_685857/tmp/pig_jython_6256288828533965407
2015-12-09 16:18:49,443 INFO [main]
org.apache.pig.scripting.jython.JythonFunction: Schema 'words: {wvpairs: (word:
chararray, normvalue: double)}' defined for func normalize_avg_words
2015-12-09 16:18:49,498 INFO [main]
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce:
Aliases being processed per job phase (AliasName[line,offset]): M:
grouped[87,10] C: R: cluster_tfidf[99,16]
2015-12-09 16:18:49,511 WARN [main] org.apache.hadoop.mapred.YarnChild:
Exception running child : java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:638)
at java.util.ArrayList.get(ArrayList.java:414)
at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:118)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getValueTuple(POPackage.java:348)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getNextTuple(POPackage.java:269)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:421)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:412)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:256)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
I do not get the failure if I use just a 5GB segment of my full 300GB data set.
Also, I do not get the failure if I comment out the call to the UDF:
cluster_tfidf = foreach grouped {
generate
group.clusters::cluster_id as cluster_id,
group.clusters::block_size as block_size,
group.tfidf::att as att;
-- UDFS.normalize_avg_words(cluster_vals.tfidf::pairs) as centroid;
}
I wonder if the failure is ultimately caused by an out of memory someplace, but
I haven't seen anything in the log that indicates that directly. (I have tried
using a large number of reducers in the definition of grouped but the result is
the same). What should I look for in the log that would be a telltale for out
of memory? How would I address it?
Since I don't get the failure when the UDF call is commented out, I wonder if
the problem is in the call itself, but don't know how to diagnose or debug that.
Any help would be much appreciated!
Apache Pig version 0.12.0-cdh5.3.3 (rexported)
Hadoop 2.5.0-cdh5.3.3
Thanks,
Will
William F Dowling
Senior Technologist
Thomson Reuters