It seems it has to do with UDF......Could u share snippet of code you are running? Kr
On 14 Jan 2017 1:40 am, "Nicholas Chammas" <nicholas.cham...@gmail.com> wrote: > I’m looking for tips on how to debug a PythonException that’s very sparse > on details. The full exception is below, but the only interesting bits > appear to be the following lines: > > org.apache.spark.api.python.PythonException: > ... > py4j.protocol.Py4JError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext > > Otherwise, the only other clue from the traceback I can see is that the > problem may involve a UDF somehow. > > I’ve tested this code against many datasets (stored as ORC) and it works > fine. The same code only seems to throw this error on a few datasets that > happen to be sourced via JDBC. I can’t seem to get a lead on what might be > going wrong here. > > Does anyone have tips on how to debug a problem like this? How do I find > more specifically what is going wrong? > > Nick > > Here’s the full exception: > > 17/01/13 17:12:14 WARN TaskSetManager: Lost task 7.0 in stage 9.0 (TID 15, > devlx023.private.massmutual.com, executor 4): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py", line > 161, in main > func, profiler, deserializer, serializer = read_udfs(pickleSer, infile) > File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py", line 97, > in read_udfs > arg_offsets, udf = read_single_udf(pickleSer, infile) > File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py", line 78, > in read_single_udf > f, return_type = read_command(pickleSer, infile) > File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py", line 54, > in read_command > command = serializer._read_with_length(file) > File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/serializers.py", > line 169, in _read_with_length > return self.loads(obj) > File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/serializers.py", > line 431, in loads > return pickle.loads(obj, encoding=encoding) > File > "/hadoop/yarn/nm/usercache/jenkins/appcache/application_1483203887152_1207/container_1483203887152_1207_01_000005/splinkr/person.py", > line 111, in <module> > py_normalize_udf = udf(py_normalize, StringType()) > File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/sql/functions.py", > line 1868, in udf > return UserDefinedFunction(f, returnType) > File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/sql/functions.py", > line 1826, in __init__ > self._judf = self._create_judf(name) > File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/sql/functions.py", > line 1830, in _create_judf > sc = SparkContext.getOrCreate() > File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py", line > 307, in getOrCreate > SparkContext(conf=conf or SparkConf()) > File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py", line > 118, in __init__ > conf, jsc, profiler_cls) > File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py", line > 179, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py", line > 246, in _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/hadoop/spark/2.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line > 1401, in __call__ > answer, self._gateway_client, None, self._fqn) > File "/hadoop/spark/2.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", > line 327, in get_return_value > format(target_id, ".", name)) > py4j.protocol.Py4JError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext > > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) > at > org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234) > at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) > at > org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144) > at > org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:973) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > >