Re: pyspark: Java null pointer exception when accessing broadcast variables

Rok Roskar Tue, 10 Feb 2015 14:17:32 -0800

I get this in the driver log:

java.lang.NullPointerException
        at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)


and on one of the executor's stderr: 

15/02/10 23:10:35 ERROR PythonRDD: Python worker exited unexpectedly (crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File 
"/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", 
line 57, in main
    split_index = read_int(infile)
  File 
"/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py",
 line 511, in read_int
    raise EOFError
EOFError

        at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
        at 
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:174)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
Caused by: java.lang.NullPointerException
        at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
        ... 4 more
15/02/10 23:10:35 ERROR PythonRDD: Python worker exited unexpectedly (crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File 
"/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", 
line 57, in main
    split_index = read_int(infile)
  File 
"/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py",
 line 511, in read_int
    raise EOFError
EOFError

        at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
        at 
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:174)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
Caused by: java.lang.NullPointerException
        at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
        ... 4 more


What I find odd is that when I make the broadcast object, the logs don't show 
any significant amount of memory being allocated in any of the block managers 
-- but the dictionary is large, it's 8 Gb pickled on disk. 


On Feb 10, 2015, at 10:01 PM, Davies Liu <dav...@databricks.com> wrote:

> Could you paste the NPE stack trace here? It will better to create a
> JIRA for it, thanks!
> 
> On Tue, Feb 10, 2015 at 10:42 AM, rok <rokros...@gmail.com> wrote:
>> I'm trying to use a broadcasted dictionary inside a map function and am
>> consistently getting Java null pointer exceptions. This is inside an IPython
>> session connected to a standalone spark cluster. I seem to recall being able
>> to do this before but at the moment I am at a loss as to what to try next.
>> Is there a limit to the size of broadcast variables? This one is rather
>> large (a few Gb dict). Thanks!
>> 
>> Rok
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Java-null-pointer-exception-when-accessing-broadcast-variables-tp21580.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: pyspark: Java null pointer exception when accessing broadcast variables

Reply via email to