Re: pyspark: Java null pointer exception when accessing broadcast variables

Davies Liu Fri, 13 Feb 2015 14:55:15 -0800

This is fixed in 1.2.1,  could you upgrade to 1.2.1?

On Thu, Feb 12, 2015 at 4:55 AM, Rok Roskar <rokros...@gmail.com> wrote:
> Hi again,
>
> I narrowed down the issue a bit more -- it seems to have to do with the Kryo
> serializer. When I use it, then this results in a Null Pointer:
>
> rdd = sc.parallelize(range(10))
> d = {}
> from random import random
> for i in range(100000) :
>     d[i] = random()
>
> rdd.map(lambda x: d[x]).collect()
>
> ---------------------------------------------------------------------------
> Py4JJavaError                             Traceback (most recent call last)
> <ipython-input-97-7cd5df24206c> in <module>()
> ----> 1 rdd.map(lambda x: d[x]).collect()
>
> /cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/rdd.pyc in
> collect(self)
>     674         """
>     675         with SCCallSiteSync(self.context) as css:
> --> 676             bytesInJava = self._jrdd.collect().iterator()
>     677         return
> list(self._collect_iterator_through_file(bytesInJava))
>     678
>
> /cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
> in __call__(self, *args)
>     536         answer = self.gateway_client.send_command(command)
>     537         return_value = get_return_value(answer, self.gateway_client,
> --> 538                 self.target_id, self.name)
>     539
>     540         for temp_arg in temp_args:
>
> /cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
> in get_return_value(answer, gateway_client, target_id, name)
>     298                 raise Py4JJavaError(
>     299                     'An error occurred while calling {0}{1}{2}.\n'.
> --> 300                     format(target_id, '.', name), value)
>     301             else:
>     302                 raise Py4JError(
>
> Py4JJavaError: An error occurred while calling o768.collect.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
> in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0
> (TID 87, e1305.hpc-lca.ethz.ch): java.lang.NullPointerException
> at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
> at
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
> at
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
> at
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
> at
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
> at
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
>
> Driver stacktrace:
> at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> If I use a dictionary with fewer items, then it works fine:
>
> In [98]:
> rdd = sc.parallelize(range(10))
> d = {}
>
> from random import random
> for i in range(10000) :
>     d[i] = random()
>
> In [99]:
> rdd.map(lambda x: d[x]).collect()
>
> Out[99]:
> [0.39210713836346933,
>  0.8636333432012482,
>  0.28744831569153617,
>  0.663815926356163,
>  0.38274814840717364,
>  0.6606453820150496,
>  0.8610156719813942,
>  0.6971353266345091,
>  0.9896836700210551,
>  0.05789392881996358]
>
> Is there a size limit for objects serialized with Kryo? Or an option that
> controls it? The Java serializer works fine.
>
> On Wed, Feb 11, 2015 at 8:04 PM, Rok Roskar <rokros...@gmail.com> wrote:
>>
>> I think the problem was related to the broadcasts being too large -- I've
>> now split it up into many smaller operations but it's still not quite there
>> -- see
>> http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-td21606.html
>>
>> Thanks,
>>
>> Rok
>>
>>
>> On Wed, Feb 11, 2015, 19:59 Davies Liu <dav...@databricks.com> wrote:
>>>
>>> Could you share a short script to reproduce this problem?
>>>
>>> On Tue, Feb 10, 2015 at 8:55 PM, Rok Roskar <rokros...@gmail.com> wrote:
>>> > I didn't notice other errors -- I also thought such a large broadcast
>>> > is a
>>> > bad idea but I tried something similar with a much smaller dictionary
>>> > and
>>> > encountered the same problem. I'm not familiar enough with spark
>>> > internals
>>> > to know whether the trace indicates an issue with the broadcast
>>> > variables or
>>> > perhaps something different?
>>> >
>>> > The driver and executors have 50gb of ram so memory should be fine.
>>> >
>>> > Thanks,
>>> >
>>> > Rok
>>> >
>>> > On Feb 11, 2015 12:19 AM, "Davies Liu" <dav...@databricks.com> wrote:
>>> >>
>>> >> It's brave to broadcast 8G pickled data, it will take more than 15G in
>>> >> memory for each Python worker,
>>> >> how much memory do you have in executor and driver?
>>> >> Do you see any other exceptions in driver and executors? Something
>>> >> related to serialization in JVM.
>>> >>
>>> >> On Tue, Feb 10, 2015 at 2:16 PM, Rok Roskar <rokros...@gmail.com>
>>> >> wrote:
>>> >> > I get this in the driver log:
>>> >>
>>> >> I think this should happen on executor, or you called first() or
>>> >> take() on the RDD?
>>> >>
>>> >> > java.lang.NullPointerException
>>> >> >         at
>>> >> > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
>>> >> >         at
>>> >> > scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>> >> >         at
>>> >> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>> >> >         at
>>> >> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>>> >> >         at
>>> >> > scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>>> >> >         at
>>> >> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
>>> >> >
>>> >> > and on one of the executor's stderr:
>>> >> >
>>> >> > 15/02/10 23:10:35 ERROR PythonRDD: Python worker exited unexpectedly
>>> >> > (crashed)
>>> >> > org.apache.spark.api.python.PythonException: Traceback (most recent
>>> >> > call
>>> >> > last):
>>> >> >   File
>>> >> >
>>> >> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py",
>>> >> > line 57, in main
>>> >> >     split_index = read_int(infile)
>>> >> >   File
>>> >> >
>>> >> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py",
>>> >> > line 511, in read_int
>>> >> >     raise EOFError
>>> >> > EOFError
>>> >> >
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:174)
>>> >> >         at
>>> >> > org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
>>> >> >         at
>>> >> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>>> >> >         at
>>> >> > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
>>> >> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>>> >> >         at
>>> >> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
>>> >> > Caused by: java.lang.NullPointerException
>>> >> >         at
>>> >> > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
>>> >> >         at
>>> >> > scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>> >> >         at
>>> >> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>> >> >         at
>>> >> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>>> >> >         at
>>> >> > scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
>>> >> >         ... 4 more
>>> >> > 15/02/10 23:10:35 ERROR PythonRDD: Python worker exited unexpectedly
>>> >> > (crashed)
>>> >> > org.apache.spark.api.python.PythonException: Traceback (most recent
>>> >> > call
>>> >> > last):
>>> >> >   File
>>> >> >
>>> >> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py",
>>> >> > line 57, in main
>>> >> >     split_index = read_int(infile)
>>> >> >   File
>>> >> >
>>> >> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py",
>>> >> > line 511, in read_int
>>> >> >     raise EOFError
>>> >> > EOFError
>>> >> >
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:174)
>>> >> >         at
>>> >> > org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
>>> >> >         at
>>> >> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>>> >> >         at
>>> >> > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
>>> >> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>>> >> >         at
>>> >> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
>>> >> > Caused by: java.lang.NullPointerException
>>> >> >         at
>>> >> > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
>>> >> >         at
>>> >> > scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>> >> >         at
>>> >> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>> >> >         at
>>> >> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>>> >> >         at
>>> >> > scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>>> >> >         at
>>> >> >
>>> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
>>> >> >         ... 4 more
>>> >> >
>>> >> >
>>> >> > What I find odd is that when I make the broadcast object, the logs
>>> >> > don't
>>> >> > show any significant amount of memory being allocated in any of the
>>> >> > block
>>> >> > managers -- but the dictionary is large, it's 8 Gb pickled on disk.
>>> >> >
>>> >> >
>>> >> > On Feb 10, 2015, at 10:01 PM, Davies Liu <dav...@databricks.com>
>>> >> > wrote:
>>> >> >
>>> >> >> Could you paste the NPE stack trace here? It will better to create
>>> >> >> a
>>> >> >> JIRA for it, thanks!
>>> >> >>
>>> >> >> On Tue, Feb 10, 2015 at 10:42 AM, rok <rokros...@gmail.com> wrote:
>>> >> >>> I'm trying to use a broadcasted dictionary inside a map function
>>> >> >>> and
>>> >> >>> am
>>> >> >>> consistently getting Java null pointer exceptions. This is inside
>>> >> >>> an
>>> >> >>> IPython
>>> >> >>> session connected to a standalone spark cluster. I seem to recall
>>> >> >>> being able
>>> >> >>> to do this before but at the moment I am at a loss as to what to
>>> >> >>> try
>>> >> >>> next.
>>> >> >>> Is there a limit to the size of broadcast variables? This one is
>>> >> >>> rather
>>> >> >>> large (a few Gb dict). Thanks!
>>> >> >>>
>>> >> >>> Rok
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> --
>>> >> >>> View this message in context:
>>> >> >>>
>>> >> >>> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Java-null-pointer-exception-when-accessing-broadcast-variables-tp21580.html
>>> >> >>> Sent from the Apache Spark User List mailing list archive at
>>> >> >>> Nabble.com.
>>> >> >>>
>>> >> >>>
>>> >> >>> ---------------------------------------------------------------------
>>> >> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> >> >>> For additional commands, e-mail: user-h...@spark.apache.org
>>> >> >>>
>>> >> >
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: pyspark: Java null pointer exception when accessing broadcast variables

Reply via email to