Re: NullPointerException when cache DataFrame in Java (Spark1.5.1)

Romi Kuntsman Thu, 29 Oct 2015 01:33:13 -0700

>
> BUT, after change limit(500) to limit(1000). The code report
> NullPointerException.
>
I had a similar situation, and the problem was with a certain record.
Try to find which records are returned when you limit to 1000 but not
returned when you limit to 500.


Could it be a NPE thrown from PixelObject?
Are you running spark with master=local, so it's running inside your IDE
and you can see the errors from the driver and worker?


*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Thu, Oct 29, 2015 at 10:04 AM, Zhang, Jingyu <jingyu.zh...@news.com.au>
wrote:

> Thanks Romi,
>
> I resize the dataset to 7MB, however, the code show NullPointerException
>  exception as well.
>
> Did you try to cache a DataFrame with just a single row?
>
> Yes, I tried. But, Same problem.
> .
> Do you rows have any columns with null values?
>
> No, I had filter out null values before cache the dataframe.
>
> Can you post a code snippet here on how you load/generate the dataframe?
>
> Sure, Here is the working code 1:
>
> JavaRDD<PixelObject> pixels = pixelsStr.map(new PixelGenerator()).cache();
>
> System.out.println(pixels.count()); // 3000-4000 rows
>
> Working code 2:
>
> JavaRDD<PixelObject> pixels = pixelsStr.map(new PixelGenerator());
>
> DataFrame schemaPixel = sqlContext.createDataFrame(pixels, PixelObject.
> class);
>
> DataFrame totalDF1 = 
> schemaPixel.select(schemaPixel.col("domain")).filter("'domain'
> is not null").limit(500);
>
> System.out.println(totalDF1.count());
>
>
> BUT, after change limit(500) to limit(1000). The code report
> NullPointerException.
>
>
> JavaRDD<PixelObject> pixels = pixelsStr.map(new PixelGenerator());
>
> DataFrame schemaPixel = sqlContext.createDataFrame(pixels, PixelObject.
> class);
>
> DataFrame totalDF = 
> schemaPixel.select(schemaPixel.col("domain")).filter("'domain'
> is not null").limit(*1000*);
>
> System.out.println(totalDF.count()); // problem at this line
>
> 15/10/29 18:56:28 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
> have all completed, from pool
>
> 15/10/29 18:56:28 INFO TaskSchedulerImpl: Cancelling stage 0
>
> 15/10/29 18:56:28 INFO DAGScheduler: ShuffleMapStage 0 (count at
> XXXXX.java:113) failed in 3.764 s
>
> 15/10/29 18:56:28 INFO DAGScheduler: Job 0 failed: count at XXX.java:113,
> took 3.862207 s
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage
> 0.0 (TID 0, localhost): java.lang.NullPointerException
>
> at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
> Does dataframe.rdd.cache work?
>
> No, I tried but same exception.
>
> Thanks,
>
> Jingyu
>
> On 29 October 2015 at 17:38, Romi Kuntsman <r...@totango.com> wrote:
>
>> Did you try to cache a DataFrame with just a single row?
>> Do you rows have any columns with null values?
>> Can you post a code snippet here on how you load/generate the dataframe?
>> Does dataframe.rdd.cache work?
>>
>> *Romi Kuntsman*, *Big Data Engineer*
>> http://www.totango.com
>>
>> On Thu, Oct 29, 2015 at 4:33 AM, Zhang, Jingyu <jingyu.zh...@news.com.au>
>> wrote:
>>
>>> It is not a problem to use JavaRDD.cache() for 200M data (all Objects
>>> read form Json Format). But when I try to use DataFrame.cache(), It shown
>>> exception in below.
>>>
>>> My machine can cache 1 G data in Avro format without any problem.
>>>
>>> 15/10/29 13:26:23 INFO GeneratePredicate: Code generated in 154.531827 ms
>>>
>>> 15/10/29 13:26:23 INFO GenerateUnsafeProjection: Code generated in
>>> 27.832369 ms
>>>
>>> 15/10/29 13:26:23 ERROR Executor: Exception in task 0.0 in stage 1.0
>>> (TID 1)
>>>
>>> java.lang.NullPointerException
>>>
>>> at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
>>>
>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
>>> DelegatingMethodAccessorImpl.java:43)
>>>
>>> at java.lang.reflect.Method.invoke(Method.java:497)
>>>
>>> at
>>> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply(
>>> SQLContext.scala:500)
>>>
>>> at
>>> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply(
>>> SQLContext.scala:500)
>>>
>>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>>> TraversableLike.scala:244)
>>>
>>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>>> TraversableLike.scala:244)
>>>
>>> at scala.collection.IndexedSeqOptimized$class.foreach(
>>> IndexedSeqOptimized.scala:33)
>>>
>>> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>>
>>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>>>
>>> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>>>
>>> at org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1.apply(
>>> SQLContext.scala:500)
>>>
>>> at org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1.apply(
>>> SQLContext.scala:498)
>>>
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>>
>>> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
>>>
>>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>>
>>> at
>>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
>>> InMemoryColumnarTableScan.scala:127)
>>>
>>> at
>>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
>>> InMemoryColumnarTableScan.scala:120)
>>>
>>> at org.apache.spark.storage.MemoryStore.unrollSafely(
>>> MemoryStore.scala:278)
>>>
>>> at org.apache.spark.CacheManager.putInBlockManager(
>>> CacheManager.scala:171)
>>>
>>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
>>>
>>> at org.apache.spark.rdd.MapPartitionsRDD.compute(
>>> MapPartitionsRDD.scala:38)
>>>
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>>
>>> at org.apache.spark.rdd.MapPartitionsRDD.compute(
>>> MapPartitionsRDD.scala:38)
>>>
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>>
>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>>
>>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>>>
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>>
>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>>> ThreadPoolExecutor.java:1142)
>>>
>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>> ThreadPoolExecutor.java:617)
>>>
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> 15/10/29 13:26:23 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID
>>> 1, localhost): java.lang.NullPointerException
>>>
>>> at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
>>>
>>>
>>> Thanks,
>>>
>>>
>>> Jingyu
>>>
>>> This message and its attachments may contain legally privileged or
>>> confidential information. It is intended solely for the named addressee. If
>>> you are not the addressee indicated in this message or responsible for
>>> delivery of the message to the addressee, you may not copy or deliver this
>>> message or its attachments to anyone. Rather, you should permanently delete
>>> this message and its attachments and kindly notify the sender by reply
>>> e-mail. Any content of this message and its attachments which does not
>>> relate to the official business of the sending company must be taken not to
>>> have been sent or endorsed by that company or any of its related entities.
>>> No warranty is made that the e-mail or attachments are free from computer
>>> virus or other defect.
>>
>>
>>
>
> This message and its attachments may contain legally privileged or
> confidential information. It is intended solely for the named addressee. If
> you are not the addressee indicated in this message or responsible for
> delivery of the message to the addressee, you may not copy or deliver this
> message or its attachments to anyone. Rather, you should permanently delete
> this message and its attachments and kindly notify the sender by reply
> e-mail. Any content of this message and its attachments which does not
> relate to the official business of the sending company must be taken not to
> have been sent or endorsed by that company or any of its related entities.
> No warranty is made that the e-mail or attachments are free from computer
> virus or other defect.
>

Re: NullPointerException when cache DataFrame in Java (Spark1.5.1)

Reply via email to