> > BUT, after change limit(500) to limit(1000). The code report > NullPointerException. > I had a similar situation, and the problem was with a certain record. Try to find which records are returned when you limit to 1000 but not returned when you limit to 500.
Could it be a NPE thrown from PixelObject? Are you running spark with master=local, so it's running inside your IDE and you can see the errors from the driver and worker? *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Thu, Oct 29, 2015 at 10:04 AM, Zhang, Jingyu <jingyu.zh...@news.com.au> wrote: > Thanks Romi, > > I resize the dataset to 7MB, however, the code show NullPointerException > exception as well. > > Did you try to cache a DataFrame with just a single row? > > Yes, I tried. But, Same problem. > . > Do you rows have any columns with null values? > > No, I had filter out null values before cache the dataframe. > > Can you post a code snippet here on how you load/generate the dataframe? > > Sure, Here is the working code 1: > > JavaRDD<PixelObject> pixels = pixelsStr.map(new PixelGenerator()).cache(); > > System.out.println(pixels.count()); // 3000-4000 rows > > Working code 2: > > JavaRDD<PixelObject> pixels = pixelsStr.map(new PixelGenerator()); > > DataFrame schemaPixel = sqlContext.createDataFrame(pixels, PixelObject. > class); > > DataFrame totalDF1 = > schemaPixel.select(schemaPixel.col("domain")).filter("'domain' > is not null").limit(500); > > System.out.println(totalDF1.count()); > > > BUT, after change limit(500) to limit(1000). The code report > NullPointerException. > > > JavaRDD<PixelObject> pixels = pixelsStr.map(new PixelGenerator()); > > DataFrame schemaPixel = sqlContext.createDataFrame(pixels, PixelObject. > class); > > DataFrame totalDF = > schemaPixel.select(schemaPixel.col("domain")).filter("'domain' > is not null").limit(*1000*); > > System.out.println(totalDF.count()); // problem at this line > > 15/10/29 18:56:28 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks > have all completed, from pool > > 15/10/29 18:56:28 INFO TaskSchedulerImpl: Cancelling stage 0 > > 15/10/29 18:56:28 INFO DAGScheduler: ShuffleMapStage 0 (count at > XXXXX.java:113) failed in 3.764 s > > 15/10/29 18:56:28 INFO DAGScheduler: Job 0 failed: count at XXX.java:113, > took 3.862207 s > > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 0.0 (TID 0, localhost): java.lang.NullPointerException > > at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) > Does dataframe.rdd.cache work? > > No, I tried but same exception. > > Thanks, > > Jingyu > > On 29 October 2015 at 17:38, Romi Kuntsman <r...@totango.com> wrote: > >> Did you try to cache a DataFrame with just a single row? >> Do you rows have any columns with null values? >> Can you post a code snippet here on how you load/generate the dataframe? >> Does dataframe.rdd.cache work? >> >> *Romi Kuntsman*, *Big Data Engineer* >> http://www.totango.com >> >> On Thu, Oct 29, 2015 at 4:33 AM, Zhang, Jingyu <jingyu.zh...@news.com.au> >> wrote: >> >>> It is not a problem to use JavaRDD.cache() for 200M data (all Objects >>> read form Json Format). But when I try to use DataFrame.cache(), It shown >>> exception in below. >>> >>> My machine can cache 1 G data in Avro format without any problem. >>> >>> 15/10/29 13:26:23 INFO GeneratePredicate: Code generated in 154.531827 ms >>> >>> 15/10/29 13:26:23 INFO GenerateUnsafeProjection: Code generated in >>> 27.832369 ms >>> >>> 15/10/29 13:26:23 ERROR Executor: Exception in task 0.0 in stage 1.0 >>> (TID 1) >>> >>> java.lang.NullPointerException >>> >>> at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) >>> >>> at sun.reflect.DelegatingMethodAccessorImpl.invoke( >>> DelegatingMethodAccessorImpl.java:43) >>> >>> at java.lang.reflect.Method.invoke(Method.java:497) >>> >>> at >>> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply( >>> SQLContext.scala:500) >>> >>> at >>> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply( >>> SQLContext.scala:500) >>> >>> at scala.collection.TraversableLike$$anonfun$map$1.apply( >>> TraversableLike.scala:244) >>> >>> at scala.collection.TraversableLike$$anonfun$map$1.apply( >>> TraversableLike.scala:244) >>> >>> at scala.collection.IndexedSeqOptimized$class.foreach( >>> IndexedSeqOptimized.scala:33) >>> >>> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) >>> >>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) >>> >>> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) >>> >>> at org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1.apply( >>> SQLContext.scala:500) >>> >>> at org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1.apply( >>> SQLContext.scala:498) >>> >>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) >>> >>> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) >>> >>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) >>> >>> at >>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next( >>> InMemoryColumnarTableScan.scala:127) >>> >>> at >>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next( >>> InMemoryColumnarTableScan.scala:120) >>> >>> at org.apache.spark.storage.MemoryStore.unrollSafely( >>> MemoryStore.scala:278) >>> >>> at org.apache.spark.CacheManager.putInBlockManager( >>> CacheManager.scala:171) >>> >>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) >>> >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) >>> >>> at org.apache.spark.rdd.MapPartitionsRDD.compute( >>> MapPartitionsRDD.scala:38) >>> >>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) >>> >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) >>> >>> at org.apache.spark.rdd.MapPartitionsRDD.compute( >>> MapPartitionsRDD.scala:38) >>> >>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) >>> >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) >>> >>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) >>> >>> at org.apache.spark.scheduler.Task.run(Task.scala:88) >>> >>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) >>> >>> at java.util.concurrent.ThreadPoolExecutor.runWorker( >>> ThreadPoolExecutor.java:1142) >>> >>> at java.util.concurrent.ThreadPoolExecutor$Worker.run( >>> ThreadPoolExecutor.java:617) >>> >>> at java.lang.Thread.run(Thread.java:745) >>> >>> 15/10/29 13:26:23 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID >>> 1, localhost): java.lang.NullPointerException >>> >>> at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) >>> >>> >>> Thanks, >>> >>> >>> Jingyu >>> >>> This message and its attachments may contain legally privileged or >>> confidential information. It is intended solely for the named addressee. If >>> you are not the addressee indicated in this message or responsible for >>> delivery of the message to the addressee, you may not copy or deliver this >>> message or its attachments to anyone. Rather, you should permanently delete >>> this message and its attachments and kindly notify the sender by reply >>> e-mail. Any content of this message and its attachments which does not >>> relate to the official business of the sending company must be taken not to >>> have been sent or endorsed by that company or any of its related entities. >>> No warranty is made that the e-mail or attachments are free from computer >>> virus or other defect. >> >> >> > > This message and its attachments may contain legally privileged or > confidential information. It is intended solely for the named addressee. If > you are not the addressee indicated in this message or responsible for > delivery of the message to the addressee, you may not copy or deliver this > message or its attachments to anyone. Rather, you should permanently delete > this message and its attachments and kindly notify the sender by reply > e-mail. Any content of this message and its attachments which does not > relate to the official business of the sending company must be taken not to > have been sent or endorsed by that company or any of its related entities. > No warranty is made that the e-mail or attachments are free from computer > virus or other defect. >