Re: NullPointerException When Reading Avro Sequence Files
To me this looks like an internal error to the REPL. I am not sure what is causing that. Personally I never use the REPL, can you try typing up your program and running it from an IDE or spark-submit and see if you still get the same error? Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Mon, Dec 15, 2014 at 4:54 PM, Cristovao Jose Domingues Cordeiro cristovao.corde...@cern.ch wrote: Sure, thanks: warning: there were 1 deprecation warning(s); re-run with -deprecation for details java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283) at org.apache.hadoop.mapreduce.Job.toString(Job.java:462) at scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324) at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329) at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337) at .init(console:10) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:846) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1119) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:672) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:703) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:667) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:819) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:864) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:776) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:619) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:627) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:632) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:959) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:907) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:907) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:907) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1002) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:331) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Could something you omitted in your snippet be chaining this exception? Cumprimentos / Best regards, Cristóvão José Domingues Cordeiro IT Department - 28/R-018 CERN -- *From:* Simone Franzini [captainfr...@gmail.com] *Sent:* 15 December 2014 16:52 *To:* Cristovao Jose Domingues Cordeiro *Subject:* Re: NullPointerException When Reading Avro Sequence Files Ok, I have no idea what that is. That appears to be an internal Spark exception. Maybe if you can post the entire stack trace it would give some more details to understand what is going on. Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Mon, Dec 15, 2014 at 4:50 PM, Cristovao Jose Domingues Cordeiro cristovao.corde...@cern.ch wrote: Hi, thanks for that. But yeah the 2nd line is an exception. jobread is not created. Cumprimentos / Best regards, Cristóvão José Domingues Cordeiro IT Department - 28/R-018 CERN -- *From:* Simone Franzini [captainfr...@gmail.com] *Sent:* 15 December 2014 16:39 *To:* Cristovao Jose Domingues Cordeiro *Subject:* Re: NullPointerException When Reading Avro Sequence Files I did not mention the imports needed in my code. I think these are all of them: import org.apache.hadoop.mapreduce.Job import org.apache.hadoop.io.NullWritable import org.apache.hadoop.mapreduce.lib.input.FileInputFormat import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat import
Re: NullPointerException When Reading Avro Sequence Files
Hi Cristovao, I have seen a very similar issue that I have posted about in this thread: http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html I think your main issue here is somewhat similar, in that the MapWrapper Scala class is not registered. This gets registered by the Twitter chill-scala AllScalaRegistrar class that you are currently not using. As far as I understand, in order to use Avro with Spark, you also have to use Kryo. This means you have to use the Spark KryoSerializer. This in turn uses Twitter chill. I posted the basic code that I am using here: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html#a19491 Maybe there is a simpler solution to your problem but I am not that much of an expert yet. I hope this helps. Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Tue, Dec 9, 2014 at 8:50 AM, Cristovao Jose Domingues Cordeiro cristovao.corde...@cern.ch wrote: Hi Simone, thanks but I don't think that's it. I've tried several libraries within the --jar argument. Some do give what you said. But other times (when I put the right version I guess) I get the following: 14/12/09 15:45:54 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.io.NotSerializableException: scala.collection.convert.Wrappers$MapWrapper at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) Which is odd since I am reading a Avro I wrote...with the same piece of code: https://gist.github.com/MLnick/5864741781b9340cb211 Cumprimentos / Best regards, Cristóvão José Domingues Cordeiro IT Department - 28/R-018 CERN -- *From:* Simone Franzini [captainfr...@gmail.com] *Sent:* 06 December 2014 15:48 *To:* Cristovao Jose Domingues Cordeiro *Subject:* Re: NullPointerException When Reading Avro Sequence Files java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected That is a sign that you are mixing up versions of Hadoop. This is particularly an issue when dealing with AVRO. If you are using Hadoop 2, you will need to get the hadoop 2 version of avro-mapred. In Maven you would do this with the classifier hadoop2 /classifier tag. Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Fri, Dec 5, 2014 at 3:52 AM, cjdc cristovao.corde...@cern.ch wrote: Hi all, I've tried the above example on Gist, but it doesn't work (at least for me). Did anyone get this: 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/12/05 10:44:40 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229
Re: NullPointerException When Reading Avro Sequence Files
You can use this Maven dependency: dependency groupIdcom.twitter/groupId artifactIdchill-avro/artifactId version0.4.0/version /dependency Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Tue, Dec 9, 2014 at 9:53 AM, Cristovao Jose Domingues Cordeiro cristovao.corde...@cern.ch wrote: Thanks for the reply! I've tried in fact your code. But I lack the twiter chill package and I can not find it online. So I am now trying this http://spark.apache.org/docs/latest/tuning.html#data-serialization . But in case I can't do it, could you tell me where to get that Twiter package you used? Thanks Cumprimentos / Best regards, Cristóvão José Domingues Cordeiro IT Department - 28/R-018 CERN -- *From:* Simone Franzini [captainfr...@gmail.com] *Sent:* 09 December 2014 16:42 *To:* Cristovao Jose Domingues Cordeiro; user *Subject:* Re: NullPointerException When Reading Avro Sequence Files Hi Cristovao, I have seen a very similar issue that I have posted about in this thread: http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html I think your main issue here is somewhat similar, in that the MapWrapper Scala class is not registered. This gets registered by the Twitter chill-scala AllScalaRegistrar class that you are currently not using. As far as I understand, in order to use Avro with Spark, you also have to use Kryo. This means you have to use the Spark KryoSerializer. This in turn uses Twitter chill. I posted the basic code that I am using here: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html#a19491 Maybe there is a simpler solution to your problem but I am not that much of an expert yet. I hope this helps. Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Tue, Dec 9, 2014 at 8:50 AM, Cristovao Jose Domingues Cordeiro cristovao.corde...@cern.ch wrote: Hi Simone, thanks but I don't think that's it. I've tried several libraries within the --jar argument. Some do give what you said. But other times (when I put the right version I guess) I get the following: 14/12/09 15:45:54 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.io.NotSerializableException: scala.collection.convert.Wrappers$MapWrapper at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) Which is odd since I am reading a Avro I wrote...with the same piece of code: https://gist.github.com/MLnick/5864741781b9340cb211 Cumprimentos / Best regards, Cristóvão José Domingues Cordeiro IT Department - 28/R-018 CERN -- *From:* Simone Franzini [captainfr...@gmail.com] *Sent:* 06 December 2014 15:48 *To:* Cristovao Jose Domingues Cordeiro *Subject:* Re: NullPointerException When Reading Avro Sequence Files java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected That is a sign that you are mixing up versions of Hadoop. This is particularly an issue when dealing with AVRO. If you are using Hadoop 2, you will need to get the hadoop 2 version of avro-mapred. In Maven you would do this with the classifier hadoop2 /classifier tag. Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Fri, Dec 5, 2014 at 3:52 AM, cjdc cristovao.corde...@cern.ch wrote: Hi all, I've tried the above example on Gist, but it doesn't work (at least for me). Did anyone get this: 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615
Re: NullPointerException When Reading Avro Sequence Files
Hi all, I've tried the above example on Gist, but it doesn't work (at least for me). Did anyone get this: 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/12/05 10:44:40 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/12/05 10:44:40 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p20456.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: NullPointerException When Reading Avro Sequence Files
For those curious I used the JavaSparkContext and got access to an AvroSequenceFile (wrapper around Sequence File) using the following: file = sc.newAPIHadoopFile(hdfs path to my file, AvroSequenceFileInputFormat.class, AvroKey.class, AvroValue.class, new Configuration()) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10305.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: NullPointerException When Reading Avro Sequence Files
I see Spark is using AvroRecordReaderBase, which is used to grab Avro Container Files, which is different from Sequence Files. If anyone is using Avro Sequence Files with success and has an example, please let me know. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10233.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: NullPointerException When Reading Avro Sequence Files
To be more specific, I'm working with a system that stores data in org.apache.avro.hadoop.io.AvroSequenceFile format. An AvroSequenceFile is A wrapper around a Hadoop SequenceFile that also supports reading and writing Avro data. It seems that Spark does not support this out of the box. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10234.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: NullPointerException When Reading Avro Sequence Files
I got this working locally a little while ago when playing around with AvroKeyInputFile: https://gist.github.com/MLnick/5864741781b9340cb211 But not sure about AvroSequenceFile. Any chance you have an example datafile or records? On Sat, Jul 19, 2014 at 11:00 AM, Sparky gullo_tho...@bah.com wrote: To be more specific, I'm working with a system that stores data in org.apache.avro.hadoop.io.AvroSequenceFile format. An AvroSequenceFile is A wrapper around a Hadoop SequenceFile that also supports reading and writing Avro data. It seems that Spark does not support this out of the box. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10234.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: NullPointerException When Reading Avro Sequence Files
Thanks for the gist. I'm just now learning about Avro. I think when you use a DataFileWriter you are writing to an Avro Container (which is different than an Avro Sequence File). I have a system where data was written to an HDFS Sequence File using AvroSequenceFile.Writer (which is a wrapper around sequence file). I'll put together an example of the problem so others can better understand what I'm talking about. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10237.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: NullPointerException When Reading Avro Sequence Files
I think you probably want to use `AvroSequenceFileOutputFormat` with `newAPIHadoopFile`. I'm not even sure that in hadoop you would use SequenceFileInput format to read an avro sequence file -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10203.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: NullPointerException When Reading Avro Sequence Files
Thanks for responding. I tried using the newAPIHadoopFile method and got an IO Exception with the message Not a data file. If anyone has an example of this working I'd appreciate your input or examples. What I entered at the repl and what I got back are below: val myAvroSequenceFile = sc.newAPIHadoopFile(hdfs://my url, classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRecord]], classOf[NullWritable]) scala myAvroSequenceFile.first() 14/07/18 17:02:38 INFO FileInputFormat: Total input paths to process : 1 14/07/18 17:02:38 INFO SparkContext: Starting job: first at console:19 14/07/18 17:02:38 INFO DAGScheduler: Got job 0 (first at console:19) with 1 output partitions (allowLocal=true) 14/07/18 17:02:38 INFO DAGScheduler: Final stage: Stage 0(first at console:19) 14/07/18 17:02:38 INFO DAGScheduler: Parents of final stage: List() 14/07/18 17:02:38 INFO DAGScheduler: Missing parents: List() 14/07/18 17:02:38 INFO DAGScheduler: Computing the requested partition locally 14/07/18 17:02:38 INFO NewHadoopRDD: Input split: hdfs:my url 14/07/18 17:02:38 WARN AvroKeyInputFormat: Reader schema was not set. Use AvroJob.setInputKeySchema() if desired. 14/07/18 17:02:38 INFO AvroKeyInputFormat: Using a reader schema equal to the writer schema. 14/07/18 17:02:38 INFO DAGScheduler: Failed to run first at console:19 org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:585) at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:563) Caused by: java.io.IOException: Not a data file. at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105) at org.apache.avro.file.DataFileReader.init(DataFileReader.java:97) at org.apache.avro.mapreduce.AvroRecordReaderBase.createAvroFileReader(AvroRecordReaderBase.java:180) at org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:90) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:114) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:100) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:62) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:578) ... 1 more -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10204.html Sent from the Apache Spark User List mailing list archive at Nabble.com.