Thanks for responding.  I tried using the newAPIHadoopFile method and got an
IO Exception with the message "Not a data file".  

If anyone has an example of this working I'd appreciate your input or
examples.  

What I entered at the repl and what I got back are below:

val myAvroSequenceFile = sc.newAPIHadoopFile("hdfs://<my url", 
classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRecord]],
classOf[NullWritable])

scala> myAvroSequenceFile.first()
14/07/18 17:02:38 INFO FileInputFormat: Total input paths to process : 1
14/07/18 17:02:38 INFO SparkContext: Starting job: first at <console>:19
14/07/18 17:02:38 INFO DAGScheduler: Got job 0 (first at <console>:19) with
1 output partitions (allowLocal=true)
14/07/18 17:02:38 INFO DAGScheduler: Final stage: Stage 0(first at
<console>:19)
14/07/18 17:02:38 INFO DAGScheduler: Parents of final stage: List()
14/07/18 17:02:38 INFO DAGScheduler: Missing parents: List()
14/07/18 17:02:38 INFO DAGScheduler: Computing the requested partition
locally
14/07/18 17:02:38 INFO NewHadoopRDD: Input split: hdfs:<my url>
14/07/18 17:02:38 WARN AvroKeyInputFormat: Reader schema was not set. Use
AvroJob.setInputKeySchema() if desired.
14/07/18 17:02:38 INFO AvroKeyInputFormat: Using a reader schema equal to
the writer schema.
14/07/18 17:02:38 INFO DAGScheduler: Failed to run first at <console>:19
org.apache.spark.SparkDriverExecutionException: Execution error
        at
org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:585)
        at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:563)
Caused by: java.io.IOException: Not a data file.
        at 
org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
        at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
        at
org.apache.avro.mapreduce.AvroRecordReaderBase.createAvroFileReader(AvroRecordReaderBase.java:180)
        at
org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:90)
        at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:114)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:100)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:62)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
        at
org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:578)
        ... 1 more



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10204.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to