I'm trying to read and an Avro Sequence File using the sequenceFile method on
the spark context object and I get a NullPointerException. If I read the
file outside of Spark using AvroSequenceFile.Reader I don't have any
problems.
Anyone have success in doing this?
Below is one I typed and saw at the spark shell:
scala>var myAvroSequenceFile = sc.sequenceFile("hdfs://<my url is here>",
classOf[AvroKey[GenericRecord], ClassOf[AvroValue[GenericRecord]])
scala>myAvroSequenceFile.first
14/07/18 16:31:31 INFO FileInputFormat: Total input paths to process : 1
14/07/18 16:31:31 INFO SparkContext: Starting job: first at <console>:18
14/07/18 16:31:31 INFO DAGScheduler: Got job 2 (first at <console>:18) with
1 output partitions (allowLocal=true)
14/07/18 16:31:31 INFO DAGScheduler: Final stage: Stage 2(first at
<console>:18)
14/07/18 16:31:31 INFO DAGScheduler: Parents of final stage: List()
14/07/18 16:31:31 INFO DAGScheduler: Missing parents: List()
14/07/18 16:31:31 INFO DAGScheduler: Computing the requested partition
locally
14/07/18 16:31:31 INFO HadoopRDD: Input split: hdfs://<my url>
14/07/18 16:31:31 INFO DAGScheduler: Failed to run first at <console>:18
java.lang.NullPointerException
at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1902)
at
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:59)
at
org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:190)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:181)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:93)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:574)
at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:559)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-Avro-Sequence-Files-tp10201.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.