Re: NullPointerException When Reading Avro Sequence Files

2014-12-15 Thread Simone Franzini
To me this looks like an internal error to the REPL. I am not sure what is
causing that.
Personally I never use the REPL, can you try typing up your program and
running it from an IDE or spark-submit and see if you still get the same
error?

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Mon, Dec 15, 2014 at 4:54 PM, Cristovao Jose Domingues Cordeiro 
cristovao.corde...@cern.ch wrote:

  Sure, thanks:
 warning: there were 1 deprecation warning(s); re-run with -deprecation for
 details
 java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
 at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283)
 at org.apache.hadoop.mapreduce.Job.toString(Job.java:462)
 at
 scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
 at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
 at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
 at .init(console:10)
 at .clinit(console)
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:846)
 at
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1119)
 at
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:672)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:703)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:667)
 at
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:819)
 at
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:864)
 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:776)
 at
 org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:619)
 at
 org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:627)
 at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:632)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:959)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:907)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:907)
 at
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:907)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1002)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:331)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)




 Could something you omitted in your snippet be chaining this exception?

  Cumprimentos / Best regards,
 Cristóvão José Domingues Cordeiro
 IT Department - 28/R-018
 CERN
--
 *From:* Simone Franzini [captainfr...@gmail.com]
 *Sent:* 15 December 2014 16:52

 *To:* Cristovao Jose Domingues Cordeiro
 *Subject:* Re: NullPointerException When Reading Avro Sequence Files

   Ok, I have no idea what that is. That appears to be an internal Spark
 exception. Maybe if you can post the entire stack trace it would give some
 more details to understand what is going on.

  Simone Franzini, PhD

 http://www.linkedin.com/in/simonefranzini

 On Mon, Dec 15, 2014 at 4:50 PM, Cristovao Jose Domingues Cordeiro 
 cristovao.corde...@cern.ch wrote:

  Hi,

 thanks for that.
 But yeah the 2nd line is an exception. jobread is not created.

  Cumprimentos / Best regards,
 Cristóvão José Domingues Cordeiro
 IT Department - 28/R-018
 CERN
--
 *From:* Simone Franzini [captainfr...@gmail.com]
 *Sent:* 15 December 2014 16:39

 *To:* Cristovao Jose Domingues Cordeiro
 *Subject:* Re: NullPointerException When Reading Avro Sequence Files

I did not mention the imports needed in my code. I think these are
 all of them:

  import org.apache.hadoop.mapreduce.Job
 import org.apache.hadoop.io.NullWritable
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
 import

Re: NullPointerException When Reading Avro Sequence Files

2014-12-09 Thread Simone Franzini
Hi Cristovao,

I have seen a very similar issue that I have posted about in this thread:
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html
I think your main issue here is somewhat similar, in that the MapWrapper
Scala class is not registered. This gets registered by the Twitter
chill-scala AllScalaRegistrar class that you are currently not using.

As far as I understand, in order to use Avro with Spark, you also have to
use Kryo. This means you have to use the Spark KryoSerializer. This in turn
uses Twitter chill. I posted the basic code that I am using here:

http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html#a19491

Maybe there is a simpler solution to your problem but I am not that much of
an expert yet. I hope this helps.

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Tue, Dec 9, 2014 at 8:50 AM, Cristovao Jose Domingues Cordeiro 
cristovao.corde...@cern.ch wrote:

  Hi Simone,

 thanks but I don't think that's it.
 I've tried several libraries within the --jar argument. Some do give what
 you said. But other times (when I put the right version I guess) I get the
 following:
 14/12/09 15:45:54 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
 0)
 java.io.NotSerializableException:
 scala.collection.convert.Wrappers$MapWrapper
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
 at
 java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)


 Which is odd since I am reading a Avro I wrote...with the same piece of
 code:
 https://gist.github.com/MLnick/5864741781b9340cb211

  Cumprimentos / Best regards,
 Cristóvão José Domingues Cordeiro
 IT Department - 28/R-018
 CERN
--
 *From:* Simone Franzini [captainfr...@gmail.com]
 *Sent:* 06 December 2014 15:48
 *To:* Cristovao Jose Domingues Cordeiro
 *Subject:* Re: NullPointerException When Reading Avro Sequence Files

   java.lang.IncompatibleClassChangeError: Found interface
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

  That is a sign that you are mixing up versions of Hadoop. This is
 particularly an issue when dealing with AVRO. If you are using Hadoop 2,
 you will need to get the hadoop 2 version of avro-mapred. In Maven you
 would do this with the classifier hadoop2 /classifier tag.

  Simone Franzini, PhD

 http://www.linkedin.com/in/simonefranzini

 On Fri, Dec 5, 2014 at 3:52 AM, cjdc cristovao.corde...@cern.ch wrote:

 Hi all,

 I've tried the above example on Gist, but it doesn't work (at least for
 me).
 Did anyone get this:
 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
 0)
 java.lang.IncompatibleClassChangeError: Found interface
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at

 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 14/12/05 10:44:40 ERROR ExecutorUncaughtExceptionHandler: Uncaught
 exception
 in thread Thread[Executor task launch worker-0,5,main]
 java.lang.IncompatibleClassChangeError: Found interface
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at

 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229

Re: NullPointerException When Reading Avro Sequence Files

2014-12-09 Thread Simone Franzini
You can use this Maven dependency:

dependency
groupIdcom.twitter/groupId
artifactIdchill-avro/artifactId
version0.4.0/version
/dependency

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Tue, Dec 9, 2014 at 9:53 AM, Cristovao Jose Domingues Cordeiro 
cristovao.corde...@cern.ch wrote:

  Thanks for the reply!

 I've tried in fact your code. But I lack the twiter chill package and I
 can not find it online. So I am now trying this
 http://spark.apache.org/docs/latest/tuning.html#data-serialization . But
 in case I can't do it, could you tell me where to get that Twiter package
 you used?

 Thanks

  Cumprimentos / Best regards,
 Cristóvão José Domingues Cordeiro
 IT Department - 28/R-018
 CERN
--
 *From:* Simone Franzini [captainfr...@gmail.com]
 *Sent:* 09 December 2014 16:42
 *To:* Cristovao Jose Domingues Cordeiro; user

 *Subject:* Re: NullPointerException When Reading Avro Sequence Files

   Hi Cristovao,

 I have seen a very similar issue that I have posted about in this thread:

 http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html
  I think your main issue here is somewhat similar, in that the MapWrapper
 Scala class is not registered. This gets registered by the Twitter
 chill-scala AllScalaRegistrar class that you are currently not using.

  As far as I understand, in order to use Avro with Spark, you also have
 to use Kryo. This means you have to use the Spark KryoSerializer. This in
 turn uses Twitter chill. I posted the basic code that I am using here:


 http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html#a19491

  Maybe there is a simpler solution to your problem but I am not that much
 of an expert yet. I hope this helps.

  Simone Franzini, PhD

 http://www.linkedin.com/in/simonefranzini

 On Tue, Dec 9, 2014 at 8:50 AM, Cristovao Jose Domingues Cordeiro 
 cristovao.corde...@cern.ch wrote:

  Hi Simone,

 thanks but I don't think that's it.
 I've tried several libraries within the --jar argument. Some do give what
 you said. But other times (when I put the right version I guess) I get the
 following:
 14/12/09 15:45:54 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
 0)
 java.io.NotSerializableException:
 scala.collection.convert.Wrappers$MapWrapper
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
 at
 java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)


 Which is odd since I am reading a Avro I wrote...with the same piece of
 code:
 https://gist.github.com/MLnick/5864741781b9340cb211

  Cumprimentos / Best regards,
 Cristóvão José Domingues Cordeiro
 IT Department - 28/R-018
 CERN
--
 *From:* Simone Franzini [captainfr...@gmail.com]
 *Sent:* 06 December 2014 15:48
 *To:* Cristovao Jose Domingues Cordeiro
 *Subject:* Re: NullPointerException When Reading Avro Sequence Files

java.lang.IncompatibleClassChangeError: Found interface
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

  That is a sign that you are mixing up versions of Hadoop. This is
 particularly an issue when dealing with AVRO. If you are using Hadoop 2,
 you will need to get the hadoop 2 version of avro-mapred. In Maven you
 would do this with the classifier hadoop2 /classifier tag.

  Simone Franzini, PhD

 http://www.linkedin.com/in/simonefranzini

 On Fri, Dec 5, 2014 at 3:52 AM, cjdc cristovao.corde...@cern.ch wrote:

 Hi all,

 I've tried the above example on Gist, but it doesn't work (at least for
 me).
 Did anyone get this:
 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0
 (TID 0)
 java.lang.IncompatibleClassChangeError: Found interface
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at

 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
 at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615

Re: NullPointerException When Reading Avro Sequence Files

2014-12-05 Thread cjdc
Hi all,

I've tried the above example on Gist, but it doesn't work (at least for me).
Did anyone get this:
14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at
org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/12/05 10:44:40 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception
in thread Thread[Executor task launch worker-0,5,main]
java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at
org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/12/05 10:44:40 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times;
aborting job


Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p20456.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NullPointerException When Reading Avro Sequence Files

2014-07-21 Thread Sparky
For those curious I used the JavaSparkContext and got access to an
AvroSequenceFile (wrapper around Sequence File) using the following:

file = sc.newAPIHadoopFile(hdfs path to my file,
AvroSequenceFileInputFormat.class, AvroKey.class, AvroValue.class,
new Configuration())



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10305.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: NullPointerException When Reading Avro Sequence Files

2014-07-19 Thread Sparky
I see Spark is using AvroRecordReaderBase, which is used to grab Avro
Container Files, which is different from Sequence Files.  If anyone is using
Avro Sequence Files with success and has an example, please let me know.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10233.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: NullPointerException When Reading Avro Sequence Files

2014-07-19 Thread Sparky
To be more specific, I'm working with a system that stores data in
org.apache.avro.hadoop.io.AvroSequenceFile format.  An AvroSequenceFile is 
A wrapper around a Hadoop SequenceFile that also supports reading and
writing Avro data.

It seems that Spark does not support this out of the box.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10234.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: NullPointerException When Reading Avro Sequence Files

2014-07-19 Thread Nick Pentreath
I got this working locally a little while ago when playing around with
AvroKeyInputFile: https://gist.github.com/MLnick/5864741781b9340cb211

But not sure about AvroSequenceFile. Any chance you have an example
datafile or records?



On Sat, Jul 19, 2014 at 11:00 AM, Sparky gullo_tho...@bah.com wrote:

 To be more specific, I'm working with a system that stores data in
 org.apache.avro.hadoop.io.AvroSequenceFile format.  An AvroSequenceFile is
 A wrapper around a Hadoop SequenceFile that also supports reading and
 writing Avro data.

 It seems that Spark does not support this out of the box.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10234.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: NullPointerException When Reading Avro Sequence Files

2014-07-19 Thread Sparky
Thanks for the gist.  I'm just now learning about Avro.  I think when you use
a DataFileWriter you are writing to an Avro Container (which is different
than an Avro Sequence File).  I have a system where data was written to an
HDFS Sequence File using  AvroSequenceFile.Writer (which is a wrapper around
sequence file).  

I'll put together an example of the problem so others can better understand
what I'm talking about.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10237.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: NullPointerException When Reading Avro Sequence Files

2014-07-18 Thread aaronjosephs
I think you probably want to use `AvroSequenceFileOutputFormat` with
`newAPIHadoopFile`. I'm not even sure that in hadoop you would use
SequenceFileInput format to read an avro sequence file



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10203.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: NullPointerException When Reading Avro Sequence Files

2014-07-18 Thread Sparky
Thanks for responding.  I tried using the newAPIHadoopFile method and got an
IO Exception with the message Not a data file.  

If anyone has an example of this working I'd appreciate your input or
examples.  

What I entered at the repl and what I got back are below:

val myAvroSequenceFile = sc.newAPIHadoopFile(hdfs://my url, 
classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRecord]],
classOf[NullWritable])

scala myAvroSequenceFile.first()
14/07/18 17:02:38 INFO FileInputFormat: Total input paths to process : 1
14/07/18 17:02:38 INFO SparkContext: Starting job: first at console:19
14/07/18 17:02:38 INFO DAGScheduler: Got job 0 (first at console:19) with
1 output partitions (allowLocal=true)
14/07/18 17:02:38 INFO DAGScheduler: Final stage: Stage 0(first at
console:19)
14/07/18 17:02:38 INFO DAGScheduler: Parents of final stage: List()
14/07/18 17:02:38 INFO DAGScheduler: Missing parents: List()
14/07/18 17:02:38 INFO DAGScheduler: Computing the requested partition
locally
14/07/18 17:02:38 INFO NewHadoopRDD: Input split: hdfs:my url
14/07/18 17:02:38 WARN AvroKeyInputFormat: Reader schema was not set. Use
AvroJob.setInputKeySchema() if desired.
14/07/18 17:02:38 INFO AvroKeyInputFormat: Using a reader schema equal to
the writer schema.
14/07/18 17:02:38 INFO DAGScheduler: Failed to run first at console:19
org.apache.spark.SparkDriverExecutionException: Execution error
at
org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:585)
at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:563)
Caused by: java.io.IOException: Not a data file.
at 
org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
at org.apache.avro.file.DataFileReader.init(DataFileReader.java:97)
at
org.apache.avro.mapreduce.AvroRecordReaderBase.createAvroFileReader(AvroRecordReaderBase.java:180)
at
org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:90)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:114)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:100)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:62)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at
org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:578)
... 1 more



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10204.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.