Ok, that turned out to be a dependency issue with Hadoop1 vs. Hadoop2 that I have not fully solved yet. I am able to run with Hadoop1 and AVRO in standalone mode but not with Hadoop2 (even after trying to fix the dependencies).
Anyway, I am now trying to write to AVRO, using a very similar snippet to the one to read from AVRO: val withValues : RDD[(AvroKey[Subscriber], NullWritable)] = records.map{s => (new AvroKey(s), NullWritable.get)} val outPath = "myOutputPath" val writeJob = new Job() FileOutputFormat.setOutputPath(writeJob, new Path(outPath)) AvroJob.setOutputKeySchema(writeJob, Subscriber.getClassSchema()) writeJob.setOutputFormatClass(classOf[AvroKeyOutputFormat[Any]]) records.saveAsNewAPIHadoopFile(outPath, classOf[AvroKey[Subscriber]], classOf[NullWritable], classOf[AvroKeyOutputFormat[Subscriber]], writeJob.getConfiguration) Now, my problem is that this writes to a plain text file. I need to write to binary AVRO. What am I missing? Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Thu, Nov 6, 2014 at 3:15 PM, Simone Franzini <captainfr...@gmail.com> wrote: > Benjamin, > > Thanks for the snippet. I have tried using it, but unfortunately I get the > following exception. I am clueless at what might be wrong. Any ideas? > > java.lang.IncompatibleClassChangeError: Found interface > org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected > at > org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:115) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > > Simone Franzini, PhD > > http://www.linkedin.com/in/simonefranzini > > On Wed, Nov 5, 2014 at 4:24 PM, Laird, Benjamin < > benjamin.la...@capitalone.com> wrote: > >> Something like this works and is how I create an RDD of specific records. >> >> val avroRdd = sc.newAPIHadoopFile("twitter.avro", >> classOf[AvroKeyInputFormat[twitter_schema]], >> classOf[AvroKey[twitter_schema]], classOf[NullWritable], conf) (From >> https://github.com/julianpeeters/avro-scala-macro-annotation-examples/blob/master/spark/src/main/scala/AvroSparkScala.scala) >> Keep in mind you'll need to use the kryo serializer as well. >> >> From: Frank Austin Nothaft <fnoth...@berkeley.edu> >> Date: Wednesday, November 5, 2014 at 5:06 PM >> To: Simone Franzini <captainfr...@gmail.com> >> Cc: "user@spark.apache.org" <user@spark.apache.org> >> Subject: Re: AVRO specific records >> >> Hi Simone, >> >> Matt Massie put together a good tutorial on his blog >> <http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/>. If you’re >> looking for more code using Avro, we use it pretty extensively in our >> genomics project. Our Avro schemas are here >> <https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl>, >> and we have serialization code here >> <https://github.com/bigdatagenomics/adam/tree/master/adam-core/src/main/scala/org/bdgenomics/adam/serialization>. >> We use Parquet for storing the Avro records, but there is also an Avro >> HadoopInputFormat. >> >> Regards, >> >> Frank Austin Nothaft >> fnoth...@berkeley.edu >> fnoth...@eecs.berkeley.edu >> 202-340-0466 >> >> On Nov 5, 2014, at 1:25 PM, Simone Franzini <captainfr...@gmail.com> >> wrote: >> >> How can I read/write AVRO specific records? >> I found several snippets using generic records, but nothing with specific >> records so far. >> >> Thanks, >> Simone Franzini, PhD >> >> http://www.linkedin.com/in/simonefranzini >> >> >> >> ------------------------------ >> >> The information contained in this e-mail is confidential and/or >> proprietary to Capital One and/or its affiliates. The information >> transmitted herewith is intended only for use by the individual or entity >> to which it is addressed. If the reader of this message is not the >> intended recipient, you are hereby notified that any review, >> retransmission, dissemination, distribution, copying or other use of, or >> taking of any action in reliance upon this information is strictly >> prohibited. If you have received this communication in error, please >> contact the sender and delete the material from your computer. >> > >