Benjamin,

Thanks for the snippet. I have tried using it, but unfortunately I get the
following exception. I am clueless at what might be wrong. Any ideas?

java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at
org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:115)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Wed, Nov 5, 2014 at 4:24 PM, Laird, Benjamin <
benjamin.la...@capitalone.com> wrote:

> Something like this works and is how I create an RDD of specific records.
>
> val avroRdd = sc.newAPIHadoopFile("twitter.avro",
> classOf[AvroKeyInputFormat[twitter_schema]],
> classOf[AvroKey[twitter_schema]], classOf[NullWritable], conf) (From
> https://github.com/julianpeeters/avro-scala-macro-annotation-examples/blob/master/spark/src/main/scala/AvroSparkScala.scala)
> Keep in mind you'll need to use the kryo serializer as well.
>
> From: Frank Austin Nothaft <fnoth...@berkeley.edu>
> Date: Wednesday, November 5, 2014 at 5:06 PM
> To: Simone Franzini <captainfr...@gmail.com>
> Cc: "user@spark.apache.org" <user@spark.apache.org>
> Subject: Re: AVRO specific records
>
> Hi Simone,
>
> Matt Massie put together a good tutorial on his blog
> <http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/>. If you’re
> looking for more code using Avro, we use it pretty extensively in our
> genomics project. Our Avro schemas are here
> <https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl>,
> and we have serialization code here
> <https://github.com/bigdatagenomics/adam/tree/master/adam-core/src/main/scala/org/bdgenomics/adam/serialization>.
> We use Parquet for storing the Avro records, but there is also an Avro
> HadoopInputFormat.
>
> Regards,
>
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466
>
> On Nov 5, 2014, at 1:25 PM, Simone Franzini <captainfr...@gmail.com>
> wrote:
>
> How can I read/write AVRO specific records?
> I found several snippets using generic records, but nothing with specific
> records so far.
>
> Thanks,
> Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>
>
>
> ------------------------------
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed.  If the reader of this message is not the
> intended recipient, you are hereby notified that any review,
> retransmission, dissemination, distribution, copying or other use of, or
> taking of any action in reliance upon this information is strictly
> prohibited. If you have received this communication in error, please
> contact the sender and delete the material from your computer.
>

Reply via email to