Re: How can I read this avro file using spark scala?
I am confused as to whether avro support was merged into Spark 1.2 or it is still an independent library. I see some people writing sqlContext.avroFile similarly to jsonFile but this does not work for me, nor do I see this in the Scala docs. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-scala-tp19400p21601.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Re: How can I read this avro file using spark scala?
Check this link. https://github.com/databricks/spark-avro Home page for Spark-avro project. Thanks, Vishnu On Wed, Feb 11, 2015 at 10:19 PM, Todd bit1...@163.com wrote: Databricks provides a sample code on its website...but i can't find it for now. At 2015-02-12 00:43:07, captainfranz captainfr...@gmail.com wrote: I am confused as to whether avro support was merged into Spark 1.2 or it is still an independent library. I see some people writing sqlContext.avroFile similarly to jsonFile but this does not work for me, nor do I see this in the Scala docs. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-scala-tp19400p21601.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How can I read this avro file using spark scala?
Thanks for the feedback, I filed a couple of issues: https://github.com/databricks/spark-avro/issues On Fri, Nov 21, 2014 at 5:04 AM, thomas j beanb...@googlemail.com wrote: I've been able to load a different avro file based on GenericRecord with: val person = sqlContext.avroFile(/tmp/person.avro) When I try to call `first()` on it, I get NotSerializableException exceptions again: person.first() ... 14/11/21 12:59:17 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 20) java.io.NotSerializableException: org.apache.avro.generic.GenericData$Record at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) ... Apart from this I want to transform the records into pairs of (user_id, record). I can do this by specifying the offset of the user_id column with something like this: person.map(r = (r.getInt(2), r)).take(4).collect() Is there any way to be able to specify the column name (user_id) instead of needing to know/calculate the offset somehow? Thanks again On Fri, Nov 21, 2014 at 11:48 AM, thomas j beanb...@googlemail.com wrote: Thanks for the pointer Michael. I've downloaded spark 1.2.0 from https://people.apache.org/~pwendell/spark-1.2.0-snapshot1/ and clone and built the spark-avro repo you linked to. When I run it against the example avro file linked to in the documentation it works. However, when I try to load my avro file (linked to in my original question) I receive the following error: java.lang.RuntimeException: Unsupported type LONG at scala.sys.package$.error(package.scala:27) at com.databricks.spark.avro.AvroRelation.com $databricks$spark$avro$AvroRelation$$toSqlType(AvroRelation.scala:116) at com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:97) at com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:96) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) ... If this is useful I'm happy to try loading the various different avro files I have to try to battle-test spark-avro. Thanks On Thu, Nov 20, 2014 at 6:30 PM, Michael Armbrust mich...@databricks.com wrote: One option (starting with Spark 1.2, which is currently in preview) is to use the Avro library for Spark SQL. This is very new, but we would love to get feedback: https://github.com/databricks/spark-avro On Thu, Nov 20, 2014 at 10:19 AM, al b beanb...@googlemail.com wrote: I've read several posts of people struggling to read avro in spark. The examples I've tried don't work. When I try this solution ( https://stackoverflow.com/questions/23944615/how-can-i-load-avros-in-spark-using-the-schema-on-board-the-avro-files) I get errors: spark java.io.NotSerializableException: org.apache.avro.mapred.AvroWrapper How can I read the following sample file in spark using scala? http://www.4shared.com/file/SxnYcdgJce/sample.html Thomas
Re: How can I read this avro file using spark scala?
Thanks for the pointer Michael. I've downloaded spark 1.2.0 from https://people.apache.org/~pwendell/spark-1.2.0-snapshot1/ and clone and built the spark-avro repo you linked to. When I run it against the example avro file linked to in the documentation it works. However, when I try to load my avro file (linked to in my original question) I receive the following error: java.lang.RuntimeException: Unsupported type LONG at scala.sys.package$.error(package.scala:27) at com.databricks.spark.avro.AvroRelation.com $databricks$spark$avro$AvroRelation$$toSqlType(AvroRelation.scala:116) at com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:97) at com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:96) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) ... If this is useful I'm happy to try loading the various different avro files I have to try to battle-test spark-avro. Thanks On Thu, Nov 20, 2014 at 6:30 PM, Michael Armbrust mich...@databricks.com wrote: One option (starting with Spark 1.2, which is currently in preview) is to use the Avro library for Spark SQL. This is very new, but we would love to get feedback: https://github.com/databricks/spark-avro On Thu, Nov 20, 2014 at 10:19 AM, al b beanb...@googlemail.com wrote: I've read several posts of people struggling to read avro in spark. The examples I've tried don't work. When I try this solution ( https://stackoverflow.com/questions/23944615/how-can-i-load-avros-in-spark-using-the-schema-on-board-the-avro-files) I get errors: spark java.io.NotSerializableException: org.apache.avro.mapred.AvroWrapper How can I read the following sample file in spark using scala? http://www.4shared.com/file/SxnYcdgJce/sample.html Thomas
Re: How can I read this avro file using spark scala?
I've been able to load a different avro file based on GenericRecord with: val person = sqlContext.avroFile(/tmp/person.avro) When I try to call `first()` on it, I get NotSerializableException exceptions again: person.first() ... 14/11/21 12:59:17 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 20) java.io.NotSerializableException: org.apache.avro.generic.GenericData$Record at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) ... Apart from this I want to transform the records into pairs of (user_id, record). I can do this by specifying the offset of the user_id column with something like this: person.map(r = (r.getInt(2), r)).take(4).collect() Is there any way to be able to specify the column name (user_id) instead of needing to know/calculate the offset somehow? Thanks again On Fri, Nov 21, 2014 at 11:48 AM, thomas j beanb...@googlemail.com wrote: Thanks for the pointer Michael. I've downloaded spark 1.2.0 from https://people.apache.org/~pwendell/spark-1.2.0-snapshot1/ and clone and built the spark-avro repo you linked to. When I run it against the example avro file linked to in the documentation it works. However, when I try to load my avro file (linked to in my original question) I receive the following error: java.lang.RuntimeException: Unsupported type LONG at scala.sys.package$.error(package.scala:27) at com.databricks.spark.avro.AvroRelation.com $databricks$spark$avro$AvroRelation$$toSqlType(AvroRelation.scala:116) at com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:97) at com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:96) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) ... If this is useful I'm happy to try loading the various different avro files I have to try to battle-test spark-avro. Thanks On Thu, Nov 20, 2014 at 6:30 PM, Michael Armbrust mich...@databricks.com wrote: One option (starting with Spark 1.2, which is currently in preview) is to use the Avro library for Spark SQL. This is very new, but we would love to get feedback: https://github.com/databricks/spark-avro On Thu, Nov 20, 2014 at 10:19 AM, al b beanb...@googlemail.com wrote: I've read several posts of people struggling to read avro in spark. The examples I've tried don't work. When I try this solution ( https://stackoverflow.com/questions/23944615/how-can-i-load-avros-in-spark-using-the-schema-on-board-the-avro-files) I get errors: spark java.io.NotSerializableException: org.apache.avro.mapred.AvroWrapper How can I read the following sample file in spark using scala? http://www.4shared.com/file/SxnYcdgJce/sample.html Thomas
Re: How can I read this avro file using spark scala?
I have also been struggling with reading avro. Very glad to hear that there is a new avro library coming in Spark 1.2 (which by the way, seems to have a lot of other very useful improvements). In the meanwhile, I have been able to piece together several snippets/tips that I found from various sources and I am now able to read/write avro correctly. From my understanding, you basically need 3 pieces: 1. Use the kryo serializer. 2. Register your avro classes. I have done this using twitter chill 0.4.0. 3. Read/write avro with a snippet of code like the one you posted. Here is relevant code (hopefully all of it). // All of the following are needed in order to read/write AVRO files import org.apache.hadoop.mapreduce.Job import org.apache.hadoop.io.NullWritable import org.apache.hadoop.mapreduce.lib.input.FileInputFormat import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat import org.apache.hadoop.fs.{ FileSystem, Path } // Uncomment the following line if you want to use generic AVRO, I am using specific //import org.apache.avro.generic.GenericData import org.apache.avro.Schema import org.apache.avro.mapreduce.{ AvroJob, AvroKeyInputFormat, AvroKeyOutputFormat } import org.apache.avro.mapred.AvroKey // Kryo/avro serialization stuff import com.esotericsoftware.kryo.Kryo import com.twitter.chill.avro.AvroSerializer import org.apache.spark.serializer.{ KryoSerializer, KryoRegistrator } object MyApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName(MyApp).setMaster(local[*]) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) .set(spark.kryo.registrator, MyRegistrator) } // Read val readJob = new Job() AvroJob.setInputKeySchema(readJob, schema) sc.newAPIHadoopFile(inputPath, classOf[AvroKeyInputFormat[MyAvroClass]], classOf[AvroKey[MyAvroClass]], classOf[NullWritable], readJob.getConfiguration) .map { t = t._1.datum } // Write val rddAvroWritable = rdd.map { s = (new AvroKey(s), NullWritable.get) } val writeJob = new Job() AvroJob.setOutputKeySchema(writeJob, schema) writeJob.setOutputFormatClass(classOf[AvroKeyOutputFormat[MyAvroClass]]) rddAvroWritable.saveAsNewAPIHadoopFile(outputPath, classOf[AvroKey[MyAvroClass]], classOf[NullWritable], classOf[AvroKeyOutputFormat[MyAvroClass]], writeJob.getConfiguration) } } class MyRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { // Put a line like the following for each of your Avro classes if you use specific Avro // If you use generic Avro, chill also has a function for that: GenericRecordSerializer kryo.register(classOf[MyAvroClass], AvroSerializer.SpecificRecordBinarySerializer[MyAvroClass]) } } Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Fri, Nov 21, 2014 at 7:04 AM, thomas j beanb...@googlemail.com wrote: I've been able to load a different avro file based on GenericRecord with: val person = sqlContext.avroFile(/tmp/person.avro) When I try to call `first()` on it, I get NotSerializableException exceptions again: person.first() ... 14/11/21 12:59:17 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 20) java.io.NotSerializableException: org.apache.avro.generic.GenericData$Record at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) ... Apart from this I want to transform the records into pairs of (user_id, record). I can do this by specifying the offset of the user_id column with something like this: person.map(r = (r.getInt(2), r)).take(4).collect() Is there any way to be able to specify the column name (user_id) instead of needing to know/calculate the offset somehow? Thanks again On Fri, Nov 21, 2014 at 11:48 AM, thomas j beanb...@googlemail.com wrote: Thanks for the pointer Michael. I've downloaded spark 1.2.0 from https://people.apache.org/~pwendell/spark-1.2.0-snapshot1/ and clone and built the spark-avro repo you linked to. When I run it against the example avro file linked to in the documentation it works. However, when I try to load my avro file (linked to in my original question) I receive the following error: java.lang.RuntimeException: Unsupported type LONG at scala.sys.package$.error(package.scala:27) at com.databricks.spark.avro.AvroRelation.com $databricks$spark$avro$AvroRelation$$toSqlType(AvroRelation.scala:116) at com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:97) at com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:96) at
Re: How can I read this avro file using spark scala?
One option (starting with Spark 1.2, which is currently in preview) is to use the Avro library for Spark SQL. This is very new, but we would love to get feedback: https://github.com/databricks/spark-avro On Thu, Nov 20, 2014 at 10:19 AM, al b beanb...@googlemail.com wrote: I've read several posts of people struggling to read avro in spark. The examples I've tried don't work. When I try this solution ( https://stackoverflow.com/questions/23944615/how-can-i-load-avros-in-spark-using-the-schema-on-board-the-avro-files) I get errors: spark java.io.NotSerializableException: org.apache.avro.mapred.AvroWrapper How can I read the following sample file in spark using scala? http://www.4shared.com/file/SxnYcdgJce/sample.html Thomas