Re: spark sql writing in avro
I probably did not do a good enough job explaining the problem. If you used Maven with the default Maven repository you have an old version of spark-avro that does not contain AvroSaver and does not have the saveAsAvro method implemented: Assuming you use the default Maven repo location: cd ~/.m2/repository/com/databricks/spark-avro_2.10/0.1 jar tvf spark-avro_2.10-0.1.jar | grep AvroSaver Comes up empty. The jar file does not contain this class because AvroSaver.scala wasn't added until Jan 21. The jar file is from 14 November. So: git clone g...@github.com:databricks/spark-avro.git cd spark-avro sbt publish-m2 This publishes the latest master code (this includes AvroSaver etc.) to your local Maven repo and Maven will pick up the latest version of spark-avro (for this machine). Now you should be able to compile and run. HTH, Markus On 03/12/2015 11:55 PM, Kevin Peng wrote: Dale, I basically have the same maven dependency above, but my code will not compile due to not being able to reference to AvroSaver, though the saveAsAvro reference compiles fine, which is weird. Eventhough saveAsAvro compiles for me, it errors out when running the spark job due to it not being implemented (the job quits and says non implemented method or something along those lines). I will try going the spark shell and passing in the jar built from github since I haven't tried that quite yet. On Thu, Mar 12, 2015 at 6:44 PM, M. Dale medal...@yahoo.com mailto:medal...@yahoo.com wrote: Short answer: if you downloaded spark-avro from the repo.maven.apache.org http://repo.maven.apache.org repo you might be using an old version (pre-November 14, 2014) - see timestamps at http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/ Lots of changes at https://github.com/databricks/spark-avro since then. Databricks, thank you for sharing the Avro code!!! Could you please push out the latest version or update the version number and republish to repo.maven.apache.org http://repo.maven.apache.org (I have no idea how jars get there). Or is there a different repository that users should point to for this artifact? Workaround: Download from https://github.com/databricks/spark-avro and build with latest functionality (still version 0.1) and add to your local Maven or Ivy repo. Long version: I used a default Maven build and declared my dependency on: dependency groupIdcom.databricks/groupId artifactIdspark-avro_2.10/artifactId version0.1/version /dependency Maven downloaded the 0.1 version from http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/ and included it in my app code jar. From spark-shell: import com.databricks.spark.avro._ import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) # This schema includes LONG for time in millis (https://github.com/medale/spark-mail/blob/master/mailrecord/src/main/avro/com/uebercomputing/mailrecord/MailRecord.avdl) val recordsSchema = sqlContext.avroFile(/opt/rpm1/enron/enron-tiny.avro) java.lang.RuntimeException: Unsupported type LONG However, checking out the spark-avro code from its GitHub repo and adding a test case against the MailRecord avro everything ran fine. So I built the databricks spark-avro locally on my box and then put it in my local Maven repo - everything worked from spark-shell when adding that jar as dependency. Hope this helps for the save case as well. On the pre-14NOV version, avro.scala says: // TODO: Implement me. implicit class AvroSchemaRDD(schemaRDD: SchemaRDD) { def saveAsAvroFile(path: String): Unit = ??? } Markus On 03/12/2015 07:05 PM, kpeng1 wrote: Hi All, I am current trying to write out a scheme RDD to avro. I noticed that there is a databricks spark-avro library and I have included that in my dependencies, but it looks like I am not able to access the AvroSaver object. On compilation of the job I get this: error: not found: value AvroSaver [ERROR] AvroSaver.save(resultRDD, args(4)) I also tried calling saveAsAvro on the resultRDD(the actual rdd with the results) and that passes compilation, but when I run the code I get an error that says the saveAsAvro is not implemented. I am using version 0.1 of spark-avro_2.10 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-writing-in-avro-tp22021.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user
Re: spark sql writing in avro
Short answer: if you downloaded spark-avro from the repo.maven.apache.org repo you might be using an old version (pre-November 14, 2014) - see timestamps at http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/ Lots of changes at https://github.com/databricks/spark-avro since then. Databricks, thank you for sharing the Avro code!!! Could you please push out the latest version or update the version number and republish to repo.maven.apache.org (I have no idea how jars get there). Or is there a different repository that users should point to for this artifact? Workaround: Download from https://github.com/databricks/spark-avro and build with latest functionality (still version 0.1) and add to your local Maven or Ivy repo. Long version: I used a default Maven build and declared my dependency on: dependency groupIdcom.databricks/groupId artifactIdspark-avro_2.10/artifactId version0.1/version /dependency Maven downloaded the 0.1 version from http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/ and included it in my app code jar. From spark-shell: import com.databricks.spark.avro._ import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) # This schema includes LONG for time in millis (https://github.com/medale/spark-mail/blob/master/mailrecord/src/main/avro/com/uebercomputing/mailrecord/MailRecord.avdl) val recordsSchema = sqlContext.avroFile(/opt/rpm1/enron/enron-tiny.avro) java.lang.RuntimeException: Unsupported type LONG However, checking out the spark-avro code from its GitHub repo and adding a test case against the MailRecord avro everything ran fine. So I built the databricks spark-avro locally on my box and then put it in my local Maven repo - everything worked from spark-shell when adding that jar as dependency. Hope this helps for the save case as well. On the pre-14NOV version, avro.scala says: // TODO: Implement me. implicit class AvroSchemaRDD(schemaRDD: SchemaRDD) { def saveAsAvroFile(path: String): Unit = ??? } Markus On 03/12/2015 07:05 PM, kpeng1 wrote: Hi All, I am current trying to write out a scheme RDD to avro. I noticed that there is a databricks spark-avro library and I have included that in my dependencies, but it looks like I am not able to access the AvroSaver object. On compilation of the job I get this: error: not found: value AvroSaver [ERROR] AvroSaver.save(resultRDD, args(4)) I also tried calling saveAsAvro on the resultRDD(the actual rdd with the results) and that passes compilation, but when I run the code I get an error that says the saveAsAvro is not implemented. I am using version 0.1 of spark-avro_2.10 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-writing-in-avro-tp22021.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: IncompatibleClassChangeError
In Hadoop 1.x TaskAttemptContext is a class (for example, https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/mapred/TaskAttemptContext.html) In Hadoop 2.x TaskAttemptContext is an interface (https://hadoop.apache.org/docs/r2.4.0/api/org/apache/hadoop/mapreduce/TaskAttemptContext.html) From http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_vd_cdh_package_tarball.html it looks like CDH 5.3.2 uses Hadoop 2.5. Are you using any third party libraries that come in hadoop1 (default) vs. hadoop2 versions like avro-mapred (see https://issues.apache.org/jira/browse/SPARK-3039)? If so make sure you include: dependency ... classifierhadoop2/classifier /dependency What version of Spark are you using? Are you using Avro? If so SPARK-3039 is fixed in Spark 1.3. Markus On 03/05/2015 01:31 PM, ey-chih chow wrote: Hi, I am using CDH5.3.2 now for a Spark project. I got the following exception: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected I used all the CDH5.3.2 jar files in my pom file to generate the application jar file. What else I should do to fix the problem? Ey-Chih Chow -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/IncompatibleClassChangeError-tp21934.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: “mapreduce.job.user.classpath.first” for Spark
Try spark.yarn.user.classpath.first (see https://issues.apache.org/jira/browse/SPARK-2996 - only works for YARN). Also thread at http://apache-spark-user-list.1001560.n3.nabble.com/netty-on-classpath-when-using-spark-submit-td18030.html. HTH, Markus On 02/03/2015 11:20 PM, Corey Nolet wrote: I'm having a really bad dependency conflict right now with Guava versions between my Spark application in Yarn and (I believe) Hadoop's version. The problem is, my driver has the version of Guava which my application is expecting (15.0) while it appears the Spark executors that are working on my RDDs have a much older version (assuming it's the old version on the Hadoop classpath). Is there a property like mapreduce.job.user.classpath.first' that I can set to make sure my own classpath is extablished first on the executors?
Re: java.io.NotSerializableException: org.apache.avro.mapred.AvroKey using spark with avro
I did not encounter this with my Avro records using Spark 1.10 (see https://github.com/medale/spark-mail/blob/master/analytics/src/main/scala/com/uebercomputing/analytics/basic/UniqueSenderCounter.scala). I do use the default Java serialization but all the fields in my Avro object are Serializable (no bytes/ByteBuffer). Does your Avro schema use bytes? If so, it seems that is wrapped in ByteBuffer, which is not Serializable. A quick search has a fix here: https://groups.google.com/forum/#!topic/spark-users/6HQPuxsCe0c Hope this helps, Markus On 12/17/2014 08:14 PM, touchdown wrote: Yeah, I have the same problem with 1.1.0, but not 1.0.0. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-io-NotSerializableException-org-apache-avro-mapred-AvroKey-using-spark-with-avro-tp15165p20752.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: netty on classpath when using spark-submit
Tobias, From http://spark.apache.org/docs/latest/configuration.html it seems that there is an experimental property: spark.files.userClassPathFirst Whether to give user-added jars precedence over Spark's own jars when loading classes in Executors. This feature can be used to mitigate conflicts between Spark's dependencies and user dependencies. It is currently an experimental feature. HTH, Markus On 11/04/2014 01:50 AM, Tobias Pfeiffer wrote: Hi, I tried hard to get a version of netty into my jar file created with sbt assembly that works with all my libraries. Now I managed that and was really happy, but it seems like spark-submit puts an older version of netty on the classpath when submitting to a cluster, such that my code ends up with an NoSuchMethodError: Code: val a = new DefaultHttpRequest(HttpVersion.HTTP_1_1, HttpMethod.POST, http://localhost;) val f = new File(a.getClass.getProtectionDomain(). getCodeSource().getLocation().getPath()) println(f.getAbsolutePath) println(headers: + a.headers()) When executed with sbt run: ~/.ivy2/cache/io.netty/netty/bundles/netty-3.9.4.Final.jar headers: org.jboss.netty.handler.codec.http.DefaultHttpHeaders@64934069 When executed with spark-submit: ~/spark-1.1.0-bin-hadoop2.4/lib/spark-assembly-1.1.0-hadoop2.4.0.jar Exception in thread main java.lang.NoSuchMethodError: org.jboss.netty.handler.codec.http.DefaultHttpRequest.headers()Lorg/jboss/netty/handler/codec/http/HttpHeaders; ... How can I get the old netty version off my classpath? Thanks Tobias