[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14301780#comment-14301780 ]
Markus Dale commented on SPARK-3039: ------------------------------------ For me, Spark 1.2.0 either downloading spark-1.2.0-bin-hadoop2.4.tgz or compiling the source with {code} mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests clean package {code} still had the same problem: {noformat} java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:87) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:135) {noformat} Starting the build with a clean .m2/repository, the repository afterwards contained: * avro-mapred/1.7.5 (with the default jar - i.e. hadoop1) * avro-mapred/1.7.6 with the avro-mapred-1.7.6-hadoop2.jar (the one we want). Seemed that sharding these two dependencies into the spark-assembly-jar resulted in the error above at least in the downloaded hadoop2.4 spark bin and my own build. Running the following (after doing a mvn install and by-hand copy of all the spark artifacts into my local repo for spark-repl/yarn): {code} mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests dependency:tree -Dincludes=org.apache.avro:avro-mapred {code} Showed that the culprit was in the Hive project, namely org.spark-project.hive:hive-exec's dependency on 1.7.5. {noformat} Building Spark Project Hive 1.2.0 [INFO] ------------------------------------------------------------------------ [INFO] [INFO] --- maven-dependency-plugin:2.4:tree (default-cli) @ spark-hive_2.10 --- [INFO] org.apache.spark:spark-hive_2.10:jar:1.2.0 [INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile [INFO] | \- org.apache.avro:avro-mapred:jar:1.7.5:compile [INFO] \- org.apache.avro:avro-mapred:jar:hadoop2:1.7.6:compile [INFO] {noformat} Editing spark-1.2.0/sql/hive/pom.xml and excluding avro-mapred from hive-exec, then recompile, fixed the problem and the resulting dist works well against Avro/Hadoop2 code: {code:xml} <dependency> <groupId>org.spark-project.hive</groupId> <artifactId>hive-exec</artifactId> <version>${hive.version}</version> <exclusions> <exclusion> <groupId>commons-logging</groupId> <artifactId>commons-logging</artifactId> </exclusion> <exclusion> <groupId>com.esotericsoftware.kryo</groupId> <artifactId>kryo</artifactId> </exclusion> <exclusion> <groupId>org.apache.avro</groupId> <artifactId>avro-mapred</artifactId> </exclusion> </exclusions> </dependency> {code} Just the last exclusion added. Will try to do a pull-request if that's not already addressed in the latest code. > Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop > 1 API > ---------------------------------------------------------------------------------- > > Key: SPARK-3039 > URL: https://issues.apache.org/jira/browse/SPARK-3039 > Project: Spark > Issue Type: Bug > Components: Build, Input/Output, Spark Core > Affects Versions: 0.9.1, 1.0.0, 1.1.0 > Environment: hadoop2, hadoop-2.4.0, HDP-2.1 > Reporter: Bertrand Bossy > Assignee: Bertrand Bossy > Fix For: 1.2.0 > > > The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a > dependency of "org.spark-project.hive:hive-serde". > The avro-mapred package provides a hadoop FileInputFormat to read and write > avro files. There are two versions of this package, distinguished by a > classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". > avro-mapred for the old Hadoop API uses no classifier. > E.g. when reading avro files using > {code} > sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro") > {code} > The following error occurs: > {code} > java.lang.IncompatibleClassChangeError: Found interface > org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected > at > org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:111) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {code} > This error usually is a hint that there was a mix up of the old and the new > Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to > appear before the version that is bundled with Spark, reading avro files > works fine. > Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org