[jira] [Comment Edited] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

Markus Dale (JIRA) Mon, 02 Feb 2015 12:42:07 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14301780#comment-14301780
 ]


Markus Dale edited comment on SPARK-3039 at 2/2/15 8:40 PM:
------------------------------------------------------------

For me, Spark 1.2.0 either downloading spark-1.2.0-bin-hadoop2.4.tgz or
compiling the source with 

{code}
mvn -Pyarn -Phadoop-2.4 -Phive-0.13.1 -DskipTests clean package
{code}

still had the same problem:

{noformat}
java.lang.IncompatibleClassChangeError: Found interface 
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
        at 
org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:87)
        at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:135)

{noformat}

Starting the build with a clean .m2/repository, the repository afterwards 
contained:

* avro-mapred/1.7.5 (with the default jar - i.e. hadoop1)
* avro-mapred/1.7.6 with the avro-mapred-1.7.6-hadoop2.jar (the one we want). 

Seemed that sharding these two dependencies into the spark-assembly-jar 
resulted in the error above
at least in the downloaded hadoop2.4 spark bin and my own build.

Running the following (after doing a mvn install and by-hand copy of all the 
spark artifacts into my local repo for spark-repl/yarn):

{code}
 mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests dependency:tree 
-Dincludes=org.apache.avro:avro-mapred
{code}

Showed that the culprit was in the Hive project, namely 
org.spark-project.hive:hive-exec's
dependency on 1.7.5.

{noformat}
Building Spark Project Hive 1.2.0
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:2.4:tree (default-cli) @ spark-hive_2.10 ---
[INFO] org.apache.spark:spark-hive_2.10:jar:1.2.0
[INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile
[INFO] |  \- org.apache.avro:avro-mapred:jar:1.7.5:compile
[INFO] \- org.apache.avro:avro-mapred:jar:hadoop2:1.7.6:compile
[INFO]
{noformat}

Editing spark-1.2.0/sql/hive/pom.xml and excluding avro-mapred from hive-exec,
then recompile, fixed the problem and the resulting dist works well against
Avro/Hadoop2 code:

{code:xml}
    <dependency>
      <groupId>org.spark-project.hive</groupId>
      <artifactId>hive-exec</artifactId>
      <version>${hive.version}</version>
      <exclusions>
        <exclusion>
          <groupId>commons-logging</groupId>
          <artifactId>commons-logging</artifactId>
        </exclusion>
        <exclusion>
          <groupId>com.esotericsoftware.kryo</groupId>
          <artifactId>kryo</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.apache.avro</groupId>
          <artifactId>avro-mapred</artifactId>
        </exclusion>
      </exclusions>
    </dependency>
{code}
   
Just the last exclusion added. Will try to do a pull-request if that's not 
already addressed in the latest code.


was (Author: medale):
For me, Spark 1.2.0 either downloading spark-1.2.0-bin-hadoop2.4.tgz or
compiling the source with 

{code}
mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests clean package
{code}

still had the same problem:

{noformat}
java.lang.IncompatibleClassChangeError: Found interface 
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
        at 
org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:87)
        at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:135)

{noformat}

Starting the build with a clean .m2/repository, the repository afterwards 
contained:

* avro-mapred/1.7.5 (with the default jar - i.e. hadoop1)
* avro-mapred/1.7.6 with the avro-mapred-1.7.6-hadoop2.jar (the one we want). 

Seemed that sharding these two dependencies into the spark-assembly-jar 
resulted in the error above
at least in the downloaded hadoop2.4 spark bin and my own build.

Running the following (after doing a mvn install and by-hand copy of all the 
spark artifacts into my local repo for spark-repl/yarn):

{code}
 mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests dependency:tree 
-Dincludes=org.apache.avro:avro-mapred
{code}

Showed that the culprit was in the Hive project, namely 
org.spark-project.hive:hive-exec's
dependency on 1.7.5.

{noformat}
Building Spark Project Hive 1.2.0
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:2.4:tree (default-cli) @ spark-hive_2.10 ---
[INFO] org.apache.spark:spark-hive_2.10:jar:1.2.0
[INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile
[INFO] |  \- org.apache.avro:avro-mapred:jar:1.7.5:compile
[INFO] \- org.apache.avro:avro-mapred:jar:hadoop2:1.7.6:compile
[INFO]
{noformat}

Editing spark-1.2.0/sql/hive/pom.xml and excluding avro-mapred from hive-exec,
then recompile, fixed the problem and the resulting dist works well against
Avro/Hadoop2 code:

{code:xml}
    <dependency>
      <groupId>org.spark-project.hive</groupId>
      <artifactId>hive-exec</artifactId>
      <version>${hive.version}</version>
      <exclusions>
        <exclusion>
          <groupId>commons-logging</groupId>
          <artifactId>commons-logging</artifactId>
        </exclusion>
        <exclusion>
          <groupId>com.esotericsoftware.kryo</groupId>
          <artifactId>kryo</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.apache.avro</groupId>
          <artifactId>avro-mapred</artifactId>
        </exclusion>
      </exclusions>
    </dependency>
{code}
   
Just the last exclusion added. Will try to do a pull-request if that's not 
already addressed in the latest code.

> Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
> 1 API
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-3039
>                 URL: https://issues.apache.org/jira/browse/SPARK-3039
>             Project: Spark
>          Issue Type: Bug
>          Components: Build, Input/Output, Spark Core
>    Affects Versions: 0.9.1, 1.0.0, 1.1.0
>         Environment: hadoop2, hadoop-2.4.0, HDP-2.1
>            Reporter: Bertrand Bossy
>            Assignee: Bertrand Bossy
>             Fix For: 1.2.0
>
>
> The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a 
> dependency of "org.spark-project.hive:hive-serde".
> The avro-mapred package provides a hadoop FileInputFormat to read and write 
> avro files. There are two versions of this package, distinguished by a 
> classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". 
> avro-mapred for the old Hadoop API uses no classifier.
> E.g. when reading avro files using 
> {code}
> sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro")
> {code}
> The following error occurs:
> {code}
> java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>         at 
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
>         at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:111)
>         at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
>         at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>         at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>         at org.apache.spark.scheduler.Task.run(Task.scala:51)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
> {code}
> This error usually is a hint that there was a mix up of the old and the new 
> Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to 
> appear before the version that is bundled with Spark, reading avro files 
> works fine. 
> Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

Reply via email to