[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304993#comment-14304993 ] Sean Owen commented on SPARK-3039: -- I think Spark's dependency tree is already well too complex to actually avoid conflicts. It doesn't converge, but manages to work. This is indeed a big source of complexity and problems. I'd love to whittle down the extent and number of permutations to support but this is a story for a little later. > Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop > 1 API > -- > > Key: SPARK-3039 > URL: https://issues.apache.org/jira/browse/SPARK-3039 > Project: Spark > Issue Type: Bug > Components: Build, Input/Output, Spark Core >Affects Versions: 0.9.1, 1.0.0, 1.1.0, 1.2.0 > Environment: hadoop2, hadoop-2.4.0, HDP-2.1 >Reporter: Bertrand Bossy >Assignee: Bertrand Bossy >Priority: Critical > > The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a > dependency of "org.spark-project.hive:hive-serde". > The avro-mapred package provides a hadoop FileInputFormat to read and write > avro files. There are two versions of this package, distinguished by a > classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". > avro-mapred for the old Hadoop API uses no classifier. > E.g. when reading avro files using > {code} > sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro") > {code} > The following error occurs: > {code} > java.lang.IncompatibleClassChangeError: Found interface > org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected > at > org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {code} > This error usually is a hint that there was a mix up of the old and the new > Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to > appear before the version that is bundled with Spark, reading avro files > works fine. > Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304991#comment-14304991 ] Markus Dale commented on SPARK-3039: The Aapche Maven Enforcer plugin has a dependency convergence rule that could be added to ensure that dependencies/transitive dependencies don't clash. Maybe this could be added to a "deploy" profile to be checked (initially set fail to false until all dependency convergence errors are fixed) during release builds initially. See https://issues.apache.org/jira/browse/SPARK-5584. For the spark-1.3.0-SNAPSHOT, looks like the fix for "[SPARK-4048] Enhance and extend hadoop-provided profile" introduced a new scope attribute in the Maven build: {code} grep -R hive.deps.scope * assembly/pom.xml: provided examples/pom.xml: provided pom.xml:compile pom.xml:${hive.deps.scope} pom.xml:${hive.deps.scope} pom.xml:${hive.deps.scope} pom.xml:${hive.deps.scope} pom.xml:${hive.deps.scope} pom.xml:${hive.deps.scope} pom.xml:${hive.deps.scope} {code} avro-mapred and hive-exec are marked with that scope so neither library nor their dependencies will be included in the spark assembly jar. This means that Spark jobs that want to use Avro-based InputFormat from avro-mapred have to include their desired avro-mapred version in their jars. So this particular problem will be gone but still need to prevent this class of problems by ensuring dependency convergence. > Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop > 1 API > -- > > Key: SPARK-3039 > URL: https://issues.apache.org/jira/browse/SPARK-3039 > Project: Spark > Issue Type: Bug > Components: Build, Input/Output, Spark Core >Affects Versions: 0.9.1, 1.0.0, 1.1.0, 1.2.0 > Environment: hadoop2, hadoop-2.4.0, HDP-2.1 >Reporter: Bertrand Bossy >Assignee: Bertrand Bossy >Priority: Critical > > The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a > dependency of "org.spark-project.hive:hive-serde". > The avro-mapred package provides a hadoop FileInputFormat to read and write > avro files. There are two versions of this package, distinguished by a > classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". > avro-mapred for the old Hadoop API uses no classifier. > E.g. when reading avro files using > {code} > sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro") > {code} > The following error occurs: > {code} > java.lang.IncompatibleClassChangeError: Found interface > org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected > at > org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {code} > This error usually is a hint that there was a mix up of the old and the new > Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to > appear before the version that is bundled with Spark, reading avro files > works fine. > Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This mes
[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304986#comment-14304986 ] Sean Owen commented on SPARK-3039: -- [~bbossy] See the good analysis and additional fix in the PR above: https://github.com/apache/spark/pull/4315 > Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop > 1 API > -- > > Key: SPARK-3039 > URL: https://issues.apache.org/jira/browse/SPARK-3039 > Project: Spark > Issue Type: Bug > Components: Build, Input/Output, Spark Core >Affects Versions: 0.9.1, 1.0.0, 1.1.0, 1.2.0 > Environment: hadoop2, hadoop-2.4.0, HDP-2.1 >Reporter: Bertrand Bossy >Assignee: Bertrand Bossy >Priority: Critical > > The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a > dependency of "org.spark-project.hive:hive-serde". > The avro-mapred package provides a hadoop FileInputFormat to read and write > avro files. There are two versions of this package, distinguished by a > classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". > avro-mapred for the old Hadoop API uses no classifier. > E.g. when reading avro files using > {code} > sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro") > {code} > The following error occurs: > {code} > java.lang.IncompatibleClassChangeError: Found interface > org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected > at > org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {code} > This error usually is a hint that there was a mix up of the old and the new > Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to > appear before the version that is bundled with Spark, reading avro files > works fine. > Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304971#comment-14304971 ] Bertrand Bossy commented on SPARK-3039: --- hmm.. Is there a way to test the contents of the assembly jar, or what jars get packaged? I fear that this will come up again.. > Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop > 1 API > -- > > Key: SPARK-3039 > URL: https://issues.apache.org/jira/browse/SPARK-3039 > Project: Spark > Issue Type: Bug > Components: Build, Input/Output, Spark Core >Affects Versions: 0.9.1, 1.0.0, 1.1.0, 1.2.0 > Environment: hadoop2, hadoop-2.4.0, HDP-2.1 >Reporter: Bertrand Bossy >Assignee: Bertrand Bossy >Priority: Critical > > The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a > dependency of "org.spark-project.hive:hive-serde". > The avro-mapred package provides a hadoop FileInputFormat to read and write > avro files. There are two versions of this package, distinguished by a > classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". > avro-mapred for the old Hadoop API uses no classifier. > E.g. when reading avro files using > {code} > sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro") > {code} > The following error occurs: > {code} > java.lang.IncompatibleClassChangeError: Found interface > org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected > at > org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {code} > This error usually is a hint that there was a mix up of the old and the new > Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to > appear before the version that is bundled with Spark, reading avro files > works fine. > Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302027#comment-14302027 ] Apache Spark commented on SPARK-3039: - User 'medale' has created a pull request for this issue: https://github.com/apache/spark/pull/4315 > Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop > 1 API > -- > > Key: SPARK-3039 > URL: https://issues.apache.org/jira/browse/SPARK-3039 > Project: Spark > Issue Type: Bug > Components: Build, Input/Output, Spark Core >Affects Versions: 0.9.1, 1.0.0, 1.1.0 > Environment: hadoop2, hadoop-2.4.0, HDP-2.1 >Reporter: Bertrand Bossy >Assignee: Bertrand Bossy > Fix For: 1.2.0 > > > The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a > dependency of "org.spark-project.hive:hive-serde". > The avro-mapred package provides a hadoop FileInputFormat to read and write > avro files. There are two versions of this package, distinguished by a > classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". > avro-mapred for the old Hadoop API uses no classifier. > E.g. when reading avro files using > {code} > sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro") > {code} > The following error occurs: > {code} > java.lang.IncompatibleClassChangeError: Found interface > org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected > at > org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {code} > This error usually is a hint that there was a mix up of the old and the new > Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to > appear before the version that is bundled with Spark, reading avro files > works fine. > Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14301780#comment-14301780 ] Markus Dale commented on SPARK-3039: For me, Spark 1.2.0 either downloading spark-1.2.0-bin-hadoop2.4.tgz or compiling the source with {code} mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests clean package {code} still had the same problem: {noformat} java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:87) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:135) {noformat} Starting the build with a clean .m2/repository, the repository afterwards contained: * avro-mapred/1.7.5 (with the default jar - i.e. hadoop1) * avro-mapred/1.7.6 with the avro-mapred-1.7.6-hadoop2.jar (the one we want). Seemed that sharding these two dependencies into the spark-assembly-jar resulted in the error above at least in the downloaded hadoop2.4 spark bin and my own build. Running the following (after doing a mvn install and by-hand copy of all the spark artifacts into my local repo for spark-repl/yarn): {code} mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests dependency:tree -Dincludes=org.apache.avro:avro-mapred {code} Showed that the culprit was in the Hive project, namely org.spark-project.hive:hive-exec's dependency on 1.7.5. {noformat} Building Spark Project Hive 1.2.0 [INFO] [INFO] [INFO] --- maven-dependency-plugin:2.4:tree (default-cli) @ spark-hive_2.10 --- [INFO] org.apache.spark:spark-hive_2.10:jar:1.2.0 [INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile [INFO] | \- org.apache.avro:avro-mapred:jar:1.7.5:compile [INFO] \- org.apache.avro:avro-mapred:jar:hadoop2:1.7.6:compile [INFO] {noformat} Editing spark-1.2.0/sql/hive/pom.xml and excluding avro-mapred from hive-exec, then recompile, fixed the problem and the resulting dist works well against Avro/Hadoop2 code: {code:xml} org.spark-project.hive hive-exec ${hive.version} commons-logging commons-logging com.esotericsoftware.kryo kryo org.apache.avro avro-mapred {code} Just the last exclusion added. Will try to do a pull-request if that's not already addressed in the latest code. > Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop > 1 API > -- > > Key: SPARK-3039 > URL: https://issues.apache.org/jira/browse/SPARK-3039 > Project: Spark > Issue Type: Bug > Components: Build, Input/Output, Spark Core >Affects Versions: 0.9.1, 1.0.0, 1.1.0 > Environment: hadoop2, hadoop-2.4.0, HDP-2.1 >Reporter: Bertrand Bossy >Assignee: Bertrand Bossy > Fix For: 1.2.0 > > > The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a > dependency of "org.spark-project.hive:hive-serde". > The avro-mapred package provides a hadoop FileInputFormat to read and write > avro files. There are two versions of this package, distinguished by a > classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". > avro-mapred for the old Hadoop API uses no classifier. > E.g. when reading avro files using > {code} > sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro") > {code} > The following error occurs: > {code} > java.lang.IncompatibleClassChangeError: Found interface > org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected > at > org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.c
[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240143#comment-14240143 ] Clay Kim commented on SPARK-3039: - I finally managed to get this to work, using the prebuilt spark spark-1.1.1-bin-hadoop1 and avro-mapred 1.7.6 w/ classifier Hadoop2. ``` "org.apache.avro" % "avro-mapred" % "1.7.6" classifier "hadoop2", ``` > Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop > 1 API > -- > > Key: SPARK-3039 > URL: https://issues.apache.org/jira/browse/SPARK-3039 > Project: Spark > Issue Type: Bug > Components: Build, Input/Output, Spark Core >Affects Versions: 0.9.1, 1.0.0, 1.1.0 > Environment: hadoop2, hadoop-2.4.0, HDP-2.1 >Reporter: Bertrand Bossy >Assignee: Bertrand Bossy > Fix For: 1.2.0 > > > The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a > dependency of "org.spark-project.hive:hive-serde". > The avro-mapred package provides a hadoop FileInputFormat to read and write > avro files. There are two versions of this package, distinguished by a > classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". > avro-mapred for the old Hadoop API uses no classifier. > E.g. when reading avro files using > {code} > sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro") > {code} > The following error occurs: > {code} > java.lang.IncompatibleClassChangeError: Found interface > org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected > at > org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {code} > This error usually is a hint that there was a mix up of the old and the new > Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to > appear before the version that is bundled with Spark, reading avro files > works fine. > Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238197#comment-14238197 ] Derrick Burns commented on SPARK-3039: -- Spark 1.1.1/Hadoop 1.0.4 {quote} java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/12/08 10:21:06 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {quote} > Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop > 1 API > -- > > Key: SPARK-3039 > URL: https://issues.apache.org/jira/browse/SPARK-3039 > Project: Spark > Issue Type: Bug > Components: Build, Input/Output, Spark Core >Affects Versions: 0.9.1, 1.0.0, 1.1.0 > Environment: hadoop2, hadoop-2.4.0, HDP-2.1 >Reporter: Bertrand Bossy >Assignee: Bertrand Bossy > Fix For: 1.2.0 > > > The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a > dependency of "org.spark-project.hive:hive-serde". > The avro-mapred package provides a hadoop FileInputFormat to read and write > avro files. There are two versions of this package, distinguished by a > classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". > avro-mapred for the old Hadoop API uses no classifier. > E.g. when reading avro files using > {code} > sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro") > {code} > The following error occurs: > {code} > java.lang.IncompatibleClassChangeError: Found interface > org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected > at > org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.FilteredRDD.c
[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237880#comment-14237880 ] Bertrand Bossy commented on SPARK-3039: --- @[~derrickburns]: Can you post some more info, such as spark version/distribution used, etc..? Spark's build system has received some updates. I still have to verify this, but AFAIK, these kind of issues should not be present in future releases. Also: have a careful look at the stack trace: if you have {code}Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected{code} then avro-mapred for hadoop1 was found, but avro-mapred for hadoop2 was expected. However, if you have {code}Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected{code} or something similar, it's the other way round > Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop > 1 API > -- > > Key: SPARK-3039 > URL: https://issues.apache.org/jira/browse/SPARK-3039 > Project: Spark > Issue Type: Bug > Components: Build, Input/Output, Spark Core >Affects Versions: 0.9.1, 1.0.0, 1.1.0 > Environment: hadoop2, hadoop-2.4.0, HDP-2.1 >Reporter: Bertrand Bossy >Assignee: Bertrand Bossy > Fix For: 1.2.0 > > > The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a > dependency of "org.spark-project.hive:hive-serde". > The avro-mapred package provides a hadoop FileInputFormat to read and write > avro files. There are two versions of this package, distinguished by a > classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". > avro-mapred for the old Hadoop API uses no classifier. > E.g. when reading avro files using > {code} > sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro") > {code} > The following error occurs: > {code} > java.lang.IncompatibleClassChangeError: Found interface > org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected > at > org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {code} > This error usually is a hint that there was a mix up of the old and the new > Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to > appear before the version that is bundled with Spark, reading avro files > works fine. > Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237648#comment-14237648 ] Derrick Burns commented on SPARK-3039: -- I get the same bug when attempting to save a RDD as a parquet file when using Hadoop 1.0.4 > Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop > 1 API > -- > > Key: SPARK-3039 > URL: https://issues.apache.org/jira/browse/SPARK-3039 > Project: Spark > Issue Type: Bug > Components: Build, Input/Output, Spark Core >Affects Versions: 0.9.1, 1.0.0, 1.1.0 > Environment: hadoop2, hadoop-2.4.0, HDP-2.1 >Reporter: Bertrand Bossy >Assignee: Bertrand Bossy > Fix For: 1.2.0 > > > The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a > dependency of "org.spark-project.hive:hive-serde". > The avro-mapred package provides a hadoop FileInputFormat to read and write > avro files. There are two versions of this package, distinguished by a > classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". > avro-mapred for the old Hadoop API uses no classifier. > E.g. when reading avro files using > {code} > sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro") > {code} > The following error occurs: > {code} > java.lang.IncompatibleClassChangeError: Found interface > org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected > at > org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {code} > This error usually is a hint that there was a mix up of the old and the new > Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to > appear before the version that is bundled with Spark, reading avro files > works fine. > Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209486#comment-14209486 ] Bertrand Bossy commented on SPARK-3039: --- Needs more fixes, since although I got the build to work for hadoop-2.4, yarn and hive with the sbt build. It doesn't work with the maven build, which AFAIK is required for pySpark. Dependency management seems to be quite different in sbt and maven > Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop > 1 API > -- > > Key: SPARK-3039 > URL: https://issues.apache.org/jira/browse/SPARK-3039 > Project: Spark > Issue Type: Bug > Components: Build, Input/Output, Spark Core >Affects Versions: 0.9.1, 1.0.0, 1.1.0 > Environment: hadoop2, hadoop-2.4.0, HDP-2.1 >Reporter: Bertrand Bossy >Assignee: Bertrand Bossy > Fix For: 1.2.0 > > > The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a > dependency of "org.spark-project.hive:hive-serde". > The avro-mapred package provides a hadoop FileInputFormat to read and write > avro files. There are two versions of this package, distinguished by a > classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". > avro-mapred for the old Hadoop API uses no classifier. > E.g. when reading avro files using > {code} > sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro") > {code} > The following error occurs: > {code} > java.lang.IncompatibleClassChangeError: Found interface > org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected > at > org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {code} > This error usually is a hint that there was a mix up of the old and the new > Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to > appear before the version that is bundled with Spark, reading avro files > works fine. > Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098398#comment-14098398 ] Bertrand Bossy commented on SPARK-3039: --- Also need to update the README: See SPARK-3069 and https://github.com/apache/spark/pull/1945 > Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop > 1 API > -- > > Key: SPARK-3039 > URL: https://issues.apache.org/jira/browse/SPARK-3039 > Project: Spark > Issue Type: Bug > Components: Build, Input/Output, Spark Core >Affects Versions: 0.9.1, 1.0.0, 1.1.0 > Environment: hadoop2, hadoop-2.4.0, HDP-2.1 >Reporter: Bertrand Bossy > > The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a > dependency of "org.spark-project.hive:hive-serde". > The avro-mapred package provides a hadoop FileInputFormat to read and write > avro files. There are two versions of this package, distinguished by a > classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". > avro-mapred for the old Hadoop API uses no classifier. > E.g. when reading avro files using > {code} > sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro") > {code} > The following error occurs: > {code} > java.lang.IncompatibleClassChangeError: Found interface > org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected > at > org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {code} > This error usually is a hint that there was a mix up of the old and the new > Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to > appear before the version that is bundled with Spark, reading avro files > works fine. > Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096973#comment-14096973 ] Apache Spark commented on SPARK-3039: - User 'bbossy' has created a pull request for this issue: https://github.com/apache/spark/pull/1945 > Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop > 1 API > -- > > Key: SPARK-3039 > URL: https://issues.apache.org/jira/browse/SPARK-3039 > Project: Spark > Issue Type: Bug > Components: Build, Input/Output, Spark Core >Affects Versions: 0.9.1, 1.0.0, 1.1.0 > Environment: hadoop2, hadoop-2.4.0, HDP-2.1 >Reporter: Bertrand Bossy > > The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a > dependency of "org.spark-project.hive:hive-serde". > The avro-mapred package provides a hadoop FileInputFormat to read and write > avro files. There are two versions of this package, distinguished by a > classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". > avro-mapred for the old Hadoop API uses no classifier. > E.g. when reading avro files using > {code} > sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro") > {code} > The following error occurs: > {code} > java.lang.IncompatibleClassChangeError: Found interface > org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected > at > org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {code} > This error usually is a hint that there was a mix up of the old and the new > Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to > appear before the version that is bundled with Spark, reading avro files > works fine. > Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org