[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2015-02-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304993#comment-14304993
 ] 

Sean Owen commented on SPARK-3039:
--

I think Spark's dependency tree is already well too complex to actually avoid 
conflicts. It doesn't converge, but manages to work. This is indeed a big 
source of complexity and problems. I'd love to whittle down the extent and 
number of permutations to support but this is a story for a little later.

> Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
> 1 API
> --
>
> Key: SPARK-3039
> URL: https://issues.apache.org/jira/browse/SPARK-3039
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Input/Output, Spark Core
>Affects Versions: 0.9.1, 1.0.0, 1.1.0, 1.2.0
> Environment: hadoop2, hadoop-2.4.0, HDP-2.1
>Reporter: Bertrand Bossy
>Assignee: Bertrand Bossy
>Priority: Critical
>
> The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a 
> dependency of "org.spark-project.hive:hive-serde".
> The avro-mapred package provides a hadoop FileInputFormat to read and write 
> avro files. There are two versions of this package, distinguished by a 
> classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". 
> avro-mapred for the old Hadoop API uses no classifier.
> E.g. when reading avro files using 
> {code}
> sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro")
> {code}
> The following error occurs:
> {code}
> java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
> at 
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> {code}
> This error usually is a hint that there was a mix up of the old and the new 
> Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to 
> appear before the version that is bundled with Spark, reading avro files 
> works fine. 
> Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2015-02-04 Thread Markus Dale (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304991#comment-14304991
 ] 

Markus Dale commented on SPARK-3039:


The Aapche Maven Enforcer plugin has a dependency convergence rule that could 
be added to ensure that dependencies/transitive dependencies don't clash. Maybe 
this could be added to a "deploy" profile to be checked (initially set fail to 
false until all dependency convergence errors are fixed) during release builds 
initially. See https://issues.apache.org/jira/browse/SPARK-5584.

For the spark-1.3.0-SNAPSHOT, looks like the fix for "[SPARK-4048] Enhance and 
extend hadoop-provided profile" introduced a new scope attribute in the Maven 
build:

{code}
grep -R hive.deps.scope *
assembly/pom.xml: provided
examples/pom.xml: provided
pom.xml:compile
pom.xml:${hive.deps.scope}
pom.xml:${hive.deps.scope}
pom.xml:${hive.deps.scope}
pom.xml:${hive.deps.scope}
pom.xml:${hive.deps.scope}
pom.xml:${hive.deps.scope}
pom.xml:${hive.deps.scope}
{code}

avro-mapred and hive-exec are marked with that scope so neither library nor 
their dependencies will be included in the spark assembly jar. This means that 
Spark jobs that want to use Avro-based InputFormat from avro-mapred have to 
include their desired avro-mapred version in their jars. So this particular 
problem will be gone but still need to prevent this class of problems by 
ensuring dependency convergence.

> Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
> 1 API
> --
>
> Key: SPARK-3039
> URL: https://issues.apache.org/jira/browse/SPARK-3039
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Input/Output, Spark Core
>Affects Versions: 0.9.1, 1.0.0, 1.1.0, 1.2.0
> Environment: hadoop2, hadoop-2.4.0, HDP-2.1
>Reporter: Bertrand Bossy
>Assignee: Bertrand Bossy
>Priority: Critical
>
> The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a 
> dependency of "org.spark-project.hive:hive-serde".
> The avro-mapred package provides a hadoop FileInputFormat to read and write 
> avro files. There are two versions of this package, distinguished by a 
> classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". 
> avro-mapred for the old Hadoop API uses no classifier.
> E.g. when reading avro files using 
> {code}
> sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro")
> {code}
> The following error occurs:
> {code}
> java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
> at 
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> {code}
> This error usually is a hint that there was a mix up of the old and the new 
> Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to 
> appear before the version that is bundled with Spark, reading avro files 
> works fine. 
> Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.



--
This mes

[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2015-02-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304986#comment-14304986
 ] 

Sean Owen commented on SPARK-3039:
--

[~bbossy] See the good analysis and additional fix in the PR above: 
https://github.com/apache/spark/pull/4315  

> Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
> 1 API
> --
>
> Key: SPARK-3039
> URL: https://issues.apache.org/jira/browse/SPARK-3039
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Input/Output, Spark Core
>Affects Versions: 0.9.1, 1.0.0, 1.1.0, 1.2.0
> Environment: hadoop2, hadoop-2.4.0, HDP-2.1
>Reporter: Bertrand Bossy
>Assignee: Bertrand Bossy
>Priority: Critical
>
> The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a 
> dependency of "org.spark-project.hive:hive-serde".
> The avro-mapred package provides a hadoop FileInputFormat to read and write 
> avro files. There are two versions of this package, distinguished by a 
> classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". 
> avro-mapred for the old Hadoop API uses no classifier.
> E.g. when reading avro files using 
> {code}
> sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro")
> {code}
> The following error occurs:
> {code}
> java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
> at 
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> {code}
> This error usually is a hint that there was a mix up of the old and the new 
> Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to 
> appear before the version that is bundled with Spark, reading avro files 
> works fine. 
> Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2015-02-04 Thread Bertrand Bossy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304971#comment-14304971
 ] 

Bertrand Bossy commented on SPARK-3039:
---

hmm.. Is there a way to test the contents of the assembly jar, or what jars get 
packaged? I fear that this will come up again..

> Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
> 1 API
> --
>
> Key: SPARK-3039
> URL: https://issues.apache.org/jira/browse/SPARK-3039
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Input/Output, Spark Core
>Affects Versions: 0.9.1, 1.0.0, 1.1.0, 1.2.0
> Environment: hadoop2, hadoop-2.4.0, HDP-2.1
>Reporter: Bertrand Bossy
>Assignee: Bertrand Bossy
>Priority: Critical
>
> The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a 
> dependency of "org.spark-project.hive:hive-serde".
> The avro-mapred package provides a hadoop FileInputFormat to read and write 
> avro files. There are two versions of this package, distinguished by a 
> classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". 
> avro-mapred for the old Hadoop API uses no classifier.
> E.g. when reading avro files using 
> {code}
> sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro")
> {code}
> The following error occurs:
> {code}
> java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
> at 
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> {code}
> This error usually is a hint that there was a mix up of the old and the new 
> Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to 
> appear before the version that is bundled with Spark, reading avro files 
> works fine. 
> Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2015-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302027#comment-14302027
 ] 

Apache Spark commented on SPARK-3039:
-

User 'medale' has created a pull request for this issue:
https://github.com/apache/spark/pull/4315

> Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
> 1 API
> --
>
> Key: SPARK-3039
> URL: https://issues.apache.org/jira/browse/SPARK-3039
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Input/Output, Spark Core
>Affects Versions: 0.9.1, 1.0.0, 1.1.0
> Environment: hadoop2, hadoop-2.4.0, HDP-2.1
>Reporter: Bertrand Bossy
>Assignee: Bertrand Bossy
> Fix For: 1.2.0
>
>
> The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a 
> dependency of "org.spark-project.hive:hive-serde".
> The avro-mapred package provides a hadoop FileInputFormat to read and write 
> avro files. There are two versions of this package, distinguished by a 
> classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". 
> avro-mapred for the old Hadoop API uses no classifier.
> E.g. when reading avro files using 
> {code}
> sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro")
> {code}
> The following error occurs:
> {code}
> java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
> at 
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> {code}
> This error usually is a hint that there was a mix up of the old and the new 
> Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to 
> appear before the version that is bundled with Spark, reading avro files 
> works fine. 
> Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2015-02-02 Thread Markus Dale (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14301780#comment-14301780
 ] 

Markus Dale commented on SPARK-3039:


For me, Spark 1.2.0 either downloading spark-1.2.0-bin-hadoop2.4.tgz or
compiling the source with 

{code}
mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests clean package
{code}

still had the same problem:

{noformat}
java.lang.IncompatibleClassChangeError: Found interface 
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at 
org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:87)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:135)

{noformat}

Starting the build with a clean .m2/repository, the repository afterwards 
contained:

* avro-mapred/1.7.5 (with the default jar - i.e. hadoop1)
* avro-mapred/1.7.6 with the avro-mapred-1.7.6-hadoop2.jar (the one we want). 

Seemed that sharding these two dependencies into the spark-assembly-jar 
resulted in the error above
at least in the downloaded hadoop2.4 spark bin and my own build.

Running the following (after doing a mvn install and by-hand copy of all the 
spark artifacts into my local repo for spark-repl/yarn):

{code}
 mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests dependency:tree 
-Dincludes=org.apache.avro:avro-mapred
{code}

Showed that the culprit was in the Hive project, namely 
org.spark-project.hive:hive-exec's
dependency on 1.7.5.

{noformat}
Building Spark Project Hive 1.2.0
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:2.4:tree (default-cli) @ spark-hive_2.10 ---
[INFO] org.apache.spark:spark-hive_2.10:jar:1.2.0
[INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile
[INFO] |  \- org.apache.avro:avro-mapred:jar:1.7.5:compile
[INFO] \- org.apache.avro:avro-mapred:jar:hadoop2:1.7.6:compile
[INFO]
{noformat}

Editing spark-1.2.0/sql/hive/pom.xml and excluding avro-mapred from hive-exec,
then recompile, fixed the problem and the resulting dist works well against
Avro/Hadoop2 code:

{code:xml}

  org.spark-project.hive
  hive-exec
  ${hive.version}
  

  commons-logging
  commons-logging


  com.esotericsoftware.kryo
  kryo


  org.apache.avro
  avro-mapred

  

{code}
   
Just the last exclusion added. Will try to do a pull-request if that's not 
already addressed in the latest code.

> Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
> 1 API
> --
>
> Key: SPARK-3039
> URL: https://issues.apache.org/jira/browse/SPARK-3039
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Input/Output, Spark Core
>Affects Versions: 0.9.1, 1.0.0, 1.1.0
> Environment: hadoop2, hadoop-2.4.0, HDP-2.1
>Reporter: Bertrand Bossy
>Assignee: Bertrand Bossy
> Fix For: 1.2.0
>
>
> The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a 
> dependency of "org.spark-project.hive:hive-serde".
> The avro-mapred package provides a hadoop FileInputFormat to read and write 
> avro files. There are two versions of this package, distinguished by a 
> classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". 
> avro-mapred for the old Hadoop API uses no classifier.
> E.g. when reading avro files using 
> {code}
> sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro")
> {code}
> The following error occurs:
> {code}
> java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
> at 
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.c

[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2014-12-09 Thread Clay Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240143#comment-14240143
 ] 

Clay Kim commented on SPARK-3039:
-

I finally managed to get this to work, using the prebuilt spark 
spark-1.1.1-bin-hadoop1 and avro-mapred 1.7.6 w/ classifier Hadoop2.
```
"org.apache.avro" % "avro-mapred" % "1.7.6" classifier "hadoop2",
```

> Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
> 1 API
> --
>
> Key: SPARK-3039
> URL: https://issues.apache.org/jira/browse/SPARK-3039
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Input/Output, Spark Core
>Affects Versions: 0.9.1, 1.0.0, 1.1.0
> Environment: hadoop2, hadoop-2.4.0, HDP-2.1
>Reporter: Bertrand Bossy
>Assignee: Bertrand Bossy
> Fix For: 1.2.0
>
>
> The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a 
> dependency of "org.spark-project.hive:hive-serde".
> The avro-mapred package provides a hadoop FileInputFormat to read and write 
> avro files. There are two versions of this package, distinguished by a 
> classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". 
> avro-mapred for the old Hadoop API uses no classifier.
> E.g. when reading avro files using 
> {code}
> sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro")
> {code}
> The following error occurs:
> {code}
> java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
> at 
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> {code}
> This error usually is a hint that there was a mix up of the old and the new 
> Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to 
> appear before the version that is bundled with Spark, reading avro files 
> works fine. 
> Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2014-12-08 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238197#comment-14238197
 ] 

Derrick Burns commented on SPARK-3039:
--

Spark 1.1.1/Hadoop 1.0.4

{quote}
java.lang.IncompatibleClassChangeError: Found class 
org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
at 
org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
14/12/08 10:21:06 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught 
exception in thread Thread[Executor task launch worker-0,5,main]
java.lang.IncompatibleClassChangeError: Found class 
org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
at 
org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

{quote}


> Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
> 1 API
> --
>
> Key: SPARK-3039
> URL: https://issues.apache.org/jira/browse/SPARK-3039
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Input/Output, Spark Core
>Affects Versions: 0.9.1, 1.0.0, 1.1.0
> Environment: hadoop2, hadoop-2.4.0, HDP-2.1
>Reporter: Bertrand Bossy
>Assignee: Bertrand Bossy
> Fix For: 1.2.0
>
>
> The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a 
> dependency of "org.spark-project.hive:hive-serde".
> The avro-mapred package provides a hadoop FileInputFormat to read and write 
> avro files. There are two versions of this package, distinguished by a 
> classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". 
> avro-mapred for the old Hadoop API uses no classifier.
> E.g. when reading avro files using 
> {code}
> sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro")
> {code}
> The following error occurs:
> {code}
> java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
> at 
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.FilteredRDD.c

[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2014-12-08 Thread Bertrand Bossy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237880#comment-14237880
 ] 

Bertrand Bossy commented on SPARK-3039:
---

@[~derrickburns]: Can you post some more info, such as spark 
version/distribution used, etc..? Spark's build system has received some 
updates. I still have to verify this, but AFAIK, these kind of issues should 
not be present in future releases.

Also: have a careful look at the stack trace:
if you have {code}Found interface 
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected{code} 
then avro-mapred for hadoop1 was found, but avro-mapred for hadoop2 was 
expected. However, if you have
{code}Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface 
was expected{code} or something similar, it's the other way round


> Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
> 1 API
> --
>
> Key: SPARK-3039
> URL: https://issues.apache.org/jira/browse/SPARK-3039
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Input/Output, Spark Core
>Affects Versions: 0.9.1, 1.0.0, 1.1.0
> Environment: hadoop2, hadoop-2.4.0, HDP-2.1
>Reporter: Bertrand Bossy
>Assignee: Bertrand Bossy
> Fix For: 1.2.0
>
>
> The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a 
> dependency of "org.spark-project.hive:hive-serde".
> The avro-mapred package provides a hadoop FileInputFormat to read and write 
> avro files. There are two versions of this package, distinguished by a 
> classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". 
> avro-mapred for the old Hadoop API uses no classifier.
> E.g. when reading avro files using 
> {code}
> sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro")
> {code}
> The following error occurs:
> {code}
> java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
> at 
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> {code}
> This error usually is a hint that there was a mix up of the old and the new 
> Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to 
> appear before the version that is bundled with Spark, reading avro files 
> works fine. 
> Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2014-12-08 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237648#comment-14237648
 ] 

Derrick Burns commented on SPARK-3039:
--

I get the same bug when attempting to save a RDD as a parquet file when using 
Hadoop 1.0.4

> Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
> 1 API
> --
>
> Key: SPARK-3039
> URL: https://issues.apache.org/jira/browse/SPARK-3039
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Input/Output, Spark Core
>Affects Versions: 0.9.1, 1.0.0, 1.1.0
> Environment: hadoop2, hadoop-2.4.0, HDP-2.1
>Reporter: Bertrand Bossy
>Assignee: Bertrand Bossy
> Fix For: 1.2.0
>
>
> The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a 
> dependency of "org.spark-project.hive:hive-serde".
> The avro-mapred package provides a hadoop FileInputFormat to read and write 
> avro files. There are two versions of this package, distinguished by a 
> classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". 
> avro-mapred for the old Hadoop API uses no classifier.
> E.g. when reading avro files using 
> {code}
> sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro")
> {code}
> The following error occurs:
> {code}
> java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
> at 
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> {code}
> This error usually is a hint that there was a mix up of the old and the new 
> Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to 
> appear before the version that is bundled with Spark, reading avro files 
> works fine. 
> Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2014-11-13 Thread Bertrand Bossy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209486#comment-14209486
 ] 

Bertrand Bossy commented on SPARK-3039:
---

Needs more fixes, since although I got the build to work for hadoop-2.4, yarn 
and hive with the sbt build. It doesn't work with the maven build, which AFAIK 
is required for pySpark. Dependency management seems to be quite different in 
sbt and maven

> Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
> 1 API
> --
>
> Key: SPARK-3039
> URL: https://issues.apache.org/jira/browse/SPARK-3039
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Input/Output, Spark Core
>Affects Versions: 0.9.1, 1.0.0, 1.1.0
> Environment: hadoop2, hadoop-2.4.0, HDP-2.1
>Reporter: Bertrand Bossy
>Assignee: Bertrand Bossy
> Fix For: 1.2.0
>
>
> The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a 
> dependency of "org.spark-project.hive:hive-serde".
> The avro-mapred package provides a hadoop FileInputFormat to read and write 
> avro files. There are two versions of this package, distinguished by a 
> classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". 
> avro-mapred for the old Hadoop API uses no classifier.
> E.g. when reading avro files using 
> {code}
> sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro")
> {code}
> The following error occurs:
> {code}
> java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
> at 
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> {code}
> This error usually is a hint that there was a mix up of the old and the new 
> Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to 
> appear before the version that is bundled with Spark, reading avro files 
> works fine. 
> Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2014-08-15 Thread Bertrand Bossy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098398#comment-14098398
 ] 

Bertrand Bossy commented on SPARK-3039:
---

Also need to update the README: See SPARK-3069 and 
https://github.com/apache/spark/pull/1945

> Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
> 1 API
> --
>
> Key: SPARK-3039
> URL: https://issues.apache.org/jira/browse/SPARK-3039
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Input/Output, Spark Core
>Affects Versions: 0.9.1, 1.0.0, 1.1.0
> Environment: hadoop2, hadoop-2.4.0, HDP-2.1
>Reporter: Bertrand Bossy
>
> The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a 
> dependency of "org.spark-project.hive:hive-serde".
> The avro-mapred package provides a hadoop FileInputFormat to read and write 
> avro files. There are two versions of this package, distinguished by a 
> classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". 
> avro-mapred for the old Hadoop API uses no classifier.
> E.g. when reading avro files using 
> {code}
> sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro")
> {code}
> The following error occurs:
> {code}
> java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
> at 
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> {code}
> This error usually is a hint that there was a mix up of the old and the new 
> Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to 
> appear before the version that is bundled with Spark, reading avro files 
> works fine. 
> Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2014-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096973#comment-14096973
 ] 

Apache Spark commented on SPARK-3039:
-

User 'bbossy' has created a pull request for this issue:
https://github.com/apache/spark/pull/1945

> Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
> 1 API
> --
>
> Key: SPARK-3039
> URL: https://issues.apache.org/jira/browse/SPARK-3039
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Input/Output, Spark Core
>Affects Versions: 0.9.1, 1.0.0, 1.1.0
> Environment: hadoop2, hadoop-2.4.0, HDP-2.1
>Reporter: Bertrand Bossy
>
> The spark assembly contains the artifact "org.apache.avro:avro-mapred" as a 
> dependency of "org.spark-project.hive:hive-serde".
> The avro-mapred package provides a hadoop FileInputFormat to read and write 
> avro files. There are two versions of this package, distinguished by a 
> classifier. avro-mapred for the new Hadoop API uses the classifier "hadoop2". 
> avro-mapred for the old Hadoop API uses no classifier.
> E.g. when reading avro files using 
> {code}
> sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]]("hdfs://path/to/file.avro")
> {code}
> The following error occurs:
> {code}
> java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
> at 
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:111)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> {code}
> This error usually is a hint that there was a mix up of the old and the new 
> Hadoop API. As a work-around, if avro-mapred for hadoop2 is "forced" to 
> appear before the version that is bundled with Spark, reading avro files 
> works fine. 
> Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org