Re: The default CDH4 build uses avro-mapred hadoop1

2015-02-20 Thread Sean Owen
True, although a number of other little issues make me, personally,
not want to continue down this road:

- There are already a lot of build profiles to try to cover Hadoop versions
- I don't think it's quite right to have vendor-specific builds in
Spark to begin with
- We should be moving to only support Hadoop 2 soon IMHO anyway
- CDH4 is EOL in a few months I think

On Fri, Feb 20, 2015 at 8:30 AM, Mingyu Kim m...@palantir.com wrote:
 Hi all,

 Related to https://issues.apache.org/jira/browse/SPARK-3039, the default CDH4 
 build, which is built with mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 
 -DskipTests clean package”, pulls in avro-mapred hadoop1, as opposed to 
 avro-mapred hadoop2. This ends up in the same error as mentioned in the 
 linked bug. (pasted below).

 The right solution would be to create a hadoop-2.0 profile that sets 
 avro.mapred.classifier to hadoop2, and to build CDH4 build with 
 “-Phadoop-2.0” option.

 What do people think?

 Mingyu

 ——

 java.lang.IncompatibleClassChangeError: Found interface 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at 
 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



The default CDH4 build uses avro-mapred hadoop1

2015-02-20 Thread Mingyu Kim
Hi all,

Related to https://issues.apache.org/jira/browse/SPARK-3039, the default CDH4 
build, which is built with mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests 
clean package”, pulls in avro-mapred hadoop1, as opposed to avro-mapred 
hadoop2. This ends up in the same error as mentioned in the linked bug. (pasted 
below).

The right solution would be to create a hadoop-2.0 profile that sets 
avro.mapred.classifier to hadoop2, and to build CDH4 build with “-Phadoop-2.0” 
option.

What do people think?

Mingyu

——

java.lang.IncompatibleClassChangeError: Found interface 
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
   at 
org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
   at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:133)
   at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
   at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
   at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)



Re: The default CDH4 build uses avro-mapred hadoop1

2015-02-20 Thread Mingyu Kim
Thanks for the explanation.

To be clear, I meant to speak for any hadoop 2 releases before 2.2, which
have profiles in Spark. I referred to CDH4, since that¹s the only Hadoop
2.0/2.1 version Spark ships a prebuilt package for.

I understand the hesitation of making a code change if Spark doesn¹t plan
to support Hadoop 2.0/2.1 in general. (Please note, this is not specific
to CDH4) If so, can I propose alternative options until Spark moves to
only support hadoop2?

- Build the CDH4 package with ³-Davro.mapred.classifier=hadoop2², and
update http://spark.apache.org/docs/latest/building-spark.html for all
³2.0.*² examples.
- Build the CDH4 package as is, but note known issues clearly in the
³download² page.
- Simply do not ship CDH4 prebuilt package, and let people figure it out
themselves. Preferably, note in documentation that
³-Davro.mapred.classifier=hadoop2² should be used for all hadoop ³2.0.*²
builds.

Please let me know what you think!

Mingyu





On 2/20/15, 2:34 AM, Sean Owen so...@cloudera.com wrote:

True, although a number of other little issues make me, personally,
not want to continue down this road:

- There are already a lot of build profiles to try to cover Hadoop
versions
- I don't think it's quite right to have vendor-specific builds in
Spark to begin with
- We should be moving to only support Hadoop 2 soon IMHO anyway
- CDH4 is EOL in a few months I think

On Fri, Feb 20, 2015 at 8:30 AM, Mingyu Kim m...@palantir.com wrote:
 Hi all,

 Related to 
https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_ji
ra_browse_SPARK-2D3039d=AwIFaQc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oO
nmz8r=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84m=s1MfvBlt11h2xojQItkw
aeh094ttUKTu9K5F-lA6DJYs=Sb2SVubKkvdjaLer3K-b_Z0RfeC1fm-CP4A-Uh6nvEQe=
, the default CDH4 build, which is built with mvn
-Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests clean package², pulls in
avro-mapred hadoop1, as opposed to avro-mapred hadoop2. This ends up in
the same error as mentioned in the linked bug. (pasted below).

 The right solution would be to create a hadoop-2.0 profile that sets
avro.mapred.classifier to hadoop2, and to build CDH4 build with
³-Phadoop-2.0² option.

 What do people think?

 Mingyu

 ‹‹

 java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at 
org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyIn
putFormat.java:47)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:133)
at 
org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
at 
org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java
:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.jav
a:615)
at java.lang.Thread.run(Thread.java:745)



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org