Re: The default CDH4 build uses avro-mapred hadoop1
True, although a number of other little issues make me, personally, not want to continue down this road: - There are already a lot of build profiles to try to cover Hadoop versions - I don't think it's quite right to have vendor-specific builds in Spark to begin with - We should be moving to only support Hadoop 2 soon IMHO anyway - CDH4 is EOL in a few months I think On Fri, Feb 20, 2015 at 8:30 AM, Mingyu Kim m...@palantir.com wrote: Hi all, Related to https://issues.apache.org/jira/browse/SPARK-3039, the default CDH4 build, which is built with mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests clean package”, pulls in avro-mapred hadoop1, as opposed to avro-mapred hadoop2. This ends up in the same error as mentioned in the linked bug. (pasted below). The right solution would be to create a hadoop-2.0 profile that sets avro.mapred.classifier to hadoop2, and to build CDH4 build with “-Phadoop-2.0” option. What do people think? Mingyu —— java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:133) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
The default CDH4 build uses avro-mapred hadoop1
Hi all, Related to https://issues.apache.org/jira/browse/SPARK-3039, the default CDH4 build, which is built with mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests clean package”, pulls in avro-mapred hadoop1, as opposed to avro-mapred hadoop2. This ends up in the same error as mentioned in the linked bug. (pasted below). The right solution would be to create a hadoop-2.0 profile that sets avro.mapred.classifier to hadoop2, and to build CDH4 build with “-Phadoop-2.0” option. What do people think? Mingyu —— java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:133) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
Re: The default CDH4 build uses avro-mapred hadoop1
Thanks for the explanation. To be clear, I meant to speak for any hadoop 2 releases before 2.2, which have profiles in Spark. I referred to CDH4, since that¹s the only Hadoop 2.0/2.1 version Spark ships a prebuilt package for. I understand the hesitation of making a code change if Spark doesn¹t plan to support Hadoop 2.0/2.1 in general. (Please note, this is not specific to CDH4) If so, can I propose alternative options until Spark moves to only support hadoop2? - Build the CDH4 package with ³-Davro.mapred.classifier=hadoop2², and update http://spark.apache.org/docs/latest/building-spark.html for all ³2.0.*² examples. - Build the CDH4 package as is, but note known issues clearly in the ³download² page. - Simply do not ship CDH4 prebuilt package, and let people figure it out themselves. Preferably, note in documentation that ³-Davro.mapred.classifier=hadoop2² should be used for all hadoop ³2.0.*² builds. Please let me know what you think! Mingyu On 2/20/15, 2:34 AM, Sean Owen so...@cloudera.com wrote: True, although a number of other little issues make me, personally, not want to continue down this road: - There are already a lot of build profiles to try to cover Hadoop versions - I don't think it's quite right to have vendor-specific builds in Spark to begin with - We should be moving to only support Hadoop 2 soon IMHO anyway - CDH4 is EOL in a few months I think On Fri, Feb 20, 2015 at 8:30 AM, Mingyu Kim m...@palantir.com wrote: Hi all, Related to https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_ji ra_browse_SPARK-2D3039d=AwIFaQc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oO nmz8r=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84m=s1MfvBlt11h2xojQItkw aeh094ttUKTu9K5F-lA6DJYs=Sb2SVubKkvdjaLer3K-b_Z0RfeC1fm-CP4A-Uh6nvEQe= , the default CDH4 build, which is built with mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests clean package², pulls in avro-mapred hadoop1, as opposed to avro-mapred hadoop2. This ends up in the same error as mentioned in the linked bug. (pasted below). The right solution would be to create a hadoop-2.0 profile that sets avro.mapred.classifier to hadoop2, and to build CDH4 build with ³-Phadoop-2.0² option. What do people think? Mingyu ‹‹ java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyIn putFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:133) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java :1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.jav a:615) at java.lang.Thread.run(Thread.java:745) - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org