Hi, This email is sent to both dev and user list, just want to see if someone
familiar with Spark/Maven build procedure can provide any help.
I am building Spark 1.2.2 with the following command:
mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -Phive -Phive-0.12.0
The spark-assembly-1.2.2-hadoop2.2.0.jar contains the avro and avro-ipc of
version 1.7.6, but avro-mapred of version 1.7.1, which caused some wired
runtime exception when I tried to read the avro file in the Spark 1.2.2, as
following:
NullPointerException at java.io.StringReader.<init>(StringReader.java:50)
at org.apache.avro.Schema$Parser.parse(Schema.java:943) at
org.apache.avro.Schema.parse(Schema.java:992) at
org.apache.avro.mapred.AvroJob.getInputSchema(AvroJob.java:65) at
org.apache.avro.mapred.AvroRecordReader.<init>(AvroRecordReader.java:43) at
org.apache.avro.mapred.AvroInputFormat.getRecordReader(AvroInputFormat.java:52)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:233) at
org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
So I run the following command to understand that avro-mapred 1.7.1 is brought
in by Hive 0.12 profile:
mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -Phive -Phive-0.12.0 dependency:tree
-Dverbose -Dincludes=org.apache.avro
[INFO]
------------------------------------------------------------------------[INFO]
Building Spark Project Hive 1.2.2[INFO]
------------------------------------------------------------------------[INFO][INFO]
--- maven-dependency-plugin:2.4:tree (default-cli) @ spark-hive_2.10 ---[INFO]
org.apache.spark:spark-hive_2.10:jar:1.2.2[INFO] +-
org.apache.spark:spark-core_2.10:jar:1.2.2:compile[INFO] | \-
org.apache.hadoop:hadoop-client:jar:2.2.0:compile (version managed from
1.0.4)[INFO] | \- org.apache.hadoop:hadoop-common:jar:2.2.0:compile[INFO] |
\- (org.apache.avro:avro:jar:1.7.6:compile - version managed from 1.7.1;
omitted for duplicate)[INFO] +-
org.spark-project.hive:hive-serde:jar:0.12.0-protobuf-2.5:compile[INFO] | +-
(org.apache.avro:avro:jar:1.7.6:compile - version managed from 1.7.1; omitted
for duplicate)[INFO] | \- org.apache.avro:avro-mapred:jar:1.7.1:compile[INFO]
| \- (org.apache.avro:avro-ipc:jar:1.7.6:compile - version managed from
1.7.1; omitted for duplicate)[INFO] +-
org.apache.avro:avro:jar:1.7.6:compile[INFO] \-
org.apache.avro:avro-mapred:jar:hadoop2:1.7.6:compile[INFO] +-
org.apache.avro:avro-ipc:jar:1.7.6:compile[INFO] | \-
(org.apache.avro:avro:jar:1.7.6:compile - version managed from 1.7.1; omitted
for duplicate)[INFO] \-
org.apache.avro:avro-ipc:jar:tests:1.7.6:compile[INFO] \-
(org.apache.avro:avro:jar:1.7.6:compile - version managed from 1.7.1; omitted
for duplicate)[INFO]
In this case, I could manually fix all the classes in the final jar, changing
from avro-mapred 1.7.1 to 1.7.6, but I wonder if there is any other solution,
as this way is very error-prone.
Also, just from the above message, I can see avro-mapred.jar.hadoop2:1.7.6
dependency is there, but looks like it is being omitted. Not sure why Maven
choosed the lower version, as I am not a Maven guru.
My question, under the above situation, do I have a easy way to build it with
avro-mapred 1.7.6, instead of 1.7.1?
Thanks
Yong