I am trying to exclude the hadoop jar dependencies from spark’s assembly files,
the reason being that in order to work on our cluster it is necessary to use
our now version of those files instead of the published ones. I tried define
the hadoop dependencies as “provided”, but surpassingly this causes compilation
errors in the build. Just to be clear, I modified the sbt build file
as follows:
def yarnEnabledSettings = Seq(
libraryDependencies ++= Seq(
// Exclude rule required for all ?
"org.apache.hadoop" % "hadoop-client" % hadoopVersion % "provided"
excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib),
"org.apache.hadoop" % "hadoop-yarn-api" % hadoopVersion % "provided"
excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib),
"org.apache.hadoop" % "hadoop-yarn-common" % hadoopVersion % "provided"
excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib),
"org.apache.hadoop" % "hadoop-yarn-client" % hadoopVersion % "provided"
excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib)
)
)
and compile as
SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true SPARK_IS_NEW_HADOOP=true sbt
assembly
but the assembly still includes the hadoop libraries, contrary to what the
assembly docs say. I managed to exclude them instead by using the
non-recommended way:
def extraAssemblySettings() = Seq(
test in assembly := {},
mergeStrategy in assembly := {
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.toLowerCase.matches("meta-inf.*\\.sf$") =>
MergeStrategy.discard
case "log4j.properties" => MergeStrategy.discard
case m if m.toLowerCase.startsWith("meta-inf/services/") =>
MergeStrategy.filterDistinctLines
case "reference.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
},
excludedJars in assembly <<= (fullClasspath in assembly) map { cp =>
cp filter {_.data.getName.contains("hadoop")}
}
)
But I would like to hear whether there is interest in excluding the hadoop jar
by default in the build
Alex Cozzi
[email protected]