excluding hadoop dependencies in spark's assembly files

Alex Cozzi Mon, 06 Jan 2014 14:34:34 -0800

I am trying to exclude the hadoop jar dependencies from spark’s assembly files, 
the reason being that in order to work on our cluster it is necessary to use 
our now version of those files instead of the published ones. I tried define 
the hadoop dependencies as “provided”, but surpassingly this causes compilation 
errors in the build. Just to be clear, I modified the sbt build file 
as follows:


  def yarnEnabledSettings = Seq(
    libraryDependencies ++= Seq(
      // Exclude rule required for all ?
      "org.apache.hadoop" % "hadoop-client" % hadoopVersion  % "provided" 
excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib),
      "org.apache.hadoop" % "hadoop-yarn-api" % hadoopVersion  % "provided" 
excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib),
      "org.apache.hadoop" % "hadoop-yarn-common" % hadoopVersion  % "provided" 
excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib),
      "org.apache.hadoop" % "hadoop-yarn-client" % hadoopVersion  % "provided" 
excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib)
    )
  )

and compile as 

 SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true SPARK_IS_NEW_HADOOP=true sbt  
assembly


but the assembly still includes the hadoop libraries, contrary to what the 
assembly docs say. I managed to exclude them instead by using the 
non-recommended way:
def extraAssemblySettings() = Seq(
    test in assembly := {},
    mergeStrategy in assembly := {
      case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
      case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => 
MergeStrategy.discard
      case "log4j.properties" => MergeStrategy.discard
      case m if m.toLowerCase.startsWith("meta-inf/services/") => 
MergeStrategy.filterDistinctLines
      case "reference.conf" => MergeStrategy.concat
      case _ => MergeStrategy.first
    },
    excludedJars in assembly <<= (fullClasspath in assembly) map { cp => 
     cp filter {_.data.getName.contains("hadoop")}
    }
)


But I would like to hear whether there is interest in excluding the hadoop jar 
by default in the build
Alex Cozzi
[email protected]

excluding hadoop dependencies in spark's assembly files

Reply via email to