spark git commit: [SPARK-9593] [SQL] Fixes Hadoop shims loading

yhuai Thu, 06 Aug 2015 10:00:08 -0700

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 c39d5d144 -> 11c28a568



[SPARK-9593] [SQL] Fixes Hadoop shims loading

This PR is used to workaround CDH Hadoop versions like 2.0.0-mr1-cdh4.1.1.

Internally, Hive `ShimLoader` tries to load different versions of Hadoop shims 
by checking version information gathered from Hadoop jar files.  If the major 
version number is 1, `Hadoop20SShims` will be loaded.  Otherwise, if the major 
version number is 2, `Hadoop23Shims` will be chosen.  However, CDH Hadoop 
versions like 2.0.0-mr1-cdh4.1.1 have 2 as major version number, but contain 
Hadoop 1 code.  This confuses Hive `ShimLoader` and loads wrong version of 
shims.

In this PR we check for existence of the 
`Path.getPathWithoutSchemeAndAuthority` method, which doesn't exist in Hadoop 1 
(it's also the method that reveals this shims loading issue), and load 
`Hadoop20SShims` when it doesn't exist.

Author: Cheng Lian <l...@databricks.com>

Closes #7929 from liancheng/spark-9593/fix-hadoop-shims-loading and squashes 
the following commits:

c99b497 [Cheng Lian] Narrows down the fix to handle "2.0.0-*cdh4*" Hadoop 
versions only
b17e955 [Cheng Lian] Updates comments
490d8f2 [Cheng Lian] Fixes Scala style issue
9c6c12d [Cheng Lian] Fixes Hadoop shims loading

(cherry picked from commit 70112ff22bd1aee7689c5d3af9b66c9b8ceb3ec3)
Signed-off-by: Yin Huai <yh...@databricks.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/11c28a56
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/11c28a56
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/11c28a56

Branch: refs/heads/branch-1.5
Commit: 11c28a568ea55fcde54048d357e0709e08be3072
Parents: c39d5d1
Author: Cheng Lian <l...@databricks.com>
Authored: Wed Aug 5 20:03:54 2015 +0800
Committer: Yin Huai <yh...@databricks.com>
Committed: Thu Aug 6 09:59:42 2015 -0700

----------------------------------------------------------------------
 .../spark/sql/hive/client/ClientWrapper.scala   | 48 ++++++++++++++++++++
 1 file changed, 48 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/11c28a56/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala
----------------------------------------------------------------------
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala
index dc372be..211a3b8 100644
--- 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala
+++ 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala
@@ -32,6 +32,8 @@ import org.apache.hadoop.hive.ql.metadata.Hive
 import org.apache.hadoop.hive.ql.processors._
 import org.apache.hadoop.hive.ql.session.SessionState
 import org.apache.hadoop.hive.ql.{Driver, metadata}
+import org.apache.hadoop.hive.shims.{HadoopShims, ShimLoader}
+import org.apache.hadoop.util.VersionInfo
 
 import org.apache.spark.Logging
 import org.apache.spark.sql.catalyst.expressions.Expression
@@ -62,6 +64,52 @@ private[hive] class ClientWrapper(
   extends ClientInterface
   with Logging {
 
+  overrideHadoopShims()
+
+  // !! HACK ALERT !!
+  //
+  // This method is a surgical fix for Hadoop version 2.0.0-mr1-cdh4.1.1, 
which is used by Spark EC2
+  // scripts.  We should remove this after upgrading Spark EC2 scripts to some 
more recent Hadoop
+  // version in the future.
+  //
+  // Internally, Hive `ShimLoader` tries to load different versions of Hadoop 
shims by checking
+  // version information gathered from Hadoop jar files.  If the major version 
number is 1,
+  // `Hadoop20SShims` will be loaded.  Otherwise, if the major version number 
is 2, `Hadoop23Shims`
+  // will be chosen.
+  //
+  // However, part of APIs in Hadoop 2.0.x and 2.1.x versions were in flux due 
to historical
+  // reasons. So 2.0.0-mr1-cdh4.1.1 is actually more Hadoop-1-like and should 
be used together with
+  // `Hadoop20SShims`, but `Hadoop20SShims` is chose because the major version 
number here is 2.
+  //
+  // Here we check for this specific version and loads `Hadoop20SShims` via 
reflection.  Note that
+  // we can't check for string literal "2.0.0-mr1-cdh4.1.1" because the 
obtained version string
+  // comes from Maven artifact org.apache.hadoop:hadoop-common:2.0.0-cdh4.1.1, 
which doesn't have
+  // the "mr1" tag in its version string.
+  private def overrideHadoopShims(): Unit = {
+    val VersionPattern = """2\.0\.0.*cdh4.*""".r
+
+    VersionInfo.getVersion match {
+      case VersionPattern() =>
+        val shimClassName = "org.apache.hadoop.hive.shims.Hadoop20SShims"
+        logInfo(s"Loading Hadoop shims $shimClassName")
+
+        try {
+          val shimsField = classOf[ShimLoader].getDeclaredField("hadoopShims")
+          // scalastyle:off classforname
+          val shimsClass = Class.forName(shimClassName)
+          // scalastyle:on classforname
+          val shims = classOf[HadoopShims].cast(shimsClass.newInstance())
+          shimsField.setAccessible(true)
+          shimsField.set(null, shims)
+        } catch { case cause: Throwable =>
+          logError(s"Failed to load $shimClassName")
+          // Falls back to normal Hive `ShimLoader` logic
+        }
+
+      case _ =>
+    }
+  }
+
   // Circular buffer to hold what hive prints to STDOUT and ERR.  Only printed 
when failures occur.
   private val outputBuffer = new CircularBuffer()
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9593] [SQL] Fixes Hadoop shims loading

Reply via email to