[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20377#discussion_r164551253 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala --- @@ -251,10 +251,8 @@ private[hive] class IsolatedClientLoader( /** The isolated client interface to Hive. */ private[hive] def createClient(): HiveClient = synchronized { -val warehouseDir = Option(hadoopConf.get(ConfVars.METASTOREWAREHOUSE.varname)) if (!isolationOn) { - return new HiveClientImpl(version, warehouseDir, sparkConf, hadoopConf, config, -baseClassLoader, this) + return new HiveClientImpl(version, sparkConf, hadoopConf, config, baseClassLoader, this) --- End diff -- Adding a comment about exactly why any of the proposed changes are needed is really the only thing that can make this code more understandable. I had that in the bug and in my commit message, but in hindsight, it really should be in the code. Without the comment, all versions are equally complex, because the complexity has nothing to do with how many arguments you have or their type. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20377#discussion_r164050130 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala --- @@ -251,10 +251,8 @@ private[hive] class IsolatedClientLoader( /** The isolated client interface to Hive. */ private[hive] def createClient(): HiveClient = synchronized { -val warehouseDir = Option(hadoopConf.get(ConfVars.METASTOREWAREHOUSE.varname)) if (!isolationOn) { - return new HiveClientImpl(version, warehouseDir, sparkConf, hadoopConf, config, -baseClassLoader, this) + return new HiveClientImpl(version, sparkConf, hadoopConf, config, baseClassLoader, this) --- End diff -- @HyukjinKwon 's proposal looks good with this type alias and comments, so people can know what happened here. But due to personal taste, I prefer the wrapper solution as it looks cleaner to me and don't need to build a map or a O(n) lookup :-/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20377#discussion_r164043733 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala --- @@ -251,10 +251,8 @@ private[hive] class IsolatedClientLoader( /** The isolated client interface to Hive. */ private[hive] def createClient(): HiveClient = synchronized { -val warehouseDir = Option(hadoopConf.get(ConfVars.METASTOREWAREHOUSE.varname)) if (!isolationOn) { - return new HiveClientImpl(version, warehouseDir, sparkConf, hadoopConf, config, -baseClassLoader, this) + return new HiveClientImpl(version, sparkConf, hadoopConf, config, baseClassLoader, this) --- End diff -- Or simply just iterate it with O(n) as said in https://github.com/apache/spark/pull/20377/files#r163920410. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20377#discussion_r164043569 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala --- @@ -251,10 +251,8 @@ private[hive] class IsolatedClientLoader( /** The isolated client interface to Hive. */ private[hive] def createClient(): HiveClient = synchronized { -val warehouseDir = Option(hadoopConf.get(ConfVars.METASTOREWAREHOUSE.varname)) if (!isolationOn) { - return new HiveClientImpl(version, warehouseDir, sparkConf, hadoopConf, config, -baseClassLoader, this) + return new HiveClientImpl(version, sparkConf, hadoopConf, config, baseClassLoader, this) --- End diff -- How about something like this then if it really matters? ```diff --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala @@ -83,9 +83,8 @@ import org.apache.spark.util.{CircularBuffer, Utils} */ private[hive] class HiveClientImpl( override val version: HiveVersion, -warehouseDir: Option[String], sparkConf: SparkConf, -hadoopConf: JIterable[JMap.Entry[String, String]], +hadoopConf: HadoopConfiguration, extraConfig: Map[String, String], initClassLoader: ClassLoader, val clientLoader: IsolatedClientLoader) @@ -106,6 +105,8 @@ private[hive] class HiveClientImpl( case hive.v2_1 => new Shim_v2_1() } + private val hadoopConfMap = hadoopConf.iterator().asScala.map(e => e.getKey -> e.getValue).toMap + // Create an internal session state for this HiveClientImpl. val state: SessionState = { val original = Thread.currentThread().getContextClassLoader @@ -132,7 +133,7 @@ private[hive] class HiveClientImpl( if (ret != null) { // hive.metastore.warehouse.dir is determined in SharedState after the CliSessionState // instance constructed, we need to follow that change here. -warehouseDir.foreach { dir => +hadoopConfMap.get(ConfVars.METASTOREWAREHOUSE.varname).foreach { dir => ret.getConf.setVar(ConfVars.METASTOREWAREHOUSE, dir) } ret @@ -166,8 +167,7 @@ private[hive] class HiveClientImpl( // has hive-site.xml. So, HiveConf will use that to override its default values. // 2: we set all spark confs to this hiveConf. // 3: we set all entries in config to this hiveConf. -(hadoopConf.iterator().asScala.map(kv => kv.getKey -> kv.getValue) - ++ sparkConf.getAll.toMap ++ extraConfig).foreach { case (k, v) => +(hadoopConfMap ++ sparkConf.getAll.toMap ++ extraConfig).foreach { case (k, v) => logDebug( s""" |Applying Hadoop/Hive/Spark and extra properties to Hive Conf: @@ -847,6 +847,11 @@ private[hive] class HiveClientImpl( } private[hive] object HiveClientImpl { + + // wider signature for hadoop conf for different versions blabla .. + // See SPARK-17088. + private type HadoopConfiguration = JIterable[JMap.Entry[String, String]] + /** Converts the native StructField to Hive's FieldSchema. */ def toHiveColumn(c: StructField): FieldSchema = { val typeString = if (c.metadata.contains(HIVE_TYPE_STRING)) { ``` I think this could roughly address all concerns listed here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20377#discussion_r164035915 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala --- @@ -251,10 +251,8 @@ private[hive] class IsolatedClientLoader( /** The isolated client interface to Hive. */ private[hive] def createClient(): HiveClient = synchronized { -val warehouseDir = Option(hadoopConf.get(ConfVars.METASTOREWAREHOUSE.varname)) if (!isolationOn) { - return new HiveClientImpl(version, warehouseDir, sparkConf, hadoopConf, config, -baseClassLoader, this) + return new HiveClientImpl(version, sparkConf, hadoopConf, config, baseClassLoader, this) --- End diff -- I'm sure that when people think about non-trivial code in Spark, this is exactly the line in the whole code base they think of... :-/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20377#discussion_r164032685 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala --- @@ -251,10 +251,8 @@ private[hive] class IsolatedClientLoader( /** The isolated client interface to Hive. */ private[hive] def createClient(): HiveClient = synchronized { -val warehouseDir = Option(hadoopConf.get(ConfVars.METASTOREWAREHOUSE.varname)) if (!isolationOn) { - return new HiveClientImpl(version, warehouseDir, sparkConf, hadoopConf, config, -baseClassLoader, this) + return new HiveClientImpl(version, sparkConf, hadoopConf, config, baseClassLoader, this) --- End diff -- I think many people may get confused when they see this `JIterable[JMap.Entry[String, String]]` and the extra `warehouse` parameter. I don't agree that we can sacrifices code quality because it's only called twice. So the argument is, is this wrapper solution cleaner than the current one? If it is, and can fix the issue, let's go for it. I think refactoring should be encouraged, it happens a lot in the spark code base, people see hacky code and people fix it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20377#discussion_r163938641 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala --- @@ -251,10 +251,8 @@ private[hive] class IsolatedClientLoader( /** The isolated client interface to Hive. */ private[hive] def createClient(): HiveClient = synchronized { -val warehouseDir = Option(hadoopConf.get(ConfVars.METASTOREWAREHOUSE.varname)) if (!isolationOn) { - return new HiveClientImpl(version, warehouseDir, sparkConf, hadoopConf, config, -baseClassLoader, this) + return new HiveClientImpl(version, sparkConf, hadoopConf, config, baseClassLoader, this) --- End diff -- This is just a general rule we follow when we developing the software, especially when this does not affect the performance. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20377#discussion_r163936822 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala --- @@ -251,10 +251,8 @@ private[hive] class IsolatedClientLoader( /** The isolated client interface to Hive. */ private[hive] def createClient(): HiveClient = synchronized { -val warehouseDir = Option(hadoopConf.get(ConfVars.METASTOREWAREHOUSE.varname)) if (!isolationOn) { - return new HiveClientImpl(version, warehouseDir, sparkConf, hadoopConf, config, -baseClassLoader, this) + return new HiveClientImpl(version, sparkConf, hadoopConf, config, baseClassLoader, this) --- End diff -- That's such a subjective argument. The extra argument does not make the code more complicated, especially compared to anything else going on here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20377#discussion_r163936367 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala --- @@ -251,10 +251,8 @@ private[hive] class IsolatedClientLoader( /** The isolated client interface to Hive. */ private[hive] def createClient(): HiveClient = synchronized { -val warehouseDir = Option(hadoopConf.get(ConfVars.METASTOREWAREHOUSE.varname)) if (!isolationOn) { - return new HiveClientImpl(version, warehouseDir, sparkConf, hadoopConf, config, -baseClassLoader, this) + return new HiveClientImpl(version, sparkConf, hadoopConf, config, baseClassLoader, this) --- End diff -- This is not the hot path. Passing extra parameters for this looks unnecessary. We need to keep the interface clean for better code maintenance. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20377#discussion_r163920410 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala --- @@ -251,10 +251,8 @@ private[hive] class IsolatedClientLoader( /** The isolated client interface to Hive. */ private[hive] def createClient(): HiveClient = synchronized { -val warehouseDir = Option(hadoopConf.get(ConfVars.METASTOREWAREHOUSE.varname)) if (!isolationOn) { - return new HiveClientImpl(version, warehouseDir, sparkConf, hadoopConf, config, -baseClassLoader, this) + return new HiveClientImpl(version, sparkConf, hadoopConf, config, baseClassLoader, this) --- End diff -- Let me ask the question: what exactly is the problem with the argument I added? It solves the issue without having to write all this code. If you really dislike the argument for some odd reason, you can get the config by iterating over the `Iterable` in `HiveClientImpl`, making an operation that is currently O(1) to be O(n). But I really don't understand why you guys care about this argument so much. There are *2* call sites to that constructor in the whole code base, both in the same method in `IsolatedClientLoader.scala`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20377#discussion_r163774967 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala --- @@ -251,10 +251,8 @@ private[hive] class IsolatedClientLoader( /** The isolated client interface to Hive. */ private[hive] def createClient(): HiveClient = synchronized { -val warehouseDir = Option(hadoopConf.get(ConfVars.METASTOREWAREHOUSE.varname)) if (!isolationOn) { - return new HiveClientImpl(version, warehouseDir, sparkConf, hadoopConf, config, -baseClassLoader, this) + return new HiveClientImpl(version, sparkConf, hadoopConf, config, baseClassLoader, this) --- End diff -- cc @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20377#discussion_r163774842 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala --- @@ -251,10 +251,8 @@ private[hive] class IsolatedClientLoader( /** The isolated client interface to Hive. */ private[hive] def createClient(): HiveClient = synchronized { -val warehouseDir = Option(hadoopConf.get(ConfVars.METASTOREWAREHOUSE.varname)) if (!isolationOn) { - return new HiveClientImpl(version, warehouseDir, sparkConf, hadoopConf, config, -baseClassLoader, this) + return new HiveClientImpl(version, sparkConf, hadoopConf, config, baseClassLoader, this) --- End diff -- so the major concern is to hide the `Configuration` class through the code path. How about we create a wrapper? ``` trait HadoopConfWrapper { def get(key: String): String def toIterator: Iterator[(String, String)] } ``` and here ``` val wrapper = new HadoopConfWrapper { def get(key: String) = hadoopConf.get(key) def toIterator = hadoopConf.iterator().asScala } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20377#discussion_r163655221 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -132,7 +131,8 @@ private[hive] class HiveClientImpl( if (ret != null) { // hive.metastore.warehouse.dir is determined in SharedState after the CliSessionState // instance constructed, we need to follow that change here. -warehouseDir.foreach { dir => +val conf = hadoopConf.asInstanceOf[Configuration] --- End diff -- Over heeerre ... you could say this is a good example of why getting reviews is important! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20377#discussion_r163631446 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -132,7 +131,8 @@ private[hive] class HiveClientImpl( if (ret != null) { // hive.metastore.warehouse.dir is determined in SharedState after the CliSessionState // instance constructed, we need to follow that change here. -warehouseDir.foreach { dir => +val conf = hadoopConf.asInstanceOf[Configuration] --- End diff -- Actually I'm not sure just what I suggested will work. You need to reproduce the condition to trigger the `if` above: ``` val ret = SessionState.get if (ret != null) { if (ret != null) { ``` And I'm not sure how to do that. I'm just sure that if you do, your code will not work. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20377#discussion_r163630721 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -132,7 +131,8 @@ private[hive] class HiveClientImpl( if (ret != null) { // hive.metastore.warehouse.dir is determined in SharedState after the CliSessionState // instance constructed, we need to follow that change here. -warehouseDir.foreach { dir => +val conf = hadoopConf.asInstanceOf[Configuration] --- End diff -- Thanks! Will try to reproduce it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20377#discussion_r163628671 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -132,7 +131,8 @@ private[hive] class HiveClientImpl( if (ret != null) { // hive.metastore.warehouse.dir is determined in SharedState after the CliSessionState // instance constructed, we need to follow that change here. -warehouseDir.foreach { dir => +val conf = hadoopConf.asInstanceOf[Configuration] --- End diff -- See other comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20377#discussion_r163627927 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -132,7 +131,8 @@ private[hive] class HiveClientImpl( if (ret != null) { // hive.metastore.warehouse.dir is determined in SharedState after the CliSessionState // instance constructed, we need to follow that change here. -warehouseDir.foreach { dir => +val conf = hadoopConf.asInstanceOf[Configuration] --- End diff -- How to reproduce it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20377#discussion_r163627455 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -132,7 +131,8 @@ private[hive] class HiveClientImpl( if (ret != null) { // hive.metastore.warehouse.dir is determined in SharedState after the CliSessionState // instance constructed, we need to follow that change here. -warehouseDir.foreach { dir => +val conf = hadoopConf.asInstanceOf[Configuration] --- End diff -- You're reintroducing the original bug here. You cannot cast this to `Configuration`, because in the `sharesHadoopClasses = false` case, there are two different instances of that class involved, and this will fail with a `ClassCastException`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/20377 [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' option when creating client ## What changes were proposed in this pull request? This PR is to remove useless `warehouseDir`, which is already contained in `hadoopConf ` ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark followUp17088 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20377.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20377 commit 0574ec71d830d8c4b5acdbfc910ddc0f5622b0ed Author: gatorsmile Date: 2018-01-24T05:05:47Z remove warehousePath --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org