[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-09-29 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r496661990



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -397,27 +399,86 @@ private[spark] object HiveUtils extends Logging {
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
 } else {
-  // Convert to files and expand any directories.
-  val jars =
-hiveMetastoreJars
-  .split(File.pathSeparator)
-  .flatMap {
-  case path if new File(path).getName == "*" =>
-val files = new File(path).getParentFile.listFiles()
-if (files == null) {
-  logWarning(s"Hive jar path '$path' does not exist.")
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
   Nil
 } else {
-  
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).toSeq
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)
 }
-  case path =>
-new File(path) :: Nil
+  } else {
+if (!fs.exists(hadoopPath)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (fs.getFileStatus(hadoopPath).isDirectory) {
+  logWarning(s"Hive Jar ${path} not allow directory without `*`")
+  Nil
+} else {
+  // Since tar/tar.gz file we can't know it's final path yet, not 
support it
+  hadoopPath.toUri.toURL :: Nil
+}
+  }
+} catch {
+  case NonFatal(e) =>
+logError(s"Failed to find $path to Hive Jars", e)
+Nil
 }
-  .map(_.toURI.toURL)
+  }
+
+  // Convert to files and expand any directories.
+  val jars =
+hiveMetastoreJars
+  .split(";")

Review comment:
   > Why do we change this from `File.pathSeparator` to `;`?
   
   Since in unix system, `File.pathSeparator` is `:` will split hdfs path 
`hdfs://xxx/xxx`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-09-29 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r496662585



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -397,27 +399,86 @@ private[spark] object HiveUtils extends Logging {
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
 } else {
-  // Convert to files and expand any directories.
-  val jars =
-hiveMetastoreJars
-  .split(File.pathSeparator)
-  .flatMap {
-  case path if new File(path).getName == "*" =>
-val files = new File(path).getParentFile.listFiles()
-if (files == null) {
-  logWarning(s"Hive jar path '$path' does not exist.")
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
   Nil
 } else {
-  
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).toSeq
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)
 }
-  case path =>
-new File(path) :: Nil
+  } else {
+if (!fs.exists(hadoopPath)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (fs.getFileStatus(hadoopPath).isDirectory) {
+  logWarning(s"Hive Jar ${path} not allow directory without `*`")
+  Nil
+} else {
+  // Since tar/tar.gz file we can't know it's final path yet, not 
support it
+  hadoopPath.toUri.toURL :: Nil
+}
+  }
+} catch {
+  case NonFatal(e) =>
+logError(s"Failed to find $path to Hive Jars", e)
+Nil
 }
-  .map(_.toURI.toURL)
+  }
+
+  // Convert to files and expand any directories.
+  val jars =
+hiveMetastoreJars
+  .split(";")
+  .flatMap {
+case path if path.contains("\\") =>

Review comment:
   > `\\` can be in file names in other OSs. I don't think it makes sense 
to assume that this is a local path.
   
   All right, I will raise a pr to change this in `SparkContext`'s 
`addJar/addFile`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-09-29 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r496668116



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -397,27 +399,86 @@ private[spark] object HiveUtils extends Logging {
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
 } else {
-  // Convert to files and expand any directories.
-  val jars =
-hiveMetastoreJars
-  .split(File.pathSeparator)
-  .flatMap {
-  case path if new File(path).getName == "*" =>
-val files = new File(path).getParentFile.listFiles()
-if (files == null) {
-  logWarning(s"Hive jar path '$path' does not exist.")
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
   Nil
 } else {
-  
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).toSeq
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)
 }
-  case path =>
-new File(path) :: Nil
+  } else {
+if (!fs.exists(hadoopPath)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (fs.getFileStatus(hadoopPath).isDirectory) {
+  logWarning(s"Hive Jar ${path} not allow directory without `*`")
+  Nil
+} else {
+  // Since tar/tar.gz file we can't know it's final path yet, not 
support it
+  hadoopPath.toUri.toURL :: Nil
+}
+  }
+} catch {
+  case NonFatal(e) =>
+logError(s"Failed to find $path to Hive Jars", e)
+Nil
 }
-  .map(_.toURI.toURL)
+  }
+
+  // Convert to files and expand any directories.
+  val jars =
+hiveMetastoreJars
+  .split(";")
+  .flatMap {
+case path if path.contains("\\") =>
+  addLocalHiveJars(new File(path))
+case path =>
+  val uri = new Path(path).toUri
+  uri.getScheme match {
+case null | "file" =>
+  addLocalHiveJars(new File(uri.getPath))

Review comment:
   > Assuming `null` as a local path is kind of ambiguous (local paths vs 
`fs.default.name`). Maybe you'll have to document that it should always be 
fully qualified URL to indicate other file systems.
   
   Add comment in doc of conf `spark.sql.hive.metastore.jars`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-09-29 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r496668959



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -397,27 +399,86 @@ private[spark] object HiveUtils extends Logging {
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
 } else {
-  // Convert to files and expand any directories.
-  val jars =
-hiveMetastoreJars
-  .split(File.pathSeparator)
-  .flatMap {
-  case path if new File(path).getName == "*" =>
-val files = new File(path).getParentFile.listFiles()
-if (files == null) {
-  logWarning(s"Hive jar path '$path' does not exist.")
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
   Nil
 } else {
-  
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).toSeq
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)
 }
-  case path =>
-new File(path) :: Nil
+  } else {
+if (!fs.exists(hadoopPath)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (fs.getFileStatus(hadoopPath).isDirectory) {
+  logWarning(s"Hive Jar ${path} not allow directory without `*`")
+  Nil
+} else {
+  // Since tar/tar.gz file we can't know it's final path yet, not 
support it
+  hadoopPath.toUri.toURL :: Nil
+}
+  }
+} catch {
+  case NonFatal(e) =>
+logError(s"Failed to find $path to Hive Jars", e)
+Nil
 }
-  .map(_.toURI.toURL)
+  }
+
+  // Convert to files and expand any directories.
+  val jars =
+hiveMetastoreJars
+  .split(";")
+  .flatMap {
+case path if path.contains("\\") =>
+  addLocalHiveJars(new File(path))
+case path =>
+  val uri = new Path(path).toUri
+  uri.getScheme match {
+case null | "file" =>
+  addLocalHiveJars(new File(uri.getPath))
+case "local" =>
+  new File("file:" + uri.getPath).toURL :: Nil

Review comment:
   > Here I think you can treat `local` as same as `file` since we're not 
downloading or fetching it (or distribute)
   
   Yae





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-09-29 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r496775069



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -397,27 +399,86 @@ private[spark] object HiveUtils extends Logging {
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
 } else {
-  // Convert to files and expand any directories.
-  val jars =
-hiveMetastoreJars
-  .split(File.pathSeparator)
-  .flatMap {
-  case path if new File(path).getName == "*" =>
-val files = new File(path).getParentFile.listFiles()
-if (files == null) {
-  logWarning(s"Hive jar path '$path' does not exist.")
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
   Nil
 } else {
-  
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).toSeq
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)
 }
-  case path =>
-new File(path) :: Nil
+  } else {
+if (!fs.exists(hadoopPath)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (fs.getFileStatus(hadoopPath).isDirectory) {
+  logWarning(s"Hive Jar ${path} not allow directory without `*`")
+  Nil
+} else {
+  // Since tar/tar.gz file we can't know it's final path yet, not 
support it
+  hadoopPath.toUri.toURL :: Nil
+}
+  }
+} catch {
+  case NonFatal(e) =>
+logError(s"Failed to find $path to Hive Jars", e)
+Nil
 }
-  .map(_.toURI.toURL)
+  }
+
+  // Convert to files and expand any directories.
+  val jars =
+hiveMetastoreJars
+  .split(";")

Review comment:
   > Path separator is OS dependent.. lets don't change the behaviour
   
   But a lot place we use comma separator to split file path, right? Also I am 
confused that why we must need to use this Path separator here since what we 
config is jar file list or `*`.
   
   or you mean use will pass config like `hadoop classpath`, but seem no `hive 
classpath` 
   
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-09-30 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r497336247



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -397,27 +399,86 @@ private[spark] object HiveUtils extends Logging {
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
 } else {
-  // Convert to files and expand any directories.
-  val jars =
-hiveMetastoreJars
-  .split(File.pathSeparator)
-  .flatMap {
-  case path if new File(path).getName == "*" =>
-val files = new File(path).getParentFile.listFiles()
-if (files == null) {
-  logWarning(s"Hive jar path '$path' does not exist.")
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
   Nil
 } else {
-  
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).toSeq
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)
 }
-  case path =>
-new File(path) :: Nil
+  } else {
+if (!fs.exists(hadoopPath)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (fs.getFileStatus(hadoopPath).isDirectory) {
+  logWarning(s"Hive Jar ${path} not allow directory without `*`")
+  Nil
+} else {
+  // Since tar/tar.gz file we can't know it's final path yet, not 
support it
+  hadoopPath.toUri.toURL :: Nil
+}
+  }
+} catch {
+  case NonFatal(e) =>
+logError(s"Failed to find $path to Hive Jars", e)
+Nil
 }
-  .map(_.toURI.toURL)
+  }
+
+  // Convert to files and expand any directories.
+  val jars =
+hiveMetastoreJars
+  .split(";")
+  .flatMap {
+case path if path.contains("\\") =>

Review comment:
   > I could not follow this. Do you want to check if this is a Windows 
path? If that's the case, you can use `Utils.isWindows`.
   
   Here need use condition `Utils.isWindows && path.contains("\\")` since path 
maybe remote path





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-09-30 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r497952649



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -397,27 +399,86 @@ private[spark] object HiveUtils extends Logging {
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
 } else {
-  // Convert to files and expand any directories.
-  val jars =
-hiveMetastoreJars
-  .split(File.pathSeparator)
-  .flatMap {
-  case path if new File(path).getName == "*" =>
-val files = new File(path).getParentFile.listFiles()
-if (files == null) {
-  logWarning(s"Hive jar path '$path' does not exist.")
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
   Nil
 } else {
-  
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).toSeq
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)
 }
-  case path =>
-new File(path) :: Nil
+  } else {
+if (!fs.exists(hadoopPath)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (fs.getFileStatus(hadoopPath).isDirectory) {
+  logWarning(s"Hive Jar ${path} not allow directory without `*`")
+  Nil
+} else {
+  // Since tar/tar.gz file we can't know it's final path yet, not 
support it
+  hadoopPath.toUri.toURL :: Nil
+}
+  }
+} catch {
+  case NonFatal(e) =>
+logError(s"Failed to find $path to Hive Jars", e)
+Nil
 }
-  .map(_.toURI.toURL)
+  }
+
+  // Convert to files and expand any directories.
+  val jars =
+hiveMetastoreJars
+  .split(";")

Review comment:
   ping @dongjoon-hyun @wangyum @cloud-fan WDYT about this problem





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-11 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r503028007



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -396,6 +417,93 @@ private[spark] object HiveUtils extends Logging {
 config = configurations,
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
+} else if (hiveMetastoreJars == "path") {
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
+  Nil
+} else {
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)
+}
+  } else {
+if (!fs.exists(hadoopPath)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (fs.getFileStatus(hadoopPath).isDirectory) {
+  logWarning(s"Hive Jar ${path} not allow directory without `*`")
+  Nil
+} else {
+  // Since tar/tar.gz file we can't know it's final path yet, not 
support it
+  hadoopPath.toUri.toURL :: Nil

Review comment:
   > How are these jars be downloaded eventually? we'll need to call 
`FileSystem` API for that later right?
   
   URLClassLoader will handle this, you can see my previous commit, I use 
`Utils.fetchFile` to download, but now I remove that.

##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -88,12 +90,23 @@ private[spark] object HiveUtils extends Logging {
   |   ${builtinHiveVersion} or not defined.
   | 2. "maven"
   |   Use Hive jars of specified version downloaded from Maven 
repositories.
-  | 3. A classpath in the standard format for both Hive and Hadoop.
+  | 3. "path"
+  |   Use Hive jars configured by `spark.sql.hive.metastore.jars.path` in 
comma separated format
+  |   support both local or remote paths.
+  | 4. A classpath in the standard format for both Hive and Hadoop, we 
should always

Review comment:
   > nit: `we should always` -> `it should always`?
   
   Yea

##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -88,12 +90,23 @@ private[spark] object HiveUtils extends Logging {
   |   ${builtinHiveVersion} or not defined.
   | 2. "maven"
   |   Use Hive jars of specified version downloaded from Maven 
repositories.
-  | 3. A classpath in the standard format for both Hive and Hadoop.
+  | 3. "path"
+  |   Use Hive jars configured by `spark.sql.hive.metastore.jars.path` in 
comma separated format
+  |   support both local or remote paths.
+  | 4. A classpath in the standard format for both Hive and Hadoop, we 
should always
+  |   be fully qualified URL to indicate other file systems.
   """.stripMargin)
 .version("1.4.0")
 .stringConf
 .createWithDefault("builtin")
 
+  val HIVE_METASTORE_JARS_PATH = 
buildStaticConf("spark.sql.hive.metastore.jars.path")
+.doc(s"Comma separated path of Hive jars, both support local and remote 
paths." +

Review comment:
   > nit: `both support` -> `support both`.
   
   DOne





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-11 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r503028754



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -397,27 +399,86 @@ private[spark] object HiveUtils extends Logging {
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
 } else {
-  // Convert to files and expand any directories.
-  val jars =
-hiveMetastoreJars
-  .split(File.pathSeparator)
-  .flatMap {
-  case path if new File(path).getName == "*" =>
-val files = new File(path).getParentFile.listFiles()
-if (files == null) {
-  logWarning(s"Hive jar path '$path' does not exist.")
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
   Nil
 } else {
-  
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).toSeq
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)
 }
-  case path =>
-new File(path) :: Nil
+  } else {
+if (!fs.exists(hadoopPath)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (fs.getFileStatus(hadoopPath).isDirectory) {
+  logWarning(s"Hive Jar ${path} not allow directory without `*`")
+  Nil
+} else {
+  // Since tar/tar.gz file we can't know it's final path yet, not 
support it
+  hadoopPath.toUri.toURL :: Nil
+}
+  }
+} catch {
+  case NonFatal(e) =>
+logError(s"Failed to find $path to Hive Jars", e)
+Nil
 }
-  .map(_.toURI.toURL)
+  }
+
+  // Convert to files and expand any directories.
+  val jars =
+hiveMetastoreJars
+  .split(";")
+  .flatMap {
+case path if path.contains("\\") =>
+  addLocalHiveJars(new File(path))
+case path =>
+  val uri = new Path(path).toUri
+  uri.getScheme match {
+case null | "file" =>
+  addLocalHiveJars(new File(uri.getPath))

Review comment:
   > Should this comment on the option "path" instead of "classpath"?
   
   Can't get your point of this comment, can you. make it more clear?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-11 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r503029447



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -396,6 +417,93 @@ private[spark] object HiveUtils extends Logging {
 config = configurations,
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
+} else if (hiveMetastoreJars == "path") {
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
+  Nil
+} else {
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)

Review comment:
   > Do we handle nested `*`, e.g., `path = `/foo/_/_`?
   
   you mean `path=/xxx/xxx/*/*` ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-11 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r503030308



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -397,27 +399,86 @@ private[spark] object HiveUtils extends Logging {
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
 } else {
-  // Convert to files and expand any directories.
-  val jars =
-hiveMetastoreJars
-  .split(File.pathSeparator)
-  .flatMap {
-  case path if new File(path).getName == "*" =>
-val files = new File(path).getParentFile.listFiles()
-if (files == null) {
-  logWarning(s"Hive jar path '$path' does not exist.")
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
   Nil
 } else {
-  
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).toSeq
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)
 }
-  case path =>
-new File(path) :: Nil
+  } else {
+if (!fs.exists(hadoopPath)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (fs.getFileStatus(hadoopPath).isDirectory) {
+  logWarning(s"Hive Jar ${path} not allow directory without `*`")
+  Nil
+} else {
+  // Since tar/tar.gz file we can't know it's final path yet, not 
support it
+  hadoopPath.toUri.toURL :: Nil
+}
+  }
+} catch {
+  case NonFatal(e) =>
+logError(s"Failed to find $path to Hive Jars", e)
+Nil
 }
-  .map(_.toURI.toURL)
+  }
+
+  // Convert to files and expand any directories.
+  val jars =
+hiveMetastoreJars
+  .split(";")
+  .flatMap {
+case path if path.contains("\\") =>
+  addLocalHiveJars(new File(path))
+case path =>
+  val uri = new Path(path).toUri
+  uri.getScheme match {
+case null | "file" =>
+  addLocalHiveJars(new File(uri.getPath))

Review comment:
   > Should this comment on the option "path" instead of "classpath"?
   
   Can't got your point, can you make it more clear?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-11 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r503031298



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -396,6 +417,93 @@ private[spark] object HiveUtils extends Logging {
 config = configurations,
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
+} else if (hiveMetastoreJars == "path") {
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
+  Nil
+} else {
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)

Review comment:
   > yes (sorry for the syntax).
   
   Not yet.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-11 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r503035033



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -396,6 +417,93 @@ private[spark] object HiveUtils extends Logging {
 config = configurations,
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
+} else if (hiveMetastoreJars == "path") {
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
+  Nil
+} else {
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)
+}
+  } else {
+if (!fs.exists(hadoopPath)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (fs.getFileStatus(hadoopPath).isDirectory) {
+  logWarning(s"Hive Jar ${path} not allow directory without `*`")
+  Nil
+} else {
+  // Since tar/tar.gz file we can't know it's final path yet, not 
support it
+  hadoopPath.toUri.toURL :: Nil

Review comment:
   > Saw the `fetchFile` call in the previous commit. However I don't see 
how `URLClassLoader` can handle HDFS paths. I don't see any `FileSystem` 
related stuff in the class.
   
   I have same confuse as you before, until I see this error 
https://github.com/apache/spark/pull/29881#issuecomment-701211241





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-11 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r503035295



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -396,6 +417,93 @@ private[spark] object HiveUtils extends Logging {
 config = configurations,
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
+} else if (hiveMetastoreJars == "path") {
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
+  Nil
+} else {
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)

Review comment:
   > Okay. Maybe it is worth pointing out in the description that we only 
support one single wildcard in the last path component.
   
   Yea, good suggestion, we should make it more clear in desc.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-11 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r503041748



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -396,6 +417,93 @@ private[spark] object HiveUtils extends Logging {
 config = configurations,
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
+} else if (hiveMetastoreJars == "path") {
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
+  Nil
+} else {
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)
+}
+  } else {
+if (!fs.exists(hadoopPath)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (fs.getFileStatus(hadoopPath).isDirectory) {
+  logWarning(s"Hive Jar ${path} not allow directory without `*`")
+  Nil
+} else {
+  // Since tar/tar.gz file we can't know it's final path yet, not 
support it
+  hadoopPath.toUri.toURL :: Nil

Review comment:
   > I see. So Hadoop side has a `URLStreamHandlerFactory` impl which 
provides handlers for the Hadoop-specific URL schemes. Not sure about the S3 
error though :-/
   
   Yea,  dig into the code, seems s3 find config will call URLClassLoader and 
not equal make it reload element again and again.  





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-13 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r504443964



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -88,12 +90,25 @@ private[spark] object HiveUtils extends Logging {
   |   ${builtinHiveVersion} or not defined.
   | 2. "maven"
   |   Use Hive jars of specified version downloaded from Maven 
repositories.
-  | 3. A classpath in the standard format for both Hive and Hadoop.
+  | 3. "path"
+  |   Use Hive jars configured by `spark.sql.hive.metastore.jars.path` in 
comma separated format
+  |   support both local or remote paths, it should always be fully 
qualified URL to indicate
+  |   other file systems.
+  | 4. A classpath in the standard format for both Hive and Hadoop.

Review comment:
   Done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-14 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r50397



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -88,12 +90,25 @@ private[spark] object HiveUtils extends Logging {
   |   ${builtinHiveVersion} or not defined.
   | 2. "maven"
   |   Use Hive jars of specified version downloaded from Maven 
repositories.
-  | 3. A classpath in the standard format for both Hive and Hadoop.
+  | 3. "path"
+  |   Use Hive jars configured by `spark.sql.hive.metastore.jars.path` in 
comma separated format

Review comment:
   > `${HIVE_METASTORE_JARS_PATH.key}`
   
   We can't refer to each other between `HIVE_METASTORE_JARS_PATH` and 
`HIVE_METASTORE_JARS`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-14 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r50764



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -396,6 +419,93 @@ private[spark] object HiveUtils extends Logging {
 config = configurations,
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
+} else if (hiveMetastoreJars == "path") {
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
+  Nil
+} else {
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)
+}
+  } else {
+if (!fs.exists(hadoopPath)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (fs.getFileStatus(hadoopPath).isDirectory) {
+  logWarning(s"Hive Jar ${path} not allow directory without `*`")
+  Nil
+} else {
+  // Since tar/tar.gz file we can't know it's final path yet, not 
support it
+  hadoopPath.toUri.toURL :: Nil
+}
+  }
+} catch {
+  case NonFatal(e) =>
+logError(s"Failed to find $path to Hive Jars", e)
+Nil
+}
+  }
+
+  // Convert to files and expand any directories.
+  val jars =
+HiveUtils.hiveMetastoreJarsPath(sqlConf)
+  .flatMap {
+case path if Utils.isWindows =>
+  addLocalHiveJars(new File(path))
+case path =>
+  val uri = new Path(path).toUri
+  uri.getScheme match {
+case "file" | "local" =>
+  addLocalHiveJars(new File(uri.getPath))
+case "http" | "https" | "ftp" =>
+  try {
+// validate and fetch URI file
+uri.toURL :: Nil
+  } catch {
+case _: Throwable =>
+  logWarning(s"Hive Jars URI (${uri.toString}) is not a 
valid URL.")
+  Nil
+  }
+case _ =>
+  checkRemoteHiveJars(path)
+  }
+  }
+
+  logInfo(
+s"Initializing HiveMetastoreConnection version $hiveMetastoreVersion " 
+
+  s"using path: ${jars.mkString(";")}")
+  new IsolatedClientLoader(
+version = metaVersion,
+sparkConf = conf,
+hadoopConf = hadoopConf,
+execJars = jars.toSeq,

Review comment:
   > So `IsolatedClientLoader` already supports remote JAR paths?
   
   https://github.com/apache/spark/pull/29881#discussion_r496661990





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-14 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r50764



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -396,6 +419,93 @@ private[spark] object HiveUtils extends Logging {
 config = configurations,
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
+} else if (hiveMetastoreJars == "path") {
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
+  Nil
+} else {
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)
+}
+  } else {
+if (!fs.exists(hadoopPath)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (fs.getFileStatus(hadoopPath).isDirectory) {
+  logWarning(s"Hive Jar ${path} not allow directory without `*`")
+  Nil
+} else {
+  // Since tar/tar.gz file we can't know it's final path yet, not 
support it
+  hadoopPath.toUri.toURL :: Nil
+}
+  }
+} catch {
+  case NonFatal(e) =>
+logError(s"Failed to find $path to Hive Jars", e)
+Nil
+}
+  }
+
+  // Convert to files and expand any directories.
+  val jars =
+HiveUtils.hiveMetastoreJarsPath(sqlConf)
+  .flatMap {
+case path if Utils.isWindows =>
+  addLocalHiveJars(new File(path))
+case path =>
+  val uri = new Path(path).toUri
+  uri.getScheme match {
+case "file" | "local" =>
+  addLocalHiveJars(new File(uri.getPath))
+case "http" | "https" | "ftp" =>
+  try {
+// validate and fetch URI file
+uri.toURL :: Nil
+  } catch {
+case _: Throwable =>
+  logWarning(s"Hive Jars URI (${uri.toString}) is not a 
valid URL.")
+  Nil
+  }
+case _ =>
+  checkRemoteHiveJars(path)
+  }
+  }
+
+  logInfo(
+s"Initializing HiveMetastoreConnection version $hiveMetastoreVersion " 
+
+  s"using path: ${jars.mkString(";")}")
+  new IsolatedClientLoader(
+version = metaVersion,
+sparkConf = conf,
+hadoopConf = hadoopConf,
+execJars = jars.toSeq,

Review comment:
   > So `IsolatedClientLoader` already supports remote JAR paths?
   
   add new config reason is 
https://github.com/apache/spark/pull/29881#discussion_r496661990





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-14 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r504451835



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -396,6 +419,93 @@ private[spark] object HiveUtils extends Logging {
 config = configurations,
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
+} else if (hiveMetastoreJars == "path") {
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {

Review comment:
   > can we use `DataSource.checkAndGlobPathIfNecessary`?
   
   Yea, I will test this.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-14 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r504668188



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -396,6 +419,93 @@ private[spark] object HiveUtils extends Logging {
 config = configurations,
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
+} else if (hiveMetastoreJars == "path") {
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
+  Nil
+} else {
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)
+}
+  } else {
+if (!fs.exists(hadoopPath)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (fs.getFileStatus(hadoopPath).isDirectory) {
+  logWarning(s"Hive Jar ${path} not allow directory without `*`")
+  Nil
+} else {
+  // Since tar/tar.gz file we can't know it's final path yet, not 
support it
+  hadoopPath.toUri.toURL :: Nil
+}
+  }
+} catch {
+  case NonFatal(e) =>
+logError(s"Failed to find $path to Hive Jars", e)
+Nil
+}
+  }
+
+  // Convert to files and expand any directories.
+  val jars =
+HiveUtils.hiveMetastoreJarsPath(sqlConf)
+  .flatMap {
+case path if Utils.isWindows =>
+  addLocalHiveJars(new File(path))
+case path =>
+  val uri = new Path(path).toUri
+  uri.getScheme match {
+case "file" | "local" =>

Review comment:
   > IIUC the Hadoop DFS API also supports local file systems.
   
   Not support local file system, but it can. support `http/https/ftp`
   ```
   ~/spark-3.1.0$ bin/spark-sql --master local --conf 
spark.sql.hive.metastore.jars=path --conf 
spark.sql.hive.metastore.jars.path=/home/hadoop/.ivy2/jars/*  --conf 
spark.sql.hive.metastore.version=2.3.7 --conf 
spark.driver.extraClassPath=$HIVE_CONF_DIR
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does 
not exist: hdfs://redpoll/home/hadoop/.ivy2/jars/*;
at 
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:790)
at 
org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
at scala.util.Success.$anonfun$map$1(Try.scala:255)
at scala.util.Success.map(Try.scala:213)
at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at 
java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at 
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-14 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r504714045



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -396,6 +419,93 @@ private[spark] object HiveUtils extends Logging {
 config = configurations,
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
+} else if (hiveMetastoreJars == "path") {
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {

Review comment:
   > > can we use `DataSource.checkAndGlobPathIfNecessary`?
   > 
   > Yea, I will test this.
   
   @sunchao Follow this suggestion, we can support nested hdfs path wildcard 
now.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-14 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r504714045



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -396,6 +419,93 @@ private[spark] object HiveUtils extends Logging {
 config = configurations,
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
+} else if (hiveMetastoreJars == "path") {
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {

Review comment:
   > > can we use `DataSource.checkAndGlobPathIfNecessary`?
   > 
   > Yea, I will test this.
   
   @sunchao Follow this suggestion, we can support nested hdfs path wildcard 
now such as hdfs://xx/xx/*/*





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-14 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r504715612



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -396,6 +419,93 @@ private[spark] object HiveUtils extends Logging {
 config = configurations,
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
+} else if (hiveMetastoreJars == "path") {
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
+  Nil
+} else {
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)
+}
+  } else {
+if (!fs.exists(hadoopPath)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (fs.getFileStatus(hadoopPath).isDirectory) {
+  logWarning(s"Hive Jar ${path} not allow directory without `*`")
+  Nil
+} else {
+  // Since tar/tar.gz file we can't know it's final path yet, not 
support it
+  hadoopPath.toUri.toURL :: Nil
+}
+  }
+} catch {
+  case NonFatal(e) =>
+logError(s"Failed to find $path to Hive Jars", e)
+Nil
+}
+  }
+
+  // Convert to files and expand any directories.
+  val jars =
+HiveUtils.hiveMetastoreJarsPath(sqlConf)
+  .flatMap {
+case path if Utils.isWindows =>
+  addLocalHiveJars(new File(path))
+case path =>
+  val uri = new Path(path).toUri
+  uri.getScheme match {
+case "file" | "local" =>

Review comment:
   I have updated and with a detail check in our env, any more suggestion 
about comment or doc?
   and any other case need to check?
   
   also cc @sunchao @dongjoon-hyun 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-14 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r504717621



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -397,27 +399,86 @@ private[spark] object HiveUtils extends Logging {
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
 } else {
-  // Convert to files and expand any directories.
-  val jars =
-hiveMetastoreJars
-  .split(File.pathSeparator)
-  .flatMap {
-  case path if new File(path).getName == "*" =>
-val files = new File(path).getParentFile.listFiles()
-if (files == null) {
-  logWarning(s"Hive jar path '$path' does not exist.")
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
   Nil
 } else {
-  
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).toSeq
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)
 }
-  case path =>
-new File(path) :: Nil
+  } else {
+if (!fs.exists(hadoopPath)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (fs.getFileStatus(hadoopPath).isDirectory) {
+  logWarning(s"Hive Jar ${path} not allow directory without `*`")
+  Nil
+} else {
+  // Since tar/tar.gz file we can't know it's final path yet, not 
support it
+  hadoopPath.toUri.toURL :: Nil
+}
+  }
+} catch {
+  case NonFatal(e) =>
+logError(s"Failed to find $path to Hive Jars", e)
+Nil
 }
-  .map(_.toURI.toURL)
+  }
+
+  // Convert to files and expand any directories.
+  val jars =
+hiveMetastoreJars
+  .split(";")
+  .flatMap {
+case path if path.contains("\\") =>
+  addLocalHiveJars(new File(path))
+case path =>
+  val uri = new Path(path).toUri
+  uri.getScheme match {
+case null | "file" =>
+  addLocalHiveJars(new File(uri.getPath))

Review comment:
   @HyukjinKwon 
   Add back null since for local path, user won't  add `file://` or `local://`  
under subconscious mind





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-14 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r504746109



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -102,9 +103,16 @@ private[spark] object HiveUtils extends Logging {
 
   val HIVE_METASTORE_JARS_PATH = 
buildStaticConf("spark.sql.hive.metastore.jars.path")
 .doc(s"Comma separated path of Hive jars, support both local and remote 
paths," +
-  s"we support path wildcards such as `hdfs://path/to/jars/*`, but not 
support" +
-  s"nested path wildcards such as `hdfs://path/to/jars/*/*`. When 
${HIVE_METASTORE_JARS}" +
-  s"is set to `path`, we will use Hive jars configured by this")
+  s"Such as:" +
+  s" 1. /path/to/jar/xxx.jar" +
+  s" 2. file:///path/to/jar/xxx.jar" +
+  s" 3. local:///path/to/jar/xxx.jar" +
+  s" 4. hdfs://path/to/jar/xxx.jar" +

Review comment:
   > how about `s3://`?
   
   As https://github.com/apache/spark/pull/29881#issuecomment-701211241 
mentioned. Seems S3's method need call ClassLoader and cause stackoverflow,  
I'm not familiar with S3's API,  and maybe we can make a new jira and hope some 
expert to solve..





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-14 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r505120663



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -396,23 +440,50 @@ private[spark] object HiveUtils extends Logging {
 config = configurations,
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
+} else if (hiveMetastoreJars == "path") {
+  // Convert to files and expand any directories.
+  val jars =
+HiveUtils.hiveMetastoreJarsPath(sqlConf)
+  .flatMap {
+case path if Utils.isWindows =>
+  addLocalHiveJars(new File(path))
+case path =>
+  val uri = new Path(path).toUri
+  uri.getScheme match {

Review comment:
   > I don't understand why we need to check the scheme and do things 
differently. `spark.read.parquet(...)` can support any schema and internally 
just use the Hadoop DFS API. Can't we do that as well here?
   > 
   > Can you point to other places in Spark that do similar things to support 
this PR?
   
   In datasource file index, path is fully qualified URL to indicate other file 
systems.
   So it support  `file://  | local:// `, but without schema, it will be 
handled as hdfs path.
   One way to avoid check schema,  as 
https://github.com/apache/spark/pull/29881#discussion_r496649487 mentioned, we 
host user use fully qualified URL.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-14 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r504717621



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -397,27 +399,86 @@ private[spark] object HiveUtils extends Logging {
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
 } else {
-  // Convert to files and expand any directories.
-  val jars =
-hiveMetastoreJars
-  .split(File.pathSeparator)
-  .flatMap {
-  case path if new File(path).getName == "*" =>
-val files = new File(path).getParentFile.listFiles()
-if (files == null) {
-  logWarning(s"Hive jar path '$path' does not exist.")
+
+  def addLocalHiveJars(file: File): Seq[URL] = {
+if (file.getName == "*") {
+  val files = file.getParentFile.listFiles()
+  if (files == null) {
+logWarning(s"Hive jar path '${file.getPath}' does not exist.")
+Nil
+  } else {
+
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).map(_.toURL).toSeq
+  }
+} else {
+  file.toURL :: Nil
+}
+  }
+
+  def checkRemoteHiveJars(path: String): Seq[URL] = {
+try {
+  val hadoopPath = new Path(path)
+  val fs = hadoopPath.getFileSystem(hadoopConf)
+  if (hadoopPath.getName == "*") {
+val parent = hadoopPath.getParent
+if (!fs.exists(parent)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (!fs.getFileStatus(parent).isDirectory) {
+  logWarning(s"Hive Jar ${parent} is not a directory.")
   Nil
 } else {
-  
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).toSeq
+  fs.listStatus(parent).map(_.getPath.toUri.toURL)
 }
-  case path =>
-new File(path) :: Nil
+  } else {
+if (!fs.exists(hadoopPath)) {
+  logWarning(s"Hive Jar ${path} does not exist.")
+  Nil
+} else if (fs.getFileStatus(hadoopPath).isDirectory) {
+  logWarning(s"Hive Jar ${path} not allow directory without `*`")
+  Nil
+} else {
+  // Since tar/tar.gz file we can't know it's final path yet, not 
support it
+  hadoopPath.toUri.toURL :: Nil
+}
+  }
+} catch {
+  case NonFatal(e) =>
+logError(s"Failed to find $path to Hive Jars", e)
+Nil
 }
-  .map(_.toURI.toURL)
+  }
+
+  // Convert to files and expand any directories.
+  val jars =
+hiveMetastoreJars
+  .split(";")
+  .flatMap {
+case path if path.contains("\\") =>
+  addLocalHiveJars(new File(path))
+case path =>
+  val uri = new Path(path).toUri
+  uri.getScheme match {
+case null | "file" =>
+  addLocalHiveJars(new File(uri.getPath))

Review comment:
   @HyukjinKwon 
   Add back null since for local path, user won't  add `file://` or `local://`  
under subconscious mind





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-14 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r505138988



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -396,23 +440,50 @@ private[spark] object HiveUtils extends Logging {
 config = configurations,
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
+} else if (hiveMetastoreJars == "path") {
+  // Convert to files and expand any directories.
+  val jars =
+HiveUtils.hiveMetastoreJarsPath(sqlConf)
+  .flatMap {
+case path if Utils.isWindows =>
+  addLocalHiveJars(new File(path))
+case path =>
+  val uri = new Path(path).toUri
+  uri.getScheme match {

Review comment:
   how about current change? Make the logic more simpler.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-14 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r505224964



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -80,20 +81,41 @@ private[spark] object HiveUtils extends Logging {
   val HIVE_METASTORE_JARS = buildStaticConf("spark.sql.hive.metastore.jars")
 .doc(s"""
   | Location of the jars that should be used to instantiate the 
HiveMetastoreClient.
-  | This property can be one of three options: "
+  | This property can be one of four options: "
   | 1. "builtin"
   |   Use Hive ${builtinHiveVersion}, which is bundled with the Spark 
assembly when
   |   -Phive is enabled. When this option is chosen,
   |   spark.sql.hive.metastore.version must be either
   |   ${builtinHiveVersion} or not defined.
   | 2. "maven"
   |   Use Hive jars of specified version downloaded from Maven 
repositories.
-  | 3. A classpath in the standard format for both Hive and Hadoop.
+  | 3. "path"
+  |   Use Hive jars configured by `spark.sql.hive.metastore.jars.path` in 
comma separated format
+  |   support both local or remote paths, it should always be fully 
qualified URL to indicate
+  |   other file systems.
+  | 4. A classpath in the standard format for both Hive and Hadoop.
   """.stripMargin)
 .version("1.4.0")
 .stringConf
 .createWithDefault("builtin")
 
+  val HIVE_METASTORE_JARS_PATH = 
buildStaticConf("spark.sql.hive.metastore.jars.path")
+.doc(s"Comma separated fully qualified URL of Hive jars, support both 
local and remote paths," +
+  s"Such as: " +
+  s" 1. file://path/to/jar/xxx.jar" +
+  s" 2. hdfs://nameservice/path/to/jar/xxx.jar" +
+  s" 3. /path/to/jar (path without schema will be treated as HDFS path)" +

Review comment:
   > hmm, this depends on the hadoop conf IIUC. e.g. when I run spark 
locally with default settings, path without scheme is local file system.
   
   emmm,  I need to change the doc..





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-14 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r505228738



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -396,23 +440,43 @@ private[spark] object HiveUtils extends Logging {
 config = configurations,
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
+} else if (hiveMetastoreJars == "path") {
+  // Convert to files and expand any directories.
+  val jars =
+HiveUtils.hiveMetastoreJarsPath(sqlConf)
+  .flatMap {
+case path if path.contains("\\") && Utils.isWindows =>

Review comment:
   > I don't know windows very well. cc @HyukjinKwon
   
   see 
https://github.com/apache/spark/blob/77a8efbc05cb4ecc40dd050c363429e71a9f23c1/core/src/main/scala/org/apache/spark/SparkContext.scala#L1902





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-15 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r505252862



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -80,20 +81,41 @@ private[spark] object HiveUtils extends Logging {
   val HIVE_METASTORE_JARS = buildStaticConf("spark.sql.hive.metastore.jars")
 .doc(s"""
   | Location of the jars that should be used to instantiate the 
HiveMetastoreClient.
-  | This property can be one of three options: "
+  | This property can be one of four options: "
   | 1. "builtin"
   |   Use Hive ${builtinHiveVersion}, which is bundled with the Spark 
assembly when
   |   -Phive is enabled. When this option is chosen,
   |   spark.sql.hive.metastore.version must be either
   |   ${builtinHiveVersion} or not defined.
   | 2. "maven"
   |   Use Hive jars of specified version downloaded from Maven 
repositories.
-  | 3. A classpath in the standard format for both Hive and Hadoop.
+  | 3. "path"
+  |   Use Hive jars configured by `spark.sql.hive.metastore.jars.path` in 
comma separated format
+  |   support both local or remote paths, it should always be fully 
qualified URL to indicate
+  |   other file systems.
+  | 4. A classpath in the standard format for both Hive and Hadoop.
   """.stripMargin)
 .version("1.4.0")
 .stringConf
 .createWithDefault("builtin")
 
+  val HIVE_METASTORE_JARS_PATH = 
buildStaticConf("spark.sql.hive.metastore.jars.path")
+.doc(s"Comma separated fully qualified URL of Hive jars, support both 
local and remote paths," +
+  s"Such as: " +
+  s" 1. file://path/to/jar/xxx.jar" +
+  s" 2. hdfs://nameservice/path/to/jar/xxx.jar" +
+  s" 3. /path/to/jar (path without schema will be treated as HDFS path)" +
+  s" 3. [http/https/ftp]://path/to/jar/xxx.jar" +

Review comment:
   > This is supported because `IsolatedClientLoader.execJars` supports it?
   
   URLClassLoader support URI path, and `DataSource. 
checkAndGlobPathIfNecessary` can pass URI too.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-15 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r505254274



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -80,20 +81,41 @@ private[spark] object HiveUtils extends Logging {
   val HIVE_METASTORE_JARS = buildStaticConf("spark.sql.hive.metastore.jars")
 .doc(s"""
   | Location of the jars that should be used to instantiate the 
HiveMetastoreClient.
-  | This property can be one of three options: "
+  | This property can be one of four options: "
   | 1. "builtin"
   |   Use Hive ${builtinHiveVersion}, which is bundled with the Spark 
assembly when
   |   -Phive is enabled. When this option is chosen,
   |   spark.sql.hive.metastore.version must be either
   |   ${builtinHiveVersion} or not defined.
   | 2. "maven"
   |   Use Hive jars of specified version downloaded from Maven 
repositories.
-  | 3. A classpath in the standard format for both Hive and Hadoop.
+  | 3. "path"
+  |   Use Hive jars configured by `spark.sql.hive.metastore.jars.path` in 
comma separated format
+  |   support both local or remote paths, it should always be fully 
qualified URL to indicate
+  |   other file systems.
+  | 4. A classpath in the standard format for both Hive and Hadoop.
   """.stripMargin)
 .version("1.4.0")
 .stringConf
 .createWithDefault("builtin")
 
+  val HIVE_METASTORE_JARS_PATH = 
buildStaticConf("spark.sql.hive.metastore.jars.path")
+.doc(s"Comma separated fully qualified URL of Hive jars, support both 
local and remote paths," +
+  s"Such as: " +
+  s" 1. file://path/to/jar/xxx.jar" +
+  s" 2. hdfs://nameservice/path/to/jar/xxx.jar" +
+  s" 3. /path/to/jar (path without schema will be treated as HDFS path)" +
+  s" 3. [http/https/ftp]://path/to/jar/xxx.jar" +

Review comment:
   remove hive-exec jar in hdfs path and bring with URI path can build  
HiveClient too.
   ```
   --conf spark.sql.hive.metastore.jars=path --conf 
spark.sql.hive.metastore.jars.path=hdfs://nameservice/spark/hive-2.3.7/*,https://repo1.maven.org/maven2/org/apache/hive/hive-exec/2.3.7/hive-exec-2.3.7.jar
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-15 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r505446116



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -80,20 +81,41 @@ private[spark] object HiveUtils extends Logging {
   val HIVE_METASTORE_JARS = buildStaticConf("spark.sql.hive.metastore.jars")
 .doc(s"""
   | Location of the jars that should be used to instantiate the 
HiveMetastoreClient.
-  | This property can be one of three options: "
+  | This property can be one of four options: "
   | 1. "builtin"
   |   Use Hive ${builtinHiveVersion}, which is bundled with the Spark 
assembly when
   |   -Phive is enabled. When this option is chosen,
   |   spark.sql.hive.metastore.version must be either
   |   ${builtinHiveVersion} or not defined.
   | 2. "maven"
   |   Use Hive jars of specified version downloaded from Maven 
repositories.
-  | 3. A classpath in the standard format for both Hive and Hadoop.
+  | 3. "path"
+  |   Use Hive jars configured by `spark.sql.hive.metastore.jars.path` in 
comma separated format
+  |   support both local or remote paths, it should always be fully 
qualified URL to indicate
+  |   other file systems.
+  | 4. A classpath in the standard format for both Hive and Hadoop.
   """.stripMargin)
 .version("1.4.0")
 .stringConf
 .createWithDefault("builtin")
 
+  val HIVE_METASTORE_JARS_PATH = 
buildStaticConf("spark.sql.hive.metastore.jars.path")
+.doc(s"Comma separated fully qualified URL of Hive jars, support both 
local and remote paths," +
+  s"Such as: " +
+  s" 1. file://path/to/jar/xxx.jar" +
+  s" 2. hdfs://nameservice/path/to/jar/xxx.jar" +
+  s" 3. /path/to/jar/ (path without schema will follow hadoop conf 
`fs.defaultFS`'s schema)" +

Review comment:
   > `schema` -> `URI scheme`
   
   Done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-16 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r506787887



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -396,23 +440,43 @@ private[spark] object HiveUtils extends Logging {
 config = configurations,
 barrierPrefixes = hiveMetastoreBarrierPrefixes,
 sharedPrefixes = hiveMetastoreSharedPrefixes)
+} else if (hiveMetastoreJars == "path") {
+  // Convert to files and expand any directories.
+  val jars =
+HiveUtils.hiveMetastoreJarsPath(sqlConf)
+  .flatMap {
+case path if path.contains("\\") && Utils.isWindows =>
+  addLocalHiveJars(new File(path))
+case path =>
+  DataSource.checkAndGlobPathIfNecessary(
+pathStrings = Seq(path),
+hadoopConf = hadoopConf,
+checkEmptyGlobPath = true,
+checkFilesExist = false,
+enableGlobbing = true
+  ).map(_.toUri.toURL)
+  }
+
+  logInfo(
+s"Initializing HiveMetastoreConnection version $hiveMetastoreVersion " 
+
+  s"using path: ${jars.mkString(";")}")
+  new IsolatedClientLoader(
+version = metaVersion,
+sparkConf = conf,
+hadoopConf = hadoopConf,
+execJars = jars.toSeq,
+config = configurations,
+isolationOn = true,
+barrierPrefixes = hiveMetastoreBarrierPrefixes,
+sharedPrefixes = hiveMetastoreSharedPrefixes)
 } else {
   // Convert to files and expand any directories.
   val jars =
 hiveMetastoreJars
   .split(File.pathSeparator)
-  .flatMap {
-  case path if new File(path).getName == "*" =>
-val files = new File(path).getParentFile.listFiles()
-if (files == null) {
-  logWarning(s"Hive jar path '$path' does not exist.")
-  Nil
-} else {
-  
files.filter(_.getName.toLowerCase(Locale.ROOT).endsWith(".jar")).toSeq
-}
-  case path =>
-new File(path) :: Nil
-}
+  .flatMap { path =>
+addLocalHiveJars(new File(path))
+  }
   .map(_.toURI.toURL)

Review comment:
   > This line can be removed, as `addLocalHiveJars` already returns URL
   
   Done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-16 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r506787897



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -80,20 +81,41 @@ private[spark] object HiveUtils extends Logging {
   val HIVE_METASTORE_JARS = buildStaticConf("spark.sql.hive.metastore.jars")
 .doc(s"""
   | Location of the jars that should be used to instantiate the 
HiveMetastoreClient.
-  | This property can be one of three options: "
+  | This property can be one of four options: "
   | 1. "builtin"
   |   Use Hive ${builtinHiveVersion}, which is bundled with the Spark 
assembly when
   |   -Phive is enabled. When this option is chosen,
   |   spark.sql.hive.metastore.version must be either
   |   ${builtinHiveVersion} or not defined.
   | 2. "maven"
   |   Use Hive jars of specified version downloaded from Maven 
repositories.
-  | 3. A classpath in the standard format for both Hive and Hadoop.
+  | 3. "path"
+  |   Use Hive jars configured by `spark.sql.hive.metastore.jars.path` in 
comma separated format

Review comment:
   > Add a period at the end?
   
   Done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-16 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r506788022



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -80,20 +81,41 @@ private[spark] object HiveUtils extends Logging {
   val HIVE_METASTORE_JARS = buildStaticConf("spark.sql.hive.metastore.jars")
 .doc(s"""
   | Location of the jars that should be used to instantiate the 
HiveMetastoreClient.
-  | This property can be one of three options: "
+  | This property can be one of four options: "
   | 1. "builtin"
   |   Use Hive ${builtinHiveVersion}, which is bundled with the Spark 
assembly when
   |   -Phive is enabled. When this option is chosen,
   |   spark.sql.hive.metastore.version must be either
   |   ${builtinHiveVersion} or not defined.
   | 2. "maven"
   |   Use Hive jars of specified version downloaded from Maven 
repositories.
-  | 3. A classpath in the standard format for both Hive and Hadoop.
+  | 3. "path"
+  |   Use Hive jars configured by `spark.sql.hive.metastore.jars.path` in 
comma separated format
+  |   support both local or remote paths, it should always be fully 
qualified URL to indicate
+  |   other file systems.
+  | 4. A classpath in the standard format for both Hive and Hadoop.
   """.stripMargin)
 .version("1.4.0")
 .stringConf
 .createWithDefault("builtin")
 
+  val HIVE_METASTORE_JARS_PATH = 
buildStaticConf("spark.sql.hive.metastore.jars.path")
+.doc(s"Comma separated fully qualified URL of Hive jars, support both 
local and remote paths," +
+  s"Such as: " +
+  s" 1. file://path/to/jar/xxx.jar" +
+  s" 2. hdfs://nameservice/path/to/jar/xxx.jar" +
+  s" 3. /path/to/jar/ (path without URI scheme follow conf 
`fs.defaultFS`'s URI schema)" +
+  s" 4. [http/https/ftp]://path/to/jar/xxx.jar" +
+  s"For URI, we can't support path wildcard, but for other URL support 
nested path wildcard," +

Review comment:
   > can we be more specific? e.g. `http/https/ftp` doesn't support 
wildcard.
   
   Done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-22 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r509938417



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -80,20 +81,41 @@ private[spark] object HiveUtils extends Logging {
   val HIVE_METASTORE_JARS = buildStaticConf("spark.sql.hive.metastore.jars")
 .doc(s"""
   | Location of the jars that should be used to instantiate the 
HiveMetastoreClient.
-  | This property can be one of three options: "
+  | This property can be one of four options: "
   | 1. "builtin"
   |   Use Hive ${builtinHiveVersion}, which is bundled with the Spark 
assembly when
   |   -Phive is enabled. When this option is chosen,
   |   spark.sql.hive.metastore.version must be either
   |   ${builtinHiveVersion} or not defined.
   | 2. "maven"
   |   Use Hive jars of specified version downloaded from Maven 
repositories.
-  | 3. A classpath in the standard format for both Hive and Hadoop.
+  | 3. "path"
+  |   Use Hive jars configured by `spark.sql.hive.metastore.jars.path`
+  |   in comma separated format. Support both local or remote paths, it 
should always
+  |   be fully qualified URL to indicate other file systems.
+  | 4. A classpath in the standard format for both Hive and Hadoop.
   """.stripMargin)
 .version("1.4.0")
 .stringConf
 .createWithDefault("builtin")
 
+  val HIVE_METASTORE_JARS_PATH = 
buildStaticConf("spark.sql.hive.metastore.jars.path")
+.doc(s"Comma separated fully qualified URL of Hive jars, support both 
local and remote paths," +
+  s"Such as: " +
+  s" 1. file://path/to/jar/xxx.jar" +

Review comment:
   > shall we add `\n` at the end for each item?
   
   Done

##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -80,20 +81,41 @@ private[spark] object HiveUtils extends Logging {
   val HIVE_METASTORE_JARS = buildStaticConf("spark.sql.hive.metastore.jars")
 .doc(s"""
   | Location of the jars that should be used to instantiate the 
HiveMetastoreClient.
-  | This property can be one of three options: "
+  | This property can be one of four options: "
   | 1. "builtin"
   |   Use Hive ${builtinHiveVersion}, which is bundled with the Spark 
assembly when
   |   -Phive is enabled. When this option is chosen,
   |   spark.sql.hive.metastore.version must be either
   |   ${builtinHiveVersion} or not defined.
   | 2. "maven"
   |   Use Hive jars of specified version downloaded from Maven 
repositories.
-  | 3. A classpath in the standard format for both Hive and Hadoop.
+  | 3. "path"
+  |   Use Hive jars configured by `spark.sql.hive.metastore.jars.path`
+  |   in comma separated format. Support both local or remote paths, it 
should always
+  |   be fully qualified URL to indicate other file systems.
+  | 4. A classpath in the standard format for both Hive and Hadoop.
   """.stripMargin)
 .version("1.4.0")
 .stringConf
 .createWithDefault("builtin")
 
+  val HIVE_METASTORE_JARS_PATH = 
buildStaticConf("spark.sql.hive.metastore.jars.path")
+.doc(s"Comma separated fully qualified URL of Hive jars, support both 
local and remote paths," +
+  s"Such as: " +
+  s" 1. file://path/to/jar/xxx.jar" +
+  s" 2. hdfs://nameservice/path/to/jar/xxx.jar" +
+  s" 3. /path/to/jar/ (path without URI scheme follow conf 
`fs.defaultFS`'s URI schema)" +
+  s" 4. [http/https/ftp]://path/to/jar/xxx.jar" +
+  s"Notice: `http/https/ftp` doesn't support wildcard, but for other URLs 
support" +

Review comment:
   > `but for other ...` -> `but other ...`
   
   Done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29881: [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location

2020-10-22 Thread GitBox


AngersZh commented on a change in pull request #29881:
URL: https://github.com/apache/spark/pull/29881#discussion_r509938592



##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -80,20 +81,41 @@ private[spark] object HiveUtils extends Logging {
   val HIVE_METASTORE_JARS = buildStaticConf("spark.sql.hive.metastore.jars")
 .doc(s"""
   | Location of the jars that should be used to instantiate the 
HiveMetastoreClient.
-  | This property can be one of three options: "
+  | This property can be one of four options: "
   | 1. "builtin"
   |   Use Hive ${builtinHiveVersion}, which is bundled with the Spark 
assembly when
   |   -Phive is enabled. When this option is chosen,
   |   spark.sql.hive.metastore.version must be either
   |   ${builtinHiveVersion} or not defined.
   | 2. "maven"
   |   Use Hive jars of specified version downloaded from Maven 
repositories.
-  | 3. A classpath in the standard format for both Hive and Hadoop.
+  | 3. "path"
+  |   Use Hive jars configured by `spark.sql.hive.metastore.jars.path`
+  |   in comma separated format. Support both local or remote paths, it 
should always
+  |   be fully qualified URL to indicate other file systems.

Review comment:
   > `it should always be fully qualified URL ...` it's not true now.
   
   Remove this now

##
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
##
@@ -80,20 +81,41 @@ private[spark] object HiveUtils extends Logging {
   val HIVE_METASTORE_JARS = buildStaticConf("spark.sql.hive.metastore.jars")
 .doc(s"""
   | Location of the jars that should be used to instantiate the 
HiveMetastoreClient.
-  | This property can be one of three options: "
+  | This property can be one of four options: "
   | 1. "builtin"
   |   Use Hive ${builtinHiveVersion}, which is bundled with the Spark 
assembly when
   |   -Phive is enabled. When this option is chosen,
   |   spark.sql.hive.metastore.version must be either
   |   ${builtinHiveVersion} or not defined.
   | 2. "maven"
   |   Use Hive jars of specified version downloaded from Maven 
repositories.
-  | 3. A classpath in the standard format for both Hive and Hadoop.
+  | 3. "path"
+  |   Use Hive jars configured by `spark.sql.hive.metastore.jars.path`
+  |   in comma separated format. Support both local or remote paths, it 
should always
+  |   be fully qualified URL to indicate other file systems.
+  | 4. A classpath in the standard format for both Hive and Hadoop.
   """.stripMargin)
 .version("1.4.0")
 .stringConf
 .createWithDefault("builtin")
 
+  val HIVE_METASTORE_JARS_PATH = 
buildStaticConf("spark.sql.hive.metastore.jars.path")
+.doc(s"Comma separated fully qualified URL of Hive jars, support both 
local and remote paths," +
+  s"Such as: " +
+  s" 1. file://path/to/jar/xxx.jar" +
+  s" 2. hdfs://nameservice/path/to/jar/xxx.jar" +
+  s" 3. /path/to/jar/ (path without URI scheme follow conf 
`fs.defaultFS`'s URI schema)" +
+  s" 4. [http/https/ftp]://path/to/jar/xxx.jar" +
+  s"Notice: `http/https/ftp` doesn't support wildcard, but for other URLs 
support" +
+  s" nested path wildcard, Such as: " +
+  s" 1. file://path/to/jar/*, file://path/to/jar/*/*" +
+  s" 2. hdfs://nameservice/path/to/jar/*, 
hdfs://nameservice/path/to/jar/*/*" +

Review comment:
   > ditto, `\n`
   
   Done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org