[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r456232227 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,12 +99,46 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + // Mostly copied from Context.java#getMRTmpPath of Hive 2.3. + // Visible for testing. + private[execution] def getNonBlobTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath. +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090. +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir. +val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else scratchDir +val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), hadoopConf, sessionPath) +logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used") +val path = new Path(mrScratchDir, "-mr-1") +val scheme = Option(path.toUri.getScheme).getOrElse("") +if (scheme.equals("file")) { + logWarning(s"Temporary data will be written into a local file system " + +s"(scheme: '$scheme', path: '$mrScratchDir'). If your Spark is not in local mode, " + +s"you might need to configure 'hive.exec.scratchdir' " + +s"to use accessible file system (e.g. HDFS path) from any executors in the cluster.") Review comment: Removed unnecessary `s` in the head. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r456294954 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,7 +99,42 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + // Mostly copied from Context.java#getMRTmpPath of Hive 2.3. + // Visible for testing. + private[execution] def getNonBlobTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath. +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090. +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir. +val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else scratchDir +val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), hadoopConf, sessionPath) +logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used") +val path = new Path(mrScratchDir, "-mr-1") +val scheme = Option(path.toUri.getScheme).getOrElse("") +if (scheme.equals("file")) { + logWarning("Temporary data will be written into a local file system " + +"(scheme: '$scheme', path: '$mrScratchDir'). If your Spark is not in local mode, " + Review comment: Oh it was my bad. Reverted only the necessary part. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r456232227 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,12 +99,46 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + // Mostly copied from Context.java#getMRTmpPath of Hive 2.3. + // Visible for testing. + private[execution] def getNonBlobTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath. +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090. +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir. +val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else scratchDir +val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), hadoopConf, sessionPath) +logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used") +val path = new Path(mrScratchDir, "-mr-1") +val scheme = Option(path.toUri.getScheme).getOrElse("") +if (scheme.equals("file")) { + logWarning(s"Temporary data will be written into a local file system " + +s"(scheme: '$scheme', path: '$mrScratchDir'). If your Spark is not in local mode, " + +s"you might need to configure 'hive.exec.scratchdir' " + +s"to use accessible file system (e.g. HDFS path) from any executors in the cluster.") Review comment: Removed `s` in the head. BTW there are a lot of existing code which includes it, but I left it as it is. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r456231877 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,12 +99,46 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + // Mostly copied from Context.java#getMRTmpPath of Hive 2.3. + // Visible for testing. + private[execution] def getNonBlobTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath. +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090. +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir. +val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else scratchDir +val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), hadoopConf, sessionPath) +logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used") +val path = new Path(mrScratchDir, "-mr-1") +val scheme = Option(path.toUri.getScheme).getOrElse("") +if (scheme.equals("file")) { + logWarning(s"Temporary data will be written into a local file system " + +s"(scheme: '$scheme', path: '$mrScratchDir'). If your Spark is not in local mode, " + +s"you might need to configure 'hive.exec.scratchdir' " + +s"to use accessible file system (e.g. HDFS path) from any executors in the cluster.") +} +path + } + + private def supportSchemeToUseNonBlobStore(path: Path): Boolean = { +path != null && { + val supportedBlobSchemes = SQLConf.get.supportedSchemesToUseNonBlobstore + val scheme = Option(path.toUri.getScheme).getOrElse("") + Utils.stringToSeq(supportedBlobSchemes).contains(scheme.toLowerCase(Locale.ROOT)) +} + } + + def getExternalTmpPath( sparkSession: SparkSession, hadoopConf: Configuration, path: Path): Path = { import org.apache.spark.sql.hive.client.hive._ - Review comment: Thanks for pointing it. Reverted. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r446739501 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,12 +99,46 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + // Mostly copied from Context.java#getMRTmpPath of Hive 2.3. + // Visible for testing. + private[execution] def getNonBlobTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath. +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090. +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir. +val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else scratchDir +val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), hadoopConf, sessionPath) +logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used") +val path = new Path(mrScratchDir, "-mr-1") +val scheme = Option(path.toUri.getScheme).getOrElse("") +if (schema == "file") { + logWarning(s"Temporary data will be written into a local disk " + Review comment: If Spark is running in local mode or Spark uses NFS, it works even if `file` is used. If we throw an exception here, we cannot cover those use-cases. IMO, it is similar to the fact that Spark can still use RawLocalFileSystem (file://). ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,12 +99,46 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + // Mostly copied from Context.java#getMRTmpPath of Hive 2.3. + // Visible for testing. + private[execution] def getNonBlobTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath. +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090. +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir. +val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else scratchDir +val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), hadoopConf, sessionPath) +logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used") +val path = new Path(mrScratchDir, "-mr-1") +val scheme = Option(path.toUri.getScheme).getOrElse("") +if (schema == "file") { + logWarning(s"Temporary data will be written into a local disk " + Review comment: If Spark is running in local mode or Spark uses NFS, it works even if `file` is used. If we throw an exception here, we cannot cover those use-cases. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r446738298 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -124,11 +162,25 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { val hiveVersion = externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive") +logDebug(s"path '${path.toString}', staging dir '$stagingDir', " + + s"scratch dir '$scratchDir' are used") if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) { oldVersionExternalTempPath(path, hadoopConf, scratchDir) } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) { - newVersionExternalTempPath(path, hadoopConf, stagingDir) + // HIVE-14270: Write temporary data to HDFS when doing inserts on tables located on S3 Review comment: Makes sense. Modified to use `SPARK-21514` in this comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r446614786 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,12 +99,38 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + // Mostly copied from Context.java#getMRTmpPath of Hive 2.3. + // Visible for testing. + private[execution] def getNonBlobTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath. +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090. +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir. +val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else scratchDir +val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), hadoopConf, sessionPath) Review comment: It makes sense to me.Thank you for the suggestion. I added an error message to suggest end-users to configure `hive.exec.scratchdir` if the scheme of the temporary directory is `file`. Can you check if it is clear enough for end-users? ``` val path = new Path(mrScratchDir, "-mr-1") val scheme = Option(path.toUri.getScheme).getOrElse("") if (schema == "file") { logWarning(s"Temporary data will be written into a local disk " + s"(scheme: '$scheme', path: '$mrScratchDir'). " + s"You need to configure 'hive.exec.scratchdir' to use accessible location " + s"(e.g. HDFS path) from any executors in the cluster.") } ``` Note: I also added one test case for HDFS. ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,12 +99,38 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + // Mostly copied from Context.java#getMRTmpPath of Hive 2.3. + // Visible for testing. + private[execution] def getNonBlobTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath. +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090. +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir. +val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else scratchDir +val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), hadoopConf, sessionPath) Review comment: It makes sense to me. Thank you for the suggestion. I added an error message to suggest end-users to configure `hive.exec.scratchdir` if the scheme of the temporary directory is `file`. Can you check if it is clear enough for end-users? ``` val path = new Path(mrScratchDir, "-mr-1") val scheme = Option(path.toUri.getScheme).getOrElse("") if (schema == "file") { logWarning(s"Temporary data will be written into a local disk " + s"(scheme: '$scheme', path: '$mrScratchDir'). " + s"You need to configure 'hive.exec.scratchdir' to use accessible location " + s"(e.g. HDFS path) from any executors in the cluster.") } ``` Note: I also added one test case for HDFS. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r446614786 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,12 +99,38 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + // Mostly copied from Context.java#getMRTmpPath of Hive 2.3. + // Visible for testing. + private[execution] def getNonBlobTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath. +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090. +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir. +val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else scratchDir +val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), hadoopConf, sessionPath) Review comment: It makes sense to me.Thank you for the suggestion. I added an error message to suggest end-users to configure `hive.exec.scratchdir` if the scheme of the temporary directory is `file`. Can you check if it is clear enough for end-users? ``` val path = new Path(mrScratchDir, "-mr-1") val scheme = Option(path.toUri.getScheme).getOrElse("") if (schema == "file") { logWarning(s"Temporary data will be written into a local disk " + s"(scheme: '$scheme', path: '$mrScratchDir'). " + s"You need to configure 'hive.exec.scratchdir' to use accessible location " + s"(e.g. HDFS path) from any executors in the cluster.") } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r446516892 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,12 +99,38 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + // Mostly copied from Context.java#getMRTmpPath of Hive 2.3. + // Visible for testing. + private[execution] def getNonBlobTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath. +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090. +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir. +val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else scratchDir +val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), hadoopConf, sessionPath) Review comment: Hive has two kinds of scratch dir accordingly, one in local, the other in hdfs. https://mingyue.me/2018/11/17/hive-scratch-working-directory/ In this pull-request, the latter one, `hive.exec.scratchdir` is used. In my recognition, we can assume HDFS schema in most cases. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r446604034 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,12 +99,38 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + // Mostly copied from Context.java#getMRTmpPath of Hive 2.3. + // Visible for testing. + private[execution] def getNonBlobTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath. +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090. +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir. +val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else scratchDir +val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), hadoopConf, sessionPath) Review comment: Currently we just rely on `hive.exec.scratchdir` (not directly on `fs.default.name`), and it works in most use cases even if `hive.exec.scratchdir` is not configured explicitly. I do not want to restrict this feature to HDFS only because I have seen some clusters which do not have HDFS. I want to let end-users choose any scheme where they want to store temporary data. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r446604034 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,12 +99,38 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + // Mostly copied from Context.java#getMRTmpPath of Hive 2.3. + // Visible for testing. + private[execution] def getNonBlobTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath. +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090. +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir. +val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else scratchDir +val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), hadoopConf, sessionPath) Review comment: Currently we just rely on `hive.exec.scratchdir`, and it works in most use cases even if `hive.exec.scratchdir` is not configured explicitly. I do not want to restrict this feature to HDFS only because I have seen some clusters which do not have HDFS. I want to let end-users choose any scheme where they want to store temporary data. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r446588070 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,12 +99,38 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + // Mostly copied from Context.java#getMRTmpPath of Hive 2.3. + // Visible for testing. + private[execution] def getNonBlobTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath. +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090. +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir. +val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else scratchDir +val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), hadoopConf, sessionPath) Review comment: Here, no specific scheme is configured in `hive.exec.scratchdir` as default value. The default is just `/tmp/hive` without scheme. It means that the default scheme determined by `fs.default.name` is used. If `fs.default.name` is a local scheme like `file://` then `file://` will be used. If `fs.default.name` is a hdfs scheme like `hdfs://host:port/` then `hdfs://host:port/` will be used. I confirmed this behavior in my end to make sure it. In most cases, `fs.default.name` is a hdfs scheme. Even if local scheme is used here, it should work because we can assume local scheme can be used in such environment (e.g. unit testing). Of course, users have a full control on `hive.exec.scratchdir` and `fs.default.name`, so I believe it won’t be a problem. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r446588070 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,12 +99,38 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + // Mostly copied from Context.java#getMRTmpPath of Hive 2.3. + // Visible for testing. + private[execution] def getNonBlobTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath. +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090. +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir. +val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else scratchDir +val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), hadoopConf, sessionPath) Review comment: Here, no specific scheme is configured in `hive.exec.scratchdir` (just `/tmp/hive` without scheme). It means that the default scheme determined by `fs.default.name` is used. If `fs.default.name` is a local scheme like `file://` then `file://` will be used. If `fs.default.name` is a hdfs scheme like `hdfs://host:port/` then `hdfs://host:port/` will be used. I confirmed this behavior in my end to make sure it. In most cases, `fs.default.name` is a hdfs scheme. Even if local scheme is used here, it should work because we can assume local scheme can be used in such environment (e.g. unit testing). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r446516892 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,12 +99,38 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + // Mostly copied from Context.java#getMRTmpPath of Hive 2.3. + // Visible for testing. + private[execution] def getNonBlobTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath. +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090. +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir. +val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else scratchDir +val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), hadoopConf, sessionPath) Review comment: Hive has two kinds of scratch dir accordingly, one in local, the other in hdfs. https://mingyue.me/2018/11/17/hive-scratch-working-directory/ In this pull-request, the latter one, `hive.exec.scratchdir` is used. In my recognition, we can assume HDFS schema here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r446482526 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -124,11 +153,24 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { val hiveVersion = externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive") +logDebug(s"path '${path.toString}', staging dir '$stagingDir', " + + s"scratch dir '$scratchDir' are used") if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) { oldVersionExternalTempPath(path, hadoopConf, scratchDir) } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) { - newVersionExternalTempPath(path, hadoopConf, stagingDir) + // HIVE-14270: Write temporary data to HDFS when doing inserts on tables located on S3 + // Copied from Context.java#getTempDirForPath of Hive 2.3. + if (supportSchemeToUseNonBlobStore(path)) { +// Hive sets session_path as HDFS_SESSION_PATH_KEY(_hive.hdfs.session.path) in hive config +val HDFS_SESSION_PATH_KEY = "_hive.hdfs.session.path" +val sessionScratchDir = externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog] + .client.getConf(HDFS_SESSION_PATH_KEY, "") +logDebug(s"session scratch dir '$sessionScratchDir' is used") +getMRTmpPath(hadoopConf, sessionScratchDir, scratchDir) Review comment: Yes, Hive's `scratchDir` is still used in Hive 2.0 or later. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r446482475 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -124,11 +153,24 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { val hiveVersion = externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive") +logDebug(s"path '${path.toString}', staging dir '$stagingDir', " + + s"scratch dir '$scratchDir' are used") if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) { oldVersionExternalTempPath(path, hadoopConf, scratchDir) } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) { - newVersionExternalTempPath(path, hadoopConf, stagingDir) + // HIVE-14270: Write temporary data to HDFS when doing inserts on tables located on S3 Review comment: Added version check using `hiveVersionsSupportingNonBlobstoreTempPath`. ``` val hiveVersionsSupportingNonBlobstoreTempPath: Set[HiveVersion] = Set(v2_0, v2_1, v2_2, v2_3, v3_0, v3_1) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r446482392 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,7 +99,34 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + // Mostly copied from Context.java#getMRTmpPath of Hive 2.3. + // Visible for testing. + private[execution] def getMRTmpPath( Review comment: Renamed this method from `getMRTmpPath` to `getNonBlobTmpPath`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r441212073 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -839,6 +839,17 @@ object SQLConf { .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString)) .createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString) + val HIVE_SUPPORTED_SCHEMES_TO_USE_NONBLOBSTORE = +buildConf("spark.sql.hive.supportedSchemesToUseNonBlobstore") + .doc("Comma-separated list of supported blobstore schemes (e.g. 's3,s3a'). " + +"If any blobstore schemes are specified, this feature is enabled. " + +"When writing data out to a Hive table, " + +"Spark writes the data first into non blobstore storage, and then moves it to blobstore. " + +"By default, this option is set to empty. It means this feature is disabled.") + .version("3.1.0") + .stringConf + .createWithDefault("") Review comment: If all the blob storage services have the same characteristics, then it will be possible. However, actually we cannot guarantee it. I believe that it is better to allow users to enable/disable this feature per scheme. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r439917190 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -124,11 +153,24 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { val hiveVersion = externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive") +logDebug(s"path '${path.toString}', staging dir '$stagingDir', " + + s"scratch dir '$scratchDir' are used") if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) { oldVersionExternalTempPath(path, hadoopConf, scratchDir) } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) { Review comment: Got it. I added the description "This option is supported in Hive 2.0 or later." in SQLConf.scala. https://github.com/apache/spark/pull/27690/files#diff-9a6b543db706f1a90f790783d6930a13R849 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r439917190 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -124,11 +153,24 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { val hiveVersion = externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive") +logDebug(s"path '${path.toString}', staging dir '$stagingDir', " + + s"scratch dir '$scratchDir' are used") if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) { oldVersionExternalTempPath(path, hadoopConf, scratchDir) } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) { Review comment: Got it. I added the descroption "This option is supported in Hive 2.0 or later." in SQLConf.scala. https://github.com/apache/spark/pull/27690/files#diff-9a6b543db706f1a90f790783d6930a13R849 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r439913882 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -839,6 +839,17 @@ object SQLConf { .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString)) .createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString) + val HIVE_SUPPORTED_SCHEMES_TO_USE_NONBLOBSTORE = +buildConf("spark.sql.hive.supportedSchemesToUseNonBlobstore") + .doc("Comma-separated list of supported blobstore schemes (e.g. 's3,s3a'). " + +"If any blobstore schemes are specified, this feature is enabled. " + +"When writing data out to a Hive table, " + +"Spark writes the data first into non blobstore storage, and then moves it to blobstore. " + +"By default, this option is set to empty. It means this feature is disabled.") + .version("3.1.0") + .stringConf + .createWithDefault("") Review comment: Note: I am not 100% sure whether all these blob storage systems have similar characteristics and not sure if this option is effective. At least, this option is effective for Amazon S3. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r439913383 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -839,6 +839,17 @@ object SQLConf { .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString)) .createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString) + val HIVE_SUPPORTED_SCHEMES_TO_USE_NONBLOBSTORE = +buildConf("spark.sql.hive.supportedSchemesToUseNonBlobstore") + .doc("Comma-separated list of supported blobstore schemes (e.g. 's3,s3a'). " + +"If any blobstore schemes are specified, this feature is enabled. " + +"When writing data out to a Hive table, " + +"Spark writes the data first into non blobstore storage, and then moves it to blobstore. " + +"By default, this option is set to empty. It means this feature is disabled.") + .version("3.1.0") + .stringConf + .createWithDefault("") Review comment: Users can specify any blob storage schema like following. If copy operation is expensive in the storage system, this option will be effective. - Amazon S3: `s3`, `s3a`, `s3n` - Azure Blob Storage: `wasb`, `wasbs` - Google Cloud Storage: `gs` - Databricks: `dbfs` - OpenStack: `swift` Since any schemes are possible to be used, I believe we cannot define specific supported schemes here. That's why I just listed samples in SQLConf.scala. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r439706256 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,7 +99,34 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + // Mostly copied from Context.java#getMRTmpPath of Hive 2.3. + // Visible for testing. + private[execution] def getMRTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath. +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090. +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir. +val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else scratchDir +val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), hadoopConf, sessionPath) Review comment: Several HDFS scratch directories are created during start SessionState. If the session scratch directory is created in the path specified in `_hive.hdfs.session.path`, the directory should be used. If it is not specified, then we just use scratchDir. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r439705159 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -124,11 +153,24 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { val hiveVersion = externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive") +logDebug(s"path '${path.toString}', staging dir '$stagingDir', " + + s"scratch dir '$scratchDir' are used") if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) { oldVersionExternalTempPath(path, hadoopConf, scratchDir) } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) { Review comment: It is because old versions do not support data copy between different type of DFSs. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r411040014 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -751,6 +751,22 @@ object SQLConf { .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString)) .createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString) + val HIVE_BLOBSTORE_SUPPORTED_SCHEMES = +buildConf("spark.sql.hive.blobstore.supported.schemes") + .doc("Comma-separated list of supported blobstore schemes.") Review comment: Added the lines into the new configurations. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r396830941 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -751,6 +751,20 @@ object SQLConf { .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString)) .createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString) + val HIVE_BLOBSTORE_SUPPORTED_SCHEMES = +buildConf("spark.sql.hive.blobstore.supported.schemes") + .doc("Comma-separated list of supported blobstore schemes.") Review comment: Modified. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r396830905 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,7 +99,43 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + /* + * Mostly copied from Context.java#getMRTmpPath of Hive 2.3 + * + */ + def getMRTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090 +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir +val sessionPath = Option(sessionScratchDir).filterNot(_.isEmpty).getOrElse(scratchDir) Review comment: Modified. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r396829010 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,7 +99,43 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + /* + * Mostly copied from Context.java#getMRTmpPath of Hive 2.3 + * + */ + def getMRTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090 +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir +val sessionPath = Option(sessionScratchDir).filterNot(_.isEmpty).getOrElse(scratchDir) +logDebug(s"session path '${sessionPath.toString}' is used") + +val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), hadoopConf, sessionPath) +logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used") +new Path(mrScratchDir, "-mr-1") + } + + private def isBlobStoragePath(path: Path): Boolean = { +path != null && isBlobStorageScheme(Option(path.toUri.getScheme).getOrElse("")) Review comment: I believe that there are only two cases where S3 is used as scratch dir even for S3 path with `spark.sql.hive.blobstore.use.blobstore.as.scratchdir=false`. - When `_hive.hdfs.session.path` is set to `s3` - When `_hive.hdfs.session.path` is not set and `hive.exec.scratchdir` is set to `s3` In both cases, these params can be configured by users, so it won't be an issue. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r396799099 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,7 +99,43 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + /* + * Mostly copied from Context.java#getMRTmpPath of Hive 2.3 + * + */ + def getMRTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090 +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir +val sessionPath = Option(sessionScratchDir).filterNot(_.isEmpty).getOrElse(scratchDir) +logDebug(s"session path '${sessionPath.toString}' is used") + +val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), hadoopConf, sessionPath) +logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used") +new Path(mrScratchDir, "-mr-1") + } + + private def isBlobStoragePath(path: Path): Boolean = { +path != null && isBlobStorageScheme(Option(path.toUri.getScheme).getOrElse("")) Review comment: As far as I tried, it worked even when `fs.default.name` is set to `s3. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r396799099 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,7 +99,43 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + /* + * Mostly copied from Context.java#getMRTmpPath of Hive 2.3 + * + */ + def getMRTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090 +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir +val sessionPath = Option(sessionScratchDir).filterNot(_.isEmpty).getOrElse(scratchDir) +logDebug(s"session path '${sessionPath.toString}' is used") + +val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), hadoopConf, sessionPath) +logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used") +new Path(mrScratchDir, "-mr-1") + } + + private def isBlobStoragePath(path: Path): Boolean = { +path != null && isBlobStorageScheme(Option(path.toUri.getScheme).getOrElse("")) Review comment: As far as I tried, it worked even when `fs.default.name` is set to `s3`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r396289647 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,7 +99,43 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + /* + * Mostly copied from Context.java#getMRTmpPath of Hive 2.3 + * + */ + def getMRTmpPath( + hadoopConf: Configuration, + sessionScratchDir: String, + scratchDir: String): Path = { + +// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1', +// which is ruled by 'hive.exec.scratchdir' including file system. +// This is the same as Spark's #oldVersionExternalTempPath +// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is HIVE-7090 +// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir' +// Here it uses session_path unless it's emtpy, otherwise uses scratchDir +val sessionPath = Option(sessionScratchDir).filterNot(_.isEmpty).getOrElse(scratchDir) Review comment: Thank you for the comment. As you said, `Option` is not needed here. I will re-write this with `val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else scratchDir`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r396285657 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -125,10 +163,22 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive") +// Hive sets session_path as HDFS_SESSION_PATH_KEY(_hive.hdfs.session.path) in hive config +val sessionScratchDir = externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog] + .client.getConf("_hive.hdfs.session.path", "") Review comment: As you mentioned in another comment, empty string is retrieved when `_hive.hdfs.session.path` is not set. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r396284464 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -751,6 +751,20 @@ object SQLConf { .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString)) .createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString) + val HIVE_BLOBSTORE_SUPPORTED_SCHEMES = +buildConf("spark.sql.hive.blobstore.supported.schemes") + .doc("Comma-separated list of supported blobstore schemes.") Review comment: Thank you for the comment. Current explanation is the same sentence as Hive's. I will add explanation like `If you disable this parameter, Spark writes the data first in scratch dir, and move it to blobstore because moving it on blobstore is expensive.` based on your comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r396179637 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -125,10 +163,22 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive") +// Hive sets session_path as HDFS_SESSION_PATH_KEY(_hive.hdfs.session.path) in hive config +val sessionScratchDir = externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog] + .client.getConf("_hive.hdfs.session.path", "") Review comment: It is tested in this unit test. https://github.com/apache/spark/pull/27690/files#diff-ee422d26750ba346c81b7f85b4b14577R46 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r396178891 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -125,10 +163,22 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive") +// Hive sets session_path as HDFS_SESSION_PATH_KEY(_hive.hdfs.session.path) in hive config +val sessionScratchDir = externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog] + .client.getConf("_hive.hdfs.session.path", "") Review comment: If `_hive.hdfs.session.path` is empty, `getMRTmpPath` uses `scratchDir` instead of `sessionScratchDir` in this line: `val sessionPath = Option(sessionScratchDir).filterNot(_.isEmpty).getOrElse(scratchDir)`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r387429262 ## File path: sql/hive/pom.xml ## @@ -189,6 +189,11 @@ scalacheck_${scala.binary.version} test + + org.mockito + mockito-core + test Review comment: Thanks, I removed it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r386197390 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -125,10 +162,23 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive") +// Hive sets session_path as HDFS_SESSION_PATH_KEY(_hive.hdfs.session.path) in hive config +val sessionScratchDir = externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog] + .client.getConf("_hive.hdfs.session.path", "") +logDebug(s"path '${path.toString}', staging dir '$stagingDir', " + + s"scratch dir '$scratchDir', session scratch dir '$sessionScratchDir' are used") + if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) { oldVersionExternalTempPath(path, hadoopConf, scratchDir) } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) { - newVersionExternalTempPath(path, hadoopConf, stagingDir) + // HIVE-14270: Write temporary data to HDFS when doing inserts on tables located on S3 + // Copied from Context.java#getTempDirForPath of Hive 2.3 + if (isBlobStoragePath(hadoopConf, path) +&& !useBlobStorageAsScratchDir(hadoopConf)) { +getMRTmpPath(hadoopConf, sessionScratchDir, scratchDir) Review comment: Although `I used spark.hadoop.hive.blobstore.*` in the initial commit, it was for consistency. Since I do not have any technical reason here, I changed it to use SQLConf to follow the naming rule here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r386185658 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFileSuite.scala ## @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.scalatestplus.mockito.MockitoSugar + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.catalyst.catalog.CatalogTable +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.hive.test.TestHiveSingleton + +class SaveAsHiveFileSuite extends QueryTest with TestHiveSingleton with MockitoSugar { + test("sessionScratchDir = '/tmp/hive/user_a/session_b' & scratchDir = '/tmp/hive_scratch'") { +val insertIntoHiveTable = new InsertIntoHiveTable( + mock[CatalogTable], Map.empty[String, Option[String]], + mock[LogicalPlan], true, false, Seq.empty[String]) Review comment: It makes sense to me. Thanks, I did not notice it. I removed mockito and just use null per the link. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r386026628 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFileSuite.scala ## @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.scalatest.GivenWhenThen +import org.scalatestplus.mockito.MockitoSugar + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.catalyst.catalog.CatalogTable +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.hive.test.TestHiveSingleton + +class SaveAsHiveFileSuite extends QueryTest with TestHiveSingleton +with MockitoSugar with GivenWhenThen { Review comment: In order to test this patch's exact behavior, we might need to check if the intermediate files are located in HDFS not in blobstore. However, it is not so easy to test it. That's the reason why I am checking each path here instead. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r386026502 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFileSuite.scala ## @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.scalatest.GivenWhenThen +import org.scalatestplus.mockito.MockitoSugar + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.catalyst.catalog.CatalogTable +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.hive.test.TestHiveSingleton + +class SaveAsHiveFileSuite extends QueryTest with TestHiveSingleton +with MockitoSugar with GivenWhenThen { + test("getMRTmpPath method") { +val insertIntoHiveTable = new InsertIntoHiveTable( + mock[CatalogTable], Map.empty[String, Option[String]], + mock[LogicalPlan], true, false, Seq.empty[String]) +val hadoopConf = new Configuration() +val scratchDir = "/tmp/hive_scratch" +val sessionScratchDir = "/tmp/hive/user_a/session_b" + +Given(s"sessionScratchDir = '$sessionScratchDir' & scratchDir = '$scratchDir'") +When("get the path from getMRTmpPath(hadoopConf, sessionScratchDir, scratchDir)") Review comment: I see, I changed to use simple `assert` instead of `GivenWhenThen`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r386024959 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/BlobStorageUtils.scala ## @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import java.util.Locale + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path + +object BlobStorageUtils { Review comment: I moved the functions to private functions in `SaveAsHiveFile`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r386023688 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,7 +97,30 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + /* + * Mostly copied from Context.java#getMRTmpPath of Hive 2.3 Review comment: I checked `hive/Context.java` in the master branch, and I did not see much difference. - master: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/Context.java - release 2.3.0: https://github.com/apache/hive/blob/rel/release-2.3.0/ql/src/java/org/apache/hadoop/hive/ql/Context.java#L588 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r385497093 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/BlobStorageUtils.scala ## @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import java.util.Locale + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path + +object BlobStorageUtils { + def isBlobStoragePath( + hadoopConf: Configuration, + path: Path): Boolean = { +path != null && isBlobStorageScheme(hadoopConf, Option(path.toUri.getScheme).getOrElse("")) + } + + def isBlobStorageScheme( + hadoopConf: Configuration, + scheme: String): Boolean = { Review comment: Removed unneeded line breaks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r385497038 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/BlobStorageUtils.scala ## @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import java.util.Locale + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path + +object BlobStorageUtils { + def isBlobStoragePath( + hadoopConf: Configuration, + path: Path): Boolean = { Review comment: Removed unneeded line breaks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r385496980 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/BlobStorageUtils.scala ## @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import java.util.Locale + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path + +object BlobStorageUtils { + def isBlobStoragePath( + hadoopConf: Configuration, + path: Path): Boolean = { +path != null && isBlobStorageScheme(hadoopConf, Option(path.toUri.getScheme).getOrElse("")) + } + + def isBlobStorageScheme( + hadoopConf: Configuration, + scheme: String): Boolean = { +val supportedBlobSchemes = hadoopConf.get("hive.blobstore.supported.schemes", "s3,s3a,s3n") +supportedBlobSchemes.toLowerCase(Locale.ROOT) + .split(",") + .map(_.trim) + .contains(scheme.toLowerCase(Locale.ROOT)) Review comment: I replaced this with `Utils.stringToSeq`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r385496854 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -124,11 +147,26 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { val hiveVersion = externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive") +logDebug(s"path '${path.toString}' is used") +logDebug(s"staging dir '$stagingDir' is used") +logDebug(s"scratch dir '$scratchDir' is used") Review comment: Yes, I packed these 4 (3+1) logDebug into one. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r385160626 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,7 +97,30 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + /* + * Mostly copied from Context.java#getMRTmpPath of Hive 2.3 Review comment: I believe that currently Spark's default Hive version is 2.3, so I just used the same version of Hive here for simplicity. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r385160313 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,7 +97,30 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + /* + * Mostly copied from Context.java#getMRTmpPath of Hive 2.3 Review comment: I believe that currently Spark's default Hive version is 2.3, so I just used the same version of Hive here for simplicity. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r384856247 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/BlobStorageUtils.scala ## @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import java.util.Locale + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path + +object BlobStorageUtils { + def isBlobStoragePath( + hadoopConf: Configuration, + path: Path): Boolean = { +path != null && isBlobStorageScheme(hadoopConf, Option(path.toUri.getScheme).getOrElse("")) + } + + def isBlobStorageScheme( + hadoopConf: Configuration, + scheme: String): Boolean = { +val supportedBlobSchemes = hadoopConf.get("hive.blobstore.supported.schemes", "s3,s3a,s3n") +supportedBlobSchemes.toLowerCase(Locale.ROOT) + .split(",") + .map(_.trim) + .toList Review comment: Thanks for the comment. Right, we do not need `toList`. I removed it, and confirmed that unit test succeeds. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage URL: https://github.com/apache/spark/pull/27690#discussion_r384853033 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -97,7 +97,30 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { options = Map.empty) } - protected def getExternalTmpPath( + /* + * Mostly copied from Context.java#getMRTmpPath of Hive 2.3 Review comment: Yes, it is coming from this. https://github.com/apache/hive/blob/rel/release-2.3.0/ql/src/java/org/apache/hadoop/hive/ql/Context.java#L588 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org