[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r90782492 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -54,6 +61,61 @@ case class InsertIntoHiveTable( @transient private lazy val hiveContext = new Context(sc.hiveconf) @transient private lazy val catalog = sc.catalog + val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR) + + private def executionId: String = { +val rand: Random = new Random +val format: SimpleDateFormat = new SimpleDateFormat("-MM-dd_HH-mm-ss_SSS") +val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) +return executionId + } + + private def getStagingDir(inputPath: Path, hadoopConf: Configuration): Path = { +val inputPathUri: URI = inputPath.toUri +val inputPathName: String = inputPathUri.getPath +val fs: FileSystem = inputPath.getFileSystem(hadoopConf) +val stagingPathName: String = + if (inputPathName.indexOf(stagingDir) == -1) { +new Path(inputPathName, stagingDir).toString + } else { +inputPathName.substring(0, inputPathName.indexOf(stagingDir) + stagingDir.length) + } +val dir: Path = + fs.makeQualified( +new Path(stagingPathName + "_" + executionId + "-" + TaskRunner.getTaskRunnerID)) +logDebug("Created staging dir = " + dir + " for path = " + inputPath) +try { + if (!FileUtils.mkdir(fs, dir, true, hadoopConf)) { +throw new IllegalStateException("Cannot create staging directory '" + dir.toString + "'") + } + fs.deleteOnExit(dir) +} +catch { + case e: IOException => +throw new RuntimeException( --- End diff -- Almost all the codes in this PR are copied from the existing master. This PR is just for branch 1.6 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...
Github user fidato13 commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r88778913 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -54,6 +61,61 @@ case class InsertIntoHiveTable( @transient private lazy val hiveContext = new Context(sc.hiveconf) @transient private lazy val catalog = sc.catalog + val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR) + + private def executionId: String = { +val rand: Random = new Random +val format: SimpleDateFormat = new SimpleDateFormat("-MM-dd_HH-mm-ss_SSS") +val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) +return executionId --- End diff -- Cheers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...
Github user fidato13 commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r88778929 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -54,6 +61,61 @@ case class InsertIntoHiveTable( @transient private lazy val hiveContext = new Context(sc.hiveconf) @transient private lazy val catalog = sc.catalog + val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR) + + private def executionId: String = { +val rand: Random = new Random +val format: SimpleDateFormat = new SimpleDateFormat("-MM-dd_HH-mm-ss_SSS") +val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) +return executionId + } + + private def getStagingDir(inputPath: Path, hadoopConf: Configuration): Path = { +val inputPathUri: URI = inputPath.toUri +val inputPathName: String = inputPathUri.getPath +val fs: FileSystem = inputPath.getFileSystem(hadoopConf) +val stagingPathName: String = + if (inputPathName.indexOf(stagingDir) == -1) { +new Path(inputPathName, stagingDir).toString + } else { +inputPathName.substring(0, inputPathName.indexOf(stagingDir) + stagingDir.length) + } +val dir: Path = + fs.makeQualified( +new Path(stagingPathName + "_" + executionId + "-" + TaskRunner.getTaskRunnerID)) +logDebug("Created staging dir = " + dir + " for path = " + inputPath) +try { + if (!FileUtils.mkdir(fs, dir, true, hadoopConf)) { +throw new IllegalStateException("Cannot create staging directory '" + dir.toString + "'") + } + fs.deleteOnExit(dir) +} +catch { + case e: IOException => +throw new RuntimeException( + "Cannot create staging directory '" + dir.toString + "': " + e.getMessage, e) + +} +return dir --- End diff -- Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...
Github user merlintang commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r88778830 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -54,6 +61,61 @@ case class InsertIntoHiveTable( @transient private lazy val hiveContext = new Context(sc.hiveconf) @transient private lazy val catalog = sc.catalog + val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR) + + private def executionId: String = { +val rand: Random = new Random +val format: SimpleDateFormat = new SimpleDateFormat("-MM-dd_HH-mm-ss_SSS") +val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) +return executionId + } + + private def getStagingDir(inputPath: Path, hadoopConf: Configuration): Path = { +val inputPathUri: URI = inputPath.toUri +val inputPathName: String = inputPathUri.getPath +val fs: FileSystem = inputPath.getFileSystem(hadoopConf) +val stagingPathName: String = + if (inputPathName.indexOf(stagingDir) == -1) { +new Path(inputPathName, stagingDir).toString + } else { +inputPathName.substring(0, inputPathName.indexOf(stagingDir) + stagingDir.length) + } +val dir: Path = + fs.makeQualified( +new Path(stagingPathName + "_" + executionId + "-" + TaskRunner.getTaskRunnerID)) +logDebug("Created staging dir = " + dir + " for path = " + inputPath) +try { + if (!FileUtils.mkdir(fs, dir, true, hadoopConf)) { +throw new IllegalStateException("Cannot create staging directory '" + dir.toString + "'") + } + fs.deleteOnExit(dir) +} +catch { + case e: IOException => +throw new RuntimeException( --- End diff -- You can find the reason that we use this code is because (1) the old version need to use the hive package to create the staging directory, in the hive code, this staging directory is storied in a hash map, and then these staging directories would be removed when the session is closed. however, our spark code do not trigger the hive session close, then, these directories will not be removed. (2) you can find the pushed code just simulate the hive way to create the staging directory inside the spark rather than based on the hive. Then, the staging directory will be removed. (3) I will fix the return type issue, thanks for your comments @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...
Github user merlintang commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r88778781 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -54,6 +61,61 @@ case class InsertIntoHiveTable( @transient private lazy val hiveContext = new Context(sc.hiveconf) @transient private lazy val catalog = sc.catalog + val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR) + + private def executionId: String = { +val rand: Random = new Random +val format: SimpleDateFormat = new SimpleDateFormat("-MM-dd_HH-mm-ss_SSS") +val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) --- End diff -- yes, it is. I am working on this way because I want to code is exactly the same as the spark 2.0.x version. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r88429928 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -54,6 +61,61 @@ case class InsertIntoHiveTable( @transient private lazy val hiveContext = new Context(sc.hiveconf) @transient private lazy val catalog = sc.catalog + val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR) + + private def executionId: String = { +val rand: Random = new Random +val format: SimpleDateFormat = new SimpleDateFormat("-MM-dd_HH-mm-ss_SSS") +val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) --- End diff -- Why all this -- just us a UUID? you also have a redundant return and types here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r88430053 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -54,6 +61,61 @@ case class InsertIntoHiveTable( @transient private lazy val hiveContext = new Context(sc.hiveconf) @transient private lazy val catalog = sc.catalog + val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR) + + private def executionId: String = { +val rand: Random = new Random +val format: SimpleDateFormat = new SimpleDateFormat("-MM-dd_HH-mm-ss_SSS") +val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) +return executionId + } + + private def getStagingDir(inputPath: Path, hadoopConf: Configuration): Path = { +val inputPathUri: URI = inputPath.toUri +val inputPathName: String = inputPathUri.getPath +val fs: FileSystem = inputPath.getFileSystem(hadoopConf) +val stagingPathName: String = + if (inputPathName.indexOf(stagingDir) == -1) { +new Path(inputPathName, stagingDir).toString + } else { +inputPathName.substring(0, inputPathName.indexOf(stagingDir) + stagingDir.length) + } +val dir: Path = + fs.makeQualified( +new Path(stagingPathName + "_" + executionId + "-" + TaskRunner.getTaskRunnerID)) +logDebug("Created staging dir = " + dir + " for path = " + inputPath) +try { + if (!FileUtils.mkdir(fs, dir, true, hadoopConf)) { +throw new IllegalStateException("Cannot create staging directory '" + dir.toString + "'") + } + fs.deleteOnExit(dir) +} +catch { + case e: IOException => +throw new RuntimeException( --- End diff -- Don't use RuntimeException; why even handle this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...
Github user merlintang commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r88345264 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -54,6 +61,61 @@ case class InsertIntoHiveTable( @transient private lazy val hiveContext = new Context(sc.hiveconf) @transient private lazy val catalog = sc.catalog + val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR) + + private def executionId: String = { +val rand: Random = new Random +val format: SimpleDateFormat = new SimpleDateFormat("-MM-dd_HH-mm-ss_SSS") +val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) +return executionId --- End diff -- hi @fidato13 this is ok, since the part of this code is reused from spark 2.0.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...
Github user fidato13 commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r87676473 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -54,6 +61,61 @@ case class InsertIntoHiveTable( @transient private lazy val hiveContext = new Context(sc.hiveconf) @transient private lazy val catalog = sc.catalog + val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR) + + private def executionId: String = { +val rand: Random = new Random +val format: SimpleDateFormat = new SimpleDateFormat("-MM-dd_HH-mm-ss_SSS") +val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) +return executionId --- End diff -- Can the extra noise (return statement) in scala code be removed please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...
Github user fidato13 commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r87676512 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -54,6 +61,61 @@ case class InsertIntoHiveTable( @transient private lazy val hiveContext = new Context(sc.hiveconf) @transient private lazy val catalog = sc.catalog + val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR) + + private def executionId: String = { +val rand: Random = new Random +val format: SimpleDateFormat = new SimpleDateFormat("-MM-dd_HH-mm-ss_SSS") +val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) +return executionId + } + + private def getStagingDir(inputPath: Path, hadoopConf: Configuration): Path = { +val inputPathUri: URI = inputPath.toUri +val inputPathName: String = inputPathUri.getPath +val fs: FileSystem = inputPath.getFileSystem(hadoopConf) +val stagingPathName: String = + if (inputPathName.indexOf(stagingDir) == -1) { +new Path(inputPathName, stagingDir).toString + } else { +inputPathName.substring(0, inputPathName.indexOf(stagingDir) + stagingDir.length) + } +val dir: Path = + fs.makeQualified( +new Path(stagingPathName + "_" + executionId + "-" + TaskRunner.getTaskRunnerID)) +logDebug("Created staging dir = " + dir + " for path = " + inputPath) +try { + if (!FileUtils.mkdir(fs, dir, true, hadoopConf)) { +throw new IllegalStateException("Cannot create staging directory '" + dir.toString + "'") + } + fs.deleteOnExit(dir) +} +catch { + case e: IOException => +throw new RuntimeException( + "Cannot create staging directory '" + dir.toString + "': " + e.getMessage, e) + +} +return dir --- End diff -- Can the extra noise (return statement) in scala code be removed please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...
GitHub user merlintang opened a pull request: https://github.com/apache/spark/pull/15819 [SPARK-18372][SQL].Staging directory fail to be removed ## What changes were proposed in this pull request? This fix is related to be bug: https://issues.apache.org/jira/browse/SPARK-18372 . The insertIntoHiveTable would generate a .staging directory, but this directory fail to be removed in the end. ## How was this patch tested? manual tests Author: Mingjie TangYou can merge this pull request into a Git repository by running: $ git pull https://github.com/merlintang/spark branch-1.6 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15819.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15819 commit ac65375a64c2a8a2fe019dc0e2c031f413df74b8 Author: Mingjie Tang Date: 2016-11-09T00:41:32Z SPARK-18372 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org