[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...

2016-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15819#discussion_r90782492
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -54,6 +61,61 @@ case class InsertIntoHiveTable(
   @transient private lazy val hiveContext = new Context(sc.hiveconf)
   @transient private lazy val catalog = sc.catalog
 
+  val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR)
+
+  private def executionId: String = {
+val rand: Random = new Random
+val format: SimpleDateFormat = new 
SimpleDateFormat("-MM-dd_HH-mm-ss_SSS")
+val executionId: String = "hive_" + format.format(new Date) + "_" + 
Math.abs(rand.nextLong)
+return executionId
+  }
+
+  private def getStagingDir(inputPath: Path, hadoopConf: Configuration): 
Path = {
+val inputPathUri: URI = inputPath.toUri
+val inputPathName: String = inputPathUri.getPath
+val fs: FileSystem = inputPath.getFileSystem(hadoopConf)
+val stagingPathName: String =
+  if (inputPathName.indexOf(stagingDir) == -1) {
+new Path(inputPathName, stagingDir).toString
+  } else {
+inputPathName.substring(0, inputPathName.indexOf(stagingDir) + 
stagingDir.length)
+  }
+val dir: Path =
+  fs.makeQualified(
+new Path(stagingPathName + "_" + executionId + "-" + 
TaskRunner.getTaskRunnerID))
+logDebug("Created staging dir = " + dir + " for path = " + inputPath)
+try {
+  if (!FileUtils.mkdir(fs, dir, true, hadoopConf)) {
+throw new IllegalStateException("Cannot create staging directory  
'" + dir.toString + "'")
+  }
+  fs.deleteOnExit(dir)
+}
+catch {
+  case e: IOException =>
+throw new RuntimeException(
--- End diff --

Almost all the codes in this PR are copied from the existing master. This 
PR is just for branch 1.6


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...

2016-11-19 Thread fidato13
Github user fidato13 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15819#discussion_r88778913
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -54,6 +61,61 @@ case class InsertIntoHiveTable(
   @transient private lazy val hiveContext = new Context(sc.hiveconf)
   @transient private lazy val catalog = sc.catalog
 
+  val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR)
+
+  private def executionId: String = {
+val rand: Random = new Random
+val format: SimpleDateFormat = new 
SimpleDateFormat("-MM-dd_HH-mm-ss_SSS")
+val executionId: String = "hive_" + format.format(new Date) + "_" + 
Math.abs(rand.nextLong)
+return executionId
--- End diff --

Cheers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...

2016-11-19 Thread fidato13
Github user fidato13 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15819#discussion_r88778929
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -54,6 +61,61 @@ case class InsertIntoHiveTable(
   @transient private lazy val hiveContext = new Context(sc.hiveconf)
   @transient private lazy val catalog = sc.catalog
 
+  val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR)
+
+  private def executionId: String = {
+val rand: Random = new Random
+val format: SimpleDateFormat = new 
SimpleDateFormat("-MM-dd_HH-mm-ss_SSS")
+val executionId: String = "hive_" + format.format(new Date) + "_" + 
Math.abs(rand.nextLong)
+return executionId
+  }
+
+  private def getStagingDir(inputPath: Path, hadoopConf: Configuration): 
Path = {
+val inputPathUri: URI = inputPath.toUri
+val inputPathName: String = inputPathUri.getPath
+val fs: FileSystem = inputPath.getFileSystem(hadoopConf)
+val stagingPathName: String =
+  if (inputPathName.indexOf(stagingDir) == -1) {
+new Path(inputPathName, stagingDir).toString
+  } else {
+inputPathName.substring(0, inputPathName.indexOf(stagingDir) + 
stagingDir.length)
+  }
+val dir: Path =
+  fs.makeQualified(
+new Path(stagingPathName + "_" + executionId + "-" + 
TaskRunner.getTaskRunnerID))
+logDebug("Created staging dir = " + dir + " for path = " + inputPath)
+try {
+  if (!FileUtils.mkdir(fs, dir, true, hadoopConf)) {
+throw new IllegalStateException("Cannot create staging directory  
'" + dir.toString + "'")
+  }
+  fs.deleteOnExit(dir)
+}
+catch {
+  case e: IOException =>
+throw new RuntimeException(
+  "Cannot create staging directory '" + dir.toString + "': " + 
e.getMessage, e)
+
+}
+return dir
--- End diff --

Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...

2016-11-19 Thread merlintang
Github user merlintang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15819#discussion_r88778830
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -54,6 +61,61 @@ case class InsertIntoHiveTable(
   @transient private lazy val hiveContext = new Context(sc.hiveconf)
   @transient private lazy val catalog = sc.catalog
 
+  val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR)
+
+  private def executionId: String = {
+val rand: Random = new Random
+val format: SimpleDateFormat = new 
SimpleDateFormat("-MM-dd_HH-mm-ss_SSS")
+val executionId: String = "hive_" + format.format(new Date) + "_" + 
Math.abs(rand.nextLong)
+return executionId
+  }
+
+  private def getStagingDir(inputPath: Path, hadoopConf: Configuration): 
Path = {
+val inputPathUri: URI = inputPath.toUri
+val inputPathName: String = inputPathUri.getPath
+val fs: FileSystem = inputPath.getFileSystem(hadoopConf)
+val stagingPathName: String =
+  if (inputPathName.indexOf(stagingDir) == -1) {
+new Path(inputPathName, stagingDir).toString
+  } else {
+inputPathName.substring(0, inputPathName.indexOf(stagingDir) + 
stagingDir.length)
+  }
+val dir: Path =
+  fs.makeQualified(
+new Path(stagingPathName + "_" + executionId + "-" + 
TaskRunner.getTaskRunnerID))
+logDebug("Created staging dir = " + dir + " for path = " + inputPath)
+try {
+  if (!FileUtils.mkdir(fs, dir, true, hadoopConf)) {
+throw new IllegalStateException("Cannot create staging directory  
'" + dir.toString + "'")
+  }
+  fs.deleteOnExit(dir)
+}
+catch {
+  case e: IOException =>
+throw new RuntimeException(
--- End diff --

You can find the reason that we use this code is because (1) the old 
version need to use the hive package to create the staging directory, in the 
hive code, this staging directory is storied in a hash map, and then these 
staging directories would be removed when the session is closed. however, our 
spark code do not trigger the hive session close, then, these directories will 
not be removed. (2) you can find the pushed code just simulate the hive way to 
create the staging directory inside the spark rather than based on the hive. 
Then, the staging directory will be removed. (3) I will fix the return type 
issue, thanks for your comments @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...

2016-11-19 Thread merlintang
Github user merlintang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15819#discussion_r88778781
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -54,6 +61,61 @@ case class InsertIntoHiveTable(
   @transient private lazy val hiveContext = new Context(sc.hiveconf)
   @transient private lazy val catalog = sc.catalog
 
+  val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR)
+
+  private def executionId: String = {
+val rand: Random = new Random
+val format: SimpleDateFormat = new 
SimpleDateFormat("-MM-dd_HH-mm-ss_SSS")
+val executionId: String = "hive_" + format.format(new Date) + "_" + 
Math.abs(rand.nextLong)
--- End diff --

yes, it is. I am working on this way because I want to code is exactly the 
same as the spark 2.0.x version. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...

2016-11-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15819#discussion_r88429928
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -54,6 +61,61 @@ case class InsertIntoHiveTable(
   @transient private lazy val hiveContext = new Context(sc.hiveconf)
   @transient private lazy val catalog = sc.catalog
 
+  val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR)
+
+  private def executionId: String = {
+val rand: Random = new Random
+val format: SimpleDateFormat = new 
SimpleDateFormat("-MM-dd_HH-mm-ss_SSS")
+val executionId: String = "hive_" + format.format(new Date) + "_" + 
Math.abs(rand.nextLong)
--- End diff --

Why all this -- just us a UUID? you also have a redundant return and types 
here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...

2016-11-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15819#discussion_r88430053
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -54,6 +61,61 @@ case class InsertIntoHiveTable(
   @transient private lazy val hiveContext = new Context(sc.hiveconf)
   @transient private lazy val catalog = sc.catalog
 
+  val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR)
+
+  private def executionId: String = {
+val rand: Random = new Random
+val format: SimpleDateFormat = new 
SimpleDateFormat("-MM-dd_HH-mm-ss_SSS")
+val executionId: String = "hive_" + format.format(new Date) + "_" + 
Math.abs(rand.nextLong)
+return executionId
+  }
+
+  private def getStagingDir(inputPath: Path, hadoopConf: Configuration): 
Path = {
+val inputPathUri: URI = inputPath.toUri
+val inputPathName: String = inputPathUri.getPath
+val fs: FileSystem = inputPath.getFileSystem(hadoopConf)
+val stagingPathName: String =
+  if (inputPathName.indexOf(stagingDir) == -1) {
+new Path(inputPathName, stagingDir).toString
+  } else {
+inputPathName.substring(0, inputPathName.indexOf(stagingDir) + 
stagingDir.length)
+  }
+val dir: Path =
+  fs.makeQualified(
+new Path(stagingPathName + "_" + executionId + "-" + 
TaskRunner.getTaskRunnerID))
+logDebug("Created staging dir = " + dir + " for path = " + inputPath)
+try {
+  if (!FileUtils.mkdir(fs, dir, true, hadoopConf)) {
+throw new IllegalStateException("Cannot create staging directory  
'" + dir.toString + "'")
+  }
+  fs.deleteOnExit(dir)
+}
+catch {
+  case e: IOException =>
+throw new RuntimeException(
--- End diff --

Don't use RuntimeException; why even handle this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...

2016-11-16 Thread merlintang
Github user merlintang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15819#discussion_r88345264
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -54,6 +61,61 @@ case class InsertIntoHiveTable(
   @transient private lazy val hiveContext = new Context(sc.hiveconf)
   @transient private lazy val catalog = sc.catalog
 
+  val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR)
+
+  private def executionId: String = {
+val rand: Random = new Random
+val format: SimpleDateFormat = new 
SimpleDateFormat("-MM-dd_HH-mm-ss_SSS")
+val executionId: String = "hive_" + format.format(new Date) + "_" + 
Math.abs(rand.nextLong)
+return executionId
--- End diff --

hi @fidato13 this is ok, since the part of this code is reused from spark 
2.0.2. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...

2016-11-11 Thread fidato13
Github user fidato13 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15819#discussion_r87676473
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -54,6 +61,61 @@ case class InsertIntoHiveTable(
   @transient private lazy val hiveContext = new Context(sc.hiveconf)
   @transient private lazy val catalog = sc.catalog
 
+  val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR)
+
+  private def executionId: String = {
+val rand: Random = new Random
+val format: SimpleDateFormat = new 
SimpleDateFormat("-MM-dd_HH-mm-ss_SSS")
+val executionId: String = "hive_" + format.format(new Date) + "_" + 
Math.abs(rand.nextLong)
+return executionId
--- End diff --

Can the extra noise (return statement) in scala code be removed please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...

2016-11-11 Thread fidato13
Github user fidato13 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15819#discussion_r87676512
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -54,6 +61,61 @@ case class InsertIntoHiveTable(
   @transient private lazy val hiveContext = new Context(sc.hiveconf)
   @transient private lazy val catalog = sc.catalog
 
+  val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR)
+
+  private def executionId: String = {
+val rand: Random = new Random
+val format: SimpleDateFormat = new 
SimpleDateFormat("-MM-dd_HH-mm-ss_SSS")
+val executionId: String = "hive_" + format.format(new Date) + "_" + 
Math.abs(rand.nextLong)
+return executionId
+  }
+
+  private def getStagingDir(inputPath: Path, hadoopConf: Configuration): 
Path = {
+val inputPathUri: URI = inputPath.toUri
+val inputPathName: String = inputPathUri.getPath
+val fs: FileSystem = inputPath.getFileSystem(hadoopConf)
+val stagingPathName: String =
+  if (inputPathName.indexOf(stagingDir) == -1) {
+new Path(inputPathName, stagingDir).toString
+  } else {
+inputPathName.substring(0, inputPathName.indexOf(stagingDir) + 
stagingDir.length)
+  }
+val dir: Path =
+  fs.makeQualified(
+new Path(stagingPathName + "_" + executionId + "-" + 
TaskRunner.getTaskRunnerID))
+logDebug("Created staging dir = " + dir + " for path = " + inputPath)
+try {
+  if (!FileUtils.mkdir(fs, dir, true, hadoopConf)) {
+throw new IllegalStateException("Cannot create staging directory  
'" + dir.toString + "'")
+  }
+  fs.deleteOnExit(dir)
+}
+catch {
+  case e: IOException =>
+throw new RuntimeException(
+  "Cannot create staging directory '" + dir.toString + "': " + 
e.getMessage, e)
+
+}
+return dir
--- End diff --

Can the extra noise (return statement) in scala code be removed please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...

2016-11-08 Thread merlintang
GitHub user merlintang opened a pull request:

https://github.com/apache/spark/pull/15819

[SPARK-18372][SQL].Staging directory fail to be removed 

## What changes were proposed in this pull request?

This fix is related to be bug: 
https://issues.apache.org/jira/browse/SPARK-18372 . 
The insertIntoHiveTable would generate a .staging directory, but this 
directory  fail to be removed in the end. 

## How was this patch tested?
manual tests

Author: Mingjie Tang 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/merlintang/spark branch-1.6

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15819.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15819


commit ac65375a64c2a8a2fe019dc0e2c031f413df74b8
Author: Mingjie Tang 
Date:   2016-11-09T00:41:32Z

SPARK-18372




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org