[GitHub] spark pull request #16104: [SPARK-18675][SQL] CTAS for hive serde table shou...

2016-12-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16104


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16104: [SPARK-18675][SQL] CTAS for hive serde table shou...

2016-12-11 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16104#discussion_r91886939
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -121,21 +121,61 @@ case class InsertIntoHiveTable(
 return dir
   }
 
-  private def getExternalScratchDir(extURI: URI, hadoopConf: 
Configuration): Path = {
-getStagingDir(new Path(extURI.getScheme, extURI.getAuthority, 
extURI.getPath), hadoopConf)
+  private def getExternalScratchDir(extURI: URI): Path = {
+getStagingDir(new Path(extURI.getScheme, extURI.getAuthority, 
extURI.getPath))
   }
 
-  def getExternalTmpPath(path: Path, hadoopConf: Configuration): Path = {
+  def getExternalTmpPath(path: Path): Path = {
+val hiveVersion = 
externalCatalog.asInstanceOf[HiveExternalCatalog].client.version.fullVersion
+if (hiveVersion.startsWith("0.12") ||
+  hiveVersion.startsWith("0.13") ||
+  hiveVersion.startsWith("0.14") ||
+  hiveVersion.startsWith("1.0")) {
+  oldStyleExternalTempPath(path)
+} else if (hiveVersion.startsWith("1.1") || 
hiveVersion.startsWith("1.2")) {
+  newStyleExternalTempPath(path)
+} else {
+  throw new IllegalStateException("Unsupported hive version: " + 
hiveVersion)
--- End diff --

uh, I see. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16104: [SPARK-18675][SQL] CTAS for hive serde table shou...

2016-12-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16104#discussion_r91886458
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -121,21 +121,61 @@ case class InsertIntoHiveTable(
 return dir
   }
 
-  private def getExternalScratchDir(extURI: URI, hadoopConf: 
Configuration): Path = {
-getStagingDir(new Path(extURI.getScheme, extURI.getAuthority, 
extURI.getPath), hadoopConf)
+  private def getExternalScratchDir(extURI: URI): Path = {
+getStagingDir(new Path(extURI.getScheme, extURI.getAuthority, 
extURI.getPath))
   }
 
-  def getExternalTmpPath(path: Path, hadoopConf: Configuration): Path = {
+  def getExternalTmpPath(path: Path): Path = {
+val hiveVersion = 
externalCatalog.asInstanceOf[HiveExternalCatalog].client.version.fullVersion
+if (hiveVersion.startsWith("0.12") ||
+  hiveVersion.startsWith("0.13") ||
+  hiveVersion.startsWith("0.14") ||
+  hiveVersion.startsWith("1.0")) {
+  oldStyleExternalTempPath(path)
+} else if (hiveVersion.startsWith("1.1") || 
hiveVersion.startsWith("1.2")) {
+  newStyleExternalTempPath(path)
+} else {
+  throw new IllegalStateException("Unsupported hive version: " + 
hiveVersion)
--- End diff --

We will fail in other places any way, e.g. 
`IsolatedClientLoader.hiveVersion`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16104: [SPARK-18675][SQL] CTAS for hive serde table shou...

2016-12-11 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16104#discussion_r91861824
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -121,21 +121,61 @@ case class InsertIntoHiveTable(
 return dir
   }
 
-  private def getExternalScratchDir(extURI: URI, hadoopConf: 
Configuration): Path = {
-getStagingDir(new Path(extURI.getScheme, extURI.getAuthority, 
extURI.getPath), hadoopConf)
+  private def getExternalScratchDir(extURI: URI): Path = {
+getStagingDir(new Path(extURI.getScheme, extURI.getAuthority, 
extURI.getPath))
   }
 
-  def getExternalTmpPath(path: Path, hadoopConf: Configuration): Path = {
+  def getExternalTmpPath(path: Path): Path = {
+val hiveVersion = 
externalCatalog.asInstanceOf[HiveExternalCatalog].client.version.fullVersion
+if (hiveVersion.startsWith("0.12") ||
+  hiveVersion.startsWith("0.13") ||
+  hiveVersion.startsWith("0.14") ||
+  hiveVersion.startsWith("1.0")) {
+  oldStyleExternalTempPath(path)
+} else if (hiveVersion.startsWith("1.1") || 
hiveVersion.startsWith("1.2")) {
+  newStyleExternalTempPath(path)
+} else {
+  throw new IllegalStateException("Unsupported hive version: " + 
hiveVersion)
--- End diff --

Do we have to issue an exception here? It sounds like this is the only 
place we block it. Do you think users might be able to use the higher version, 
although it might have a few issues?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16104: [SPARK-18675][SQL] CTAS for hive serde table shou...

2016-12-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16104#discussion_r91853527
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -86,14 +85,15 @@ case class InsertIntoHiveTable(
 
   val hadoopConf = sessionState.newHadoopConf()
   val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")
+  val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive")
--- End diff --

yea that's true, but I don't think the default value here is a big deal and 
need to worry about.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16104: [SPARK-18675][SQL] CTAS for hive serde table shou...

2016-12-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16104#discussion_r90771606
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -121,21 +121,61 @@ case class InsertIntoHiveTable(
 return dir
   }
 
-  private def getExternalScratchDir(extURI: URI, hadoopConf: 
Configuration): Path = {
-getStagingDir(new Path(extURI.getScheme, extURI.getAuthority, 
extURI.getPath), hadoopConf)
+  private def getExternalScratchDir(extURI: URI): Path = {
+getStagingDir(new Path(extURI.getScheme, extURI.getAuthority, 
extURI.getPath))
   }
 
-  def getExternalTmpPath(path: Path, hadoopConf: Configuration): Path = {
+  def getExternalTmpPath(path: Path): Path = {
+val hiveVersion = 
externalCatalog.asInstanceOf[HiveExternalCatalog].client.version.fullVersion
+if (hiveVersion.startsWith("0.12") ||
+  hiveVersion.startsWith("0.13") ||
+  hiveVersion.startsWith("0.14") ||
+  hiveVersion.startsWith("1.0")) {
+  oldStyleExternalTempPath(path)
+} else if (hiveVersion.startsWith("1.1") || 
hiveVersion.startsWith("1.2")) {
+  newStyleExternalTempPath(path)
+} else {
+  throw new IllegalStateException("Unsupported hive version: " + 
hiveVersion)
+}
+  }
+
+  // Mostly copied from Context.java#getExternalTmpPath of Hive 0.13
+  def oldStyleExternalTempPath(path: Path): Path = {
+val extURI: URI = path.toUri
+val scratchPath = new Path(scratchDir, executionId)
+var dirPath = new Path(
+  extURI.getScheme,
+  extURI.getAuthority,
+  scratchPath.toUri.getPath + "-" + TaskRunner.getTaskRunnerID())
+
+try {
+  val fs: FileSystem = dirPath.getFileSystem(hadoopConf)
+  dirPath = new Path(fs.makeQualified(dirPath).toString())
+
+  if (!FileUtils.mkdir(fs, dirPath, true, hadoopConf)) {
--- End diff --

My above concern is not directly related to this PR. Just submitted a PR to 
resolve the existing issue. https://github.com/apache/spark/pull/16134  I think 
it is a very serious bug.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16104: [SPARK-18675][SQL] CTAS for hive serde table shou...

2016-12-02 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16104#discussion_r90740642
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -121,21 +121,61 @@ case class InsertIntoHiveTable(
 return dir
   }
 
-  private def getExternalScratchDir(extURI: URI, hadoopConf: 
Configuration): Path = {
-getStagingDir(new Path(extURI.getScheme, extURI.getAuthority, 
extURI.getPath), hadoopConf)
+  private def getExternalScratchDir(extURI: URI): Path = {
+getStagingDir(new Path(extURI.getScheme, extURI.getAuthority, 
extURI.getPath))
   }
 
-  def getExternalTmpPath(path: Path, hadoopConf: Configuration): Path = {
+  def getExternalTmpPath(path: Path): Path = {
+val hiveVersion = 
externalCatalog.asInstanceOf[HiveExternalCatalog].client.version.fullVersion
+if (hiveVersion.startsWith("0.12") ||
+  hiveVersion.startsWith("0.13") ||
+  hiveVersion.startsWith("0.14") ||
+  hiveVersion.startsWith("1.0")) {
+  oldStyleExternalTempPath(path)
+} else if (hiveVersion.startsWith("1.1") || 
hiveVersion.startsWith("1.2")) {
+  newStyleExternalTempPath(path)
+} else {
+  throw new IllegalStateException("Unsupported hive version: " + 
hiveVersion)
+}
+  }
+
+  // Mostly copied from Context.java#getExternalTmpPath of Hive 0.13
+  def oldStyleExternalTempPath(path: Path): Path = {
+val extURI: URI = path.toUri
+val scratchPath = new Path(scratchDir, executionId)
+var dirPath = new Path(
+  extURI.getScheme,
+  extURI.getAuthority,
+  scratchPath.toUri.getPath + "-" + TaskRunner.getTaskRunnerID())
+
+try {
+  val fs: FileSystem = dirPath.getFileSystem(hadoopConf)
+  dirPath = new Path(fs.makeQualified(dirPath).toString())
+
+  if (!FileUtils.mkdir(fs, dirPath, true, hadoopConf)) {
--- End diff --

Keeping these useless directories and files might not make sense, right? 
Many many files could accumulate?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16104: [SPARK-18675][SQL] CTAS for hive serde table shou...

2016-12-02 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16104#discussion_r90721359
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -121,21 +121,61 @@ case class InsertIntoHiveTable(
 return dir
   }
 
-  private def getExternalScratchDir(extURI: URI, hadoopConf: 
Configuration): Path = {
-getStagingDir(new Path(extURI.getScheme, extURI.getAuthority, 
extURI.getPath), hadoopConf)
+  private def getExternalScratchDir(extURI: URI): Path = {
+getStagingDir(new Path(extURI.getScheme, extURI.getAuthority, 
extURI.getPath))
   }
 
-  def getExternalTmpPath(path: Path, hadoopConf: Configuration): Path = {
+  def getExternalTmpPath(path: Path): Path = {
+val hiveVersion = 
externalCatalog.asInstanceOf[HiveExternalCatalog].client.version.fullVersion
+if (hiveVersion.startsWith("0.12") ||
+  hiveVersion.startsWith("0.13") ||
+  hiveVersion.startsWith("0.14") ||
+  hiveVersion.startsWith("1.0")) {
+  oldStyleExternalTempPath(path)
+} else if (hiveVersion.startsWith("1.1") || 
hiveVersion.startsWith("1.2")) {
+  newStyleExternalTempPath(path)
+} else {
+  throw new IllegalStateException("Unsupported hive version: " + 
hiveVersion)
+}
+  }
+
+  // Mostly copied from Context.java#getExternalTmpPath of Hive 0.13
+  def oldStyleExternalTempPath(path: Path): Path = {
+val extURI: URI = path.toUri
+val scratchPath = new Path(scratchDir, executionId)
+var dirPath = new Path(
+  extURI.getScheme,
+  extURI.getAuthority,
+  scratchPath.toUri.getPath + "-" + TaskRunner.getTaskRunnerID())
+
+try {
+  val fs: FileSystem = dirPath.getFileSystem(hadoopConf)
+  dirPath = new Path(fs.makeQualified(dirPath).toString())
+
+  if (!FileUtils.mkdir(fs, dirPath, true, hadoopConf)) {
--- End diff --

uh, we drop it after the normal termination of vm by calling 
`fs.deleteOnExit`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16104: [SPARK-18675][SQL] CTAS for hive serde table shou...

2016-12-02 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16104#discussion_r90682972
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -86,14 +85,15 @@ case class InsertIntoHiveTable(
 
   val hadoopConf = sessionState.newHadoopConf()
   val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")
+  val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive")
--- End diff --

The default value of the `hive.exec.scratchdir` also depends on the 
version. 
> Default Value: /tmp/${user.name} in Hive 0.2.0 through 0.8.0; 
/tmp/hive-${user.name} in Hive 0.8.1 through 0.14.0; or /tmp/hive in Hive 
0.14.0 and later



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16104: [SPARK-18675][SQL] CTAS for hive serde table shou...

2016-12-01 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/16104

[SPARK-18675][SQL] CTAS for hive serde table should work for all hive 
versions

## What changes were proposed in this pull request?

Before hive 1.1, when inserting into a table, hive will create the staging 
directory under a common scratch directory. After the writing is finished, hive 
will simply empty the table directory and move the staging directory to it.

After hive 1.1, hive will create the staging directory under the table 
directory, and when moving staging directory to table directory, hive will 
still empty the table directory, but will exclude the staging directory there.

In `InsertIntoHiveTable`, we simply copy the code from hive 1.2, which 
means we will always create the staging directory under the table directory, no 
matter what the hive version is. This causes problems if the hive version is 
prior to 1.1, because the staging directory will be removed by hive when hive 
is trying to empty the table directory.

This PR copies the code from hive 0.13, so that we have 2 branches to 
create staging directory. If hive version is prior to 1.1, we'll go to the old 
style branch(i.e. create the staging directory under a common scratch 
directory), else, go to the new style branch(i.e. create the staging directory 
under the table directory)

## How was this patch tested?

new test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark hive-0.13

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16104.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16104


commit 410b6f06ed8e0063b987acffeaafb8e1c9cf2a4b
Author: Wenchen Fan 
Date:   2016-12-01T17:20:22Z

CTAS for hive serde table should work for all hive versions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org