[GitHub] spark pull request #15667: [SPARK-18107][SQL] Insert overwrite statement run...

2016-11-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15667


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15667: [SPARK-18107][SQL] Insert overwrite statement run...

2016-10-31 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15667#discussion_r85861794
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -257,7 +258,31 @@ case class InsertIntoHiveTable(
 table.catalogTable.identifier.table,
 partitionSpec)
 
+var doOverwrite = overwrite
--- End diff --

ok. updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15667: [SPARK-18107][SQL] Insert overwrite statement run...

2016-10-31 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15667#discussion_r85861722
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -257,7 +258,31 @@ case class InsertIntoHiveTable(
 table.catalogTable.identifier.table,
 partitionSpec)
 
+var doOverwrite = overwrite
+
 if (oldPart.isEmpty || !ifNotExists) {
+  // SPARK-18107: Insert overwrite runs much slower than 
hive-client.
+  // Newer Hive largely improves insert overwrite performance. As 
Spark uses older Hive
+  // version and we may not want to catch up new Hive version 
every time. We delete the
+  // Hive partition first and then load data file into the Hive 
partition.
+  if (oldPart.nonEmpty && overwrite) {
+oldPart.get.storage.locationUri.map { uri =>
+  val partitionPath = new Path(uri)
+  val fs = partitionPath.getFileSystem(hadoopConf)
+  if (fs.exists(partitionPath)) {
+val pathPermission = 
fs.getFileStatus(partitionPath).getPermission()
+if (!fs.delete(partitionPath, true)) {
+  throw new RuntimeException(
+"Cannot remove partition directory '" + 
partitionPath.toString)
+} else {
+  fs.mkdirs(partitionPath, pathPermission)
--- End diff --

I was thinking Hive will complain if the dir is not existing. But looks 
like it won't. Let me remove this and see if tests can passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15667: [SPARK-18107][SQL] Insert overwrite statement run...

2016-10-31 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/15667#discussion_r85808139
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -257,7 +258,31 @@ case class InsertIntoHiveTable(
 table.catalogTable.identifier.table,
 partitionSpec)
 
+var doOverwrite = overwrite
--- End diff --

nit: `doHiveOverwrite`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15667: [SPARK-18107][SQL] Insert overwrite statement run...

2016-10-31 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/15667#discussion_r85807729
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -257,7 +258,31 @@ case class InsertIntoHiveTable(
 table.catalogTable.identifier.table,
 partitionSpec)
 
+var doOverwrite = overwrite
+
 if (oldPart.isEmpty || !ifNotExists) {
+  // SPARK-18107: Insert overwrite runs much slower than 
hive-client.
+  // Newer Hive largely improves insert overwrite performance. As 
Spark uses older Hive
+  // version and we may not want to catch up new Hive version 
every time. We delete the
+  // Hive partition first and then load data file into the Hive 
partition.
+  if (oldPart.nonEmpty && overwrite) {
+oldPart.get.storage.locationUri.map { uri =>
+  val partitionPath = new Path(uri)
+  val fs = partitionPath.getFileSystem(hadoopConf)
+  if (fs.exists(partitionPath)) {
+val pathPermission = 
fs.getFileStatus(partitionPath).getPermission()
+if (!fs.delete(partitionPath, true)) {
+  throw new RuntimeException(
+"Cannot remove partition directory '" + 
partitionPath.toString)
+} else {
+  fs.mkdirs(partitionPath, pathPermission)
--- End diff --

Is the mkdir necessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15667: [SPARK-18107][SQL] Insert overwrite statement run...

2016-10-27 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/15667

[SPARK-18107][SQL] Insert overwrite statement runs much slower in spark-sql 
than it does in hive-client

## What changes were proposed in this pull request?

As reported on the jira, insert overwrite statement runs much slower in 
Spark, compared with hive-client.

It seems there is a patch 
[HIVE-11940](https://github.com/apache/hive/commit/ba21806b77287e237e1aa68fa169d2a81e07346d)
 which largely improves insert overwrite performance on Hive. HIVE-11940 is 
patched after Hive 2.0.0.

Because Spark SQL uses older Hive library, we can not benefit from such 
improvement.

The reporter verified that there is also a big performance gap between Hive 
1.2.1 and Hive 2.0.1 on insert overwrite execution.

Instead of upgrading to Hive 2.0 in Spark SQL, which might not be a trivial 
task, this patch provides an approach to delete the partition before asking 
Hive to load data files into the partition.

Note: since `Hive.loadTable` also uses the function to replace files, it 
should has the same issue. We can take the same approach to delete the table 
first. I will upgrade this to include this.

## How was this patch tested?

Jenkins tests.

There are existing tests using insert overwrite statement. Those tests 
should be passed. I added a new test to specially test insert overwrite into 
partition.

For performance issue, as I don't have Hive 2.0 environment, this needs the 
reporter to verify this patch. Please refer to the jira.

Please review 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
opening a pull request.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 improve-hive-insertoverwrite

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15667.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15667


commit 81dbeb19e61a67a287a5762e391517eb55a20721
Author: Liang-Chi Hsieh 
Date:   2016-10-27T09:29:16Z

Drop partition before insert overwrite to Hive table.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org