[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-10-09 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r143624181
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ---
@@ -68,6 +68,26 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 .get("mapreduce.output.fileoutputformat.compress.type"))
 }
 
+fileSinkConf.tableInfo.getOutputFileFormatClassName match {
+  case formatName if formatName.endsWith("ParquetOutputFormat") =>
--- End diff --

Sounds good idea.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-10-09 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r143624196
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ---
@@ -68,6 +68,26 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 .get("mapreduce.output.fileoutputformat.compress.type"))
 }
 
+fileSinkConf.tableInfo.getOutputFileFormatClassName match {
+  case formatName if formatName.endsWith("ParquetOutputFormat") =>
+val compressionConf = "parquet.compression"
+val compressionCodec = getCompressionByPriority(fileSinkConf, 
compressionConf,
+  sparkSession.sessionState.conf.parquetCompressionCodec) match {
--- End diff --

`compressionConf` will be used below, I've adjusted the format, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-10-09 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r143624210
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ---
@@ -68,6 +68,26 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 .get("mapreduce.output.fileoutputformat.compress.type"))
 }
 
+fileSinkConf.tableInfo.getOutputFileFormatClassName match {
+  case formatName if formatName.endsWith("ParquetOutputFormat") =>
+val compressionConf = "parquet.compression"
+val compressionCodec = getCompressionByPriority(fileSinkConf, 
compressionConf,
+  sparkSession.sessionState.conf.parquetCompressionCodec) match {
+  case "NONE" => "UNCOMPRESSED"
+  case _@x => x
+}
+hadoopConf.set(compressionConf, compressionCodec)
+  case formatName if formatName.endsWith("OrcOutputFormat") =>
+val compressionConf = "orc.compress"
+val compressionCodec = getCompressionByPriority(fileSinkConf, 
compressionConf,
+  sparkSession.sessionState.conf.orcCompressionCodec) match {
+  case "UNCOMPRESSED" => "NONE"
+  case _@x => x
--- End diff --

In fact, the following process will check the correctness of this value, 
and because "orcoptions" is not accessable here, I have to add the 
"uncompressed" => "NONE" conversion.
Do you have any good advice?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-10-09 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r143624224
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ---
@@ -68,6 +68,26 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 .get("mapreduce.output.fileoutputformat.compress.type"))
 }
 
+fileSinkConf.tableInfo.getOutputFileFormatClassName match {
+  case formatName if formatName.endsWith("ParquetOutputFormat") =>
+val compressionConf = "parquet.compression"
+val compressionCodec = getCompressionByPriority(fileSinkConf, 
compressionConf,
+  sparkSession.sessionState.conf.parquetCompressionCodec) match {
+  case "NONE" => "UNCOMPRESSED"
+  case _@x => x
+}
+hadoopConf.set(compressionConf, compressionCodec)
+  case formatName if formatName.endsWith("OrcOutputFormat") =>
+val compressionConf = "orc.compress"
+val compressionCodec = getCompressionByPriority(fileSinkConf, 
compressionConf,
+  sparkSession.sessionState.conf.orcCompressionCodec) match {
+  case "UNCOMPRESSED" => "NONE"
--- End diff --

Yes, they are different, the style of parameter names and parameter values 
are all different, and should be parquet and orc problems.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #6751: [SPARK-8300] DataFrame hint for broadcast join.

2017-10-10 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/6751
  
@rxin  @marmbrus 
Is there another way to broadcast table with the spark-sql now, except by 
`spark.sql.autoBroadcastJoinThreshold`?
And if no, is it a good way to broadcast table by user conf,such as 
`spark.sql.autoBroadcastJoin.mytable=true` to broadcast a table named `mytable`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #6751: [SPARK-8300] DataFrame hint for broadcast join.

2017-10-10 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/6751
  
Is there an example?
I use broadcast like the following, but it perform an error.Would you be so 
kind as to show me an example?

>spark-sql> select a.* from tableA a left outer join broadcast(tableB) b on 
a.a=b.a;
>17/10/11 09:06:40 INFO HiveMetaStore: 0: get_table : db=default tbl=tablea
>17/10/11 09:06:40 INFO audit: ugi=root ip=unknown-ip-addr  
cmd=get_table : db=default tbl=tablea   
>  **Error in query: cannot resolve '`tableB`' given input columns: []; 
line 1 pos 51;**
>'Project [ArrayBuffer(a).*]
>+- 'Join LeftOuter, ('a.a = 'b.a)
>   :- SubqueryAlias a
>   :  +- SubqueryAlias tablea
>   : +- HiveTableRelation `default`.`tablea`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#94, b#95, c#96]
>   +- 'SubqueryAlias b
>  +- 'UnresolvedTableValuedFunction broadcast, ['tableB]


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #6751: [SPARK-8300] DataFrame hint for broadcast join.

2017-10-10 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/6751
  
With `/*+ broadcast(table) */`, it works well, thank you very much.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-10-11 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
cc @gatorsmile @dongjoon-hyun 
Is it ok now?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-10-11 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r144202674
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
@@ -728,4 +732,254 @@ class InsertSuite extends QueryTest with 
TestHiveSingleton with BeforeAndAfter
   assert(e.contains("mismatched input 'ROW'"))
 }
   }
+
+  private def getConvertMetastoreConfName(format: String): String = format 
match {
+case "parquet" => "spark.sql.hive.convertMetastoreParquet"
+case "orc" => "spark.sql.hive.convertMetastoreOrc"
+  }
+
+  private def getSparkCompressionConfName(format: String): String = format 
match {
+case "parquet" => "spark.sql.parquet.compression.codec"
+case "orc" => "spark.sql.orc.compression.codec"
+  }
+
+  private def getTableCompressPropName(format: String): String = {
+format.toLowerCase match {
+  case "parquet" => "parquet.compression"
+  case "orc" => "orc.compress"
+}
+  }
+
+  private def getTableCompressionCodec(path: String, format: String): 
String = {
--- End diff --

Change to getHiveCompressPropName, is it appropriate?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-11-01 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
cc @gatorsmile @dongjoon-hyun
Is it ok now?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-12-21 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
@gatorsmile Could you help to review it? Thanks very much!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-12-22 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r158457806
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala
 ---
@@ -42,8 +43,15 @@ private[parquet] class ParquetOptions(
* Acceptable values are defined in 
[[shortParquetCompressionCodecNames]].
*/
   val compressionCodecClassName: String = {
-val codecName = parameters.getOrElse("compression",
-  sqlConf.parquetCompressionCodec).toLowerCase(Locale.ROOT)
+// `compression`, `parquet.compression`(i.e., 
ParquetOutputFormat.COMPRESSION), and
+// `spark.sql.parquet.compression.codec`
+// are in order of precedence from highest to lowest.
+val parquetCompressionConf = 
parameters.get(ParquetOutputFormat.COMPRESSION)
+val codecName = parameters
+  .get("compression")
+  .orElse(parquetCompressionConf)
--- End diff --

Yes it's new. I guess `PartitionOptions` did not used when writing hive 
table before, because it's invisible for hive. I changeed it to `public`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-12-22 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r158458003
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveOptions.scala 
---
@@ -19,7 +19,16 @@ package org.apache.spark.sql.hive.execution
 
 import java.util.Locale
 
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.hive.ql.plan.{FileSinkDesc, TableDesc}
--- End diff --

I will remove it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-12-22 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r158458747
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveOptions.scala 
---
@@ -102,4 +111,18 @@ object HiveOptions {
 "collectionDelim" -> "colelction.delim",
 "mapkeyDelim" -> "mapkey.delim",
 "lineDelim" -> "line.delim").map { case (k, v) => 
k.toLowerCase(Locale.ROOT) -> v }
+
+  def getHiveWriteCompression(tableInfo: TableDesc, sqlConf: SQLConf): 
Option[(String, String)] = {
+tableInfo.getOutputFileFormatClassName.toLowerCase match {
+  case formatName if formatName.endsWith("parquetoutputformat") =>
+val compressionCodec = new 
ParquetOptions(tableInfo.getProperties.asScala.toMap,
+  sqlConf).compressionCodecClassName
--- End diff --

Yes it looke better, I will change it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-12-22 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r158459550
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveOptions.scala 
---
@@ -102,4 +111,18 @@ object HiveOptions {
 "collectionDelim" -> "colelction.delim",
 "mapkeyDelim" -> "mapkey.delim",
 "lineDelim" -> "line.delim").map { case (k, v) => 
k.toLowerCase(Locale.ROOT) -> v }
+
+  def getHiveWriteCompression(tableInfo: TableDesc, sqlConf: SQLConf): 
Option[(String, String)] = {
+tableInfo.getOutputFileFormatClassName.toLowerCase match {
+  case formatName if formatName.endsWith("parquetoutputformat") =>
+val compressionCodec = new 
ParquetOptions(tableInfo.getProperties.asScala.toMap,
+  sqlConf).compressionCodecClassName
+Option((ParquetOutputFormat.COMPRESSION, compressionCodec))
+  case formatName if formatName.endsWith("orcoutputformat") =>
+val compressionCodec = new 
OrcOptions(tableInfo.getProperties.asScala.toMap,
+  sqlConf).compressionCodec
--- End diff --

The `compressionCodec ` is used in several places, do you mean I should fix 
them all?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-12-22 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r158460931
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
@@ -35,7 +39,7 @@ case class TestData(key: Int, value: String)
 case class ThreeCloumntable(key: Int, value: String, key1: String)
 
 class InsertSuite extends QueryTest with TestHiveSingleton with 
BeforeAndAfter
-with SQLTestUtils {
+with ParquetTest {
--- End diff --

Ok, I will do it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-12-22 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r158492627
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
@@ -35,7 +39,7 @@ case class TestData(key: Int, value: String)
 case class ThreeCloumntable(key: Int, value: String, key1: String)
 
 class InsertSuite extends QueryTest with TestHiveSingleton with 
BeforeAndAfter
-with SQLTestUtils {
+with ParquetTest {
--- End diff --

Seems compressed table does not always be smaller than uncompressed tables.
`SNAPPY` Compression size may be bigger than non-compression size when the 
amount of data is not big. So I'd like to check the size not equal when 
compression are different.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-12-22 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r158574986
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala
 ---
@@ -42,8 +43,15 @@ private[parquet] class ParquetOptions(
* Acceptable values are defined in 
[[shortParquetCompressionCodecNames]].
*/
   val compressionCodecClassName: String = {
-val codecName = parameters.getOrElse("compression",
-  sqlConf.parquetCompressionCodec).toLowerCase(Locale.ROOT)
+// `compression`, `parquet.compression`(i.e., 
ParquetOutputFormat.COMPRESSION), and
+// `spark.sql.parquet.compression.codec`
+// are in order of precedence from highest to lowest.
+val parquetCompressionConf = 
parameters.get(ParquetOutputFormat.COMPRESSION)
+val codecName = parameters
+  .get("compression")
+  .orElse(parquetCompressionConf)
--- End diff --

If so, parquet's table-level compression may be overwrited in this PR, and 
it may not be what we want.
Shall I  fix it first in another PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-12-22 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
@gatorsmile 
I'd test manully. When table-level compression not configured, it always 
take the session level compression, and ignore the existing file compression. 
Seems like a bug, however, table files with multiple compressions do not affect 
the reading and writing. 
Is it ok to add a test to check reading and writing when there are multiple 
conpressions in the existing table files?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-12-23 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
I mean sevral files with different compressioncodcs under the same table 
directory, like below:

![image](https://user-images.githubusercontent.com/26785576/34318192-6891bf5a-e7fc-11e7-82b9-b07b746a20d3.png)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...

2017-12-24 Thread fjh100456
GitHub user fjh100456 opened a pull request:

https://github.com/apache/spark/pull/20076

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 
'ParquetOptions', `parquet.compression` needs to be considered.

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 
'ParquetOptions', `parquet.compression` needs to be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from 
`parquet.compression`,and the precedence order is 
`compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just 
like what we do in `OrcOptions`.

2.Change `spark.sql.parquet.compression.codec` to support "none".Actually 
in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but 
it does not allowed to configured to "none".

3.Change `compressionCode` to `compressionCodecClassName`.

## How was this patch tested?
Add test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fjh100456/spark ParquetOptionIssue

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20076.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20076


commit 9bbfe6ef4b5a418373c2250ad676233fb05df7f7
Author: fjh100456 
Date:   2017-12-25T02:29:53Z

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 
'ParquetOptions', `parquet.compression` needs to be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from 
`parquet.compression`,and the order is 
`compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just 
like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually 
in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but 
it does not allowed to configured to "none".

## How was this patch tested?
Manual test.

commit 48cf108ed5c3298eb860d9735b439ac89d65765e
Author: fjh100456 
Date:   2017-12-25T02:30:24Z

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 
'ParquetOptions', `parquet.compression` needs to be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from 
`parquet.compression`,and the order is 
`compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just 
like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually 
in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but 
it does not allowed to configured to "none".

## How was this patch tested?
Manual test.

commit 5dbd3edf9e086433d3d3fe9c0ead887d799c61d3
Author: fjh100456 
Date:   2017-12-25T02:34:29Z

spark.sql.parquet.compression.codec[SPARK-21786][SQL] When acquiring 
'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to 
be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from 
`parquet.compression`,and the order is 
`compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just 
like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually 
in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but 
it does not allowed to configured to "none".

## How was this patch tested?
Manual test.

commit 5124f1b560e942c0dc23af31336317a4b995dd8f
Author: fjh100456 
Date:   2017-12-25T07:06:26Z

spark.sql.parquet.compression.codec[SPARK-21786][SQL] When acquiring 
'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to 
be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from 
`parquet.compression`,and the order is 
`compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just 
like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually 
in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but 
it does not allowed to configured to "none".
3.Change `compressionCode` to `compressionCodecClassName`.

## How was this patch tested?
Manual test.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-24 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/20076
  
cc @gatorsmile 
No orc configuration found in "sql-programming-guide.md", so I did not add 
the precedence description to `spark.sql.orc.compression.codec `.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...

2017-12-25 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20076#discussion_r158628479
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala
 ---
@@ -42,8 +43,15 @@ private[parquet] class ParquetOptions(
* Acceptable values are defined in 
[[shortParquetCompressionCodecNames]].
*/
   val compressionCodecClassName: String = {
--- End diff --

@HyukjinKwon Seems you're right.
@gatorsmile  Are we mistaken, shouldn't we change ParquetOptions's  
`compressionCodec ` to `compressionCodecClassName `? Because `OrcOptions` and 
`TextOptions` are all using `compressionCodec `.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...

2017-12-25 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20076#discussion_r158631203
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala
 ---
@@ -42,8 +43,15 @@ private[parquet] class ParquetOptions(
* Acceptable values are defined in 
[[shortParquetCompressionCodecNames]].
*/
   val compressionCodecClassName: String = {
--- End diff --

So, change all `compressionCodecClassName` and `compressionCodec` to 
`compressionCodecName`?  In `TextOptions` ,`JSONOptions` and `CSVOptions` too ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...

2017-12-25 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20076#discussion_r158632847
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala
 ---
@@ -42,8 +43,15 @@ private[parquet] class ParquetOptions(
* Acceptable values are defined in 
[[shortParquetCompressionCodecNames]].
*/
   val compressionCodecClassName: String = {
--- End diff --

@gatorsmile @HyukjinKwon 
In `TextOptions` ,`JSONOptions` and `CSVOptions`, it's  "Option[String]", 
but in `OrcOptions` and `ParquetOptions`, it's a "String".
Just change `compressionCodecClassName` in `OrcOptions` and 
`ParquetOptions` to `compressionCodecName` is ok ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-25 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/20076
  
Well, I'll revert back the renaming. Any comments? @gatorsmile 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-26 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/20076
  
Does it mean what we do in the test case of another pr #19218 ? @gatorsmile


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-26 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/20076
  
@gatorsmile 
I test it manually and found that table-level compression property coming 
from sqls like below still can not take effect, enven though passing table 
properties to a `hadoopConf`(just like what I do in #19218 ). Because it can 
not be found in properties of  `tableInfo`. I am not familiar with the SQL 
parsing, where is the attribute information stored when parsing SQL?
```CREATE table A using Parquet tblproperties (' parquet.compression ' = ' 
gzip ') ...```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-12-26 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
I'd finished to write the test case with the table containing mixed 
compression codec. But maybe I'd made a mistake, the original branch was 
deleted mistakenly, I will closed this PR and create another PR. Sorry.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-12-26 Thread fjh100456
GitHub user fjh100456 opened a pull request:

https://github.com/apache/spark/pull/20087

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 
'spark.sql.orc.compression.codec' configuration doesn't take effect on hive 
table writing

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 
'spark.sql.orc.compression.codec' configuration doesn't take effect on hive 
table writing

What changes were proposed in this pull request?

Pass ‘spark.sql.parquet.compression.codec’ value to 
‘parquet.compression’.
Pass ‘spark.sql.orc.compression.codec’ value to ‘orc.compress’.

How was this patch tested?

Add test.

Note: 
This is the same issue mentioned in #19218 . That branch was deleted 
mistakenly, so make a new pr instead.

@gatorsmile @maropu @dongjoon-hyun @discipleforteen


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fjh100456/spark HiveTableWriting

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20087.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20087


commit 9bbfe6ef4b5a418373c2250ad676233fb05df7f7
Author: fjh100456 
Date:   2017-12-25T02:29:53Z

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 
'ParquetOptions', `parquet.compression` needs to be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from 
`parquet.compression`,and the order is 
`compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just 
like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually 
in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but 
it does not allowed to configured to "none".
    
## How was this patch tested?
Manual test.

commit 48cf108ed5c3298eb860d9735b439ac89d65765e
Author: fjh100456 
Date:   2017-12-25T02:30:24Z

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 
'ParquetOptions', `parquet.compression` needs to be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from 
`parquet.compression`,and the order is 
`compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just 
like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually 
in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but 
it does not allowed to configured to "none".

## How was this patch tested?
Manual test.

commit 5dbd3edf9e086433d3d3fe9c0ead887d799c61d3
Author: fjh100456 
Date:   2017-12-25T02:34:29Z

spark.sql.parquet.compression.codec[SPARK-21786][SQL] When acquiring 
'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to 
be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from 
`parquet.compression`,and the order is 
`compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just 
like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually 
in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but 
it does not allowed to configured to "none".

## How was this patch tested?
Manual test.

commit 5124f1b560e942c0dc23af31336317a4b995dd8f
Author: fjh100456 
Date:   2017-12-25T07:06:26Z

spark.sql.parquet.compression.codec[SPARK-21786][SQL] When acquiring 
'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to 
be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from 
`parquet.compression`,and the order is 
`compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just 
like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually 
in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but 
it does not allowed to configured to "none".
3.Change `compressionCode` to `compressionCodecClassName`.

## How was this patch tested?
Manual test.

commit 6907a3ef86a2546fae91c22754796490a80e
Author: fjh100456 
Date:   2017-12-25T09:26:33Z

Make comression codec take effect in hive table writing

[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-12-26 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
Please go to #20087


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-12-26 Thread fjh100456
Github user fjh100456 closed the pull request at:

https://github.com/apache/spark/pull/19218


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...

2018-01-02 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20076#discussion_r159218522
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/CompressionCodecPrecedenceSuite.scala
 ---
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
--- End diff --

Thank you. I had move it to 
`org.apache.spark.sql.execution.datasources.parquet`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...

2018-01-02 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20076#discussion_r159218648
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -364,7 +366,9 @@ object SQLConf {
   .createWithDefault(true)
 
   val ORC_COMPRESSION = buildConf("spark.sql.orc.compression.codec")
-.doc("Sets the compression codec use when writing ORC files. 
Acceptable values include: " +
+.doc("Sets the compression codec use when writing ORC files. If other 
compression codec " +
--- End diff --

Thank you. I had fixed them.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2018-01-02 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/20076
  
@gatorsmile 
I have added two test cases. Please review them. Thank you very much.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...

2018-01-04 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20076#discussion_r159802320
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala
 ---
@@ -27,7 +28,7 @@ import org.apache.spark.sql.internal.SQLConf
 /**
  * Options for the Parquet data source.
  */
-private[parquet] class ParquetOptions(
--- End diff --

Yes, It should be revived. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2018-01-07 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20087#discussion_r160086482
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ---
@@ -68,6 +68,10 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 .get("mapreduce.output.fileoutputformat.compress.type"))
 }
 
+// Set compression by priority
+HiveOptions.getHiveWriteCompression(fileSinkConf.getTableInfo, 
sparkSession.sessionState.conf)
+  .foreach { case (compression, codec) => hadoopConf.set(compression, 
codec) }
--- End diff --

For parquet, without the changes of this pr, the precedence is table-level 
compression > `mapreduce.output.fileoutputformat.compress`. 
`spark.sql.parquet.compression` never takes effect. But now with this pr, 
`mapreduce.output.fileoutputformat.compress` will not take effect. As an 
alternative, `spark.sql.parquet.compression` will always take effect if there 
is no table level compression.

For ORC, `hive.exec.compress.output` does not take effect, as explained in 
the comments of the Code.

Shall we keep this precedence for parquet? If so, how to deal with ORC?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2018-01-08 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/20087
  
Retest please. @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2018-01-10 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/20087
  
@gatorsmile 
I'd change the precedence. 
For Parquet, if `hive.exec.compress.output` is true, keep the old 
precedence, otherwise, get compression from HiveOption.
For Orc, because `hive.exec.compress.output` always take no effect, we get 
compression from HiveOption directly.

If this is ok, I'll change the test case. Any suggestion?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2018-01-18 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20087#discussion_r162519320
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala 
---
@@ -0,0 +1,321 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.Path
+import org.apache.orc.OrcConf.COMPRESS
+import org.apache.parquet.hadoop.ParquetOutputFormat
+import org.scalatest.BeforeAndAfterAll
+
+import org.apache.spark.sql.execution.datasources.orc.OrcOptions
+import org.apache.spark.sql.execution.datasources.parquet.{ParquetOptions, 
ParquetTest}
+import org.apache.spark.sql.hive.orc.OrcFileOperator
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+
+class CompressionCodecSuite extends TestHiveSingleton with ParquetTest 
with BeforeAndAfterAll {
+  import spark.implicits._
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+(0 until 
maxRecordNum).toDF("a").createOrReplaceTempView("table_source")
+  }
+
+  override def afterAll(): Unit = {
+try {
+  spark.catalog.dropTempView("table_source")
+} finally {
+  super.afterAll()
+}
+  }
+
+  private val maxRecordNum = 500
+
+  private def getConvertMetastoreConfName(format: String): String = 
format.toLowerCase match {
+case "parquet" => HiveUtils.CONVERT_METASTORE_PARQUET.key
+case "orc" => HiveUtils.CONVERT_METASTORE_ORC.key
+  }
+
+  private def getSparkCompressionConfName(format: String): String = 
format.toLowerCase match {
+case "parquet" => SQLConf.PARQUET_COMPRESSION.key
+case "orc" => SQLConf.ORC_COMPRESSION.key
+  }
+
+  private def getHiveCompressPropName(format: String): String = 
format.toLowerCase match {
+case "parquet" => ParquetOutputFormat.COMPRESSION
+case "orc" => COMPRESS.getAttribute
+  }
+
+  private def normalizeCodecName(format: String, name: String): String = {
+format.toLowerCase match {
+  case "parquet" => 
ParquetOptions.shortParquetCompressionCodecNames(name).name()
+  case "orc" => OrcOptions.shortOrcCompressionCodecNames(name)
+}
+  }
+
+  private def getTableCompressionCodec(path: String, format: String): 
Seq[String] = {
+val hadoopConf = spark.sessionState.newHadoopConf()
+val codecs = format.toLowerCase match {
+  case "parquet" => for {
+footer <- readAllFootersWithoutSummaryFiles(new Path(path), 
hadoopConf)
+block <- footer.getParquetMetadata.getBlocks.asScala
+column <- block.getColumns.asScala
+  } yield column.getCodec.name()
+  case "orc" => new File(path).listFiles().filter{ file =>
+file.isFile && !file.getName.endsWith(".crc") && file.getName != 
"_SUCCESS"
+  }.map { orcFile =>
+
OrcFileOperator.getFileReader(orcFile.toPath.toString).get.getCompression.toString
+  }.toSeq
+}
+codecs.distinct
+  }
+
+  private def createTable(
+  rootDir: File,
+  tableName: String,
+  isPartitioned: Boolean,
+  format: String,
+  compressionCodec: Option[String]): Unit = {
+val tblProperties = compressionCodec match {
+  case Some(prop) => 
s"TBLPROPERTIES('${getHiveCompressPropName(format)}'='$prop')"
+  case _ => ""
+}
+val partitionCreate = if (isPartitioned) "PARTITIONED BY (p string)" 
else ""
+sql(
+  s"""
+   

[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2018-01-18 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20087#discussion_r162520864
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala 
---
@@ -0,0 +1,321 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.Path
+import org.apache.orc.OrcConf.COMPRESS
+import org.apache.parquet.hadoop.ParquetOutputFormat
+import org.scalatest.BeforeAndAfterAll
+
+import org.apache.spark.sql.execution.datasources.orc.OrcOptions
+import org.apache.spark.sql.execution.datasources.parquet.{ParquetOptions, 
ParquetTest}
+import org.apache.spark.sql.hive.orc.OrcFileOperator
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+
+class CompressionCodecSuite extends TestHiveSingleton with ParquetTest 
with BeforeAndAfterAll {
+  import spark.implicits._
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+(0 until 
maxRecordNum).toDF("a").createOrReplaceTempView("table_source")
+  }
+
+  override def afterAll(): Unit = {
+try {
+  spark.catalog.dropTempView("table_source")
+} finally {
+  super.afterAll()
+}
+  }
+
+  private val maxRecordNum = 500
+
+  private def getConvertMetastoreConfName(format: String): String = 
format.toLowerCase match {
+case "parquet" => HiveUtils.CONVERT_METASTORE_PARQUET.key
+case "orc" => HiveUtils.CONVERT_METASTORE_ORC.key
+  }
+
+  private def getSparkCompressionConfName(format: String): String = 
format.toLowerCase match {
+case "parquet" => SQLConf.PARQUET_COMPRESSION.key
+case "orc" => SQLConf.ORC_COMPRESSION.key
+  }
+
+  private def getHiveCompressPropName(format: String): String = 
format.toLowerCase match {
+case "parquet" => ParquetOutputFormat.COMPRESSION
+case "orc" => COMPRESS.getAttribute
+  }
+
+  private def normalizeCodecName(format: String, name: String): String = {
+format.toLowerCase match {
+  case "parquet" => 
ParquetOptions.shortParquetCompressionCodecNames(name).name()
+  case "orc" => OrcOptions.shortOrcCompressionCodecNames(name)
+}
+  }
+
+  private def getTableCompressionCodec(path: String, format: String): 
Seq[String] = {
+val hadoopConf = spark.sessionState.newHadoopConf()
+val codecs = format.toLowerCase match {
+  case "parquet" => for {
+footer <- readAllFootersWithoutSummaryFiles(new Path(path), 
hadoopConf)
+block <- footer.getParquetMetadata.getBlocks.asScala
+column <- block.getColumns.asScala
+  } yield column.getCodec.name()
+  case "orc" => new File(path).listFiles().filter{ file =>
+file.isFile && !file.getName.endsWith(".crc") && file.getName != 
"_SUCCESS"
+  }.map { orcFile =>
+
OrcFileOperator.getFileReader(orcFile.toPath.toString).get.getCompression.toString
+  }.toSeq
+}
+codecs.distinct
+  }
+
+  private def createTable(
+  rootDir: File,
+  tableName: String,
+  isPartitioned: Boolean,
+  format: String,
+  compressionCodec: Option[String]): Unit = {
+val tblProperties = compressionCodec match {
+  case Some(prop) => 
s"TBLPROPERTIES('${getHiveCompressPropName(format)}'='$prop')"
+  case _ => ""
+}
+val partitionCreate = if (isPartitioned) "PARTITIONED BY (p string)" 
else ""
+sql(
+  s"""
+   

[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2018-01-18 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20087#discussion_r162528298
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala 
---
@@ -0,0 +1,321 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.Path
+import org.apache.orc.OrcConf.COMPRESS
+import org.apache.parquet.hadoop.ParquetOutputFormat
+import org.scalatest.BeforeAndAfterAll
+
+import org.apache.spark.sql.execution.datasources.orc.OrcOptions
+import org.apache.spark.sql.execution.datasources.parquet.{ParquetOptions, 
ParquetTest}
+import org.apache.spark.sql.hive.orc.OrcFileOperator
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+
+class CompressionCodecSuite extends TestHiveSingleton with ParquetTest 
with BeforeAndAfterAll {
+  import spark.implicits._
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+(0 until 
maxRecordNum).toDF("a").createOrReplaceTempView("table_source")
+  }
+
+  override def afterAll(): Unit = {
+try {
+  spark.catalog.dropTempView("table_source")
+} finally {
+  super.afterAll()
+}
+  }
+
+  private val maxRecordNum = 500
+
+  private def getConvertMetastoreConfName(format: String): String = 
format.toLowerCase match {
+case "parquet" => HiveUtils.CONVERT_METASTORE_PARQUET.key
+case "orc" => HiveUtils.CONVERT_METASTORE_ORC.key
+  }
+
+  private def getSparkCompressionConfName(format: String): String = 
format.toLowerCase match {
+case "parquet" => SQLConf.PARQUET_COMPRESSION.key
+case "orc" => SQLConf.ORC_COMPRESSION.key
+  }
+
+  private def getHiveCompressPropName(format: String): String = 
format.toLowerCase match {
+case "parquet" => ParquetOutputFormat.COMPRESSION
+case "orc" => COMPRESS.getAttribute
+  }
+
+  private def normalizeCodecName(format: String, name: String): String = {
+format.toLowerCase match {
+  case "parquet" => 
ParquetOptions.shortParquetCompressionCodecNames(name).name()
+  case "orc" => OrcOptions.shortOrcCompressionCodecNames(name)
+}
+  }
+
+  private def getTableCompressionCodec(path: String, format: String): 
Seq[String] = {
+val hadoopConf = spark.sessionState.newHadoopConf()
+val codecs = format.toLowerCase match {
+  case "parquet" => for {
+footer <- readAllFootersWithoutSummaryFiles(new Path(path), 
hadoopConf)
+block <- footer.getParquetMetadata.getBlocks.asScala
+column <- block.getColumns.asScala
+  } yield column.getCodec.name()
+  case "orc" => new File(path).listFiles().filter{ file =>
+file.isFile && !file.getName.endsWith(".crc") && file.getName != 
"_SUCCESS"
+  }.map { orcFile =>
+
OrcFileOperator.getFileReader(orcFile.toPath.toString).get.getCompression.toString
+  }.toSeq
+}
+codecs.distinct
+  }
+
+  private def createTable(
+  rootDir: File,
+  tableName: String,
+  isPartitioned: Boolean,
+  format: String,
+  compressionCodec: Option[String]): Unit = {
+val tblProperties = compressionCodec match {
+  case Some(prop) => 
s"TBLPROPERTIES('${getHiveCompressPropName(format)}'='$prop')"
+  case _ => ""
+}
+val partitionCreate = if (isPartitioned) "PARTITIONED BY (p string)" 
else ""
+sql(
+  s"""
+   

[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2018-01-18 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20087#discussion_r162551602
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala 
---
@@ -0,0 +1,321 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.Path
+import org.apache.orc.OrcConf.COMPRESS
+import org.apache.parquet.hadoop.ParquetOutputFormat
+import org.scalatest.BeforeAndAfterAll
+
+import org.apache.spark.sql.execution.datasources.orc.OrcOptions
+import org.apache.spark.sql.execution.datasources.parquet.{ParquetOptions, 
ParquetTest}
+import org.apache.spark.sql.hive.orc.OrcFileOperator
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+
+class CompressionCodecSuite extends TestHiveSingleton with ParquetTest 
with BeforeAndAfterAll {
+  import spark.implicits._
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+(0 until 
maxRecordNum).toDF("a").createOrReplaceTempView("table_source")
+  }
+
+  override def afterAll(): Unit = {
+try {
+  spark.catalog.dropTempView("table_source")
+} finally {
+  super.afterAll()
+}
+  }
+
+  private val maxRecordNum = 500
+
+  private def getConvertMetastoreConfName(format: String): String = 
format.toLowerCase match {
+case "parquet" => HiveUtils.CONVERT_METASTORE_PARQUET.key
+case "orc" => HiveUtils.CONVERT_METASTORE_ORC.key
+  }
+
+  private def getSparkCompressionConfName(format: String): String = 
format.toLowerCase match {
+case "parquet" => SQLConf.PARQUET_COMPRESSION.key
+case "orc" => SQLConf.ORC_COMPRESSION.key
+  }
+
+  private def getHiveCompressPropName(format: String): String = 
format.toLowerCase match {
+case "parquet" => ParquetOutputFormat.COMPRESSION
+case "orc" => COMPRESS.getAttribute
+  }
+
+  private def normalizeCodecName(format: String, name: String): String = {
+format.toLowerCase match {
+  case "parquet" => 
ParquetOptions.shortParquetCompressionCodecNames(name).name()
+  case "orc" => OrcOptions.shortOrcCompressionCodecNames(name)
+}
+  }
+
+  private def getTableCompressionCodec(path: String, format: String): 
Seq[String] = {
+val hadoopConf = spark.sessionState.newHadoopConf()
+val codecs = format.toLowerCase match {
+  case "parquet" => for {
+footer <- readAllFootersWithoutSummaryFiles(new Path(path), 
hadoopConf)
+block <- footer.getParquetMetadata.getBlocks.asScala
+column <- block.getColumns.asScala
+  } yield column.getCodec.name()
+  case "orc" => new File(path).listFiles().filter{ file =>
+file.isFile && !file.getName.endsWith(".crc") && file.getName != 
"_SUCCESS"
+  }.map { orcFile =>
+
OrcFileOperator.getFileReader(orcFile.toPath.toString).get.getCompression.toString
+  }.toSeq
+}
+codecs.distinct
+  }
+
+  private def createTable(
+  rootDir: File,
+  tableName: String,
+  isPartitioned: Boolean,
+  format: String,
+  compressionCodec: Option[String]): Unit = {
+val tblProperties = compressionCodec match {
+  case Some(prop) => 
s"TBLPROPERTIES('${getHiveCompressPropName(format)}'='$prop')"
+  case _ => ""
+}
+val partitionCreate = if (isPartitioned) "PARTITIONED BY (p string)" 
else ""
+sql(
+  s"""
+   

[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2018-01-19 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20087#discussion_r162563141
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala 
---
@@ -0,0 +1,321 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.Path
+import org.apache.orc.OrcConf.COMPRESS
+import org.apache.parquet.hadoop.ParquetOutputFormat
+import org.scalatest.BeforeAndAfterAll
+
+import org.apache.spark.sql.execution.datasources.orc.OrcOptions
+import org.apache.spark.sql.execution.datasources.parquet.{ParquetOptions, 
ParquetTest}
+import org.apache.spark.sql.hive.orc.OrcFileOperator
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+
+class CompressionCodecSuite extends TestHiveSingleton with ParquetTest 
with BeforeAndAfterAll {
+  import spark.implicits._
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+(0 until 
maxRecordNum).toDF("a").createOrReplaceTempView("table_source")
+  }
+
+  override def afterAll(): Unit = {
+try {
+  spark.catalog.dropTempView("table_source")
+} finally {
+  super.afterAll()
+}
+  }
+
+  private val maxRecordNum = 500
+
+  private def getConvertMetastoreConfName(format: String): String = 
format.toLowerCase match {
+case "parquet" => HiveUtils.CONVERT_METASTORE_PARQUET.key
+case "orc" => HiveUtils.CONVERT_METASTORE_ORC.key
+  }
+
+  private def getSparkCompressionConfName(format: String): String = 
format.toLowerCase match {
+case "parquet" => SQLConf.PARQUET_COMPRESSION.key
+case "orc" => SQLConf.ORC_COMPRESSION.key
+  }
+
+  private def getHiveCompressPropName(format: String): String = 
format.toLowerCase match {
+case "parquet" => ParquetOutputFormat.COMPRESSION
+case "orc" => COMPRESS.getAttribute
+  }
+
+  private def normalizeCodecName(format: String, name: String): String = {
+format.toLowerCase match {
+  case "parquet" => 
ParquetOptions.shortParquetCompressionCodecNames(name).name()
+  case "orc" => OrcOptions.shortOrcCompressionCodecNames(name)
+}
+  }
+
+  private def getTableCompressionCodec(path: String, format: String): 
Seq[String] = {
+val hadoopConf = spark.sessionState.newHadoopConf()
+val codecs = format.toLowerCase match {
+  case "parquet" => for {
+footer <- readAllFootersWithoutSummaryFiles(new Path(path), 
hadoopConf)
+block <- footer.getParquetMetadata.getBlocks.asScala
+column <- block.getColumns.asScala
+  } yield column.getCodec.name()
+  case "orc" => new File(path).listFiles().filter{ file =>
+file.isFile && !file.getName.endsWith(".crc") && file.getName != 
"_SUCCESS"
+  }.map { orcFile =>
+
OrcFileOperator.getFileReader(orcFile.toPath.toString).get.getCompression.toString
+  }.toSeq
+}
+codecs.distinct
+  }
+
+  private def createTable(
+  rootDir: File,
+  tableName: String,
+  isPartitioned: Boolean,
+  format: String,
+  compressionCodec: Option[String]): Unit = {
+val tblProperties = compressionCodec match {
+  case Some(prop) => 
s"TBLPROPERTIES('${getHiveCompressPropName(format)}'='$prop')"
+  case _ => ""
+}
+val partitionCreate = if (isPartitioned) "PARTITIONED BY (p string)" 
else ""
+sql(
+  s"""
+   

[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2018-01-19 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/20087
  
@gatorsmile 
I had fix the test case for CTAS, but it may not pass the test, until merge 
the code of your PR #20120 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2018-01-20 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20087#discussion_r162779218
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala 
---
@@ -0,0 +1,354 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.Path
+import org.apache.orc.OrcConf.COMPRESS
+import org.apache.parquet.hadoop.ParquetOutputFormat
+import org.scalatest.BeforeAndAfterAll
+
+import org.apache.spark.sql.execution.datasources.orc.OrcOptions
+import org.apache.spark.sql.execution.datasources.parquet.{ParquetOptions, 
ParquetTest}
+import org.apache.spark.sql.hive.orc.OrcFileOperator
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+
+class CompressionCodecSuite extends TestHiveSingleton with ParquetTest 
with BeforeAndAfterAll {
+  import spark.implicits._
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+(0 until 
maxRecordNum).toDF("a").createOrReplaceTempView("table_source")
+  }
+
+  override def afterAll(): Unit = {
+try {
+  spark.catalog.dropTempView("table_source")
+} finally {
+  super.afterAll()
+}
+  }
+
+  private val maxRecordNum = 50
+
+  private def getConvertMetastoreConfName(format: String): String = 
format.toLowerCase match {
+case "parquet" => HiveUtils.CONVERT_METASTORE_PARQUET.key
+case "orc" => HiveUtils.CONVERT_METASTORE_ORC.key
+  }
+
+  private def getSparkCompressionConfName(format: String): String = 
format.toLowerCase match {
+case "parquet" => SQLConf.PARQUET_COMPRESSION.key
+case "orc" => SQLConf.ORC_COMPRESSION.key
+  }
+
+  private def getHiveCompressPropName(format: String): String = 
format.toLowerCase match {
+case "parquet" => ParquetOutputFormat.COMPRESSION
+case "orc" => COMPRESS.getAttribute
+  }
+
+  private def normalizeCodecName(format: String, name: String): String = {
+format.toLowerCase match {
+  case "parquet" => ParquetOptions.getParquetCompressionCodecName(name)
+  case "orc" => OrcOptions.getORCCompressionCodecName(name)
+}
+  }
+
+  private def getTableCompressionCodec(path: String, format: String): 
Seq[String] = {
+val hadoopConf = spark.sessionState.newHadoopConf()
+val codecs = format.toLowerCase match {
+  case "parquet" => for {
+footer <- readAllFootersWithoutSummaryFiles(new Path(path), 
hadoopConf)
+block <- footer.getParquetMetadata.getBlocks.asScala
+column <- block.getColumns.asScala
+  } yield column.getCodec.name()
+  case "orc" => new File(path).listFiles().filter { file =>
+file.isFile && !file.getName.endsWith(".crc") && file.getName != 
"_SUCCESS"
+  }.map { orcFile =>
+
OrcFileOperator.getFileReader(orcFile.toPath.toString).get.getCompression.toString
+  }.toSeq
+}
+codecs.distinct
+  }
+
+  private def createTable(
+  rootDir: File,
+  tableName: String,
+  isPartitioned: Boolean,
+  format: String,
+  compressionCodec: Option[String]): Unit = {
+val tblProperties = compressionCodec match {
+  case Some(prop) => 
s"TBLPROPERTIES('${getHiveCompressPropName(format)}'='$prop')"
+  case _ => ""
+}
+val partitionCreate = if (isPartitioned) "PARTITIONED BY (p string)" 
else ""
+sql(
+  s"""
+   

[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2018-01-21 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20087#discussion_r162836010
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ---
@@ -55,18 +55,28 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   customPartitionLocations: Map[TablePartitionSpec, String] = 
Map.empty,
   partitionAttributes: Seq[Attribute] = Nil): Set[String] = {
 
-val isCompressed = hadoopConf.get("hive.exec.compress.output", 
"false").toBoolean
+val isCompressed =
+  
fileSinkConf.getTableInfo.getOutputFileFormatClassName.toLowerCase(Locale.ROOT) 
match {
+case formatName if formatName.endsWith("orcoutputformat") =>
+  // For ORC,"mapreduce.output.fileoutputformat.compress",
+  // "mapreduce.output.fileoutputformat.compress.codec", and
+  // "mapreduce.output.fileoutputformat.compress.type"
+  // have no impact because it uses table properties to store 
compression information.
--- End diff --

Surely, I'll do it this days.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2018-01-22 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20087#discussion_r163132078
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ---
@@ -55,18 +55,28 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   customPartitionLocations: Map[TablePartitionSpec, String] = 
Map.empty,
   partitionAttributes: Seq[Attribute] = Nil): Set[String] = {
 
-val isCompressed = hadoopConf.get("hive.exec.compress.output", 
"false").toBoolean
+val isCompressed =
+  
fileSinkConf.getTableInfo.getOutputFileFormatClassName.toLowerCase(Locale.ROOT) 
match {
+case formatName if formatName.endsWith("orcoutputformat") =>
+  // For ORC,"mapreduce.output.fileoutputformat.compress",
+  // "mapreduce.output.fileoutputformat.compress.codec", and
+  // "mapreduce.output.fileoutputformat.compress.type"
+  // have no impact because it uses table properties to store 
compression information.
--- End diff --

For parquet, using a hive client, `parquet.compression` has a higher 
priority than  `mapreduce.output.fileoutputformat.compress`. And table-level 
compression( set by tblproperties) has the highest priority.  
`parquet.compression` set by cli also has a higher priority than 
`mapreduce.output.fileoutputformat.compress`.

After this pr, the priority does not changed. If table-level compression 
was set, other compression would not take effect, even though 
`mapreduce.output` were set, which is the same with hive. But 
`parquet.compression` set by spark cli does not take effect, unless set 
`hive.exec.compress.output` to true. This may because we do not get 
`parquet.compression` from the session, and I wonder if it's necessary because 
we have `spark.sql.parquet.comression` instead.

For orc, `hive.exec.compress.output` and `mapreduce.output` have no 
impact really, but table-leval  compression (set by tblproperties) always take 
effect.  `orc.compression` set by spark cli does not take effect too, even 
though  set `hive.exec.compress.output` to true, which is differet with 
parquet. 
Another question, the comment say `it uses table properties to store 
compression information`, actully, by manul test, I found orc-tables also can 
have mixed compressions, and the data can be read together correctly.

My Hive version for this test is 1.1.0.  Actully it's a little difficut for 
me to get a higher version runable Hive client.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2018-01-22 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20087#discussion_r163150428
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ---
@@ -55,18 +55,28 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   customPartitionLocations: Map[TablePartitionSpec, String] = 
Map.empty,
   partitionAttributes: Seq[Attribute] = Nil): Set[String] = {
 
-val isCompressed = hadoopConf.get("hive.exec.compress.output", 
"false").toBoolean
+val isCompressed =
+  
fileSinkConf.getTableInfo.getOutputFileFormatClassName.toLowerCase(Locale.ROOT) 
match {
+case formatName if formatName.endsWith("orcoutputformat") =>
+  // For ORC,"mapreduce.output.fileoutputformat.compress",
+  // "mapreduce.output.fileoutputformat.compress.codec", and
+  // "mapreduce.output.fileoutputformat.compress.type"
+  // have no impact because it uses table properties to store 
compression information.
--- End diff --

Ok, I'll try it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19189: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-09-11 Thread fjh100456
GitHub user fjh100456 opened a pull request:

https://github.com/apache/spark/pull/19189

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' configuration 
doesn't take effect on tables with partition field(s)

## What changes were proposed in this pull request?
Pass the spark configuration ‘spark.sql.parquet.compression.codec’ to 
hive so that it can take effect.

## How was this patch tested?
Manually checked.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fjh100456/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19189.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19189


commit f91ed07718d4462e487237353bf82c80c5e148f7
Author: fjh100456 
Date:   2017-06-19T08:53:06Z

[SPARK-21135][WEB UI] On history server page,duration of incompleted 
applications should be hidden instead of showing up as 0

## What changes were proposed in this pull request?

Hide duration of incomplete applications.

## How was this patch tested?

manual tests

commit faffde55e38d1db9f54338f29429f98baf73def5
Author: fjh100456 
Date:   2017-06-30T06:15:01Z

Merge pull request #1 from apache/master

更新代码到2017-06-30

commit 44ff604cb6fca37acafd018c3e25888fbe1df1ae
Author: fjh100456 
Date:   2017-09-08T09:02:59Z

Merge pull request #2 from apache/master

Merge spark master




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19189: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-09-11 Thread fjh100456
Github user fjh100456 closed the pull request at:

https://github.com/apache/spark/pull/19189


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-09-13 Thread fjh100456
GitHub user fjh100456 opened a pull request:

https://github.com/apache/spark/pull/19218

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' configuration 
doesn't take effect on tables with partition field(s)

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' configuration 
doesn't take effect on tables with partition field(s)

## What changes were proposed in this pull request?
Pass ‘spark.sql.parquet.compression.codec’ value to 
‘parquet.compression’.

## How was this patch tested?
Manual test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fjh100456/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19218.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19218


commit 677541b47f27fd85f44aa2e46ec44861579475a8
Author: fjh100456 
Date:   2017-09-13T09:24:15Z

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' configuration 
doesn't take effect on tables with partition field(s)




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-14 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
cc @maropu 

I have added the test. However, all of my local use cases do not work 
properly, so I'm not sure if the new use case will pass, but I will always be 
concerned.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-15 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
@gatorsmile 
1. Non-partitioned tables do not have this problem, 
'spark.sql.parquet.compression.codec' can take effect normally, because the 
process of writing data differs from that of a partitioned table.
2. ORC does not have a configuration similar to 'spark. sql.* ', but can 
only use ' orc. compress ' which may not be a spark configuration.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-15 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
@dongjoon-hyun Thank you very much,  I'll fix it now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-16 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
@dongjoon-hyun 

A problem has been encountered, There are two ways to specify the 
compression format:
1. CREATE TABLE Test(id int) STORED AS ORC TBLPROPERTIES 
('orc.compress'='SNAPPY');
2. set orc.compress=ZLIB;
If the table already has been specified a compressed format when it was 
created, and then specified another compression format by setting 
'orc.compress', the latter will take effect.

So whether the spark side should not have the default value, we can 
distinguish by 'undefined'; or discard this change, and explain in the document 
that 'spark.sql.parquet.compression.codec' for partitioned tables does not take 
effect, and 'spark.sql.orc.compression.codec ' is not valid for hive tables. Or 
your other better solution.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-18 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
Encounter two problems:
1. I tried to fix it in the order of 'compression' > 'parquet.compression' 
> 'spark.sql.parquet.compression. codec', but found 'parquet.compression' may 
come from a user 'set' command or from a table property which is specified on 
creation, so which one should take precedence?

2. If I change the priorities, it seems partitioned tables and 
non-partitioned tables are all need to be fixed, maybe I have to moves the code 
forward, for example, to 'Insertintohadoopfsrelationcommand.scala'.

I would appreciate it if you could give me some advice. @gatorsmile  
@dongjoon-hyun 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-18 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
Thanks for your review. @gatorsmile 
In the first question I mean that ‘parquet.compression’ can be found in 
the `table: Tabledesc` (maybe similar with `catalogtable`), and can also be 
found in `sparkSession.sessionState.conf`(set by user through the command `set 
parquet.compression=xxx`), which one should take precedence?

This issue was originally only related to hive table writing,  but after 
fix the priority, it was found that non-partitioned tables did not take the 
right precedence, and non-partitioned tables writing will not enter 
`InsertIntoHiveTable`. `Insertintohadoopfsrelationcommand.scala` is really not 
a proper place, is there any place that can solve both of partitioned tables 
and non-partitioned tables?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-20 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
@gatorsmile  @dongjoon-hyun  
 I'd fix it. Could you helpe me to review it again? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-26 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
@dongjoon-hyun 
Thank you very much, I'll fix them as soon as possible.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-27 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
@dongjoon-hyun  @gatorsmile 
I'd fix them. Could you help me to review it again? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17895: Branch 2.0

2017-05-07 Thread fjh100456
GitHub user fjh100456 opened a pull request:

https://github.com/apache/spark/pull/17895

Branch 2.0

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/spark branch-2.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17895.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17895


commit b57e2acb134d94dafc81686da875c5dd3ea35c74
Author: Jagadeesan 
Date:   2016-10-03T09:46:38Z

[SPARK-17736][DOCUMENTATION][SPARKR] Update R README for rmarkdown,…

## What changes were proposed in this pull request?

To build R docs (which are built when R tests are run), users need to 
install pandoc and rmarkdown. This was done for Jenkins in 
~~[SPARK-17420](https://issues.apache.org/jira/browse/SPARK-17420)~~

… pandoc]

Author: Jagadeesan 

Closes #15309 from jagadeesanas2/SPARK-17736.

(cherry picked from commit a27033c0bbaae8f31db9b91693947ed71738ed11)
Signed-off-by: Sean Owen 

commit 613863b116b6cbc9ac83845c68a2d11b3b02f7cb
Author: zero323 
Date:   2016-10-04T00:57:54Z

[SPARK-17587][PYTHON][MLLIB] SparseVector __getitem__ should follow 
__getitem__ contract

## What changes were proposed in this pull request?

Replaces` ValueError` with `IndexError` when index passed to `ml` / `mllib` 
`SparseVector.__getitem__` is out of range. This ensures correct iteration 
behavior.

Replaces `ValueError` with `IndexError` for `DenseMatrix` and `SparkMatrix` 
in `ml` / `mllib`.

## How was this patch tested?

PySpark `ml` / `mllib` unit tests. Additional unit tests to prove that the 
problem has been resolved.

Author: zero323 

Closes #15144 from zero323/SPARK-17587.

(cherry picked from commit d8399b600cef706c22d381b01fab19c610db439a)
Signed-off-by: Joseph K. Bradley 

commit 5843932021cc8bbe0277943c6c480cfeae1b29e2
Author: Herman van Hovell 
Date:   2016-10-04T02:32:59Z

[SPARK-17753][SQL] Allow a complex expression as the input a value based 
case statement

## What changes were proposed in this pull request?
We currently only allow relatively simple expressions as the input for a 
value based case statement. Expressions like `case (a > 1) or (b = 2) when true 
then 1 when false then 0 end` currently fail. This PR adds support for such 
expressions.

## How was this patch tested?
Added a test to the ExpressionParserSuite.

Author: Herman van Hovell 

Closes #15322 from hvanhovell/SPARK-17753.

(cherry picked from commit 2bbecdec2023143fd144e4242ff70822e0823986)
Signed-off-by: Herman van Hovell 

commit 7429199e5b34d5594e3fcedb57eda789d16e26f3
Author: Dongjoon Hyun 
Date:   2016-10-04T04:28:16Z

[SPARK-17112][SQL] "select null" via JDBC triggers IllegalArgumentException 
in Thriftserver

## What changes were proposed in this pull request?

Currently, Spark Thrift Server raises `IllegalArgumentException` for 
queries whose column types are `NullType`, e.g., `SELECT null` or `SELECT 
if(true,null,null)`. This PR fixes that by returning `void` like Hive 1.2.

**Before**
```sql
$ bin/beeline -u jdbc:hive2://localhost:1 -e "select null"
Connecting to jdbc:hive2://localhost:1
Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Error: java.lang.IllegalArgumentException: Unrecognized type name: null 
(state=,code=0)
Closing: 0: jdbc:hive2://localhost:1

$ bin/beeline -u jdbc:hive2://localhost:1 -e "select if(true,null,null)"
Connecting to jdbc:hive2://localhost:1
Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Error: java.lang.IllegalArgumentException: Unrecognized type name: null 
(state=,code=0)
Closing: 0: jdbc:hive2://localhost:1
```

**After**
```sql
$ bin/beeline -u jdbc:hive2://localhost:1 -e "select null"
Connecting to jdbc:hive2://localhost:1
Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_

[GitHub] spark pull request #17895: Branch 2.0

2017-05-07 Thread fjh100456
Github user fjh100456 closed the pull request at:

https://github.com/apache/spark/pull/17895


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17895: Branch 2.0

2017-05-07 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/17895
  
Sorry, I made a mistake. I'll close it by now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17923: [SPARK-20591][WEB UI] Succeeded tasks num not equ...

2017-05-09 Thread fjh100456
GitHub user fjh100456 opened a pull request:

https://github.com/apache/spark/pull/17923

[SPARK-20591][WEB UI] Succeeded tasks num not equal in all jobs page and 
job detail page on spark web ui when speculative task(s) exist.

## What changes were proposed in this pull request?

Modified succeeded num in job detail page from "completed = 
stageData.completedIndices.size" to "completed = 
stageData.numCompleteTasks",which making succeeded tasks num in all jobs page 
and job detail page look more consistent, and more easily to find which stages 
the speculative task(s) were in.

## How was this patch tested?

manual tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fjh100456/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17923.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17923


commit 78b0f975cfa02ba7f5bfeb0aa5a07f12a228658d
Author: fjh100456 
Date:   2017-05-09T13:29:33Z

[SPARK-20591][WEB UI] Succeeded tasks num not equal in all jobs page and 
job detail page on spark web ui when speculative task(s) exist.

## What changes were proposed in this pull request?

Modified succeeded num in job detail page from "completed = 
stageData.completedIndices.size" to "completed = 
stageData.numCompleteTasks",which making succeeded tasks num in all jobs page 
and job detail page look more consistent, and more easily to find which stages 
the speculative task(s) were in.

## How was this patch tested?

manual tests




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17923: [SPARK-20591][WEB UI] Succeeded tasks num not equal in a...

2017-05-09 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/17923
  
I see, but really it's not  easy to find the speculative tasks, especially 
when the succeeded task num are inconsistent on  the all jobs page and the job 
detail page.
Shall we let them seems consistent by modifying the num on all jobs page? 
Or it's also on purpose? @ajbozarth   @pwendell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17923: [SPARK-20591][WEB UI] Succeeded tasks num not equal in a...

2017-05-10 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/17923
  

The description of @pwendell 's  jira looks as if it only because the 
number of tasks completed will exceed the total number of tasks. But I think 
this excess can remind the user that the "speculative" has been triggered, and 
users like me may check to be a host running problem or a data problem or other 
problem. The excess can be more convenient for me to discover speculative tasks 
and find them.

So I prefer to allow the "excess". I have checked the code, and 
"completedIndices" is used only for calculating the number of tasks, and no 
other place to use. If this is ok, I will remove "completedIndices".



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17923: [SPARK-20591][WEB UI] Succeeded tasks num not equal in a...

2017-05-11 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/17923
  
Well, I'll change the succeeded tasks of the all jobs page to the same as 
the job detail page, and also record the completed tskid by a 
"completedIndices".
Is it ok?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17923: [SPARK-20591][WEB UI] Succeeded tasks num not equal in a...

2017-05-16 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/17923
  
The build seems to have failed, but it should have no relationship with my 
changes.
Why does it show "No changes"? Something error?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17923: [SPARK-20591][WEB UI] Succeeded tasks num not equal in a...

2017-05-20 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/17923
  
Yes, I've fixed the use case and bugs. @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17923: [SPARK-20591][WEB UI] Succeeded tasks num not equal in a...

2017-05-20 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/17923
  

![test](https://cloud.githubusercontent.com/assets/26785576/26275734/78cd1674-3d9a-11e7-9b9c-fdb63bab0576.jpg)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17923: [SPARK-20591][WEB UI] Succeeded tasks num not equal in a...

2017-05-22 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/17923
  
I had done. 
Please rebuild again,thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18351: [SPARK-21135][WEB UI] On history server page,du...

2017-06-19 Thread fjh100456
GitHub user fjh100456 opened a pull request:

https://github.com/apache/spark/pull/18351

[SPARK-21135][WEB UI] On history server page,duration of incompleted 
applications should be hidden instead of showing up as 0

## What changes were proposed in this pull request?

Hide duration of incompleted applications.

## How was this patch tested?

manual tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fjh100456/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18351.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18351


commit f91ed07718d4462e487237353bf82c80c5e148f7
Author: fjh100456 
Date:   2017-06-19T08:53:06Z

[SPARK-21135][WEB UI] On history server page,duration of incompleted 
applications should be hidden instead of showing up as 0

## What changes were proposed in this pull request?

Hide duration of incomplete applications.

## How was this patch tested?

manual tests




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18351: [SPARK-21135][WEB UI] On history server page,duration ...

2017-06-19 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/18351
  
I have not found a similar problem on other pages? The CompleteTime of this 
page has been hidden, indicating that it should have been considered before. 
Have you considered the other question I mentioned in Jira about the abort 
of the anomaly?   @zhouliu  @srowen


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18351: [SPARK-21135][WEB UI] On history server page,duration ...

2017-06-19 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/18351
  
Yes,it should be. @ajbozarth 

The screenshot:@zhuoliu


![default](https://user-images.githubusercontent.com/26785576/27312007-89a3eca6-5597-11e7-81fe-7dcff2c2a861.png)
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18351: [SPARK-21135][WEB UI] On history server page,duration ...

2017-06-19 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/18351
  
@guoxiaolongzte I have considered this problem, but there is no 
“EndTime“ value for applications in  processing, including the application 
of the exception abort I mentioned. So in the Scala code this value is not 
counted but directly set to 0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18351: [SPARK-21135][WEB UI] On history server page,duration ...

2017-06-20 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/18351
  
@jerryshao  I thought about doing like this, but the application of the 
exception abort will always be treated as “incompleted application”, if we 
use “currTimeInMs - startTime”, the duration of this kind of application 
will be wrong and look very strange. 
I have also considered using the lastUpdated - Starttime, but in some cases 
this is not very accurate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18351: [SPARK-21135][WEB UI] On history server page,duration ...

2017-06-20 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/18351
  
@jerryshao For example, you use shell commands to kill the application's 
submit process from outside, this application is always treated as a 
"incomplete application" in history server.In this case it would be wrong to 
use "currenttime - startTime", and the time will increases with each refresh 
which should actually be discontinued after the kill.

If we must keep the “duration“, I tend to use "last updated - 
startTime", although sometimes may not be very accurate, but at least closer to 
the real value, and more acceptable than display an error value, you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18351: [SPARK-21135][WEB UI] On history server page,duration ...

2017-06-20 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/18351
  
Sorry, I forgot to explain that what I mean by “Lastupdated“ refers to 
the value of "Filestatus.Getmodificationtime".Maybe the value of lastUpdated 
field needs a change too. @jerryshao 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18351: [SPARK-21135][WEB UI] On history server page,duration ...

2017-06-20 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/18351
  
Oops, I didn't notice the comment. But it seems that if we don't change the 
"lastUpdated", it just be a  "Currenttime".
Even so,I'd like to ask for other contributor's opinions, hide or change 
to currentTime-startTime,which is better, after all, abnormal scenes may also 
occur often, such as driver downtime, or task kill process.

I tend to show the right content, not more but possibly incorrect content.  
And the duration of the in-process applications can be seen on other pages, not 
unknowable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18351: [SPARK-21135][WEB UI] On history server page,duration ...

2017-06-22 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/18351
  

According to Jerryshao, I should change to use "Currenttime - StartTime", 
although in some scenarios the value is wrong.
But I tend to hide the duration for incomplete applications, because the 
duration of incomplete applications has been shown on the master page, there is 
no need to display a value that might not be always correct here.

Can you help me decide? @srowen @ajbozarth @zhuoliu 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18351: [SPARK-21135][WEB UI] On history server page,duration ...

2017-06-25 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/18351
  
It seems to be a big change to solve this problem completely, it may not be 
worth making such a big change for such a small problem, so I think we should 
hide it. 
cc @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18351: [SPARK-21135][WEB UI] On history server page,duration ...

2017-06-28 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/18351
  
Hi, @srowen , what's your opinion about this issue? There's been no 
progress for days.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-11-28 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
@gatorsmile 
I'd tested the performance of 'uncompressed', 'snappy', 'gzip' compression 
algorithm for parquet, the input data volume is 22MB, 220MB, 1100MB, 
respectively run 10 times, finally 'snappy' in several cases are more excellent.
The test results are as follows:(TimeUnit: ms)

![default](https://user-images.githubusercontent.com/26785576/33362659-74c0cf06-d518-11e7-9907-2f353ffed37d.png)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-12-05 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
@gatorsmile  
Thank you very much.I'll fix all the problem later.For more,I'm not very 
clear what you mean by 'workload', I changed the test to a benchmark in this 
two days, and the code and the results are as follows:

``` scala
class InsertHiveTableBenchmark extends SQLTestUtils {
  lazy val spark = SparkSession.builder
.enableHiveSupport()
.master("local[1]")
.appName("microbenchmark")
.config("spark.sql.warehouse.dir", "file:/E:/study/temp/warehouse")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()

  /** Runs function `f` with different format. */
  private def runBenchmark(
name: String, cardinality: Long, tables: List[String])(f: String => 
Unit): Unit = {
val benchmark = new Benchmark(name, cardinality)
val numIters = 5

tables.map { table =>
  benchmark.addCase(s"To $table", numIters)(iter => f(table))
}

benchmark.run()
  }

  private val dataSourceName: String = "projdata"

  private def getTableByFormatCompression(format: String, compressionCodec: 
String): String = {
s"${dataSourceName}_${format}_${compressionCodec}"
  }

  private def createTable(format: String, compressionCodec: String): String 
= {
val tableName = getTableByFormatCompression(format, compressionCodec)
val compressionName = getCompressionNameByFormat(format)
spark.sql(
  s"""CREATE TABLE if not exists $tableName(
  /*---About 74 fields, Forgive me for not being able to show all the 
fields 
 because being banned by the company---/*)
 |PARTITIONED BY (p_provincecode int)
 |STORED AS $format
 
|TBLPROPERTIES('$compressionName'='$compressionCodec')""".stripMargin)
tableName
  }

  private def insertOverwriteTable(tableName: String, dataSource: String): 
Unit = {
spark.sql(s"insert overwrite table $tableName 
PARTITION(p_provincecode=51) select * from $dataSource")
  }

  private def withCachedTable(sizeInMb: Int)(f: (String, Long) => Unit): 
Unit = {
val size = 22 //textfile data size in MB
val times = if (sizeInMb <= size) 1 else sizeInMb / size
val unions = (0 until times).map(_ => s"select * from $dataSourceName 
where p_provincecode=51")
val cache = spark.sql(unions.mkString(" union all ")).cache()
val tempViewName = s"${dataSourceName}_cache"
cache.createTempView(tempViewName)
try {
  val numRecords = cache.count()
  f(tempViewName, numRecords)
} finally {
  spark.sqlContext.dropTempTable(tempViewName)
  cache.unpersist()
}
  }

  private def getCompressionNameByFormat(format: String): String = {
format.toLowerCase match {
  case "parquet" => "parquet.compression"
  case "orc" => "orc.compress"
  case _ => throw new Exception(s"Invalid format $format")
}
  }

  private def getCompressionCodesByFormat(format: String): List[String] = {
format.toLowerCase match {
  case "parquet" => List("UNCOMPRESSED", "SNAPPY", "GZIP")
  case "orc" => List("NONE", "SNAPPY", "ZLIB")
  case _ => throw new Exception(s"Invalid format $format")
}
  }

  private def performance(dataSizeInMB: Int, format: String): Unit = {
withCachedTable(dataSizeInMB) { case (cacheName, numRecords) =>
  val tables = getCompressionCodesByFormat(format).map(codec => 
createTable(format, codec))
  withTable(tables:_*) {
runBenchmark(s"Inert overwrite ${dataSizeInMB}MB to hive table", 
numRecords, tables){ tableName =>
  insertOverwriteTable(tableName, cacheName)
}
  }
}
  }

  test("Insert 22MB textfile datasource into parquet hive table") {
performance(22, "parquet")
/*
Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13 on Windows 7 6.1
Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
Inert overwrite 22MB to hive table:  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative


To projdata_parquet_UNCOMPRESSED  2043 / 2509  0.0   
45247.5 

[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-12-18 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
I will change the code with the suggestion of @gatorsmile ,it's a little 
busy this days.I will do it tomorrow.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-12-19 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
@gatorsmile @maropu  Does it look better now? About statistic issue, is 
there any suggestion?
 
@SparkQA Please start test, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22641: [SPARK-25611][SPARK-25612][SQL][TESTS] Improve test run ...

2018-10-09 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/22641
  
I think random tests are not a good solution. If the use case does run too 
slowly, it may be much better to reduce codecs. We have done unit tests on the 
codec priority in `ParquetCompressionCodecPrecedenceSuite.scala` and 
`OrcSourceSuite.scala`, here we just want to test the correctness of the whole 
process, so it seems feasible to choose only 1 or 2 codec(s) for testing.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22641: [SPARK-25611][SPARK-25612][SQL][TESTS] Improve te...

2018-10-09 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22641#discussion_r223914298
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala 
---
@@ -262,7 +261,10 @@ class CompressionCodecSuite extends TestHiveSingleton 
with ParquetTest with Befo
 }
   }
 
-  def checkForTableWithCompressProp(format: String, compressCodecs: 
List[String]): Unit = {
+  def checkForTableWithCompressProp(
+  format: String,
+  tableCompressCodecs: List[String],
+  sessionCompressCodecs: List[String]): Unit = {
 Seq(true, false).foreach { isPartitioned =>
--- End diff --

This combination provides different test scenarios,they are not quite the 
same on writing data and table property passing, all of which have the 
potential to affect the test results. So I think it's necessary to keep.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22693: [SPARK-25701][SQL] Supports calculation of table ...

2018-10-10 Thread fjh100456
GitHub user fjh100456 opened a pull request:

https://github.com/apache/spark/pull/22693

[SPARK-25701][SQL] Supports calculation of table statistics from 
partition's catalog statistics.

## What changes were proposed in this pull request?

When determine table statistics, if the `totalSize` of the table is not 
defined, we fallback to HDFS to get the table statistics when 
`spark.sql.statistics.fallBackToHdfs` is `true`, otherwise the default 
value(`spark.sql.defaultSizeInBytes`) will be taken, which will lead to tables 
without `totalSize` property may not be broadcast(Except parquet). 

Fortunately, in most case the data is written into the table by a insertion 
command which will save the data-size in metastore, so it's possible to use 
metastore to calculate the table statistics.

## How was this patch tested?
Add test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fjh100456/spark StatisticCommit

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22693.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22693


commit e610477063b4f326b8261d59b55abce83cbb82e7
Author: fjh100456 
Date:   2018-10-11T06:43:52Z

[SPARK-25701][SQL] Supports calculation of table statistics from 
partition's catalog statistics.

## What changes were proposed in this pull request?

When obtaining table statistics, if the `totalSize` of the table is not 
defined, we fallback to HDFS to get the table statistics when 
`spark.sql.statistics.fallBackToHdfs` is `true`, otherwise the default 
value(`spark.sql.defaultSizeInBytes`) will be taken.

Fortunately, in most case the data is written into the table by a insertion 
command which will save the data-size in metastore, so it's possible to use 
metastore to calculate the table statistics.

## How was this patch tested?
Add test.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...

2018-10-12 Thread fjh100456
GitHub user fjh100456 opened a pull request:

https://github.com/apache/spark/pull/22707

[SPARK-25717][SQL] Insert overwrite a recreated external and partitioned 
table may result in incorrect query results

## What changes were proposed in this pull request?

Consider the following scenario:

```
spark.range(100).createTempView("temp")
(0 until 3).foreach { _ =>
  spark.sql("drop table if exists tableA")
  spark.sql("create table if not exists tableA(a int) partitioned by (p 
int) location 'file:/e:/study/warehouse/tableA'")
  spark.sql("insert overwrite table tableA partition(p=1) select * from 
temp")
  spark.sql("select count(1) from tableA where p=1").show
}
```

We expect the count always be 100, but the actual results are as follows:

```
++
|count(1)|
++
| 100|
++

++
|count(1)|
++
|   200|
++

++
|count(1)|
++
|   300|
++
```

when spark executes an `insert overwrite` command,  it gets the historical 
partition first, and then delete it from fileSystem.

But for recreated external and partitioned table, the partitions were all 
deleted by the `drop  table` command with data unremoved. So the historical 
data is preserved which lead to the query results incorrect.

## How was this patch tested?
Manual test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fjh100456/spark InsertOverwriteCommit

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22707.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22707
    

commit 8e1b1be1ac7f7cc4fe31ccf99fb92dd7d8fc8918
Author: fjh100456 
Date:   2018-10-12T09:45:39Z

[SPARK-25717][SQL] Insert overwrite a recreated external and partitioned 
table may result in incorrect query results.

## What changes were proposed in this pull request?

when spark executes an `insert overwrite` command,  it gets the historical 
partition first, and then delete it from fileSystem.

But for recreated external and partitioned table, the partitions were all 
deleted by the `drop  table` command with data unremoved. So the historical 
data is preserved which lead to the query results incorrect.

## How was this patch tested?
Manual test.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

2018-10-14 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/22707
  
@wangyum I have added the test. Thank you.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...

2018-10-16 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22707#discussion_r225756219
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -227,18 +227,22 @@ case class InsertIntoHiveTable(
   // Newer Hive largely improves insert overwrite performance. As 
Spark uses older Hive
   // version and we may not want to catch up new Hive version 
every time. We delete the
   // Hive partition first and then load data file into the Hive 
partition.
-  if (oldPart.nonEmpty && overwrite) {
-oldPart.get.storage.locationUri.foreach { uri =>
-  val partitionPath = new Path(uri)
-  val fs = partitionPath.getFileSystem(hadoopConf)
-  if (fs.exists(partitionPath)) {
-if (!fs.delete(partitionPath, true)) {
-  throw new RuntimeException(
-"Cannot remove partition directory '" + 
partitionPath.toString)
-}
-// Don't let Hive do overwrite operation since it is 
slower.
-doHiveOverwrite = false
+  if (overwrite) {
+val oldPartitionPath = 
oldPart.flatMap(_.storage.locationUri.map(new Path(_)))
+  .getOrElse {
+ExternalCatalogUtils.generatePartitionPath(
+  partitionSpec,
+  partitionColumnNames,
+  HiveClientImpl.toHiveTable(table).getDataLocation)
--- End diff --

> 
> 
> `HiveClientImpl.toHiveTable(table).getDataLocation` -> `new 
Path(table.location)`?

Yes, they get the same value. I'll change it , thank you very much.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22301: [SPARK-21786][SQL][FOLLOWUP] Add compressionCodec...

2018-08-31 Thread fjh100456
GitHub user fjh100456 opened a pull request:

https://github.com/apache/spark/pull/22301

[SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test for CTAS

What changes were proposed in this pull request?
Since resolved by @dongjoon in  
[20522](https://github.com/apache/spark/pull/20522), compressionCodec test for 
CTAS has been able to support, the scenario of CTAS suggested by @gatorsmile in 
 [20087](https://github.com/apache/spark/pull/20087#discussion_r162252598)  
should be enabled.

How was this patch tested?
Add test.

cc @gatorsmile @dongjoon-hyun 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fjh100456/spark CompressionCodecCommit

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22301.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22301


commit 5244aafc2d7945c11c96398b8d5b752b45fd148c
Author: Xianjin YE 
Date:   2018-01-02T15:30:38Z

[SPARK-22897][CORE] Expose stageAttemptId in TaskContext

## What changes were proposed in this pull request?
stageAttemptId added in TaskContext and corresponding construction 
modification

## How was this patch tested?
Added a new test in TaskContextSuite, two cases are tested:
1. Normal case without failure
2. Exception case with resubmitted stages

Link to [SPARK-22897](https://issues.apache.org/jira/browse/SPARK-22897)

Author: Xianjin YE 

Closes #20082 from advancedxy/SPARK-22897.

(cherry picked from commit a6fc300e91273230e7134ac6db95ccb4436c6f8f)
Signed-off-by: Wenchen Fan 

commit b96a2132413937c013e1099be3ec4bc420c947fd
Author: Juliusz Sompolski 
Date:   2018-01-03T13:40:51Z

[SPARK-22938] Assert that SQLConf.get is accessed only on the driver.

## What changes were proposed in this pull request?

Assert if code tries to access SQLConf.get on executor.
This can lead to hard to detect bugs, where the executor will read 
fallbackConf, falling back to default config values, ignoring potentially 
changed non-default configs.
If a config is to be passed to executor code, it needs to be read on the 
driver, and passed explicitly.

## How was this patch tested?

Check in existing tests.

Author: Juliusz Sompolski 

Closes #20136 from juliuszsompolski/SPARK-22938.

(cherry picked from commit 247a08939d58405aef39b2a4e7773aa45474ad12)
Signed-off-by: Wenchen Fan 

commit a05e85ecb76091567a26a3a14ad0879b4728addc
Author: gatorsmile 
Date:   2018-01-03T14:09:30Z

[SPARK-22934][SQL] Make optional clauses order insensitive for CREATE TABLE 
SQL statement

## What changes were proposed in this pull request?
Currently, our CREATE TABLE syntax require the EXACT order of clauses. It 
is pretty hard to remember the exact order. Thus, this PR is to make optional 
clauses order insensitive for `CREATE TABLE` SQL statement.

```
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name1 col_type1 [COMMENT col_comment1], ...)]
USING datasource
[OPTIONS (key1=val1, key2=val2, ...)]
[PARTITIONED BY (col_name1, col_name2, ...)]
[CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
[LOCATION path]
[COMMENT table_comment]
[TBLPROPERTIES (key1=val1, key2=val2, ...)]
[AS select_statement]
```

The proposal is to make the following clauses order insensitive.
```
[OPTIONS (key1=val1, key2=val2, ...)]
[PARTITIONED BY (col_name1, col_name2, ...)]
[CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
[LOCATION path]
[COMMENT table_comment]
[TBLPROPERTIES (key1=val1, key2=val2, ...)]
```

The same idea is also applicable to Create Hive Table.
```
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name1[:] col_type1 [COMMENT col_comment1], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION path]
[TBLPROPERTIES (key1=val1, key2=val2, ...)]
[AS select_statement]
```

The proposal is to make the following clauses order insensitive.
```
[COMMENT table_comment]
[PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION path]
[TBLPROPERTIES (key1=val1, key2=val2, ...)]
```

## How was this patch tested?
Added test cases

Author: gatorsmile 

Closes #20133 from gatorsmile/createDataSourceTableDDL.

(cherry picked from

[GitHub] spark pull request #22301: [SPARK-21786][SQL][FOLLOWUP] Add compressionCodec...

2018-08-31 Thread fjh100456
Github user fjh100456 closed the pull request at:

https://github.com/apache/spark/pull/22301


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22301: [SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test f...

2018-08-31 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/22301
  
Ok, I will do it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22302: [SPARK-21786][SQL][FOLLOWUP] Add compressionCodec...

2018-08-31 Thread fjh100456
GitHub user fjh100456 opened a pull request:

https://github.com/apache/spark/pull/22302

[SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test for CTAS

What changes were proposed in this pull request?
Since resolved by @DongJoon in 20522, compressionCodec test for CTAS has 
been able to support, the scenario of CTAS suggested by @gatorsmile in 20087 
should be enabled.

How was this patch tested?
Add test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fjh100456/spark compressionCodec

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22302.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22302


commit c3c5af2b9f27e4effcc8f44f1b98f24a6d1fafa7
Author: fjh100456 
Date:   2018-08-31T09:44:41Z

[SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test for CTAS

What changes were proposed in this pull request?
Since resolved by @DongJoon in 20522, compressionCodec test for CTAS has 
been able to support, the scenario of CTAS suggested by @gatorsmile in 20087 
should be enabled.

How was this patch tested?
Add test.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22302: [SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test f...

2018-09-02 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/22302
  
@maropu I'd update the PR description, thank you!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22302: [SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test f...

2018-09-05 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/22302
  
@dongjoon-hyun @maropu  I'm so sorry, and I have changed the description. 
Is it ok now ? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22302: [SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test f...

2018-09-05 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/22302
  
The pr [#20120](https://github.com/apache/spark/pull/20120) was closed, 
instead [SPARK-23355](https://issues.apache.org/jira/browse/SPARK-23355) had 
resolved it.
Does it mean the same thing 
[#21269](https://github.com/apache/spark/pull/21269/files)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >