[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...

2018-11-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22932


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...

2018-11-09 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22932#discussion_r232444034
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOutputWriter.scala
 ---
@@ -36,11 +37,17 @@ private[orc] class OrcOutputWriter(
   private[this] val serializer = new OrcSerializer(dataSchema)
 
   private val recordWriter = {
-new OrcOutputFormat[OrcStruct]() {
+val orcOutputFormat = new OrcOutputFormat[OrcStruct]() {
   override def getDefaultWorkFile(context: TaskAttemptContext, 
extension: String): Path = {
 new Path(path)
   }
-}.getRecordWriter(context)
+}
+val filename = orcOutputFormat.getDefaultWorkFile(context, ".orc")
+val options = 
OrcMapRedOutputFormat.buildOptions(context.getConfiguration)
+val writer = OrcFile.createWriter(filename, options)
+val recordWriter = new OrcMapreduceRecordWriter[OrcStruct](writer)
--- End diff --

Right. To avoid reflection, this was the only way.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...

2018-11-09 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22932#discussion_r232443802
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOutputWriter.scala
 ---
@@ -36,11 +37,17 @@ private[orc] class OrcOutputWriter(
   private[this] val serializer = new OrcSerializer(dataSchema)
 
   private val recordWriter = {
-new OrcOutputFormat[OrcStruct]() {
+val orcOutputFormat = new OrcOutputFormat[OrcStruct]() {
   override def getDefaultWorkFile(context: TaskAttemptContext, 
extension: String): Path = {
 new Path(path)
   }
-}.getRecordWriter(context)
+}
+val filename = orcOutputFormat.getDefaultWorkFile(context, ".orc")
+val options = 
OrcMapRedOutputFormat.buildOptions(context.getConfiguration)
+val writer = OrcFile.createWriter(filename, options)
+val recordWriter = new OrcMapreduceRecordWriter[OrcStruct](writer)
--- End diff --

This is basically copied from getRecordWriter


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...

2018-11-09 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22932#discussion_r232430599
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -274,6 +278,15 @@ private[orc] class OrcOutputWriter(
 
   override def close(): Unit = {
 if (recordWriterInstantiated) {
+  // Hive 1.2.1 ORC initializes its private `writer` field at the 
first write.
+  try {
+val writerField = recordWriter.getClass.getDeclaredField("writer")
+writerField.setAccessible(true)
+val writer = writerField.get(recordWriter).asInstanceOf[Writer]
+writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, 
UTF_8.encode(SPARK_VERSION_SHORT))
+  } catch {
+case NonFatal(e) => log.warn(e.toString, e)
+  }
--- End diff --

For this case, I'll refactor out all the new code (line 281 ~ 289).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...

2018-11-09 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22932#discussion_r232428893
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -274,6 +278,15 @@ private[orc] class OrcOutputWriter(
 
   override def close(): Unit = {
 if (recordWriterInstantiated) {
+  // Hive 1.2.1 ORC initializes its private `writer` field at the 
first write.
+  try {
+val writerField = recordWriter.getClass.getDeclaredField("writer")
+writerField.setAccessible(true)
+val writer = writerField.get(recordWriter).asInstanceOf[Writer]
+writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, 
UTF_8.encode(SPARK_VERSION_SHORT))
+  } catch {
+case NonFatal(e) => log.warn(e.toString, e)
+  }
--- End diff --

BTW, as you expected, we cannot use a single function for this. The 
`Writer` are not the same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...

2018-11-09 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22932#discussion_r232428173
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOutputWriter.scala
 ---
@@ -36,11 +41,17 @@ private[orc] class OrcOutputWriter(
   private[this] val serializer = new OrcSerializer(dataSchema)
 
   private val recordWriter = {
-new OrcOutputFormat[OrcStruct]() {
+val orcOutputFormat = new OrcOutputFormat[OrcStruct]() {
   override def getDefaultWorkFile(context: TaskAttemptContext, 
extension: String): Path = {
 new Path(path)
   }
-}.getRecordWriter(context)
+}
+val filename = orcOutputFormat.getDefaultWorkFile(context, ".orc")
+val options = 
OrcMapRedOutputFormat.buildOptions(context.getConfiguration)
+val writer = OrcFile.createWriter(filename, options)
+val recordWriter = new OrcMapreduceRecordWriter[OrcStruct](writer)
+writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, 
UTF_8.encode(SPARK_VERSION_SHORT))
--- End diff --

Thank you for review, @gatorsmile . Sure. I'll refactor out the following 
line.
```
writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, 
UTF_8.encode(SPARK_VERSION_SHORT))
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...

2018-11-09 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22932#discussion_r232424657
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -274,6 +278,15 @@ private[orc] class OrcOutputWriter(
 
   override def close(): Unit = {
 if (recordWriterInstantiated) {
+  // Hive 1.2.1 ORC initializes its private `writer` field at the 
first write.
+  try {
+val writerField = recordWriter.getClass.getDeclaredField("writer")
+writerField.setAccessible(true)
+val writer = writerField.get(recordWriter).asInstanceOf[Writer]
+writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, 
UTF_8.encode(SPARK_VERSION_SHORT))
+  } catch {
+case NonFatal(e) => log.warn(e.toString, e)
+  }
--- End diff --

The same comment here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...

2018-11-09 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22932#discussion_r232424626
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOutputWriter.scala
 ---
@@ -36,11 +41,17 @@ private[orc] class OrcOutputWriter(
   private[this] val serializer = new OrcSerializer(dataSchema)
 
   private val recordWriter = {
-new OrcOutputFormat[OrcStruct]() {
+val orcOutputFormat = new OrcOutputFormat[OrcStruct]() {
   override def getDefaultWorkFile(context: TaskAttemptContext, 
extension: String): Path = {
 new Path(path)
   }
-}.getRecordWriter(context)
+}
+val filename = orcOutputFormat.getDefaultWorkFile(context, ".orc")
+val options = 
OrcMapRedOutputFormat.buildOptions(context.getConfiguration)
+val writer = OrcFile.createWriter(filename, options)
+val recordWriter = new OrcMapreduceRecordWriter[OrcStruct](writer)
+writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, 
UTF_8.encode(SPARK_VERSION_SHORT))
--- End diff --

Could we create a separate function for adding these metadata?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...

2018-11-08 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22932#discussion_r231791191
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql.out
 ---
@@ -93,7 +93,7 @@ Partition Values  [ds=2017-08-01, hr=10]
 Location [not included in 
comparison]sql/core/spark-warehouse/t/ds=2017-08-01/hr=10   
 
 Created Time [not included in comparison]
 Last Access [not included in comparison]
-Partition Statistics   1121 bytes, 3 rows  
+Partition Statistics   1229 bytes, 3 rows  
--- End diff --

It's filed and I made [a PR for 
SPARK-25971](https://github.com/apache/spark/pull/22972) for SQLQueryTestSuite.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...

2018-11-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22932#discussion_r231777190
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql.out
 ---
@@ -93,7 +93,7 @@ Partition Values  [ds=2017-08-01, hr=10]
 Location [not included in 
comparison]sql/core/spark-warehouse/t/ds=2017-08-01/hr=10   
 
 Created Time [not included in comparison]
 Last Access [not included in comparison]
-Partition Statistics   1121 bytes, 3 rows  
+Partition Statistics   1229 bytes, 3 rows  
--- End diff --

Hmmm .. yea, I think we should avoid ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...

2018-11-07 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22932#discussion_r231775619
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql.out
 ---
@@ -93,7 +93,7 @@ Partition Values  [ds=2017-08-01, hr=10]
 Location [not included in 
comparison]sql/core/spark-warehouse/t/ds=2017-08-01/hr=10   
 
 Created Time [not included in comparison]
 Last Access [not included in comparison]
-Partition Statistics   1121 bytes, 3 rows  
+Partition Statistics   1229 bytes, 3 rows  
--- End diff --

Nice catch! Hmm. I think we should not measure the bytes in the test case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...

2018-11-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22932#discussion_r231775020
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql.out
 ---
@@ -93,7 +93,7 @@ Partition Values  [ds=2017-08-01, hr=10]
 Location [not included in 
comparison]sql/core/spark-warehouse/t/ds=2017-08-01/hr=10   
 
 Created Time [not included in comparison]
 Last Access [not included in comparison]
-Partition Statistics   1121 bytes, 3 rows  
+Partition Statistics   1229 bytes, 3 rows  
--- End diff --

Hm, does it mean that basically the tests will be failed or fixed for 
official releases (since it doesn't have `-SNAPSHOT`)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...

2018-11-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22932#discussion_r230610261
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/package.scala ---
@@ -44,4 +44,13 @@ package object sql {
   type Strategy = SparkStrategy
 
   type DataFrame = Dataset[Row]
+
+  /**
+   * Metadata key which is used to write Spark version in the followings:
+   * - Parquet file metadata
+   * - ORC file metadata
+   *
+   * Note that Hive table property `spark.sql.create.version` also has 
Spark version.
+   */
+  private[sql] val CREATE_VERSION = "org.apache.spark.sql.create.version"
--- End diff --

Thank you for review, @hvanhovell . Yes, we can use that 
`org.apache.spark.version` since this is a new key.

Although Hive table property `spark.sql.create.version` has `.create.` 
part, it seems that we don't need to follow that convention here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...

2018-11-04 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/22932#discussion_r230604337
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/package.scala ---
@@ -44,4 +44,13 @@ package object sql {
   type Strategy = SparkStrategy
 
   type DataFrame = Dataset[Row]
+
+  /**
+   * Metadata key which is used to write Spark version in the followings:
+   * - Parquet file metadata
+   * - ORC file metadata
+   *
+   * Note that Hive table property `spark.sql.create.version` also has 
Spark version.
+   */
+  private[sql] val CREATE_VERSION = "org.apache.spark.sql.create.version"
--- End diff --

Is this a pre-existing key? Seems that `org.apache.spark.version` should be 
enough.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...

2018-11-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22932#discussion_r230564513
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql.out
 ---
@@ -93,7 +93,7 @@ Partition Values  [ds=2017-08-01, hr=10]
 Location [not included in 
comparison]sql/core/spark-warehouse/t/ds=2017-08-01/hr=10   
 
 Created Time [not included in comparison]
 Last Access [not included in comparison]
-Partition Statistics   1121 bytes, 3 rows  
+Partition Statistics   1229 bytes, 3 rows  
--- End diff --

Right, @gatorsmile .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...

2018-11-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22932#discussion_r230563752
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql.out
 ---
@@ -93,7 +93,7 @@ Partition Values  [ds=2017-08-01, hr=10]
 Location [not included in 
comparison]sql/core/spark-warehouse/t/ds=2017-08-01/hr=10   
 
 Created Time [not included in comparison]
 Last Access [not included in comparison]
-Partition Statistics   1121 bytes, 3 rows  
+Partition Statistics   1229 bytes, 3 rows  
--- End diff --

This is caused by adding `org.apache.spark.sql.create.version = 
3.0.0-SNAPSHOT`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...

2018-11-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22932#discussion_r230547020
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -314,6 +316,21 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   checkAnswer(spark.read.orc(path.getCanonicalPath), Row(ts))
 }
   }
+
--- End diff --

Please note that the following test case is executed twice; 
`OrcSourceSuite` and `HiveOrcSourceSuite`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...

2018-11-03 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/22932

[SPARK-25102][SQL] Write Spark version to ORC/Parquet file metadata

## What changes were proposed in this pull request?

Currently, Spark writes Spark version number into Hive Table properties 
with `spark.sql.create.version`.
```
parameters:{
  spark.sql.sources.schema.part.0={
"type":"struct",
"fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
  },
  transient_lastDdlTime=1541142761, 
  spark.sql.sources.schema.numParts=1,
  spark.sql.create.version=2.4.0
}
```

This PR aims to write Spark versions to ORC/Parquet file metadata with 
`org.apache.spark.sql.create.version`. It's different from Hive Table property 
key `spark.sql.create.version`, but it seems that we cannot change that for 
backward compatibility.

**ORC (`native` and `hive` implmentation)**
```
File Version: 0.12 with ORC_135
...
User Metadata:
  org.apache.spark.sql.create.version=3.0.0-SNAPSHOT
```

**PARQUET**
```
creator: parquet-mr version 1.10.0 (build 
031a6654009e3b82020012a18434c582bd74c73a)
extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
```

## How was this patch tested?

Pass the Jenkins with newly added test cases.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-25102

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22932.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22932


commit 601ccbb4e20a068469839bc71870230cfb6fd7a1
Author: Dongjoon Hyun 
Date:   2018-11-03T06:43:48Z

[SPARK-25102][SQL] Write Spark version to ORC/Parquet file metadata




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org