[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22932 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r232444034 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOutputWriter.scala --- @@ -36,11 +37,17 @@ private[orc] class OrcOutputWriter( private[this] val serializer = new OrcSerializer(dataSchema) private val recordWriter = { -new OrcOutputFormat[OrcStruct]() { +val orcOutputFormat = new OrcOutputFormat[OrcStruct]() { override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = { new Path(path) } -}.getRecordWriter(context) +} +val filename = orcOutputFormat.getDefaultWorkFile(context, ".orc") +val options = OrcMapRedOutputFormat.buildOptions(context.getConfiguration) +val writer = OrcFile.createWriter(filename, options) +val recordWriter = new OrcMapreduceRecordWriter[OrcStruct](writer) --- End diff -- Right. To avoid reflection, this was the only way. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r232443802 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOutputWriter.scala --- @@ -36,11 +37,17 @@ private[orc] class OrcOutputWriter( private[this] val serializer = new OrcSerializer(dataSchema) private val recordWriter = { -new OrcOutputFormat[OrcStruct]() { +val orcOutputFormat = new OrcOutputFormat[OrcStruct]() { override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = { new Path(path) } -}.getRecordWriter(context) +} +val filename = orcOutputFormat.getDefaultWorkFile(context, ".orc") +val options = OrcMapRedOutputFormat.buildOptions(context.getConfiguration) +val writer = OrcFile.createWriter(filename, options) +val recordWriter = new OrcMapreduceRecordWriter[OrcStruct](writer) --- End diff -- This is basically copied from getRecordWriter --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r232430599 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -274,6 +278,15 @@ private[orc] class OrcOutputWriter( override def close(): Unit = { if (recordWriterInstantiated) { + // Hive 1.2.1 ORC initializes its private `writer` field at the first write. + try { +val writerField = recordWriter.getClass.getDeclaredField("writer") +writerField.setAccessible(true) +val writer = writerField.get(recordWriter).asInstanceOf[Writer] +writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, UTF_8.encode(SPARK_VERSION_SHORT)) + } catch { +case NonFatal(e) => log.warn(e.toString, e) + } --- End diff -- For this case, I'll refactor out all the new code (line 281 ~ 289). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r232428893 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -274,6 +278,15 @@ private[orc] class OrcOutputWriter( override def close(): Unit = { if (recordWriterInstantiated) { + // Hive 1.2.1 ORC initializes its private `writer` field at the first write. + try { +val writerField = recordWriter.getClass.getDeclaredField("writer") +writerField.setAccessible(true) +val writer = writerField.get(recordWriter).asInstanceOf[Writer] +writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, UTF_8.encode(SPARK_VERSION_SHORT)) + } catch { +case NonFatal(e) => log.warn(e.toString, e) + } --- End diff -- BTW, as you expected, we cannot use a single function for this. The `Writer` are not the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r232428173 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOutputWriter.scala --- @@ -36,11 +41,17 @@ private[orc] class OrcOutputWriter( private[this] val serializer = new OrcSerializer(dataSchema) private val recordWriter = { -new OrcOutputFormat[OrcStruct]() { +val orcOutputFormat = new OrcOutputFormat[OrcStruct]() { override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = { new Path(path) } -}.getRecordWriter(context) +} +val filename = orcOutputFormat.getDefaultWorkFile(context, ".orc") +val options = OrcMapRedOutputFormat.buildOptions(context.getConfiguration) +val writer = OrcFile.createWriter(filename, options) +val recordWriter = new OrcMapreduceRecordWriter[OrcStruct](writer) +writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, UTF_8.encode(SPARK_VERSION_SHORT)) --- End diff -- Thank you for review, @gatorsmile . Sure. I'll refactor out the following line. ``` writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, UTF_8.encode(SPARK_VERSION_SHORT)) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r232424657 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -274,6 +278,15 @@ private[orc] class OrcOutputWriter( override def close(): Unit = { if (recordWriterInstantiated) { + // Hive 1.2.1 ORC initializes its private `writer` field at the first write. + try { +val writerField = recordWriter.getClass.getDeclaredField("writer") +writerField.setAccessible(true) +val writer = writerField.get(recordWriter).asInstanceOf[Writer] +writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, UTF_8.encode(SPARK_VERSION_SHORT)) + } catch { +case NonFatal(e) => log.warn(e.toString, e) + } --- End diff -- The same comment here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r232424626 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOutputWriter.scala --- @@ -36,11 +41,17 @@ private[orc] class OrcOutputWriter( private[this] val serializer = new OrcSerializer(dataSchema) private val recordWriter = { -new OrcOutputFormat[OrcStruct]() { +val orcOutputFormat = new OrcOutputFormat[OrcStruct]() { override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = { new Path(path) } -}.getRecordWriter(context) +} +val filename = orcOutputFormat.getDefaultWorkFile(context, ".orc") +val options = OrcMapRedOutputFormat.buildOptions(context.getConfiguration) +val writer = OrcFile.createWriter(filename, options) +val recordWriter = new OrcMapreduceRecordWriter[OrcStruct](writer) +writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, UTF_8.encode(SPARK_VERSION_SHORT)) --- End diff -- Could we create a separate function for adding these metadata? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r231791191 --- Diff: sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql.out --- @@ -93,7 +93,7 @@ Partition Values [ds=2017-08-01, hr=10] Location [not included in comparison]sql/core/spark-warehouse/t/ds=2017-08-01/hr=10 Created Time [not included in comparison] Last Access [not included in comparison] -Partition Statistics 1121 bytes, 3 rows +Partition Statistics 1229 bytes, 3 rows --- End diff -- It's filed and I made [a PR for SPARK-25971](https://github.com/apache/spark/pull/22972) for SQLQueryTestSuite. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r231777190 --- Diff: sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql.out --- @@ -93,7 +93,7 @@ Partition Values [ds=2017-08-01, hr=10] Location [not included in comparison]sql/core/spark-warehouse/t/ds=2017-08-01/hr=10 Created Time [not included in comparison] Last Access [not included in comparison] -Partition Statistics 1121 bytes, 3 rows +Partition Statistics 1229 bytes, 3 rows --- End diff -- Hmmm .. yea, I think we should avoid .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r231775619 --- Diff: sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql.out --- @@ -93,7 +93,7 @@ Partition Values [ds=2017-08-01, hr=10] Location [not included in comparison]sql/core/spark-warehouse/t/ds=2017-08-01/hr=10 Created Time [not included in comparison] Last Access [not included in comparison] -Partition Statistics 1121 bytes, 3 rows +Partition Statistics 1229 bytes, 3 rows --- End diff -- Nice catch! Hmm. I think we should not measure the bytes in the test case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r231775020 --- Diff: sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql.out --- @@ -93,7 +93,7 @@ Partition Values [ds=2017-08-01, hr=10] Location [not included in comparison]sql/core/spark-warehouse/t/ds=2017-08-01/hr=10 Created Time [not included in comparison] Last Access [not included in comparison] -Partition Statistics 1121 bytes, 3 rows +Partition Statistics 1229 bytes, 3 rows --- End diff -- Hm, does it mean that basically the tests will be failed or fixed for official releases (since it doesn't have `-SNAPSHOT`)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r230610261 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/package.scala --- @@ -44,4 +44,13 @@ package object sql { type Strategy = SparkStrategy type DataFrame = Dataset[Row] + + /** + * Metadata key which is used to write Spark version in the followings: + * - Parquet file metadata + * - ORC file metadata + * + * Note that Hive table property `spark.sql.create.version` also has Spark version. + */ + private[sql] val CREATE_VERSION = "org.apache.spark.sql.create.version" --- End diff -- Thank you for review, @hvanhovell . Yes, we can use that `org.apache.spark.version` since this is a new key. Although Hive table property `spark.sql.create.version` has `.create.` part, it seems that we don't need to follow that convention here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r230604337 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/package.scala --- @@ -44,4 +44,13 @@ package object sql { type Strategy = SparkStrategy type DataFrame = Dataset[Row] + + /** + * Metadata key which is used to write Spark version in the followings: + * - Parquet file metadata + * - ORC file metadata + * + * Note that Hive table property `spark.sql.create.version` also has Spark version. + */ + private[sql] val CREATE_VERSION = "org.apache.spark.sql.create.version" --- End diff -- Is this a pre-existing key? Seems that `org.apache.spark.version` should be enough. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r230564513 --- Diff: sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql.out --- @@ -93,7 +93,7 @@ Partition Values [ds=2017-08-01, hr=10] Location [not included in comparison]sql/core/spark-warehouse/t/ds=2017-08-01/hr=10 Created Time [not included in comparison] Last Access [not included in comparison] -Partition Statistics 1121 bytes, 3 rows +Partition Statistics 1229 bytes, 3 rows --- End diff -- Right, @gatorsmile . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r230563752 --- Diff: sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql.out --- @@ -93,7 +93,7 @@ Partition Values [ds=2017-08-01, hr=10] Location [not included in comparison]sql/core/spark-warehouse/t/ds=2017-08-01/hr=10 Created Time [not included in comparison] Last Access [not included in comparison] -Partition Statistics 1121 bytes, 3 rows +Partition Statistics 1229 bytes, 3 rows --- End diff -- This is caused by adding `org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22932#discussion_r230547020 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -314,6 +316,21 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { checkAnswer(spark.read.orc(path.getCanonicalPath), Row(ts)) } } + --- End diff -- Please note that the following test case is executed twice; `OrcSourceSuite` and `HiveOrcSourceSuite`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22932: [SPARK-25102][SQL] Write Spark version to ORC/Par...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/22932 [SPARK-25102][SQL] Write Spark version to ORC/Parquet file metadata ## What changes were proposed in this pull request? Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`. ``` parameters:{ spark.sql.sources.schema.part.0={ "type":"struct", "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}] }, transient_lastDdlTime=1541142761, spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.4.0 } ``` This PR aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version`. It's different from Hive Table property key `spark.sql.create.version`, but it seems that we cannot change that for backward compatibility. **ORC (`native` and `hive` implmentation)** ``` File Version: 0.12 with ORC_135 ... User Metadata: org.apache.spark.sql.create.version=3.0.0-SNAPSHOT ``` **PARQUET** ``` creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) extra: org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]} ``` ## How was this patch tested? Pass the Jenkins with newly added test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-25102 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22932.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22932 commit 601ccbb4e20a068469839bc71870230cfb6fd7a1 Author: Dongjoon Hyun Date: 2018-11-03T06:43:48Z [SPARK-25102][SQL] Write Spark version to ORC/Parquet file metadata --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org