[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-157027289 @marmbrus Is this one OK for branch-1.6? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-157027571 @HyukjinKwon Thanks! I've merged this one to master. And yes, please feel free to add the decimal test case(s). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/9060 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-157138358 Merging to branch-1.6. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-157108064 Sure --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156879272 I saw accidently `TODO Adds test case for reading dictionary encoded decimals written as 'FIXED_LEN_BYTE_ARRAY'`. I will also add this test in the following PR for using the overloaded `writeMetaFile`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156891545 **[Test build #45964 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45964/consoleFull)** for PR 9060 at commit [`cea5034`](https://github.com/apache/spark/commit/cea50348da091e5d83c14474a76d4f49e1ff3c9b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156891628 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/45964/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156891627 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156879507 **[Test build #45964 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45964/consoleFull)** for PR 9060 at commit [`cea5034`](https://github.com/apache/spark/commit/cea50348da091e5d83c14474a76d4f49e1ff3c9b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/9060#discussion_r44765188 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala --- @@ -513,6 +515,41 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSQLContext { } } + test("SPARK-11044 Parquet writer version fixed as version1 ") { + +// For dictionary encoding, Parquet changes the encoding types according to its writer version +// So, this test checks the encoding types in order to ensure that the file is written with +// writer version2. +withTempPath { dir => + val clonedConf = new Configuration(hadoopConfiguration) + try { + +// Write a Parquet file with writer version 2 +hadoopConfiguration.set(ParquetOutputFormat.WRITER_VERSION, + ParquetProperties.WriterVersion.PARQUET_2_0.toString) + +// By default, dictionary encoding is enabled from Parquet 1.2.0 but +// it is enabled just in case. + hadoopConfiguration.setBoolean(ParquetOutputFormat.ENABLE_DICTIONARY, true) +val path = s"${dir.getCanonicalPath}/part-r-0.parquet" +sqlContext.range(1 << 16).selectExpr("(id % 4) AS i") + .coalesce(1).write.mode("overwrite").parquet(path) + +val blockMetadata = readFooter(new Path(path), hadoopConfiguration).getBlocks.asScala.head +val columnChunkMetadata = blockMetadata.getColumns.asScala.head + +// If the file is written with version 2, this should include +// [[Encoding.RLE_DICTIONARY]] type. For version 1, it is Encoding.PLAIN_DICTIONARY --- End diff -- BTW, the `[[...]]` notation is only useful when writing ScalaDoc. In case of inline comment s like this, you may either omit the brackets or use backquotes to emphasize that the quoted part is a Scala/Java entity. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/9060#discussion_r44764961 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala --- @@ -513,6 +515,41 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSQLContext { } } + test("SPARK-11044 Parquet writer version fixed as version1 ") { + +// For dictionary encoding, Parquet changes the encoding types according to its writer version +// So, this test checks the encoding types in order to ensure that the file is written with +// writer version2. +withTempPath { dir => + val clonedConf = new Configuration(hadoopConfiguration) + try { + --- End diff -- Nit: Remove this empty line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/9060#discussion_r44764956 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala --- @@ -513,6 +515,41 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSQLContext { } } + test("SPARK-11044 Parquet writer version fixed as version1 ") { + --- End diff -- Nit: Remove this empty line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156379942 LGTM except for a few minor styling issue. I can merge it right after you fix them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156072061 I think we can check for column encoding information, which is accessible from Parquet footers. For example, `PARQUET_2_0` uses `RLE_DICTIONARY` while `PARQUET_1_0` uses `PLAIN_DICTIONARY` (see [here][1]). The [parquet-meta CLI tool][2] can be a reference for how to inspect related metadata. [1]: https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L116-L123 [2]: https://github.com/apache/parquet-mr/blob/master/parquet-tools/src/main/java/org/apache/parquet/tools/util/MetadataUtils.java#L139 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156076494 Thank toy very much. I will try in that way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156077334 You may construct a Parquet file consists of a single column with dictionary encoding using: ```scala val path = "file:///tmp/parquet/dict" sqlContext.range(1 << 16).selectExpr("(id % 4) AS i").coalesce(1).write.mode("overwrite").parquet(path) ``` And here are instructions of building and installing the parquet-tools CLI tool. Then you can inspect Parquet metadata using: ``` $ parquet-meta /tmp/parquet/dict file: file:/private/tmp/parquet/dict/part-r-0-88498608-9eed-4728-b96a-b60bc5ebc2a8.gz.parquet creator: parquet-mr version 1.6.0 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"i","type":"long","nullable":true,"metadata":{}}]} file schema: root -- i: OPTIONAL INT64 R:0 D:1 row group 1: RC:65536 TS:16615 OFFSET:4 -- i:INT64 GZIP DO:0 FPO:4 SZ:198/16615/83.91 VC:65536 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY ``` The `ENC:...` part in the last line is column encoding information. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156306727 Fortunately, I worked around parquet tools once and looked through Parquet codes several times :). Thank you very much for your help. This could be dome much more easily than I though because of your help. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156306860 [Test build #45810 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45810/consoleFull) for PR 9060 at commit [`2d1d343`](https://github.com/apache/spark/commit/2d1d343ab4a0218cfcbc621c6fccb77397e7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156322308 Build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156322309 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/45810/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156322233 [Test build #45810 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45810/console) for PR 9060 at commit [`2d1d343`](https://github.com/apache/spark/commit/2d1d343ab4a0218cfcbc621c6fccb77397e7). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156354310 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/45831/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156327284 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156327273 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156354309 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156354224 **[Test build #45831 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45831/consoleFull)** for PR 9060 at commit [`78449ec`](https://github.com/apache/spark/commit/78449ec530007bbebf729c19e74364dd0e001b81). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `class TypedColumn[-T, U](`\n * `class JavaTrackStateDStream[KeyType, ValueType, StateType, EmittedType](`\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156099372 Thanks! I will follow the way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156309719 **[Test build #45811 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45811/consoleFull)** for PR 9060 at commit [`7e80ad6`](https://github.com/apache/spark/commit/7e80ad6082a9f5b53f08800bfb519a2a80632ec8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156325563 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156325565 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/45811/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156325499 **[Test build #45811 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45811/consoleFull)** for PR 9060 at commit [`7e80ad6`](https://github.com/apache/spark/commit/7e80ad6082a9f5b53f08800bfb519a2a80632ec8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156327584 **[Test build #45831 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45831/consoleFull)** for PR 9060 at commit [`78449ec`](https://github.com/apache/spark/commit/78449ec530007bbebf729c19e74364dd0e001b81). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156306712 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156306692 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156309106 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156309116 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-155718158 @HyukjinKwon Oh yeah, sorry. Finally got sometime to clean my review queue :) I wonder is there an easy way to add a test case for this? At first I thought `WriterVersion` corresponds to the the `version` field of the Thrift struct `FileMetaData` described in [parquet-format] [1], but it's not. I only found that when `WriterVersion` is set to v2, the Thrift field `PageHeader.type` is set to `DATA_PAGE_V2`. [1]: https://github.com/apache/parquet-format#metadata --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-155718167 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-155718924 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-155718954 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-155720490 I will try to find and test them first tommorow before adding a commit! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-155719264 **[Test build #45626 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45626/consoleFull)** for PR 9060 at commit [`2eee7e3`](https://github.com/apache/spark/commit/2eee7e37b6f366336cbe19bd9545f07abb13f7db). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-155752417 **[Test build #45626 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45626/consoleFull)** for PR 9060 at commit [`2eee7e3`](https://github.com/apache/spark/commit/2eee7e37b6f366336cbe19bd9545f07abb13f7db). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-155753066 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-155753068 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/45626/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-155973698 @liancheng I give some tries to figure out the version but.. as you said, it is pretty tricky to check the writer version as it only changes the version of data page which we could know only within the internal of Parquet. Would this be too inappropriate if we write Parquet files with both version1 and version2 and then, check if the sizes of both are equal? Since encoding types are different, the size should be also different. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-154597634 @liancheng I assume you missed this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-148994769 /cc @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/9060#discussion_r41705069 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystWriteSupport.scala --- @@ -431,6 +431,7 @@ private[parquet] object CatalystWriteSupport { configuration.set(SPARK_ROW_SCHEMA, schema.json) configuration.set( ParquetOutputFormat.WRITER_VERSION, - ParquetProperties.WriterVersion.PARQUET_1_0.toString) + configuration.get(ParquetOutputFormat.WRITER_VERSION, --- End diff -- Yeap I just updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/9060#discussion_r41695242 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystWriteSupport.scala --- @@ -431,6 +431,7 @@ private[parquet] object CatalystWriteSupport { configuration.set(SPARK_ROW_SCHEMA, schema.json) configuration.set( ParquetOutputFormat.WRITER_VERSION, - ParquetProperties.WriterVersion.PARQUET_1_0.toString) + configuration.get(ParquetOutputFormat.WRITER_VERSION, --- End diff -- Can you just use `setIfUnset` here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/9060 [SPARK-11044][SQL] Parquet writer version fixed as version1 https://issues.apache.org/jira/browse/SPARK-11044 Spark only writes the parquet file with writer version1 ignoring the given writer version by user. So, in this PR, it keeps the writer version if given and sets version1 as default. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-11044 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9060.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9060 commit 5e72fbc93ec0783d5a440f8f70c7653f8fc39d9a Author: HyukjinKwonDate: 2015-10-10T06:59:52Z [SPARK-11044][SQL] Apply the writer version if given. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-147047845 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org