[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-08 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21070 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90375/ Test PASSed. ---

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-08 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21070 **[Test build #90375 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90375/testReport)** for PR 21070 at commit

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3045/

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-08 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21070 **[Test build #90375 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90375/testReport)** for PR 21070 at commit

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21070 (While I am here) seems fine to me otherwise. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21070 LGTM except 2 comments --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90327/ Test PASSed. ---

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21070 **[Test build #90327 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90327/testReport)** for PR 21070 at commit

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3007/

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21070 **[Test build #90327 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90327/testReport)** for PR 21070 at commit

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread maropu
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21070 @gatorsmile ok, I'll do later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands,

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-06 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21070 > sql("select * from parquetTable where value = '0'") > sql("select * from parquetTable where value = 0") Semantically, they are different. The results should be also different. The

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21070 For documentation, I think it's fine. it's a trackable history in JIRA although it's a bit hard to search .. Also, there might be some diff we should check since I haven't checked the details

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-06 Thread maropu
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21070 @HyukjinKwon Thanks for the explanation! I feel it's ok to keep this as it it for now. btw, we don't need to document the story somewhere? ---

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21070 @liancheng should also remember all the story too since I talked with him at that time IIRC. --- - To unsubscribe, e-mail:

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21070 Yup, https://github.com/apache/spark/pull/21070#issuecomment-386793202, just for clarification, the given attribute and literal is castable but they are not being as so, right? I

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-05 Thread maropu
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21070 ok, I checked string pushdown worked well; https://gist.github.com/maropu/e7cbfb8dc73a91806088e159f64e113f --- - To

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-05 Thread maropu
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21070 I looked over the code and I fond this issue could be fix by; https://github.com/apache/spark/compare/master...maropu:FixFilterPushdown ``` scala> sql("select * from parquetTable where

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-05 Thread maropu
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21070 ok, I got the reason why; ``` $ ./bin/spark-shell --master=local[1] scala> import scala.util.Random scala> :paste def timer[R](f: => {}): Unit = { val count = 5

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-04 Thread maropu
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21070 ok, give me more time to do so. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands,

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-04 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 @maropu, I'd recommend looking at the Parquet files using [`parquet-cli`](http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22parquet-cli%22) to see if you're getting reasonable min/max stats for your

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-04 Thread maropu
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21070 @rdblue [I made the input test date sorted

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread maropu
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21070 aha, I see and it might to be true. I'll check the benchmark again. --- - To unsubscribe, e-mail:

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 @maropu, I suspect that the problem is that comparison is different for strings: `"17297598712"` is less than `"5"` with string comparison. ---

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread maropu
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21070 @rdblue Aha, thanks for the explanation! > I think that you were expecting a string comparison case to have a significant benefit over > non->pushdown. But I would only expect that if

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 @gatorsmile, are you happy committing this with the benchmark results? @maropu, thanks for taking the time to add these benchmarks, it is really great to have them so we can monitor the

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread maropu
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21070 @rdblue ya, my bad for the simple scan case, you're right. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 @maropu, looking at the pushdown benchmark, it looks like ORC and Parquet either both benefit or both do not benefit from pushdown. In some cases ORC is much faster, which is due to the fact that

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 @maropu, are you sure about the INT and FLOAT columns? I think you might have that assessment backwards. Here's the INT results from the PR gist: ``` SQL Single INT Column Scan:

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread maropu
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21070 Simple scan benchmarks: code: https://github.com/apache/spark/compare/master...maropu:DataSourceReadBenchmark master: https://gist.github.com/maropu/a767d21ed1dd047ec2bdca92915dc5c5 this

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-02 Thread maropu
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21070 I'll put the benchmark results (the benchmark's currently running) today, just a sec. --- - To unsubscribe, e-mail:

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-01 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 @mswit-databricks, I wouldn't worry about that. We've limited the length of binary and string fields. In the next version of Parquet, we're planning on releasing page indexes, which are

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-01 Thread mswit-databricks
Github user mswit-databricks commented on the issue: https://github.com/apache/spark/pull/21070 @rdblue Do you see any risk of additional overhead coming from the extra stats? For example, if the data contains very long strings, performing comparison on them to generate stats will be

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-30 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21070 TPC-DS is just an end-to-end perf testing. To ensure no perf regressions, we need micro-benchmark tests. That is just a normal procedure. We did the same in the ORC upgrade too. ---

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-30 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 Is it necessary to block this commit on benchmarks? We know that it doesn't degrade performance from the Parquet benchmarks and TPC-DS run. What do you think needs to be done before this can move

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-30 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21070 @maropu Could you help verify the perf improvement in this string filter pushdown in the microbenchmark? You can also refer to another microbenchmark we did in ORC.

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-30 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 Yes, it is safe to use push-down for string columns in data written with this version. --- - To unsubscribe, e-mail:

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-30 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21070 @iddoav Thank you for your feedbacks. It sounds like CDH5.X's spark has a bug, right? Our Apache Spark does not have such an issue based on my understanding. @rdblue Does that means we

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-29 Thread iddoav
Github user iddoav commented on the issue: https://github.com/apache/spark/pull/21070 Our R in SimilarWeb have hard times with PARQUET-686, and merging this PR will help us a lot. Note, that unlike Spark 2.1+ readers which have read-time mitigations (SPARK-17213 et al), other systems

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-27 Thread maropu
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21070 ok, I'll work on it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-27 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21070 @maropu Thank you for works! Could you also run/improve the micro-benchmark?

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread maropu
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21070 I just run `TPCDSQueryBenchmark` in an ec2 instance (m4.2xlarge). The job to check performance regression is too heavy, so we just check if `tpcds`, `tpch`, and `ssb` can be correctly compiled in

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread henryr
Github user henryr commented on the issue: https://github.com/apache/spark/pull/21070 yes, thanks @maropu! +1 to the idea of making this a jenkins job. --- - To unsubscribe, e-mail:

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 Thank you @maropu! What resources does the run require? Is it something we could create a Jenkins job to run? --- - To

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread maropu
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21070 @gatorsmile @rdblue I checked the numbers and I found no regression at least in TPC-DS: to check the actual numbers, see https://issues.apache.org/jira/browse/SPARK-24070. ---

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 The fix for PARQUET-686 was to suppress min/max stats. It is safe to push filters, but those filters can't be used without the stats. 1.10.0 has the correct stats and can use those filters. ---

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21070 Regarding [`SPARK-17213`](https://issues.apache.org/jira/browse/SPARK-17213), we already enable the pushdown after we upgrading to 1.8.2. Is my understanding right? ---

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 There are two main reasons to update. First, the problem behind SPARK-17213 is fixed, hence the new min and max fields. Second, this updates the internal byte array management, which is needed for

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21070 Parquet is the default file format. Thus, the impact could be huge. We need to be more cautious. Since Parquet 1.10 was officially released this month. Thus, we might need more careful review

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread maropu
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21070 I'll check performance changes w/ this pr in days and put the result in https://issues.apache.org/jira/browse/SPARK-24070 --- -

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread henryr
Github user henryr commented on the issue: https://github.com/apache/spark/pull/21070 Ok, thanks for the context. I've worked on other projects where it's sometimes ok to take a small risk of a perf hit on trunk as long as the community is committed to addressing the issues before

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 > Based on the previous upgrade (e.g., #13280 (comment)), we hit the performance regressions when we upgrade Parquet and we did the revert at the end. I should point out that the regression

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21070 cc @liancheng @yhuai @rxin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands,

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21070 Based on the previous upgrade (e.g., https://github.com/apache/spark/pull/13280#issuecomment-223022415), we hit the performance regressions when we upgrade Parquet and we did the revert at the

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread henryr
Github user henryr commented on the issue: https://github.com/apache/spark/pull/21070 @cloud-fan since there’s probably quite some time before this lands in a release, what do you think about merging this now if it’s ready, and filing the perf jira as a blocker against 2.4? My

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 Okay, I don't have the time to set up and run benchmarks without a stronger case for this causing a regression (given the Parquet testing), but other people should feel free to pick this up. ---

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21070 No. But there are a bunch of open source projects to help you easily run TPCDS on Spark with parquet files, e.g. https://github.com/maropu/spark-tpcds-datagen ---

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 @cloud-fan, is there a Jenkins job to run it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-23 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21070 Can we run a TPCDS and show that this upgrade doesn't cause performance regression in Spark? I can see that this new version doesn't have perf regression at parquet side, just want to be sure the

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-23 Thread henryr
Github user henryr commented on the issue: https://github.com/apache/spark/pull/21070 This looks pretty good to me - are there any committers that can give it a (hopefully) final review? --- - To unsubscribe,

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89644/ Test PASSed. ---

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21070 **[Test build #89644 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89644/testReport)** for PR 21070 at commit

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2539/

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21070 **[Test build #89644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89644/testReport)** for PR 21070 at commit

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-20 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89600/ Test FAILed. ---

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-19 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21070 **[Test build #89600 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89600/testReport)** for PR 21070 at commit

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2504/

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-19 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21070 **[Test build #89600 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89600/testReport)** for PR 21070 at commit

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89533/ Test FAILed. ---

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21070 **[Test build #89533 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89533/testReport)** for PR 21070 at commit

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89531/ Test FAILed. ---

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21070 **[Test build #89531 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89531/testReport)** for PR 21070 at commit

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread scottcarey
Github user scottcarey commented on the issue: https://github.com/apache/spark/pull/21070 > This is about getting Parquet updated, not about worrying whether users can easily add compression implementations to their classpath. Yes, of course. My hunch is that someone

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 I backported the Hadoop zstd codec to 2.7.3 without much trouble. But either way, I think that's a separate concern. This is about getting Parquet updated, not about worrying whether users can

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread scottcarey
Github user scottcarey commented on the issue: https://github.com/apache/spark/pull/21070 @rdblue The problem with zstd is that it is only in Hadoop 3.0, and dropping _that_ jar in breaks things as it is a major release. Extracting out only the ZStandardCodec from that and

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2455/

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21070 **[Test build #89533 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89533/testReport)** for PR 21070 at commit

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 @scottcarey, Parquet will use the compressors if they are available. You can add them from an external Jar and it will work. LZ4 should also work out of the box because it is included in Hadoop 2.7.

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2453/

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21070 **[Test build #89531 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89531/testReport)** for PR 21070 at commit

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread scottcarey
Github user scottcarey commented on the issue: https://github.com/apache/spark/pull/21070 I tested this with the addition of some changes to ParquetOptions.scala, but this alone does not allow for writing or reading zstd compressed parquet files, because it is using reflection to

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-17 Thread henryr
Github user henryr commented on the issue: https://github.com/apache/spark/pull/21070 @scottcarey I agree that's important. Perhaps it could be done as a follow-up PR? --- - To unsubscribe, e-mail:

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-17 Thread scottcarey
Github user scottcarey commented on the issue: https://github.com/apache/spark/pull/21070 This PR should include changes to `ParquetOptions.scala` to expose the `LZ4`, `ZSTD` and `BROTLI` parquet compression codecs, or else spark users won't be able to leverage those parquet 1.10.0

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89415/ Test FAILed. ---

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-16 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21070 **[Test build #89415 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89415/testReport)** for PR 21070 at commit

  1   2   >