Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/21070
thanks, merging to master!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90375/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21070
**[Test build #90375 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90375/testReport)**
for PR 21070 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3045/
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21070
**[Test build #90375 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90375/testReport)**
for PR 21070 at commit
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21070
(While I am here) seems fine to me otherwise.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/21070
LGTM except 2 comments
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90327/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21070
**[Test build #90327 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90327/testReport)**
for PR 21070 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3007/
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21070
**[Test build #90327 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90327/testReport)**
for PR 21070 at commit
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21070
@gatorsmile ok, I'll do later.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands,
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/21070
> sql("select * from parquetTable where value = '0'")
> sql("select * from parquetTable where value = 0")
Semantically, they are different. The results should be also different. The
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21070
For documentation, I think it's fine. it's a trackable history in JIRA
although it's a bit hard to search .. Also, there might be some diff we should
check since I haven't checked the details
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21070
@HyukjinKwon Thanks for the explanation! I feel it's ok to keep this as it
it for now. btw, we don't need to document the story somewhere?
---
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21070
@liancheng should also remember all the story too since I talked with him
at that time IIRC.
---
-
To unsubscribe, e-mail:
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21070
Yup, https://github.com/apache/spark/pull/21070#issuecomment-386793202,
just for clarification, the given attribute and literal is castable but they
are not being as so, right?
I
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21070
ok, I checked string pushdown worked well;
https://gist.github.com/maropu/e7cbfb8dc73a91806088e159f64e113f
---
-
To
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21070
I looked over the code and I fond this issue could be fix by;
https://github.com/apache/spark/compare/master...maropu:FixFilterPushdown
```
scala> sql("select * from parquetTable where
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21070
ok, I got the reason why;
```
$ ./bin/spark-shell --master=local[1]
scala> import scala.util.Random
scala> :paste
def timer[R](f: => {}): Unit = {
val count = 5
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21070
ok, give me more time to do so.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands,
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
@maropu, I'd recommend looking at the Parquet files using
[`parquet-cli`](http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22parquet-cli%22)
to see if you're getting reasonable min/max stats for your
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21070
@rdblue [I made the input test date sorted
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21070
aha, I see and it might to be true. I'll check the benchmark again.
---
-
To unsubscribe, e-mail:
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
@maropu, I suspect that the problem is that comparison is different for
strings: `"17297598712"` is less than `"5"` with string comparison.
---
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21070
@rdblue Aha, thanks for the explanation!
> I think that you were expecting a string comparison case to have a
significant benefit over
> non->pushdown. But I would only expect that if
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
@gatorsmile, are you happy committing this with the benchmark results?
@maropu, thanks for taking the time to add these benchmarks, it is really
great to have them so we can monitor the
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21070
@rdblue ya, my bad for the simple scan case, you're right.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
@maropu, looking at the pushdown benchmark, it looks like ORC and Parquet
either both benefit or both do not benefit from pushdown. In some cases ORC is
much faster, which is due to the fact that
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
@maropu, are you sure about the INT and FLOAT columns? I think you might
have that assessment backwards. Here's the INT results from the PR gist:
```
SQL Single INT Column Scan:
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21070
Simple scan benchmarks:
code:
https://github.com/apache/spark/compare/master...maropu:DataSourceReadBenchmark
master: https://gist.github.com/maropu/a767d21ed1dd047ec2bdca92915dc5c5
this
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21070
I'll put the benchmark results (the benchmark's currently running) today,
just a sec.
---
-
To unsubscribe, e-mail:
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
@mswit-databricks, I wouldn't worry about that. We've limited the length of
binary and string fields.
In the next version of Parquet, we're planning on releasing page indexes,
which are
Github user mswit-databricks commented on the issue:
https://github.com/apache/spark/pull/21070
@rdblue Do you see any risk of additional overhead coming from the extra
stats? For example, if the data contains very long strings, performing
comparison on them to generate stats will be
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/21070
TPC-DS is just an end-to-end perf testing. To ensure no perf regressions,
we need micro-benchmark tests. That is just a normal procedure. We did the
same in the ORC upgrade too.
---
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
Is it necessary to block this commit on benchmarks? We know that it doesn't
degrade performance from the Parquet benchmarks and TPC-DS run. What do you
think needs to be done before this can move
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/21070
@maropu Could you help verify the perf improvement in this string filter
pushdown in the microbenchmark? You can also refer to another microbenchmark we
did in ORC.
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
Yes, it is safe to use push-down for string columns in data written with
this version.
---
-
To unsubscribe, e-mail:
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/21070
@iddoav Thank you for your feedbacks. It sounds like CDH5.X's spark has a
bug, right? Our Apache Spark does not have such an issue based on my
understanding.
@rdblue Does that means we
Github user iddoav commented on the issue:
https://github.com/apache/spark/pull/21070
Our R in SimilarWeb have hard times with PARQUET-686, and merging this PR
will help us a lot. Note, that unlike Spark 2.1+ readers which have read-time
mitigations (SPARK-17213 et al), other systems
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21070
ok, I'll work on it.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/21070
@maropu Thank you for works! Could you also run/improve the
micro-benchmark?
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21070
I just run `TPCDSQueryBenchmark` in an ec2 instance (m4.2xlarge). The job
to check performance regression is too heavy, so we just check if `tpcds`,
`tpch`, and `ssb` can be correctly compiled in
Github user henryr commented on the issue:
https://github.com/apache/spark/pull/21070
yes, thanks @maropu! +1 to the idea of making this a jenkins job.
---
-
To unsubscribe, e-mail:
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
Thank you @maropu!
What resources does the run require? Is it something we could create a
Jenkins job to run?
---
-
To
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21070
@gatorsmile @rdblue I checked the numbers and I found no regression at
least in TPC-DS: to check the actual numbers, see
https://issues.apache.org/jira/browse/SPARK-24070.
---
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
The fix for PARQUET-686 was to suppress min/max stats. It is safe to push
filters, but those filters can't be used without the stats. 1.10.0 has the
correct stats and can use those filters.
---
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/21070
Regarding
[`SPARK-17213`](https://issues.apache.org/jira/browse/SPARK-17213), we already
enable the pushdown after we upgrading to 1.8.2. Is my understanding right?
---
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
There are two main reasons to update. First, the problem behind SPARK-17213
is fixed, hence the new min and max fields. Second, this updates the internal
byte array management, which is needed for
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/21070
Parquet is the default file format. Thus, the impact could be huge. We need
to be more cautious. Since Parquet 1.10 was officially released this month.
Thus, we might need more careful review
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21070
I'll check performance changes w/ this pr in days and put the result in
https://issues.apache.org/jira/browse/SPARK-24070
---
-
Github user henryr commented on the issue:
https://github.com/apache/spark/pull/21070
Ok, thanks for the context. I've worked on other projects where it's
sometimes ok to take a small risk of a perf hit on trunk as long as the
community is committed to addressing the issues before
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
> Based on the previous upgrade (e.g., #13280 (comment)), we hit the
performance regressions when we upgrade Parquet and we did the revert at the
end.
I should point out that the regression
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/21070
cc @liancheng @yhuai @rxin
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands,
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/21070
Based on the previous upgrade (e.g.,
https://github.com/apache/spark/pull/13280#issuecomment-223022415), we hit the
performance regressions when we upgrade Parquet and we did the revert at the
Github user henryr commented on the issue:
https://github.com/apache/spark/pull/21070
@cloud-fan since thereâs probably quite some time before this lands in a
release, what do you think about merging this now if itâs ready, and filing
the perf jira as a blocker against 2.4? My
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
Okay, I don't have the time to set up and run benchmarks without a stronger
case for this causing a regression (given the Parquet testing), but other
people should feel free to pick this up.
---
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/21070
No. But there are a bunch of open source projects to help you easily run
TPCDS on Spark with parquet files, e.g.
https://github.com/maropu/spark-tpcds-datagen
---
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
@cloud-fan, is there a Jenkins job to run it?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/21070
Can we run a TPCDS and show that this upgrade doesn't cause performance
regression in Spark? I can see that this new version doesn't have perf
regression at parquet side, just want to be sure the
Github user henryr commented on the issue:
https://github.com/apache/spark/pull/21070
This looks pretty good to me - are there any committers that can give it a
(hopefully) final review?
---
-
To unsubscribe,
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89644/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21070
**[Test build #89644 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89644/testReport)**
for PR 21070 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2539/
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21070
**[Test build #89644 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89644/testReport)**
for PR 21070 at commit
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
Retest this please.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89600/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21070
**[Test build #89600 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89600/testReport)**
for PR 21070 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2504/
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21070
**[Test build #89600 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89600/testReport)**
for PR 21070 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89533/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21070
**[Test build #89533 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89533/testReport)**
for PR 21070 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89531/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21070
**[Test build #89531 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89531/testReport)**
for PR 21070 at commit
Github user scottcarey commented on the issue:
https://github.com/apache/spark/pull/21070
> This is about getting Parquet updated, not about worrying whether users
can easily add compression implementations to their classpath.
Yes, of course.
My hunch is that someone
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
I backported the Hadoop zstd codec to 2.7.3 without much trouble. But
either way, I think that's a separate concern. This is about getting Parquet
updated, not about worrying whether users can
Github user scottcarey commented on the issue:
https://github.com/apache/spark/pull/21070
@rdblue
The problem with zstd is that it is only in Hadoop 3.0, and dropping _that_
jar in breaks things as it is a major release. Extracting out only the
ZStandardCodec from that and
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2455/
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21070
**[Test build #89533 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89533/testReport)**
for PR 21070 at commit
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
@scottcarey, Parquet will use the compressors if they are available. You
can add them from an external Jar and it will work. LZ4 should also work out of
the box because it is included in Hadoop 2.7.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2453/
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21070
**[Test build #89531 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89531/testReport)**
for PR 21070 at commit
Github user scottcarey commented on the issue:
https://github.com/apache/spark/pull/21070
I tested this with the addition of some changes to ParquetOptions.scala,
but this alone does not allow for writing or reading zstd compressed parquet
files, because it is using reflection to
Github user henryr commented on the issue:
https://github.com/apache/spark/pull/21070
@scottcarey I agree that's important. Perhaps it could be done as a
follow-up PR?
---
-
To unsubscribe, e-mail:
Github user scottcarey commented on the issue:
https://github.com/apache/spark/pull/21070
This PR should include changes to `ParquetOptions.scala` to expose the
`LZ4`, `ZSTD` and `BROTLI` parquet compression codecs, or else spark users
won't be able to leverage those parquet 1.10.0
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21070
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89415/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21070
**[Test build #89415 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89415/testReport)**
for PR 21070 at commit
1 - 100 of 121 matches
Mail list logo