Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
I see. I opened a new PR, #20265.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18991
Yes. Please work on the perf tests and the benchmark test suite.
I think the priority of this PR is much higher than the test-only PR you
are working on
---
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/18991
conventionally rc1 would fail so we still have time :)
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
Oh, do we have time for 2.3?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands,
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/18991
I think we still have time for 2.3? I'm not worried about correctness, but
we should show people how much it improves.
---
-
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
Ur, it's not record-level filtering. Maybe, it's because I explained it too
abstractly
[here](https://github.com/apache/spark/pull/19943#discussion_r160251456
). It's stripe-level. So,
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/18991
Yea let's add some, I'm curious to see how well PPD works in ORC, since for
parquet PPD doesn't work well and we disable record level filtering for parquet.
---
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
@gatorsmile . I don't have any numbers for PPD=false.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18991
What is the performance number when turning it on, compared with the off
mode?
---
-
To unsubscribe, e-mail:
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
Yes, @cloud-fan . I added the same test coverage for ORC in Apache Spark.
Sorry, @gatorsmile . I always turned on PPD, so there is no perf number for
PPD=false.
---
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/18991
I'd expect orc has same test coverage as parquet, is it true?
---
-
To unsubscribe, e-mail:
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18991
Any perf number ?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18991
I expect we can port more test cases to Spark, instead of relying on the
external data sources.
---
-
To unsubscribe,
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
Maybe, this seems not a scope on Apache Spark 2.3.0 because it's a debut of
Apache ORC 1.4.0. I'll close this PR. Thank you all for giving advice on this
PR.
---
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
Of course, I want to proceed in any part of ORC!
As you know, I tried many trials to get a chance to be reviewed.
Some PR gets it, but the other ORC PR like #18953 didn't get a feedback
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18991
If you want to hold, we can wait for the completion of data source API v2.
Otherwise, we can start it now and change it if needed.
Conceptually, the test coverage improvement should be
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
Is the plan aligned with the ongoing [Data Source V2](#19136) ?
---
-
To unsubscribe, e-mail:
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18991
To avoid duplicating the efforts, we should have a unified testing
framework for covering the PPD of all the sources. Parquet and ORC should be
part of it. In the future, when we add the other
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
To be more clear, I mean the existing Parquet coverage in Apache Spark code
base.
---
-
To unsubscribe, e-mail:
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
The test case coverage parity between Parquet and ORC should be the
criteria for this, right?
---
-
To unsubscribe,
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18991
Left a few comments in the another PR:
https://github.com/apache/spark/pull/19060#discussion_r137060959.
I think it is a right time to improve the test case coverage before
turning
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
Hi, @liancheng , @gatorsmile , @cloud-fan , @rxin , and @omalley .
I think you are the best people about ORC predicate pushdown issue.
Could you review this PR to turn on ORC PPD by
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
Hi, @gatorsmile , @cloud-fan, @rxin , and @omalley .
#19060 shows that the behavior of Apache ORC 1.4.0 predicate push-down is
correct. #19060 will add more test cases for data source
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18991
orc_create.q and orc_people_create.txt are from Hive.
Writing test cases is pretty time consuming. I still hope we can get the
test cases from the other open source projects. Instead,
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
Hi, @gatorsmile .
I made #19060 to tap on the direction. Could you review that?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
@gatorsmile .
[orc_create.q](https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/ql/src/test/queries/clientpositive/orc_create.q)
and
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
+1, I cannot agree anymore.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18991
Yes. The commercial DBMS products have a very good/comprehensive test
coverage. So far, it is missing in Apache Spark. Basically, we simply trust the
underlying data sources, which are
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
Wow. It's real commercial spec. Thank you! I understand.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18991
Since I saw you are also working on the enhancement of ORC reader/writer,
we need to check all the limits (value ranges). I am not sure how good Apache
ORC/Parquet did in their test case design.
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
Thank you for the comments and directions. Definitely, I'll try!
Since we depends on Apache Spark 1.4.0, I think I can add raw level test
case somewhere for evaluation purpose only.
---
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18991
If ORC incorrectly filters out the extra rows, we might get incorrect
results. In addition, we do not know whether the push down could get the
performance gain. We saw the performance regression
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
Hi, @gatorsmile .
Could you review this ORC PPD default configuration? Our data source
doesn't trust any data sources including Parquet/ORC. I think ORC PPD do no
harm on Spark.
---
If
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18991
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18991
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81114/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18991
**[Test build #81114 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81114/testReport)**
for PR 18991 at commit
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
Hi, @cloud-fan, @gatorsmile , @sameeragarwal , @rxin , @mridulm .
Could you reivew this one liner PR about ORC PPD configuration?
---
If your project is set up for it, you can reply to
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18991
**[Test build #81114 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81114/testReport)**
for PR 18991 at commit
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
Retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
Hi, @cloud-fan, @gatorsmile , @sameeragarwal , @rxin , @mridulm .
Could you reivew this ORC predicate pushdown configuration PR when you have
sometime?
Thank you in advance!
---
If
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18991
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80934/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18991
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18991
**[Test build #80934 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80934/testReport)**
for PR 18991 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18991
**[Test build #80934 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80934/testReport)**
for PR 18991 at commit
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
Retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
Hi, @cloud-fan, @gatorsmile , @sameeragarwal , @rxin , @mridulm .
Could you review this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18991
Hi, @cloud-fan, @gatorsmile , @sameeragarwal , @rxin .
Could you review this? This will help our ORC transition much more for next
3 months.
If you don't want this, you can turn off
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18991
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18991
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80834/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18991
**[Test build #80834 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80834/testReport)**
for PR 18991 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18991
**[Test build #80834 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80834/testReport)**
for PR 18991 at commit
51 matches
Mail list logo