Github user pwoody commented on the issue:
https://github.com/apache/spark/pull/15835
Hey - yeah definitely a real concern as it needs driver heap to scale with
the size of the metadata of the table you are going to read in.
We could be creative to add heuristics around readin
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/15835
This creates huge problems when the table is big doesn't it? We just did a
big change to get rid of the per table file status cache, because its existence
made Spark unstable with dealing with tables (
Github user pwoody commented on the issue:
https://github.com/apache/spark/pull/15835
I've updated the structure of the PR to change caching to be global across
instances of FileFormat, have expiry, and reuse known filters. Here is a new
benchmark to highlight the filter re-use (I fli
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/15835
I guess we should ping @liancheng as he was reviewing the previous one.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your pr
Github user pwoody commented on the issue:
https://github.com/apache/spark/pull/15835
@rxin @HyukjinKwon ready for more review on my end.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user pwoody commented on the issue:
https://github.com/apache/spark/pull/15835
I've pushed up the ability to configure this feature being enabled as well.
Here is a benchmark when writing out 200 files with this code:
```
withSQLConf(ParquetOutputFormat.ENABLE_JOB_S
Github user pwoody commented on the issue:
https://github.com/apache/spark/pull/15835
Cool - I've added the caching, fixed style issues, and added pruning to the
bucketed reads.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user pwoody commented on the issue:
https://github.com/apache/spark/pull/15835
Ah awesome thanks - my IDE settings must not be properly catching the style
- will fix and make sure this doesn't happen in the future.
Also just took a look at the bucketing. I've implemente
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/15835
Ah, I missed the last comment from the old PR. Okay, we can make this
shaped nicer. BTW, Spark collects small partitions for each task so I guess
this would not introduce a lot of tasks always b
Github user pwoody commented on the issue:
https://github.com/apache/spark/pull/15835
Hey @HyukjinKwon - appreciate the feedback!
Re: file touching - If I add the cache to the `_metadata` file, then this
PR will end up touching at most one file per rootPath driver side (genera
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/15835
I think we should cc @liancheng as well who is insightful in this area.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your pr
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/15835
Hi @pwoody, So, if I understood this correctly, the original PR only
filters out the files to touch ahead but this one proposes also to filter
splits via offsets from Parquet's metadata in drive
Github user pwoody commented on the issue:
https://github.com/apache/spark/pull/15835
There is a performance issue here in that we re-fetch the metadata file for
each file. My understanding is that FileFormat is meant to be stateless, but if
we can add in a cache then that would be ve
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15835
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feat
Github user pwoody commented on the issue:
https://github.com/apache/spark/pull/15835
@andreweduffy @robert3005 @HyukjinKwon
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enab
15 matches
Mail list logo