[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-17 Thread pwoody
Github user pwoody commented on the issue: https://github.com/apache/spark/pull/15835 Hey - yeah definitely a real concern as it needs driver heap to scale with the size of the metadata of the table you are going to read in. We could be creative to add heuristics around readin

[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-17 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15835 This creates huge problems when the table is big doesn't it? We just did a big change to get rid of the per table file status cache, because its existence made Spark unstable with dealing with tables (

[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-17 Thread pwoody
Github user pwoody commented on the issue: https://github.com/apache/spark/pull/15835 I've updated the structure of the PR to change caching to be global across instances of FileFormat, have expiry, and reuse known filters. Here is a new benchmark to highlight the filter re-use (I fli

[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15835 I guess we should ping @liancheng as he was reviewing the previous one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your pr

[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-14 Thread pwoody
Github user pwoody commented on the issue: https://github.com/apache/spark/pull/15835 @rxin @HyukjinKwon ready for more review on my end. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-11 Thread pwoody
Github user pwoody commented on the issue: https://github.com/apache/spark/pull/15835 I've pushed up the ability to configure this feature being enabled as well. Here is a benchmark when writing out 200 files with this code: ``` withSQLConf(ParquetOutputFormat.ENABLE_JOB_S

[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-11 Thread pwoody
Github user pwoody commented on the issue: https://github.com/apache/spark/pull/15835 Cool - I've added the caching, fixed style issues, and added pruning to the bucketed reads. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-10 Thread pwoody
Github user pwoody commented on the issue: https://github.com/apache/spark/pull/15835 Ah awesome thanks - my IDE settings must not be properly catching the style - will fix and make sure this doesn't happen in the future. Also just took a look at the bucketing. I've implemente

[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15835 Ah, I missed the last comment from the old PR. Okay, we can make this shaped nicer. BTW, Spark collects small partitions for each task so I guess this would not introduce a lot of tasks always b

[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-10 Thread pwoody
Github user pwoody commented on the issue: https://github.com/apache/spark/pull/15835 Hey @HyukjinKwon - appreciate the feedback! Re: file touching - If I add the cache to the `_metadata` file, then this PR will end up touching at most one file per rootPath driver side (genera

[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15835 I think we should cc @liancheng as well who is insightful in this area. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your pr

[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15835 Hi @pwoody, So, if I understood this correctly, the original PR only filters out the files to touch ahead but this one proposes also to filter splits via offsets from Parquet's metadata in drive

[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-09 Thread pwoody
Github user pwoody commented on the issue: https://github.com/apache/spark/pull/15835 There is a performance issue here in that we re-fetch the metadata file for each file. My understanding is that FileFormat is meant to be stateless, but if we can add in a cache then that would be ve

[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15835 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feat

[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-09 Thread pwoody
Github user pwoody commented on the issue: https://github.com/apache/spark/pull/15835 @andreweduffy @robert3005 @HyukjinKwon --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enab