Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14649 Glad that helped, sorry if it wasn't more clear. Agreed that writing summary metadata isn't always the best. In this patch, it only ever performs the file pruning if the _metadata file exists for the dataset. At work we have it enabled since we have a query-heavy workload where new data lands occasionally. Sent from Outlook On Tue, Sep 27, 2016 at 10:13 AM -0700, "Cheng Lian" <notificati...@github.com> wrote: @andreweduffy @andreweduffy Thanks for the explanations! This makes much more sense to me now. Although _metadata can be neat for the read path, it's a trouble maker for the write path: Writing summary files (either _metadata or _common_metadata) can be quite expensive when writing a large Parquet dataset since it reads footers from all files and tries to merge them. This can be especially frustrating when appending a small amount of data to an existing large dataset. Parquet doesn't always write the summary files even if you explicitly set parquet.enable.summary-metadata to true. For example, when two files have different values of a single key in the user-defined key/value metadata section, Parquet simply gives up writing the summary files and delete existing ones. This may be quite common in the case of schema evolution. What makes it worse, outdated _common_metadata might not be deleted properly due to PARQUET-359, which makes the summary files out of sync. However, I still agree that with an existing trustworthy _metadata file at hand, this patch is still very useful. I'll take a deeper look at this tomorrow. â You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org