Github user andreweduffy commented on the issue:

    https://github.com/apache/spark/pull/14649
  
    Glad that helped, sorry if it wasn't more clear. Agreed that writing 
summary metadata isn't always the best. In this patch, it only ever performs 
the file pruning if the _metadata file exists for the dataset. At work we have 
it enabled since we have a query-heavy workload where new data lands 
occasionally. 
    
    Sent from Outlook
    
    
    
    
    On Tue, Sep 27, 2016 at 10:13 AM -0700, "Cheng Lian" 
<notificati...@github.com> wrote:
    
    
    
    
    
    
    
    
    
    
    
    
    @andreweduffy @andreweduffy Thanks for the explanations! This makes much 
more sense to me now. 
    
    
    
    Although _metadata can be neat for the read path, it's a trouble maker for 
the write path:
    
    
    Writing summary files (either _metadata or _common_metadata) can be quite 
expensive when writing a large Parquet dataset since it reads footers from all 
files and tries to merge them. This can be especially frustrating when 
appending a small amount of data to an existing large dataset.
    Parquet doesn't always write the summary files even if you explicitly set 
parquet.enable.summary-metadata to true. For example, when two files have 
different values of a single key in the user-defined key/value metadata 
section, Parquet simply gives up writing the summary files and delete existing 
ones. This may be quite common in the case of schema evolution. What makes it 
worse, outdated _common_metadata might not be deleted properly due to 
PARQUET-359, which makes the summary files out of sync.
    
    
    
    
    However, I still agree that with an existing trustworthy _metadata file at 
hand, this patch is still very useful. I'll take a deeper look at this tomorrow.
    
    
    
    —
    You are receiving this because you were mentioned.
    Reply to this email directly, view it on GitHub, or mute the thread.
    
    
      
      
    
    
    
    
    
    
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to