Michael Culshaw-Maurer created ARROW-16598:
----------------------------------------------

             Summary: [R] Sorting data.frame prior to writing Parquet affects 
file size
                 Key: ARROW-16598
                 URL: https://issues.apache.org/jira/browse/ARROW-16598
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
         Environment: MacBook Pro (non-M1), other info in R file
            Reporter: Michael Culshaw-Maurer
         Attachments: arrow_parquet_bug.R

When using the arrow R package, sorting a data.frame prior to using 
write_parquet() results in different file sizes, depending on how the 
data.frame is sorted. I have attached a reproducible example showing how a few 
different sorting methods can lead to 2-3 fold changes in .parquet file size.

It may be that I don't know enough about Parquet internals, but at the very 
least, I think this behavior should be documented on the arrow R package site. 
Most R users tend to approach sorting as a convenience and don't expect it to 
lead to performance changes when writing to a file.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to