Michael Culshaw-Maurer created ARROW-16598: ----------------------------------------------
Summary: [R] Sorting data.frame prior to writing Parquet affects file size Key: ARROW-16598 URL: https://issues.apache.org/jira/browse/ARROW-16598 Project: Apache Arrow Issue Type: Bug Components: R Environment: MacBook Pro (non-M1), other info in R file Reporter: Michael Culshaw-Maurer Attachments: arrow_parquet_bug.R When using the arrow R package, sorting a data.frame prior to using write_parquet() results in different file sizes, depending on how the data.frame is sorted. I have attached a reproducible example showing how a few different sorting methods can lead to 2-3 fold changes in .parquet file size. It may be that I don't know enough about Parquet internals, but at the very least, I think this behavior should be documented on the arrow R package site. Most R users tend to approach sorting as a convenience and don't expect it to lead to performance changes when writing to a file. -- This message was sent by Atlassian Jira (v8.20.7#820007)