r2evans opened a new issue, #39671: URL: https://github.com/apache/arrow/issues/39671
### Describe the enhancement requested I recognize that appending to parquet files is not on the roadmap. Is it possible to do an efficient concatenation of two parquets with the output to a parquet file? While brute-force methods exist (read all of "A", read all of "B", and row-concatenate them however the language allows), it requires loading all data into memory. (I'm specifically targeting R, where it's perhaps more difficult to use the lower-level API.) Part of the alternative to the "append" request (such as https://github.com/apache/arrow/issues/32708) is https://github.com/apache/arrow/issues/32708#issuecomment-1378120110: > the pattern that Arrow enables is writing multiple files and then using open_dataset() to query them lazily This works fine in concept, though as the count of files increases, eventually there is a tradeoff with performance. This penalty can be mitigated (e.g., `unify_schemas=FALSE`), but eventually there may be a time when there is the desire to reduce the number of files by combining them. The brute-force read of both _works_, but it would be very nice to have a simple function that takes 1+ input filenames and 1 output filename (previously non-existent) and as efficiently as possible concatenates the data (handling meta, of course). I'm guessing there would need to be assumptions/requirements with regards to the schemas between the files, perhaps a first guess would require "effectively identical" (where "effectively" might allow differences such as `numeric`/`integer` or similar), but I'd still be very happy with "perfectly identical". I'm specifically targeting R in my usage, though I guess that other languages might also take advantage of this. Thanks! ### Component(s) Parquet, R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
