Thanks to both of you, this is helpful.
On Wed, May 26, 2021 at 6:07 PM, Weston Pace <[email protected]> wrote: > Elad's advice is very helpful. This is not a problem that Arrow solves > today (to the best of my knowledge). It is a topic that comes up > periodically[1][2][3]. If a crash happens while your parquet stream writer > is open then the most likely outcome is that you will be missing the footer > (this gets written on close) and be unable to read the file (although it > could presumably be recovered). The parquet format may be able to support > an append mode but readers don't typically support it. > > I believe a common approach to this problem is to dump out lots of small > files as the data arrives and then periodically batch them together. Kafka > is a great way to do this but it could be done with a single process as > well. If you go very far down this path you will likely run into concerns > like durability and schema evolution so I don't mean to imply that it is > trivial :) > > [1] https://stackoverflow.com/questions/47113813/ > using-pyarrow-how-do-you-append-to-parquet-file > [2] https://issues.apache.org/jira/browse/PARQUET-1154 > [3] https://lists.apache.org/thread.html/ > r7efad314abec0219016886eaddc7ba79a451087b6324531bdeede1af%40%3Cdev.arrow. > apache.org%3E > > On Wed, May 26, 2021 at 7:39 AM Elad Rosenheim <[email protected]> > wrote: > > Hi, > > While I'm not using the C++ version of Arrow, the issue you're talking > about is a very common concern. > > There are a few points to discuss here: > > 1. Generally, Parquet files cannot be appended to. You could of course > load the file to memory, add more information and re-save, but that's not > really what you're looking for... tools like `parquet-tools` can > concatenate files together by creating a new file with two (or more) row > groups, but that's not a very good solution either. Having multiple row > groups in a single file is sometimes desirable, but in this case would just > create a less compressed file, most probably. > > 2. The other concern is reliability - having a process that holds a big > batch in memory and then spills them to disk every X minutes/rows/bytes is > bound to have issues when things crash/get stuck/need to go down for > maintenance. You probably want to have as close to "exactly once" > guarantees as possible (the holy grail...). One common solution for this is > to write to Kafka, and a have a consumer that periodically reads a batch of > messages and stores them to file. This is nowadays provided by Kafka > Connect > <https://www.confluent.io/blog/apache-kafka-to-amazon-s3-exactly-once/>, > thankfully. Anyway, the "exactly once" part stops at this point, and for > anything that happens downstream you'd need > > 3. Then, you're back to the question of many many files per day... there > is no magical solution to this. You may need to have a scheduled task that > reads files every X hours (or every day?), and re-partitions the data in > the way that makes the most sense for processing/querying later - perhaps > by date, perhaps by customer, both, etc. There are various tools that help > in this. > > Elad > > On Wed, May 26, 2021 at 7:32 PM Xander Dunn <[email protected]> wrote: > > I have a very long-running (months) program that is streaming in data > continually, processing it, and saving it to file using Arrow. My current > solution is to buffer several million rows and write them to a new .parquet > file each time. This works, but produces 1000+ files every day. > > If I could, I would just append to the same file for each day. I see an > `arrow::fs::FileySystem::OpenAppendStream` - what file formats does this > work with? Can I append to .parquet or .feather files? Googling seems to > indicate these formats can't be appended to. > > Using the `parquet::StreamWriter > <https://arrow.apache.org/docs/cpp/parquet.html?highlight=writetable#writetable>`, > could I continually stream rows to a single file throughout the day? What > happens if the program is unexpectedly terminated? Would everything in the > currently open monolithic file be lost? I would be streaming rows to a > single .parquet file for 24 hours. > > Thanks, > Xander > >
