Re: Long-Running Continuous Data Saving to File

2021-05-28 Thread Xander Dunn
Thanks to both of you, this is helpful. On Wed, May 26, 2021 at 6:07 PM, Weston Pace wrote: > Elad's advice is very helpful. This is not a problem that Arrow solves > today (to the best of my knowledge). It is a topic that comes up > periodically[1][2][3]. If a crash happens while your parqu

Re: Long-Running Continuous Data Saving to File

2021-05-26 Thread Elad Rosenheim
I want to add a few notes from my experience with Kafka: 1. There's an ecosystem - having battle-tested consumers that write to various external systems, with known reliability guarantees, is very helpful. It's also then possible to have multiple consumers - some batch, some real-time streaming (e

Re: Long-Running Continuous Data Saving to File

2021-05-26 Thread Weston Pace
Elad's advice is very helpful. This is not a problem that Arrow solves today (to the best of my knowledge). It is a topic that comes up periodically[1][2][3]. If a crash happens while your parquet stream writer is open then the most likely outcome is that you will be missing the footer (this get

Re: Long-Running Continuous Data Saving to File

2021-05-26 Thread Elad Rosenheim
Hi, While I'm not using the C++ version of Arrow, the issue you're talking about is a very common concern. There are a few points to discuss here: 1. Generally, Parquet files cannot be appended to. You could of course load the file to memory, add more information and re-save, but that's not real

Long-Running Continuous Data Saving to File

2021-05-26 Thread Xander Dunn
I have a very long-running (months) program that is streaming in data continually, processing it, and saving it to file using Arrow. My current solution is to buffer several million rows and write them to a new .parquet file each time. This works, but produces 1000+ files every day. If I could, I