On first glance it sounds like a problem for a persistent queue such as Kafka or Google Cloud's pubsub. You could write a path to the queue upon download, which would trigger Beam to read the file and then bump the offset only upon completion of the read to the queue. If the read of the file fails, the offset won't get committed, so it should be 'at least once' semantics. Just remember, unless you have unlimited memory/disk there's not really such a thing as 'exactly once', but it sounds like for your case you'd prefer 'at least once' vs. 'at most once'.
On Mon, Apr 13, 2020 at 4:53 PM Cameron Bateman <[email protected]> wrote: > I have a use case where I'm regularly polling for and downloading data > files from a public (government) web site. I then intake these files from > a directory and pass them through a Beam pipeline with the data ultimately > being deposited into a database. > > As the files come in, I would like to track them somewhere like a database > perhaps with a checksum and some other metadata. When an intake through > the pipeline succeeds, I would like to archive the file and delete it from > the main intake directory. When an intake on the pipeline fails, I would > like to keep the file, mark at as an error in that database and either > leave it at the intake dir or move it to another location for me to fix the > problem etc. > > Is there a framework that does something like this, ideally one with Beam > integration? This seems like a common scenario (in a prior life, I did > this sort of thing for a customer who sent CSV files once a day to a drop > location, which we then processed). Yet I've always ended up writing > something custom. Maybe I'm just using the wrong Google criteria. > > Thanks, > > Cameron > -- *~Vincent*
