Thanks Vincent. I looked briefly at Kafka. I might revisit that, but the learning curve looks large and it would probably be over-kill at the scale I'm at with this project. My intake right now is a few files a day that reduce to a few kilobytes worth of data. I have future projects that involve a lot more files in a similar scenario, so I will revisit Kafka then.
Thanks, Cameron On Mon, Apr 13, 2020 at 5:28 PM Vincent Marquez <[email protected]> wrote: > On first glance it sounds like a problem for a persistent queue such as > Kafka or Google Cloud's pubsub. You could write a path to the queue upon > download, which would trigger Beam to read the file and then bump the > offset only upon completion of the read to the queue. If the read of the > file fails, the offset won't get committed, so it should be 'at least once' > semantics. Just remember, unless you have unlimited memory/disk there's > not really such a thing as 'exactly once', but it sounds like for your case > you'd prefer 'at least once' vs. 'at most once'. > > On Mon, Apr 13, 2020 at 4:53 PM Cameron Bateman <[email protected]> > wrote: > >> I have a use case where I'm regularly polling for and downloading data >> files from a public (government) web site. I then intake these files from >> a directory and pass them through a Beam pipeline with the data ultimately >> being deposited into a database. >> >> As the files come in, I would like to track them somewhere like a >> database perhaps with a checksum and some other metadata. When an intake >> through the pipeline succeeds, I would like to archive the file and delete >> it from the main intake directory. When an intake on the pipeline fails, I >> would like to keep the file, mark at as an error in that database and >> either leave it at the intake dir or move it to another location for me to >> fix the problem etc. >> >> Is there a framework that does something like this, ideally one with Beam >> integration? This seems like a common scenario (in a prior life, I did >> this sort of thing for a customer who sent CSV files once a day to a drop >> location, which we then processed). Yet I've always ended up writing >> something custom. Maybe I'm just using the wrong Google criteria. >> >> Thanks, >> >> Cameron >> > > > -- > *~Vincent* >
