You might want to take a look at the Watch[1] transform which "scans" a directory for new files and allows you to process them as they arrive.
1: https://beam.apache.org/documentation/patterns/file-processing/ On Mon, Apr 13, 2020 at 7:05 PM Cameron Bateman <[email protected]> wrote: > Thanks Vincent. I looked briefly at Kafka. I might revisit that, but the > learning curve looks large and it would probably be over-kill at the scale > I'm at with this project. My intake right now is a few files a day that > reduce to a few kilobytes worth of data. I have future projects that > involve a lot more files in a similar scenario, so I will revisit Kafka > then. > > Thanks, > > Cameron > > On Mon, Apr 13, 2020 at 5:28 PM Vincent Marquez <[email protected]> > wrote: > >> On first glance it sounds like a problem for a persistent queue such as >> Kafka or Google Cloud's pubsub. You could write a path to the queue upon >> download, which would trigger Beam to read the file and then bump the >> offset only upon completion of the read to the queue. If the read of the >> file fails, the offset won't get committed, so it should be 'at least once' >> semantics. Just remember, unless you have unlimited memory/disk there's >> not really such a thing as 'exactly once', but it sounds like for your case >> you'd prefer 'at least once' vs. 'at most once'. >> >> On Mon, Apr 13, 2020 at 4:53 PM Cameron Bateman <[email protected]> >> wrote: >> >>> I have a use case where I'm regularly polling for and downloading data >>> files from a public (government) web site. I then intake these files from >>> a directory and pass them through a Beam pipeline with the data ultimately >>> being deposited into a database. >>> >>> As the files come in, I would like to track them somewhere like a >>> database perhaps with a checksum and some other metadata. When an intake >>> through the pipeline succeeds, I would like to archive the file and delete >>> it from the main intake directory. When an intake on the pipeline fails, I >>> would like to keep the file, mark at as an error in that database and >>> either leave it at the intake dir or move it to another location for me to >>> fix the problem etc. >>> >>> Is there a framework that does something like this, ideally one with >>> Beam integration? This seems like a common scenario (in a prior life, I >>> did this sort of thing for a customer who sent CSV files once a day to a >>> drop location, which we then processed). Yet I've always ended up writing >>> something custom. Maybe I'm just using the wrong Google criteria. >>> >>> Thanks, >>> >>> Cameron >>> >> >> >> -- >> *~Vincent* >> >
