I have a use case where I'm regularly polling for and downloading data
files from a public (government) web site.  I then intake these files from
a directory and pass them through a Beam pipeline with the data ultimately
being deposited into a database.

As the files come in, I would like to track them somewhere like a database
perhaps with a checksum and some other metadata.  When an intake through
the pipeline succeeds, I would like to archive the file and delete it from
the main intake directory.  When an intake on the pipeline fails, I would
like to keep the file, mark at as an error in that database and either
leave it at the intake dir or move it to another location for me to fix the
problem etc.

Is there a framework that does something like this, ideally one with Beam
integration?  This seems like a common scenario (in a prior life, I did
this sort of thing for a customer who sent CSV files once a day to a drop
location, which we then processed).  Yet I've always ended up writing
something custom.  Maybe I'm just using the wrong Google criteria.

Thanks,

Cameron

Reply via email to