Re: Input File Tracking

Cameron Bateman Mon, 13 Apr 2020 19:05:50 -0700

Thanks Vincent.  I looked briefly at Kafka.  I might revisit that, but the
learning curve looks large and it would probably be over-kill at the scale
I'm at with this project.  My intake right now is a few files a day that
reduce to a few kilobytes worth of data.  I have future projects that
involve a lot more files in a similar scenario, so I will revisit Kafka
then.


Thanks,

Cameron

On Mon, Apr 13, 2020 at 5:28 PM Vincent Marquez <[email protected]>
wrote:

> On first glance it sounds like a problem for a persistent queue such as
> Kafka or Google Cloud's pubsub.  You could write a path to the queue upon
> download, which would trigger Beam to read the file and then bump the
> offset only upon completion of the read to the queue.  If the read of the
> file fails, the offset won't get committed, so it should be 'at least once'
> semantics.  Just remember, unless you have unlimited memory/disk there's
> not really such a thing as 'exactly once', but it sounds like for your case
> you'd prefer 'at least once' vs. 'at most once'.
>
> On Mon, Apr 13, 2020 at 4:53 PM Cameron Bateman <[email protected]>
> wrote:
>
>> I have a use case where I'm regularly polling for and downloading data
>> files from a public (government) web site.  I then intake these files from
>> a directory and pass them through a Beam pipeline with the data ultimately
>> being deposited into a database.
>>
>> As the files come in, I would like to track them somewhere like a
>> database perhaps with a checksum and some other metadata.  When an intake
>> through the pipeline succeeds, I would like to archive the file and delete
>> it from the main intake directory.  When an intake on the pipeline fails, I
>> would like to keep the file, mark at as an error in that database and
>> either leave it at the intake dir or move it to another location for me to
>> fix the problem etc.
>>
>> Is there a framework that does something like this, ideally one with Beam
>> integration?  This seems like a common scenario (in a prior life, I did
>> this sort of thing for a customer who sent CSV files once a day to a drop
>> location, which we then processed).  Yet I've always ended up writing
>> something custom.  Maybe I'm just using the wrong Google criteria.
>>
>> Thanks,
>>
>> Cameron
>>
>
>
> --
> *~Vincent*
>

Re: Input File Tracking

Reply via email to