Re: Input File Tracking

Luke Cwik Tue, 14 Apr 2020 11:51:56 -0700

You might want to take a look at the Watch[1] transform which "scans" a
directory for new files and allows you to process them as they arrive.


1: https://beam.apache.org/documentation/patterns/file-processing/

On Mon, Apr 13, 2020 at 7:05 PM Cameron Bateman <[email protected]>
wrote:

> Thanks Vincent.  I looked briefly at Kafka.  I might revisit that, but the
> learning curve looks large and it would probably be over-kill at the scale
> I'm at with this project.  My intake right now is a few files a day that
> reduce to a few kilobytes worth of data.  I have future projects that
> involve a lot more files in a similar scenario, so I will revisit Kafka
> then.
>
> Thanks,
>
> Cameron
>
> On Mon, Apr 13, 2020 at 5:28 PM Vincent Marquez <[email protected]>
> wrote:
>
>> On first glance it sounds like a problem for a persistent queue such as
>> Kafka or Google Cloud's pubsub.  You could write a path to the queue upon
>> download, which would trigger Beam to read the file and then bump the
>> offset only upon completion of the read to the queue.  If the read of the
>> file fails, the offset won't get committed, so it should be 'at least once'
>> semantics.  Just remember, unless you have unlimited memory/disk there's
>> not really such a thing as 'exactly once', but it sounds like for your case
>> you'd prefer 'at least once' vs. 'at most once'.
>>
>> On Mon, Apr 13, 2020 at 4:53 PM Cameron Bateman <[email protected]>
>> wrote:
>>
>>> I have a use case where I'm regularly polling for and downloading data
>>> files from a public (government) web site.  I then intake these files from
>>> a directory and pass them through a Beam pipeline with the data ultimately
>>> being deposited into a database.
>>>
>>> As the files come in, I would like to track them somewhere like a
>>> database perhaps with a checksum and some other metadata.  When an intake
>>> through the pipeline succeeds, I would like to archive the file and delete
>>> it from the main intake directory.  When an intake on the pipeline fails, I
>>> would like to keep the file, mark at as an error in that database and
>>> either leave it at the intake dir or move it to another location for me to
>>> fix the problem etc.
>>>
>>> Is there a framework that does something like this, ideally one with
>>> Beam integration?  This seems like a common scenario (in a prior life, I
>>> did this sort of thing for a customer who sent CSV files once a day to a
>>> drop location, which we then processed).  Yet I've always ended up writing
>>> something custom.  Maybe I'm just using the wrong Google criteria.
>>>
>>> Thanks,
>>>
>>> Cameron
>>>
>>
>>
>> --
>> *~Vincent*
>>
>

Re: Input File Tracking

Reply via email to