Re: Data guarantees PubSub to GCS

Derek Hao Hu Thu, 04 Jan 2018 13:01:54 -0800

Hi Pablo,

*Regarding elements coming from PubSub into your pipeline:*
Once the data enters your pipeline, it is 'acknowledged' on your PubSub
subscription, and you won't be able to retrieve it again from PubSub on the
same subscription.


This part differs from my understanding of consuming Pub/Sub messages in
the Dataflow pipeline. I think the message will only be committed when a
PCollection in the pipeline gets materialized (
https://stackoverflow.com/questions/41727106/when-does-dataflow-acknowledge-a-message-of-batched-items-from-pubsubio),
which means if the pipeline is not complicated. Fusion optimization would
fuse multiple stages together and if any of these stages throw an
exception, the Pub/Sub message won't be acknowledged. I've also verified
this behavior.

Let me know if my understanding is correct. :)

Thanks,

Derek


On Thu, Jan 4, 2018 at 11:42 AM, Pablo Estrada <[email protected]> wrote:

> I am not a streaming expert, but I will answer according to how I
> understand the system, and others can correct me if I get something wrong.
>
> *Regarding elements coming from PubSub into your pipeline:*
> Once the data enters your pipeline, it is 'acknowledged' on your PubSub
> subscription, and you won't be able to retrieve it again from PubSub on the
> same subscription.
>
> *Regarding elements stuck within your pipeline:*
> Bundles in a streaming pipeline are executed and committed individually.
> This means that one bundle may be stuck, while all other bundles may be
> moving forward in your pipeline. In a case like this, you won't be able to
> drain the pipeline because there is one bundle that can't be drained out
> (because exceptions are thrown every time processing for it is attempted).
> On the other hand, if you cancel your pipeline, then the information
> regarding the progress made by each bundle will be lost, so you will drop
> the data that was stuck within your pipeline, and was never written out.
> (That data was also acked in your PubSub subscription, so it won't come out
> from PubSub if you reattach to the same subscription later). - So cancel
> may not be what you're looking for either.
>
> For cases like these, what you'd need to do is to live-update your
> pipeline with code that can handle the problems in your current pipeline.
> This new code will replace the code in your pipeline stages, and then
> Dataflow will continue processing of your data in the state that it was
> before the update. This means that if there's one bundle that was stuck, it
> will be retried against the new code, and it will finally make progress
> across your pipeline.
>
> If you want to completely change, or stop your pipeline without dropping
> stuck bundles, you will still need to live-update it, and then drain it.
>
> I hope that was clear. Let me know if you need more clarification - and
> perhaps others will have more to add / correct.
> Best!
> -P.
>
> On Wed, Jan 3, 2018 at 3:09 AM Andrew Jones <[email protected]>
> wrote:
>
>> Hi,
>>
>> I'd like to confirm Beams data guarantees when used with Google Cloud
>> PubSub and Cloud Storage and running on Dataflow. I can't find any explicit
>> documentation on it.
>>
>> If the Beam job is running successfully, then I believe all data will be
>> delivered to GCS at least once. If I stop the job with 'Drain', then any
>> inflight data will be processed and saved.
>>
>> What happens if the Beam job is not running successfully, and maybe
>> throwing exceptions? Will the data still be available in PubSub when I
>> cancel (not drain) the job? Does a drain work successfully if the data
>> cannot be written to GCS because of the exceptions?
>>
>> Thanks,
>> Andrew
>>
>


-- 
Derek Hao Hu

Software Engineer | Snapchat
Snap Inc.

Re: Data guarantees PubSub to GCS

Reply via email to