Re: Data guarantees PubSub to GCS

Pablo Estrada Thu, 04 Jan 2018 17:36:39 -0800

Oops - well, sorry about that! Glad Luke was able to clarify.
Best.
-P.

On Thu, Jan 4, 2018, 1:20 PM Lukasz Cwik <[email protected]> wrote:


> That is correct Derek, Google Cloud Dataflow will only ack the message to
> Pubsub when a bundle completes.
> For very simple pipelines fusion will make it so that all downstream
> actions will happen within the same bundle.
>
> On Thu, Jan 4, 2018 at 1:00 PM, Derek Hao Hu <[email protected]>
> wrote:
>
>> Hi Pablo,
>>
>> *Regarding elements coming from PubSub into your pipeline:*
>> Once the data enters your pipeline, it is 'acknowledged' on your PubSub
>> subscription, and you won't be able to retrieve it again from PubSub on the
>> same subscription.
>>
>> This part differs from my understanding of consuming Pub/Sub messages in
>> the Dataflow pipeline. I think the message will only be committed when a
>> PCollection in the pipeline gets materialized (
>> https://stackoverflow.com/questions/41727106/when-does-dataflow-acknowledge-a-message-of-batched-items-from-pubsubio),
>> which means if the pipeline is not complicated. Fusion optimization would
>> fuse multiple stages together and if any of these stages throw an
>> exception, the Pub/Sub message won't be acknowledged. I've also verified
>> this behavior.
>>
>> Let me know if my understanding is correct. :)
>>
>> Thanks,
>>
>> Derek
>>
>>
>> On Thu, Jan 4, 2018 at 11:42 AM, Pablo Estrada <[email protected]>
>> wrote:
>>
>>> I am not a streaming expert, but I will answer according to how I
>>> understand the system, and others can correct me if I get something wrong.
>>>
>>> *Regarding elements coming from PubSub into your pipeline:*
>>> Once the data enters your pipeline, it is 'acknowledged' on your PubSub
>>> subscription, and you won't be able to retrieve it again from PubSub on the
>>> same subscription.
>>>
>>> *Regarding elements stuck within your pipeline:*
>>> Bundles in a streaming pipeline are executed and committed individually.
>>> This means that one bundle may be stuck, while all other bundles may be
>>> moving forward in your pipeline. In a case like this, you won't be able to
>>> drain the pipeline because there is one bundle that can't be drained out
>>> (because exceptions are thrown every time processing for it is attempted).
>>> On the other hand, if you cancel your pipeline, then the information
>>> regarding the progress made by each bundle will be lost, so you will drop
>>> the data that was stuck within your pipeline, and was never written out.
>>> (That data was also acked in your PubSub subscription, so it won't come out
>>> from PubSub if you reattach to the same subscription later). - So cancel
>>> may not be what you're looking for either.
>>>
>>> For cases like these, what you'd need to do is to live-update your
>>> pipeline with code that can handle the problems in your current pipeline.
>>> This new code will replace the code in your pipeline stages, and then
>>> Dataflow will continue processing of your data in the state that it was
>>> before the update. This means that if there's one bundle that was stuck, it
>>> will be retried against the new code, and it will finally make progress
>>> across your pipeline.
>>>
>>> If you want to completely change, or stop your pipeline without dropping
>>> stuck bundles, you will still need to live-update it, and then drain it.
>>>
>>> I hope that was clear. Let me know if you need more clarification - and
>>> perhaps others will have more to add / correct.
>>> Best!
>>> -P.
>>>
>>> On Wed, Jan 3, 2018 at 3:09 AM Andrew Jones <
>>> [email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'd like to confirm Beams data guarantees when used with Google Cloud
>>>> PubSub and Cloud Storage and running on Dataflow. I can't find any explicit
>>>> documentation on it.
>>>>
>>>> If the Beam job is running successfully, then I believe all data will
>>>> be delivered to GCS at least once. If I stop the job with 'Drain', then any
>>>> inflight data will be processed and saved.
>>>>
>>>> What happens if the Beam job is not running successfully, and maybe
>>>> throwing exceptions? Will the data still be available in PubSub when I
>>>> cancel (not drain) the job? Does a drain work successfully if the data
>>>> cannot be written to GCS because of the exceptions?
>>>>
>>>> Thanks,
>>>> Andrew
>>>>
>>>
>>
>>
>> --
>> Derek Hao Hu
>>
>> Software Engineer | Snapchat
>> Snap Inc.
>>
>
>

Re: Data guarantees PubSub to GCS

Reply via email to