Oops - well, sorry about that! Glad Luke was able to clarify. Best. -P. On Thu, Jan 4, 2018, 1:20 PM Lukasz Cwik <[email protected]> wrote:
> That is correct Derek, Google Cloud Dataflow will only ack the message to > Pubsub when a bundle completes. > For very simple pipelines fusion will make it so that all downstream > actions will happen within the same bundle. > > On Thu, Jan 4, 2018 at 1:00 PM, Derek Hao Hu <[email protected]> > wrote: > >> Hi Pablo, >> >> *Regarding elements coming from PubSub into your pipeline:* >> Once the data enters your pipeline, it is 'acknowledged' on your PubSub >> subscription, and you won't be able to retrieve it again from PubSub on the >> same subscription. >> >> This part differs from my understanding of consuming Pub/Sub messages in >> the Dataflow pipeline. I think the message will only be committed when a >> PCollection in the pipeline gets materialized ( >> https://stackoverflow.com/questions/41727106/when-does-dataflow-acknowledge-a-message-of-batched-items-from-pubsubio), >> which means if the pipeline is not complicated. Fusion optimization would >> fuse multiple stages together and if any of these stages throw an >> exception, the Pub/Sub message won't be acknowledged. I've also verified >> this behavior. >> >> Let me know if my understanding is correct. :) >> >> Thanks, >> >> Derek >> >> >> On Thu, Jan 4, 2018 at 11:42 AM, Pablo Estrada <[email protected]> >> wrote: >> >>> I am not a streaming expert, but I will answer according to how I >>> understand the system, and others can correct me if I get something wrong. >>> >>> *Regarding elements coming from PubSub into your pipeline:* >>> Once the data enters your pipeline, it is 'acknowledged' on your PubSub >>> subscription, and you won't be able to retrieve it again from PubSub on the >>> same subscription. >>> >>> *Regarding elements stuck within your pipeline:* >>> Bundles in a streaming pipeline are executed and committed individually. >>> This means that one bundle may be stuck, while all other bundles may be >>> moving forward in your pipeline. In a case like this, you won't be able to >>> drain the pipeline because there is one bundle that can't be drained out >>> (because exceptions are thrown every time processing for it is attempted). >>> On the other hand, if you cancel your pipeline, then the information >>> regarding the progress made by each bundle will be lost, so you will drop >>> the data that was stuck within your pipeline, and was never written out. >>> (That data was also acked in your PubSub subscription, so it won't come out >>> from PubSub if you reattach to the same subscription later). - So cancel >>> may not be what you're looking for either. >>> >>> For cases like these, what you'd need to do is to live-update your >>> pipeline with code that can handle the problems in your current pipeline. >>> This new code will replace the code in your pipeline stages, and then >>> Dataflow will continue processing of your data in the state that it was >>> before the update. This means that if there's one bundle that was stuck, it >>> will be retried against the new code, and it will finally make progress >>> across your pipeline. >>> >>> If you want to completely change, or stop your pipeline without dropping >>> stuck bundles, you will still need to live-update it, and then drain it. >>> >>> I hope that was clear. Let me know if you need more clarification - and >>> perhaps others will have more to add / correct. >>> Best! >>> -P. >>> >>> On Wed, Jan 3, 2018 at 3:09 AM Andrew Jones < >>> [email protected]> wrote: >>> >>>> Hi, >>>> >>>> I'd like to confirm Beams data guarantees when used with Google Cloud >>>> PubSub and Cloud Storage and running on Dataflow. I can't find any explicit >>>> documentation on it. >>>> >>>> If the Beam job is running successfully, then I believe all data will >>>> be delivered to GCS at least once. If I stop the job with 'Drain', then any >>>> inflight data will be processed and saved. >>>> >>>> What happens if the Beam job is not running successfully, and maybe >>>> throwing exceptions? Will the data still be available in PubSub when I >>>> cancel (not drain) the job? Does a drain work successfully if the data >>>> cannot be written to GCS because of the exceptions? >>>> >>>> Thanks, >>>> Andrew >>>> >>> >> >> >> -- >> Derek Hao Hu >> >> Software Engineer | Snapchat >> Snap Inc. >> > >
