Hi Robert,

sorry for the late reply, I just did a quick test up, this seems working:
1. during the time checkpoints could expire, but once the thread is not
blocked, it will continue checkpointing
2. this guarantees the message ordering

Thanks a lot!
Eleanore

On Tue, Dec 15, 2020 at 10:42 PM Robert Metzger <rmetz...@apache.org> wrote:

> What you can also do is rely on Flink's backpressure mechanism: If the map
> operator that validates the messages detects that the external system is
> down, it blocks until the system is up again.
> This effectively causes the whole streaming job to pause: the Kafka source
> won't read new messages.
>
> On Tue, Dec 15, 2020 at 3:07 AM Eleanore Jin <eleanore....@gmail.com>
> wrote:
>
>> Hi Guowei and Arvid,
>>
>> Thanks for the suggestion. I wonder if it makes sense and possible that
>> the operator will produce a side output message telling the source to
>> 'pause', and the same side output as the side input to the source, based on
>> which, the source would pause and resume?
>>
>> Thanks a lot!
>> Eleanore
>>
>> On Sun, Nov 29, 2020 at 11:33 PM Arvid Heise <ar...@ververica.com> wrote:
>>
>>> Hi Eleanore,
>>>
>>> if the external system is down, you could simply fail the job after a
>>> given timeout (for example, using asyncIO). Then the job would restart
>>> using the restarting policies.
>>>
>>> If your state is rather small (and thus recovery time okay), you would
>>> pretty much get your desired behavior. The job would stop to make progress
>>> until eventually the external system is responding again.
>>>
>>> On Mon, Nov 30, 2020 at 7:39 AM Guowei Ma <guowei....@gmail.com> wrote:
>>>
>>>> Hi, Eleanore
>>>>
>>>> 1. AFAIK I think only the job could "pause" itself.  For example the
>>>> "query" external system could pause when the external system is down.
>>>> 2. Maybe you could try the "iterate" and send the failed message back
>>>> to retry if you use the DataStream api.
>>>>
>>>> Best,
>>>> Guowei
>>>>
>>>>
>>>> On Mon, Nov 30, 2020 at 1:01 PM Eleanore Jin <eleanore....@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi experts,
>>>>>
>>>>> Here is my use case, it's a flink stateless streaming job for message
>>>>> validation.
>>>>> 1. read from a kafka topic
>>>>> 2. perform validation of message, which requires query external system
>>>>>        2a. the metadata from the external system will be cached in
>>>>> memory for 15minutes
>>>>>        2b. there is another stream that will send updates to update
>>>>> the cache if metadata changed                     within 15 minutes
>>>>> 3. if message is valid, publish to valid topic
>>>>> 4. if message is invalid, publish to error topic
>>>>> 5. if the external system is down, the message is marked as invalid
>>>>> with different error code, and published to the same error topic.
>>>>>
>>>>> Ask:
>>>>> For those messages that failed due to external system failures, it
>>>>> requires manual replay of those messages.
>>>>>
>>>>> Is there a way to pause the job if there is an external system
>>>>> failure, and resume once the external system is online?
>>>>>
>>>>> Or are there any other suggestions to allow auto retry such error?
>>>>>
>>>>> Thanks a lot!
>>>>> Eleanore
>>>>>
>>>>
>>>
>>> --
>>>
>>> Arvid Heise | Senior Java Developer
>>>
>>> <https://www.ververica.com/>
>>>
>>> Follow us @VervericaData
>>>
>>> --
>>>
>>> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
>>> Conference
>>>
>>> Stream Processing | Event Driven | Real Time
>>>
>>> --
>>>
>>> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>>>
>>> --
>>> Ververica GmbH
>>> Registered at Amtsgericht Charlottenburg: HRB 158244 B
>>> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
>>> (Toni) Cheng
>>>
>>

Reply via email to