Re: How to process mini-batch events in Flink with Datastream API

2023-02-10 Thread Leon Xu
Thanks Austin. Will take a look at the AsyncIO. Looks like a pretty cool
feature.

On Fri, Feb 10, 2023 at 1:31 PM Austin Cawley-Edwards <
austin.caw...@gmail.com> wrote:

> It's been a while, but I think I've done something similar before with
> Async I/O [1] and batching records with a window.
>
> This was years ago, so no idea if this was/is good practice, but
> essentially it was:
>
> -> Window by batch size (with a timeout trigger to maintain some SLA)
> -> Process function that just collects all records in the window
> -> Send the entire batch to the AsyncFunction
>
> This approach definitely has some downside, where you don't get to take
> advantage of some of the nice per-record things Async I/O gives you
> (ordering, retries, etc.) but it does greatly reduce the load on external
> services.
>
> Hope that helps,
> Austin
>
> [1]:
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/datastream/operators/asyncio/
>
> On Fri, Feb 10, 2023 at 3:22 PM Leon Xu  wrote:
>
>> I wonder if windows will be the solution when it comes to datastream API.
>>
>> On Fri, Feb 10, 2023 at 12:07 PM Leon Xu  wrote:
>>
>>> Hi Flink Users,
>>>
>>> We wanted to use Flink to run a decoration pipeline, where we would like
>>> to make calls to some external service to fetch data and alter the event in
>>> the Flink pipeline.
>>>
>>> Since there's external service call involved so we want to do batch
>>> calls so that it can reduce the load on the external service.(batching
>>> multiple flink events and just make one external service call)
>>>
>>> It looks like min-batch might be something we can leverage to achieve
>>> that but that feature seems to only exist in table API. We are using
>>> datastream API and we are wondering if there's any solution/workaround for
>>> this?
>>>
>>>
>>> Thanks
>>> Leon
>>>
>>


Could savepoints contain in-flight data?

2023-02-10 Thread Alexis Sarda-Espinosa
Hello,

One feature of unaligned checkpoints is that the checkpoint barriers can
overtake in-flight data, so the buffers are persisted as part of the state.

The documentation for savepoints doesn't mention anything explicitly, so
just to be sure, will savepoints always wait for in-flight data to be
processed before they are completed, or could they also persist buffers in
certain situations (e.g. when there's backpressure)?

Regards,
Alexis.


Re: How to process mini-batch events in Flink with Datastream API

2023-02-10 Thread Austin Cawley-Edwards
It's been a while, but I think I've done something similar before with
Async I/O [1] and batching records with a window.

This was years ago, so no idea if this was/is good practice, but
essentially it was:

-> Window by batch size (with a timeout trigger to maintain some SLA)
-> Process function that just collects all records in the window
-> Send the entire batch to the AsyncFunction

This approach definitely has some downside, where you don't get to take
advantage of some of the nice per-record things Async I/O gives you
(ordering, retries, etc.) but it does greatly reduce the load on external
services.

Hope that helps,
Austin

[1]:
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/datastream/operators/asyncio/

On Fri, Feb 10, 2023 at 3:22 PM Leon Xu  wrote:

> I wonder if windows will be the solution when it comes to datastream API.
>
> On Fri, Feb 10, 2023 at 12:07 PM Leon Xu  wrote:
>
>> Hi Flink Users,
>>
>> We wanted to use Flink to run a decoration pipeline, where we would like
>> to make calls to some external service to fetch data and alter the event in
>> the Flink pipeline.
>>
>> Since there's external service call involved so we want to do batch calls
>> so that it can reduce the load on the external service.(batching multiple
>> flink events and just make one external service call)
>>
>> It looks like min-batch might be something we can leverage to achieve
>> that but that feature seems to only exist in table API. We are using
>> datastream API and we are wondering if there's any solution/workaround for
>> this?
>>
>>
>> Thanks
>> Leon
>>
>


Re: How to process mini-batch events in Flink with Datastream API

2023-02-10 Thread Leon Xu
I wonder if windows will be the solution when it comes to datastream API.

On Fri, Feb 10, 2023 at 12:07 PM Leon Xu  wrote:

> Hi Flink Users,
>
> We wanted to use Flink to run a decoration pipeline, where we would like
> to make calls to some external service to fetch data and alter the event in
> the Flink pipeline.
>
> Since there's external service call involved so we want to do batch calls
> so that it can reduce the load on the external service.(batching multiple
> flink events and just make one external service call)
>
> It looks like min-batch might be something we can leverage to achieve that
> but that feature seems to only exist in table API. We are using datastream
> API and we are wondering if there's any solution/workaround for this?
>
>
> Thanks
> Leon
>


How to process mini-batch events in Flink with Datastream API

2023-02-10 Thread Leon Xu
Hi Flink Users,

We wanted to use Flink to run a decoration pipeline, where we would like to
make calls to some external service to fetch data and alter the event in
the Flink pipeline.

Since there's external service call involved so we want to do batch calls
so that it can reduce the load on the external service.(batching multiple
flink events and just make one external service call)

It looks like min-batch might be something we can leverage to achieve that
but that feature seems to only exist in table API. We are using datastream
API and we are wondering if there's any solution/workaround for this?


Thanks
Leon


Pyflink Side Output Question and/or suggested documentation change

2023-02-10 Thread Andrew Otto
Question about side outputs and OutputTags in pyflink.  The docs

say we are supposed to

yield output_tag, value

Docs then say:
> For retrieving the side output stream you use getSideOutput(OutputTag) on
the result of the DataStream operation.

>From this, I'd expect that calling datastream.get_side_output would be
optional.   However, it seems that if you do not call
datastream.get_side_output, then the main datastream will have the record
destined to the output tag still in it, as a Tuple(output_tag, value).
This caused me great confusion for a while, as my downstream tasks would
break because of the unexpected Tuple type of the record.

Here's an example of the failure using side output and ProcessFunction in
the word count example.


I'd expect that just yielding an output_tag would make those records be in
a different datastream, but apparently this is not the case unless you call
get_side_output.

If this is the expected behavior, perhaps the docs should be updated to say
so?

-Andrew Otto
 Wikimedia Foundation


How to deploy kubernetes flink operator on GKE

2023-02-10 Thread P Singh
Hi Team,

I have tried many ways to deploy kubernetes flink operators on GKE link

 link2  link3
 but unable
to do so and getting error *"Error: INSTALLATION FAILED: failed to download
"helm/flink-kubernetes-operator"*

Can someone help me?


apache/flink docker images arm64

2023-02-10 Thread Roberts, Ben (Senior Developer) via user
Hi,

Would it be possible for the arm64/v8 architecture images to be published to 
dockerhub apache/flink:1.16 and 1.16.1 please?

I’m aware that the official docker flink image is now published in the arm64 
arch, but that image doesn’t include a JDK, so it’d be super helpful to have 
the apache/flink images published for that arch too.

It looks like it’s listed as a target architecture in the metadata: 
https://github.com/apache/flink-docker/blob/e348fd602cfe038402aeb574d1956762f4175af0/1.16/scala_2.12-java11-ubuntu/release.metadata#L2
But I can’t find it available to pull anywhere in the repo.

Thanks
--
Ben Roberts

Information in this email including any attachments may be privileged, 
confidential and is intended exclusively for the addressee. The views expressed 
may not be official policy, but the personal views of the originator. If you 
have received it in error, please notify the sender by return e-mail and delete 
it from your system. You should not reproduce, distribute, store, retransmit, 
use or disclose its contents to anyone. Please note we reserve the right to 
monitor all e-mail communication through our internal and external networks. 
SKY and the SKY marks are trademarks of Sky Limited and Sky International AG 
and are used under licence.

Sky UK Limited (Registration No. 2906991), Sky-In-Home Service Limited 
(Registration No. 2067075), Sky Subscribers Services Limited (Registration No. 
2340150) and Sky CP Limited (Registration No. 9513259) are direct or indirect 
subsidiaries of Sky Limited (Registration No. 2247735). All of the companies 
mentioned in this paragraph are incorporated in England and Wales and share the 
same registered office at Grant Way, Isleworth, Middlesex TW7 5QD