Re: Operator state in New Source API

Krzysztof Chmielewski Thu, 23 Dec 2021 03:12:52 -0800

Thank you both,
yes seems that the only option on a non keyed operate would be List State,
my bad.


Yun Gao,
I'm wondering from where you get the information that " Flink only support
in-memory operator state", can you point me to the documentation that says
that?
I cannot find any mention in the documentation about it regarding regular
operator state.
I know that Broadcast State which is special type of an Operator State is
kept in-memory [1].

What I was hoping to do is something similar to what is described here [2]
- Statefulf Source Functions. The List State in that example is really
always kept in memory?

Additionally I'm wondering Is it even possible to do something like [2] in
source that is implementing the new Source API [3]? Especially in Source
Enumerator implementation.

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/fault-tolerance/broadcast_state/#important-considerations
[2]
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/fault-tolerance/state/#stateful-source-functions
[3]
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/sources/

Thanks,
Krzysztof Chmielewski




czw., 23 gru 2021 o 07:58 Yun Gao <yungao...@aliyun.com> napisał(a):

> Hi Krzysztof,
>
> If I understand right, I think managed operator state might not help here
> since currently Flink
> only support in-memory operator state.
>
> Is it possible currently we first have a customized SplitEnumerator to
> skip the processed files
> in some other way? For example, if these files have different created
> time, we may process them
> in time order, and only maintains the latest file created time and the
> list of processed files with the
> same time.
>
> Best,
> Yun
>
> ------------------Original Mail ------------------
> *Sender:*Krzysztof Chmielewski <krzysiek.chmielew...@gmail.com>
> *Send Date:*Thu Dec 23 06:33:07 2021
> *Recipients:*user <user@flink.apache.org>
> *Subject:*Operator state in New Source API
>
>> Hi,
>> Is it possible to use managed operator state like MapState in an
>> implementation of new unified source interface [1]. I'm especially
>> interested with using Managed State in SplitEnumerator implementation.
>>
>> I have a use case that is a variation of File Source where I will have a
>> great number of files that I need to process, for example a million. I know
>> that FileSource maintains a collection of already processed paths
>> in ContinuousFileSplitEnumerator object.
>>
>> In my case I cannot afford to have all million Strings sitting on my
>> heap. I'm hoping to use an operator state for this and build splits in
>> batches, periodically adding new files to the alreadyProcessedPaths
>> collection.
>>
>> Regards,
>> Krzysztof Chmielewski
>>
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/sources/
>>
>

Re: Operator state in New Source API

Reply via email to