Hello Everyone,

I am trying to implement a custom source where split detection is an
expensive operation and I would like to benefit from the split reader
results to build my next splits.

Basically, my source is taking as input an id from my external system,
let's name it ID1.

>From ID1, I can get a list of other sub splits but getting this list is an
expensive operation so I want it to be done on a task manager during the
split reading of ID1. Now we can imagine sub splits of ID1 are ID1.1 and
ID1.2.
So, to sum up my split reader of ID1 will be responsible for:
1. Collecting content of ID1
2. Producing n sub splits
Then, the split enumerator will receive these sub splits and schedule
ID1.1, ... ID1.n for split reading.

As of now, I have implemented this mechanism using events between split
reader and split enumerator but I think there might be a better
architecture using Flink.

Finally, I have a second problem which is about avoiding extracting
multiple times the same split. We can imagine, based on my previous
explanation, that same ID might be detected through multiple parent splits.
To avoid losing time doing the same job multiple times, I need to avoid
extracting the same ID.
Actually, I am thinking about storing the already extracted ID into the
state and storing it into my state backend. What do you think about this ?

Thank you.

Benoit

Reply via email to