Re: Recursive Split Detection + same split optimization

Benoit Tailhades Mon, 10 Jul 2023 01:25:52 -0700

Thank you Hang for your answer.

Regarding your proposal 2, implementing such logic will prevent
parallelizing on TM, since from the 1st ID, I will fetch n IDs, but with
this approach, all IDs will finally be managed by the same TM.
However, I am not totally satisfied with the 1st choice which is the one I
implemented because it relies on events mechanism which is a custom
solution. There is no such flink mechanism to allow what I am trying to
achieve ?
By the way, the solution works perfectly, but using events is for me like a
bypass to a missing functionality.


Thank you for your precious help.

Benoit

Le lun. 10 juil. 2023 à 09:19, Hang Ruan <ruanhang1...@gmail.com> a écrit :

> Hi, Benoit.
>
> A split enumerator responsible for discovering the source splits, and
> assigning them to the reader. It seems like that your connector discovering
> splits in TM and assigning them in JM.
>
> I think there are 2 choices:
> 1. If you need the enumerator to assign splits, you have to send the
> events about the splits between the source reader and the enumerator.
> 2. If you can make use of the subtaskId and let every reader read some
> scope of the IDs, the enumerator is useless for you.
>
> I am not sure whether you are able to move the discovering splits task
> back to the enumerator by multi thread. Putting it to the TM may be
> weird and error-prone.
>
> Finally, I have a second problem which is about avoiding extracting
>> multiple times the same split. We can imagine, based on my previous
>> explanation, that same ID might be detected through multiple parent splits.
>> To avoid losing time doing the same job multiple times, I need to avoid
>> extracting the same ID.
>> Actually, I am thinking about storing the already extracted ID into the
>> state and storing it into my state backend. What do you think about this ?
>>
>
> For choice 1, put the information in the enumerator's state.
> For choice 2, no need to consider that issue.
>
> Best,
> Hang
>
> Benoit Tailhades <benoit.tailha...@gmail.com> 于2023年7月10日周一 12:59写道：
>
>> Hello Everyone,
>>
>> I am trying to implement a custom source where split detection is an
>> expensive operation and I would like to benefit from the split reader
>> results to build my next splits.
>>
>> Basically, my source is taking as input an id from my external system,
>> let's name it ID1.
>>
>> From ID1, I can get a list of other sub splits but getting this list is
>> an expensive operation so I want it to be done on a task manager during the
>> split reading of ID1. Now we can imagine sub splits of ID1 are ID1.1 and
>> ID1.2.
>> So, to sum up my split reader of ID1 will be responsible for:
>> 1. Collecting content of ID1
>> 2. Producing n sub splits
>> Then, the split enumerator will receive these sub splits and schedule
>> ID1.1, ... ID1.n for split reading.
>>
>> As of now, I have implemented this mechanism using events between split
>> reader and split enumerator but I think there might be a better
>> architecture using Flink.
>>
>> Finally, I have a second problem which is about avoiding extracting
>> multiple times the same split. We can imagine, based on my previous
>> explanation, that same ID might be detected through multiple parent splits.
>> To avoid losing time doing the same job multiple times, I need to avoid
>> extracting the same ID.
>> Actually, I am thinking about storing the already extracted ID into the
>> state and storing it into my state backend. What do you think about this ?
>>
>> Thank you.
>>
>> Benoit
>>
>>

Re: Recursive Split Detection + same split optimization

Reply via email to