Hi Lu,

I think it's OK to choose any way as long as it works.
Though I've no idea how you would extend SplittableIterator in your case.
The underlying is ParallelIteratorInputFormat and its processing is not
matched to a certain subtask index.

Thanks,
Zhu Zhu

Lu Niu <qqib...@gmail.com> 于2019年8月16日周五 上午12:48写道:

> Hi, Zhu
>
> Thanks for reply! I found using SplittableIterator is also doable to some
> extent. How to choose between these two?
>
> Best
> Lu
>
> On Wed, Aug 14, 2019 at 8:02 PM Zhu Zhu <reed...@gmail.com> wrote:
>
>> Hi Lu,
>>
>> Implementing your own *InputFormat* and *InputSplitAssigner*(which has
>> the interface "InputSplit getNextInputSplit(String host, int
>> taskId)") created by it should work if you want to assign InputSplit to
>> tasks according to the task index and file name patterns.
>> To assign 2 *InputSplit*s in one request, you can implement a new
>> *InputSplit* which wraps multiple *FileInputSplit*s. And you may need to
>> define in your *InputFormat* on how to process the new *InputSplit*.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Lu Niu <qqib...@gmail.com> 于2019年8月15日周四 上午12:26写道:
>>
>>> Hi,
>>>
>>> I have a data set backed by a directory of files in which file names are
>>> meaningful.
>>>
>>> folder1
>>>    +-----file01
>>>    +-----file02
>>>    +-----file03
>>>    +-----file04
>>>
>>> I want to control the file assignments in my flink application. For
>>> example, when parallelism is 2, worker 1 get file01 and file02 to read and
>>> worker2 get 3 and 4. Also each worker get 2 files all at once because
>>> reading requires jumping back and forth between those two files.
>>>
>>> What's the best way to do this? It seems like FileInputFormat is not
>>> extensible in this case.
>>>
>>> Best
>>> Lu
>>>
>>>
>>>

Reply via email to