Hi Arjun,
Flink's FileSource doesnt move or delete the files as of now. It will keep the
files as is and remember the name of the file read in checkpointed state to
ensure it doesnt read the same file twice.
Flink's source API works in a way that single Enumerator operates on the
JobManager. The enumerator is responsible for listing the files and splitting
these into smaller units. These units could be the complete file (in case of
row formats) or splits within a file (for bulk formats). The reading is done by
SplitReaders in the Task Managers. This way it ensures that only reading is
done concurrently and is able to track file completions.
You can read more Flink Sources and here
|
|
| |
FileSystem
FileSystem # This connector provides a unified Source and Sink for BATCH and
STREAMING that reads or writes (par...
|
|
|
On Thursday, 26 October, 2023 at 06:53:23 pm IST, arjun s
<[email protected]> wrote:
Hello team,
I'm currently in the process of configuring a Flink job. This job entails
reading files from a specified directory and then transmitting the data to a
Kafka sink. I've already successfully designed a Flink job that reads the file
contents in a streaming manner and effectively sends them to Kafka. However, my
specific requirement is a bit more intricate. I need the job to not only read
these files and push the data to Kafka but also relocate the processed file to
a different directory once all of its contents have been processed. Following
this, the job should seamlessly transition to processing the next file in the
source directory. Additionally, I have some concerns regarding how the job will
behave if it encounters a restart. Could you please advise if this is
achievable, and if so, provide guidance or code to implement it?
I'm also quite interested in how the job will handle situations where the
source has a parallelism greater than 2 or 3, and how it can accurately monitor
the completion of reading all contents in each file.
Thanks and Regards,
Arjun