Re: Enhancing File Processing and Kafka Integration with Flink Jobs

Chirag Dewan via user Thu, 26 Oct 2023 18:59:23 -0700

 Hi Arjun,
Flink's FileSource doesnt move or delete the files as of now. It will keep the 
files as is and remember the name of the file read in checkpointed state to 
ensure it doesnt read the same file twice.
Flink's source API works in a way that single Enumerator operates on the 
JobManager. The enumerator is responsible for listing the files and splitting 
these into smaller units. These units could be the complete file (in case of 
row formats) or splits within a file (for bulk formats). The reading is done by 
SplitReaders in the Task Managers. This way it ensures that only reading is 
done concurrently and is able to track file completions.
You can read more Flink Sources and here


| 
| 
|  | 
FileSystem

FileSystem # This connector provides a unified Source and Sink for BATCH and 
STREAMING that reads or writes (par...
 |

 |

 |




    On Thursday, 26 October, 2023 at 06:53:23 pm IST, arjun s 
<[email protected]> wrote:  
 
 Hello team,
I'm currently in the process of configuring a Flink job. This job entails 
reading files from a specified directory and then transmitting the data to a 
Kafka sink. I've already successfully designed a Flink job that reads the file 
contents in a streaming manner and effectively sends them to Kafka. However, my 
specific requirement is a bit more intricate. I need the job to not only read 
these files and push the data to Kafka but also relocate the processed file to 
a different directory once all of its contents have been processed. Following 
this, the job should seamlessly transition to processing the next file in the 
source directory. Additionally, I have some concerns regarding how the job will 
behave if it encounters a restart. Could you please advise if this is 
achievable, and if so, provide guidance or code to implement it?

I'm also quite interested in how the job will handle situations where the 
source has a parallelism greater than 2 or 3, and how it can accurately monitor 
the completion of reading all contents in each file.

Thanks and Regards,
Arjun

Re: Enhancing File Processing and Kafka Integration with Flink Jobs

Reply via email to