Hi team,
I'm also interested in finding out if there is Java code available to
determine the extent to which a Flink job has processed files within a
directory. Additionally, I'm curious about where the details of the
processed files are stored within Flink.
Thanks and regards,
Arjun S
On Mon, 30
Hi team,
I appreciate the information provided. I'm inquiring whether there exists a
method to automatically relocate processed files from a directory once a
Flink job has completed processing them. I'm particularly keen on
understanding how this particular use case is currently managed in
product
> Or was it the querying of the checkpoints you were advising against?
Yes, I meant the approach, not file removal itself. Mainly because how
exactly FileSource stores its state is an implementation detail and there
are no external guarantees for its consistency between even the minor
versions.
On
> This is not a robust solution, I would advise against it.
Oh no? Am curious as to why not. It seems not dissimilar to how Kafka
topic retention works: the messages are removed after some time period
(hopefully after they are processed), so why would it be bad to remove
files that are already pr
> I wonder if you could use this fact to query the committed checkpoints
and move them away after the job is done.
This is not a robust solution, I would advise against it.
Best,
Alexander
On Fri, 27 Oct 2023 at 16:41, Andrew Otto wrote:
> For moving the files:
> > It will keep the files as is
For moving the files:
> It will keep the files as is and remember the name of the file read in
checkpointed state to ensure it doesnt read the same file twice.
I wonder if you could use this fact to query the committed checkpoints and
move them away after the job is done. I think it should even b
Hi team, Thanks for your quick response.
I have an inquiry regarding file processing in the event of a job restart.
When the job is restarted, we encounter challenges in tracking which files
have been processed and which remain pending. Is there a method to
seamlessly resume processing files from w
Hi Arjun,
Flink's FileSource doesnt move or delete the files as of now. It will keep the
files as is and remember the name of the file read in checkpointed state to
ensure it doesnt read the same file twice.
Flink's source API works in a way that single Enumerator operates on the
JobManager. Th
Flink's FileSource will enumerate the files and keep track of the progress
in parallel for the individual files. Depending on the format you use, the
progress is tracked at the different level of granularity (TextLine being
the simplest one that tracks the progress based on the number of lines
proc