dongwoo.kim created FLINK-33715:
-----------------------------------
Summary: Enhance history server to archive multiple histories per
jobid
Key: FLINK-33715
URL: https://issues.apache.org/jira/browse/FLINK-33715
Project: Flink
Issue Type: Improvement
Components: Runtime / Coordination
Reporter: dongwoo.kim
Hello Flink team,
I'd like to propose an improvement to how the job manager archives job
histories and how flink history server fetches the history.
Currently, only one job history per jobid is available to be archived and
fectched.
When a flink job tries to archive the job's history more than once, usually
'FileAlreadyExistsException' error happens.
This makes sense in most cases, since a job typically gets a new ID when it
gets restarted from latest checkpoint/savepoint.
However, there's a specific situation where this behavior can be problematic:
1) When we upgrade a job using the savepoint mode, the job's first history gets
successfully archived.
2) If the same job later fails due to an error, its history isn't archived
again because there's already a record with the same job ID.
This can be an issue because the most valuable information – why the job failed
– gets lost.
To simply solve this, I suggest to include currentTimeMillis to the history
filename along with jobid. ( \{jobid}-\{currentTimeMillis} )
And also in the history fetching side parse jobid before the *"-"* delimiter
and fetch all the histories for that jobid.
For UI we can keep current display or maybe enhance with adding extra hierarchy
for each jobid since each jobid can now have multiple histories.
If we could reach an agreement I'll be glad to take on the implementation.
Thanks in advance.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)