dongwoo.kim created FLINK-33715:
-----------------------------------

             Summary: Enhance history server to archive multiple histories per 
jobid
                 Key: FLINK-33715
                 URL: https://issues.apache.org/jira/browse/FLINK-33715
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination
            Reporter: dongwoo.kim


Hello Flink team,

I'd like to propose an improvement to how the job manager archives job 
histories and how flink history server fetches the history. 
Currently, only one job history per jobid is available to be archived and 
fectched.  
When a flink job tries to archive the job's history more than once, usually 
'FileAlreadyExistsException' error happens.
This makes sense in most cases, since a job typically gets a new ID when it 
gets restarted from latest checkpoint/savepoint.

However, there's a specific situation where this behavior can be problematic:

1) When we upgrade a job using the savepoint mode, the job's first history gets 
successfully archived.
2) If the same job later fails due to an error, its history isn't archived 
again because there's already a record with the same job ID.

This can be an issue because the most valuable information – why the job failed 
– gets lost.

To simply solve this, I suggest to include currentTimeMillis to the history 
filename along with jobid. ( \{jobid}-\{currentTimeMillis} )
And also in the history fetching side parse jobid before the *"-"* delimiter 
and fetch all the histories for that jobid.
For UI we can keep current display or maybe enhance with adding extra hierarchy 
for each jobid since each jobid can now have multiple histories.

If we could reach an agreement I'll be glad to take on the implementation.
Thanks in advance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to