[ 
https://issues.apache.org/jira/browse/FLINK-33715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongwoo.kim updated FLINK-33715:
--------------------------------
    Description: 
Hello Flink team,

I'd like to propose an improvement to how the job manager archives job 
histories and how flink history server fetches the history. 

*Currently, only one job history per jobid is available to be archived and 
fectched.*  
When a flink job tries to archive the job's history more than once, usually 
'FileAlreadyExistsException' error happens.
This makes sense in most cases, since a job typically gets a new ID when it 
gets restarted from latest checkpoint/savepoint.

 

*_However, there's a specific situation where this behavior can be 
problematic:_*

*_1) When we upgrade a job using the savepoint mode, the job's first history 
gets successfully archived._*
*_2) If the same job later fails due to an error, its history isn't archived 
again because there's already a record with the same job ID._*

This can be an issue because the most valuable information – why the job failed 
– gets lost.

 

To simply solve this, I suggest to include currentTimeMillis to the history 
filename along with jobid. ( \{jobid}-\{currentTimeMillis} )
And also in the history fetching side parse jobid before the *"-"* delimiter 
and fetch all the histories for that jobid.
For UI we can keep current display or maybe enhance with adding extra hierarchy 
for each jobid since each jobid can now have multiple histories.

 

If we could reach an agreement I'll be glad to take on the implementation.
Thanks in advance.

  was:
Hello Flink team,

I'd like to propose an improvement to how the job manager archives job 
histories and how flink history server fetches the history. 
Currently, only one job history per jobid is available to be archived and 
fectched.  
When a flink job tries to archive the job's history more than once, usually 
'FileAlreadyExistsException' error happens.
This makes sense in most cases, since a job typically gets a new ID when it 
gets restarted from latest checkpoint/savepoint.

However, there's a specific situation where this behavior can be problematic:

1) When we upgrade a job using the savepoint mode, the job's first history gets 
successfully archived.
2) If the same job later fails due to an error, its history isn't archived 
again because there's already a record with the same job ID.

This can be an issue because the most valuable information – why the job failed 
– gets lost.

To simply solve this, I suggest to include currentTimeMillis to the history 
filename along with jobid. ( \{jobid}-\{currentTimeMillis} )
And also in the history fetching side parse jobid before the *"-"* delimiter 
and fetch all the histories for that jobid.
For UI we can keep current display or maybe enhance with adding extra hierarchy 
for each jobid since each jobid can now have multiple histories.

If we could reach an agreement I'll be glad to take on the implementation.
Thanks in advance.


> Enhance history server to archive multiple histories per jobid
> --------------------------------------------------------------
>
>                 Key: FLINK-33715
>                 URL: https://issues.apache.org/jira/browse/FLINK-33715
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: dongwoo.kim
>            Priority: Minor
>
> Hello Flink team,
> I'd like to propose an improvement to how the job manager archives job 
> histories and how flink history server fetches the history. 
> *Currently, only one job history per jobid is available to be archived and 
> fectched.*  
> When a flink job tries to archive the job's history more than once, usually 
> 'FileAlreadyExistsException' error happens.
> This makes sense in most cases, since a job typically gets a new ID when it 
> gets restarted from latest checkpoint/savepoint.
>  
> *_However, there's a specific situation where this behavior can be 
> problematic:_*
> *_1) When we upgrade a job using the savepoint mode, the job's first history 
> gets successfully archived._*
> *_2) If the same job later fails due to an error, its history isn't archived 
> again because there's already a record with the same job ID._*
> This can be an issue because the most valuable information – why the job 
> failed – gets lost.
>  
> To simply solve this, I suggest to include currentTimeMillis to the history 
> filename along with jobid. ( \{jobid}-\{currentTimeMillis} )
> And also in the history fetching side parse jobid before the *"-"* delimiter 
> and fetch all the histories for that jobid.
> For UI we can keep current display or maybe enhance with adding extra 
> hierarchy for each jobid since each jobid can now have multiple histories.
>  
> If we could reach an agreement I'll be glad to take on the implementation.
> Thanks in advance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to