[jira] [Commented] (FLINK-9381) BlobServer data for a job is not getting cleaned up at JM

2018-05-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478808#comment-16478808
 ] 

ASF GitHub Bot commented on FLINK-9381:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/6030


> BlobServer data for a job is not getting cleaned up at JM
> -
>
> Key: FLINK-9381
> URL: https://issues.apache.org/jira/browse/FLINK-9381
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.5.0
> Environment: Flink 1.5.0 RC3 Commit e725269
>Reporter: Amit Jain
>Assignee: Till Rohrmann
>Priority: Blocker
> Fix For: 1.5.0
>
>
> We are running Flink 1.5.0 rc3 with YARN as cluster manager and found
>  Job Manager is getting killed due to out of disk error.
>  
>  Upon further analysis, we found blob server data for a job is not
>  getting cleaned up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9381) BlobServer data for a job is not getting cleaned up at JM

2018-05-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478670#comment-16478670
 ] 

ASF GitHub Bot commented on FLINK-9381:
---

GitHub user tillrohrmann opened a pull request:

https://github.com/apache/flink/pull/6030

[FLINK-9381] Release blobs after job termination

## What is the purpose of the change

Properly remove job blobs from BlobServer after the job terminates. If the 
job reaches a globally terminal
state, then the HA blob store files will also be cleared. In case of a 
suspension or that the job is not
finished (e.g. another process finsihes the job concurrently), we only 
remove the local blob server files.

Additionally, we properly release the user code class loader registered in 
the JobManagerRunner when it closes.

Moreover, this commit extends the `BlobServer#cleanupJob` method to take a 
second argument which specifies whether the `BlobStore` files shall be cleaned 
up or not.

## Brief change log

- Properly deregister user code class loader from `LibraryCacheManager` in 
`JobManagerRunner`
- Remove BlobServer files if the job is removed from the `Dispatcher` in 
the `removeJob` method
- Remove HA `BlobStore` files if the job reached a globally terminal state

## Verifying this change

- Added `JobManagerRunnerTest#testLibraryCacheManagerRegistration`
- Added `DispatcherResourceCleanupTest`

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (no)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
  - The serializers: (no)
  - The runtime per-record code paths (performance sensitive): (no)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
  - The S3 file system connector: (no)

## Documentation

  - Does this pull request introduce a new feature? (no)
  - If yes, how is the feature documented? (not applicable)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tillrohrmann/flink fixBlobRelease

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/6030.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6030


commit 7b77fc85010c8831ecc3704a773f0f944da838a5
Author: Till Rohrmann 
Date:   2018-05-17T06:58:07Z

[FLINK-9381] Release blobs after job termination

Properly remove job blobs from BlobServer after the job terminates. If the 
job reaches a globally terminal
state, then the HA blob store files will also be cleared. In case of a 
suspension or that the job is not
finished (e.g. another process finsihes the job concurrently), we only 
remove the local blob server files.

Additionally, we properly release the user code class loader registered in 
the JobManagerRunner when it
closes.




> BlobServer data for a job is not getting cleaned up at JM
> -
>
> Key: FLINK-9381
> URL: https://issues.apache.org/jira/browse/FLINK-9381
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.5.0
> Environment: Flink 1.5.0 RC3 Commit e725269
>Reporter: Amit Jain
>Assignee: Till Rohrmann
>Priority: Blocker
> Fix For: 1.5.0
>
>
> We are running Flink 1.5.0 rc3 with YARN as cluster manager and found
>  Job Manager is getting killed due to out of disk error.
>  
>  Upon further analysis, we found blob server data for a job is not
>  getting cleaned up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9381) BlobServer data for a job is not getting cleaned up at JM

2018-05-16 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477600#comment-16477600
 ] 

Till Rohrmann commented on FLINK-9381:
--

I think we don't properly clean up the job resources after the job has 
finished. I think this is indeed a release blocker. 

I will fix the problem.

> BlobServer data for a job is not getting cleaned up at JM
> -
>
> Key: FLINK-9381
> URL: https://issues.apache.org/jira/browse/FLINK-9381
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.5.0
> Environment: Flink 1.5.0 RC3 Commit e725269
>Reporter: Amit Jain
>Priority: Blocker
>
> We are running Flink 1.5.0 rc3 with YARN as cluster manager and found
>  Job Manager is getting killed due to out of disk error.
>  
>  Upon further analysis, we found blob server data for a job is not
>  getting cleaned up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9381) BlobServer data for a job is not getting cleaned up at JM

2018-05-16 Thread Chesnay Schepler (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477313#comment-16477313
 ] 

Chesnay Schepler commented on FLINK-9381:
-

Searching for {{BlobServer#cleanupJob}} only yields results in the legacy 
{{JobManager}}; it is never called by the {{JobMaster}} or {{Dispatcher}}.

[~till.rohrmann] Is this simply missing or are we cleaning up things in another 
way?

> BlobServer data for a job is not getting cleaned up at JM
> -
>
> Key: FLINK-9381
> URL: https://issues.apache.org/jira/browse/FLINK-9381
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: Flink 1.5.0 RC3 Commit e725269
>Reporter: Amit Jain
>Priority: Blocker
>
> We are running Flink 1.5.0 rc3 with YARN as cluster manager and found
>  Job Manager is getting killed due to out of disk error.
>  
>  Upon further analysis, we found blob server data for a job is not
>  getting cleaned up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)