[ 
https://issues.apache.org/jira/browse/HDDS-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069198#comment-17069198
 ] 

Yiqun Lin commented on HDDS-3241:
---------------------------------

Thanks for the comments, [~elek] / [~msingh].

Actually current SCM safemode can also protect this behavior once we startup 
SCM with wrong container/pipeline db files. And then leads large containers 
deleted.  This should not happen because SCM won't exit safemode firstly since 
DN containers reported will not reach the safemode threshold anyway.

Also I have mentioned another case that in large clusters, the node sent to 
repair and come back to cluster again. SCM deletion behavior can help 
automation cleanup Datanode stale container datas. This is also one common 
cases.

I have updated the PR to make this configurable and disabled by default. Please 
help have a look, thanks.

> Invalid container reported to SCM should be deleted
> ---------------------------------------------------
>
>                 Key: HDDS-3241
>                 URL: https://issues.apache.org/jira/browse/HDDS-3241
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>    Affects Versions: 0.4.1
>            Reporter: Yiqun Lin
>            Assignee: Yiqun Lin
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> For the invalid or out-updated container reported by Datanode, 
> ContainerReportHandler in SCM only prints error log and doesn't 
>  take any action.
> {noformat}
> 2020-03-15 05:19:41,072 ERROR 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received 
> container report for an unknown container 37 from datanode 
> 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
> networkLocation: /dc2/rack1, certSerialId: null}.
> org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container 
> with id #37 not found.
>         at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
>         at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
>         at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
>         at 
> org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
>         at 
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
>         at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
>         at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
>         at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
>         at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 2020-03-15 05:19:41,073 ERROR 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received 
> container report for an unknown container 38 from datanode 
> 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
> networkLocation: /dc2/rack1, certSerialId: null}.
> org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container 
> with id #38 not found.
>         at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
>         at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
>         at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
>         at 
> org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
>         at 
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
>         at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
>         at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
>         at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
>         at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {noformat}
> Actually SCM should inform Datanode to delete its outdated container. 
> Otherwise, Datanode will always report this invalid container and this dirty 
> container data will be always kept in Datanode. Sometimes, we bring back a 
> node that be repaired and it maybe stores stale data and we should have a way 
> to auto-cleanup them.
> We could have a setting to control this auto-deletion behavior if this is a 
> little risk approach.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

Reply via email to