[GitHub] [hadoop-ozone] sodonnel commented on pull request #1338: HDDS-4023. Delete closed container after all blocks have been deleted.

2020-09-29 Thread GitBox


sodonnel commented on pull request #1338:
URL: https://github.com/apache/hadoop-ozone/pull/1338#issuecomment-700582338


   > There is following logic in ReplicatioManager, which will handle the 
replicas reported during container state is DELETING.
   
   Sorry I missed that. You are correct. I am +1 on this change as it is now, 
so feel free to commit it.
   
   @linyiqun I do agree that I think this could be handled more cleanly and 
efficiently in the container report handler. However its probably not much of 
an overhead for replication manager. I am happy for us to commit the change as 
it is, and we can see how it performs in practice. Worst case we have to 
refactor the change out of RM into the report handler. What do you think?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[GitHub] [hadoop-ozone] sodonnel commented on pull request #1338: HDDS-4023. Delete closed container after all blocks have been deleted.

2020-09-28 Thread GitBox


sodonnel commented on pull request #1338:
URL: https://github.com/apache/hadoop-ozone/pull/1338#issuecomment-700016623


   This change looks almost good now. I wonder about two final things:
   
   1. In `updateContainerStats(...)` do you think we should return if the 
container is DELETING or DELETED without making any updates? If this is a stale 
replica, they it may not be empty and hence could update the container stats 
incorrectly. If the Container is already in DELETING or DELETED state, then we 
can ignore any changes to it, as we know the stale replica will get removed 
anyway.
   
   2. I am wondering if we could receive a stale replica when the container is 
DELETING. Then a replica would get added in `updateContainerReplica(...)`. Then 
later the container will go to DELETED and the state replica will get reported 
again - at that point we will send a delete command, but the replica will never 
get removed from memory now I think. Would it make sense to send the delete 
command for any replicas received when the container is DELETING or DELETED?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[GitHub] [hadoop-ozone] sodonnel commented on pull request #1338: HDDS-4023. Delete closed container after all blocks have been deleted.

2020-09-25 Thread GitBox


sodonnel commented on pull request #1338:
URL: https://github.com/apache/hadoop-ozone/pull/1338#issuecomment-698259315


   Sorry for the slow reply on this. I have been caught up on some other things.
   
   > After a second thought, deleting the container record in SCM DB 
immediately while keep it in memory maybe a better and clean choice. So if 
there is stale container replica, it can be deleted based on in memory 
information.
   
   I think this is a good enough idea for now. If SCM is up for a very long 
time, perhaps in the future we will want to add a thread to clear all the in 
memory DELETED containers. One small concern is that if a container goes 
DELETED and then SCM is restarted soon after. Then a DN is restarted and 
reports a stale replica, it will just be seen as an unknown container. The 
default position there, is to log a warning. The config 
hdds.scm.unknown-container.action controls this. This is all an edge case - 
most of the time, all DNs should be up anyway.
   
   I left just one comment on a suggested refactor in the container report 
handler, when dealing with replicas from a DELETED container.
   
   Could you also add a test in TestContainerReportHander to check the logic 
around deleting a replica from a DELETED container?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[GitHub] [hadoop-ozone] sodonnel commented on pull request #1338: HDDS-4023. Delete closed container after all blocks have been deleted.

2020-09-24 Thread GitBox


sodonnel commented on pull request #1338:
URL: https://github.com/apache/hadoop-ozone/pull/1338#issuecomment-698259315


   Sorry for the slow reply on this. I have been caught up on some other things.
   
   > After a second thought, deleting the container record in SCM DB 
immediately while keep it in memory maybe a better and clean choice. So if 
there is stale container replica, it can be deleted based on in memory 
information.
   
   I think this is a good enough idea for now. If SCM is up for a very long 
time, perhaps in the future we will want to add a thread to clear all the in 
memory DELETED containers. One small concern is that if a container goes 
DELETED and then SCM is restarted soon after. Then a DN is restarted and 
reports a stale replica, it will just be seen as an unknown container. The 
default position there, is to log a warning. The config 
hdds.scm.unknown-container.action controls this. This is all an edge case - 
most of the time, all DNs should be up anyway.
   
   I left just one comment on a suggested refactor in the container report 
handler, when dealing with replicas from a DELETED container.
   
   Could you also add a test in TestContainerReportHander to check the logic 
around deleting a replica from a DELETED container?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[GitHub] [hadoop-ozone] sodonnel commented on pull request #1338: HDDS-4023. Delete closed container after all blocks have been deleted.

2020-09-08 Thread GitBox


sodonnel commented on pull request #1338:
URL: https://github.com/apache/hadoop-ozone/pull/1338#issuecomment-688978976


   Thinking some more about what to do with containers in the DELETED state in 
SCM. If we are not confident to delete them immediately and automatically, 
maybe we should provide an command "purge empty containers" which can be run by 
an admin, or via Recon. We should show in Recon how many empty containers are 
present in the cluster and then provide an option to remove them.
   
   Alternatively we could have some threshold in SCM, so that when the DELETED 
containers crosses some threshold, they are all removed.
   
   We could do either of these in a separate jira if this idea makes sense.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[GitHub] [hadoop-ozone] sodonnel commented on pull request #1338: HDDS-4023. Delete closed container after all blocks have been deleted.

2020-09-08 Thread GitBox


sodonnel commented on pull request #1338:
URL: https://github.com/apache/hadoop-ozone/pull/1338#issuecomment-688774840


   As I said before, I have a feeling that it *might* be better to have a 
separate service for deleting empty containers, and it would perhaps naturally 
be part of a "merge small containers" service, which we will need to create 
eventually. However, the changes to the replication manager here are not too 
big, and you could argue the current solution fits here OK. Therefore I am OK 
with keeping this logic in the replication manager. However when it comes to 
merging small containers in the future, I believe we should create a separate 
service and not clutter RM with that new feature too.
   
   The patch looks mostly good to me, there are just a couple of areas we need 
to consider.
   
   1. What do we do with the containers in DELETED state, and with no replicas? 
I don't think we should keep them in SCM forever, but we are also worried about 
removing them. However that said, if the replicas have all been deleted, what 
use is there in keeping the container in SCM?
   
   2. Consider if a datanode was dead for some time, and is then restarted. It 
will start reporting a replica for the DELETED container and with the current 
logic I don't think we will clean it up as `isContainerEmpty(...)` will always 
return false due to it checking for only CLOSED containers. I feel that this 
should be handled in the ContainerReportHandler code. If a replica is reported 
for a container which is DELETED in SCM, then we should instruct the datanode 
to delete the replica. What do you think? The parameter 
hdds.scm.unknown-container.action currently controls what to do if a DN reports 
a containers which SCM does not know about. The default is to log a warning, 
but for a DELETED container, I think we should just delete the replica.
   
   3. There is another corner case, where the dead node may have a container 
which is out of sync with the others, as it was dead when the blocks should 
have been removed. Therefore we had 3 empty container replicas, plus one 
non-empty dead one. The 3 empty replicas are removed, and the container moved 
to "DELETED". Then the dead node starts reporting a non-empty replica for the 
DELETED container. This is actually a wider problem, in that I believe there 
are scenarios where container replicas can get out of sync with one another and 
there is nothing to reconcile them and fix it.
   
   4. In ContainerSafeModeRule, I think we need to consider DELETING and 
DELETED. If there are a lot of DELETED containers in the cluster, then they 
will never get replicas and the safemode rule will never get met. I think this 
could be as simple as excluding DELETING and DELETED when loading the 
containerMap (right now OPEN and CLOSING are excluded) so we don't wait for 
replicas on these containers.
   
   5. In `ReplicationManager.isContainerUnderReplicated(...)`, should we 
immediately return true if the containerState is DELETING or DELETED?
   
   6. Are there any other places outside of ReplicationManager and 
ContainerSafeModeRule where we monitor under / over replicated containers I may 
have missed?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[GitHub] [hadoop-ozone] sodonnel commented on pull request #1338: HDDS-4023. Delete closed container after all blocks have been deleted.

2020-08-19 Thread GitBox


sodonnel commented on pull request #1338:
URL: https://github.com/apache/hadoop-ozone/pull/1338#issuecomment-676523668


   I raised HDDS-4131 / https://github.com/apache/hadoop-ozone/pull/1339 
discuss and address the bytesUsed and KeyCount metric problem I mentioned above.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[GitHub] [hadoop-ozone] sodonnel commented on pull request #1338: HDDS-4023. Delete closed container after all blocks have been deleted.

2020-08-19 Thread GitBox


sodonnel commented on pull request #1338:
URL: https://github.com/apache/hadoop-ozone/pull/1338#issuecomment-676472849


   Thanks for working on this @ChenSammi.
   
   I wonder if it would be simpler to remove empty containers as part of 
Container Report processing? In 
`AbstractContainerReportHandler#updateContainerState`, we could check the size 
and number of keys of the reported containers in the CLOSED branch of the 
switch statement, and then take action to delete an empty container there? I 
have a feeling it might be simpler, but I am not sure. The disadvantage of 
doing it in the Container Report Processing, is that we are dealing with only a 
single replica at that stage. However if the container is CLOSED in SCM, and a 
report says it is empty then we should be good to simply remove the container 
from SCM and issue the delete container command when processing the container 
report.
   
   In looking at how this all works, I noticed there is a bug in the existing 
method AbstractContainerReportHandler#updateContainerStats, which might stop 
your change working:
   
   ```
 private void updateContainerStats(final ContainerID containerId,
   final ContainerReplicaProto replicaProto)
 throws ContainerNotFoundException {
   if (isHealthy(replicaProto::getState)) {
 final ContainerInfo containerInfo = containerManager
 .getContainer(containerId);
   
 if (containerInfo.getSequenceId() <
 replicaProto.getBlockCommitSequenceId()) {
   containerInfo.updateSequenceId(
   replicaProto.getBlockCommitSequenceId());
 }
 if (containerInfo.getUsedBytes() < replicaProto.getUsed()) {
   containerInfo.setUsedBytes(replicaProto.getUsed());
 }
 if (containerInfo.getNumberOfKeys() < replicaProto.getKeyCount()) {
   containerInfo.setNumberOfKeys(replicaProto.getKeyCount());
 }
   }
 }
   ```
   
   This logic assumes the bytes used and keyCount is always increasing on the 
container. However in this case, where blocks are deleted the keyCount and 
bytesUsed can decrease. Therefore I don't think the stats will ever get updated 
in SCM via the container reports when blocks are removed.
   
   I then noticed you recently made a change in ReplicationManager to update 
these counts here:
   
   
https://github.com/apache/hadoop-ozone/pull/1295/files#diff-218132dfe72e6e39b9eaaf59737adcf2R780
   
   I think the ReplicationManager part of that change should be reverted and it 
handled via the ContainerReports by fixing the bug I mentioned above. The only 
reason we need the Replication Manager change is to work around the bug above. 
If it was fixed, we don't need that logic in RM.
   
   If we decide to keep your changes inside replicationManager, then we would 
need a few more tests added to TestReplicationManager to cover the new 
scenarios.
   
   I also have a question - when the replicas are all deleted, and we call:
   
   ```
   containerManager.updateContainerState(container.containerID(),
 HddsProtos.LifeCycleEvent.CLEANUP);
   ```
   
   Does this somehow cause the container to be removed from SCM memory and the 
persistent store?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org