Re: [Discuss] - Merge Decommission and Maintenance - HDDS-1880

Stephen O'Donnell Tue, 27 Oct 2020 02:36:03 -0700

Hi Yiqun,

Thanks for taking a look.


> Does the container data can be read by client side when container node is
in DECOMMISSIONING/ DECOMMISSIONED state? If the container cannot be
accessed, it can lost containers in a short time when multiple nodes be in
decommissioning.

There is no limitation on the DN side for this. I need to check the SCM
read path to ensure nodes which are DECOMMISSIONING or ENTERING_MAINTENANCE
are still returned when OM requests the block locations. I agree this is
important and we need to ensure these nodes can still be read.

> Do we have the rate limitation control for the node decommission?

At the moment no. I feel this is something we should control in Replication
Manager rather than decommissioning. We already have seen issues with RM
where too many in-flight replication commands are sent to the DNs, which
cannot complete them in time, and then more get scheduled etc. Each DN has
a replication limit, so I think we need to enhance RM to hold back the
commands until the DNs have capacity to service them. We may also want to
give priority to under replicated containers due to a dead node rather than
decommissioning containers etc.

> For above command usage, will we support input the node with given a
input node list file, that will be useful for admin users to use this
feature.

That is certainly something that can be added, and I would see as one of
the "usability enhancements" I mentioned. What we can do is create a new
epic Jira for "post branch merge enhancements" and start collecting these
suggestions there?

Thanks,

Stephen.


On Tue, Oct 27, 2020 at 7:09 AM Lin, Yiqun <[email protected]> wrote:

> Hi Stephen,
>
> I haven't reviewed much of the decommission feature code but have a look
> for the overview doc you attached.
>
> Just some questions and comments from me:
>
> * Does the container data can be read by client side when container node
> is in DECOMMISSIONING/ DECOMMISSIONED state? If the container cannot be
> accessed, it can lost containers in a short time when multiple nodes be in
> decommissioning.
> * Do we have the rate limitation control for the node decommission? Large
> number of nodes concurrently  decommissioned, lots of closed containers be
> in replication. And this can impact the performance of SCM I think.
>
> Minor suggestion:
> ozone admin datanode decommission <list of nodes to remove>
> ozone admin datanode maintenance <list of nodes to put to maintenance >
> ozone admin datanode recommission <list of nodes to recommission>
>
> For above command usage, will we support input the node with given a input
> node list file, that will be useful for admin users to use this feature.
>
> Thanks,
> Yiqun
>
> On 2020/10/27, 2:09 AM, "Stephen O'Donnell" <[email protected]>
> wrote:
>
>     External Email
>
>     Someone reported that the attachment did not come through - perhaps the
>     mailing strips out attachments?
>
>     I have attached it to the HDDS-1880 jia - here is the direct link:
>
>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fsecure%2Fattachment%2F13014144%2FDecommission%2520and%2520Maintenance%2520Overview.pdf&amp;data=04%7C01%7Cyiqlin%40ebay.com%7Cdee6f8e2c0394a384c7108d879da576f%7C46326bff992841a0baca17c16c94ea99%7C0%7C1%7C637393325964258052%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=YMi7AhzcN7XceFeC8ZRckPnsiJ2eMYjd34TpImIm0kM%3D&amp;reserved=0
>
>     Thanks,
>
>     Stephen.
>
>     On Mon, Oct 26, 2020 at 5:47 PM Stephen O'Donnell <
> [email protected]>
>     wrote:
>
>     > Hi All,
>     >
>     > I am pleased to announce the Datanode Decommission and Maintenance
> feature
>     > for Ozone -
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHDDS-1880&amp;data=04%7C01%7Cyiqlin%40ebay.com%7Cdee6f8e2c0394a384c7108d879da576f%7C46326bff992841a0baca17c16c94ea99%7C0%7C1%7C637393325964258052%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=3F%2Fwmrrh72uNAGkgv7k7OGi%2BwDxi24JpmkocMNY1LQU%3D&amp;reserved=0
>     >
>     > The feature is working in Integration tests and also via
> docker-compose.
>     > There is still some work to improve monitoring and usability, but I
> believe
>     > the feature is now complete enough to merge into master and continue
>     > development there.
>     >
>     > I would like to use this thread to discuss the feature and agree on
>     > whether we can merge it into master. To help with the discussion, I
> have
>     > attached a short document describing the major changes.
>     >
>     > The decommission changes are all on the branch HDDS-1880-Decom.
>     >
>     > Please reply here with any questions and comments.
>     >
>     > Thanks,
>     >
>     > Stephen.
>     >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [Discuss] - Merge Decommission and Maintenance - HDDS-1880

Reply via email to