This is my idea about what happened, let me know if it makes sense
(extended to everybody reading of course!):

* Any upgrade from 2.6 to something more recent needs to go through a
restructuring of the datanode volumes/directories, as described in
https://issues.apache.org/jira/browse/HDFS-3290 and
https://issues.apache.org/jira/browse/HDFS-6482.
* From https://issues.apache.org/jira/browse/HDFS-8782 it seems that
the procedure requires time, and until the volumes are upgraded the DN
doesn't register to the Namenode. This is what we observed during the
upgrade, a lot of DNs took a ton of time to register but eventually
they did it (without hitting OOMs).
* https://issues.apache.org/jira/browse/HDFS-8578 was created to
process the datanode volumes/dirs in parallel on upgrade
(independently from the datanode dir structure upgrade mentioned above
IIUC) but this may cause OOM, as described in
https://issues.apache.org/jira/browse/HDFS-9536 (that looks an open
problem).

In theory then upgrading from a distro equipped with 2.6 (like CDH 5)
needs to go through the directory restructure, but any upgrade can
also hit OOMs due to parallel processing of storage volumes/dirs. Does
it make sense?

Luca

On Sat, Feb 13, 2021 at 7:11 PM Luca Toscano <[email protected]> wrote:
>
> Hi Jason,
>
> Thanks a lot for sharing your story too, I definitely feel way better
> about the upgrade plan that we used knowing that the exact issue
> happened to other people. I tried to check in Hadoop's jira if this
> upgrade memory requirement was mentioned, but didn't find anything.
> Have you some more info to share about how to best scale DNs' jvm heap
> sizes before the upgrade starts? In my case it was a
> restart/fail/double-the-heap procedure until we found that 16G was a
> good value for our DNs, but I see that in your case it was probably
> worse (4GB -> 64GB). I wouldn't really be sure about what to suggest
> to somebody doing a similar upgrade and asking for suggestions, and
> since you encountered the issue upgrading to Hadoop 3.x this will be
> relevant also for people upgrading from Bigtop 1.4/1.5 to the future
> 3.x release. The more info we can collect the better for the community
> in my opinion!
>
> Luca
>
> On Fri, Feb 12, 2021 at 7:49 PM Jason Wen <[email protected]> wrote:
> >
> > HI Luca,
> >
> > Thanks for sharing your upgrade experience.
> > We hit the exact same issue of HDFS inconsistent status issue when we 
> > upgraded one cluster from CDH5.16.2 to CDH6.3.2. At that time some DNs 
> > crashed due to OOM and some other DNs were still running but failed to 
> > upgrade its volumes. We finally resolved the issue by increasing the max 
> > heap size from 4GB to 64GB (our DNs has either 256GB or 512GB memory) and 
> > then restarting all the DNs.
> >
> > -Jason
> >
> > On 2/12/21, 12:52 AM, "Luca Toscano" <[email protected]> wrote:
> >
> >     Hi everybody,
> >
> >     We have finally migrated our CDH cluster to Bigtop 1.5, so I can say
> >     that we are now happy Bigtop users :)
> >
> >     The upgrade of the production cluster (60 worker nodes, ~50M files on
> >     HDFS) was harder than I expected, since we bumped into a strange
> >     performance issue that slowed down the HDFS upgrade. I wrote a summary
> >     in 
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__phabricator.wikimedia.org_T273711-236818136&d=DwIBaQ&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=UflFQf1BWcrVtfjfN1LUqWWh-UBP5XtRGMdcDC-0P7o&m=n8sbnJKTVI75MPipuVM4uUi1n49089On4CdWygRwp20&s=Lluhh7rsGsKk9zQbVVXvbAMLIlMUPdary3ZUuI3dA8I&e=
> >   for whoever is
> >     interested, it is surely something to highlight in the CDH->Bigtop
> >     guide. Speaking of which, the last thing that we did was starting
> >     
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1fI1mvbR1mFLV6ohU5cIEnU5hFvEE7EWnKYWOkF55jtE_edit&d=DwIBaQ&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=UflFQf1BWcrVtfjfN1LUqWWh-UBP5XtRGMdcDC-0P7o&m=n8sbnJKTVI75MPipuVM4uUi1n49089On4CdWygRwp20&s=GxA46Ok2-8JaiU3V2_uF9QaI49w31jRHn4sRh_YCcGc&e=
> >     some time ago, so I am wondering if we could find a more permanent
> >     location. Would it make sense to start a wiki page somewhere? Or even
> >     a .md file in the github repo, as you prefer (the latter would be more
> >     convenient for reviewers etc..).
> >
> >     Anyway, thanks a lot to all for the support! It was a looong project
> >     but we eventually did it!
> >
> >     Luca
> >

Reply via email to