This is my idea about what happened, let me know if it makes sense (extended to everybody reading of course!):
* Any upgrade from 2.6 to something more recent needs to go through a restructuring of the datanode volumes/directories, as described in https://issues.apache.org/jira/browse/HDFS-3290 and https://issues.apache.org/jira/browse/HDFS-6482. * From https://issues.apache.org/jira/browse/HDFS-8782 it seems that the procedure requires time, and until the volumes are upgraded the DN doesn't register to the Namenode. This is what we observed during the upgrade, a lot of DNs took a ton of time to register but eventually they did it (without hitting OOMs). * https://issues.apache.org/jira/browse/HDFS-8578 was created to process the datanode volumes/dirs in parallel on upgrade (independently from the datanode dir structure upgrade mentioned above IIUC) but this may cause OOM, as described in https://issues.apache.org/jira/browse/HDFS-9536 (that looks an open problem). In theory then upgrading from a distro equipped with 2.6 (like CDH 5) needs to go through the directory restructure, but any upgrade can also hit OOMs due to parallel processing of storage volumes/dirs. Does it make sense? Luca On Sat, Feb 13, 2021 at 7:11 PM Luca Toscano <[email protected]> wrote: > > Hi Jason, > > Thanks a lot for sharing your story too, I definitely feel way better > about the upgrade plan that we used knowing that the exact issue > happened to other people. I tried to check in Hadoop's jira if this > upgrade memory requirement was mentioned, but didn't find anything. > Have you some more info to share about how to best scale DNs' jvm heap > sizes before the upgrade starts? In my case it was a > restart/fail/double-the-heap procedure until we found that 16G was a > good value for our DNs, but I see that in your case it was probably > worse (4GB -> 64GB). I wouldn't really be sure about what to suggest > to somebody doing a similar upgrade and asking for suggestions, and > since you encountered the issue upgrading to Hadoop 3.x this will be > relevant also for people upgrading from Bigtop 1.4/1.5 to the future > 3.x release. The more info we can collect the better for the community > in my opinion! > > Luca > > On Fri, Feb 12, 2021 at 7:49 PM Jason Wen <[email protected]> wrote: > > > > HI Luca, > > > > Thanks for sharing your upgrade experience. > > We hit the exact same issue of HDFS inconsistent status issue when we > > upgraded one cluster from CDH5.16.2 to CDH6.3.2. At that time some DNs > > crashed due to OOM and some other DNs were still running but failed to > > upgrade its volumes. We finally resolved the issue by increasing the max > > heap size from 4GB to 64GB (our DNs has either 256GB or 512GB memory) and > > then restarting all the DNs. > > > > -Jason > > > > On 2/12/21, 12:52 AM, "Luca Toscano" <[email protected]> wrote: > > > > Hi everybody, > > > > We have finally migrated our CDH cluster to Bigtop 1.5, so I can say > > that we are now happy Bigtop users :) > > > > The upgrade of the production cluster (60 worker nodes, ~50M files on > > HDFS) was harder than I expected, since we bumped into a strange > > performance issue that slowed down the HDFS upgrade. I wrote a summary > > in > > https://urldefense.proofpoint.com/v2/url?u=https-3A__phabricator.wikimedia.org_T273711-236818136&d=DwIBaQ&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=UflFQf1BWcrVtfjfN1LUqWWh-UBP5XtRGMdcDC-0P7o&m=n8sbnJKTVI75MPipuVM4uUi1n49089On4CdWygRwp20&s=Lluhh7rsGsKk9zQbVVXvbAMLIlMUPdary3ZUuI3dA8I&e= > > for whoever is > > interested, it is surely something to highlight in the CDH->Bigtop > > guide. Speaking of which, the last thing that we did was starting > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1fI1mvbR1mFLV6ohU5cIEnU5hFvEE7EWnKYWOkF55jtE_edit&d=DwIBaQ&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=UflFQf1BWcrVtfjfN1LUqWWh-UBP5XtRGMdcDC-0P7o&m=n8sbnJKTVI75MPipuVM4uUi1n49089On4CdWygRwp20&s=GxA46Ok2-8JaiU3V2_uF9QaI49w31jRHn4sRh_YCcGc&e= > > some time ago, so I am wondering if we could find a more permanent > > location. Would it make sense to start a wiki page somewhere? Or even > > a .md file in the github repo, as you prefer (the latter would be more > > convenient for reviewers etc..). > > > > Anyway, thanks a lot to all for the support! It was a looong project > > but we eventually did it! > > > > Luca > >
