[ https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906285#comment-16906285 ]
Stephen O'Donnell commented on HDFS-14311: ------------------------------------------ I have tried to reproduce this in a unit test, but without success. The issue is a little more subtle than I first suspected too. In the doTransition method, it reads the layout version of the storage it is working from and stores that in the blockPoolSliceStorage instance variable. Then it submits a job to upgrade the storage. That upgrade job will change the same instance variable to the new layout version, but at the same time the next storage is having its layout version read into the same instance variable and this instance variable will flip-flop between the values. [~caiyicong] are you able to reproduce this problem easily or do you see it frequently? It would be nice to be able to reproduce it via a unit or manual test. > multi-threading conflict at layoutVersion when loading block pool storage > ------------------------------------------------------------------------- > > Key: HDFS-14311 > URL: https://issues.apache.org/jira/browse/HDFS-14311 > Project: Hadoop HDFS > Issue Type: Bug > Components: rolling upgrades > Affects Versions: 2.9.2 > Reporter: Yicong Cai > Assignee: Yicong Cai > Priority: Major > Attachments: HDFS-14311.1.patch > > > When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at > StorageInfo.layoutVersion in loading block pool storage process. > It will cause this exception: > > {panel:title=exceptions} > 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] > - Restored 36974 block files from trash before the layout upgrade. These > blocks will be moved to the previous directory during the upgrade > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] > - Failed to analyze storage directories for block pool > BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed > to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block > pool BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > {panel} > > root cause: > BlockPoolSliceStorage instance is shared for all storage locations recover > transition. In BlockPoolSliceStorage.doTransition, it will read the old > layoutVersion from local storage, compare with current DataNode version, then > do upgrade. In doUpgrade, add the transition work as a sub-thread, the > transition work will set the BlockPoolSliceStorage's layoutVersion to current > DN version. The next storage dir transition check will concurrent with pre > storage dir real transition work, then the BlockPoolSliceStorage instance > layoutVersion will confusion. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org