[ https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16910187#comment-16910187 ]
Surendra Singh Lilhore commented on HDFS-14311: ----------------------------------------------- Thanks [~caiyicong] for reporting this issue. We got the same issue in our cluster and it got fixed with this patch. I don't think it is easy to reproduce in unit test. > multi-threading conflict at layoutVersion when loading block pool storage > ------------------------------------------------------------------------- > > Key: HDFS-14311 > URL: https://issues.apache.org/jira/browse/HDFS-14311 > Project: Hadoop HDFS > Issue Type: Bug > Components: rolling upgrades > Affects Versions: 2.9.2 > Reporter: Yicong Cai > Assignee: Yicong Cai > Priority: Major > Attachments: HDFS-14311.1.patch > > > When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at > StorageInfo.layoutVersion in loading block pool storage process. > It will cause this exception: > > {panel:title=exceptions} > 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] > - Restored 36974 block files from trash before the layout upgrade. These > blocks will be moved to the previous directory during the upgrade > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] > - Failed to analyze storage directories for block pool > BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed > to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block > pool BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > {panel} > > root cause: > BlockPoolSliceStorage instance is shared for all storage locations recover > transition. In BlockPoolSliceStorage.doTransition, it will read the old > layoutVersion from local storage, compare with current DataNode version, then > do upgrade. In doUpgrade, add the transition work as a sub-thread, the > transition work will set the BlockPoolSliceStorage's layoutVersion to current > DN version. The next storage dir transition check will concurrent with pre > storage dir real transition work, then the BlockPoolSliceStorage instance > layoutVersion will confusion. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org