[ https://issues.apache.org/jira/browse/HDFS-16586?focusedWorklogId=773559&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-773559 ]
ASF GitHub Bot logged work on HDFS-16586: ----------------------------------------- Author: ASF GitHub Bot Created on: 23/May/22 15:51 Start Date: 23/May/22 15:51 Worklog Time Spent: 10m Work Description: saintstack closed pull request #4346: HDFS-16586. Purge FsDatasetAsyncDiskService threadgroup URL: https://github.com/apache/hadoop/pull/4346 Issue Time Tracking ------------------- Worklog Id: (was: 773559) Time Spent: 1h 10m (was: 1h) > Purge FsDatasetAsyncDiskService threadgroup; it causes > BPServiceActor$CommandProcessingThread IllegalThreadStateException 'fatal > exception and exit' > ----------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-16586 > URL: https://issues.apache.org/jira/browse/HDFS-16586 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 3.3.0, 3.2.3 > Reporter: Michael Stack > Assignee: Michael Stack > Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > The below failed block finalize is causing a downstreamer's test to fail when > it uses hadoop 3.2.3 or 3.3.0+: > {code:java} > 2022-05-19T18:21:08,243 INFO [Command processor] > impl.FsDatasetAsyncDiskService(234): Scheduling blk_1073741840_1016 replica > FinalizedReplica, blk_1073741840_1016, FINALIZED > getNumBytes() = 52 > getBytesOnDisk() = 52 > getVisibleLength()= 52 > getVolume() = > /Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2 > getBlockURI() = > file:/Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2/current/BP-62743752-127.0.0.1-1653009535881/current/finalized/subdir0/subdir0/blk_1073741840 > for deletion > 2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] > metrics.TopMetrics(134): a metric is reported: cmd: delete user: stack.hfs.0 > (auth:SIMPLE) > 2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] > top.TopAuditLogger(78): ------------------- logged event for top service: > allowed=true ugi=stack.hfs.0 (auth:SIMPLE) ip=/127.0.0.1 cmd=delete > src=/user/stack/test-data/b8167d53-bcd7-c682-a767-55faaf7f3e96/data/default/t1/4499521075f51d5138fe4f1916daf92d/.tmp > dst=null perm=null > 2022-05-19T18:21:08,243 DEBUG [PacketResponder: > BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, > type=LAST_IN_PIPELINE] datanode.BlockReceiver$PacketResponder(1645): > PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, > type=LAST_IN_PIPELINE, replyAck=seqno: 901 reply: SUCCESS > downstreamAckTimeNanos: 0 flag: 0 > 2022-05-19T18:21:08,243 DEBUG [PacketResponder: > BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, > type=LAST_IN_PIPELINE] datanode.BlockReceiver$PacketResponder(1327): > PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, > type=LAST_IN_PIPELINE: seqno=-2 waiting for local datanode to finish write. > 2022-05-19T18:21:08,243 ERROR [Command processor] > datanode.BPServiceActor$CommandProcessingThread(1276): Command processor > encountered fatal exception and exit. > java.lang.IllegalThreadStateException: null > at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:865) ~[?:?] > at java.lang.Thread.<init>(Thread.java:430) ~[?:?] > at java.lang.Thread.<init>(Thread.java:704) ~[?:?] > at java.lang.Thread.<init>(Thread.java:525) ~[?:?] > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService$1.newThread(FsDatasetAsyncDiskService.java:113) > ~[hadoop-hdfs-3.2.3.jar:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:623) > ~[?:?] > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:912) > ~[?:?] > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343) > ~[?:?] > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:189) > ~[hadoop-hdfs-3.2.3.jar:?] > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:238) > ~[hadoop-hdfs-3.2.3.jar:?] > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2184) > ~[hadoop-hdfs-3.2.3.jar:?] > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2103) > ~[hadoop-hdfs-3.2.3.jar:?] > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:736) > ~[hadoop-hdfs-3.2.3.jar:?] > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:682) > ~[hadoop-hdfs-3.2.3.jar:?] > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1318) > ~[hadoop-hdfs-3.2.3.jar:?] > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1364) > ~[hadoop-hdfs-3.2.3.jar:?] > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1291) > ~[hadoop-hdfs-3.2.3.jar:?] > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1274) > ~[hadoop-hdfs-3.2.3.jar:?] > 2022-05-19T18:21:08,243 DEBUG [DataXceiver for client > DFSClient_NONMAPREDUCE_1036441577_16 at /127.0.0.1:54879 [Receiving block > BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006]] > datanode.BlockReceiver(532): Receiving one packet for block > BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006: PacketHeader with > packetLen=750 header data: offsetInBlock: 495104 > seqno: 902 > lastPacketInBlock: false > dataLen: 7382022-05-19T18:21:08,243 WARN [Command processor] > datanode.BPServiceActor$CommandProcessingThread(1279): Ending command > processor service for: Thread[Command processor,5,FailOnTimeoutGroup] {code} > Above, we are trying to run a FINALIZED 'command' asynchronously using a > thread from the volume's dedicated executor. The executor is trying to create > a thread because it has none-to-hand as it has not seen action in over > {color:#660e7a}THREADS_KEEP_ALIVE_SECONDS{color} (60seconds) so all live > threads have been let go (The volume executor as configured allows core > threads to die also). The thread creation fails because the > FsDatasetAsyncDiskService#threadGroup we are trying to associate the new > thread with is 'destroyed'. Threadgroups auto-'destroy' themselves after > their last member has > died|[https://github.com/openjdk/jdk11/blob/master/src/java.base/share/classes/java/lang/ThreadGroup.java#L846]|https://github.com/openjdk/jdk11/blob/master/src/java.base/share/classes/java/lang/ThreadGroup.java#L846].] > which can happen given the way this volume executor is configured. This > FsDatasetAsyncDiskService#threadGroup goes unexploited. If no thread group is > supplied, the executor makes a default that internally associates any created > threads with a dedicated executor 'threadgroup' but given how our executor is > configured to let all threads die in the name of economizing on total unning > threads, a threadgroup-for-each-volume-executor is likely to have the same > issue as this FsDatasetAsyncDiskService#threadGroup one. Let me put up a PR > to remove it and undo the cause of the above ITSE. > Of interest perhaps is that this IllegalThreadStateException is not new to > 3.2.3/3.0.0+. See the below from 3.2.2 when I run the same downstream test: > {{2022-05-19T17:06:11,140 ERROR [BP-1924619834-10.0.0.192-1653005021952 > heartbeating to localhost/127.0.0.1:54360] datanode.BPServiceActor(856): > Exception in BPOfferService for Block pool > BP-1924619834-10.0.0.192-1653005021952 (Datanode Uuid > bb5219c0-a444-4e5a-9455-6176ced7ffe7) service to localhost/127.0.0.1:54360}} > {{java.lang.IllegalThreadStateException: null}} > {{ at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:865) ~[?:?]}} > {{ at java.lang.Thread.<init>(Thread.java:430) ~[?:?]}} > {{ at java.lang.Thread.<init>(Thread.java:704) ~[?:?]}} > {{ at java.lang.Thread.<init>(Thread.java:525) ~[?:?]}} > {{ at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService$1.newThread(FsDatasetAsyncDiskService.java:105) > ~[hadoop-hdfs-3.2.2.jar:?]}} > {{ at > java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:623) > ~[?:?]}} > {{ at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:912) > ~[?:?]}} > {{ at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343) > ~[?:?]}} > {{ at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:181) > ~[hadoop-hdfs-3.2.2.jar:?]}} > {{ at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:230) > ~[hadoop-hdfs-3.2.2.jar:?]}} > {{ at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2155) > ~[hadoop-hdfs-3.2.2.jar:?]}} > {{ at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2074) > ~[hadoop-hdfs-3.2.2.jar:?]}} > {{ at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:736) > ~[hadoop-hdfs-3.2.2.jar:?]}} > {{ at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:682) > ~[hadoop-hdfs-3.2.2.jar:?]}} > {{ at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:888) > ~[hadoop-hdfs-3.2.2.jar:?]}} > {{ at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678) > ~[hadoop-hdfs-3.2.2.jar:?]}} > {{ at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:854) > ~[hadoop-hdfs-3.2.2.jar:?]}} > {{ at java.lang.Thread.run(Thread.java:829) ~[?:?]}} > It too runs into the IllegalThreadStateException. > The difference is HDFS-14997 which nicely made command processing async. It > changed how commands work (of course). Now on an ITSE, the processor shuts > down completely with ERROR. In my test case, the FINALIZED of the block fails > and the procedure waiting on its completion times out, in turn timing out the > test. In the old days, before HDFS-14997, the then synchronus heartbeats > would pick up the ITSE earlier and would just keep retrying in spite of ITSE > – the block finalize happened to complete and the test passed (I did not do > deep analysis of how command-processnig behavior has changed. If interest, > I'd be up for digging in). > {color:#660e7a} {color} -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org