[jira] [Created] (HDFS-16586) Purge FsDatasetAsyncDiskService threadgroup; it causes BPServiceActor$CommandProcessingThread IllegalThreadStateException 'fatal exception and exit'

Michael Stack (Jira) Fri, 20 May 2022 14:12:05 -0700

Michael Stack created HDFS-16586:
------------------------------------

             Summary: Purge FsDatasetAsyncDiskService threadgroup; it causes 
BPServiceActor$CommandProcessingThread IllegalThreadStateException 'fatal 
exception and exit' 
                 Key: HDFS-16586
                 URL: https://issues.apache.org/jira/browse/HDFS-16586
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode
    Affects Versions: 3.2.3, 3.3.0
            Reporter: Michael Stack
            Assignee: Michael Stack



The below failed block finalize is causing a downstreamer's test to fail when 
it uses hadoop 3.2.3 or 3.3.0+:
{code:java}
2022-05-19T18:21:08,243 INFO  [Command processor] 
impl.FsDatasetAsyncDiskService(234): Scheduling blk_1073741840_1016 replica 
FinalizedReplica, blk_1073741840_1016, FINALIZED
  getNumBytes()     = 52
  getBytesOnDisk()  = 52
  getVisibleLength()= 52
  getVolume()       = 
/Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2
  getBlockURI()     = 
file:/Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2/current/BP-62743752-127.0.0.1-1653009535881/current/finalized/subdir0/subdir0/blk_1073741840
 for deletion
2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] 
metrics.TopMetrics(134): a metric is reported: cmd: delete user: stack.hfs.0 
(auth:SIMPLE)
2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] 
top.TopAuditLogger(78): ------------------- logged event for top service: 
allowed=true ugi=stack.hfs.0 (auth:SIMPLE) ip=/127.0.0.1 cmd=delete  
src=/user/stack/test-data/b8167d53-bcd7-c682-a767-55faaf7f3e96/data/default/t1/4499521075f51d5138fe4f1916daf92d/.tmp
  dst=null  perm=null
2022-05-19T18:21:08,243 DEBUG [PacketResponder: 
BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, type=LAST_IN_PIPELINE] 
datanode.BlockReceiver$PacketResponder(1645): PacketResponder: 
BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, type=LAST_IN_PIPELINE, 
replyAck=seqno: 901 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0
2022-05-19T18:21:08,243 DEBUG [PacketResponder: 
BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, type=LAST_IN_PIPELINE] 
datanode.BlockReceiver$PacketResponder(1327): PacketResponder: 
BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, type=LAST_IN_PIPELINE: 
seqno=-2 waiting for local datanode to finish write.
2022-05-19T18:21:08,243 ERROR [Command processor] 
datanode.BPServiceActor$CommandProcessingThread(1276): Command processor 
encountered fatal exception and exit.
java.lang.IllegalThreadStateException: null
  at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:865) ~[?:?]
  at java.lang.Thread.<init>(Thread.java:430) ~[?:?]
  at java.lang.Thread.<init>(Thread.java:704) ~[?:?]
  at java.lang.Thread.<init>(Thread.java:525) ~[?:?]
  at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService$1.newThread(FsDatasetAsyncDiskService.java:113)
 ~[hadoop-hdfs-3.2.3.jar:?]
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:623)
 ~[?:?]
  at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:912) 
~[?:?]
  at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343) 
~[?:?]
  at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:189)
 ~[hadoop-hdfs-3.2.3.jar:?]
  at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:238)
 ~[hadoop-hdfs-3.2.3.jar:?]
  at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2184)
 ~[hadoop-hdfs-3.2.3.jar:?]
  at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2103)
 ~[hadoop-hdfs-3.2.3.jar:?]
  at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:736)
 ~[hadoop-hdfs-3.2.3.jar:?]
  at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:682)
 ~[hadoop-hdfs-3.2.3.jar:?]
  at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1318)
 ~[hadoop-hdfs-3.2.3.jar:?]
  at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1364)
 ~[hadoop-hdfs-3.2.3.jar:?]
  at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1291)
 ~[hadoop-hdfs-3.2.3.jar:?]
  at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1274)
 ~[hadoop-hdfs-3.2.3.jar:?]
2022-05-19T18:21:08,243 DEBUG [DataXceiver for client 
DFSClient_NONMAPREDUCE_1036441577_16 at /127.0.0.1:54879 [Receiving block 
BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006]] 
datanode.BlockReceiver(532): Receiving one packet for block 
BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006: PacketHeader with 
packetLen=750 header data: offsetInBlock: 495104
seqno: 902
lastPacketInBlock: false
dataLen: 7382022-05-19T18:21:08,243 WARN  [Command processor] 
datanode.BPServiceActor$CommandProcessingThread(1279): Ending command processor 
service for: Thread[Command processor,5,FailOnTimeoutGroup] {code}
Above, we are trying to run a FINALIZED 'command' asynchronously using a thread 
from the volume's dedicated executor. The executor is trying to create a thread 
because it has none-to-hand as it has not seen action in over 
{color:#660e7a}THREADS_KEEP_ALIVE_SECONDS{color} (60seconds) so all live 
threads have been let go (The volume executor as configured allows core threads 
to die also). The thread creation fails because the 
FsDatasetAsyncDiskService#threadGroup we are trying to associate the new thread 
with is 'destroyed'. Threadgroups auto-'destroy' themselves after their last 
member has 
died|[https://github.com/openjdk/jdk11/blob/master/src/java.base/share/classes/java/lang/ThreadGroup.java#L846]|https://github.com/openjdk/jdk11/blob/master/src/java.base/share/classes/java/lang/ThreadGroup.java#L846].]
 which can happen given the way this volume executor is configured. This 
FsDatasetAsyncDiskService#threadGroup goes unexploited. If no thread group is 
supplied, the executor makes a default that internally associates any created 
threads with a dedicated executor 'threadgroup' but given how our executor is 
configured to let all threads die in the name of economizing on total unning 
threads, a threadgroup-for-each-volume-executor is likely to have the same 
issue as this  FsDatasetAsyncDiskService#threadGroup one. Let me put up a PR to 
remove it and undo the cause of the above ITSE.

Of interest perhaps is that this IllegalThreadStateException is not new to 
3.2.3/3.0.0+. See the below from 3.2.2 when I run the same downstream test:

{{2022-05-19T17:06:11,140 ERROR [BP-1924619834-10.0.0.192-1653005021952 
heartbeating to localhost/127.0.0.1:54360] datanode.BPServiceActor(856): 
Exception in BPOfferService for Block pool 
BP-1924619834-10.0.0.192-1653005021952 (Datanode Uuid 
bb5219c0-a444-4e5a-9455-6176ced7ffe7) service to localhost/127.0.0.1:54360}}
{{java.lang.IllegalThreadStateException: null}}
{{  at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:865) ~[?:?]}}
{{  at java.lang.Thread.<init>(Thread.java:430) ~[?:?]}}
{{  at java.lang.Thread.<init>(Thread.java:704) ~[?:?]}}
{{  at java.lang.Thread.<init>(Thread.java:525) ~[?:?]}}
{{  at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService$1.newThread(FsDatasetAsyncDiskService.java:105)
 ~[hadoop-hdfs-3.2.2.jar:?]}}
{{  at 
java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:623)
 ~[?:?]}}
{{  at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:912) 
~[?:?]}}
{{  at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343) 
~[?:?]}}
{{  at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:181)
 ~[hadoop-hdfs-3.2.2.jar:?]}}
{{  at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:230)
 ~[hadoop-hdfs-3.2.2.jar:?]}}
{{  at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2155)
 ~[hadoop-hdfs-3.2.2.jar:?]}}
{{  at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2074)
 ~[hadoop-hdfs-3.2.2.jar:?]}}
{{  at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:736)
 ~[hadoop-hdfs-3.2.2.jar:?]}}
{{  at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:682)
 ~[hadoop-hdfs-3.2.2.jar:?]}}
{{  at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:888)
 ~[hadoop-hdfs-3.2.2.jar:?]}}
{{  at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678)
 ~[hadoop-hdfs-3.2.2.jar:?]}}
{{  at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:854)
 ~[hadoop-hdfs-3.2.2.jar:?]}}
{{  at java.lang.Thread.run(Thread.java:829) ~[?:?]}}

It too runs into the IllegalThreadStateException.

The difference is HDFS-14997 which nicely made command processing async. It 
changed how commands work (of course). Now on an ITSE, the processor shuts down 
completely with ERROR. In my test case, the FINALIZED of the block fails and 
the procedure waiting on its completion times out, in turn timing out the test. 
In the old days, before HDFS-14997, the then synchronus heartbeats would pick 
up the ITSE earlier and would just keep retrying in spite of ITSE – the block 
finalize happened to complete and the test passed (I did not do deep analysis 
of how command-processnig behavior has changed. If interest, I'd be up for 
digging in).

{color:#660e7a} {color}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-16586) Purge FsDatasetAsyncDiskService threadgroup; it causes BPServiceActor$CommandProcessingThread IllegalThreadStateException 'fatal exception and exit'

Reply via email to