[ 
https://issues.apache.org/jira/browse/HDFS-16586?focusedWorklogId=773961&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-773961
 ]

ASF GitHub Bot logged work on HDFS-16586:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 24/May/22 10:23
            Start Date: 24/May/22 10:23
    Worklog Time Spent: 10m 
      Work Description: hadoop-yetus commented on PR #4347:
URL: https://github.com/apache/hadoop/pull/4347#issuecomment-1135730139

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   1m  9s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
   |||| _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  41m 21s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   1m 31s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   1m  4s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 35s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   1m 46s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   3m 43s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  28m 43s |  |  branch has no errors 
when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 16s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   1m 16s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 47s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 19s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 24s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   3m 26s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  28m 11s |  |  patch has no errors 
when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | -1 :x: |  unit  | 227m 56s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4347/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 16s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 344m 49s |  |  |
   
   
   | Reason | Tests |
   |-------:|:------|
   | Failed junit tests | hadoop.hdfs.TestRollingUpgrade |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4347/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4347 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 5504327bf8bc 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 
17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / f3f74013f773e0dbe599a80cee462c73b230946a |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~18.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4347/2/testReport/ |
   | Max. process+thread count | 2138 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4347/2/console |
   | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




Issue Time Tracking
-------------------

    Worklog Id:     (was: 773961)
    Time Spent: 2h 20m  (was: 2h 10m)

> Purge FsDatasetAsyncDiskService threadgroup; it causes 
> BPServiceActor$CommandProcessingThread IllegalThreadStateException 'fatal 
> exception and exit' 
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16586
>                 URL: https://issues.apache.org/jira/browse/HDFS-16586
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.3.0, 3.2.3
>            Reporter: Michael Stack
>            Assignee: Michael Stack
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The below failed block finalize is causing a downstreamer's test to fail when 
> it uses hadoop 3.2.3 or 3.3.0+:
> {code:java}
> 2022-05-19T18:21:08,243 INFO  [Command processor] 
> impl.FsDatasetAsyncDiskService(234): Scheduling blk_1073741840_1016 replica 
> FinalizedReplica, blk_1073741840_1016, FINALIZED
>   getNumBytes()     = 52
>   getBytesOnDisk()  = 52
>   getVisibleLength()= 52
>   getVolume()       = 
> /Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2
>   getBlockURI()     = 
> file:/Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2/current/BP-62743752-127.0.0.1-1653009535881/current/finalized/subdir0/subdir0/blk_1073741840
>  for deletion
> 2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] 
> metrics.TopMetrics(134): a metric is reported: cmd: delete user: stack.hfs.0 
> (auth:SIMPLE)
> 2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] 
> top.TopAuditLogger(78): ------------------- logged event for top service: 
> allowed=true ugi=stack.hfs.0 (auth:SIMPLE) ip=/127.0.0.1 cmd=delete  
> src=/user/stack/test-data/b8167d53-bcd7-c682-a767-55faaf7f3e96/data/default/t1/4499521075f51d5138fe4f1916daf92d/.tmp
>   dst=null  perm=null
> 2022-05-19T18:21:08,243 DEBUG [PacketResponder: 
> BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE] datanode.BlockReceiver$PacketResponder(1645): 
> PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE, replyAck=seqno: 901 reply: SUCCESS 
> downstreamAckTimeNanos: 0 flag: 0
> 2022-05-19T18:21:08,243 DEBUG [PacketResponder: 
> BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE] datanode.BlockReceiver$PacketResponder(1327): 
> PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE: seqno=-2 waiting for local datanode to finish write.
> 2022-05-19T18:21:08,243 ERROR [Command processor] 
> datanode.BPServiceActor$CommandProcessingThread(1276): Command processor 
> encountered fatal exception and exit.
> java.lang.IllegalThreadStateException: null
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:865) ~[?:?]
>   at java.lang.Thread.<init>(Thread.java:430) ~[?:?]
>   at java.lang.Thread.<init>(Thread.java:704) ~[?:?]
>   at java.lang.Thread.<init>(Thread.java:525) ~[?:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService$1.newThread(FsDatasetAsyncDiskService.java:113)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:623)
>  ~[?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:912)
>  ~[?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343) 
> ~[?:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:189)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:238)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2184)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2103)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:736)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:682)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1318)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1364)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1291)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1274)
>  ~[hadoop-hdfs-3.2.3.jar:?]
> 2022-05-19T18:21:08,243 DEBUG [DataXceiver for client 
> DFSClient_NONMAPREDUCE_1036441577_16 at /127.0.0.1:54879 [Receiving block 
> BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006]] 
> datanode.BlockReceiver(532): Receiving one packet for block 
> BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006: PacketHeader with 
> packetLen=750 header data: offsetInBlock: 495104
> seqno: 902
> lastPacketInBlock: false
> dataLen: 7382022-05-19T18:21:08,243 WARN  [Command processor] 
> datanode.BPServiceActor$CommandProcessingThread(1279): Ending command 
> processor service for: Thread[Command processor,5,FailOnTimeoutGroup] {code}
> Above, we are trying to run a FINALIZED 'command' asynchronously using a 
> thread from the volume's dedicated executor. The executor is trying to create 
> a thread because it has none-to-hand as it has not seen action in over 
> {color:#660e7a}THREADS_KEEP_ALIVE_SECONDS{color} (60seconds) so all live 
> threads have been let go (The volume executor as configured allows core 
> threads to die also). The thread creation fails because the 
> FsDatasetAsyncDiskService#threadGroup we are trying to associate the new 
> thread with is 'destroyed'. Threadgroups auto-'destroy' themselves after 
> their last member has 
> died|[https://github.com/openjdk/jdk11/blob/master/src/java.base/share/classes/java/lang/ThreadGroup.java#L846]|https://github.com/openjdk/jdk11/blob/master/src/java.base/share/classes/java/lang/ThreadGroup.java#L846].]
>  which can happen given the way this volume executor is configured. This 
> FsDatasetAsyncDiskService#threadGroup goes unexploited. If no thread group is 
> supplied, the executor makes a default that internally associates any created 
> threads with a dedicated executor 'threadgroup' but given how our executor is 
> configured to let all threads die in the name of economizing on total unning 
> threads, a threadgroup-for-each-volume-executor is likely to have the same 
> issue as this  FsDatasetAsyncDiskService#threadGroup one. Let me put up a PR 
> to remove it and undo the cause of the above ITSE.
> Of interest perhaps is that this IllegalThreadStateException is not new to 
> 3.2.3/3.0.0+. See the below from 3.2.2 when I run the same downstream test:
> {{2022-05-19T17:06:11,140 ERROR [BP-1924619834-10.0.0.192-1653005021952 
> heartbeating to localhost/127.0.0.1:54360] datanode.BPServiceActor(856): 
> Exception in BPOfferService for Block pool 
> BP-1924619834-10.0.0.192-1653005021952 (Datanode Uuid 
> bb5219c0-a444-4e5a-9455-6176ced7ffe7) service to localhost/127.0.0.1:54360}}
> {{java.lang.IllegalThreadStateException: null}}
> {{  at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:865) ~[?:?]}}
> {{  at java.lang.Thread.<init>(Thread.java:430) ~[?:?]}}
> {{  at java.lang.Thread.<init>(Thread.java:704) ~[?:?]}}
> {{  at java.lang.Thread.<init>(Thread.java:525) ~[?:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService$1.newThread(FsDatasetAsyncDiskService.java:105)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:623)
>  ~[?:?]}}
> {{  at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:912)
>  ~[?:?]}}
> {{  at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343) 
> ~[?:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:181)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:230)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2155)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2074)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:736)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:682)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:888)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:854)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at java.lang.Thread.run(Thread.java:829) ~[?:?]}}
> It too runs into the IllegalThreadStateException.
> The difference is HDFS-14997 which nicely made command processing async. It 
> changed how commands work (of course). Now on an ITSE, the processor shuts 
> down completely with ERROR. In my test case, the FINALIZED of the block fails 
> and the procedure waiting on its completion times out, in turn timing out the 
> test. In the old days, before HDFS-14997, the then synchronus heartbeats 
> would pick up the ITSE earlier and would just keep retrying in spite of ITSE 
> – the block finalize happened to complete and the test passed (I did not do 
> deep analysis of how command-processnig behavior has changed. If interest, 
> I'd be up for digging in).
> {color:#660e7a} {color}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to