[ 
https://issues.apache.org/jira/browse/HADOOP-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17370953#comment-17370953
 ] 

Daniel Zhi edited comment on HADOOP-17201 at 6/28/21, 11:43 PM:
----------------------------------------------------------------

Our tasks using StagingCommitter could be "stuck" in 
S3AFileSystem.deleteUnnecessaryFakeDirectories() for 300~500 seconds because 
the code retries with exponential delay (not a problem by itself) and the final 
exception being swallowed. The logic collects O(depth) parent dirs of an object 
path and call s3.deleteObjects() on them. For example, putObjectDirect of 
s3://bucket/05aee454/root/uuid/staging-uploads/_temporary/0/_temporary/file 
would lead to batch delete of following 7 objects:
 # {{s3://bucket/05aee454/root/uuid/staging-uploads/_temporary/0/_temporary/}}
 # {{s3://bucket/05aee454/root/uuid/staging-uploads/_temporary/0/}}
 # {{s3://bucket/05aee454/root/uuid/staging-uploads/_temporary/}}
 # {{s3://bucket/05aee454/root/uuid/staging-uploads/}}
 # {{s3://bucket/05aee454/root/uuid/}}
 # {{s3://bucket/05aee454/root/}}
 # {{s3://bucket/05aee454/}}

Such batch delete happens in every putObjectDirect() and rename(), and the keys 
spread across multiple levels. Either simply due to too many of such calls, and 
likely the fact that keys are across levels hence contention across parallel 
such requests, it trigger large number of SlowDown error from S3. The retry 
with exponential delay (and finally swallow the exception) make the task just 
appear as hang without progress for sometime O(500) seconds, and no obvious 
failure in the end.

Most of the fake parent paths end with "/" actually do not exist in S3 (maybe 
except the immediate one created by createFakeDirectory). It is not clear to me 
what's the exact purpose of deleteUnnecessaryFakeDirectories() but the impact 
is severe when use s3 as the defaultFS (for the simplicity to avoid managing 
HDFS and S3 being strong consistent [https://aws.amazon.com/s3/consistency/]).


was (Author: danzhi):
I noticed the task could be "stuck" in 
S3AFileSystem.deleteUnnecessaryFakeDirectories() for 300~500 seconds because 
the code retries with exponential delay (not a problem by itself) and the final 
exception being swallowed. The logic collects O(depth) parent dirs of an object 
path and call s3.deleteObjects() on them. For example, putObjectDirect of 
s3://bucket/05aee454/root/uuid/staging-uploads/_temporary/0/_temporary/file 
would lead to following 7 objects:
 # {{s3://bucket/05aee454/root/uuid/staging-uploads/_temporary/0/_temporary/}}
 # {{s3://bucket/05aee454/root/uuid/staging-uploads/_temporary/0/}}
 # {{s3://bucket/05aee454/root/uuid/staging-uploads/_temporary/}}
 # {{s3://bucket/05aee454/root/uuid/staging-uploads/}}
 # {{s3://bucket/05aee454/root/uuid/}}
 # {{s3://bucket/05aee454/root/}}
 # {{s3://bucket/05aee454/}}

Such batch delete happens in every putObjectDirect() and rename(), and the keys 
spread across multiple levels. Either simply due to too many of such calls, 
and/or the fact is delete keys across levels with key contention across 
parallel requests, it trigger SlowDown error from S3. The retry with 
exponential delay (and finally swallow the exception) make the task just appear 
as hang without progress for sometime O(500) seconds.

Most of the fake parent path end with "/" actually does not exist in S3 (maybe 
except the immediate one created by createFakeDirectory). It is not clear to me 
what's the exact purpose of deleteUnnecessaryFakeDirectories() but the impact 
is severe for scenarios using s3 as the defaultFS (for simplicity to avoid 
managing HDFS and S3's announcement of strong consistency of 
https://aws.amazon.com/s3/consistency/).

> Spark job with s3acommitter stuck at the last stage
> ---------------------------------------------------
>
>                 Key: HADOOP-17201
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17201
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 3.2.1
>         Environment: we are on spark 2.4.5/hadoop 3.2.1 with s3a committer.
> spark.hadoop.fs.s3a.committer.magic.enabled: 'true'
> spark.hadoop.fs.s3a.committer.name: magic
>            Reporter: Dyno
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: exec-120.log, exec-125.log, exec-25.log, exec-31.log, 
> exec-36.log, exec-44.log, exec-5.log, exec-64.log, exec-7.log
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> usually our spark job took 1 hour or 2 to finish, occasionally it runs for 
> more than 3 hour and then we know it's stuck and usually the executor has 
> stack like this
> {{
> "Executor task launch worker for task 78620" #265 daemon prio=5 os_prio=0 
> tid=0x00007f73e0005000 nid=0x12d waiting on condition [0x00007f74cb291000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>       at java.lang.Thread.sleep(Native Method)
>       at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349)
>       at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)
>       at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457)
>       at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717)
>       at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785)
>       at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751)
>       at 
> org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238)
>       at 
> org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown
>  Source)
>       at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
>       at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
>       at 
> org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown Source)
>       at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
>       at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
>       at 
> org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226)
>       at 
> org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271)
>       at 
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660)
>       at 
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521)
>       at 
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
>       at 
> org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685)
>       at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
>       at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
>       at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
>       at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
>       at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
>       at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
>       at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>       at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>       at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>       at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>       at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>       at org.apache.spark.scheduler.Task.run(Task.scala:123)
>       at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>       at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
>    Locked ownable synchronizers:
>       - <0x00000003a57332e0> (a 
> java.util.concurrent.ThreadPoolExecutor$Worker)
> }}
> captured jstack on the stuck executors in case it's useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to