[jira] [Commented] (HADOOP-17201) Spark job with s3acommitter stuck at the last stage
[ https://issues.apache.org/jira/browse/HADOOP-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182187#comment-17182187 ] Dyno commented on HADOOP-17201: --- what i can confirm now is - the bucket is versioned. - HADOOP-17063 is also mine. - the problem happens from time to time but not all time. auth is not likely the problem we use a aws access key which gives almost all permission. for s3 bucket ``` Effect: Allow Action: - "s3:GetObject" - "s3:PutObject" - "s3:DeleteObject" - "s3:PutObjectAcl" - "s3:ListBucket" - "s3:ListBucketMultipartUploads" - "s3:ListMultipartUploadParts" - "s3:AbortMultipartUpload" ``` for s3guard dynamodb. Effect: Allow Action: - "dynamodb:List*" - "dynamodb:DescribeReservedCapacity*" - "dynamodb:DescribeLimits" - "dynamodb:DescribeTimeToLive" Resource: "*" - Effect: Allow Action: - "dynamodb:BatchGet*" - "dynamodb:DescribeStream" - "dynamodb:DescribeTable" - "dynamodb:Get*" - "dynamodb:Query" - "dynamodb:Scan" - "dynamodb:BatchWrite*" - "dynamodb:CreateTable" - "dynamodb:Delete*" - "dynamodb:Update*" - "dynamodb:PutItem" ``` won't be able to work on it in the next 2 weeks. will try hadoop 3.3.0 and see how it goes. > Spark job with s3acommitter stuck at the last stage > --- > > Key: HADOOP-17201 > URL: https://issues.apache.org/jira/browse/HADOOP-17201 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 >Affects Versions: 3.2.1 > Environment: we are on spark 2.4.5/hadoop 3.2.1 with s3a committer. > spark.hadoop.fs.s3a.committer.magic.enabled: 'true' > spark.hadoop.fs.s3a.committer.name: magic >Reporter: Dyno >Priority: Major > Attachments: exec-120.log, exec-125.log, exec-25.log, exec-31.log, > exec-36.log, exec-44.log, exec-5.log, exec-64.log, exec-7.log > > > usually our spark job took 1 hour or 2 to finish, occasionally it runs for > more than 3 hour and then we know it's stuck and usually the executor has > stack like this > {{ > "Executor task launch worker for task 78620" #265 daemon prio=5 os_prio=0 > tid=0x7f73e0005000 nid=0x12d waiting on condition [0x7f74cb291000] >java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349) > at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751) > at > org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238) > at > org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown > Source) > at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109) > at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265) > at > org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown Source) > at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322) > at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261) > at > org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226) > at > org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271) > at > org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660) > at > org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521) > at > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) > at > org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) > at > org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) > at >
[jira] [Updated] (HADOOP-17201) Spark job with s3acommitter stuck at the last stage
[ https://issues.apache.org/jira/browse/HADOOP-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dyno updated HADOOP-17201: -- Summary: Spark job with s3acommitter stuck at the last stage (was: Spark job stuck at the last stage) > Spark job with s3acommitter stuck at the last stage > --- > > Key: HADOOP-17201 > URL: https://issues.apache.org/jira/browse/HADOOP-17201 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 >Affects Versions: 3.2.1 > Environment: we are on spark 2.4.5/hadoop 3.2.1 with s3a committer. > spark.hadoop.fs.s3a.committer.magic.enabled: 'true' > spark.hadoop.fs.s3a.committer.name: magic >Reporter: Dyno >Priority: Major > Attachments: exec-120.log, exec-125.log, exec-25.log, exec-31.log, > exec-36.log, exec-44.log, exec-5.log, exec-64.log, exec-7.log > > > usually our spark job took 1 hour or 2 to finish, occasionally it runs for > more than 3 hour and then we know it's stuck and usually the executor has > stack like this > {{ > "Executor task launch worker for task 78620" #265 daemon prio=5 os_prio=0 > tid=0x7f73e0005000 nid=0x12d waiting on condition [0x7f74cb291000] >java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349) > at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751) > at > org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238) > at > org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown > Source) > at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109) > at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265) > at > org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown Source) > at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322) > at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261) > at > org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226) > at > org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271) > at > org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660) > at > org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521) > at > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) > at > org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) > at > org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) > at > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > at > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) > at > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > at >
[jira] [Updated] (HADOOP-17201) Spark job stuck at the last stage
[ https://issues.apache.org/jira/browse/HADOOP-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dyno updated HADOOP-17201: -- Description: usually our spark job took 1 hour or 2 to finish, occasionally it runs for more than 3 hour and then we know it's stuck and usually the executor has stack like this {{ "Executor task launch worker for task 78620" #265 daemon prio=5 os_prio=0 tid=0x7f73e0005000 nid=0x12d waiting on condition [0x7f74cb291000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285) at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457) at org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717) at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785) at org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751) at org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238) at org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown Source) at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109) at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265) at org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown Source) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322) at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261) at org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226) at org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271) at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660) at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521) at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x0003a57332e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) }} captured jstack on the stuck executors in case it's useful. was: usually our
[jira] [Updated] (HADOOP-17201) Spark job stuck at the last stage
[ https://issues.apache.org/jira/browse/HADOOP-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dyno updated HADOOP-17201: -- Description: usually our spark job took 1 hour or 2 to finish, occasionally it runs for more than 3 hour and then we know it's stuck and usually the executor has stack like this }} {{ "Executor task launch worker for task 78384" #272 daemon prio=5 os_prio=0 tid=0x7f73e000d000 nid=0x134 waiting on condition [0x7f73601ef000]}} {{ java.lang.Thread.State: TIMED_WAITING (sleeping)}} {{ at java.lang.Thread.sleep(Native Method)}} {{ at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349)}} {{ at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)}} {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457)}} {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717)}} {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785)}} {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751)}} {{ at org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238)}} {{ at org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown Source)}} {{ at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)}} {{ at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)}} {{ at org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown Source)}} {{ at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)}} {{ at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)}} {{ at org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226)}} {{ at org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271)}} {{ at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660)}} {{ at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521)}} {{ at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385)}} {{ at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)}} {{ at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)}} {{ at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)}} {{ at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685)}} {{ at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)}} {{ at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)}} {{ at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)}} {{ at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)}} {{ at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)}} {{ at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)}} {{ at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)}} {{ at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)}} {{ at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)}} {{ at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)}} {{ at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)}} {{ at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)}} {{ at org.apache.spark.scheduler.Task.run(Task.scala:123)}} {{ at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)}} {{ at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)}} {{ at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)}} {{ at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}} {{ at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}} {{ at java.lang.Thread.run(Thread.java:748)}}{{Locked ownable synchronizers:}} {{<0x0003a573b988> (a java.util.concurrent.ThreadPoolExecutor$Worker)}} }} captured jstack on the stuck executors in case it's useful. was: usually our spark job took 1 hour or 2 to finish, occasionally it runs for more than 3 hour and then we know it's stuck and
[jira] [Updated] (HADOOP-17201) Spark job stuck at the last stage
[ https://issues.apache.org/jira/browse/HADOOP-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dyno updated HADOOP-17201: -- Description: usually our spark job took 1 hour or 2 to finish, occasionally it runs for more than 3 hour and then we know it's stuck and usually the executor has stack like this {{ "Executor task launch worker for task 78384" #272 daemon prio=5 os_prio=0 tid=0x7f73e000d000 nid=0x134 waiting on condition [0x7f73601ef000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285) at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457) at org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717) at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785) at org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751) at org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238) at org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown Source) at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109) at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265) at org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown Source) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322) at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261) at org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226) at org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271) at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660) at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521) at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x0003a573b988> (a java.util.concurrent.ThreadPoolExecutor$Worker) }} captured jstack on the stuck executors in case it's useful. was: usually our
[jira] [Updated] (HADOOP-17201) Spark job stuck at the last stage
[ https://issues.apache.org/jira/browse/HADOOP-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dyno updated HADOOP-17201: -- Description: usually our spark job took 1 hour or 2 to finish, occasionally it runs for more than 3 hour and then we know it's stuck and usually the executor has stack like this }} {{ "Executor task launch worker for task 78384" #272 daemon prio=5 os_prio=0 tid=0x7f73e000d000 nid=0x134 waiting on condition [0x7f73601ef000]}} {{ java.lang.Thread.State: TIMED_WAITING (sleeping)}} {{ at java.lang.Thread.sleep(Native Method)}} {{ at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349)}} {{ at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)}} {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457)}} {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717)}} {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785)}} {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751)}} {{ at org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238)}} {{ at org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown Source)}} {{ at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)}} {{ at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)}} {{ at org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown Source)}} {{ at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)}} {{ at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)}} {{ at org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226)}} {{ at org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271)}} {{ at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660)}} {{ at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521)}} {{ at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385)}} {{ at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)}} {{ at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)}} {{ at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)}} {{ at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685)}} {{ at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)}} {{ at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)}} {{ at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)}} {{ at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)}} {{ at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)}} {{ at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)}} {{ at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)}} {{ at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)}} {{ at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)}} {{ at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)}} {{ at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)}} {{ at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)}} {{ at org.apache.spark.scheduler.Task.run(Task.scala:123)}} {{ at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)}} {{ at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)}} {{ at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)}} {{ at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}} {{ at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}} {{ at java.lang.Thread.run(Thread.java:748)}}{{Locked ownable synchronizers:}} - {{<0x0003a573b988> (a java.util.concurrent.ThreadPoolExecutor$Worker)}} }} captured jstack on the stuck executors in case it's useful. was: usually our spark job took 1 hour or 2 to finish, occasionally it runs for more than 3 hour and then we know it's stuck and usually
[jira] [Created] (HADOOP-17201) Spark job stuck at the last stage
Dyno created HADOOP-17201: - Summary: Spark job stuck at the last stage Key: HADOOP-17201 URL: https://issues.apache.org/jira/browse/HADOOP-17201 Project: Hadoop Common Issue Type: Bug Components: fs/s3 Affects Versions: 3.2.1 Environment: we are on spark 2.4.5/hadoop 3.2.1 with s3a committer. spark.hadoop.fs.s3a.committer.magic.enabled: 'true' spark.hadoop.fs.s3a.committer.name: magic Reporter: Dyno Attachments: exec-120.log, exec-125.log, exec-25.log, exec-31.log, exec-36.log, exec-44.log, exec-5.log, exec-64.log, exec-7.log usually our spark job took 1 hour or 2 to finish, occasionally it runs for more than 3 hour and then we know it's stuck and usually the executor has stack like this ``` "Executor task launch worker for task 78384" #272 daemon prio=5 os_prio=0 tid=0x7f73e000d000 nid=0x134 waiting on condition [0x7f73601ef000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285) at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457) at org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717) at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785) at org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751) at org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238) at org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown Source) at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109) at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265) at org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown Source) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322) at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261) at org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226) at org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271) at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660) at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521) at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at
[jira] [Commented] (HADOOP-17063) S3A deleteObjects hanging/retrying forever
[ https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134609#comment-17134609 ] Dyno commented on HADOOP-17063: --- switch to magic looks working. thanks for your help. [~ste...@apache.org]. > S3A deleteObjects hanging/retrying forever > -- > > Key: HADOOP-17063 > URL: https://issues.apache.org/jira/browse/HADOOP-17063 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Affects Versions: 3.2.1 > Environment: hadoop 3.2.1 > spark 2.4.4 > >Reporter: Dyno >Priority: Minor > Attachments: jstack_exec-34.log, jstack_exec-40.log, > jstack_exec-74.log > > > {code} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) > com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365) > > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) > org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) > org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) > > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) > > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > org.apache.spark.scheduler.Task.run(Task.scala:123) > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > > we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, > sometimes we see this hang with the stacktrace above. it looks like the > putObject never return, we have to kill the executor to make the job move > forward. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17063) S3A deleteObjects hanging/retrying forever
[ https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133778#comment-17133778 ] Dyno commented on HADOOP-17063: --- we do have s3guard turns on, and let me try if magic works or not on our set up. core-site.xml {noformat} fs.s3a.metadatastore.impl org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore fs.s3a.s3guard.ddb.table.create true If true, the S3A client will create the table if it does not already exist. fs.s3a.s3guard.ddb.region us-east-1 {noformat} > S3A deleteObjects hanging/retrying forever > -- > > Key: HADOOP-17063 > URL: https://issues.apache.org/jira/browse/HADOOP-17063 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Affects Versions: 3.2.1 > Environment: hadoop 3.2.1 > spark 2.4.4 > >Reporter: Dyno >Priority: Minor > Attachments: jstack_exec-34.log, jstack_exec-40.log, > jstack_exec-74.log > > > {code} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) > com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365) > > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) > org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) > org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) > > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) > > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > org.apache.spark.scheduler.Task.run(Task.scala:123) > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > > we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, > sometimes we see this hang with the stacktrace above. it looks like the > putObject never return, we have to kill the executor to make the job move > forward. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout
[ https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132746#comment-17132746 ] Dyno commented on HADOOP-17063: --- > important: you are using the classic FileOutputCommitter; this is slow and > unsafe on s3. Can you move to the S3a zero rename committer? {noformat} fs.s3a.committer.name Committer directory directory staging committer partitioned partition staging committer (for use in Spark only) magic the “magic” committer file the original and unsafe File committer; (default) {noformat} our setup is kubernetes/ spark 2.4.4 /hadoop-3.2.1 according to https://hadoop.apache.org/docs/r3.2.1/hadoop-aws/tools/hadoop-aws/committers.html https://github.com/aws-samples/eks-spark-benchmark/blob/master/performance/s3.md directory/partitioned needs shared storage, magic is only supported in spark 3.0 so i think the only option for us is file. when you say "S3a zero rename committer" you mean the directory one? > S3ABlockOutputStream.putObject looks stuck and never timeout > > > Key: HADOOP-17063 > URL: https://issues.apache.org/jira/browse/HADOOP-17063 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 3.2.1 > Environment: hadoop 3.2.1 > spark 2.4.4 > >Reporter: Dyno >Priority: Minor > Attachments: jstack_exec-34.log, jstack_exec-40.log, > jstack_exec-74.log > > > {code} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) > com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365) > > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) > org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) > org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) > > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) > > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > org.apache.spark.scheduler.Task.run(Task.scala:123) > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > > we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, > sometimes we see this hang with the stacktrace above. it looks like the > putObject never return, we have to kill the executor to make the job move > forward. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout
[ https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129802#comment-17129802 ] Dyno edited comment on HADOOP-17063 at 6/9/20, 9:03 PM: it happens again i attached the jstack. thanks for looking into it. i was trying to implement the change you have suggested but the test instruction does not looks quite clear. is it enough to run test under hadoop-tools/hadoop-aws/? was (Author: fu): it happens again i attached the jstack. thanks for looking into it. i was trying to implement the change you have suggested but the test instruction does not looks quite clear. > S3ABlockOutputStream.putObject looks stuck and never timeout > > > Key: HADOOP-17063 > URL: https://issues.apache.org/jira/browse/HADOOP-17063 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 3.2.1 > Environment: hadoop 3.2.1 > spark 2.4.4 > >Reporter: Dyno >Priority: Minor > Attachments: jstack_exec-34.log, jstack_exec-40.log, > jstack_exec-74.log > > > {code} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) > com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365) > > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) > org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) > org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) > > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) > > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > org.apache.spark.scheduler.Task.run(Task.scala:123) > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > > we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, > sometimes we see this hang with the stacktrace above. it looks like the > putObject never return, we have to kill the executor to make the job move > forward. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout
[ https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129802#comment-17129802 ] Dyno commented on HADOOP-17063: --- it happens again i attached the jstack. thanks for looking into it. i was trying to implement the change you have suggested but the test instruction does not looks quite clear. > S3ABlockOutputStream.putObject looks stuck and never timeout > > > Key: HADOOP-17063 > URL: https://issues.apache.org/jira/browse/HADOOP-17063 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 3.2.1 > Environment: hadoop 3.2.1 > spark 2.4.4 > >Reporter: Dyno >Priority: Minor > Attachments: jstack_exec-34.log, jstack_exec-40.log, > jstack_exec-74.log > > > {code} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) > com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365) > > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) > org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) > org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) > > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) > > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > org.apache.spark.scheduler.Task.run(Task.scala:123) > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > > we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, > sometimes we see this hang with the stacktrace above. it looks like the > putObject never return, we have to kill the executor to make the job move > forward. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout
[ https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dyno updated HADOOP-17063: -- Attachment: jstack_exec-74.log jstack_exec-40.log jstack_exec-34.log > S3ABlockOutputStream.putObject looks stuck and never timeout > > > Key: HADOOP-17063 > URL: https://issues.apache.org/jira/browse/HADOOP-17063 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 3.2.1 > Environment: hadoop 3.2.1 > spark 2.4.4 > >Reporter: Dyno >Priority: Minor > Attachments: jstack_exec-34.log, jstack_exec-40.log, > jstack_exec-74.log > > > {code} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) > com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365) > > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) > org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) > org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) > > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) > > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > org.apache.spark.scheduler.Task.run(Task.scala:123) > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > > we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, > sometimes we see this hang with the stacktrace above. it looks like the > putObject never return, we have to kill the executor to make the job move > forward. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout
[ https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125413#comment-17125413 ] Dyno commented on HADOOP-17063: --- log in the spark executor, Jun 3, 2020 @ 22:57:23.032 2020-06-03 22:57:23,032 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete. Jun 3, 2020 @ 22:57:23.0322020-06-03 22:57:23,032 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system...105 Jun 3, 2020 @ 22:57:23.0322020-06-03 22:57:23,032 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped. Jun 3, 2020 @ 22:57:22.9732020-06-03 22:57:22,973 INFO util.ShutdownHookManager: Deleting directory /var/data/spark-ff208630-1fc7-48bc-93a1-6bdf94921c64/spark-eb4613f4-a41a-4985-845b-34b58ae95c50 Jun 3, 2020 @ 22:57:22.9722020-06-03 22:57:22,972 INFO util.ShutdownHookManager: Shutdown hook called Jun 3, 2020 @ 22:57:22.964 2020-06-03 22:57:22,964 INFO storage.DiskBlockManager: Shutdown hook called Jun 3, 2020 @ 22:57:22.960 2020-06-03 22:57:22,960 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM <-- kill to make it move forward Jun 3, 2020 @ 21:52:40.5502020-06-03 21:52:40,549 INFO executor.Executor: Finished task 2927.0 in stage 12.0 (TID 31771). 4696 bytes result sent to driver Jun 3, 2020 @ 21:52:40.5412020-06-03 21:52:40,541 INFO output.FileOutputCommitter: Saved output of task 'attempt_20200603213232_0012_m_002927_31771' to s3a://com Jun 3, 2020 @ 21:52:40.5412020-06-03 21:52:40,541 INFO mapred.SparkHadoopMapRedUtil: attempt_20200603213232_0012_m_002927_31771: Committed Jun 3, 2020 @ 21:52:34.9722020-06-03 21:52:34,971 INFO executor.Executor: Finished task 2922.0 in stage 12.0 (TID 31766). 4696 bytes result sent to driver Jun 3, 2020 @ 21:52:34.9632020-06-03 21:52:34,962 INFO output.FileOutputCommitter: Saved out ... > S3ABlockOutputStream.putObject looks stuck and never timeout > > > Key: HADOOP-17063 > URL: https://issues.apache.org/jira/browse/HADOOP-17063 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.2.1 > Environment: hadoop 3.2.1 > spark 2.4.4 > >Reporter: Dyno >Priority: Major > > {code} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) > com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365) > > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) > org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) > org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) > > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) > > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > org.apache.spark.scheduler.Task.run(Task.scala:123) > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) >
[jira] [Created] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout
Dyno created HADOOP-17063: - Summary: S3ABlockOutputStream.putObject looks stuck and never timeout Key: HADOOP-17063 URL: https://issues.apache.org/jira/browse/HADOOP-17063 Project: Hadoop Common Issue Type: Bug Affects Versions: 3.2.1 Environment: hadoop 3.2.1 spark 2.4.4 Reporter: Dyno {code} sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82) org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446) org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365) org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) org.apache.spark.scheduler.Task.run(Task.scala:123) org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748) {code} we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, sometimes we see this hang with the stacktrace above. it looks like the putObject never return, we have to kill the executor to make the job move forward. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org