[jira] [Commented] (HADOOP-17201) Spark job with s3acommitter stuck at the last stage

2020-08-21 Thread Dyno (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182187#comment-17182187
 ] 

Dyno commented on HADOOP-17201:
---

what i can confirm now is 
- the bucket is versioned.
- HADOOP-17063 is also mine.
- the problem happens from time to time but not all time.

auth is not likely the problem we use a aws access key which gives almost all 
permission.

for s3 bucket
```
   Effect: Allow
Action:
  - "s3:GetObject"
  - "s3:PutObject"
  - "s3:DeleteObject"
  - "s3:PutObjectAcl"
  - "s3:ListBucket"
  - "s3:ListBucketMultipartUploads"
  - "s3:ListMultipartUploadParts"
  - "s3:AbortMultipartUpload"
```

for s3guard dynamodb.

   Effect: Allow
Action:
  - "dynamodb:List*"
  - "dynamodb:DescribeReservedCapacity*"
  - "dynamodb:DescribeLimits"
  - "dynamodb:DescribeTimeToLive"
Resource: "*"
  -
Effect: Allow
Action:
  - "dynamodb:BatchGet*"
  - "dynamodb:DescribeStream"
  - "dynamodb:DescribeTable"
  - "dynamodb:Get*"
  - "dynamodb:Query"
  - "dynamodb:Scan"
  - "dynamodb:BatchWrite*"
  - "dynamodb:CreateTable"
  - "dynamodb:Delete*"
  - "dynamodb:Update*"
  - "dynamodb:PutItem"
```
won't be able to work on it in the next 2 weeks. will try hadoop 3.3.0 and see 
how it goes.

> Spark job with s3acommitter stuck at the last stage
> ---
>
> Key: HADOOP-17201
> URL: https://issues.apache.org/jira/browse/HADOOP-17201
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3
>Affects Versions: 3.2.1
> Environment: we are on spark 2.4.5/hadoop 3.2.1 with s3a committer.
> spark.hadoop.fs.s3a.committer.magic.enabled: 'true'
> spark.hadoop.fs.s3a.committer.name: magic
>Reporter: Dyno
>Priority: Major
> Attachments: exec-120.log, exec-125.log, exec-25.log, exec-31.log, 
> exec-36.log, exec-44.log, exec-5.log, exec-64.log, exec-7.log
>
>
> usually our spark job took 1 hour or 2 to finish, occasionally it runs for 
> more than 3 hour and then we know it's stuck and usually the executor has 
> stack like this
> {{
> "Executor task launch worker for task 78620" #265 daemon prio=5 os_prio=0 
> tid=0x7f73e0005000 nid=0x12d waiting on condition [0x7f74cb291000]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349)
>   at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751)
>   at 
> org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238)
>   at 
> org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown
>  Source)
>   at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
>   at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
>   at 
> org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown Source)
>   at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
>   at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
>   at 
> org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226)
>   at 
> org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271)
>   at 
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660)
>   at 
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521)
>   at 
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385)
>   at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>   at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
>   at 
> org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
>   at 
> org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685)
>   at 
> 

[jira] [Updated] (HADOOP-17201) Spark job with s3acommitter stuck at the last stage

2020-08-10 Thread Dyno (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dyno updated HADOOP-17201:
--
Summary: Spark job with s3acommitter stuck at the last stage  (was: Spark 
job stuck at the last stage)

> Spark job with s3acommitter stuck at the last stage
> ---
>
> Key: HADOOP-17201
> URL: https://issues.apache.org/jira/browse/HADOOP-17201
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3
>Affects Versions: 3.2.1
> Environment: we are on spark 2.4.5/hadoop 3.2.1 with s3a committer.
> spark.hadoop.fs.s3a.committer.magic.enabled: 'true'
> spark.hadoop.fs.s3a.committer.name: magic
>Reporter: Dyno
>Priority: Major
> Attachments: exec-120.log, exec-125.log, exec-25.log, exec-31.log, 
> exec-36.log, exec-44.log, exec-5.log, exec-64.log, exec-7.log
>
>
> usually our spark job took 1 hour or 2 to finish, occasionally it runs for 
> more than 3 hour and then we know it's stuck and usually the executor has 
> stack like this
> {{
> "Executor task launch worker for task 78620" #265 daemon prio=5 os_prio=0 
> tid=0x7f73e0005000 nid=0x12d waiting on condition [0x7f74cb291000]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349)
>   at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751)
>   at 
> org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238)
>   at 
> org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown
>  Source)
>   at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
>   at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
>   at 
> org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown Source)
>   at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
>   at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
>   at 
> org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226)
>   at 
> org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271)
>   at 
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660)
>   at 
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521)
>   at 
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385)
>   at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>   at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
>   at 
> org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
>   at 
> org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>   at 
> 

[jira] [Updated] (HADOOP-17201) Spark job stuck at the last stage

2020-08-10 Thread Dyno (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dyno updated HADOOP-17201:
--
Description: 
usually our spark job took 1 hour or 2 to finish, occasionally it runs for more 
than 3 hour and then we know it's stuck and usually the executor has stack like 
this

{{
"Executor task launch worker for task 78620" #265 daemon prio=5 os_prio=0 
tid=0x7f73e0005000 nid=0x12d waiting on condition [0x7f74cb291000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751)
at 
org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238)
at 
org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown
 Source)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
at 
org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown Source)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
at 
org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226)
at 
org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271)
at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660)
at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521)
at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
at 
org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
at 
org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685)
at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
at 
org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
at 
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
at 
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
- <0x0003a57332e0> (a 
java.util.concurrent.ThreadPoolExecutor$Worker)


}}
captured jstack on the stuck executors in case it's useful.

  was:
usually our 

[jira] [Updated] (HADOOP-17201) Spark job stuck at the last stage

2020-08-10 Thread Dyno (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dyno updated HADOOP-17201:
--
Description: 
usually our spark job took 1 hour or 2 to finish, occasionally it runs for more 
than 3 hour and then we know it's stuck and usually the executor has stack like 
this

}}


{{ "Executor task launch worker for task 78384" #272 daemon prio=5 os_prio=0 
tid=0x7f73e000d000 nid=0x134 waiting on condition [0x7f73601ef000]}}
{{ java.lang.Thread.State: TIMED_WAITING (sleeping)}}
{{ at java.lang.Thread.sleep(Native Method)}}
{{ at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349)}}
{{ at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)}}
{{ at 
org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457)}}
{{ at 
org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717)}}
{{ at 
org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785)}}
{{ at 
org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751)}}
{{ at 
org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238)}}
{{ at 
org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown
 Source)}}
{{ at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)}}
{{ at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)}}
{{ at org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown 
Source)}}
{{ at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)}}
{{ at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)}}
{{ at 
org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226)}}
{{ at 
org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271)}}
{{ at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660)}}
{{ at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521)}}
{{ at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385)}}
{{ at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)}}
{{ at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)}}
{{ at 
org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)}}
{{ at 
org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685)}}
{{ at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)}}
{{ at 
org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)}}
{{ at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)}}
{{ at 
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)}}
{{ at 
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)}}
{{ at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)}}
{{ at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)}}
{{ at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)}}
{{ at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)}}
{{ at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)}}
{{ at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)}}
{{ at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)}}
{{ at org.apache.spark.scheduler.Task.run(Task.scala:123)}}
{{ at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)}}
{{ at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)}}
{{ at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)}}
{{ at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}}
{{ at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}}
{{ at java.lang.Thread.run(Thread.java:748)}}{{Locked ownable synchronizers:}}

 

{{<0x0003a573b988> (a java.util.concurrent.ThreadPoolExecutor$Worker)}}

 

}}

captured jstack on the stuck executors in case it's useful.

  was:
usually our spark job took 1 hour or 2 to finish, occasionally it runs for more 
than 3 hour and then we know it's stuck and 

[jira] [Updated] (HADOOP-17201) Spark job stuck at the last stage

2020-08-10 Thread Dyno (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dyno updated HADOOP-17201:
--
Description: 
usually our spark job took 1 hour or 2 to finish, occasionally it runs for more 
than 3 hour and then we know it's stuck and usually the executor has stack like 
this



{{
"Executor task launch worker for task 78384" #272 daemon prio=5 os_prio=0 
tid=0x7f73e000d000 nid=0x134 waiting on condition [0x7f73601ef000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751)
at 
org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238)
at 
org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown
 Source)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
at 
org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown Source)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
at 
org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226)
at 
org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271)
at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660)
at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521)
at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
at 
org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
at 
org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685)
at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
at 
org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
at 
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
at 
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
- <0x0003a573b988> (a 
java.util.concurrent.ThreadPoolExecutor$Worker)
}}

captured jstack on the stuck executors in case it's useful.

  was:
usually our 

[jira] [Updated] (HADOOP-17201) Spark job stuck at the last stage

2020-08-10 Thread Dyno (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dyno updated HADOOP-17201:
--
Description: 
usually our spark job took 1 hour or 2 to finish, occasionally it runs for more 
than 3 hour and then we know it's stuck and usually the executor has stack like 
this

}}
{{ "Executor task launch worker for task 78384" #272 daemon prio=5 os_prio=0 
tid=0x7f73e000d000 nid=0x134 waiting on condition [0x7f73601ef000]}}
{{ java.lang.Thread.State: TIMED_WAITING (sleeping)}}
{{ at java.lang.Thread.sleep(Native Method)}}
{{ at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349)}}
{{ at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)}}
{{ at 
org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457)}}
{{ at 
org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717)}}
{{ at 
org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785)}}
{{ at 
org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751)}}
{{ at 
org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238)}}
{{ at 
org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown
 Source)}}
{{ at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)}}
{{ at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)}}
{{ at org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown 
Source)}}
{{ at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)}}
{{ at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)}}
{{ at 
org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226)}}
{{ at 
org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271)}}
{{ at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660)}}
{{ at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521)}}
{{ at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385)}}
{{ at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)}}
{{ at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)}}
{{ at 
org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)}}
{{ at 
org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685)}}
{{ at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)}}
{{ at 
org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)}}
{{ at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)}}
{{ at 
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)}}
{{ at 
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)}}
{{ at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)}}
{{ at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)}}
{{ at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)}}
{{ at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)}}
{{ at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)}}
{{ at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)}}
{{ at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)}}
{{ at org.apache.spark.scheduler.Task.run(Task.scala:123)}}
{{ at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)}}
{{ at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)}}
{{ at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)}}
{{ at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}}
{{ at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}}
{{ at java.lang.Thread.run(Thread.java:748)}}{{Locked ownable synchronizers:}}
 - {{<0x0003a573b988> (a java.util.concurrent.ThreadPoolExecutor$Worker)}} 
}}

captured jstack on the stuck executors in case it's useful.

  was:
usually our spark job took 1 hour or 2 to finish, occasionally it runs for more 
than 3 hour and then we know it's stuck and usually 

[jira] [Created] (HADOOP-17201) Spark job stuck at the last stage

2020-08-10 Thread Dyno (Jira)
Dyno created HADOOP-17201:
-

 Summary: Spark job stuck at the last stage
 Key: HADOOP-17201
 URL: https://issues.apache.org/jira/browse/HADOOP-17201
 Project: Hadoop Common
  Issue Type: Bug
  Components: fs/s3
Affects Versions: 3.2.1
 Environment: we are on spark 2.4.5/hadoop 3.2.1 with s3a committer.
spark.hadoop.fs.s3a.committer.magic.enabled: 'true'
spark.hadoop.fs.s3a.committer.name: magic
Reporter: Dyno
 Attachments: exec-120.log, exec-125.log, exec-25.log, exec-31.log, 
exec-36.log, exec-44.log, exec-5.log, exec-64.log, exec-7.log

usually our spark job took 1 hour or 2 to finish, occasionally it runs for more 
than 3 hour and then we know it's stuck and usually the executor has stack like 
this


```
"Executor task launch worker for task 78384" #272 daemon prio=5 os_prio=0 
tid=0x7f73e000d000 nid=0x134 waiting on condition [0x7f73601ef000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751)
at 
org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238)
at 
org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown
 Source)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
at 
org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown Source)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
at 
org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226)
at 
org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271)
at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660)
at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521)
at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
at 
org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
at 
org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685)
at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
at 
org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
at 
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
at 
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at 

[jira] [Commented] (HADOOP-17063) S3A deleteObjects hanging/retrying forever

2020-06-12 Thread Dyno (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134609#comment-17134609
 ] 

Dyno commented on HADOOP-17063:
---

switch to magic looks working. thanks for your help. [~ste...@apache.org].

> S3A deleteObjects hanging/retrying forever
> --
>
> Key: HADOOP-17063
> URL: https://issues.apache.org/jira/browse/HADOOP-17063
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 3.2.1
> Environment: hadoop 3.2.1
> spark 2.4.4
>  
>Reporter: Dyno
>Priority: Minor
> Attachments: jstack_exec-34.log, jstack_exec-40.log, 
> jstack_exec-74.log
>
>
> {code}
> sun.misc.Unsafe.park(Native Method) 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) 
> com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82)
>  
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446)
>  
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365)
>  
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>  org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) 
> org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
>  org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
>  
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
>  
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
>  
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
>  
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>  
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
> org.apache.spark.scheduler.Task.run(Task.scala:123) 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  java.lang.Thread.run(Thread.java:748)
> {code}
>  
> we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, 
> sometimes we see this hang with the stacktrace above. it looks like the 
> putObject never return, we have to kill the executor to make the job move 
> forward. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17063) S3A deleteObjects hanging/retrying forever

2020-06-11 Thread Dyno (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133778#comment-17133778
 ] 

Dyno commented on HADOOP-17063:
---

we do have s3guard turns on, and let me try if magic works or not on our set up.

core-site.xml

{noformat}
  
fs.s3a.metadatastore.impl
org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore
  
  
fs.s3a.s3guard.ddb.table.create
true
If true, the S3A client will create the table if it does not 
already exist.
  
  
fs.s3a.s3guard.ddb.region
us-east-1
  
{noformat}


> S3A deleteObjects hanging/retrying forever
> --
>
> Key: HADOOP-17063
> URL: https://issues.apache.org/jira/browse/HADOOP-17063
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 3.2.1
> Environment: hadoop 3.2.1
> spark 2.4.4
>  
>Reporter: Dyno
>Priority: Minor
> Attachments: jstack_exec-34.log, jstack_exec-40.log, 
> jstack_exec-74.log
>
>
> {code}
> sun.misc.Unsafe.park(Native Method) 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) 
> com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82)
>  
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446)
>  
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365)
>  
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>  org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) 
> org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
>  org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
>  
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
>  
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
>  
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
>  
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>  
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
> org.apache.spark.scheduler.Task.run(Task.scala:123) 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  java.lang.Thread.run(Thread.java:748)
> {code}
>  
> we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, 
> sometimes we see this hang with the stacktrace above. it looks like the 
> putObject never return, we have to kill the executor to make the job move 
> forward. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout

2020-06-10 Thread Dyno (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132746#comment-17132746
 ] 

Dyno commented on HADOOP-17063:
---

> important: you are using the classic FileOutputCommitter; this is slow and 
> unsafe on s3. Can you move to the S3a zero rename committer?

{noformat}
fs.s3a.committer.name Committer
directory directory staging committer
partitioned partition staging committer (for use in Spark only)
magic the “magic” committer
file  the original and unsafe File committer; (default)
{noformat}
our setup is kubernetes/ spark 2.4.4 /hadoop-3.2.1

according to 
https://hadoop.apache.org/docs/r3.2.1/hadoop-aws/tools/hadoop-aws/committers.html
 
https://github.com/aws-samples/eks-spark-benchmark/blob/master/performance/s3.md

directory/partitioned needs shared storage, magic is only supported in spark 
3.0 

so i think the only option for us is file.  when you say "S3a zero rename 
committer" you mean the directory one?


> S3ABlockOutputStream.putObject looks stuck and never timeout
> 
>
> Key: HADOOP-17063
> URL: https://issues.apache.org/jira/browse/HADOOP-17063
> Project: Hadoop Common
>  Issue Type: Sub-task
>Affects Versions: 3.2.1
> Environment: hadoop 3.2.1
> spark 2.4.4
>  
>Reporter: Dyno
>Priority: Minor
> Attachments: jstack_exec-34.log, jstack_exec-40.log, 
> jstack_exec-74.log
>
>
> {code}
> sun.misc.Unsafe.park(Native Method) 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) 
> com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82)
>  
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446)
>  
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365)
>  
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>  org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) 
> org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
>  org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
>  
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
>  
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
>  
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
>  
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>  
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
> org.apache.spark.scheduler.Task.run(Task.scala:123) 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  java.lang.Thread.run(Thread.java:748)
> {code}
>  
> we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, 
> sometimes we see this hang with the stacktrace above. it looks like the 
> putObject never return, we have to kill the executor to make the job move 
> forward. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout

2020-06-09 Thread Dyno (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129802#comment-17129802
 ] 

Dyno edited comment on HADOOP-17063 at 6/9/20, 9:03 PM:


it happens again i attached the jstack. thanks for looking into it.

i was trying to implement the change you have suggested but the test 
instruction does not looks quite clear. 

is it enough to run test under hadoop-tools/hadoop-aws/?


was (Author: fu):
it happens again i attached the jstack. thanks for looking into it.

i was trying to implement the change you have suggested but the test 
instruction does not looks quite clear. 

> S3ABlockOutputStream.putObject looks stuck and never timeout
> 
>
> Key: HADOOP-17063
> URL: https://issues.apache.org/jira/browse/HADOOP-17063
> Project: Hadoop Common
>  Issue Type: Sub-task
>Affects Versions: 3.2.1
> Environment: hadoop 3.2.1
> spark 2.4.4
>  
>Reporter: Dyno
>Priority: Minor
> Attachments: jstack_exec-34.log, jstack_exec-40.log, 
> jstack_exec-74.log
>
>
> {code}
> sun.misc.Unsafe.park(Native Method) 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) 
> com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82)
>  
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446)
>  
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365)
>  
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>  org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) 
> org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
>  org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
>  
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
>  
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
>  
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
>  
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>  
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
> org.apache.spark.scheduler.Task.run(Task.scala:123) 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  java.lang.Thread.run(Thread.java:748)
> {code}
>  
> we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, 
> sometimes we see this hang with the stacktrace above. it looks like the 
> putObject never return, we have to kill the executor to make the job move 
> forward. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout

2020-06-09 Thread Dyno (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129802#comment-17129802
 ] 

Dyno commented on HADOOP-17063:
---

it happens again i attached the jstack. thanks for looking into it.

i was trying to implement the change you have suggested but the test 
instruction does not looks quite clear. 

> S3ABlockOutputStream.putObject looks stuck and never timeout
> 
>
> Key: HADOOP-17063
> URL: https://issues.apache.org/jira/browse/HADOOP-17063
> Project: Hadoop Common
>  Issue Type: Sub-task
>Affects Versions: 3.2.1
> Environment: hadoop 3.2.1
> spark 2.4.4
>  
>Reporter: Dyno
>Priority: Minor
> Attachments: jstack_exec-34.log, jstack_exec-40.log, 
> jstack_exec-74.log
>
>
> {code}
> sun.misc.Unsafe.park(Native Method) 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) 
> com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82)
>  
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446)
>  
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365)
>  
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>  org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) 
> org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
>  org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
>  
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
>  
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
>  
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
>  
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>  
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
> org.apache.spark.scheduler.Task.run(Task.scala:123) 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  java.lang.Thread.run(Thread.java:748)
> {code}
>  
> we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, 
> sometimes we see this hang with the stacktrace above. it looks like the 
> putObject never return, we have to kill the executor to make the job move 
> forward. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout

2020-06-09 Thread Dyno (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dyno updated HADOOP-17063:
--
Attachment: jstack_exec-74.log
jstack_exec-40.log
jstack_exec-34.log

> S3ABlockOutputStream.putObject looks stuck and never timeout
> 
>
> Key: HADOOP-17063
> URL: https://issues.apache.org/jira/browse/HADOOP-17063
> Project: Hadoop Common
>  Issue Type: Sub-task
>Affects Versions: 3.2.1
> Environment: hadoop 3.2.1
> spark 2.4.4
>  
>Reporter: Dyno
>Priority: Minor
> Attachments: jstack_exec-34.log, jstack_exec-40.log, 
> jstack_exec-74.log
>
>
> {code}
> sun.misc.Unsafe.park(Native Method) 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) 
> com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82)
>  
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446)
>  
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365)
>  
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>  org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) 
> org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
>  org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
>  
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
>  
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
>  
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
>  
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>  
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
> org.apache.spark.scheduler.Task.run(Task.scala:123) 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  java.lang.Thread.run(Thread.java:748)
> {code}
>  
> we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, 
> sometimes we see this hang with the stacktrace above. it looks like the 
> putObject never return, we have to kill the executor to make the job move 
> forward. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout

2020-06-03 Thread Dyno (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125413#comment-17125413
 ] 

Dyno commented on HADOOP-17063:
---

log in the spark executor,

Jun 3, 2020 @ 22:57:23.032
2020-06-03 22:57:23,032 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
system shutdown complete.
Jun 3, 2020 @ 22:57:23.0322020-06-03 22:57:23,032 INFO impl.MetricsSystemImpl: 
Stopping s3a-file-system metrics system...105
Jun 3, 2020 @ 22:57:23.0322020-06-03 22:57:23,032 INFO impl.MetricsSystemImpl: 
s3a-file-system metrics system stopped.
Jun 3, 2020 @ 22:57:22.9732020-06-03 22:57:22,973 INFO 
util.ShutdownHookManager: Deleting directory 
/var/data/spark-ff208630-1fc7-48bc-93a1-6bdf94921c64/spark-eb4613f4-a41a-4985-845b-34b58ae95c50
Jun 3, 2020 @ 22:57:22.9722020-06-03 22:57:22,972 INFO 
util.ShutdownHookManager: Shutdown hook called
Jun 3, 2020 @ 22:57:22.964
2020-06-03 22:57:22,964 INFO storage.DiskBlockManager: Shutdown hook called
Jun 3, 2020 @ 22:57:22.960
2020-06-03 22:57:22,960 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
SIGNAL TERM <-- kill to make it move forward
Jun 3, 2020 @ 21:52:40.5502020-06-03 21:52:40,549 INFO executor.Executor: 
Finished task 2927.0 in stage 12.0 (TID 31771). 4696 bytes result sent to driver
Jun 3, 2020 @ 21:52:40.5412020-06-03 21:52:40,541 INFO 
output.FileOutputCommitter: Saved output of task 
'attempt_20200603213232_0012_m_002927_31771' to s3a://com
Jun 3, 2020 @ 21:52:40.5412020-06-03 21:52:40,541 INFO 
mapred.SparkHadoopMapRedUtil: attempt_20200603213232_0012_m_002927_31771: 
Committed
Jun 3, 2020 @ 21:52:34.9722020-06-03 21:52:34,971 INFO executor.Executor: 
Finished task 2922.0 in stage 12.0 (TID 31766). 4696 bytes result sent to driver
Jun 3, 2020 @ 21:52:34.9632020-06-03 21:52:34,962 INFO 
output.FileOutputCommitter: Saved out ...

> S3ABlockOutputStream.putObject looks stuck and never timeout
> 
>
> Key: HADOOP-17063
> URL: https://issues.apache.org/jira/browse/HADOOP-17063
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 3.2.1
> Environment: hadoop 3.2.1
> spark 2.4.4
>  
>Reporter: Dyno
>Priority: Major
>
> {code}
> sun.misc.Unsafe.park(Native Method) 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) 
> com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82)
>  
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446)
>  
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365)
>  
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>  org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) 
> org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
>  org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
>  
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
>  
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
>  
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
>  
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>  
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
> org.apache.spark.scheduler.Task.run(Task.scala:123) 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) 
> 

[jira] [Created] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout

2020-06-03 Thread Dyno (Jira)
Dyno created HADOOP-17063:
-

 Summary: S3ABlockOutputStream.putObject looks stuck and never 
timeout
 Key: HADOOP-17063
 URL: https://issues.apache.org/jira/browse/HADOOP-17063
 Project: Hadoop Common
  Issue Type: Bug
Affects Versions: 3.2.1
 Environment: hadoop 3.2.1

spark 2.4.4

 
Reporter: Dyno


{code}
sun.misc.Unsafe.park(Native Method) 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) 
com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82)
 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446)
 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365)
 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
 org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) 
org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
 org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) 
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
 
org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
 
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
 
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
org.apache.spark.scheduler.Task.run(Task.scala:123) 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
java.lang.Thread.run(Thread.java:748)
{code}

 
we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, 
sometimes we see this hang with the stacktrace above. it looks like the 
putObject never return, we have to kill the executor to make the job move 
forward. 
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org