[jira] [Commented] (SPARK-31995) Spark Structure Streaming checkpiontFileManager ERROR when HDFS.DFSOutputStream.completeFile with IOException unable to close file because the last block does not have enough number of replicas

Jim Huang (Jira) Tue, 23 Jun 2020 06:43:04 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-31995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142932#comment-17142932
 ]


Jim Huang commented on SPARK-31995:
-----------------------------------

Thank you for providing a helpful perspective.  I was able to locate HDFS 11486 
using the search string you have provided and it was "resolved in (Hadoop) 
2.7.4+."  I agree with you the HDFS 11486 fix will definitely improve the HDFS 
replica exception handling.  

This issue is a pretty unique, I am not sure I am equipped to create and induce 
such a rare corner case.  Spark 3.0.0 just got released this past week.  I will 
need additional application development time to migrate to Spark 3.x 
architecture (Delta 0.7.0+) ecosystem.  I will be able to upgrade to Spark 
2.4.6 sooner.  

 

>From the Spark Structured Streaming application continuity perspective, the 
>thread that ran this task was terminated with ERROR but to YARN it is still an 
>active running job even though my Spark Structured Streaming job is no longer 
>making any further processing.  If the monitoring of the Spark Structured 
>Streaming job is done only from the YARN job perspective, it may provide a 
>false status.  In this situation, should the Spark Structure Streaming 
>application fail hard and completely (fail by Spark framework or Application 
>exception handling)?  Or should I investigate and develop some ideal 
>monitoring implementation that has the right level of specificity to detect 
>Spark Structured Streaming task level failures?  Any references on these 
>topics are much appreciated.

 

> Spark Structure Streaming checkpiontFileManager ERROR when 
> HDFS.DFSOutputStream.completeFile with IOException unable to close file 
> because the last block does not have enough number of replicas
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-31995
>                 URL: https://issues.apache.org/jira/browse/SPARK-31995
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.4.5
>         Environment: Apache Spark 2.4.5 Scala 2.11 without Hadoop
> Hadoop 2.7.3 - YARN cluster
> delta-core_ 2.11:0.6.1
>  
>            Reporter: Jim Huang
>            Priority: Major
>
> I am using Spark 2.4.5's Spark Structured Streaming with Delta table (0.6.1) 
> as the sink running in YARN cluster running on Hadoop 2.7.3.  I have been 
> using Spark Structured Streaming for several months now in this runtime 
> environment until this new corner case that handicapped my Spark structured 
> streaming job in partial working state.
>  
> I have included the ERROR message and stack trace.  I did a quick search 
> using the string "MicroBatchExecution: Query terminated with error" but did 
> not find any existing Jira that looks like my stack trace.  
>  
> Based on the naive look at this error message and stack trace, is it possible 
> the Spark's CheckpointFileManager could attempt to handle this HDFS exception 
> better to simply wait a little longer for HDFS's pipeline to complete the 
> replicas?  
>  
> Being new to this code, where can I find the configuration parameter that 
> sets the replica counts for the `streaming.HDFSMetadataLog`?  I am just 
> trying to understand if there are already some holistic configuration tuning 
> variable(s) the current code provide to be able to handle this IOException 
> more gracefully?  Hopefully experts can provide some pointers or directions.  
>  
> {code:java}
> 20/06/12 20:14:15 ERROR MicroBatchExecution: Query [id = 
> yarn-job-id-redacted, runId = run-id-redacted] terminated with error
>  java.io.IOException: Unable to close file because the last block does not 
> have enough number of replicas.
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2511)
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2472)
>  at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2437)
>  at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>  at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
>  at 
> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:145)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataLog.scala:126)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
>  at scala.Option.getOrElse(Option.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:547)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:557)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:545)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31995) Spark Structure Streaming checkpiontFileManager ERROR when HDFS.DFSOutputStream.completeFile with IOException unable to close file because the last block does not have enough number of replicas

Reply via email to