[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets
[ https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881949#comment-16881949 ] Steve Loughran commented on SPARK-25966: [~joshrosen] HADOOP-16109 is *only* for the S3AInputStream class, not going to be an issue for swift > "EOF Reached the end of stream with bytes left to read" while reading/writing > to Parquets > - > > Key: SPARK-25966 > URL: https://issues.apache.org/jira/browse/SPARK-25966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on > top of a Mesos cluster. Both input and output Parquet files are on S3. >Reporter: Alessandro Andrioni >Priority: Major > > I was persistently getting the following exception while trying to run one > Spark job we have using Spark 2.4.0. It went away after I regenerated from > scratch all the input Parquet files (generated by another Spark job also > using Spark 2.4.0). > Is there a chance that Spark is writing (quite rarely) corrupted Parquet > files? > {code:java} > org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557) > (...) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 > in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: > Reached the end of stream with 996 bytes left to read > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256) > at >
[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets
[ https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881625#comment-16881625 ] Josh Rosen commented on SPARK-25966: Cross-post: there's discussion of a similar issue at https://issues.apache.org/jira/browse/SPARK-27570. Based on that, I suspect that https://issues.apache.org/jira/browse/HADOOP-16109 may fix this problem. > "EOF Reached the end of stream with bytes left to read" while reading/writing > to Parquets > - > > Key: SPARK-25966 > URL: https://issues.apache.org/jira/browse/SPARK-25966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on > top of a Mesos cluster. Both input and output Parquet files are on S3. >Reporter: Alessandro Andrioni >Priority: Major > > I was persistently getting the following exception while trying to run one > Spark job we have using Spark 2.4.0. It went away after I regenerated from > scratch all the input Parquet files (generated by another Spark job also > using Spark 2.4.0). > Is there a chance that Spark is writing (quite rarely) corrupted Parquet > files? > {code:java} > org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557) > (...) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 > in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: > Reached the end of stream with 996 bytes left to read > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256) >
[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets
[ https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681297#comment-16681297 ] Steve Loughran commented on SPARK-25966: bq. Hadoop 3.1.x is not yet officially supported in Spark. true, but there were some changes in that S3A input stream which it is good to see if it caused this h3. better recovery of failures in the underlying read() call Before: [https://github.com/apache/hadoop/blob/branch-2/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L382] After: [https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L364] h3. AWS SDK++ an update to a more recent AWS SDK. (1.11.271), which complains a lot more if you close in input stream while there's data h3. Adaptive seek policy When you start off with fadvise=normal the first read is the full file, but if you do a backward seek is switches to random IO (fs.s3a.experimental.fadvise=random): [https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L281] Unless the fadvise=random is set (Best) or fadvise=sequential (completely wrong for striped columnar formats), the parquet reader is following that codepath. [~andrioni]: can you put the log {{org.apache.hadoop.fs.s3a.S3AInputStream}} into DEBUG and see what it says on these failures? > "EOF Reached the end of stream with bytes left to read" while reading/writing > to Parquets > - > > Key: SPARK-25966 > URL: https://issues.apache.org/jira/browse/SPARK-25966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on > top of a Mesos cluster. Both input and output Parquet files are on S3. >Reporter: Alessandro Andrioni >Priority: Major > > I was persistently getting the following exception while trying to run one > Spark job we have using Spark 2.4.0. It went away after I regenerated from > scratch all the input Parquet files (generated by another Spark job also > using Spark 2.4.0). > Is there a chance that Spark is writing (quite rarely) corrupted Parquet > files? > {code:java} > org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557) > (...) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 312 in stage
[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets
[ https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680901#comment-16680901 ] Xiao Li commented on SPARK-25966: - Thank you for reporting this. I think this is not an issue. Please provide more info and then we can investigate more. > "EOF Reached the end of stream with bytes left to read" while reading/writing > to Parquets > - > > Key: SPARK-25966 > URL: https://issues.apache.org/jira/browse/SPARK-25966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on > top of a Mesos cluster. Both input and output Parquet files are on S3. >Reporter: Alessandro Andrioni >Priority: Major > > I was persistently getting the following exception while trying to run one > Spark job we have using Spark 2.4.0. It went away after I regenerated from > scratch all the input Parquet files (generated by another Spark job also > using Spark 2.4.0). > Is there a chance that Spark is writing (quite rarely) corrupted Parquet > files? > {code:java} > org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557) > (...) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 > in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: > Reached the end of stream with 996 bytes left to read > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256) > at >
[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets
[ https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680816#comment-16680816 ] Hyukjin Kwon commented on SPARK-25966: -- No one can reproduce this or makes sure if it's really fixed or not. Please avoid to just simply copy and paste the error messages in JIRA. Please take a look for "Contributing Bug Reports" in https://spark.apache.org/contributing.html when the JIRA is filed. Simply encountering an error does not mean a bug should be reported. {quote} Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on top of a Mesos cluster. Both input and output Parquet files are on S3 {quote} Hadoop 3.1.x is not yet officially supported in Spark. > "EOF Reached the end of stream with bytes left to read" while reading/writing > to Parquets > - > > Key: SPARK-25966 > URL: https://issues.apache.org/jira/browse/SPARK-25966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on > top of a Mesos cluster. Both input and output Parquet files are on S3. >Reporter: Alessandro Andrioni >Priority: Major > > I was persistently getting the following exception while trying to run one > Spark job we have using Spark 2.4.0. It went away after I regenerated from > scratch all the input Parquet files (generated by another Spark job also > using Spark 2.4.0). > Is there a chance that Spark is writing (quite rarely) corrupted Parquet > files? > {code:java} > org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557) > (...) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 > in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: > Reached the end of stream with 996 bytes left to read > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174) > at >
[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets
[ https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679611#comment-16679611 ] Steve Loughran commented on SPARK-25966: bq. It looks to me like a problem in closing the file or with an executor dying before finishing a file. If that happened and the data wasn't cleaned up, then it could lead to this problem. though if S3 was the destination of the write, that'd be an atomic PUT wouldn't it? Corruption would have to happen in either the spark/parquet writer, or in the s3 client uploading the data (or RAM corruption, which I'm ignoring) > "EOF Reached the end of stream with bytes left to read" while reading/writing > to Parquets > - > > Key: SPARK-25966 > URL: https://issues.apache.org/jira/browse/SPARK-25966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on > top of a Mesos cluster. Both input and output Parquet files are on S3. >Reporter: Alessandro Andrioni >Priority: Major > > I was persistently getting the following exception while trying to run one > Spark job we have using Spark 2.4.0. It went away after I regenerated from > scratch all the input Parquet files (generated by another Spark job also > using Spark 2.4.0). > Is there a chance that Spark is writing (quite rarely) corrupted Parquet > files? > {code:java} > org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557) > (...) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 > in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: > Reached the end of stream with 996 bytes left to read > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) > at >
[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets
[ https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678728#comment-16678728 ] Ryan Blue commented on SPARK-25966: --- [~andrioni], were there any failed tasks or executors in the job that wrote this file? It looks to me like a problem in closing the file or with an executor dying before finishing a file. If that happened and the data wasn't cleaned up, then it could lead to this problem. > "EOF Reached the end of stream with bytes left to read" while reading/writing > to Parquets > - > > Key: SPARK-25966 > URL: https://issues.apache.org/jira/browse/SPARK-25966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on > top of a Mesos cluster. Both input and output Parquet files are on S3. >Reporter: Alessandro Andrioni >Priority: Major > > I was persistently getting the following exception while trying to run one > Spark job we have using Spark 2.4.0. It went away after I regenerated from > scratch all the input Parquet files (generated by another Spark job also > using Spark 2.4.0). > Is there a chance that Spark is writing (quite rarely) corrupted Parquet > files? > {code:java} > org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557) > (...) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 > in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: > Reached the end of stream with 996 bytes left to read > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) > at >
[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets
[ https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678595#comment-16678595 ] Cheng Lian commented on SPARK-25966: [~andrioni], just realized that I might misunderstand this part of your statement: {quote} This job used to work fine with Spark 2.2.1 [...] {quote} I thought you could read the same problematic files using Spark 2.2.1. Now I guess you probably only meant that the same job worked fine with Spark 2.2.1 previously (with different sets of historical files). > "EOF Reached the end of stream with bytes left to read" while reading/writing > to Parquets > - > > Key: SPARK-25966 > URL: https://issues.apache.org/jira/browse/SPARK-25966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on > top of a Mesos cluster. Both input and output Parquet files are on S3. >Reporter: Alessandro Andrioni >Priority: Major > > I was persistently getting the following exception while trying to run one > Spark job we have using Spark 2.4.0. It went away after I regenerated from > scratch all the input Parquet files (generated by another Spark job also > using Spark 2.4.0). > Is there a chance that Spark is writing (quite rarely) corrupted Parquet > files? > {code:java} > org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557) > (...) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 > in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: > Reached the end of stream with 996 bytes left to read > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) >
[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets
[ https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678542#comment-16678542 ] Cheng Lian commented on SPARK-25966: Hey, [~andrioni], if you still have the original (potentially) corrupted Parquet files at hand, could you please try reading them again with Spark 2.4 but with {{spark.sql.parquet.enableVectorizedReader}} set to {{false}}? In this way, we fall back to the vanilla {{parquet-mr}} 1.10 Parquet reader. If it works fine, it might be an issue in the vectorized reader. Also, any chances that you can share a sample problematic file? Since the same workload worked fine with Spark 2.2.1, I doubt whether this is really a file corruption issue. Unless somehow Spark 2.4 is reading more columns/row groups than Spark 2.2.1 for the same job, which would also indicate an optimizer side issue (predicate push-down and column pruning). > "EOF Reached the end of stream with bytes left to read" while reading/writing > to Parquets > - > > Key: SPARK-25966 > URL: https://issues.apache.org/jira/browse/SPARK-25966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on > top of a Mesos cluster. Both input and output Parquet files are on S3. >Reporter: Alessandro Andrioni >Priority: Major > > I was persistently getting the following exception while trying to run one > Spark job we have using Spark 2.4.0. It went away after I regenerated from > scratch all the input Parquet files (generated by another Spark job also > using Spark 2.4.0). > Is there a chance that Spark is writing (quite rarely) corrupted Parquet > files? > {code:java} > org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557) > (...) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 > in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: > Reached the end of stream with 996 bytes left to read > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) >
[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets
[ https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678527#comment-16678527 ] Xiao Li commented on SPARK-25966: - Do you still have the file that fail your job? Can you use the previous version of Spark to read it? > "EOF Reached the end of stream with bytes left to read" while reading/writing > to Parquets > - > > Key: SPARK-25966 > URL: https://issues.apache.org/jira/browse/SPARK-25966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on > top of a Mesos cluster. Both input and output Parquet files are on S3. >Reporter: Alessandro Andrioni >Priority: Major > > I was persistently getting the following exception while trying to run one > Spark job we have using Spark 2.4.0. It went away after I regenerated from > scratch all the input Parquet files (generated by another Spark job also > using Spark 2.4.0). > Is there a chance that Spark is writing (quite rarely) corrupted Parquet > files? > {code:java} > org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557) > (...) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 > in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: > Reached the end of stream with 996 bytes left to read > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256) > at >
[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets
[ https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678475#comment-16678475 ] Yuming Wang commented on SPARK-25966: - Thanks [~andrioni] Is there an easy way to reproduce it? > "EOF Reached the end of stream with bytes left to read" while reading/writing > to Parquets > - > > Key: SPARK-25966 > URL: https://issues.apache.org/jira/browse/SPARK-25966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on > top of a Mesos cluster. Both input and output Parquet files are on S3. >Reporter: Alessandro Andrioni >Priority: Major > > I was persistently getting the following exception while trying to run one > Spark job we have using Spark 2.4.0. It went away after I regenerated from > scratch all the input Parquet files (generated by another Spark job also > using Spark 2.4.0). > Is there a chance that Spark is writing (quite rarely) corrupted Parquet > files? > {code:java} > org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557) > (...) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 > in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: > Reached the end of stream with 996 bytes left to read > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) > at