[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2019-07-10 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881949#comment-16881949
 ] 

Steve Loughran commented on SPARK-25966:


[~joshrosen] HADOOP-16109 is *only* for the S3AInputStream class, not going to 
be an issue for swift

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
>   at 
> 

[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2019-07-09 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881625#comment-16881625
 ] 

Josh Rosen commented on SPARK-25966:


Cross-post: there's discussion of a similar issue at 
https://issues.apache.org/jira/browse/SPARK-27570. Based on that, I suspect 
that https://issues.apache.org/jira/browse/HADOOP-16109 may fix this problem.

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
>   

[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-09 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681297#comment-16681297
 ] 

Steve Loughran commented on SPARK-25966:


bq. Hadoop 3.1.x is not yet officially supported in Spark.

true, but there were some changes in that S3A input stream which it is good to 
see if it caused this

h3.  better recovery of failures in the underlying read() call

Before: 
[https://github.com/apache/hadoop/blob/branch-2/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L382]

After: 
[https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L364]

h3. AWS SDK++

an update to a more recent AWS SDK. (1.11.271), which complains a lot more if 
you close in input stream while there's data

h3.  Adaptive seek policy

When you start off with fadvise=normal the first read is the full file, but if 
you do a backward seek is switches to random IO 
(fs.s3a.experimental.fadvise=random): 
[https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L281]
 

Unless the fadvise=random is set (Best) or fadvise=sequential (completely wrong 
for striped columnar formats), the parquet reader is following that codepath.

[~andrioni]:  can you put the log {{org.apache.hadoop.fs.s3a.S3AInputStream}} 
into DEBUG and see what it says on these failures?

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 

[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-08 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680901#comment-16680901
 ] 

Xiao Li commented on SPARK-25966:
-

Thank you for reporting this. I think this is not an issue. Please provide more 
info and then we can investigate more. 

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
>   at 
> 

[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-08 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680816#comment-16680816
 ] 

Hyukjin Kwon commented on SPARK-25966:
--

No one can reproduce this or makes sure if it's really fixed or not. Please 
avoid to just simply copy and paste the error messages in JIRA. Please take a 
look for "Contributing Bug Reports" in 
https://spark.apache.org/contributing.html when the JIRA is filed. Simply 
encountering an error does not mean a bug should be reported.

{quote}
Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on top of a Mesos 
cluster. Both input and output Parquet files are on S3
{quote}

Hadoop 3.1.x is not yet officially supported in Spark.

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> 

[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-08 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679611#comment-16679611
 ] 

Steve Loughran commented on SPARK-25966:


bq.  It looks to me like a problem in closing the file or with an executor 
dying before finishing a file. If that happened and the data wasn't cleaned up, 
then it could lead to this problem.

though if S3 was the destination of the write, that'd be an atomic PUT wouldn't 
it? Corruption would have to happen in either the spark/parquet writer, or in 
the s3 client uploading the data (or RAM corruption, which I'm ignoring)

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>   at 
> 

[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678728#comment-16678728
 ] 

Ryan Blue commented on SPARK-25966:
---

[~andrioni], were there any failed tasks or executors in the job that wrote 
this file? It looks to me like a problem in closing the file or with an 
executor dying before finishing a file. If that happened and the data wasn't 
cleaned up, then it could lead to this problem.

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
>   at 
> 

[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678595#comment-16678595
 ] 

Cheng Lian commented on SPARK-25966:


[~andrioni], just realized that I might misunderstand this part of your 
statement:
{quote}
This job used to work fine with Spark 2.2.1
[...]
{quote}
I thought you could read the same problematic files using Spark 2.2.1. Now I 
guess you probably only meant that the same job worked fine with Spark 2.2.1 
previously (with different sets of historical files).

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
>   

[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678542#comment-16678542
 ] 

Cheng Lian commented on SPARK-25966:


Hey, [~andrioni], if you still have the original (potentially) corrupted 
Parquet files at hand, could you please try reading them again with Spark 2.4 
but with {{spark.sql.parquet.enableVectorizedReader}} set to {{false}}? In this 
way, we fall back to the vanilla {{parquet-mr}} 1.10 Parquet reader. If it 
works fine, it might be an issue in the vectorized reader.

Also, any chances that you can share a sample problematic file?

Since the same workload worked fine with Spark 2.2.1, I doubt whether this is 
really a file corruption issue. Unless somehow Spark 2.4 is reading more 
columns/row groups than Spark 2.2.1 for the same job, which would also indicate 
an optimizer side issue (predicate push-down and column pruning).

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
> 

[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678527#comment-16678527
 ] 

Xiao Li commented on SPARK-25966:
-

Do you still have the file that fail your job? Can you use the previous version 
of Spark to read it?

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
>   at 
> 

[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678475#comment-16678475
 ] 

Yuming Wang commented on SPARK-25966:
-

Thanks [~andrioni] Is there an easy way to reproduce it?

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
>   at