[jira] [Commented] (HADOOP-17755) EOF reached error reading ORC file on S3A

Arghya Saha (Jira) Thu, 01 Jul 2021 21:14:22 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17373213#comment-17373213
 ]


Arghya Saha commented on HADOOP-17755:
--------------------------------------

[~ste...@apache.org] Apologies for the delay, I was struggling to build spark. 
I have actually built spark 3.1.1 with hadoop 3.3.1 and ran the same job. Good 
news is we do not have the error. However I have noticed an issue which is 
concerning.

The input data reported by Spark(Hadoop 3.3.1) was almost double and read 
runtime also increased (around 20%) compared to Spark(Hadoop 3.2.0) with same 
exact amount of resource and same configuration. And this is happening with 
other jobs as well which was not impacted by read fully error as stated above.

*I was having the same exact issue when I was using the workaround  
fs.s3a.readahead.range = 1G with Hadoop 3.2.0*

Below is further details :

 
|Hadoop Version|Actual size of the files(in SQL Tab)|Reported size of the 
file(In Stages)|Time to complete the Stage|fs.s3a.readahead.range|
|Hadoop 3.2.0|29.3 GiB|29.3 GiB|23 min|64K|
|Hadoop 3.3.1|29.3 GiB|*{color:#FF0000}58.7 GiB{color}*|*{color:#FF0000}27 
min{color}*|{color:#172b4d}64K{color}|
|Hadoop 3.2.0|29.3 GiB|*{color:#FF0000}58.7 GiB{color}*|*{color:#FF0000}~27 
min{color}*|{color:#172b4d}1G{color}|

I was expecting some improvement(or same as 3.2.0) with Hadoop 3.3.1 with read 
operations, please suggest how to approach this and resolve this.

 

 

> EOF reached error reading ORC file on S3A
> -----------------------------------------
>
>                 Key: HADOOP-17755
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17755
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 3.2.0
>         Environment: Hadoop 3.2.0
>            Reporter: Arghya Saha
>            Priority: Major
>
> Hi I am trying to do some transformation using Spark 3.1.1-Hadoop 3.2 on K8s 
> and using s3a
> I have around 700 GB of data to read and around 200 executors (5 vCore and 
> 30G each).
> Its able to read most of the files in problematic stage (Scan orc => Filter 
> => Project) but is failing with few files at the end with below error.  The 
> size of the file mentioned in error is around 140 MB and all other files are 
> of similar size.
> I am able to read and rewrite the specific file mentioned which suggest the 
> file is not corrupted.
> Let me know if further information is required
>  
> {code:java}
> java.io.IOException: Error reading file: 
> s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orcjava.io.IOException:
>  Error reading file: 
> s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orc
>  at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1331) at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
>  at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:96)
>  at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
>  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:511) at 
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177)
>  at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at 
> org.apache.spark.scheduler.Task.run(Task.scala:131) at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source) at java.base/java.lang.Thread.run(Unknown Source)Caused by: 
> java.io.EOFException: End of file reached before reading fully. at 
> org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:702) at 
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) 
> at 
> org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:566)
>  at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:285)
>  at 
> org.apache.orc.impl.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1237)
>  at 
> org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1105) 
> at 
> org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1256)
>  at 
> org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1291)
>  at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1327) 
> ... 20 more
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-17755) EOF reached error reading ORC file on S3A

Reply via email to