[ https://issues.apache.org/jira/browse/HADOOP-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17373239#comment-17373239 ]
Dongjoon Hyun edited comment on HADOOP-17755 at 7/2/21, 5:39 AM: ----------------------------------------------------------------- Could you share the other Hadoop related configuration you used, [~arghya18]? I'm using the vanilla Apache Hadoop 3.3.1 with the following Hadoop-related configuration on EKS environment. For the other stuff, the default is used. For Spark, it's Spark 3.1.2. {code} -c spark.hadoop.fs.s3a.experimental.input.fadvise=random \ -c spark.hadoop.fs.s3a.downgrade.syncable.exceptions=true \ -c spark.kubernetes.driverEnv.AWS_REGION=us-west-2 \ -c spark.executorEnv.AWS_REGION=us-west-2 \ {code} was (Author: dongjoon): Could you share the other Hadoop related configuration you used, [~arghya18]? I'm using the vanilla Apache Hadoop 3.3.1 with the following Hadoop-related configuration on EKS environment. For the other stuff, the default is used. {code} -c spark.hadoop.fs.s3a.experimental.input.fadvise=random \ -c spark.hadoop.fs.s3a.downgrade.syncable.exceptions=true \ -c spark.kubernetes.driverEnv.AWS_REGION=us-west-2 \ -c spark.executorEnv.AWS_REGION=us-west-2 \ {code} > EOF reached error reading ORC file on S3A > ----------------------------------------- > > Key: HADOOP-17755 > URL: https://issues.apache.org/jira/browse/HADOOP-17755 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 > Affects Versions: 3.2.0 > Environment: Hadoop 3.2.0 > Reporter: Arghya Saha > Priority: Major > > Hi I am trying to do some transformation using Spark 3.1.1-Hadoop 3.2 on K8s > and using s3a > I have around 700 GB of data to read and around 200 executors (5 vCore and > 30G each). > Its able to read most of the files in problematic stage (Scan orc => Filter > => Project) but is failing with few files at the end with below error. The > size of the file mentioned in error is around 140 MB and all other files are > of similar size. > I am able to read and rewrite the specific file mentioned which suggest the > file is not corrupted. > Let me know if further information is required > > {code:java} > java.io.IOException: Error reading file: > s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orcjava.io.IOException: > Error reading file: > s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orc > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1331) at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78) > at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:96) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:511) at > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at > org.apache.spark.scheduler.Task.run(Task.scala:131) at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) at java.base/java.lang.Thread.run(Unknown Source)Caused by: > java.io.EOFException: End of file reached before reading fully. at > org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:702) at > org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) > at > org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:566) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:285) > at > org.apache.orc.impl.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1237) > at > org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1105) > at > org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1256) > at > org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1291) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1327) > ... 20 more > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org