Arghya Saha created HADOOP-17755:
------------------------------------
Summary: EOF reached error reading ORC file on S3A
Key: HADOOP-17755
URL: https://issues.apache.org/jira/browse/HADOOP-17755
Project: Hadoop Common
Issue Type: Bug
Affects Versions: 3.2.0
Environment: Hadoop 3.2.0
Reporter: Arghya Saha
Hi I am trying to do some transformation using Spark 3.1.1-Hadoop 3.2 on K8s
and using s3a
I have around 700 GB of data to read and around 200 executors (5 vCore and 30G
each).
Its able to read most of the files in problematic stage (Scan orc => Filter =>
Project) but is failing with few files at the end with below error.
I am able to read and rewrite the specific file mentioned which suggest the
file is not corrupted.
Let me know if further information is required
{code:java}
java.io.IOException: Error reading file:
s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orcjava.io.IOException:
Error reading file:
s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orc
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1331)
at
org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
at
org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:96)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:511) at
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131) at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at
java.base/java.lang.Thread.run(Unknown Source)Caused by: java.io.EOFException:
End of file reached before reading fully. at
org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:702) at
org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) at
org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:566)
at
org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:285)
at
org.apache.orc.impl.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1237)
at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1105)
at
org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1256)
at
org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1291)
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1327)
... 20 more
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]