atifiu opened a new issue, #8399:
URL: https://github.com/apache/iceberg/issues/8399
### Apache Iceberg version
1.3.0
### Query engine
Spark
### Please describe the bug 🐞
We are trying to read Hive data and write it to Iceberg table using either
CTAS or INSERT INTO, we get the following error message and it fails. While
same data we are able to read in Hive/Spark/Trino i.e. without iceberg jars.
Our environment has erasure coding enabled on hdfs path. To confirm if the
issue is with erasure coding, we have tested the same on another env with same
setup but without erasure coding and it works fine. It seems to be an issue
with **erasure coding**. Is this known issue or if there is any jar to handle
this issue.
Env Details:
1. Spark 3.3.1
2. Iceberg 1.3.0
```
py4j.protocol.Py4JJavaError: An error occurred while calling o104.sql.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
113 in stage 3.0 failed 4 times, most recent failure: Lost task 113.3 in stage
3.0 (TID 662) (servername_redacted executor 17):
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block:
BP-177356802-10.28.113.126-1620307273641:blk_-9223372036848879072_382398
file=/hdfspath_redacted/core_page_view/eventdate=20210421/000005_0
at
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:976)
at
org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1083)
at
org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1439)
at
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1402)
at
org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:78)
at
org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:107)
at org.apache.orc.impl.ReaderImpl.read(ReaderImpl.java:701)
at
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:793)
at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:566)
at org.apache.orc.OrcFile.createReader(OrcFile.java:385)
at
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:146)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2763)
at
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:146)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:209)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
at
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
at scala.Option.foreach(Option.scala:407)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-177356802-10.28.113.126-1620307273641:blk_-9223372036848879072_382398
file=/hdfs_path_redacted/core_page_view/eventdate=20210421/000005_0
at
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:976)
at
org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1083)
at
org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1439)
at
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1402)
at
org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:78)
at
org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:107)
at org.apache.orc.impl.ReaderImpl.read(ReaderImpl.java:701)
at
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:793)
at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:566)
at org.apache.orc.OrcFile.createReader(OrcFile.java:385)
at
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:146)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2763)
at
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:146)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:209)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]