[GitHub] [iceberg] atifiu opened a new issue, #8399: HDFS BlockMissingException when trying to read Hive data using Iceberg jars with erasure coding enabled

via GitHub Fri, 25 Aug 2023 23:51:06 -0700


atifiu opened a new issue, #8399:
URL: https://github.com/apache/iceberg/issues/8399


   ### Apache Iceberg version
   
   1.3.0
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   We are trying to read Hive data and write it to Iceberg table using either 
CTAS or INSERT INTO, we get the following error message and it fails. While 
same data we are able to read in Hive/Spark/Trino i.e. without iceberg jars. 
Our environment has erasure coding enabled on hdfs path. To confirm if the 
issue is with erasure coding, we have tested the same on another env with same 
setup but without erasure coding and it works fine. It seems to be an issue 
with **erasure coding**. Is this known issue or if there is any jar to handle 
this issue.
   
   Env Details:
   1. Spark 3.3.1
   2. Iceberg 1.3.0
   
   ```
   py4j.protocol.Py4JJavaError: An error occurred while calling o104.sql.
   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
113 in stage 3.0 failed 4 times, most recent failure: Lost task 113.3 in stage 
3.0 (TID 662) (servername_redacted executor 17): 
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: 
BP-177356802-10.28.113.126-1620307273641:blk_-9223372036848879072_382398 
file=/hdfspath_redacted/core_page_view/eventdate=20210421/000005_0
           at 
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:976)
           at 
org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1083)
           at 
org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1439)
           at 
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1402)
           at 
org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:78)
           at 
org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:107)
           at org.apache.orc.impl.ReaderImpl.read(ReaderImpl.java:701)
           at 
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:793)
           at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:566)
           at org.apache.orc.OrcFile.createReader(OrcFile.java:385)
           at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:146)
           at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2763)
           at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:146)
           at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:209)
           at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270)
           at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
           at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
           at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
           at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
           at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
           at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
           at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
           at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
           at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
           at org.apache.spark.scheduler.Task.run(Task.scala:136)
           at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
           at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
           at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
                at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
           at java.lang.Thread.run(Thread.java:745)
   
   Driver stacktrace:
           at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
           at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
           at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
           at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
           at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
           at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
           at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
           at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
           at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
           at scala.Option.foreach(Option.scala:407)
           at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
           at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
           at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
           at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
           at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
   Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain 
block: BP-177356802-10.28.113.126-1620307273641:blk_-9223372036848879072_382398 
file=/hdfs_path_redacted/core_page_view/eventdate=20210421/000005_0
           at 
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:976)
           at 
org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1083)
           at 
org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1439)
           at 
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1402)
           at 
org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:78)
           at 
org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:107)
           at org.apache.orc.impl.ReaderImpl.read(ReaderImpl.java:701)
           at 
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:793)
           at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:566)
           at org.apache.orc.OrcFile.createReader(OrcFile.java:385)
           at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:146)
           at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2763)
           at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:146)
           at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:209)
                at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270)
           at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
           at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
           at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
           at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
           at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
           at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
           at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
           at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
           at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
           at org.apache.spark.scheduler.Task.run(Task.scala:136)
           at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
           at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
           at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
           at java.lang.Thread.run(Thread.java:745)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] atifiu opened a new issue, #8399: HDFS BlockMissingException when trying to read Hive data using Iceberg jars with erasure coding enabled

Reply via email to