Re: [I] [SUPPORT]Downstream Fails to Read Hudi MOR Table and Write to Kudu Table When Using Bucketed Table, with Warning: reader has fallen behind too much from the writer [hudi]

via GitHub Wed, 08 Jan 2025 17:21:36 -0800


pengxianzi commented on issue #12585:
URL: https://github.com/apache/hudi/issues/12585#issuecomment-2579007727


   @danny0405 
   Thank you for your suggestion! We followed your advice and set the slot per 
TaskManager to 1, and set read.tasks to 10. However, when the Flink task 
starts, a full table scan reads upstream files, which may include Parquet files 
that are being merged or Log files that have just been written, causing errors. 
The actual size of these files is 0 MB, leading to task failures. Below are 
some error summaries:
    1. Parquet File Error:
           org.apache.flink.runtime.taskmanager.task [] - split_reader switched 
from RUNNING to FAILED with failure cause: java.lang.RuntimeException: 
hdfs://node1/XXX.parquet is not a Parquet file (length is too low: 0)
    2.Log File Error:
          ERROR org.apache.hudi.exception.HoodieIOException: IOException when 
reading logfile HoodieLogFile{pathStr='hdfs://node1/XXXX.log.1_0-1-0', 
fileLen=-1}
          
   Problem Analysis
   The full table scan triggered at task startup reads all files, including 
Parquet files being merged and Log files that have just been written.
   
   These files may not have completed writing or merging, resulting in a file 
size of 0 MB or invalid, causing read failures.
   
   We would like to know:
   
   1. How to avoid reading incomplete files (e.g., Parquet or Log files with a 
size of 0 MB).
   
   2. Whether there is a mechanism to ensure that files are read only after 
they have completed writing or merging.
   
   3. Whether there is a way to optimize the full table scan at task startup to 
avoid reading invalid files.
   
   Looking forward to your reply!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [SUPPORT]Downstream Fails to Read Hudi MOR Table and Write to Kudu Table When Using Bucketed Table, with Warning: reader has fallen behind too much from the writer [hudi]

Reply via email to