pengxianzi commented on issue #12585:
URL: https://github.com/apache/hudi/issues/12585#issuecomment-2579007727
@danny0405
Thank you for your suggestion! We followed your advice and set the slot per
TaskManager to 1, and set read.tasks to 10. However, when the Flink task
starts, a full table scan reads upstream files, which may include Parquet files
that are being merged or Log files that have just been written, causing errors.
The actual size of these files is 0 MB, leading to task failures. Below are
some error summaries:
1. Parquet File Error:
org.apache.flink.runtime.taskmanager.task [] - split_reader switched
from RUNNING to FAILED with failure cause: java.lang.RuntimeException:
hdfs://node1/XXX.parquet is not a Parquet file (length is too low: 0)
2.Log File Error:
ERROR org.apache.hudi.exception.HoodieIOException: IOException when
reading logfile HoodieLogFile{pathStr='hdfs://node1/XXXX.log.1_0-1-0',
fileLen=-1}
Problem Analysis
The full table scan triggered at task startup reads all files, including
Parquet files being merged and Log files that have just been written.
These files may not have completed writing or merging, resulting in a file
size of 0 MB or invalid, causing read failures.
We would like to know:
1. How to avoid reading incomplete files (e.g., Parquet or Log files with a
size of 0 MB).
2. Whether there is a mechanism to ensure that files are read only after
they have completed writing or merging.
3. Whether there is a way to optimize the full table scan at task startup to
avoid reading invalid files.
Looking forward to your reply!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]