pengxianzi opened a new issue, #12585:
URL: https://github.com/apache/hudi/issues/12585

   We are using Apache Hudi to build a data lake and writing data to a Kudu 
table downstream. The following two scenarios exhibit different behaviors:
    Scenario 1: The upstream writes data using a regular MOR (Merge On Read) 
Hudi table, and the downstream reads the Hudi table and writes to the Kudu 
table without any issues.
    Scenario 2: The upstream writes data using a bucketed table, and when the 
downstream reads the Hudi table and attempts to write to the Kudu table, the 
task fails with the following warning:
    `caution: the reader has fallen behind too much from the writer, tweak 
'read.tasks' option to add parallelism of read tasks`
   
   We have tried setting the read.tasks parameter to 10, but the issue 
persists. Below are our configuration and environment details:
   
    Hudi version : 0.14.0
   
   Spark version : 2.4.7
   
   Hive version : 3.1.3
   
   Hadoop version : 3.1.1
   
   Storage Format: HDFS
   
   Downstream Storage: Apache Kudu
   
   Bucketed Table Configuration: Number of buckets is 10
   
   Configuration Information
   
   Below is our Hudi table configuration:
    
            Map<String, String> options = new HashMap<>();
           options.put(FlinkOptions.PATH.key(), basePath+tableName);
           options.put(FlinkOptions.TABLE_TYPE.key(), name);
           options.put(FlinkOptions.READ_AS_STREAMING.key(), "true");
           options.put(FlinkOptions.PRECOMBINE_FIELD.key(),precombing);
           options.put(FlinkOptions.READ_START_COMMIT.key(), "20210316134557");
           options.put("read.streaming.skip_clustering", "true");  
           options.put("read.streaming.skip_compaction", "true");  
   
   
   Steps to Reproduce
   
    1. The upstream writes data to a Hudi MOR table using a bucketed table.
    2. The downstream reads the Hudi table and attempts to write the data to 
the Kudu table.
    3. The task fails with the warning: reader has fallen behind too much from 
the writer.
   
   Attempted Solutions
    1. Set the read.tasks parameter to 10, but the issue persists.
    2. Checked the data distribution of the bucketed table to ensure there is 
no data skew.
    3. Checked the file layout of the Hudi table to ensure there are no 
excessive small files.
   
   Expected Behavior
    The downstream should be able to read the Hudi MOR table written by the 
bucketed table and write the data to the Kudu table normally.
   
   Actual Behavior
    The downstream read task fails with the warning: reader has fallen behind 
too much from the writer.
   
   Log Snippet
    Below is a snippet of the log when the task fails:
   
   
    Caused 
by:org.apache.flink.runtime.resourcemanager.exceptions.UnknownTaskExecutorException:No
 TaskExecutor registered under container e38 1734494154374 0718 01 000002
    caused by :org.apache.flink.util.FlinkRuntimeException:Exceeded checkpoint 
tolerable failure threshould
    Caused by: java.util.concurrent.TimeoutException
    caution :the reader has fall behind too much from the write , tweak 
'read.tasks' option to add parallelism of read tasks
   
   ERROR org.apache.flink.runtime.taseManagerRunner [] - Fatal error occurred 
while executing the TaskManager. Shutting it down ...
   org.apache.flink.util.FlinkException: the TaskExecutor's registration at the 
ResourceManager akka.tcp://node1:7791/user/rpc/resourcemanager_0 has been 
rejected :Rejected TaskExecutor registration at the ResourceManager because:The 
ResourceManager does not recognize this TaskExecutor
   
    Summary of Questions
    We would like to know:
    1. Why does the read task fall behind when using a bucketed table?
    2. Are there any other configurations besides read.tasks that can optimize 
read performance?
    3. Are there any known issues or limitations related to the combination of 
bucketed tables and MOR tables?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to