Prashant Wason created HUDI-1351:
------------------------------------

             Summary: Improvements required to hudi-test-suite for scalable and 
repeated testing
                 Key: HUDI-1351
                 URL: https://issues.apache.org/jira/browse/HUDI-1351
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Prashant Wason
            Assignee: Prashant Wason


There are some shortcomings of the hudi-test-suite which would be good to fix:

1. When doing repeated testing with the same DAG, the input and output 
directories need to be manually cleaned. This is cumbersome for repeated 
testing.

2. When running a long test, the input data generated by older DAG nodes is not 
deleted and leads to high file count on the HDFS cluster. The older files can 
be deleted once the data has been ingested.

3. When generating input data, if the number of insert/update partitions is 
less than spark's default parallelism, a number of empty avro files are 
created. This also leads to scalability issues on the HDFS cluster. Creating 
large number of smaller AVRO files is slower and less scalable than single AVRO 
file.

4. When generating data to be inserted, we cannot control which partition the 
data will be generated for or add a new partition. Hence we need a start_offset 
parameter to control the partition offset.

5. BUG: Does not generate correct number of insert partitions as partition 
number is chosen as a random long. 

6. BUG: Integer division used within Math.ceil in a couple of places is not 
correct and leads to 0 value.  Math.ceil(5/10) == 0 and not 1 (as intended) as 
5 and 10 are integers.
 


1. When generating input data, 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to