[ https://issues.apache.org/jira/browse/HUDI-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinoth Chandar updated HUDI-1351: --------------------------------- Status: Patch Available (was: In Progress) > Improvements required to hudi-test-suite for scalable and repeated testing > -------------------------------------------------------------------------- > > Key: HUDI-1351 > URL: https://issues.apache.org/jira/browse/HUDI-1351 > Project: Apache Hudi > Issue Type: Improvement > Reporter: Prashant Wason > Assignee: Prashant Wason > Priority: Major > Labels: pull-request-available > > There are some shortcomings of the hudi-test-suite which would be good to fix: > 1. When doing repeated testing with the same DAG, the input and output > directories need to be manually cleaned. This is cumbersome for repeated > testing. > 2. When running a long test, the input data generated by older DAG nodes is > not deleted and leads to high file count on the HDFS cluster. The older files > can be deleted once the data has been ingested. > 3. When generating input data, if the number of insert/update partitions is > less than spark's default parallelism, a number of empty avro files are > created. This also leads to scalability issues on the HDFS cluster. Creating > large number of smaller AVRO files is slower and less scalable than single > AVRO file. > 4. When generating data to be inserted, we cannot control which partition the > data will be generated for or add a new partition. Hence we need a > start_offset parameter to control the partition offset. > 5. BUG: Does not generate correct number of insert partitions as partition > number is chosen as a random long. > 6. BUG: Integer division used within Math.ceil in a couple of places is not > correct and leads to 0 value. Math.ceil(5/10) == 0 and not 1 (as intended) > as 5 and 10 are integers. > > 1. When generating input data, -- This message was sent by Atlassian Jira (v8.3.4#803005)