GitHub user geserdugarov edited a comment on the discussion: Spark DataSource 
V2 read and write benchmarks?

For the start, I will use **read from Kafka topic** (8 partitions) and direct 
**write to Hudi table** (MOR, upsert, bucket index, 16 buckets) for 
benchmarking:
https://github.com/geserdugarov/test-hudi-issues/blob/main/common/read-from-kafka-write-to-hudi.py

This PySpark script will be run on local PC, which will be a driver, and will 
submit a job to the remote Spark cluster (Spark 3.5.7) with 8 executors (3 
CPUs, 8 GB memory for each):
https://github.com/geserdugarov/test-hudi-issues/blob/main/utils/spark_configuration.py

The data in the Kafka topic is `lineitem` table from TPC-H benchmark (scale 
factor = 10, 60 mln records). All records are unique for now.

Write scenario (4 commits in total):
- (4.5 mln * 8) = 36 mln records in the 1st commit, 
- (1 mln * 8) = 8 mln records per commit in the following 3 commits.
- All table services (compaction, cleaning, compaction scheduling) are disabled.

Hudi table, Spark event log directory and SQL warehouse directory are placed in 
a separate HDFS cluster to prevent any data transfer to the driver.

For Hudi 1.1.0 (V1 is used) the **total time is about 17 min**.

GitHub link: 
https://github.com/apache/hudi/discussions/13955#discussioncomment-15059391

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to