GitHub user geserdugarov edited a comment on the discussion: Spark DataSource 
V2 read and write benchmarks?

For the start, I will use for benchmarking read from Kafka topic (8 partitions) 
and direct write to Hudi table (MOR, upsert, bucket index, 16 buckets):
https://github.com/geserdugarov/test-hudi-issues/blob/main/common/read-from-kafka-write-to-hudi.py

This PySpark script will be run on local PC, which will be a driver, and will 
submit a job to the remote Spark cluster (Spark 3.5.7) with 8 executors (3 
CPUs, 8 GB memory for each):
https://github.com/geserdugarov/test-hudi-issues/blob/main/utils/spark_configuration.py

The data in the Kafka topic is `lineitem` table from TPC-H benchmark (scale 
factor = 10, 60 mln records).

Hudi table, Spark event log directory and SQL warehouse directory are placed in 
a separate HDFS cluster.

For Hudi 1.1.0 (V1 is used) the total time is about 17 min.

GitHub link: 
https://github.com/apache/hudi/discussions/13955#discussioncomment-15059391

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to