GitHub user geserdugarov edited a comment on the discussion: Spark DataSource V2 read and write benchmarks?
For the start, I will use **read from Kafka topic** (8 partitions) and direct **write to Hudi table** (MOR, upsert, bucket index, 16 buckets) for benchmarking: https://github.com/geserdugarov/test-hudi-issues/blob/main/common/read-from-kafka-write-to-hudi.py This PySpark script will be run on local PC, which will be a driver, and will submit a job to the remote Spark cluster (Spark 3.5.7) with 8 executors (3 CPUs, 8 GB memory for each): https://github.com/geserdugarov/test-hudi-issues/blob/main/utils/spark_configuration.py The data in the Kafka topic is `lineitem` table from TPC-H benchmark (scale factor = 10, 60 mln records). All records are unique for now. Write scenario (4 commits in total): - (4.5 mln * 8) = 36 mln records in the 1st commit, - (1 mln * 8) = 8 mln records per commit in the following 3 commits. - All table services (compaction, cleaning, compaction scheduling) are disabled. Hudi table, Spark event log directory and SQL warehouse directory are placed in a separate HDFS cluster to prevent any data transfer to the driver. For Hudi 1.1.0 (V1 is used) the **total time is about 17 min**. GitHub link: https://github.com/apache/hudi/discussions/13955#discussioncomment-15059391 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
