GitHub user geserdugarov edited a comment on the discussion: Spark DataSource V2 read and write benchmarks?
For the start, I will use for benchmarking read from Kafka topic (8 partitions) and direct write to Hudi table (MOR, upsert, bucket index, 16 buckets): https://github.com/geserdugarov/test-hudi-issues/blob/main/common/read-from-kafka-write-to-hudi.py This PySpark script will be run on local PC, which will be a driver, and will submit a job to the remote Spark cluster (Spark 3.5.7) with 8 executors (3 CPUs, 8 GB memory for each): https://github.com/geserdugarov/test-hudi-issues/blob/main/utils/spark_configuration.py The data in the Kafka topic is `lineitem` table from TPC-H benchmark (scale factor = 10, 60 mln records). Hudi table, Spark event log directory and SQL warehouse directory are placed in a separate HDFS cluster. For Hudi 1.1.0 (V1 is used) the total time is about 17 min. GitHub link: https://github.com/apache/hudi/discussions/13955#discussioncomment-15059391 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
