Hello
I found that both spark-shell and spark-submit are slow. It takes 1 minute to 
insert 10,000 rows into a 10 million row dataset. Is this normal?



ma...@bonc.com.cn
 
From: lamberken
Date: 2019-12-26 16:18
To: dev@hudi.apache.org
Subject: Re: insert too slow
Hi @mayu1,
 
 
Here is Tuning Guide[1], it may help you to improve performance.
 
 
[1]https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide
 
 
 
 
best,
lamber-kem
 
 
 
 
On 12/26/2019 15:46,ma...@bonc.com.cn<ma...@bonc.com.cn> wrote:
Thank you for your reply, it really works. And how to insert 100 million 
records without OOM?
 
 
 
ma...@bonc.com.cn
 
From: lamberken
Date: 2019-12-26 15:33
To: dev@hudi.apache.org
Subject: Re:insert too slow
 
 
Hi @mayu1,
 
 
Can you run the below program in cosole? looking forward to your feedback.
 
 
========================================================================
${SPARK_HOME}/bin/spark-shell \
--packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 
 
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.QuickstartUtils._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.spark.sql.SaveMode._
import scala.collection.JavaConversions._
 
 
val tableName = "tableName"
val basePath = "file:///tmp/data"
 
 
for (i <- 1 to 1) {
println("start:" + System.currentTimeMillis())
val dataGen = new DataGenerator
val inserts = convertToStringList(dataGen.generateInserts(10000))
println("start insert:" + System.currentTimeMillis())
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 32))
df.write.format("org.apache.hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).                   
mode(Append).
save(basePath);
println("finish" + i + " " + System.currentTimeMillis())
}
========================================================================
 
 
best,
lamber-ken
On 12/26/2019 15:08,ma...@bonc.com.cn<ma...@bonc.com.cn> wrote:
Hello!
What is the throughput of Hudi? I currently use spark to insert 10,000 records 
(300 bytes each), which takes one minute. Is it too slow?
my program:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
 
object HudiDataGen {
def main(args: Array[String]): Unit = {
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.QuickstartUtils._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.spark.sql.SaveMode._
 
import scala.collection.JavaConversions._
 
//初始化
val conf = new SparkConf().setAppName("HudiTest")
.setMaster("local[*]")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") 
//使用Kryo序列化库
val sc = new SparkContext(conf)
val spark = new SQLContext(sc)
 
//设置表名、基本路径和数据生成器来为本指南生成记录。
val tableName = Constant.tableName
//        val basePath = Constant.hdfsPath
val basePath = args(0)
//        val basePath = "file:///e:/hudi_cow_table"
val count = args(1)
for (i <- 1 to count.toInt) {
println("start:" + System.currentTimeMillis())
val dataGen = new DataGenerator
//生成一些新的行程样本,将其加载到DataFrame中,然后将DataFrame写入Hudi数据集中,如下所示。
val inserts = convertToStringList(dataGen.generateInserts(10000))
//        println("insert:"+System.currentTimeMillis())
println("start insert:" + System.currentTimeMillis())
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 32))
df.write.format("org.apache.hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
//                option(STORAGE_TYPE_OPT_KEY, "MERGE_ON_READ").
mode(Append).
save(basePath);
println("finish" + i + " " + System.currentTimeMillis())
}
 
}
}
 
 
ma...@bonc.com.cn

Reply via email to