[ 
https://issues.apache.org/jira/browse/HUDI-943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437732#comment-17437732
 ] 

Sagar Sumit commented on HUDI-943:
----------------------------------

[~h117561964] [~vbalaji] Is thi still an issue? HoodieSparkSqlWriter has 
changed significantly since this issue was reported.

> Slow performance observed when inserting data into Hudi table
> -------------------------------------------------------------
>
>                 Key: HUDI-943
>                 URL: https://issues.apache.org/jira/browse/HUDI-943
>             Project: Apache Hudi
>          Issue Type: Test
>          Components: Performance
>            Reporter: Sam Huang
>            Assignee: Balaji Varadarajan
>            Priority: Blocker
>              Labels: performance, sev:critical, user-support-issues
>
> I am using Datasource Writer API to write 5000 records into Hudi 
> copy-on-write table, each with 8 columns and the total size is less than 1Mb. 
> Please refer to the below code.
>  
> {code:java}
> Dataset<Row> ds1 = spark.read().json(jsc.parallelize(records, 2)); 
> DataFrameWriter<Row> writer = ds1.write().format("org.apache.hudi") 
> .option("hoodie.insert.shuffle.parallelism", 2) 
> .option("hoodie.upsert.shuffle.parallelism", 2) 
> .option(DataSourceWriteOptions.OPERATION_OPT_KEY(), 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL()) 
> .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY(), tableType) 
> .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), recordKey) 
> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), partitionPath) 
> .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), precombineKey) 
> .option(HoodieWriteConfig.TABLE_NAME, hudiTableName) 
> .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, hudiWriteCompactEnabled) 
> .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY(), hudiTableName) 
> .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY(), hiveDatabase) 
> .option(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), hiveServerUrl) 
> .option(DataSourceWriteOptions.HIVE_USER_OPT_KEY(), hiveUser) 
> .option(DataSourceWriteOptions.HIVE_PASS_OPT_KEY(), hivePassword) 
> .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY(), hiveSyncEnabled) 
> .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY(), 
> partitionPath)    .mode(SaveMode.Append); 
> writer.save(basePath);
> {code}
>  
> At the beginning, it only takes 3~4 seconds to finish the insert operation, 
> but it gets longer and longer, say 30 seconds after 5 minutes.
> From the below spark logs, it shows that most of time is spent on 
> HoodieSparkSqlWriter count task.
>  
> {noformat}
> 2020-05-25 16:36:37,851 | INFO  | [dag-scheduler-event-loop] | Adding task 
> set 185.0 with 1 tasks | 
> org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
> 2020-05-25 16:36:37,851 | INFO  | [dispatcher-event-loop-0] | Starting task 
> 0.0 in stage 185.0 (TID 190, node-ana-corepOlf, executor 2, partition 0, 
> NODE_LOCAL, 7651 bytes) | 
> org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
> 2020-05-25 16:36:37,858 | INFO  | [dispatcher-event-loop-1] | Added 
> broadcast_124_piece0 in memory on node-ana-corepOlf:36554 (size: 138.1 KB, 
> free: 29.2 GB) | 
> org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
> 2020-05-25 16:36:37,887 | INFO  | [dispatcher-event-loop-1] | Asked to send 
> map output locations for shuffle 53 to 10.155.114.97:32461 | 
> org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
> 2020-05-25 16:37:11,098 | INFO  | [dispatcher-event-loop-0] | Added rdd_381_0 
> in memory on node-ana-corepOlf:36554 (size: 387.0 B, free: 29.2 GB) | 
> org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
> 2020-05-25 16:37:11,111 | INFO  | [task-result-getter-2] | Finished task 0.0 
> in stage 185.0 (TID 190) in 33260 ms on node-ana-corepOlf (executor 2) (1/1) 
> | org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
> 2020-05-25 16:37:11,111 | INFO  | [task-result-getter-2] | Removed TaskSet 
> 185.0, whose tasks have all completed, from pool  | 
> org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
> 2020-05-25 16:37:11,112 | INFO  | [dag-scheduler-event-loop] | ResultStage 
> 185 (count at HoodieSparkSqlWriter.scala:254) finished in 33.308 s | 
> org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
> 2020-05-25 16:37:11,113 | INFO  | [Driver] | Job 70 finished: count at 
> HoodieSparkSqlWriter.scala:254, took 33.438673 s | 
> org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
> {noformat}
>  
> I tried to tune the parameter hoodie.insert.shuffle.parallelism to 20, but 
> did not help. And the CPU/Heap usages are all normal.
>  
> Below is the setting for application.
> Executor instance: 2
> Executor memory: 55g
> Executor cores: 4
> Driver memory: 4g
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to