[jira] [Updated] (KUDU-2672) Spark write to kudu, too many machines write to one tserver.

Grant Henke (JIRA) Thu, 24 Jan 2019 06:14:40 -0800


     [ 
https://issues.apache.org/jira/browse/KUDU-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Grant Henke updated KUDU-2672:
------------------------------
    Fix Version/s:     (was: 1.8.0)

> Spark write to kudu, too many machines write to one tserver.
> ------------------------------------------------------------
>
>                 Key: KUDU-2672
>                 URL: https://issues.apache.org/jira/browse/KUDU-2672
>             Project: Kudu
>          Issue Type: Improvement
>          Components: java, spark
>    Affects Versions: 1.8.0
>            Reporter: yangz
>            Priority: Major
>              Labels: performance
>
> For the spark use case. We sometimes will use spark to write data to kudu.
> Such as import a hive table data to kudu table.
> There will have 2 problems here in current implement.
>  # It use a FlushMode.AUTO_FLUSH_BACKGROUND, which is not efficient for error 
> processing. When some error happen such as timeout. It will always flush all 
> data in the task.Then failed the task. It retry by the task level. 
>  # For the write mode, spark use default hash way to split data to partition. 
> And the hash method is not always meets the tablet distribution. Such as a 
> big hive table for 500G size.It will give 2000 task, but we only have 20 
> tserver machines. so there will may 2000 machines write at same time to 20 
> tserver machines. There will be two bad thing for the performance. First is 
> primary key lock, tserver user row lock, so there will so many lock wait. The 
> worst case it always timeout for the write operation.Second is there are so 
> many machines write data at the same time to tserver. And no any controller 
> in the code.
> So we suggest two thing to do
>  # Change the flush mode to MANNUL_FLUSH_MODE, and process the error at row 
> level. At last at task level.
>  # Give an optional repartition step in spark. We can repartition the data by 
> the tablet distribution. Then we can get only one machine will write to one 
> tserver. There will no lock any more.
> We use this feature for some times. And it solve some problem when write big 
> table data to spark.I hope this feature will be useful for the community who 
> uses a lot spark with kudu. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2672) Spark write to kudu, too many machines write to one tserver.

Reply via email to