yangz created KUDU-2672:
---------------------------

             Summary: Spark write to kudu, too many machines write to one 
tserver.
                 Key: KUDU-2672
                 URL: https://issues.apache.org/jira/browse/KUDU-2672
             Project: Kudu
          Issue Type: Improvement
          Components: java, spark
    Affects Versions: 1.8.0
            Reporter: yangz
             Fix For: 1.8.0


For the spark use case. We sometimes will use spark to write data to kudu.

Such as import a hive table data to kudu table.

There will have 2 problems here in current implement.
 # It use a FlushMode.AUTO_FLUSH_BACKGROUND, which is not efficient for error 
processing. When some error happen such as timeout. It will always flush all 
data in the task.Then failed the task. It retry by the task level. 
 # For the write mode, spark use default hash way to split data to partition. 
And the hash method is not always meets the tablet distribution. Such as a big 
hive table for 500G size.It will give 2000 task, but we only have 20 tserver 
machines. so there will may 2000 machines write at same time to 20 tserver 
machines. There will be two bad thing for the performance. First is primary key 
lock, tserver user row lock, so there will so many lock wait. The worst case it 
always timeout for the write operation.Second is there are so many machines 
write data at the same time to tserver. And no any controller in the code.

So we suggest two thing to do
 # Change the flush mode to MANNUL_FLUSH_MODE, and process the error at row 
level. At last at task level.
 # Give an optional repartition step in spark. We can repartition the data by 
the tablet distribution. Then we can get only one machine will write to one 
tserver. There will no lock any more.

We use this feature for some times. And it solve some problem when write big 
table data to spark.I hope this feature will be useful for the community who 
uses a lot spark with kudu. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to