[ https://issues.apache.org/jira/browse/KUDU-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Henke updated KUDU-2672: ------------------------------ Fix Version/s: (was: 1.8.0) > Spark write to kudu, too many machines write to one tserver. > ------------------------------------------------------------ > > Key: KUDU-2672 > URL: https://issues.apache.org/jira/browse/KUDU-2672 > Project: Kudu > Issue Type: Improvement > Components: java, spark > Affects Versions: 1.8.0 > Reporter: yangz > Priority: Major > Labels: performance > > For the spark use case. We sometimes will use spark to write data to kudu. > Such as import a hive table data to kudu table. > There will have 2 problems here in current implement. > # It use a FlushMode.AUTO_FLUSH_BACKGROUND, which is not efficient for error > processing. When some error happen such as timeout. It will always flush all > data in the task.Then failed the task. It retry by the task level. > # For the write mode, spark use default hash way to split data to partition. > And the hash method is not always meets the tablet distribution. Such as a > big hive table for 500G size.It will give 2000 task, but we only have 20 > tserver machines. so there will may 2000 machines write at same time to 20 > tserver machines. There will be two bad thing for the performance. First is > primary key lock, tserver user row lock, so there will so many lock wait. The > worst case it always timeout for the write operation.Second is there are so > many machines write data at the same time to tserver. And no any controller > in the code. > So we suggest two thing to do > # Change the flush mode to MANNUL_FLUSH_MODE, and process the error at row > level. At last at task level. > # Give an optional repartition step in spark. We can repartition the data by > the tablet distribution. Then we can get only one machine will write to one > tserver. There will no lock any more. > We use this feature for some times. And it solve some problem when write big > table data to spark.I hope this feature will be useful for the community who > uses a lot spark with kudu. -- This message was sent by Atlassian JIRA (v7.6.3#76005)