[ https://issues.apache.org/jira/browse/KUDU-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754376#comment-16754376 ]
Grant Henke commented on KUDU-2670: ----------------------------------- [~yangz] If you have an older un-rebased WIP patch I would be interested to see it. This work would be very beneficial to the backup jobs that we are working on now. I am happy to help get this work into Kudu anyway I can. > Splitting more tasks for spark job, and add more concurrent for scan operation > ------------------------------------------------------------------------------ > > Key: KUDU-2670 > URL: https://issues.apache.org/jira/browse/KUDU-2670 > Project: Kudu > Issue Type: Improvement > Components: java, spark > Affects Versions: 1.8.0 > Reporter: yangz > Priority: Major > Labels: backup, performance > > Refer to the KUDU-2437 Split a tablet into primary key ranges by size. > We need a java client implementation to support the split the tablet scan > operation. > We suggest two new implementation for the java client. > # A ConcurrentKuduScanner to get more scanner read data at the same time. > This will be useful for one case. We scanner only one row, but the predicate > doesn't contain the primary key, for this case, we will send a lot scanner > request but only one row return.It will be slow to send so much scanner > request one by one. So we need a concurrent way. And by this case we test, > for a 10G tablet, it will save a lot time for one machine. > # A way to split more spark task. To do so, we need get scanner tokens for > two step, first we send to the tserver to give range, then with this range we > get more scanner tokens. For our usage we make a tablet 10G, but we split a > task to process only 1G data. So we get better performance. > And all this feature has run well for us for half a year. We hope this > feature will be useful for the community. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)