[jira] [Commented] (KUDU-2670) Splitting more tasks for spark job, and add more concurrent for scan operation

Grant Henke (JIRA) Mon, 28 Jan 2019 13:32:34 -0800


    [ 
https://issues.apache.org/jira/browse/KUDU-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754376#comment-16754376
 ]


Grant Henke commented on KUDU-2670:
-----------------------------------

[~yangz] If you have an older un-rebased WIP patch I would be interested to see 
it. This work would be very beneficial to the backup jobs that we are working 
on now. I am happy to help get this work into Kudu anyway I can.



> Splitting more tasks for spark job, and add more concurrent for scan operation
> ------------------------------------------------------------------------------
>
>                 Key: KUDU-2670
>                 URL: https://issues.apache.org/jira/browse/KUDU-2670
>             Project: Kudu
>          Issue Type: Improvement
>          Components: java, spark
>    Affects Versions: 1.8.0
>            Reporter: yangz
>            Priority: Major
>              Labels: backup, performance
>
> Refer to the KUDU-2437 Split a tablet into primary key ranges by size.
> We need a java client implementation to support the split the tablet scan 
> operation.
> We suggest two new implementation for the java client.
>  # A ConcurrentKuduScanner to get more scanner read data at the same time. 
> This will be useful for one case.  We scanner only one row, but the predicate 
> doesn't contain the primary key, for this case, we will send a lot scanner 
> request but only one row return.It will be slow to send so much scanner 
> request one by one. So we need a concurrent way. And by this case we test, 
> for a 10G tablet, it will save a lot time for one machine.
>  # A way to split more spark task. To do so, we need get scanner tokens for 
> two step, first we send to the tserver to give range, then with this range we 
> get more scanner tokens. For our usage we make a tablet 10G, but we split a 
> task to process only 1G data. So we get better performance.
> And all this feature has run well for us for half a year. We hope this 
> feature will be useful for the community.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2670) Splitting more tasks for spark job, and add more concurrent for scan operation

Reply via email to