[jira] [Updated] (KUDU-2437) Split a tablet into primary key ranges by size

Xu Yao (JIRA) Tue, 19 Jun 2018 21:03:15 -0700


     [ 
https://issues.apache.org/jira/browse/KUDU-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xu Yao updated KUDU-2437:
-------------------------
    Description: 
When reading data in a kudu table using spark, if there is a large amount of 
data in the tablet, reading the data takes a long time. The reason is that 
KuduRDD uses a tablet to generate the scanToken, so a spark task needs to 
process all the data in a tablet. 

We think that TabletServer should provide an RPC interface, which can be split 
tablet into multiple primary key ranges by size. The kudu-client can choose 
whether to perform parallel scan according to the case.

RPC interface:
{code:java}
// A split key range request. Split tablet to key ranges, the request
// doesn't change layout of tablet.
message SplitKeyRangeRequestPB {
 required bytes tablet_id = 1;

 // Encoded primary key to begin scanning at (inclusive).
 optional bytes start_primary_key = 2 [(kudu.REDACT) = true];
 // Encoded primary key to stop scanning at (exclusive).
 optional bytes stop_primary_key = 3 [(kudu.REDACT) = true];

 // Number of bytes to try to return in each chunk. This is a hint.
 // The tablet server may return chunks larger or smaller than this value.
 optional uint64 target_chunk_size_bytes = 4;

 // The columns to consider when chunking.
 // If specified, then the size estimate used for 'target_chunk_size_bytes'
 // should only include these columns. This can be used if a query will
 // only scan a certain subset of the columns.
 repeated ColumnSchemaPB columns = 5;
}

message SplitKeyRangeResponsePB {
 // The error, if an error occurred with this request.
 optional TabletServerErrorPB error = 1;

 repeated KeyRangePB ranges = 2;
}
{code}
 

  was:
When reading data in a kudu table using spark, if there is a large amount of 
data in the tablet, reading the data takes a long time. The reason is that 
KuduRDD uses a tablet to generate the scanToken, so a spark task needs to 
process all the data in a tablet. 

We think that TabletServer should provide an RPC interface, which can be split 
tablet into multiple primary key ranges by size. The kudu-client can choose 
whether to perform parallel scan according to the case.

RPC interface:

 


> Split a tablet into primary key ranges by size
> ----------------------------------------------
>
>                 Key: KUDU-2437
>                 URL: https://issues.apache.org/jira/browse/KUDU-2437
>             Project: Kudu
>          Issue Type: Improvement
>          Components: client, tablet
>            Reporter: Xu Yao
>            Assignee: Xu Yao
>            Priority: Major
>
> When reading data in a kudu table using spark, if there is a large amount of 
> data in the tablet, reading the data takes a long time. The reason is that 
> KuduRDD uses a tablet to generate the scanToken, so a spark task needs to 
> process all the data in a tablet. 
> We think that TabletServer should provide an RPC interface, which can be 
> split tablet into multiple primary key ranges by size. The kudu-client can 
> choose whether to perform parallel scan according to the case.
> RPC interface:
> {code:java}
> // A split key range request. Split tablet to key ranges, the request
> // doesn't change layout of tablet.
> message SplitKeyRangeRequestPB {
>  required bytes tablet_id = 1;
>  // Encoded primary key to begin scanning at (inclusive).
>  optional bytes start_primary_key = 2 [(kudu.REDACT) = true];
>  // Encoded primary key to stop scanning at (exclusive).
>  optional bytes stop_primary_key = 3 [(kudu.REDACT) = true];
>  // Number of bytes to try to return in each chunk. This is a hint.
>  // The tablet server may return chunks larger or smaller than this value.
>  optional uint64 target_chunk_size_bytes = 4;
>  // The columns to consider when chunking.
>  // If specified, then the size estimate used for 'target_chunk_size_bytes'
>  // should only include these columns. This can be used if a query will
>  // only scan a certain subset of the columns.
>  repeated ColumnSchemaPB columns = 5;
> }
> message SplitKeyRangeResponsePB {
>  // The error, if an error occurred with this request.
>  optional TabletServerErrorPB error = 1;
>  repeated KeyRangePB ranges = 2;
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2437) Split a tablet into primary key ranges by size

Reply via email to