[ https://issues.apache.org/jira/browse/KUDU-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xu Yao updated KUDU-2437: ------------------------- Description: When reading data in a kudu table using spark, if there is a large amount of data in the tablet, reading the data takes a long time. The reason is that KuduRDD uses a tablet to generate the scanToken, so a spark task needs to process all the data in a tablet. We think that TabletServer should provide an RPC interface, which can be split tablet into multiple primary key ranges by size. The kudu-client can choose whether to perform parallel scan according to the case. RPC interface: {code:java} // A split key range request. Split tablet to key ranges, the request // doesn't change layout of tablet. message SplitKeyRangeRequestPB { required bytes tablet_id = 1; // Encoded primary key to begin scanning at (inclusive). optional bytes start_primary_key = 2 [(kudu.REDACT) = true]; // Encoded primary key to stop scanning at (exclusive). optional bytes stop_primary_key = 3 [(kudu.REDACT) = true]; // Number of bytes to try to return in each chunk. This is a hint. // The tablet server may return chunks larger or smaller than this value. optional uint64 target_chunk_size_bytes = 4; // The columns to consider when chunking. // If specified, then the size estimate used for 'target_chunk_size_bytes' // should only include these columns. This can be used if a query will // only scan a certain subset of the columns. repeated ColumnSchemaPB columns = 5; } message SplitKeyRangeResponsePB { // The error, if an error occurred with this request. optional TabletServerErrorPB error = 1; repeated KeyRangePB ranges = 2; } {code} was: When reading data in a kudu table using spark, if there is a large amount of data in the tablet, reading the data takes a long time. The reason is that KuduRDD uses a tablet to generate the scanToken, so a spark task needs to process all the data in a tablet. We think that TabletServer should provide an RPC interface, which can be split tablet into multiple primary key ranges by size. The kudu-client can choose whether to perform parallel scan according to the case. RPC interface: > Split a tablet into primary key ranges by size > ---------------------------------------------- > > Key: KUDU-2437 > URL: https://issues.apache.org/jira/browse/KUDU-2437 > Project: Kudu > Issue Type: Improvement > Components: client, tablet > Reporter: Xu Yao > Assignee: Xu Yao > Priority: Major > > When reading data in a kudu table using spark, if there is a large amount of > data in the tablet, reading the data takes a long time. The reason is that > KuduRDD uses a tablet to generate the scanToken, so a spark task needs to > process all the data in a tablet. > We think that TabletServer should provide an RPC interface, which can be > split tablet into multiple primary key ranges by size. The kudu-client can > choose whether to perform parallel scan according to the case. > RPC interface: > {code:java} > // A split key range request. Split tablet to key ranges, the request > // doesn't change layout of tablet. > message SplitKeyRangeRequestPB { > required bytes tablet_id = 1; > // Encoded primary key to begin scanning at (inclusive). > optional bytes start_primary_key = 2 [(kudu.REDACT) = true]; > // Encoded primary key to stop scanning at (exclusive). > optional bytes stop_primary_key = 3 [(kudu.REDACT) = true]; > // Number of bytes to try to return in each chunk. This is a hint. > // The tablet server may return chunks larger or smaller than this value. > optional uint64 target_chunk_size_bytes = 4; > // The columns to consider when chunking. > // If specified, then the size estimate used for 'target_chunk_size_bytes' > // should only include these columns. This can be used if a query will > // only scan a certain subset of the columns. > repeated ColumnSchemaPB columns = 5; > } > message SplitKeyRangeResponsePB { > // The error, if an error occurred with this request. > optional TabletServerErrorPB error = 1; > repeated KeyRangePB ranges = 2; > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)