[ 
https://issues.apache.org/jira/browse/KUDU-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900644#comment-16900644
 ] 

Xu Yao commented on KUDU-2917:
------------------------------

Emm, maybe we can also solve the long tail problem by recording the original 
size of the data in CFile. KUDU-2917 can be used as a separate feature. :)


> Split a tablet into primary key ranges by number of row
> -------------------------------------------------------
>
>                 Key: KUDU-2917
>                 URL: https://issues.apache.org/jira/browse/KUDU-2917
>             Project: Kudu
>          Issue Type: Improvement
>            Reporter: Xu Yao
>            Assignee: Xu Yao
>            Priority: Major
>
> Since we implemented 
> [KUDU-2437|https://issues.apache.org/jira/browse/KUDU-2437] and 
> [KUDU-2670|https://issues.apache.org/jira/browse/KUDU-2670], the spark job 
> can read data inside the tablet in parallel. However, we found in actual use 
> that splitting key range by size may cause the spark task to read long tails. 
> (Some tasks read more data when the data size in KeyRange is basically the 
> same.)
> I think this issue is caused by the encoding and compression of column-wise. 
> For example, we store 1000 rows of data in column-wise. If most of these 
> columns have the same values, less storage space is required. Instead, If 
> these columns have different values, more storage is needed. So I think maybe 
> split the primary key range by the number of rows might be a good choice.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to