[ https://issues.apache.org/jira/browse/KUDU-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900644#comment-16900644 ]
Xu Yao commented on KUDU-2917: ------------------------------ Emm, maybe we can also solve the long tail problem by recording the original size of the data in CFile. KUDU-2917 can be used as a separate feature. :) > Split a tablet into primary key ranges by number of row > ------------------------------------------------------- > > Key: KUDU-2917 > URL: https://issues.apache.org/jira/browse/KUDU-2917 > Project: Kudu > Issue Type: Improvement > Reporter: Xu Yao > Assignee: Xu Yao > Priority: Major > > Since we implemented > [KUDU-2437|https://issues.apache.org/jira/browse/KUDU-2437] and > [KUDU-2670|https://issues.apache.org/jira/browse/KUDU-2670], the spark job > can read data inside the tablet in parallel. However, we found in actual use > that splitting key range by size may cause the spark task to read long tails. > (Some tasks read more data when the data size in KeyRange is basically the > same.) > I think this issue is caused by the encoding and compression of column-wise. > For example, we store 1000 rows of data in column-wise. If most of these > columns have the same values, less storage space is required. Instead, If > these columns have different values, more storage is needed. So I think maybe > split the primary key range by the number of rows might be a good choice. -- This message was sent by Atlassian JIRA (v7.6.14#76016)