[ https://issues.apache.org/jira/browse/KUDU-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Henke updated KUDU-2917: ------------------------------ Component/s: spark perf > Split a tablet into primary key ranges by number of row > ------------------------------------------------------- > > Key: KUDU-2917 > URL: https://issues.apache.org/jira/browse/KUDU-2917 > Project: Kudu > Issue Type: Improvement > Components: perf, spark > Reporter: Xu Yao > Assignee: Xu Yao > Priority: Major > Labels: impala > > Since we implemented > [KUDU-2437|https://issues.apache.org/jira/browse/KUDU-2437] and > [KUDU-2670|https://issues.apache.org/jira/browse/KUDU-2670], the spark job > can read data inside the tablet in parallel. However, we found in actual use > that splitting key range by size may cause the spark task to read long tails. > (Some tasks read more data when the data size in KeyRange is basically the > same.) > I think this issue is caused by the encoding and compression of column-wise. > For example, we store 1000 rows of data in column-wise. If most of these > columns have the same values, less storage space is required. Instead, If > these columns have different values, more storage is needed. So I think maybe > split the primary key range by the number of rows might be a good choice. -- This message was sent by Atlassian Jira (v8.3.4#803005)