[ 
https://issues.apache.org/jira/browse/KUDU-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2917:
------------------------------
    Component/s: spark
                 perf

> Split a tablet into primary key ranges by number of row
> -------------------------------------------------------
>
>                 Key: KUDU-2917
>                 URL: https://issues.apache.org/jira/browse/KUDU-2917
>             Project: Kudu
>          Issue Type: Improvement
>          Components: perf, spark
>            Reporter: Xu Yao
>            Assignee: Xu Yao
>            Priority: Major
>              Labels: impala
>
> Since we implemented 
> [KUDU-2437|https://issues.apache.org/jira/browse/KUDU-2437] and 
> [KUDU-2670|https://issues.apache.org/jira/browse/KUDU-2670], the spark job 
> can read data inside the tablet in parallel. However, we found in actual use 
> that splitting key range by size may cause the spark task to read long tails. 
> (Some tasks read more data when the data size in KeyRange is basically the 
> same.)
> I think this issue is caused by the encoding and compression of column-wise. 
> For example, we store 1000 rows of data in column-wise. If most of these 
> columns have the same values, less storage space is required. Instead, If 
> these columns have different values, more storage is needed. So I think maybe 
> split the primary key range by the number of rows might be a good choice.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to