For sorting, I think more optimization we can do, I am currently thinking these:
1. Do not sort the whole TablePage, only KeyPage is required as the sort key

2. Should find a more memory efficient sorting algorithm than System.arraycopy 
which requires doubling space.

3. Should try to hold the KeyPage as well as the RowId in a compact data 
structure, it is best if it fits in CPU cache. Modern L3 CPU is larger than 
8MB. For this, I am thinking to have a 8 bytes encoded format that includes 
RowID and SORT_COLUMNS (partial or full), for example, 2 bytes for RowId, 
remaining 6 bytes for 2 to 3 columns after dictionary encoding.
     a) If we can hold the RowID + whole SORT_COLUMNS in 8 bytes, it will be 
most efficient to leverage CPU cache to do sorting, use in-place update 
approach while sorting. So no extra storage is needed, and the RowID + whole 
SORT_COLUMNS will be sorted.
     b) If we can only hold the RowID + partial SORT_COLUMNS in 8 bytes, we can 
employ strategy like the sorting in Spark Tungsten project. (first compare the 
8 bytes in cache, if  it equals then compare remaining bytes in memory)

Regards,
Jacky

> 在 2017年5月22日,上午10:19,David CaiQiang <david.c...@gmail.com> 写道:
> 
> As I known, System.arrayCopy of object array is a shallow copy, so I think
> both KeyPage and TablePage maybe have the same performance on Arrays.sort. 
> 
> 
> -----
> Best Regards
> David Cai
> --
> View this message in context: 
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-Data-loading-improvement-tp11429p13056.html
> Sent from the Apache CarbonData Dev Mailing List archive mailing list archive 
> at Nabble.com.

Reply via email to