Hi all, Currently in carbondata, we have 'local_sort' as default sort_scope and by default, all the dimension columns are selected for sort_columns. This will slow down the data loading. *To give the best performance benefit to user by default values, * we can change sort_scope to 'no_sort' and stop using all dimensions for sort_columns by default. Also if sort_columns are specified but sort_scope is not specified by the user, implicitly need to consider scort_scope as 'local_sort'. These default values are applicable for carbonsession, spark file format and SDK also. (all will have the same behavior)
With these changes below is the performance results of TPCH queries on 500GB data ** Load time is improved nearly by 4 times. * total Query time by all queries is improved. (50% of queries are faster with no_sort, other 50% queries are slightly degraded or same. overall better performance)* Also when I did this change, I found few major issues from existing code in 'no_sort' and empty sort_columns flow. I have fixed that also. Below are the issues found, *[CARBONDATA-3162] Range filters don't remove null values for no_sort direct dictionary dimension columns. [CARBONDATA-3163] If table has different time format, for no_sort columns data goes as bad record (null) for second table when loaded after first table.[CARBONDATA-3164] During no_sort, exception happened at converter step is not reaching to user. same problem in SDK and spark file format flow also.Also fixed multiple test case issues.* I have already opened a PR for fixing these issues. https://github.com/apache/carbondata/pull/2966 Let me know if any suggestions about these changes. Thanks, Ajantha