We can remove merge operation of data files in SI segment, if we avoid small file creation during SI load itself by following methods.
a) By estimating the SI load size and launch task based on Block size threshold for SI. For eg: if blocklsize for SI is 1Gb and SI segment load size is 3GB then launch 3 task if blocklsize for SI is 1Gb and SI segment load size is 512MB then launch 1 task. Problem with this method : We can only estimate Uncompressed size for a SI segment load. For eg: In Uncompressed form SI segment load size 3GB and blocksize for SI is 1GB. For this scenario we will launch 3 tasks, but it is possible that after compression this 3GB size reduces to 1GB. So again we will be having 3 files of 333MB (approx) each. So in this approach we are launching more tasks than required. b) Hardcode the number of tasks by 1 node 1 task logic. Here we will launch tasks equal to number of nodes in a cluster. 1. If SI is created with local/global sort and main table is non-partition table --> This approach will give benefit if number of nodes in cluster are less. But if number of nodes are more(100 nodes) and data is less(1GB) this will result in creating small small files. 2. If SI is created with local/global sort and main table is partition table --> Data in main table is partitioned over partition column. But data in SI segment is not partitioned. So there can be many small small carbondata files present inside main table segment that depends on cardinality of partition column. So 1 node 1 task logic can give benefit here if number of nodes are less. But again if number of nodes are greater than or equal to the cardinality of partition column in main table. It will create many small files. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/