Re: [Discussion] About carbon.si.segment.merge feature

Karan-c980 Fri, 29 Jan 2021 03:06:08 -0800

We can remove merge operation of data files in SI segment, if we avoid small
file creation during SI load itself by following methods.


a) By estimating the SI load size and launch task based on Block size
threshold for SI. For eg:
if blocklsize for SI is 1Gb and SI segment load size is 3GB then launch 3
task
if blocklsize for SI is 1Gb and SI segment load size is 512MB then launch 1
task.

Problem with this method : We can only estimate Uncompressed size for a SI
segment load. For eg: In Uncompressed form SI segment load size 3GB and
blocksize for SI is 1GB. For this scenario we will launch 3 tasks, but it is
possible that after compression this 3GB size reduces to 1GB. So again we
will be having 3 files of 333MB (approx) each. So in this approach we are
launching more tasks than required.
 
b) Hardcode the number of tasks by 1 node 1 task logic. Here we will launch
tasks equal to number of nodes in a cluster.

1. If SI is created with local/global sort and main table is non-partition
table --> This approach will give benefit if number of nodes in cluster are
less. But if number of nodes are more(100 nodes) and data is less(1GB) this
will result in creating small small files. 
2. If SI is created with local/global sort and main table is partition table
--> Data in main table is partitioned over partition column. But data in SI
segment is not partitioned. So there can be many small small carbondata
files present inside main table segment that depends on cardinality of
partition column. So 1 node 1 task logic can give benefit here if number of
nodes are less. But again if number of nodes are greater than or equal to
the cardinality of partition column in main table. It will create many small
files.



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion] About carbon.si.segment.merge feature

Reply via email to