Hi,

Problem :
 The first-time query of carbon becomes very slow. It is because of reading
many small carbonindex files and cache to the driver at the first time.
 Many carbonindex files are created in two cases
 Case 1: Loading data in large cluster
   For example, if the cluster size is 100 nodes then for each load 100
index files are created per segment. So after 100 loads, the number of
carbonindex files becomes 10000.
 Case 2: Frequent loads
   For example, if the load happens for every 5 minutes in 4 node cluster,
it will be more than 10000 index files after 10 days even in 4 node cluster.

It will be slower to read all the files from the driver since a lot of
namenode calls and IO operations.

Solution :
Merge the carbonindex files in two levels.so that we can reduce the IO
calls to namenode and improves the read performance.

Level 1: Merge within a segment.
Merge the carbonindex files to single file immediately after load completes
within the segment. It would be named as a .carbonindexmerge file. It is
actually not a true data merging but a simple file merge. So that the
current structure of carbonindex files does not change. While reading we
just read one file instead of many carbonindex files within the segment.

Level 2: Merge across segments.
Merge the already merged carbonindex files of each segment would be merged
after a configurable number of segments reached. These files are placed
under the metadata folder of the table.And the information of these merged
carbonindex files will be updated in the table status file. While reading
the carbonindex files first we check the tablestatus for the availability
of the merged file and read using the information available in it.
For example, the configurable number to merge index files across segments
are 100 then for every 100 segments one new merged index file will be
created under metadata folder and the tablestatus of these 100 segments are
updated with the information of this file.
This file is not updatable and it would be removed only if all the segments
of this merged index file is removed. This file also a simple file merge
not an actual data merge. By default this is disabled and the user can
enable it from the carbon properties.

And also there is an issue in driver cache for old segments.It would be not
necessary to cache the old segments if the queries are not interested in
them.I will start another discussion for this cache issue.

-- 
Thanks & Regards
Ravindra

Reply via email to