回复： [Discussion] Merging carbonindex files for each segments and across segments

岑玉海 Sat, 21 Oct 2017 05:09:51 -0700

A very good feature!   I think case 1 and case 2 can be handle 
We can merge data files and index files after we insert into hdfs automaticlly.
In case 1:
if the data files are not small, there will be 100 data files and 1 index file.
if the data files are small,  there will be (dataSize/sizePerFile)  data files 
and 1 index file.



When you start to develop this feature, I need this...


Best regards!
Yuhai Cen


在2017年10月21日 13:02，Jacky Li<[email protected]> 写道：
Hi Ravindra,

I doubt whether Level 2 merge is required, if the intention is to solve problem 
of case 2, user can perform data compaction, so that both data and index will 
be merged using level 1 merge. So it can avoid both small data file and small 
index file, right? 

Regards,
Jacky Li

> 在 2017年10月20日，下午9:43，Ravindra Pesala <[email protected]> 写道：
> 
> Hi,
> 
> Problem :
> The first-time query of carbon becomes very slow. It is because of reading
> many small carbonindex files and cache to the driver at the first time.
> Many carbonindex files are created in two cases
> Case 1: Loading data in large cluster
>   For example, if the cluster size is 100 nodes then for each load 100
> index files are created per segment. So after 100 loads, the number of
> carbonindex files becomes 10000.
> Case 2: Frequent loads
>   For example, if the load happens for every 5 minutes in 4 node cluster,
> it will be more than 10000 index files after 10 days even in 4 node cluster.
> 
> It will be slower to read all the files from the driver since a lot of
> namenode calls and IO operations.
> 
> Solution :
> Merge the carbonindex files in two levels.so that we can reduce the IO
> calls to namenode and improves the read performance.
> 
> Level 1: Merge within a segment.
> Merge the carbonindex files to single file immediately after load completes
> within the segment. It would be named as a .carbonindexmerge file. It is
> actually not a true data merging but a simple file merge. So that the
> current structure of carbonindex files does not change. While reading we
> just read one file instead of many carbonindex files within the segment.
> 
> Level 2: Merge across segments.
> Merge the already merged carbonindex files of each segment would be merged
> after a configurable number of segments reached. These files are placed
> under the metadata folder of the table.And the information of these merged
> carbonindex files will be updated in the table status file. While reading
> the carbonindex files first we check the tablestatus for the availability
> of the merged file and read using the information available in it.
> For example, the configurable number to merge index files across segments
> are 100 then for every 100 segments one new merged index file will be
> created under metadata folder and the tablestatus of these 100 segments are
> updated with the information of this file.
> This file is not updatable and it would be removed only if all the segments
> of this merged index file is removed. This file also a simple file merge
> not an actual data merge. By default this is disabled and the user can
> enable it from the carbon properties.
> 
> And also there is an issue in driver cache for old segments.It would be not
> necessary to cache the old segments if the queries are not interested in
> them.I will start another discussion for this cache issue.
> 
> -- 
> Thanks & Regards
> Ravindra

回复： [Discussion] Merging carbonindex files for each segments and across segments

Reply via email to