Re: [Discussion] Merging carbonindex files for each segments and across segments

Ravindra Pesala Sun, 22 Oct 2017 06:18:08 -0700

Hi ,

@Jacky  I feel level 2 merging also required as level 1 does not resolve
the problem completely. And yes compaction might solve the issue but in
some use cases users do not compact at all.


@yaojinguo If the table already has many index files then new load after
the upgrade will generate level 2 files across segments.

@cenyuhai11 we start developing this feature very soon and it will be
delivered in next carbon version.

Regards,
Ravindra.

On 21 October 2017 at 17:39, 岑玉海 <cenyuha...@163.com> wrote:

> A very good feature!   I think case 1 and case 2 can be handle
> We can merge data files and index files after we insert into hdfs
> automaticlly.
> In case 1:
> if the data files are not small, there will be 100 data files and 1 index
> file.
> if the data files are small,  there will be (dataSize/sizePerFile)  data
> files and 1 index file.
>
>
> When you start to develop this feature, I need this...
>
>
> Best regards!
> Yuhai Cen
>
>
> 在2017年10月21日 13:02，Jacky Li<jacky.li...@qq.com> 写道：
> Hi Ravindra,
>
> I doubt whether Level 2 merge is required, if the intention is to solve
> problem of case 2, user can perform data compaction, so that both data and
> index will be merged using level 1 merge. So it can avoid both small data
> file and small index file, right?
>
> Regards,
> Jacky Li
>
> > 在 2017年10月20日，下午9:43，Ravindra Pesala <ravi.pes...@gmail.com> 写道：
> >
> > Hi,
> >
> > Problem :
> > The first-time query of carbon becomes very slow. It is because of
> reading
> > many small carbonindex files and cache to the driver at the first time.
> > Many carbonindex files are created in two cases
> > Case 1: Loading data in large cluster
> >   For example, if the cluster size is 100 nodes then for each load 100
> > index files are created per segment. So after 100 loads, the number of
> > carbonindex files becomes 10000.
> > Case 2: Frequent loads
> >   For example, if the load happens for every 5 minutes in 4 node cluster,
> > it will be more than 10000 index files after 10 days even in 4 node
> cluster.
> >
> > It will be slower to read all the files from the driver since a lot of
> > namenode calls and IO operations.
> >
> > Solution :
> > Merge the carbonindex files in two levels.so that we can reduce the IO
> > calls to namenode and improves the read performance.
> >
> > Level 1: Merge within a segment.
> > Merge the carbonindex files to single file immediately after load
> completes
> > within the segment. It would be named as a .carbonindexmerge file. It is
> > actually not a true data merging but a simple file merge. So that the
> > current structure of carbonindex files does not change. While reading we
> > just read one file instead of many carbonindex files within the segment.
> >
> > Level 2: Merge across segments.
> > Merge the already merged carbonindex files of each segment would be
> merged
> > after a configurable number of segments reached. These files are placed
> > under the metadata folder of the table.And the information of these
> merged
> > carbonindex files will be updated in the table status file. While reading
> > the carbonindex files first we check the tablestatus for the availability
> > of the merged file and read using the information available in it.
> > For example, the configurable number to merge index files across segments
> > are 100 then for every 100 segments one new merged index file will be
> > created under metadata folder and the tablestatus of these 100 segments
> are
> > updated with the information of this file.
> > This file is not updatable and it would be removed only if all the
> segments
> > of this merged index file is removed. This file also a simple file merge
> > not an actual data merge. By default this is disabled and the user can
> > enable it from the carbon properties.
> >
> > And also there is an issue in driver cache for old segments.It would be
> not
> > necessary to cache the old segments if the queries are not interested in
> > them.I will start another discussion for this cache issue.
> >
> > --
> > Thanks & Regards
> > Ravindra
>
>
>
>


-- 
Thanks & Regards,
Ravi

Re: [Discussion] Merging carbonindex files for each segments and across segments

Reply via email to