Hi , @Jacky I feel level 2 merging also required as level 1 does not resolve the problem completely. And yes compaction might solve the issue but in some use cases users do not compact at all.
@yaojinguo If the table already has many index files then new load after the upgrade will generate level 2 files across segments. @cenyuhai11 we start developing this feature very soon and it will be delivered in next carbon version. Regards, Ravindra. On 21 October 2017 at 17:39, 岑玉海 <cenyuha...@163.com> wrote: > A very good feature! I think case 1 and case 2 can be handle > We can merge data files and index files after we insert into hdfs > automaticlly. > In case 1: > if the data files are not small, there will be 100 data files and 1 index > file. > if the data files are small, there will be (dataSize/sizePerFile) data > files and 1 index file. > > > When you start to develop this feature, I need this... > > > Best regards! > Yuhai Cen > > > 在2017年10月21日 13:02,Jacky Li<jacky.li...@qq.com> 写道: > Hi Ravindra, > > I doubt whether Level 2 merge is required, if the intention is to solve > problem of case 2, user can perform data compaction, so that both data and > index will be merged using level 1 merge. So it can avoid both small data > file and small index file, right? > > Regards, > Jacky Li > > > 在 2017年10月20日,下午9:43,Ravindra Pesala <ravi.pes...@gmail.com> 写道: > > > > Hi, > > > > Problem : > > The first-time query of carbon becomes very slow. It is because of > reading > > many small carbonindex files and cache to the driver at the first time. > > Many carbonindex files are created in two cases > > Case 1: Loading data in large cluster > > For example, if the cluster size is 100 nodes then for each load 100 > > index files are created per segment. So after 100 loads, the number of > > carbonindex files becomes 10000. > > Case 2: Frequent loads > > For example, if the load happens for every 5 minutes in 4 node cluster, > > it will be more than 10000 index files after 10 days even in 4 node > cluster. > > > > It will be slower to read all the files from the driver since a lot of > > namenode calls and IO operations. > > > > Solution : > > Merge the carbonindex files in two levels.so that we can reduce the IO > > calls to namenode and improves the read performance. > > > > Level 1: Merge within a segment. > > Merge the carbonindex files to single file immediately after load > completes > > within the segment. It would be named as a .carbonindexmerge file. It is > > actually not a true data merging but a simple file merge. So that the > > current structure of carbonindex files does not change. While reading we > > just read one file instead of many carbonindex files within the segment. > > > > Level 2: Merge across segments. > > Merge the already merged carbonindex files of each segment would be > merged > > after a configurable number of segments reached. These files are placed > > under the metadata folder of the table.And the information of these > merged > > carbonindex files will be updated in the table status file. While reading > > the carbonindex files first we check the tablestatus for the availability > > of the merged file and read using the information available in it. > > For example, the configurable number to merge index files across segments > > are 100 then for every 100 segments one new merged index file will be > > created under metadata folder and the tablestatus of these 100 segments > are > > updated with the information of this file. > > This file is not updatable and it would be removed only if all the > segments > > of this merged index file is removed. This file also a simple file merge > > not an actual data merge. By default this is disabled and the user can > > enable it from the carbon properties. > > > > And also there is an issue in driver cache for old segments.It would be > not > > necessary to cache the old segments if the queries are not interested in > > them.I will start another discussion for this cache issue. > > > > -- > > Thanks & Regards > > Ravindra > > > > -- Thanks & Regards, Ravi