A very good feature! I think case 1 and case 2 can be handle We can merge data files and index files after we insert into hdfs automaticlly. In case 1: if the data files are not small, there will be 100 data files and 1 index file. if the data files are small, there will be (dataSize/sizePerFile) data files and 1 index file.
When you start to develop this feature, I need this... Best regards! Yuhai Cen 在2017年10月21日 13:02,Jacky Li<jacky.li...@qq.com> 写道: Hi Ravindra, I doubt whether Level 2 merge is required, if the intention is to solve problem of case 2, user can perform data compaction, so that both data and index will be merged using level 1 merge. So it can avoid both small data file and small index file, right? Regards, Jacky Li > 在 2017年10月20日,下午9:43,Ravindra Pesala <ravi.pes...@gmail.com> 写道: > > Hi, > > Problem : > The first-time query of carbon becomes very slow. It is because of reading > many small carbonindex files and cache to the driver at the first time. > Many carbonindex files are created in two cases > Case 1: Loading data in large cluster > For example, if the cluster size is 100 nodes then for each load 100 > index files are created per segment. So after 100 loads, the number of > carbonindex files becomes 10000. > Case 2: Frequent loads > For example, if the load happens for every 5 minutes in 4 node cluster, > it will be more than 10000 index files after 10 days even in 4 node cluster. > > It will be slower to read all the files from the driver since a lot of > namenode calls and IO operations. > > Solution : > Merge the carbonindex files in two levels.so that we can reduce the IO > calls to namenode and improves the read performance. > > Level 1: Merge within a segment. > Merge the carbonindex files to single file immediately after load completes > within the segment. It would be named as a .carbonindexmerge file. It is > actually not a true data merging but a simple file merge. So that the > current structure of carbonindex files does not change. While reading we > just read one file instead of many carbonindex files within the segment. > > Level 2: Merge across segments. > Merge the already merged carbonindex files of each segment would be merged > after a configurable number of segments reached. These files are placed > under the metadata folder of the table.And the information of these merged > carbonindex files will be updated in the table status file. While reading > the carbonindex files first we check the tablestatus for the availability > of the merged file and read using the information available in it. > For example, the configurable number to merge index files across segments > are 100 then for every 100 segments one new merged index file will be > created under metadata folder and the tablestatus of these 100 segments are > updated with the information of this file. > This file is not updatable and it would be removed only if all the segments > of this merged index file is removed. This file also a simple file merge > not an actual data merge. By default this is disabled and the user can > enable it from the carbon properties. > > And also there is an issue in driver cache for old segments.It would be not > necessary to cache the old segments if the queries are not interested in > them.I will start another discussion for this cache issue. > > -- > Thanks & Regards > Ravindra