Hi Thank you started this discussion. This proposal is for improving data updation performance, right ?
Regards Liang Linwood wrote > *[Background]* > Update operation will clean up delta files before update( see > cleanUpDeltaFiles(carbonTable, false)), It's loop traversal metadata path > and segment path many times. When there are too many files, the overhead > will increase and update time will be longer. > > *[Motivation & Goal]* > During the update process, reduce loop traversal or remove > cleanUpDelteFiles > to another method. > > *[Modification]* > There are some solutions as following. > > Solution 1: > > In cleanUpDeltaFiles have some same points in get files method, like > updateStatusManager.getUpdateDeltaFilesList(segment, > false,CarbonCommonConstants.UPDATE_DELTA_FILE_EXT, true, > allSegmentFiles,true) and > updateStatusManager.getUpdateDeltaFilesList(segment, > false,CarbonCommonConstants.UPDATE_INDEX_FILE_EXT, true, > allSegmentFiles,true), They are just different file types,but loop > traversal > segment path twice. we can merge it. > > Solution 2: > > Base solution 1,Use Spark or MapReduce to hand over tasks to other nodes. > > Solution 3: > > Submit cleanUpDelaFiles to another task, process them in the early > morning > or when the cluster is not busy. > > Solution 4: > > Establish a garbage collection bin, which provides some interfaces for our > program to determine when files enter the garbage collection bin and how > to > deal with them. > > Please vote for all solutions. > > Best Regards, > LinWood > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/