Hi Team, I would just like to discuss the new compactor implementation to run major compactions through mapreduce job(which are best fit for merge sorting applications)
I have a high level plan and would like to check with you before proceeding with detailed design and implementation to know any challenges or any similar solutions you are aware of. High level plan: We should have a new compactor implementation which can create the mapreduce job for running major compaction and wait for job to complete in a thread. Mapreduce job implementation is as follows: 1) since we need to read through all the files in a column family for major compaction we can pass the column family folder to the mapreduce job. If possible file filters might be required to not use newly created hfiles. 2) we can identify the partitions or input splits based on hfiles boundaries and utilise existing HFileInputFormatter to scan through each hfile partitions so that each mapper sorts data within the partition range. 3) If possible we can use combiner to remove old versions or deleted cells. 4) we can use the HFileOutputFilter to create new HFile at tmp directory and write cells to it by reading the sorted data from mappers in the reducer. once the hfile is created in a tmp directory and mapreduce job completed we can move the compacted file to the column family location, move old files out and refresh the hfiles which is same as default implementation. There are tradeoffs with the solution where intermediate copies of data required while running the mapreduce job even though the hfiles have sorted data. Thanks, Rajeshbabu.