Hi Team,

I would just like to discuss the new compactor implementation to run major
compactions through mapreduce job(which are best fit for merge sorting
applications)

I have a high level plan and would like to check with you before proceeding
with detailed design and implementation to know any challenges or any
similar solutions you are aware of.

High level plan:

We should have a new compactor implementation which can create the
mapreduce job
 for running major compaction and wait for job to complete in a thread.
Mapreduce job implementation is as follows:
1) since we need to read through all the files in a column family for major
compaction
 we can pass the column family folder to the mapreduce job.
If possible file filters might be required to not use newly created hfiles.
2) we can identify the partitions or input splits based on  hfiles
boundaries and
utilise existing HFileInputFormatter to scan through each hfile partitions
 so that each mapper sorts data  within the partition range.
3) If possible we can use combiner to remove old versions or deleted cells.
4) we can use the HFileOutputFilter to create new HFile at tmp directory
and write cells to it by reading the sorted data from mappers in the
reducer.

once the hfile is created in a tmp directory and mapreduce job completed
we can move the compacted file to the column family location, move old
files out and refresh the hfiles which is same as default implementation.

There are tradeoffs with the solution where intermediate copies of data
required
while running the mapreduce job even though the hfiles have sorted data.

Thanks,
Rajeshbabu.

Reply via email to