Re: [Discuss] Mapreduce based major compactions to minimise compactions overhead in HBase cluster

[email protected] Sun, 05 Mar 2023 23:40:25 -0800

Hi Andrew and Nick, thanks  for your valuable feedback and information.

*Problem statement: *


By default the compactions enabled with pressure aware throughput
controller to reduce the burden of compactions in the clusters even under
heavy workloads but because of this sometimes the compaction take more time
(sometimes hours also) when the compactions triggered during peak hours or
minor compactions promoted to major compactions this leads to bigger
compaction queues as well.


I don't want to replace existing compactions with mapreduce kind of jobs
but would like to add as an external tool it could be yarn based, spark
based job just mentioned mapreduce based job because it's good fit for
merge sort kid of applications(sorry it's my bad)

The external tool can just read the HFiles from HDFS or any other storage
and directly create the compacted HFiles and load to HBase. For this as  @Nick
Dimiduk <[email protected]> mentioned we might need new APIs like
compaction context having files to be compacted etc and additional metadata
to complete the compaction process along with that APIs like loading the
newly created compacted HFiles and archive the files to compact etc.

Will spend some time on this and come up with  more details or plan.

Thanks,
Rajeshbabu.



On Thu, Mar 2, 2023, 10:32 AM Nick Dimiduk <[email protected]> wrote:

> Hi Rajeshbabu,
>
> I think that compaction management and execution are important areas for
> experimentation and growth of HBase. I’m more interested in the harness and
> APIs that make an implementation possible than in any specific
> implementation. I’d also like to see consideration for a cluster-wide
> compaction scheduler, something to prioritize allocation of precious IO
> resources.
>
> I agree with Andrew that externalizing to MapReduce is unlikely to be a
> popular compute runtime for the feature, but I also have no statistics
> about which runtimes are commonly available.
>
> I look forward to seeing how your design proposal develops.
>
> Thanks,
> Nick
>
> On Thu, Mar 2, 2023 at 02:46 Andrew Purtell <[email protected]> wrote:
>
> >  Hi Rajesbabu,
> >
> > You have proposed a solution without describing the problem. Please do
> that
> > first.
> >
> > That said, compaction is fundamental to HBase operation and should have
> no
> > external dependency on a particular compute framework. Especially
> > MapReduce, which is out of favor and deprecated in many places. If this
> is
> > an optional feature it could be fine. So perhaps you could also explain
> how
> > you see this potential feature fitting into the long term roadmap for the
> > project.
> >
> >
> >
> > On Wed, Mar 1, 2023 at 3:54 PM [email protected] <
> > [email protected]> wrote:
> >
> > > Hi Team,
> > >
> > > I would just like to discuss the new compactor implementation to run
> > major
> > > compactions through mapreduce job(which are best fit for merge sorting
> > > applications)
> > >
> > > I have a high level plan and would like to check with you before
> > proceeding
> > > with detailed design and implementation to know any challenges or any
> > > similar solutions you are aware of.
> > >
> > > High level plan:
> > >
> > > We should have a new compactor implementation which can create the
> > > mapreduce job
> > >  for running major compaction and wait for job to complete in a thread.
> > > Mapreduce job implementation is as follows:
> > > 1) since we need to read through all the files in a column family for
> > major
> > > compaction
> > >  we can pass the column family folder to the mapreduce job.
> > > If possible file filters might be required to not use newly created
> > hfiles.
> > > 2) we can identify the partitions or input splits based on  hfiles
> > > boundaries and
> > > utilise existing HFileInputFormatter to scan through each hfile
> > partitions
> > >  so that each mapper sorts data  within the partition range.
> > > 3) If possible we can use combiner to remove old versions or deleted
> > cells.
> > > 4) we can use the HFileOutputFilter to create new HFile at tmp
> directory
> > > and write cells to it by reading the sorted data from mappers in the
> > > reducer.
> > >
> > > once the hfile is created in a tmp directory and mapreduce job
> completed
> > > we can move the compacted file to the column family location, move old
> > > files out and refresh the hfiles which is same as default
> implementation.
> > >
> > > There are tradeoffs with the solution where intermediate copies of data
> > > required
> > > while running the mapreduce job even though the hfiles have sorted
> data.
> > >
> > > Thanks,
> > > Rajeshbabu.
> > >
> >
> >
> > --
> > Best regards,
> > Andrew
> >
> > Unrest, ignorance distilled, nihilistic imbeciles -
> >     It's what we’ve earned
> > Welcome, apocalypse, what’s taken you so long?
> > Bring us the fitting end that we’ve been counting on
> >    - A23, Welcome, Apocalypse
> >
>

Re: [Discuss] Mapreduce based major compactions to minimise compactions overhead in HBase cluster

Reply via email to