Hi Nicolas

First thanks for the information and the pointers, second nice to have a
discussion with the compaction expert :-)

It seems that version 0.90.1 which I'm using is very different in the
compaction management than the current release or the planned 0.92 release.
The 3 triggers are clear but according to the code the majorCompaction
indicator is switched back to true because the number of files. The result
is that major compaction are running at a time this is less adequate.

Let me explain a little bit on the use case and my problem...

The system is an OLTP system, with strict latency and throughput
requirements, regions are pre-splitted and throughput is controlled.

The system has heavy load period for few hours, during heavy load i mean
high proportion insert/update and small proportion of read.

According to the default parameters of log rolling, log size and compaction
thresholds the system suffer from performance problem due to the following:

Log are being flushed every 30 secs (according to our load), and very soon
the memstore flush occur in order to clean the logs, nevertheless under
this heavy load a lot of memstore are created on the FS and will trigger a
major compaction (due to the number of file reached, the third option).
Since the volume of the store is large (>1G) it will take more or less
1'40'' to handle a single region compaction. I the meantime other store
files are created. As a result we fall in the memstore flush throttling (
will wait 90000 ms before flushing the memstore) retaining more logs,
triggering more flush that can't be flushed.... adding pressure on the
system memory (memstore is not flushed on time)
Since the cluster his uniformly loaded when compaction occurs in one RS it
will happen on all of them also, adding network traffic (and being
saturated)
As a result we have a degradation in performance!

Please remember i'm on 0.90.1 so when major compaction is running minor is
blocked, when a memstore for a column family is flushed all other memstore
(for other) column family are also (no matter if they are smaller or not).
As you already wrote, the best way is to manage compaction, and it is what
i tried to do.

Regarding the compaction plug-ability needs.
Let suppose that the data you are inserting in different column family has
a different pattern, for example on CF1 (column family #1) you update
fields in the same row key while in CF2 you add each time new fields or CF2
has new row and older rows are never updated won't you use different
algorithms for compacting these CF? one if merging the records cleaning
older version the other one is just appending the field to the record.
What i'm trying to say is that different approach can be used allowing
better needs to the data profile/application profile.

Regarding minor vs major compaction
isn't HBASE-3797 all about this? But support they work in similar way
(beside TTL and tombstones) a major compaction will work on all the store
files so it can make a single store by the end (per CF), minor will work
only up to an upper limit (as far as understand) of files configured. So if
you have a lot of storefiles, a major compaction will take longer to finish
(and will block memstore flush for longer period) than a compaction working
only on a partial set of the files.

Regarding the queue size - the behavior is understood, but if you want to
make a utility to major compact regions one after another (not all at the
same time) you need to know when a region/the server is under compaction or
not. So either a new indicator is needed or a new queue metric is needed
(like compactingFileSize).

Finally the schema design is guided by the ACID property of a row, we have
2 CF only both CF holds a different volume of data even if they are updated
approximately with the same amount of data (cell updated vs cell created).
Regarding config param, i'm sure they are not well tuned yet but they will
not resolve all the problems...

Regards,

Mikael.S

On Mon, Jan 9, 2012 at 9:42 PM, Nicolas Spiegelberg <nspiegelb...@fb.com>wrote:

> Significant compaction JIRAs:
>  - HBASE-2462 : original formulation of current compaction algorithm
>  - HBASE-3209 : implementation
>  - HBASE-1476 : multithreaded compactions
>  - HBASE-3797 : storefile-based compaction selection
>
>
> On 1/9/12 11:37 AM, "Ted Yu" <yuzhih...@gmail.com> wrote:
>
> >Nicolas:
> >Thanks for your insight.
> >
> >Can you point Mikael to a few of the JIRAs where algorithm mentioned in #1
> >was implemented ?
> >
> >On Mon, Jan 9, 2012 at 10:55 AM, Nicolas Spiegelberg
> ><nspiegelb...@fb.com>wrote:
> >
> >> Mikael,
> >>
> >> Hi, I wrote the current compaction algorithm, so I should be able to
> >> answer most questions that you have about the feature.  It sounds like
> >> you're creating quite a task list of work to do, but I don't understand
> >> what your use case is so a lot of that work may be not be critical and
> >>you
> >> can leverage existing functionality.  A better description of your
> >>system
> >> requirements is a must to getting a good solution.
> >>
> >> 1. Major compactions are triggered by 3 methods: user issued, timed, and
> >> size-based.  You are probably hitting size-based compactions where your
> >> config is disabling time-based compactions.  Minor compactions are
> >>issued
> >> on a size-based threshold.  The algorithm sees if sum(file[0:i] *
> >>ratio) >
> >> file[i+1] and includes file[0:i+1] if so.  This is a reverse iteration,
> >>so
> >> the highest 'i' value is used.  If all files match, then you can remove
> >> delete markers [which is the difference between a major and minor
> >> compaction].  Major compactions aren't a bad or time-intensive thing,
> >>it's
> >> just delete marker removal.
> >>
> >> As a note, we use timed majors in an OLTP production environment.  They
> >> are less useful if you're doing bulk imports or have an OLAP environment
> >> where you're either running a read-intensive test or the cluster is
> >>idle.
> >> In that case, it's definitely best to disable compactions and run them
> >> when you're not using the cluster very much.
> >>
> >> 2. See HBASE-4418 for showing all configuration options in the Web UI.
> >> This is in 0.92 however.
> >>
> >> 4. The compaction queue shows compactions that are waiting to happen.
> >>If
> >> you invoke a compaction and the queue is empty, the thread will
> >> immediately pick up your request and the queue will remain empty.
> >>
> >> 8. A patch for pluggable compactions had been thrown up in the past.  It
> >> was not well-tested and the compaction algorithm was undergoing major
> >> design changes at the time that clashed with the patch.  I think it's
> >>been
> >> a low priority because there are many other ways to get big performance
> >> wins from HBase outside of pluggable compactions.  Most people don't
> >> understand how to optimize the current algorithm, which is well-known
> >> (very similar to BigTable's).  I think bigger wins can come from
> >>correctly
> >> laying out a good schema and understanding the config knobs currently at
> >> our disposal.
> >>
> >>
> >>
> >> On 1/8/12 7:25 AM, "Mikael Sitruk" <mikael.sit...@gmail.com> wrote:
> >>
> >> >Hi
> >> >
> >> >
> >> >
> >> >I have some concern regarding major compactions below...
> >> >
> >> >
> >> >   1. According to best practices from the mailing list and from the
> >>book,
> >> >   automatic major compaction should be disabled. This can be done by
> >> >setting
> >> >   the property Œhbase.hregion.majorcompaction¹ to Œ0¹. Neverhteless
> >>even
> >> >   after having doing this I STILL see ³major compaction² messages in
> >> >logs.
> >> >   therefore it is unclear how can I manage major compactions. (The
> >> >system has
> >> >   heavy insert - uniformly on the cluster, and major compaction affect
> >> >the
> >> >   performance of the system).
> >> >   If I'm not wrong it seems from the code that: even if not requested
> >>and
> >> >   even if the indicator is set to '0' (no automatic major compaction),
> >> >major
> >> >   compaction can be triggered by the code in case all store files are
> >> >   candidate for a compaction (from Store.compact(final boolean
> >> >forceMajor)).
> >> >   Shouldn't the code add a condition that automatic major compaction
> >>is
> >> >   disabled??
> >> >
> >> >   2. I tried to check the parameter  Œhbase.hregion.majorcompaction¹
> >>at
> >> >   runtime using several approaches - to validate that the server
> >>indeed
> >> >   loaded the parameter.
> >> >
> >> >a. Using a connection created from local config
> >> >
> >> >*conn = (HConnection) HConnectionManager.getConnection(m_hbConfig);*
> >> >
> >> >*conn.getConfiguration().getString(³hbase.hregion.majorcompaction²)*
> >> >
> >> >returns the parameter from local config and not from cluster. Is it a
> >>bug?
> >> >If I set the property via the configuration shouldn¹t all the cluster
> >>be
> >> >aware of? (supposing that the connection indeed connected to the
> >>cluster)
> >> >
> >> >b.  fetching the property from the table descriptor
> >> >
> >> >*HTableDescriptor hTableDescriptor =
> >> >conn.getHTableDescriptor(Bytes.toBytes("my table"));*
> >> >
> >> >*hTableDescriptor.getValue("hbase.hregion.majorcompaction")*
> >> >
> >> >This will returns the default parameter value (1 day) not the parameter
> >> >from the configuration (on the cluster). It seems to be a bug, isn¹t
> >>it?
> >> >(the parameter from the config, should be the default if not set at the
> >> >table level)
> >> >
> >> >c. The only way I could set the parameter to 0 and really see it is via
> >> >the
> >> >Admin API, updating the table descriptor or the column descriptor. Now
> >>I
> >> >could see the parameter on the web UI. So is it the only way to set
> >> >correctly the parameter? If setting the parameter via the configuration
> >> >file, shouldn¹t the webUI show this on any table created?
> >> >
> >> >d. I tried also to setup the parameter via hbase shell but setting such
> >> >properties is not supported. (do you plan to add such support via the
> >> >shell?)
> >> >
> >> >e. Generally is it possible to get via API the configuration used by
> >>the
> >> >servers? (at cluster/server level)
> >> >
> >> >    3.  I ran both major compaction  requests from the shell or from
> >>API
> >> >but since both are async there is no progress indication. Neither the
> >>JMX
> >> >nor the Web will help here since you don¹t know if a compaction task is
> >> >running. Tailling the logs is not an efficient way to do this neither.
> >>The
> >> >point is that I would like to automate the process and avoid compaction
> >> >storm. So I want to do that region, region, but if I don¹t know when a
> >> >compaction started/ended I can¹t automate it.
> >> >
> >> >4.       In case there is no compaction files in queue (but still you
> >>have
> >> >more than 1 storefile per store e.g. minor compaction just finished)
> >>then
> >> >invoking major_compact will indeed decrease the number of store files,
> >>but
> >> >the compaction queue will remain to 0 during the compaction task
> >> >(shouldn¹t
> >> >the compaction queue increase by the number of file to compact and be
> >> >reduced when the task ended?)
> >> >
> >> >
> >> >5.       I saw already HBASE-3965 for getting status of major
> >>compaction,
> >> >nevertheless it has be removed from 0.92, is it possible to put it
> >>back?
> >> >Even sooner than 0.92?
> >> >
> >> >6.       In case a compaction (major) is running it seems there is no
> >>way
> >> >to stop-it. Do you plan to add such feature?
> >> >
> >> >7.       Do you plan to add functionality via JMX (starting/stopping
> >> >compaction, splitting....)
> >> >
> >> >8.       Finally there were some request for allowing custom
> >>compaction,
> >> >part of this was given via the RegionObserver in HBASE-2001,
> >>nevertheless
> >> >do you consider adding support for custom compaction (providing real
> >> >pluggable compaction stategy not just observer)?
> >> >
> >> >
> >> >Regards,
> >> >Mikael.S
> >>
> >>
>
>

Reply via email to