We have a scenario where we use the SummingCombiner aggregating stats on
high cardinality properties of a streaming dataset. Use-case is generating
histograms over a certain period, so we age off these stats after a certain
time.

We run into some unexpected behaviour where the ageoff does not physically
happen, unless we trigger a manual compaction using EverythingStrategy as
opposed to the DefaultStrategy. This in combination with fairly large
splitsizes(50-100G) to prevent tablets from splitting further.

The default strategy with majc ratio of 3 and table.max.files=15 seem to
result in a scenario where the tablet servers over time will contain one
reasonable large file, ie 20G(A-*), and then several smaller files(C-*), of
1 to 5 and maybe 8GB. It will take a very long time before these C-* files
will sum upto <ratio> x <largest file>, so the 20G file will almost never
be considered for compaction and over time will hurt query performance
because of all the aged-of data which needs to be skipped in a scan.

Manual compaction will correct this, but it is a matter of time before we
run into the same problem. What is the best approach to let accumulo handle
this automatically? Is this a matter of lowering the ratio to get to the
20G quicker, fending against continuously running compactions? Or writing a
custom CompactionStrategy?

Reply via email to