We have a scenario where we use the SummingCombiner aggregating stats on high cardinality properties of a streaming dataset. Use-case is generating histograms over a certain period, so we age off these stats after a certain time.
We run into some unexpected behaviour where the ageoff does not physically happen, unless we trigger a manual compaction using EverythingStrategy as opposed to the DefaultStrategy. This in combination with fairly large splitsizes(50-100G) to prevent tablets from splitting further. The default strategy with majc ratio of 3 and table.max.files=15 seem to result in a scenario where the tablet servers over time will contain one reasonable large file, ie 20G(A-*), and then several smaller files(C-*), of 1 to 5 and maybe 8GB. It will take a very long time before these C-* files will sum upto <ratio> x <largest file>, so the 20G file will almost never be considered for compaction and over time will hurt query performance because of all the aged-of data which needs to be skipped in a scan. Manual compaction will correct this, but it is a matter of time before we run into the same problem. What is the best approach to let accumulo handle this automatically? Is this a matter of lowering the ratio to get to the 20G quicker, fending against continuously running compactions? Or writing a custom CompactionStrategy?
