Bryan Beaudreault created HBASE-28440:
-----------------------------------------
Summary: Add support for using mapreduce sort in HFileOutputFormat2
Key: HBASE-28440
URL: https://issues.apache.org/jira/browse/HBASE-28440
Project: HBase
Issue Type: Improvement
Components: backup&restore
Reporter: Bryan Beaudreault
Currently HFileOutputFormat2 uses CellSortReducer, which attempts to sort all
of the cells of a row in memory using a TreeSet. There is a warning in the
javadoc "If lots of columns per row, it will use lots of memory sorting." This
can be problematic for WALPlayer, which uses HFileOutputFormat2. You could have
reasonably sized row which just gets lots of edits in the time period of WALs
being replayed, and that would cause an OOM. We are seeing this in some cases
with incremental backups.
MapReduce has built-in sorting capabilities which are not limited to sorting in
memory. It can spill to disk as necessary to sort very large datasets. We can
get this capability in HFileOutputFormat2 with a couple changes:
# Add support for a KeyOnlyCellComparable type as the map output key
# When configured, use
job.setSortComparatorClass(CellWritableComparator.class) and
job.setReducerClass(PreSortedCellsReducer.class)
# Update WALPlayer to have a mode which can output this new comparable instead
of ImmutableBytesWritable
CellWritableComparator exists already for the Import job, so there is some
prior art.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)