[
https://issues.apache.org/jira/browse/HBASE-28440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Duo Zhang reopened HBASE-28440:
-------------------------------
Missed in branch-3.
> Add support for using mapreduce sort in HFileOutputFormat2
> ----------------------------------------------------------
>
> Key: HBASE-28440
> URL: https://issues.apache.org/jira/browse/HBASE-28440
> Project: HBase
> Issue Type: Improvement
> Components: backup&restore
> Reporter: Bryan Beaudreault
> Assignee: Hernan Gelaf-Romer
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0-alpha-1, 2.7.0, 3.0.0-beta-2, 2.6.4
>
>
> Currently HFileOutputFormat2 uses CellSortReducer, which attempts to sort all
> of the cells of a row in memory using a TreeSet. There is a warning in the
> javadoc "If lots of columns per row, it will use lots of memory sorting."
> This can be problematic for WALPlayer, which uses HFileOutputFormat2. You
> could have reasonably sized row which just gets lots of edits in the time
> period of WALs being replayed, and that would cause an OOM. We are seeing
> this in some cases with incremental backups.
> MapReduce has built-in sorting capabilities which are not limited to sorting
> in memory. It can spill to disk as necessary to sort very large datasets. We
> can get this capability in HFileOutputFormat2 with a couple changes:
> # Add support for a KeyOnlyCellComparable type as the map output key
> # When configured, use
> job.setSortComparatorClass(CellWritableComparator.class) and
> job.setReducerClass(PreSortedCellsReducer.class)
> # Update WALPlayer to have a mode which can output this new comparable
> instead of ImmutableBytesWritable
> CellWritableComparator exists already for the Import job, so there is some
> prior art.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)