[ 
https://issues.apache.org/jira/browse/HBASE-28440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang reopened HBASE-28440:
-------------------------------

Missed in branch-3.

> Add support for using mapreduce sort in HFileOutputFormat2
> ----------------------------------------------------------
>
>                 Key: HBASE-28440
>                 URL: https://issues.apache.org/jira/browse/HBASE-28440
>             Project: HBase
>          Issue Type: Improvement
>          Components: backup&restore
>            Reporter: Bryan Beaudreault
>            Assignee: Hernan Gelaf-Romer
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0-alpha-1, 2.7.0, 3.0.0-beta-2, 2.6.4
>
>
> Currently HFileOutputFormat2 uses CellSortReducer, which attempts to sort all 
> of the cells of a row in memory using a TreeSet. There is a warning in the 
> javadoc "If lots of columns per row, it will use lots of memory sorting." 
> This can be problematic for WALPlayer, which uses HFileOutputFormat2. You 
> could have reasonably sized row which just gets lots of edits in the time 
> period of WALs being replayed, and that would cause an OOM. We are seeing 
> this in some cases with incremental backups.
> MapReduce has built-in sorting capabilities which are not limited to sorting 
> in memory. It can spill to disk as necessary to sort very large datasets. We 
> can get this capability in HFileOutputFormat2 with a couple changes:
>  # Add support for a KeyOnlyCellComparable type as the map output key
>  # When configured, use 
> job.setSortComparatorClass(CellWritableComparator.class) and 
> job.setReducerClass(PreSortedCellsReducer.class)
>  # Update WALPlayer to have a mode which can output this new comparable 
> instead of ImmutableBytesWritable
> CellWritableComparator exists already for the Import job, so there is some 
> prior art. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to