Thanks for replying promptly. oh, i think it maybe hard to set a proper mapper number per region for a hbase user, and in that way, some small region may create so much small jobs. however, we can simply specify a fixed mapper number only if the scan range located in a single region which maybe a common production scene for the large region(>30g). what do you think?
2017-09-04 17:13 GMT+08:00 Chia-Ping Tsai <[email protected]>: > That sounds good. There are some related issue. see > https://issues.apache.org/jira/browse/HBASE-4914 and > https://issues.apache.org/jira/browse/HBASE-4063. > > On 2017-09-04 15:06, libis <[email protected]> wrote: > > Hi > > > > When TableInputFormat is used to source an HBase table in a MapReduce > job, > > its splitter will make a map task for each region of the table. However, > in > > some cases, the user’s scan range may locate in a single region, > resulting > > in there is a only mapper. For example, the rowkey of the table is > > ‘md5(userid) + timestamp’, once client want to scan the data of a > specified > > user in the latest month with MR, it’s much possible that there is only > one > > mapper working. > > > > In order to scan data in parallel if the user's scan range located in a > > single region, should we split the scan range into serveral segments > within > > a region? > > > > Best, > > > > xinxin > > >
