[ https://issues.apache.org/jira/browse/HBASE-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870334#action_12870334 ]
Todd Lipcon commented on HBASE-2468: ------------------------------------ For TIF, I don't see any reason to prefetch meta. Each mapper taking *input* from a table accesses only a small subset of regions, and scanning the whole thing is total overkill. For TOF, or jobs that simply access HBase but don't use either of our provided output formats, having a prewarmed meta cache is primarily avoid DDOSing the META server. On a 500 node MR cluster, you might well have 4000 mappers start within the period of just a few seconds and overwhelm META pretty fast. This is a real use case for many people - imagine storing a dimension table in HBase and doing a hash join against it from a MR job processing many TB of logs. So, for the MR case, I think we should provide the option to serialize META to disk, put it in the DistributedCache, and then prewarm the meta cache from there. This would reduce number of mappers actually hitting META to nearly 0. Doing a scan of META to fetch *all* rows for this table, though, probably makes the problem even worse, especially for jobs that only access a subset of the content. My opinions are that we should: a) prefetch ahead a few rows on any META miss, since it will fill the cache up faster and catch split children. Perhaps we can do a benchmark to see whether a 10-row scan is any harder to service than a 2-row scan - my hunch is that the load on the server is mostly dominated by constant time, so we may as well scan ahead a bit. b) allow the *option* to fetch a row range (which includes the full table range) into the cache. This could be used in startup of long-running processes (eg the thrift gateway or stargate may prefer to warm up its cache before it starts accepting any user requests). This should not be default. I think of this as the equivalent of posix_fadvise(..., POSIX_FADV_WILLNEED). Providing the API as a range will also allow us to do it for multiregion scans, etc. c) allow the option to serialize meta rows into a sequencefile and then load them back. This provides the improvement I mentioned above for large MR jobs randomly accessing a cluster. I see the above 3 things as separate tasks. It sounds like the current patch can do the first of the three, so maybe we should separate that out, get it committed, and then move on to b and c? > Improvements to prewarm META cache on clients > --------------------------------------------- > > Key: HBASE-2468 > URL: https://issues.apache.org/jira/browse/HBASE-2468 > Project: Hadoop HBase > Issue Type: Improvement > Components: client > Reporter: Todd Lipcon > Assignee: Mingjie Lai > Fix For: 0.21.0 > > Attachments: HBASE-2468-trunk.patch > > > A couple different use cases cause storms of reads to META during startup. > For example, a large MR job will cause each map task to hit meta since it > starts with an empty cache. > A couple possible improvements have been proposed: > - MR jobs could ship a copy of META for the table in the DistributedCache > - Clients could prewarm cache by doing a large scan of all the meta for the > table instead of random reads for each miss > - Each miss could fetch ahead some number of rows in META -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.