[ https://issues.apache.org/jira/browse/HBASE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848481#comment-13848481 ]
Enis Soztutar commented on HBASE-8369: -------------------------------------- bq. Maybe Enis Soztutar can mention the logic on why for some of these kinds of things? These are the list of high level things in the final version(v11) of the patch, which are different from Bryan's version (trunk-v3) - ClientScanner / AbstractClientScanner / TableRecordReaderImpl changes: the ClientSideRegion scanner keeps track of ScanMetrics, and exports those via MR job counters or Scan. - CellUtil changes : these are at a different place in Bryan's patch. - PB of MR data - HDFSBlocksDistribution: in v3, we are providing 3 servers with highest locality to the input split. In v11, we are using all the servers with 80% of the locality for the top locality server. This ensures better locality. - ClientSideRegionScanner / TableSnapshotScanner: not present in v3. ClientSideRegionScanner is an internal class to do the scanning. Both TableSnapshotScanner and TableSnapshotInputFormat uses it. TableSnapshotScanner is a client API, to scan snapshots without MR. - TableMapreduceUtil changes (other than the new method): needed in case security is enabled. We should not talk with the HBase cluster at all. - HRegion changes: v3 patch does send the parent dir for the region snapshot by assuming that table dir is the parent dir of the region dir. We do not want to make that assumption in trunk. - RestoreSnapshotHelper / ModifyRegionUtils : code organization - Other than these, general test, integration test, or performance evaluation tools. For 0.94, we can do a less intrusive patch which combines some of the changes above (like RestoreSnapshotHelper changes going into the new classes), and get rid of some of the changes like HRegion changes. > MapReduce over snapshot files > ----------------------------- > > Key: HBASE-8369 > URL: https://issues.apache.org/jira/browse/HBASE-8369 > Project: HBase > Issue Type: New Feature > Components: mapreduce, snapshots > Reporter: Enis Soztutar > Assignee: Enis Soztutar > Fix For: 0.98.0 > > Attachments: HBASE-8369-0.94.patch, HBASE-8369-0.94_v2.patch, > HBASE-8369-0.94_v3.patch, HBASE-8369-0.94_v4.patch, HBASE-8369-0.94_v5.patch, > HBASE-8369-trunk_v1.patch, HBASE-8369-trunk_v2.patch, > HBASE-8369-trunk_v3.patch, hbase-8369_v0.patch, hbase-8369_v11.patch, > hbase-8369_v5.patch, hbase-8369_v6.patch, hbase-8369_v7.patch, > hbase-8369_v8.patch, hbase-8369_v9.patch > > > The idea is to add an InputFormat, which can run the mapreduce job over > snapshot files directly bypassing hbase server layer. The IF is similar in > usage to TableInputFormat, taking a Scan object from the user, but instead of > running from an online table, it runs from a table snapshot. We do one split > per region in the snapshot, and open an HRegion inside the RecordReader. A > RegionScanner is used internally for doing the scan without any HRegionServer > bits. > Users have been asking and searching for ways to run MR jobs by reading > directly from hfiles, so this allows new use cases if reading from stale data > is ok: > - Take snapshots periodically, and run MR jobs only on snapshots. > - Export snapshots to remote hdfs cluster, run the MR jobs at that cluster > without HBase cluster. > - (Future use case) Combine snapshot data with online hbase data: Scan from > yesterday's snapshot, but read today's data from online hbase cluster. -- This message was sent by Atlassian JIRA (v6.1.4#6159)