[ 
https://issues.apache.org/jira/browse/HBASE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848481#comment-13848481
 ] 

Enis Soztutar commented on HBASE-8369:
--------------------------------------

bq. Maybe Enis Soztutar can mention the logic on why for some of these kinds of 
things?
These are the list of high level things in the final version(v11) of the patch, 
which are different from Bryan's version (trunk-v3)
 - ClientScanner / AbstractClientScanner / TableRecordReaderImpl changes: the 
ClientSideRegion scanner keeps track of ScanMetrics, and exports those via MR 
job counters or Scan. 
 - CellUtil changes : these are at a different place in Bryan's patch. 
 - PB of MR data 
 - HDFSBlocksDistribution: in v3, we are providing 3 servers with highest 
locality to the input split. In v11, we are using all the servers with 80% of 
the locality for the top locality server. This ensures better locality. 
 - ClientSideRegionScanner / TableSnapshotScanner: not present in v3. 
ClientSideRegionScanner is an internal class to do the scanning. Both 
TableSnapshotScanner and TableSnapshotInputFormat uses it. TableSnapshotScanner 
is a client API, to scan snapshots without MR. 
 - TableMapreduceUtil changes (other than the new method): needed in case 
security is enabled. We should not talk with the HBase cluster at all. 
 - HRegion changes: v3 patch does send the parent dir for the region snapshot 
by assuming that table dir is the parent dir of the region dir. We do not want 
to make that assumption in trunk. 
 - RestoreSnapshotHelper / ModifyRegionUtils : code organization 
 - Other than these, general test, integration test, or performance evaluation 
tools. 

For 0.94, we can do a less intrusive patch which combines some of the changes 
above (like RestoreSnapshotHelper changes going into the new classes), and get 
rid of some of the changes like HRegion changes. 

> MapReduce over snapshot files
> -----------------------------
>
>                 Key: HBASE-8369
>                 URL: https://issues.apache.org/jira/browse/HBASE-8369
>             Project: HBase
>          Issue Type: New Feature
>          Components: mapreduce, snapshots
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.98.0
>
>         Attachments: HBASE-8369-0.94.patch, HBASE-8369-0.94_v2.patch, 
> HBASE-8369-0.94_v3.patch, HBASE-8369-0.94_v4.patch, HBASE-8369-0.94_v5.patch, 
> HBASE-8369-trunk_v1.patch, HBASE-8369-trunk_v2.patch, 
> HBASE-8369-trunk_v3.patch, hbase-8369_v0.patch, hbase-8369_v11.patch, 
> hbase-8369_v5.patch, hbase-8369_v6.patch, hbase-8369_v7.patch, 
> hbase-8369_v8.patch, hbase-8369_v9.patch
>
>
> The idea is to add an InputFormat, which can run the mapreduce job over 
> snapshot files directly bypassing hbase server layer. The IF is similar in 
> usage to TableInputFormat, taking a Scan object from the user, but instead of 
> running from an online table, it runs from a table snapshot. We do one split 
> per region in the snapshot, and open an HRegion inside the RecordReader. A 
> RegionScanner is used internally for doing the scan without any HRegionServer 
> bits. 
> Users have been asking and searching for ways to run MR jobs by reading 
> directly from hfiles, so this allows new use cases if reading from stale data 
> is ok:
>  - Take snapshots periodically, and run MR jobs only on snapshots.
>  - Export snapshots to remote hdfs cluster, run the MR jobs at that cluster 
> without HBase cluster.
>  - (Future use case) Combine snapshot data with online hbase data: Scan from 
> yesterday's snapshot, but read today's data from online hbase cluster. 



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to