[ 
https://issues.apache.org/jira/browse/HBASE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635708#comment-13635708
 ] 

Enis Soztutar commented on HBASE-8369:
--------------------------------------

bq. So would this completely bypass security?
Underlying hFiles are owned by the hbase user. For reading the files from MR 
files, a couple of options comes to my mind:
(1) open the files directly from hdfs, in which case, the user has to be in the 
same group and have group permissions to read the files, or the user has to be 
the hbase user. Similar to current SSR.
(2) have HBase servers open the file, and pass the file handlers to the MR job, 
similar to the approach in HDFS-347. This is obviously more involved and 
require a live HBase cluster.
(3) Copy snapshot files as different user. This will only be applicable to 
exported snapshots. Copying data for in-place snapshots would be costly.
any other ideas?
                
> MapReduce over snapshot files
> -----------------------------
>
>                 Key: HBASE-8369
>                 URL: https://issues.apache.org/jira/browse/HBASE-8369
>             Project: HBase
>          Issue Type: New Feature
>          Components: mapreduce, snapshots
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.98.0, 0.95.2
>
>         Attachments: hbase-8369_v0.patch
>
>
> The idea is to add an InputFormat, which can run the mapreduce job over 
> snapshot files directly bypassing hbase server layer. The IF is similar in 
> usage to TableInputFormat, taking a Scan object from the user, but instead of 
> running from an online table, it runs from a table snapshot. We do one split 
> per region in the snapshot, and open an HRegion inside the RecordReader. A 
> RegionScanner is used internally for doing the scan without any HRegionServer 
> bits. 
> Users have been asking and searching for ways to run MR jobs by reading 
> directly from hfiles, so this allows new use cases if reading from stale data 
> is ok:
>  - Take snapshots periodically, and run MR jobs only on snapshots.
>  - Export snapshots to remote hdfs cluster, run the MR jobs at that cluster 
> without HBase cluster.
>  - (Future use case) Combine snapshot data with online hbase data: Scan from 
> yesterday's snapshot, but read today's data from online hbase cluster. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to