[ 
https://issues.apache.org/jira/browse/HBASE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310953#comment-14310953
 ] 

David Witten commented on HBASE-10216:
--------------------------------------

I was wondering if any of you had any thoughts about the validity of my initial 
assumption that reading from a disk locally and thus incurring the overhead of 
seeks and disk reads is preferable to having reads done on one node which sends 
the the data to multiple nodes which do writes.  In other words is it better to 
reduce disk seeks and reads or is it better to reduce  network bandwidth and 
overhead?  
It certainly seems more in line with the philosophy of Hadoop to do local reads 
and writes.  It al seems true that you would want to conserve network bandwidth 
when working across a wide area network,  but I don't really know about the 
trade offs and I wondered what others thought. 

> Change HBase to support local compactions
> -----------------------------------------
>
>                 Key: HBASE-10216
>                 URL: https://issues.apache.org/jira/browse/HBASE-10216
>             Project: HBase
>          Issue Type: New Feature
>          Components: Compaction
>         Environment: All
>            Reporter: David Witten
>
> As I understand it compactions will read data from DFS and write to DFS.  
> This means that even when the reading occurs on the local host (because 
> region server has a local copy) all the writing must go over the network to 
> the other replicas.  This proposal suggests that HBase would perform much 
> better if all the reading and writing occurred locally and did not go over 
> the network. 
> I propose that the DFS interface be extended to provide method that would 
> merge files so that the merging and deleting can be performed on local data 
> nodes with no file contents moving over the network.  The method would take a 
> list of paths to be merged and deleted and the merged file path and an 
> indication of a file-format-aware class that would be run on each data node 
> to perform the merge.  The merge method provided by this merging class would 
> be passed files open for reading for all the files to be merged and one file 
> open for writing.  The custom class provided merge method would read all the 
> input files and append to the output file using some standard API that would 
> work across all DFS implementations.  The DFS would ensure that the merge had 
> happened properly on all replicas before returning to the caller.  It could 
> be that greater resiliency could be achieved by implementing the deletion as 
> a separate phase that is only done after enough of the replicas had completed 
> the merge. 
> HBase would be changed to use the new merge method for compactions, and would 
> provide an implementation of the merging class that works with HFiles.
> This proposal would require a custom code that understands the file format to 
> be runnable by the data nodes to manage the merge.  So there would need to be 
> a facility to load classes into DFS if there isn't such a facility already.  
> Or, less generally, HDFS could build in support for HFile merging.
> The merge method might be optional.  If the DFS implementation did not 
> provide it a generic version that performed the merge on top of the regular 
> DFS interfaces would be used.
> It may be that this method needs to be tweaked or ignored when the region 
> server does not have a local copy data so that, as happens currently, one 
> copy of the data moves to the region server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to