[ 
https://issues.apache.org/jira/browse/HBASE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14321639#comment-14321639
 ] 

David Witten commented on HBASE-10216:
--------------------------------------

Replying to Andrew.  Thanks for your comments.  I've included much of your 
message and will reply in-line.

We could propose a new HDFS API that "would merge files so that the merging and 
deleting can be performed on local data nodes with no file contents moving over 
the network", but does this not only push something implemented today in the 
HBase regionserver down into the HDFS datanodes?
[Witten] It is indeed a re-implementation of an existing facility.  The 
proposal involves a callback called on each replica to do the actual merge.  
The implementation of the callback would be implemented by the HBase team.  So 
the policies about how merging is done and what the file format is will still 
be made by the HBase team.  Which files are to be merged is specified by the 
HDFS client and would also be decided within HBase.

Could a merge as described be safely executed in parallel on multiple datanodes 
without coordination? No, because the result is not a 1:1 map of input block to 
output block. 
[Witten] As I think about it each replica can merge the input files to the 
output files without much coordination.  When the merge is finished it needs to 
notify the merge coordinator about the new blocks and files.  The name node 
needs to be notified about the new files and their blocks from the merge 
coordinator.  I agree that there isn't a design for this yet.  But the data 
involved in the coordination will involve block ids and file ids and not block 
content.  So the amount of data moving across the net is substantially reduced. 
 Regarding the 1:1 comment.  It would be required that the merging callback 
produce identical output files for a given set of input files.  This means that 
all replicas will produce exactly the same output.  

Therefore in a realistic implementation (IMHO) a single datanode would handle 
the merge procedure. 
[Witten] Yes,
>From a block device and network perspective nothing would change.
[Witten] The amount of data moving over the net would be massively reduced.

Set the above aside. We can't push something as critical to HBase as compaction 
down into HDFS.
[Witten] HDFS would be providing a generic facility to merge files.  HDFS would 
not provide any policy file formats or merger behavior.  As such the important 
code will be retained by HBase.  If there is a significant performance 
improvement, without loss of reliability, HBase can't not seriously consider it.

First, the HDFS project is unlikely to accept the idea or implement it in the 
first place. Even in the unlikely event that happens,
[Witten] If there is a significant improvement to HBase, and other HDFS clients 
which do merging (Does Parquet or other higher level storage clients?).  I 
would think they'd be eager.

we would need reimplement compaction using the new HDFS facility to take 
advantage of it, yet we will need to support older versions of HDFS without the 
new API for a while,
[Witten]  Actually, we'd need to keep it around forever.  There would be a DFS 
interface that HDFS would implement, but other implementations of DFS may not.  
So HBase would need to keep its current implementation if the new APIs were not 
implemented by some DFS.  It may also be that we'd want to keep the existing 
implementation when the current environment cares less about network IO.

and if the new HDFS API ever doesn't perfectly address the minutiae of HBase 
compaction then or going forward we would be back where we started.
[Witten] I think the objection is handled by the callback nature of the new API.

...
Only then are we really saving on IO.
[Witten] The original proposal saves network IO not only on the initial 
compactions but also on all the other compactions.
...

[Witten] The performance value of the original proposal is not certain (as can 
be seen by repeated use of IF in my comments here).  I think it hinges on the 
question I asked in my last comment post: Is it better to have local reads and 
writes and reduce network overhead, or is it better to limit disk reads by 
having them only occur on one replica (as the current implementation does) and 
thereby reduce disk read overhead.  There may not be a simple answer to this 
question.  It may be that the original proposal is worse on a LAN but when 
replicas are geographically far away reducing network IO is worth the cost of 
extra local disk reads.  I'm still interested in comments about this question.


> Change HBase to support local compactions
> -----------------------------------------
>
>                 Key: HBASE-10216
>                 URL: https://issues.apache.org/jira/browse/HBASE-10216
>             Project: HBase
>          Issue Type: New Feature
>          Components: Compaction
>         Environment: All
>            Reporter: David Witten
>
> As I understand it compactions will read data from DFS and write to DFS.  
> This means that even when the reading occurs on the local host (because 
> region server has a local copy) all the writing must go over the network to 
> the other replicas.  This proposal suggests that HBase would perform much 
> better if all the reading and writing occurred locally and did not go over 
> the network. 
> I propose that the DFS interface be extended to provide method that would 
> merge files so that the merging and deleting can be performed on local data 
> nodes with no file contents moving over the network.  The method would take a 
> list of paths to be merged and deleted and the merged file path and an 
> indication of a file-format-aware class that would be run on each data node 
> to perform the merge.  The merge method provided by this merging class would 
> be passed files open for reading for all the files to be merged and one file 
> open for writing.  The custom class provided merge method would read all the 
> input files and append to the output file using some standard API that would 
> work across all DFS implementations.  The DFS would ensure that the merge had 
> happened properly on all replicas before returning to the caller.  It could 
> be that greater resiliency could be achieved by implementing the deletion as 
> a separate phase that is only done after enough of the replicas had completed 
> the merge. 
> HBase would be changed to use the new merge method for compactions, and would 
> provide an implementation of the merging class that works with HFiles.
> This proposal would require a custom code that understands the file format to 
> be runnable by the data nodes to manage the merge.  So there would need to be 
> a facility to load classes into DFS if there isn't such a facility already.  
> Or, less generally, HDFS could build in support for HFile merging.
> The merge method might be optional.  If the DFS implementation did not 
> provide it a generic version that performed the merge on top of the regular 
> DFS interfaces would be used.
> It may be that this method needs to be tweaked or ignored when the region 
> server does not have a local copy data so that, as happens currently, one 
> copy of the data moves to the region server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to