Re: Cross-data centre DFS communication?

2008-02-28 Thread Owen O'Malley


On Feb 28, 2008, at 2:43 AM, Miles Osborne wrote:


Currently, we have the following setup:

--cluster A, running Nutch: small RAM per node

--cluster B, just running Hadoop:  lots of RAM per node

At some point in the future we will want cluster B to talk to  
cluster A, and

ideally this should be DFS-to-DFS

Is this possible?  Or do we need to do something like:

Cluster A -- Unix filesystem -- Cluster B

via hadoop dfs -cat / -put operations etc


To copy between clusters, there is a tool called distcp. Look at bin/ 
hadoop distcp. It runs a map/reduce job that copies a group of  
files. It can also be used to copy between versions of hadoop, if the  
source file system is hftp, which uses xml to read hdfs.


-- Owen


Re: Cross-data centre DFS communication?

2008-02-28 Thread Steve Sapovits

Owen O'Malley wrote:

To copy between clusters, there is a tool called distcp. Look at 
bin/hadoop distcp. It runs a map/reduce job that copies a group of 
files. It can also be used to copy between versions of hadoop, if the 
source file system is hftp, which uses xml to read hdfs.


Can you further explain the hftp part of this?  I'm not familiar with that. 
We have a similar need to go cross-data center.  In an earlier post it

was suggested that there was no map/reduce model for that so this
sounds more like what we're looking for. 


--
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
[EMAIL PROTECTED]


Re: Cross-data centre DFS communication?

2008-02-28 Thread Owen O'Malley


On Feb 28, 2008, at 8:20 AM, Steve Sapovits wrote:

Can you further explain the hftp part of this?  I'm not familiar  
with that. We have a similar need to go cross-data center.


Sure, the info server on the name node of HDFS has a read-only  
interface that lists directories in xml and allows the client to read  
files over http. There is a FileSystem implementation that provides  
the client side interface to the xml/http access.


To use it, you need a path with hftp as the protocol:
hadoop distcp hftp://namenode1:50070/foo/bar hdfs://namenode2:8020/foo



In an earlier post it
was suggested that there was no map/reduce model for that so this
sounds more like what we're looking for.


It isn't a good idea to run map/reduce jobs across clusters, so you  
usually need to copy the data locally.


-- Owen


Re: Cross-data centre DFS communication?

2008-02-28 Thread Steve Sapovits

Owen O'Malley wrote:

Sure, the info server on the name node of HDFS has a read-only interface 
that lists directories in xml and allows the client to read files over 
http. There is a FileSystem implementation that provides the client side 
interface to the xml/http access.


To use it, you need a path with hftp as the protocol:
hadoop distcp hftp://namenode1:50070/foo/bar hdfs://namenode2:8020/foo


Very useful.  Thanks.

--
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
[EMAIL PROTECTED]