Re: Could hadoop do word count in all files under two-level sub folders?

2009-05-23 Thread Zhengguo 'Mike' SUN
Check TextInputFormat. You could override it to achieve that.





From: Kunsheng Chen 
To: core-user@hadoop.apache.org
Sent: Saturday, May 23, 2009 5:04:50 PM
Subject: Could hadoop do word count in all files under two-level sub folders?


Hello everyone,


I referred to the hadoop tutorial online and found that wordcount example, it 
seems to me that all files have to be under a certain folder to make it work.


I am not sure whether that workcount example could work for multiple subfolders.

For example, if the input folder is 'input' , and I have two subfolders 
'input/input1', 'input/input2', will it work if I only tell the program folder 
'input' ?  Or I have to program a little bit myself for that?


I know it is a simple quesiton that could be tried myself, but the thing is 
that I haven't decided whether to use hadoop for my project and not yet 
installed it.


So any simple answer or idea is well apprecaited, 

Thanks a lot!

-Kun 


  

Re: Circumventing Hadoop's data placement policy

2009-05-23 Thread Raghu Angadi

Raghu Angadi wrote:
As hack, you could tunnel NN traffic from GridFTP clients through a 
different machine (by changing fs.default.name). 


Alternately these 
clients could use a socks proxy.


Socks proxy would not be useful since you don't want datanode traffic to 
go through the proxy.


Raghu.

The amount of traffic to NN is not much and tunneling should not affect 
performance.


Raghu.

Brian Bockelman wrote:

Hey all,

Had a problem I wanted to ask advice on.  The Caltech site I work with 
currently have a few GridFTP servers which are on the same physical 
machines as the Hadoop datanodes, and a few that aren't.  The GridFTP 
server has a libhdfs backend which writes incoming network data into 
HDFS.


They've found that the GridFTP servers which are co-located with HDFS 
datanode have poor performance because data is incoming at a much 
faster rate than the HDD can handle.  The standalone GridFTP servers, 
however, push data out to multiple nodes at one, and can handle the 
incoming data just fine (>200MB/s).


Is there any way to turn off the preference for the local node?  Can 
anyone think of a good workaround to trick HDFS into thinking the 
client isn't on the same node?


Brian







Re: Circumventing Hadoop's data placement policy

2009-05-23 Thread Raghu Angadi
As hack, you could tunnel NN traffic from GridFTP clients through a 
different machine (by changing fs.default.name). Alternately these 
clients could use a socks proxy.


The amount of traffic to NN is not much and tunneling should not affect 
performance.


Raghu.

Brian Bockelman wrote:

Hey all,

Had a problem I wanted to ask advice on.  The Caltech site I work with 
currently have a few GridFTP servers which are on the same physical 
machines as the Hadoop datanodes, and a few that aren't.  The GridFTP 
server has a libhdfs backend which writes incoming network data into HDFS.


They've found that the GridFTP servers which are co-located with HDFS 
datanode have poor performance because data is incoming at a much faster 
rate than the HDD can handle.  The standalone GridFTP servers, however, 
push data out to multiple nodes at one, and can handle the incoming data 
just fine (>200MB/s).


Is there any way to turn off the preference for the local node?  Can 
anyone think of a good workaround to trick HDFS into thinking the client 
isn't on the same node?


Brian




Could hadoop do word count in all files under two-level sub folders?

2009-05-23 Thread Kunsheng Chen

Hello everyone,


I referred to the hadoop tutorial online and found that wordcount example, it 
seems to me that all files have to be under a certain folder to make it work.


I am not sure whether that workcount example could work for multiple subfolders.

For example, if the input folder is 'input' , and I have two subfolders 
'input/input1', 'input/input2', will it work if I only tell the program folder 
'input' ?  Or I have to program a little bit myself for that?


I know it is a simple quesiton that could be tried myself, but the thing is 
that I haven't decided whether to use hadoop for my project and not yet 
installed it.


So any simple answer or idea is well apprecaited, 

Thanks a lot!

-Kun 


  


Re: Circumventing Hadoop's data placement policy

2009-05-23 Thread Tom White
You can't use it yet, but
https://issues.apache.org/jira/browse/HADOOP-3799 (Design a pluggable
interface to place replicas of blocks in HDFS) would enable you to
write your own policy so blocks are never placed locally. Might be
worth following its development to check it can meet your need?

Cheers,
Tom

On Sat, May 23, 2009 at 8:06 PM, jason hadoop  wrote:
> Can you give your machines multiple IP addresses, and bind the grid server
> to a different IP than the datanode
> With solaris you could put it in a different zone,
>
> On Sat, May 23, 2009 at 10:13 AM, Brian Bockelman 
> wrote:
>
>> Hey all,
>>
>> Had a problem I wanted to ask advice on.  The Caltech site I work with
>> currently have a few GridFTP servers which are on the same physical machines
>> as the Hadoop datanodes, and a few that aren't.  The GridFTP server has a
>> libhdfs backend which writes incoming network data into HDFS.
>>
>> They've found that the GridFTP servers which are co-located with HDFS
>> datanode have poor performance because data is incoming at a much faster
>> rate than the HDD can handle.  The standalone GridFTP servers, however, push
>> data out to multiple nodes at one, and can handle the incoming data just
>> fine (>200MB/s).
>>
>> Is there any way to turn off the preference for the local node?  Can anyone
>> think of a good workaround to trick HDFS into thinking the client isn't on
>> the same node?
>>
>> Brian
>
>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
> www.prohadoopbook.com a community for Hadoop Professionals
>


Re: Circumventing Hadoop's data placement policy

2009-05-23 Thread jason hadoop
Can you give your machines multiple IP addresses, and bind the grid server
to a different IP than the datanode
With solaris you could put it in a different zone,

On Sat, May 23, 2009 at 10:13 AM, Brian Bockelman wrote:

> Hey all,
>
> Had a problem I wanted to ask advice on.  The Caltech site I work with
> currently have a few GridFTP servers which are on the same physical machines
> as the Hadoop datanodes, and a few that aren't.  The GridFTP server has a
> libhdfs backend which writes incoming network data into HDFS.
>
> They've found that the GridFTP servers which are co-located with HDFS
> datanode have poor performance because data is incoming at a much faster
> rate than the HDD can handle.  The standalone GridFTP servers, however, push
> data out to multiple nodes at one, and can handle the incoming data just
> fine (>200MB/s).
>
> Is there any way to turn off the preference for the local node?  Can anyone
> think of a good workaround to trick HDFS into thinking the client isn't on
> the same node?
>
> Brian




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Circumventing Hadoop's data placement policy

2009-05-23 Thread Brian Bockelman

Hey all,

Had a problem I wanted to ask advice on.  The Caltech site I work with  
currently have a few GridFTP servers which are on the same physical  
machines as the Hadoop datanodes, and a few that aren't.  The GridFTP  
server has a libhdfs backend which writes incoming network data into  
HDFS.


They've found that the GridFTP servers which are co-located with HDFS  
datanode have poor performance because data is incoming at a much  
faster rate than the HDD can handle.  The standalone GridFTP servers,  
however, push data out to multiple nodes at one, and can handle the  
incoming data just fine (>200MB/s).


Is there any way to turn off the preference for the local node?  Can  
anyone think of a good workaround to trick HDFS into thinking the  
client isn't on the same node?


Brian

smime.p7s
Description: S/MIME cryptographic signature