Re: Could hadoop do word count in all files under two-level sub folders?
Check TextInputFormat. You could override it to achieve that. From: Kunsheng Chen To: core-user@hadoop.apache.org Sent: Saturday, May 23, 2009 5:04:50 PM Subject: Could hadoop do word count in all files under two-level sub folders? Hello everyone, I referred to the hadoop tutorial online and found that wordcount example, it seems to me that all files have to be under a certain folder to make it work. I am not sure whether that workcount example could work for multiple subfolders. For example, if the input folder is 'input' , and I have two subfolders 'input/input1', 'input/input2', will it work if I only tell the program folder 'input' ? Or I have to program a little bit myself for that? I know it is a simple quesiton that could be tried myself, but the thing is that I haven't decided whether to use hadoop for my project and not yet installed it. So any simple answer or idea is well apprecaited, Thanks a lot! -Kun
Re: Circumventing Hadoop's data placement policy
Raghu Angadi wrote: As hack, you could tunnel NN traffic from GridFTP clients through a different machine (by changing fs.default.name). Alternately these clients could use a socks proxy. Socks proxy would not be useful since you don't want datanode traffic to go through the proxy. Raghu. The amount of traffic to NN is not much and tunneling should not affect performance. Raghu. Brian Bockelman wrote: Hey all, Had a problem I wanted to ask advice on. The Caltech site I work with currently have a few GridFTP servers which are on the same physical machines as the Hadoop datanodes, and a few that aren't. The GridFTP server has a libhdfs backend which writes incoming network data into HDFS. They've found that the GridFTP servers which are co-located with HDFS datanode have poor performance because data is incoming at a much faster rate than the HDD can handle. The standalone GridFTP servers, however, push data out to multiple nodes at one, and can handle the incoming data just fine (>200MB/s). Is there any way to turn off the preference for the local node? Can anyone think of a good workaround to trick HDFS into thinking the client isn't on the same node? Brian
Re: Circumventing Hadoop's data placement policy
As hack, you could tunnel NN traffic from GridFTP clients through a different machine (by changing fs.default.name). Alternately these clients could use a socks proxy. The amount of traffic to NN is not much and tunneling should not affect performance. Raghu. Brian Bockelman wrote: Hey all, Had a problem I wanted to ask advice on. The Caltech site I work with currently have a few GridFTP servers which are on the same physical machines as the Hadoop datanodes, and a few that aren't. The GridFTP server has a libhdfs backend which writes incoming network data into HDFS. They've found that the GridFTP servers which are co-located with HDFS datanode have poor performance because data is incoming at a much faster rate than the HDD can handle. The standalone GridFTP servers, however, push data out to multiple nodes at one, and can handle the incoming data just fine (>200MB/s). Is there any way to turn off the preference for the local node? Can anyone think of a good workaround to trick HDFS into thinking the client isn't on the same node? Brian
Could hadoop do word count in all files under two-level sub folders?
Hello everyone, I referred to the hadoop tutorial online and found that wordcount example, it seems to me that all files have to be under a certain folder to make it work. I am not sure whether that workcount example could work for multiple subfolders. For example, if the input folder is 'input' , and I have two subfolders 'input/input1', 'input/input2', will it work if I only tell the program folder 'input' ? Or I have to program a little bit myself for that? I know it is a simple quesiton that could be tried myself, but the thing is that I haven't decided whether to use hadoop for my project and not yet installed it. So any simple answer or idea is well apprecaited, Thanks a lot! -Kun
Re: Circumventing Hadoop's data placement policy
You can't use it yet, but https://issues.apache.org/jira/browse/HADOOP-3799 (Design a pluggable interface to place replicas of blocks in HDFS) would enable you to write your own policy so blocks are never placed locally. Might be worth following its development to check it can meet your need? Cheers, Tom On Sat, May 23, 2009 at 8:06 PM, jason hadoop wrote: > Can you give your machines multiple IP addresses, and bind the grid server > to a different IP than the datanode > With solaris you could put it in a different zone, > > On Sat, May 23, 2009 at 10:13 AM, Brian Bockelman > wrote: > >> Hey all, >> >> Had a problem I wanted to ask advice on. The Caltech site I work with >> currently have a few GridFTP servers which are on the same physical machines >> as the Hadoop datanodes, and a few that aren't. The GridFTP server has a >> libhdfs backend which writes incoming network data into HDFS. >> >> They've found that the GridFTP servers which are co-located with HDFS >> datanode have poor performance because data is incoming at a much faster >> rate than the HDD can handle. The standalone GridFTP servers, however, push >> data out to multiple nodes at one, and can handle the incoming data just >> fine (>200MB/s). >> >> Is there any way to turn off the preference for the local node? Can anyone >> think of a good workaround to trick HDFS into thinking the client isn't on >> the same node? >> >> Brian > > > > > -- > Alpha Chapters of my book on Hadoop are available > http://www.apress.com/book/view/9781430219422 > www.prohadoopbook.com a community for Hadoop Professionals >
Re: Circumventing Hadoop's data placement policy
Can you give your machines multiple IP addresses, and bind the grid server to a different IP than the datanode With solaris you could put it in a different zone, On Sat, May 23, 2009 at 10:13 AM, Brian Bockelman wrote: > Hey all, > > Had a problem I wanted to ask advice on. The Caltech site I work with > currently have a few GridFTP servers which are on the same physical machines > as the Hadoop datanodes, and a few that aren't. The GridFTP server has a > libhdfs backend which writes incoming network data into HDFS. > > They've found that the GridFTP servers which are co-located with HDFS > datanode have poor performance because data is incoming at a much faster > rate than the HDD can handle. The standalone GridFTP servers, however, push > data out to multiple nodes at one, and can handle the incoming data just > fine (>200MB/s). > > Is there any way to turn off the preference for the local node? Can anyone > think of a good workaround to trick HDFS into thinking the client isn't on > the same node? > > Brian -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Circumventing Hadoop's data placement policy
Hey all, Had a problem I wanted to ask advice on. The Caltech site I work with currently have a few GridFTP servers which are on the same physical machines as the Hadoop datanodes, and a few that aren't. The GridFTP server has a libhdfs backend which writes incoming network data into HDFS. They've found that the GridFTP servers which are co-located with HDFS datanode have poor performance because data is incoming at a much faster rate than the HDD can handle. The standalone GridFTP servers, however, push data out to multiple nodes at one, and can handle the incoming data just fine (>200MB/s). Is there any way to turn off the preference for the local node? Can anyone think of a good workaround to trick HDFS into thinking the client isn't on the same node? Brian smime.p7s Description: S/MIME cryptographic signature