Re: basic questions about Hadoop!
Gerardo, Thank for you information. I've success with remote writing on HDFS using the following steps: 1. Installation of the latest stable version (hadoop 0.17.2.1) to data nodes and client machine. 2. Open ports 50010, 50070, 54310, 54311 on data nodes machines to access from client machine and data nodes. 3. Usage of ./bin/hadoop dfs -put command to send files to remote HDFS. Hope this help you. Thanks, Victor On Sat, Aug 30, 2008 at 6:12 AM, Gerardo Velez [EMAIL PROTECTED]wrote: Hi Victor! I got problem with remote writing as well, so I tried to go further on this and I would like to share what I did, maybe you have more luck than me 1) as I'm working with user gvelez in remote host I had to give write access to all, like this: bin/hadoop dfs -chmod -R a+w input 2) After that, there is no more connection refused error, but instead I got following exception $ bin/hadoop dfs -copyFromLocal README.txt /user/hadoop/input/README.txt cygpath: cannot create short name of d:\hadoop\hadoop-0.17.2\logs 08/08/29 19:06:51 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: jav a.io.IOException: File /user/hadoop/input/README.txt could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.ja va:1145) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) On Fri, Aug 29, 2008 at 9:53 AM, Victor Samoylov [EMAIL PROTECTED] wrote: Jeff, Thanks for detailed instructions, but on machine that is not hadoop server I got error: ~/hadoop-0.17.2$ ./bin/hadoop dfs -copyFromLocal NOTICE.txt test 08/08/29 19:33:07 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.ConnectException: Connection refused 08/08/29 19:33:07 INFO dfs.DFSClient: Abandoning block blk_-7622891475776838399 The thing is that file was created, but with zero size. Do you have ideas why this happened? Thanks, Victor On Fri, Aug 29, 2008 at 4:10 AM, Jeff Payne [EMAIL PROTECTED] wrote: You can use the hadoop command line on machines that aren't hadoop servers. If you copy the hadoop configuration from one of your master servers or data node to the client machine and run the command line dfs tools, it will copy the files directly to the data node. Or, you could use one of the client libraries. The java client, for example, allows you to open up an output stream and start dumping bytes on it. On Thu, Aug 28, 2008 at 5:05 PM, Gerardo Velez [EMAIL PROTECTED] wrote: Hi Jeff, thank you for answering! What about remote writing on HDFS, lets suppose I got an application server on a linux server A and I got a Hadoop cluster on servers B (master), C (slave), D (slave) What I would like is sent some files from Server A to be processed by hadoop. So in order to do so, what I need to do do I need send those files to master server first and then copy those to HDFS? or can I pass those files to any slave server? basically I'm looking for remote writing due to files to be process are not being generated on any haddop server. Thanks again! -- Gerardo Regarding On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne [EMAIL PROTECTED] wrote: Gerardo: I can't really speak to all of your questions, but the master/slave issue is a common concern with hadoop. A cluster has a single namenode and therefore a single point of failure. There is also a secondary name node process which runs on the same machine as the name node in most default configurations. You can make it a different machine by adjusting the master file. One of the more experienced lurkers should feel free to correct me, but my understanding is that the secondary name node keeps track of all the same index information used by the primary name node. So, if the namenode fails, there is no automatic recovery, but you can always tweak your cluster configuration to make the secondary namenode the primary and safely restart the cluster. As for the storage of files, the name node is really just the traffic cop for HDFS. No HDFS files are actually stored on that machine. It's basically used as a directory and lock manager, etc. The files are stored on multiple datanodes and I'm pretty sure all the actual file I/O happens directly between the client and the respective
Re: basic questions about Hadoop!
On Sat, Aug 30, 2008 at 10:12 AM, Gerardo Velez [EMAIL PROTECTED]wrote: Hi Victor! I got problem with remote writing as well, so I tried to go further on this and I would like to share what I did, maybe you have more luck than me 1) as I'm working with user gvelez in remote host I had to give write access to all, like this: bin/hadoop dfs -chmod -R a+w input 2) After that, there is no more connection refused error, but instead I got following exception $ bin/hadoop dfs -copyFromLocal README.txt /user/hadoop/input/README.txt cygpath: cannot create short name of d:\hadoop\hadoop-0.17.2\logs 08/08/29 19:06:51 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: jav a.io.IOException: File /user/hadoop/input/README.txt could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.ja va:1145) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) How many datanode do you have ? Only one, I guess. Modify your $HADOOP_HOME/conf/hadoop-site.xml and lookup property namedfs.replication/name value1/value /property set value to 0. On Fri, Aug 29, 2008 at 9:53 AM, Victor Samoylov [EMAIL PROTECTED] wrote: Jeff, Thanks for detailed instructions, but on machine that is not hadoop server I got error: ~/hadoop-0.17.2$ ./bin/hadoop dfs -copyFromLocal NOTICE.txt test 08/08/29 19:33:07 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.ConnectException: Connection refused 08/08/29 19:33:07 INFO dfs.DFSClient: Abandoning block blk_-7622891475776838399 The thing is that file was created, but with zero size. Do you have ideas why this happened? Thanks, Victor On Fri, Aug 29, 2008 at 4:10 AM, Jeff Payne [EMAIL PROTECTED] wrote: You can use the hadoop command line on machines that aren't hadoop servers. If you copy the hadoop configuration from one of your master servers or data node to the client machine and run the command line dfs tools, it will copy the files directly to the data node. Or, you could use one of the client libraries. The java client, for example, allows you to open up an output stream and start dumping bytes on it. On Thu, Aug 28, 2008 at 5:05 PM, Gerardo Velez [EMAIL PROTECTED] wrote: Hi Jeff, thank you for answering! What about remote writing on HDFS, lets suppose I got an application server on a linux server A and I got a Hadoop cluster on servers B (master), C (slave), D (slave) What I would like is sent some files from Server A to be processed by hadoop. So in order to do so, what I need to do do I need send those files to master server first and then copy those to HDFS? or can I pass those files to any slave server? basically I'm looking for remote writing due to files to be process are not being generated on any haddop server. Thanks again! -- Gerardo Regarding On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne [EMAIL PROTECTED] wrote: Gerardo: I can't really speak to all of your questions, but the master/slave issue is a common concern with hadoop. A cluster has a single namenode and therefore a single point of failure. There is also a secondary name node process which runs on the same machine as the name node in most default configurations. You can make it a different machine by adjusting the master file. One of the more experienced lurkers should feel free to correct me, but my understanding is that the secondary name node keeps track of all the same index information used by the primary name node. So, if the namenode fails, there is no automatic recovery, but you can always tweak your cluster configuration to make the secondary namenode the primary and safely restart the cluster. As for the storage of files, the name node is really just the traffic cop for HDFS. No HDFS files are actually stored on that machine. It's basically used as a directory and lock manager, etc. The files are stored on multiple datanodes and I'm pretty sure all the actual file I/O happens directly between the client and the respective datanodes. Perhaps one of the more hardcore hadoop people on here will point it out if I'm giving bad advice. On Thu, Aug 28, 2008 at 2:28 PM, Gerardo Velez [EMAIL PROTECTED] wrote:
Re: basic questions about Hadoop!
Jeff, Thanks for detailed instructions, but on machine that is not hadoop server I got error: ~/hadoop-0.17.2$ ./bin/hadoop dfs -copyFromLocal NOTICE.txt test 08/08/29 19:33:07 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.ConnectException: Connection refused 08/08/29 19:33:07 INFO dfs.DFSClient: Abandoning block blk_-7622891475776838399 The thing is that file was created, but with zero size. Do you have ideas why this happened? Thanks, Victor On Fri, Aug 29, 2008 at 4:10 AM, Jeff Payne [EMAIL PROTECTED] wrote: You can use the hadoop command line on machines that aren't hadoop servers. If you copy the hadoop configuration from one of your master servers or data node to the client machine and run the command line dfs tools, it will copy the files directly to the data node. Or, you could use one of the client libraries. The java client, for example, allows you to open up an output stream and start dumping bytes on it. On Thu, Aug 28, 2008 at 5:05 PM, Gerardo Velez [EMAIL PROTECTED] wrote: Hi Jeff, thank you for answering! What about remote writing on HDFS, lets suppose I got an application server on a linux server A and I got a Hadoop cluster on servers B (master), C (slave), D (slave) What I would like is sent some files from Server A to be processed by hadoop. So in order to do so, what I need to do do I need send those files to master server first and then copy those to HDFS? or can I pass those files to any slave server? basically I'm looking for remote writing due to files to be process are not being generated on any haddop server. Thanks again! -- Gerardo Regarding On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne [EMAIL PROTECTED] wrote: Gerardo: I can't really speak to all of your questions, but the master/slave issue is a common concern with hadoop. A cluster has a single namenode and therefore a single point of failure. There is also a secondary name node process which runs on the same machine as the name node in most default configurations. You can make it a different machine by adjusting the master file. One of the more experienced lurkers should feel free to correct me, but my understanding is that the secondary name node keeps track of all the same index information used by the primary name node. So, if the namenode fails, there is no automatic recovery, but you can always tweak your cluster configuration to make the secondary namenode the primary and safely restart the cluster. As for the storage of files, the name node is really just the traffic cop for HDFS. No HDFS files are actually stored on that machine. It's basically used as a directory and lock manager, etc. The files are stored on multiple datanodes and I'm pretty sure all the actual file I/O happens directly between the client and the respective datanodes. Perhaps one of the more hardcore hadoop people on here will point it out if I'm giving bad advice. On Thu, Aug 28, 2008 at 2:28 PM, Gerardo Velez [EMAIL PROTECTED] wrote: Hi Everybody! I'm a newbie with Hadoop, I've installed it as a single node as a pseudo-distributed environment, but I would like to go further and configure a complete hadoop cluster. But I got the following questions. 1.- I undertsand that HDFS has a master/slave architecture. So master and the master server manages the file system namespace and regulates access to files by clients. So, what happens in a cluster environment if the master server fails or is down due to network issues? the slave become as master server or something? 2.- What about Haddop Filesystem, from client point of view. the client should only store files in the HDFS on master server, or clients are able to store the file to be processed on a HDFS from a slave server as well? 3.- Until now, what I;m doing to run hadoop is: 1.- copy file to be processes from Linux File System to HDFS 2.- Run hadoop shell hadoop -jarfile input output 3.- The results are stored on output directory There is anyway to have hadoop as a deamon, so that, when the file is stored in HDFS the file is processed automatically with hadoop? (witout to run hadoop shell everytime) 4.- What happens with processed files, they are deleted form HDFS automatically? Thanks in advance! -- Gerardo Velez -- Jeffrey Payne Lead Software Engineer Eyealike, Inc. [EMAIL PROTECTED] www.eyealike.com (206) 257-8708 Anything worth doing is worth overdoing. -H. Lifter -- Jeffrey Payne Lead Software Engineer Eyealike, Inc. [EMAIL PROTECTED] www.eyealike.com (206) 257-8708 Anything worth doing is worth overdoing. -H. Lifter
Re: basic questions about Hadoop!
Hi Victor! I got problem with remote writing as well, so I tried to go further on this and I would like to share what I did, maybe you have more luck than me 1) as I'm working with user gvelez in remote host I had to give write access to all, like this: bin/hadoop dfs -chmod -R a+w input 2) After that, there is no more connection refused error, but instead I got following exception $ bin/hadoop dfs -copyFromLocal README.txt /user/hadoop/input/README.txt cygpath: cannot create short name of d:\hadoop\hadoop-0.17.2\logs 08/08/29 19:06:51 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: jav a.io.IOException: File /user/hadoop/input/README.txt could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.ja va:1145) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) On Fri, Aug 29, 2008 at 9:53 AM, Victor Samoylov [EMAIL PROTECTED] wrote: Jeff, Thanks for detailed instructions, but on machine that is not hadoop server I got error: ~/hadoop-0.17.2$ ./bin/hadoop dfs -copyFromLocal NOTICE.txt test 08/08/29 19:33:07 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.ConnectException: Connection refused 08/08/29 19:33:07 INFO dfs.DFSClient: Abandoning block blk_-7622891475776838399 The thing is that file was created, but with zero size. Do you have ideas why this happened? Thanks, Victor On Fri, Aug 29, 2008 at 4:10 AM, Jeff Payne [EMAIL PROTECTED] wrote: You can use the hadoop command line on machines that aren't hadoop servers. If you copy the hadoop configuration from one of your master servers or data node to the client machine and run the command line dfs tools, it will copy the files directly to the data node. Or, you could use one of the client libraries. The java client, for example, allows you to open up an output stream and start dumping bytes on it. On Thu, Aug 28, 2008 at 5:05 PM, Gerardo Velez [EMAIL PROTECTED] wrote: Hi Jeff, thank you for answering! What about remote writing on HDFS, lets suppose I got an application server on a linux server A and I got a Hadoop cluster on servers B (master), C (slave), D (slave) What I would like is sent some files from Server A to be processed by hadoop. So in order to do so, what I need to do do I need send those files to master server first and then copy those to HDFS? or can I pass those files to any slave server? basically I'm looking for remote writing due to files to be process are not being generated on any haddop server. Thanks again! -- Gerardo Regarding On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne [EMAIL PROTECTED] wrote: Gerardo: I can't really speak to all of your questions, but the master/slave issue is a common concern with hadoop. A cluster has a single namenode and therefore a single point of failure. There is also a secondary name node process which runs on the same machine as the name node in most default configurations. You can make it a different machine by adjusting the master file. One of the more experienced lurkers should feel free to correct me, but my understanding is that the secondary name node keeps track of all the same index information used by the primary name node. So, if the namenode fails, there is no automatic recovery, but you can always tweak your cluster configuration to make the secondary namenode the primary and safely restart the cluster. As for the storage of files, the name node is really just the traffic cop for HDFS. No HDFS files are actually stored on that machine. It's basically used as a directory and lock manager, etc. The files are stored on multiple datanodes and I'm pretty sure all the actual file I/O happens directly between the client and the respective datanodes. Perhaps one of the more hardcore hadoop people on here will point it out if I'm giving bad advice. On Thu, Aug 28, 2008 at 2:28 PM, Gerardo Velez [EMAIL PROTECTED] wrote: Hi Everybody! I'm a newbie with Hadoop, I've installed it as a single node as a pseudo-distributed environment, but I would like to go further and configure a complete hadoop cluster. But I got the following questions. 1.- I undertsand that HDFS has a master/slave architecture. So master and the master server manages the file system namespace and regulates
Re: basic questions about Hadoop!
Gerardo: I can't really speak to all of your questions, but the master/slave issue is a common concern with hadoop. A cluster has a single namenode and therefore a single point of failure. There is also a secondary name node process which runs on the same machine as the name node in most default configurations. You can make it a different machine by adjusting the master file. One of the more experienced lurkers should feel free to correct me, but my understanding is that the secondary name node keeps track of all the same index information used by the primary name node. So, if the namenode fails, there is no automatic recovery, but you can always tweak your cluster configuration to make the secondary namenode the primary and safely restart the cluster. As for the storage of files, the name node is really just the traffic cop for HDFS. No HDFS files are actually stored on that machine. It's basically used as a directory and lock manager, etc. The files are stored on multiple datanodes and I'm pretty sure all the actual file I/O happens directly between the client and the respective datanodes. Perhaps one of the more hardcore hadoop people on here will point it out if I'm giving bad advice. On Thu, Aug 28, 2008 at 2:28 PM, Gerardo Velez [EMAIL PROTECTED]wrote: Hi Everybody! I'm a newbie with Hadoop, I've installed it as a single node as a pseudo-distributed environment, but I would like to go further and configure a complete hadoop cluster. But I got the following questions. 1.- I undertsand that HDFS has a master/slave architecture. So master and the master server manages the file system namespace and regulates access to files by clients. So, what happens in a cluster environment if the master server fails or is down due to network issues? the slave become as master server or something? 2.- What about Haddop Filesystem, from client point of view. the client should only store files in the HDFS on master server, or clients are able to store the file to be processed on a HDFS from a slave server as well? 3.- Until now, what I;m doing to run hadoop is: 1.- copy file to be processes from Linux File System to HDFS 2.- Run hadoop shell hadoop -jarfile input output 3.- The results are stored on output directory There is anyway to have hadoop as a deamon, so that, when the file is stored in HDFS the file is processed automatically with hadoop? (witout to run hadoop shell everytime) 4.- What happens with processed files, they are deleted form HDFS automatically? Thanks in advance! -- Gerardo Velez -- Jeffrey Payne Lead Software Engineer Eyealike, Inc. [EMAIL PROTECTED] www.eyealike.com (206) 257-8708 Anything worth doing is worth overdoing. -H. Lifter
Re: basic questions about Hadoop!
Hi Jeff, thank you for answering! What about remote writing on HDFS, lets suppose I got an application server on a linux server A and I got a Hadoop cluster on servers B (master), C (slave), D (slave) What I would like is sent some files from Server A to be processed by hadoop. So in order to do so, what I need to do do I need send those files to master server first and then copy those to HDFS? or can I pass those files to any slave server? basically I'm looking for remote writing due to files to be process are not being generated on any haddop server. Thanks again! -- Gerardo Regarding On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne [EMAIL PROTECTED] wrote: Gerardo: I can't really speak to all of your questions, but the master/slave issue is a common concern with hadoop. A cluster has a single namenode and therefore a single point of failure. There is also a secondary name node process which runs on the same machine as the name node in most default configurations. You can make it a different machine by adjusting the master file. One of the more experienced lurkers should feel free to correct me, but my understanding is that the secondary name node keeps track of all the same index information used by the primary name node. So, if the namenode fails, there is no automatic recovery, but you can always tweak your cluster configuration to make the secondary namenode the primary and safely restart the cluster. As for the storage of files, the name node is really just the traffic cop for HDFS. No HDFS files are actually stored on that machine. It's basically used as a directory and lock manager, etc. The files are stored on multiple datanodes and I'm pretty sure all the actual file I/O happens directly between the client and the respective datanodes. Perhaps one of the more hardcore hadoop people on here will point it out if I'm giving bad advice. On Thu, Aug 28, 2008 at 2:28 PM, Gerardo Velez [EMAIL PROTECTED] wrote: Hi Everybody! I'm a newbie with Hadoop, I've installed it as a single node as a pseudo-distributed environment, but I would like to go further and configure a complete hadoop cluster. But I got the following questions. 1.- I undertsand that HDFS has a master/slave architecture. So master and the master server manages the file system namespace and regulates access to files by clients. So, what happens in a cluster environment if the master server fails or is down due to network issues? the slave become as master server or something? 2.- What about Haddop Filesystem, from client point of view. the client should only store files in the HDFS on master server, or clients are able to store the file to be processed on a HDFS from a slave server as well? 3.- Until now, what I;m doing to run hadoop is: 1.- copy file to be processes from Linux File System to HDFS 2.- Run hadoop shell hadoop -jarfile input output 3.- The results are stored on output directory There is anyway to have hadoop as a deamon, so that, when the file is stored in HDFS the file is processed automatically with hadoop? (witout to run hadoop shell everytime) 4.- What happens with processed files, they are deleted form HDFS automatically? Thanks in advance! -- Gerardo Velez -- Jeffrey Payne Lead Software Engineer Eyealike, Inc. [EMAIL PROTECTED] www.eyealike.com (206) 257-8708 Anything worth doing is worth overdoing. -H. Lifter
Re: basic questions about Hadoop!
You can use the hadoop command line on machines that aren't hadoop servers. If you copy the hadoop configuration from one of your master servers or data node to the client machine and run the command line dfs tools, it will copy the files directly to the data node. Or, you could use one of the client libraries. The java client, for example, allows you to open up an output stream and start dumping bytes on it. On Thu, Aug 28, 2008 at 5:05 PM, Gerardo Velez [EMAIL PROTECTED]wrote: Hi Jeff, thank you for answering! What about remote writing on HDFS, lets suppose I got an application server on a linux server A and I got a Hadoop cluster on servers B (master), C (slave), D (slave) What I would like is sent some files from Server A to be processed by hadoop. So in order to do so, what I need to do do I need send those files to master server first and then copy those to HDFS? or can I pass those files to any slave server? basically I'm looking for remote writing due to files to be process are not being generated on any haddop server. Thanks again! -- Gerardo Regarding On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne [EMAIL PROTECTED] wrote: Gerardo: I can't really speak to all of your questions, but the master/slave issue is a common concern with hadoop. A cluster has a single namenode and therefore a single point of failure. There is also a secondary name node process which runs on the same machine as the name node in most default configurations. You can make it a different machine by adjusting the master file. One of the more experienced lurkers should feel free to correct me, but my understanding is that the secondary name node keeps track of all the same index information used by the primary name node. So, if the namenode fails, there is no automatic recovery, but you can always tweak your cluster configuration to make the secondary namenode the primary and safely restart the cluster. As for the storage of files, the name node is really just the traffic cop for HDFS. No HDFS files are actually stored on that machine. It's basically used as a directory and lock manager, etc. The files are stored on multiple datanodes and I'm pretty sure all the actual file I/O happens directly between the client and the respective datanodes. Perhaps one of the more hardcore hadoop people on here will point it out if I'm giving bad advice. On Thu, Aug 28, 2008 at 2:28 PM, Gerardo Velez [EMAIL PROTECTED] wrote: Hi Everybody! I'm a newbie with Hadoop, I've installed it as a single node as a pseudo-distributed environment, but I would like to go further and configure a complete hadoop cluster. But I got the following questions. 1.- I undertsand that HDFS has a master/slave architecture. So master and the master server manages the file system namespace and regulates access to files by clients. So, what happens in a cluster environment if the master server fails or is down due to network issues? the slave become as master server or something? 2.- What about Haddop Filesystem, from client point of view. the client should only store files in the HDFS on master server, or clients are able to store the file to be processed on a HDFS from a slave server as well? 3.- Until now, what I;m doing to run hadoop is: 1.- copy file to be processes from Linux File System to HDFS 2.- Run hadoop shell hadoop -jarfile input output 3.- The results are stored on output directory There is anyway to have hadoop as a deamon, so that, when the file is stored in HDFS the file is processed automatically with hadoop? (witout to run hadoop shell everytime) 4.- What happens with processed files, they are deleted form HDFS automatically? Thanks in advance! -- Gerardo Velez -- Jeffrey Payne Lead Software Engineer Eyealike, Inc. [EMAIL PROTECTED] www.eyealike.com (206) 257-8708 Anything worth doing is worth overdoing. -H. Lifter -- Jeffrey Payne Lead Software Engineer Eyealike, Inc. [EMAIL PROTECTED] www.eyealike.com (206) 257-8708 Anything worth doing is worth overdoing. -H. Lifter
Re: basic questions about Hadoop!
Thanks Jeff and sorry for bothering you again! I got clear the remoting writing into HDFS, but what about hadoop process? Once the file has been copied to HDFS, do I still needs to run hadoop -jarfile input output everytime? if I need to do it everytime, should I do it from remote server as well? Thank for helping and for your patience -- Gerardo On Thu, Aug 28, 2008 at 5:10 PM, Jeff Payne [EMAIL PROTECTED] wrote: You can use the hadoop command line on machines that aren't hadoop servers. If you copy the hadoop configuration from one of your master servers or data node to the client machine and run the command line dfs tools, it will copy the files directly to the data node. Or, you could use one of the client libraries. The java client, for example, allows you to open up an output stream and start dumping bytes on it. On Thu, Aug 28, 2008 at 5:05 PM, Gerardo Velez [EMAIL PROTECTED] wrote: Hi Jeff, thank you for answering! What about remote writing on HDFS, lets suppose I got an application server on a linux server A and I got a Hadoop cluster on servers B (master), C (slave), D (slave) What I would like is sent some files from Server A to be processed by hadoop. So in order to do so, what I need to do do I need send those files to master server first and then copy those to HDFS? or can I pass those files to any slave server? basically I'm looking for remote writing due to files to be process are not being generated on any haddop server. Thanks again! -- Gerardo Regarding On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne [EMAIL PROTECTED] wrote: Gerardo: I can't really speak to all of your questions, but the master/slave issue is a common concern with hadoop. A cluster has a single namenode and therefore a single point of failure. There is also a secondary name node process which runs on the same machine as the name node in most default configurations. You can make it a different machine by adjusting the master file. One of the more experienced lurkers should feel free to correct me, but my understanding is that the secondary name node keeps track of all the same index information used by the primary name node. So, if the namenode fails, there is no automatic recovery, but you can always tweak your cluster configuration to make the secondary namenode the primary and safely restart the cluster. As for the storage of files, the name node is really just the traffic cop for HDFS. No HDFS files are actually stored on that machine. It's basically used as a directory and lock manager, etc. The files are stored on multiple datanodes and I'm pretty sure all the actual file I/O happens directly between the client and the respective datanodes. Perhaps one of the more hardcore hadoop people on here will point it out if I'm giving bad advice. On Thu, Aug 28, 2008 at 2:28 PM, Gerardo Velez [EMAIL PROTECTED] wrote: Hi Everybody! I'm a newbie with Hadoop, I've installed it as a single node as a pseudo-distributed environment, but I would like to go further and configure a complete hadoop cluster. But I got the following questions. 1.- I undertsand that HDFS has a master/slave architecture. So master and the master server manages the file system namespace and regulates access to files by clients. So, what happens in a cluster environment if the master server fails or is down due to network issues? the slave become as master server or something? 2.- What about Haddop Filesystem, from client point of view. the client should only store files in the HDFS on master server, or clients are able to store the file to be processed on a HDFS from a slave server as well? 3.- Until now, what I;m doing to run hadoop is: 1.- copy file to be processes from Linux File System to HDFS 2.- Run hadoop shell hadoop -jarfile input output 3.- The results are stored on output directory There is anyway to have hadoop as a deamon, so that, when the file is stored in HDFS the file is processed automatically with hadoop? (witout to run hadoop shell everytime) 4.- What happens with processed files, they are deleted form HDFS automatically? Thanks in advance! -- Gerardo Velez -- Jeffrey Payne Lead Software Engineer Eyealike, Inc. [EMAIL PROTECTED] www.eyealike.com (206) 257-8708 Anything worth doing is worth overdoing. -H. Lifter -- Jeffrey Payne Lead Software Engineer Eyealike, Inc. [EMAIL PROTECTED] www.eyealike.com (206) 257-8708 Anything worth doing is worth overdoing. -H. Lifter