Re: basic questions about Hadoop!

2008-09-01 Thread Victor Samoylov
Gerardo,

Thank for you information.
I've success with remote writing on HDFS using the following steps:
1. Installation of the latest stable version (hadoop 0.17.2.1) to data nodes
and client machine.
2. Open ports 50010, 50070, 54310, 54311 on data nodes machines to access
from client machine and data nodes.
3. Usage of ./bin/hadoop dfs -put command to send files to remote HDFS.

Hope this help you.

Thanks,
Victor

On Sat, Aug 30, 2008 at 6:12 AM, Gerardo Velez [EMAIL PROTECTED]wrote:

 Hi Victor!

 I got problem with remote writing as well, so I tried to go further on this
 and I would like to share what I did, maybe you have more luck than me

 1) as I'm working with user gvelez in remote host I had to give write
 access
 to all, like this:

bin/hadoop dfs -chmod -R a+w input

 2) After that, there is no more connection refused error, but instead I got
 following exception



 $ bin/hadoop dfs -copyFromLocal README.txt /user/hadoop/input/README.txt
 cygpath: cannot create short name of d:\hadoop\hadoop-0.17.2\logs
 08/08/29 19:06:51 INFO dfs.DFSClient:
 org.apache.hadoop.ipc.RemoteException:
 jav
 a.io.IOException: File /user/hadoop/input/README.txt could only be
 replicated to
  0 nodes, instead of 1
at
 org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.ja
 va:1145)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
 sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)



 On Fri, Aug 29, 2008 at 9:53 AM, Victor Samoylov 
 [EMAIL PROTECTED]
  wrote:

  Jeff,
 
  Thanks for detailed instructions, but on machine that is not hadoop
 server
  I
  got error:
  ~/hadoop-0.17.2$ ./bin/hadoop dfs -copyFromLocal NOTICE.txt test
  08/08/29 19:33:07 INFO dfs.DFSClient: Exception in
 createBlockOutputStream
  java.net.ConnectException: Connection refused
  08/08/29 19:33:07 INFO dfs.DFSClient: Abandoning block
  blk_-7622891475776838399
  The thing is that file was created, but with zero size.
 
  Do you have ideas why this happened?
 
  Thanks,
  Victor
 
  On Fri, Aug 29, 2008 at 4:10 AM, Jeff Payne [EMAIL PROTECTED] wrote:
 
   You can use the hadoop command line on machines that aren't hadoop
  servers.
   If you copy the hadoop configuration from one of your master servers or
   data
   node to the client machine and run the command line dfs tools, it will
  copy
   the files directly to the data node.
  
   Or, you could use one of the client libraries.  The java client, for
   example, allows you to open up an output stream and start dumping bytes
  on
   it.
  
   On Thu, Aug 28, 2008 at 5:05 PM, Gerardo Velez 
 [EMAIL PROTECTED]
   wrote:
  
Hi Jeff, thank you for answering!
   
What about remote writing on HDFS, lets suppose I got an application
   server
on a
linux server A and I got a Hadoop cluster on servers B (master), C
   (slave),
D (slave)
   
What I would like is sent some files from Server A to be processed by
hadoop. So in order to do so, what I need to do do I need send
  those
files to master server first and then copy those to HDFS?
   
or can I pass those files to any slave server?
   
basically I'm looking for remote writing due to files to be process
 are
   not
being generated on any haddop server.
   
Thanks again!
   
-- Gerardo
   
   
   
Regarding
   
On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne [EMAIL PROTECTED]
  wrote:
   
 Gerardo:

 I can't really speak to all of your questions, but the master/slave
   issue
 is
 a common concern with hadoop.  A cluster has a single namenode and
 therefore
 a single point of failure.  There is also a secondary name node
  process
 which runs on the same machine as the name node in most default
 configurations.  You can make it a different machine by adjusting
 the
 master
 file.  One of the more experienced lurkers should feel free to
  correct
me,
 but my understanding is that the secondary name node keeps track of
  all
the
 same index information used by the primary name node.  So, if the
namenode
 fails, there is no automatic recovery, but you can always tweak
 your
 cluster
 configuration to make the secondary namenode the primary and safely
restart
 the cluster.

 As for the storage of files, the name node is really just the
 traffic
   cop
 for HDFS.  No HDFS files are actually stored on that machine.  It's
 basically used as a directory and lock manager, etc.  The files are
stored
 on multiple datanodes and I'm pretty sure all the actual file I/O
   happens
 directly between the client and the respective 

Re: basic questions about Hadoop!

2008-09-01 Thread Mafish Liu
On Sat, Aug 30, 2008 at 10:12 AM, Gerardo Velez [EMAIL PROTECTED]wrote:

 Hi Victor!

 I got problem with remote writing as well, so I tried to go further on this
 and I would like to share what I did, maybe you have more luck than me

 1) as I'm working with user gvelez in remote host I had to give write
 access
 to all, like this:

bin/hadoop dfs -chmod -R a+w input

 2) After that, there is no more connection refused error, but instead I got
 following exception



 $ bin/hadoop dfs -copyFromLocal README.txt /user/hadoop/input/README.txt
 cygpath: cannot create short name of d:\hadoop\hadoop-0.17.2\logs
 08/08/29 19:06:51 INFO dfs.DFSClient:
 org.apache.hadoop.ipc.RemoteException:
 jav
 a.io.IOException: File /user/hadoop/input/README.txt could only be
 replicated to
  0 nodes, instead of 1
at
 org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.ja
 va:1145)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
 sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

 How many datanode do you have ? Only one, I guess.
Modify your $HADOOP_HOME/conf/hadoop-site.xml and lookup

property
namedfs.replication/name
value1/value
/property

set value to 0.



 On Fri, Aug 29, 2008 at 9:53 AM, Victor Samoylov 
 [EMAIL PROTECTED]
  wrote:

  Jeff,
 
  Thanks for detailed instructions, but on machine that is not hadoop
 server
  I
  got error:
  ~/hadoop-0.17.2$ ./bin/hadoop dfs -copyFromLocal NOTICE.txt test
  08/08/29 19:33:07 INFO dfs.DFSClient: Exception in
 createBlockOutputStream
  java.net.ConnectException: Connection refused
  08/08/29 19:33:07 INFO dfs.DFSClient: Abandoning block
  blk_-7622891475776838399
  The thing is that file was created, but with zero size.
 
  Do you have ideas why this happened?
 
  Thanks,
  Victor
 
  On Fri, Aug 29, 2008 at 4:10 AM, Jeff Payne [EMAIL PROTECTED] wrote:
 
   You can use the hadoop command line on machines that aren't hadoop
  servers.
   If you copy the hadoop configuration from one of your master servers or
   data
   node to the client machine and run the command line dfs tools, it will
  copy
   the files directly to the data node.
  
   Or, you could use one of the client libraries.  The java client, for
   example, allows you to open up an output stream and start dumping bytes
  on
   it.
  
   On Thu, Aug 28, 2008 at 5:05 PM, Gerardo Velez 
 [EMAIL PROTECTED]
   wrote:
  
Hi Jeff, thank you for answering!
   
What about remote writing on HDFS, lets suppose I got an application
   server
on a
linux server A and I got a Hadoop cluster on servers B (master), C
   (slave),
D (slave)
   
What I would like is sent some files from Server A to be processed by
hadoop. So in order to do so, what I need to do do I need send
  those
files to master server first and then copy those to HDFS?
   
or can I pass those files to any slave server?
   
basically I'm looking for remote writing due to files to be process
 are
   not
being generated on any haddop server.
   
Thanks again!
   
-- Gerardo
   
   
   
Regarding
   
On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne [EMAIL PROTECTED]
  wrote:
   
 Gerardo:

 I can't really speak to all of your questions, but the master/slave
   issue
 is
 a common concern with hadoop.  A cluster has a single namenode and
 therefore
 a single point of failure.  There is also a secondary name node
  process
 which runs on the same machine as the name node in most default
 configurations.  You can make it a different machine by adjusting
 the
 master
 file.  One of the more experienced lurkers should feel free to
  correct
me,
 but my understanding is that the secondary name node keeps track of
  all
the
 same index information used by the primary name node.  So, if the
namenode
 fails, there is no automatic recovery, but you can always tweak
 your
 cluster
 configuration to make the secondary namenode the primary and safely
restart
 the cluster.

 As for the storage of files, the name node is really just the
 traffic
   cop
 for HDFS.  No HDFS files are actually stored on that machine.  It's
 basically used as a directory and lock manager, etc.  The files are
stored
 on multiple datanodes and I'm pretty sure all the actual file I/O
   happens
 directly between the client and the respective datanodes.

 Perhaps one of the more hardcore hadoop people on here will point
 it
   out
if
 I'm giving bad advice.


 On Thu, Aug 28, 2008 at 2:28 PM, Gerardo Velez 
   [EMAIL PROTECTED]
 wrote:

 

Re: basic questions about Hadoop!

2008-08-29 Thread Victor Samoylov
Jeff,

Thanks for detailed instructions, but on machine that is not hadoop server I
got error:
~/hadoop-0.17.2$ ./bin/hadoop dfs -copyFromLocal NOTICE.txt test
08/08/29 19:33:07 INFO dfs.DFSClient: Exception in createBlockOutputStream
java.net.ConnectException: Connection refused
08/08/29 19:33:07 INFO dfs.DFSClient: Abandoning block
blk_-7622891475776838399
The thing is that file was created, but with zero size.

Do you have ideas why this happened?

Thanks,
Victor

On Fri, Aug 29, 2008 at 4:10 AM, Jeff Payne [EMAIL PROTECTED] wrote:

 You can use the hadoop command line on machines that aren't hadoop servers.
 If you copy the hadoop configuration from one of your master servers or
 data
 node to the client machine and run the command line dfs tools, it will copy
 the files directly to the data node.

 Or, you could use one of the client libraries.  The java client, for
 example, allows you to open up an output stream and start dumping bytes on
 it.

 On Thu, Aug 28, 2008 at 5:05 PM, Gerardo Velez [EMAIL PROTECTED]
 wrote:

  Hi Jeff, thank you for answering!
 
  What about remote writing on HDFS, lets suppose I got an application
 server
  on a
  linux server A and I got a Hadoop cluster on servers B (master), C
 (slave),
  D (slave)
 
  What I would like is sent some files from Server A to be processed by
  hadoop. So in order to do so, what I need to do do I need send those
  files to master server first and then copy those to HDFS?
 
  or can I pass those files to any slave server?
 
  basically I'm looking for remote writing due to files to be process are
 not
  being generated on any haddop server.
 
  Thanks again!
 
  -- Gerardo
 
 
 
  Regarding
 
  On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne [EMAIL PROTECTED] wrote:
 
   Gerardo:
  
   I can't really speak to all of your questions, but the master/slave
 issue
   is
   a common concern with hadoop.  A cluster has a single namenode and
   therefore
   a single point of failure.  There is also a secondary name node process
   which runs on the same machine as the name node in most default
   configurations.  You can make it a different machine by adjusting the
   master
   file.  One of the more experienced lurkers should feel free to correct
  me,
   but my understanding is that the secondary name node keeps track of all
  the
   same index information used by the primary name node.  So, if the
  namenode
   fails, there is no automatic recovery, but you can always tweak your
   cluster
   configuration to make the secondary namenode the primary and safely
  restart
   the cluster.
  
   As for the storage of files, the name node is really just the traffic
 cop
   for HDFS.  No HDFS files are actually stored on that machine.  It's
   basically used as a directory and lock manager, etc.  The files are
  stored
   on multiple datanodes and I'm pretty sure all the actual file I/O
 happens
   directly between the client and the respective datanodes.
  
   Perhaps one of the more hardcore hadoop people on here will point it
 out
  if
   I'm giving bad advice.
  
  
   On Thu, Aug 28, 2008 at 2:28 PM, Gerardo Velez 
 [EMAIL PROTECTED]
   wrote:
  
Hi Everybody!
   
I'm a newbie with Hadoop, I've installed it as a single node as a
pseudo-distributed environment, but I would like to go further and
configure
a complete hadoop cluster. But I got the following questions.
   
1.- I undertsand that HDFS has a master/slave architecture. So master
  and
the master server manages the file system namespace and regulates
  access
   to
files by clients. So, what happens in a cluster environment if the
  master
server fails or is down due to network issues?
the slave become as master server or something?
   
   
2.- What about Haddop Filesystem, from client point of view. the
 client
should only store files in the HDFS on master server, or clients are
  able
to
store the file to be processed on a HDFS from a slave server as well?
   
   
3.- Until now, what I;m doing to run hadoop is:
   
   1.- copy file to be processes from Linux File System to HDFS
   2.- Run hadoop shell   hadoop   -jarfile  input output
   3.- The results are stored on output directory
   
   
There is anyway to have hadoop as a deamon, so that, when the file is
stored
in HDFS the file is processed automatically with hadoop?
   
(witout to run hadoop shell everytime)
   
   
4.- What happens with processed files, they are deleted form HDFS
automatically?
   
   
Thanks in advance!
   
   
-- Gerardo Velez
   
  
  
  
   --
   Jeffrey Payne
   Lead Software Engineer
   Eyealike, Inc.
   [EMAIL PROTECTED]
   www.eyealike.com
   (206) 257-8708
  
  
   Anything worth doing is worth overdoing.
   -H. Lifter
  
 



 --
 Jeffrey Payne
 Lead Software Engineer
 Eyealike, Inc.
 [EMAIL PROTECTED]
 www.eyealike.com
 (206) 257-8708


 Anything worth doing is worth overdoing.
 -H. Lifter



Re: basic questions about Hadoop!

2008-08-29 Thread Gerardo Velez
Hi Victor!

I got problem with remote writing as well, so I tried to go further on this
and I would like to share what I did, maybe you have more luck than me

1) as I'm working with user gvelez in remote host I had to give write access
to all, like this:

bin/hadoop dfs -chmod -R a+w input

2) After that, there is no more connection refused error, but instead I got
following exception



$ bin/hadoop dfs -copyFromLocal README.txt /user/hadoop/input/README.txt
cygpath: cannot create short name of d:\hadoop\hadoop-0.17.2\logs
08/08/29 19:06:51 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException:
jav
a.io.IOException: File /user/hadoop/input/README.txt could only be
replicated to
 0 nodes, instead of 1
at
org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.ja
va:1145)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)



On Fri, Aug 29, 2008 at 9:53 AM, Victor Samoylov [EMAIL PROTECTED]
 wrote:

 Jeff,

 Thanks for detailed instructions, but on machine that is not hadoop server
 I
 got error:
 ~/hadoop-0.17.2$ ./bin/hadoop dfs -copyFromLocal NOTICE.txt test
 08/08/29 19:33:07 INFO dfs.DFSClient: Exception in createBlockOutputStream
 java.net.ConnectException: Connection refused
 08/08/29 19:33:07 INFO dfs.DFSClient: Abandoning block
 blk_-7622891475776838399
 The thing is that file was created, but with zero size.

 Do you have ideas why this happened?

 Thanks,
 Victor

 On Fri, Aug 29, 2008 at 4:10 AM, Jeff Payne [EMAIL PROTECTED] wrote:

  You can use the hadoop command line on machines that aren't hadoop
 servers.
  If you copy the hadoop configuration from one of your master servers or
  data
  node to the client machine and run the command line dfs tools, it will
 copy
  the files directly to the data node.
 
  Or, you could use one of the client libraries.  The java client, for
  example, allows you to open up an output stream and start dumping bytes
 on
  it.
 
  On Thu, Aug 28, 2008 at 5:05 PM, Gerardo Velez [EMAIL PROTECTED]
  wrote:
 
   Hi Jeff, thank you for answering!
  
   What about remote writing on HDFS, lets suppose I got an application
  server
   on a
   linux server A and I got a Hadoop cluster on servers B (master), C
  (slave),
   D (slave)
  
   What I would like is sent some files from Server A to be processed by
   hadoop. So in order to do so, what I need to do do I need send
 those
   files to master server first and then copy those to HDFS?
  
   or can I pass those files to any slave server?
  
   basically I'm looking for remote writing due to files to be process are
  not
   being generated on any haddop server.
  
   Thanks again!
  
   -- Gerardo
  
  
  
   Regarding
  
   On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne [EMAIL PROTECTED]
 wrote:
  
Gerardo:
   
I can't really speak to all of your questions, but the master/slave
  issue
is
a common concern with hadoop.  A cluster has a single namenode and
therefore
a single point of failure.  There is also a secondary name node
 process
which runs on the same machine as the name node in most default
configurations.  You can make it a different machine by adjusting the
master
file.  One of the more experienced lurkers should feel free to
 correct
   me,
but my understanding is that the secondary name node keeps track of
 all
   the
same index information used by the primary name node.  So, if the
   namenode
fails, there is no automatic recovery, but you can always tweak your
cluster
configuration to make the secondary namenode the primary and safely
   restart
the cluster.
   
As for the storage of files, the name node is really just the traffic
  cop
for HDFS.  No HDFS files are actually stored on that machine.  It's
basically used as a directory and lock manager, etc.  The files are
   stored
on multiple datanodes and I'm pretty sure all the actual file I/O
  happens
directly between the client and the respective datanodes.
   
Perhaps one of the more hardcore hadoop people on here will point it
  out
   if
I'm giving bad advice.
   
   
On Thu, Aug 28, 2008 at 2:28 PM, Gerardo Velez 
  [EMAIL PROTECTED]
wrote:
   
 Hi Everybody!

 I'm a newbie with Hadoop, I've installed it as a single node as a
 pseudo-distributed environment, but I would like to go further and
 configure
 a complete hadoop cluster. But I got the following questions.

 1.- I undertsand that HDFS has a master/slave architecture. So
 master
   and
 the master server manages the file system namespace and regulates
   

Re: basic questions about Hadoop!

2008-08-28 Thread Jeff Payne
Gerardo:

I can't really speak to all of your questions, but the master/slave issue is
a common concern with hadoop.  A cluster has a single namenode and therefore
a single point of failure.  There is also a secondary name node process
which runs on the same machine as the name node in most default
configurations.  You can make it a different machine by adjusting the master
file.  One of the more experienced lurkers should feel free to correct me,
but my understanding is that the secondary name node keeps track of all the
same index information used by the primary name node.  So, if the namenode
fails, there is no automatic recovery, but you can always tweak your cluster
configuration to make the secondary namenode the primary and safely restart
the cluster.

As for the storage of files, the name node is really just the traffic cop
for HDFS.  No HDFS files are actually stored on that machine.  It's
basically used as a directory and lock manager, etc.  The files are stored
on multiple datanodes and I'm pretty sure all the actual file I/O happens
directly between the client and the respective datanodes.

Perhaps one of the more hardcore hadoop people on here will point it out if
I'm giving bad advice.


On Thu, Aug 28, 2008 at 2:28 PM, Gerardo Velez [EMAIL PROTECTED]wrote:

 Hi Everybody!

 I'm a newbie with Hadoop, I've installed it as a single node as a
 pseudo-distributed environment, but I would like to go further and
 configure
 a complete hadoop cluster. But I got the following questions.

 1.- I undertsand that HDFS has a master/slave architecture. So master and
 the master server manages the file system namespace and regulates access to
 files by clients. So, what happens in a cluster environment if the master
 server fails or is down due to network issues?
 the slave become as master server or something?


 2.- What about Haddop Filesystem, from client point of view. the client
 should only store files in the HDFS on master server, or clients are able
 to
 store the file to be processed on a HDFS from a slave server as well?


 3.- Until now, what I;m doing to run hadoop is:

1.- copy file to be processes from Linux File System to HDFS
2.- Run hadoop shell   hadoop   -jarfile  input output
3.- The results are stored on output directory


 There is anyway to have hadoop as a deamon, so that, when the file is
 stored
 in HDFS the file is processed automatically with hadoop?

 (witout to run hadoop shell everytime)


 4.- What happens with processed files, they are deleted form HDFS
 automatically?


 Thanks in advance!


 -- Gerardo Velez




-- 
Jeffrey Payne
Lead Software Engineer
Eyealike, Inc.
[EMAIL PROTECTED]
www.eyealike.com
(206) 257-8708


Anything worth doing is worth overdoing.
-H. Lifter


Re: basic questions about Hadoop!

2008-08-28 Thread Gerardo Velez
Hi Jeff, thank you for answering!

What about remote writing on HDFS, lets suppose I got an application server
on a
linux server A and I got a Hadoop cluster on servers B (master), C (slave),
D (slave)

What I would like is sent some files from Server A to be processed by
hadoop. So in order to do so, what I need to do do I need send those
files to master server first and then copy those to HDFS?

or can I pass those files to any slave server?

basically I'm looking for remote writing due to files to be process are not
being generated on any haddop server.

Thanks again!

-- Gerardo



Regarding

On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne [EMAIL PROTECTED] wrote:

 Gerardo:

 I can't really speak to all of your questions, but the master/slave issue
 is
 a common concern with hadoop.  A cluster has a single namenode and
 therefore
 a single point of failure.  There is also a secondary name node process
 which runs on the same machine as the name node in most default
 configurations.  You can make it a different machine by adjusting the
 master
 file.  One of the more experienced lurkers should feel free to correct me,
 but my understanding is that the secondary name node keeps track of all the
 same index information used by the primary name node.  So, if the namenode
 fails, there is no automatic recovery, but you can always tweak your
 cluster
 configuration to make the secondary namenode the primary and safely restart
 the cluster.

 As for the storage of files, the name node is really just the traffic cop
 for HDFS.  No HDFS files are actually stored on that machine.  It's
 basically used as a directory and lock manager, etc.  The files are stored
 on multiple datanodes and I'm pretty sure all the actual file I/O happens
 directly between the client and the respective datanodes.

 Perhaps one of the more hardcore hadoop people on here will point it out if
 I'm giving bad advice.


 On Thu, Aug 28, 2008 at 2:28 PM, Gerardo Velez [EMAIL PROTECTED]
 wrote:

  Hi Everybody!
 
  I'm a newbie with Hadoop, I've installed it as a single node as a
  pseudo-distributed environment, but I would like to go further and
  configure
  a complete hadoop cluster. But I got the following questions.
 
  1.- I undertsand that HDFS has a master/slave architecture. So master and
  the master server manages the file system namespace and regulates access
 to
  files by clients. So, what happens in a cluster environment if the master
  server fails or is down due to network issues?
  the slave become as master server or something?
 
 
  2.- What about Haddop Filesystem, from client point of view. the client
  should only store files in the HDFS on master server, or clients are able
  to
  store the file to be processed on a HDFS from a slave server as well?
 
 
  3.- Until now, what I;m doing to run hadoop is:
 
 1.- copy file to be processes from Linux File System to HDFS
 2.- Run hadoop shell   hadoop   -jarfile  input output
 3.- The results are stored on output directory
 
 
  There is anyway to have hadoop as a deamon, so that, when the file is
  stored
  in HDFS the file is processed automatically with hadoop?
 
  (witout to run hadoop shell everytime)
 
 
  4.- What happens with processed files, they are deleted form HDFS
  automatically?
 
 
  Thanks in advance!
 
 
  -- Gerardo Velez
 



 --
 Jeffrey Payne
 Lead Software Engineer
 Eyealike, Inc.
 [EMAIL PROTECTED]
 www.eyealike.com
 (206) 257-8708


 Anything worth doing is worth overdoing.
 -H. Lifter



Re: basic questions about Hadoop!

2008-08-28 Thread Jeff Payne
You can use the hadoop command line on machines that aren't hadoop servers.
If you copy the hadoop configuration from one of your master servers or data
node to the client machine and run the command line dfs tools, it will copy
the files directly to the data node.

Or, you could use one of the client libraries.  The java client, for
example, allows you to open up an output stream and start dumping bytes on
it.

On Thu, Aug 28, 2008 at 5:05 PM, Gerardo Velez [EMAIL PROTECTED]wrote:

 Hi Jeff, thank you for answering!

 What about remote writing on HDFS, lets suppose I got an application server
 on a
 linux server A and I got a Hadoop cluster on servers B (master), C (slave),
 D (slave)

 What I would like is sent some files from Server A to be processed by
 hadoop. So in order to do so, what I need to do do I need send those
 files to master server first and then copy those to HDFS?

 or can I pass those files to any slave server?

 basically I'm looking for remote writing due to files to be process are not
 being generated on any haddop server.

 Thanks again!

 -- Gerardo



 Regarding

 On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne [EMAIL PROTECTED] wrote:

  Gerardo:
 
  I can't really speak to all of your questions, but the master/slave issue
  is
  a common concern with hadoop.  A cluster has a single namenode and
  therefore
  a single point of failure.  There is also a secondary name node process
  which runs on the same machine as the name node in most default
  configurations.  You can make it a different machine by adjusting the
  master
  file.  One of the more experienced lurkers should feel free to correct
 me,
  but my understanding is that the secondary name node keeps track of all
 the
  same index information used by the primary name node.  So, if the
 namenode
  fails, there is no automatic recovery, but you can always tweak your
  cluster
  configuration to make the secondary namenode the primary and safely
 restart
  the cluster.
 
  As for the storage of files, the name node is really just the traffic cop
  for HDFS.  No HDFS files are actually stored on that machine.  It's
  basically used as a directory and lock manager, etc.  The files are
 stored
  on multiple datanodes and I'm pretty sure all the actual file I/O happens
  directly between the client and the respective datanodes.
 
  Perhaps one of the more hardcore hadoop people on here will point it out
 if
  I'm giving bad advice.
 
 
  On Thu, Aug 28, 2008 at 2:28 PM, Gerardo Velez [EMAIL PROTECTED]
  wrote:
 
   Hi Everybody!
  
   I'm a newbie with Hadoop, I've installed it as a single node as a
   pseudo-distributed environment, but I would like to go further and
   configure
   a complete hadoop cluster. But I got the following questions.
  
   1.- I undertsand that HDFS has a master/slave architecture. So master
 and
   the master server manages the file system namespace and regulates
 access
  to
   files by clients. So, what happens in a cluster environment if the
 master
   server fails or is down due to network issues?
   the slave become as master server or something?
  
  
   2.- What about Haddop Filesystem, from client point of view. the client
   should only store files in the HDFS on master server, or clients are
 able
   to
   store the file to be processed on a HDFS from a slave server as well?
  
  
   3.- Until now, what I;m doing to run hadoop is:
  
  1.- copy file to be processes from Linux File System to HDFS
  2.- Run hadoop shell   hadoop   -jarfile  input output
  3.- The results are stored on output directory
  
  
   There is anyway to have hadoop as a deamon, so that, when the file is
   stored
   in HDFS the file is processed automatically with hadoop?
  
   (witout to run hadoop shell everytime)
  
  
   4.- What happens with processed files, they are deleted form HDFS
   automatically?
  
  
   Thanks in advance!
  
  
   -- Gerardo Velez
  
 
 
 
  --
  Jeffrey Payne
  Lead Software Engineer
  Eyealike, Inc.
  [EMAIL PROTECTED]
  www.eyealike.com
  (206) 257-8708
 
 
  Anything worth doing is worth overdoing.
  -H. Lifter
 




-- 
Jeffrey Payne
Lead Software Engineer
Eyealike, Inc.
[EMAIL PROTECTED]
www.eyealike.com
(206) 257-8708


Anything worth doing is worth overdoing.
-H. Lifter


Re: basic questions about Hadoop!

2008-08-28 Thread Gerardo Velez
Thanks Jeff and sorry for bothering you again!

I got clear the remoting writing into HDFS, but what about hadoop process?

Once the file has been copied to HDFS, do I still needs to run

hadoop -jarfile  input output   everytime?

if I need to do it everytime, should I do it from remote server as well?


Thank for helping and for your patience


-- Gerardo


On Thu, Aug 28, 2008 at 5:10 PM, Jeff Payne [EMAIL PROTECTED] wrote:

 You can use the hadoop command line on machines that aren't hadoop servers.
 If you copy the hadoop configuration from one of your master servers or
 data
 node to the client machine and run the command line dfs tools, it will copy
 the files directly to the data node.

 Or, you could use one of the client libraries.  The java client, for
 example, allows you to open up an output stream and start dumping bytes on
 it.

 On Thu, Aug 28, 2008 at 5:05 PM, Gerardo Velez [EMAIL PROTECTED]
 wrote:

  Hi Jeff, thank you for answering!
 
  What about remote writing on HDFS, lets suppose I got an application
 server
  on a
  linux server A and I got a Hadoop cluster on servers B (master), C
 (slave),
  D (slave)
 
  What I would like is sent some files from Server A to be processed by
  hadoop. So in order to do so, what I need to do do I need send those
  files to master server first and then copy those to HDFS?
 
  or can I pass those files to any slave server?
 
  basically I'm looking for remote writing due to files to be process are
 not
  being generated on any haddop server.
 
  Thanks again!
 
  -- Gerardo
 
 
 
  Regarding
 
  On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne [EMAIL PROTECTED] wrote:
 
   Gerardo:
  
   I can't really speak to all of your questions, but the master/slave
 issue
   is
   a common concern with hadoop.  A cluster has a single namenode and
   therefore
   a single point of failure.  There is also a secondary name node process
   which runs on the same machine as the name node in most default
   configurations.  You can make it a different machine by adjusting the
   master
   file.  One of the more experienced lurkers should feel free to correct
  me,
   but my understanding is that the secondary name node keeps track of all
  the
   same index information used by the primary name node.  So, if the
  namenode
   fails, there is no automatic recovery, but you can always tweak your
   cluster
   configuration to make the secondary namenode the primary and safely
  restart
   the cluster.
  
   As for the storage of files, the name node is really just the traffic
 cop
   for HDFS.  No HDFS files are actually stored on that machine.  It's
   basically used as a directory and lock manager, etc.  The files are
  stored
   on multiple datanodes and I'm pretty sure all the actual file I/O
 happens
   directly between the client and the respective datanodes.
  
   Perhaps one of the more hardcore hadoop people on here will point it
 out
  if
   I'm giving bad advice.
  
  
   On Thu, Aug 28, 2008 at 2:28 PM, Gerardo Velez 
 [EMAIL PROTECTED]
   wrote:
  
Hi Everybody!
   
I'm a newbie with Hadoop, I've installed it as a single node as a
pseudo-distributed environment, but I would like to go further and
configure
a complete hadoop cluster. But I got the following questions.
   
1.- I undertsand that HDFS has a master/slave architecture. So master
  and
the master server manages the file system namespace and regulates
  access
   to
files by clients. So, what happens in a cluster environment if the
  master
server fails or is down due to network issues?
the slave become as master server or something?
   
   
2.- What about Haddop Filesystem, from client point of view. the
 client
should only store files in the HDFS on master server, or clients are
  able
to
store the file to be processed on a HDFS from a slave server as well?
   
   
3.- Until now, what I;m doing to run hadoop is:
   
   1.- copy file to be processes from Linux File System to HDFS
   2.- Run hadoop shell   hadoop   -jarfile  input output
   3.- The results are stored on output directory
   
   
There is anyway to have hadoop as a deamon, so that, when the file is
stored
in HDFS the file is processed automatically with hadoop?
   
(witout to run hadoop shell everytime)
   
   
4.- What happens with processed files, they are deleted form HDFS
automatically?
   
   
Thanks in advance!
   
   
-- Gerardo Velez
   
  
  
  
   --
   Jeffrey Payne
   Lead Software Engineer
   Eyealike, Inc.
   [EMAIL PROTECTED]
   www.eyealike.com
   (206) 257-8708
  
  
   Anything worth doing is worth overdoing.
   -H. Lifter
  
 



 --
 Jeffrey Payne
 Lead Software Engineer
 Eyealike, Inc.
 [EMAIL PROTECTED]
 www.eyealike.com
 (206) 257-8708


 Anything worth doing is worth overdoing.
 -H. Lifter