Re: HDFS architecture based on GFS?
"If there was a malicious process though, then I imagine it could talk to a datanode directly and request a specific block." I didn't understand usage of "malicuous" here, but any process using HDFS api should first ask NameNode where the file replications are. Then - I assume - namenode returns the IP of best DataNode (or all IPs), then call to specific DataNode is made. Please correct me if I'm wrong. Cheers, Rasit 2009/2/16 Matei Zaharia : > In general, yeah, the scripts can access any resource they want (within the > permissions of the user that the task runs as). It's also possible to access > HDFS from scripts because HDFS provides a FUSE interface that can make it > look like a regular file system on the machine. (The FUSE module in turn > talks to the namenode as a regular HDFS client.) > > On Sun, Feb 15, 2009 at 8:43 PM, Amandeep Khurana wrote: > >> I dont know much about Hadoop streaming and have a quick question here. >> >> The snippets of code/programs that you attach into the map reduce job might >> want to access outside resources (like you mentioned). Now these might not >> need to go to the namenode right? For example a python script. How would it >> access the data? Would it ask the parent java process (in the tasktracker) >> to get the data or would it go and do stuff on its own? >> >> >> Amandeep Khurana >> Computer Science Graduate Student >> University of California, Santa Cruz >> >> >> On Sun, Feb 15, 2009 at 8:23 PM, Matei Zaharia wrote: >> >> > Nope, typically the JobTracker just starts the process, and the >> tasktracker >> > talks directly to the namenode to get a pointer to the datanode, and then >> > directly to the datanode. >> > >> > On Sun, Feb 15, 2009 at 8:07 PM, Amandeep Khurana >> > wrote: >> > >> > > Alright.. Got it. >> > > >> > > Now, do the task trackers talk to the namenode and the data node >> directly >> > > or >> > > do they go through the job tracker for it? So, if my code is such that >> I >> > > need to access more files from the hdfs, would the job tracker get >> > involved >> > > or not? >> > > >> > > >> > > >> > > >> > > Amandeep Khurana >> > > Computer Science Graduate Student >> > > University of California, Santa Cruz >> > > >> > > >> > > On Sun, Feb 15, 2009 at 7:20 PM, Matei Zaharia >> > wrote: >> > > >> > > > Normally, HDFS files are accessed through the namenode. If there was >> a >> > > > malicious process though, then I imagine it could talk to a datanode >> > > > directly and request a specific block. >> > > > >> > > > On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana >> > > > wrote: >> > > > >> > > > > Ok. Got it. >> > > > > >> > > > > Now, when my job needs to access another file, does it go to the >> > > Namenode >> > > > > to >> > > > > get the block ids? How does the java process know where the files >> are >> > > and >> > > > > how to access them? >> > > > > >> > > > > >> > > > > Amandeep Khurana >> > > > > Computer Science Graduate Student >> > > > > University of California, Santa Cruz >> > > > > >> > > > > >> > > > > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia > > >> > > > wrote: >> > > > > >> > > > > > I mentioned this case because even jobs written in Java can use >> the >> > > > HDFS >> > > > > > API >> > > > > > to talk to the NameNode and access the filesystem. People often >> do >> > > this >> > > > > > because their job needs to read a config file, some small data >> > table, >> > > > etc >> > > > > > and use this information in its map or reduce functions. In this >> > > case, >> > > > > you >> > > > > > open the second file separately in your mapper's init function >> and >> > > read >> > > > > > whatever you need from it. In general I wanted to point out that >> > you >> > > > > can't >> > > > > > know which files a job will access unless you look at its source >> > code >> > > > or >> > > > > > monitor the calls it makes; the input file(s) you provide in the >> > job >> > > > > > description are a hint to the MapReduce framework to place your >> job >> > > on >> > > > > > certain nodes, but it's reasonable for the job to access other >> > files >> > > as >> > > > > > well. >> > > > > > >> > > > > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana < >> > ama...@gmail.com> >> > > > > > wrote: >> > > > > > >> > > > > > > Another question that I have here - When the jobs run arbitrary >> > > code >> > > > > and >> > > > > > > access data from the HDFS, do they go to the namenode to get >> the >> > > > block >> > > > > > > information? >> > > > > > > >> > > > > > > >> > > > > > > Amandeep Khurana >> > > > > > > Computer Science Graduate Student >> > > > > > > University of California, Santa Cruz >> > > > > > > >> > > > > > > >> > > > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana < >> > > ama...@gmail.com> >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Assuming that the job is purely in Java and not involving >> > > streaming >> > > > > or >> > > > > > > > pipes, wouldnt the resources (files) required by
Re: HADOOP-2536 supports Oracle too?
@Amandeep Hi, I'm new to Hadoop and am trying to run a simple database connectivity program on it. Could you please tell me how u went about it?? my mail id is "sandys_cr...@yahoo.com" . A copy of your code that successfully connected to MySQL will also be helpful. Thanks, Sandhiya Enis Soztutar-2 wrote: > > From the exception : > > java.io.IOException: ORA-00933: SQL command not properly ended > > I would broadly guess that Oracle JDBC driver might be complaining that > the statement does not end with ";", or something similar. you can > 1. download the latest source code of hadoop > 2. add a print statement printing the query (probably in > DBInputFormat:119) > 3. build hadoop jar > 4. use the new hadoop jar to see the actual SQL query > 5. run the query on Oracle if is gives an error. > > Enis > > > Amandeep Khurana wrote: >> Ok. I created the same database in a MySQL database and ran the same >> hadoop >> job against it. It worked. So, that means there is some Oracle specific >> issue. It cant be an issue with the JDBC drivers since I am using the >> same >> drivers in a simple JDBC client. >> >> What could it be? >> >> Amandeep >> >> >> Amandeep Khurana >> Computer Science Graduate Student >> University of California, Santa Cruz >> >> >> On Wed, Feb 4, 2009 at 10:26 AM, Amandeep Khurana >> wrote: >> >> >>> Ok. I'm not sure if I got it correct. Are you saying, I should test the >>> statement that hadoop creates directly with the database? >>> >>> Amandeep >>> >>> >>> Amandeep Khurana >>> Computer Science Graduate Student >>> University of California, Santa Cruz >>> >>> >>> On Wed, Feb 4, 2009 at 7:13 AM, Enis Soztutar >>> wrote: >>> >>> Hadoop-2536 connects to the db via JDBC, so in theory it should work with proper jdbc drivers. It has been tested against MySQL, Hsqldb, and PostreSQL, but not Oracle. To answer your earlier question, the actual SQL statements might not be recognized by Oracle, so I suggest the best way to test this is to insert print statements, and run the actual SQL statements against Oracle to see if the syntax is accepted. We would appreciate if you publish your results. Enis Amandeep Khurana wrote: > Does the patch HADOOP-2536 support connecting to Oracle databases as > well? > Or is it just limited to MySQL? > > Amandeep > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > > >> >> > > -- View this message in context: http://www.nabble.com/HADOOP-2536-supports-Oracle-too--tp21823199p22032715.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Namenode not listening for remote connections to port 9000
Hmmm - I checked all the /etc/hosts files, and they're all fine. Then I switched the conf/hadoop-site.xml to specify ip addresses instead of host names. Then oddly enough it starts working... Now the funny thing is this: It's fine ssh-ing to the correct machines to start up datanodes, but when the datanode thread tries to make the connection back to the namenode (from within a java app I assume) it doesn't resolve the names correctly. Just looking at this makes me want to think that the namenode does its ssh work in some non-java way, actually checking the hosts file, while the datanode does its thing in Java, which doesn't seem to check the hosts file. Could this be some Java funnyness that it's not checking the hosts file? Now there's just one sucky thing about my setup: If I change my file with a list of datanodes (dfs.hosts property) to also have IP addresses instead of hostnames, then it fails. So parts of my config is specifying hostnames, and other parts are specifying IP addresses. Oh well - for development purposes this is good enough, 'cause out in the real world I won't be using the hosts file to string it all together. Thanks for the responses. Mark Kerzner wrote: I had a problem that it listened only on 8020, even though I told it to use 9000 On Fri, Feb 13, 2009 at 7:50 AM, Norbert Burger wrote: On Fri, Feb 13, 2009 at 8:37 AM, Steve Loughran wrote: Michael Lynch wrote: Hi, As far as I can tell I've followed the setup instructions for a hadoop cluster to the letter, but I find that the datanodes can't connect to the namenode on port 9000 because it is only listening for connections from localhost. In my case, the namenode is called centos1, and the datanode is called centos2. They are centos 5.1 servers with an unmodified sun java 6 runtime. fs.default.name takes a URL to the filesystem. such as hdfs://centos1:9000/ If the machine is only binding to localhost, that may mean DNS fun. Try a fully qualified name instead (fs.default.name is defined in conf/hadoop-site.xml, overriding entries from conf/hadoop-default.xml). Also, check your /etc/hosts file on both machines. Could be that you have a incorrect setup where both localhost and the namenode hostname (centos1) are aliased to 127.0.0.1. Norbert
Re: Hostnames on MapReduce Web UI
Thanks, this did it. I changed my /etc/hosts file on each node from 127.0.0.1 localhost localhost.localdomain 127.0.0.1 to just switch the order with 127.0.0.1 127.0.0.1 localhost localhost.localdomain This did the trick! I vaguely recall from somewhere that I need the localhost localhost.localdomain line so I thought I better avoid removing it altogether. Thanks, John On Sun, Feb 15, 2009 at 10:38 AM, Nick Cen wrote: > Try comment out te localhost definition in your /etc/hosts file. > > 2009/2/14 S D > > > I'm reviewing the task trackers on the web interface ( > > http://jobtracker-hostname:50030/) for my cluster of 3 machines. The > names > > of the task trackers do not list real domain names; e.g., one of the task > > trackers is listed as: > > > > tracker_localhost:localhost/127.0.0.1:48167 > > > > I believe that the networking on my machines is set correctly. What do I > > need to configure so that the listing above will show the actual domain > > name? This will help me in diagnosing where problems are occurring in my > > cluster. Note that at the top of the page the hostname (in my case > "storm") > > is properly listed; e.g., > > > > storm Hadoop Machine List > > > > Thanks, > > John > > > > > > -- > http://daily.appspot.com/food/ >
Re: HDFS architecture based on GFS?
In general, yeah, the scripts can access any resource they want (within the permissions of the user that the task runs as). It's also possible to access HDFS from scripts because HDFS provides a FUSE interface that can make it look like a regular file system on the machine. (The FUSE module in turn talks to the namenode as a regular HDFS client.) On Sun, Feb 15, 2009 at 8:43 PM, Amandeep Khurana wrote: > I dont know much about Hadoop streaming and have a quick question here. > > The snippets of code/programs that you attach into the map reduce job might > want to access outside resources (like you mentioned). Now these might not > need to go to the namenode right? For example a python script. How would it > access the data? Would it ask the parent java process (in the tasktracker) > to get the data or would it go and do stuff on its own? > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Sun, Feb 15, 2009 at 8:23 PM, Matei Zaharia wrote: > > > Nope, typically the JobTracker just starts the process, and the > tasktracker > > talks directly to the namenode to get a pointer to the datanode, and then > > directly to the datanode. > > > > On Sun, Feb 15, 2009 at 8:07 PM, Amandeep Khurana > > wrote: > > > > > Alright.. Got it. > > > > > > Now, do the task trackers talk to the namenode and the data node > directly > > > or > > > do they go through the job tracker for it? So, if my code is such that > I > > > need to access more files from the hdfs, would the job tracker get > > involved > > > or not? > > > > > > > > > > > > > > > Amandeep Khurana > > > Computer Science Graduate Student > > > University of California, Santa Cruz > > > > > > > > > On Sun, Feb 15, 2009 at 7:20 PM, Matei Zaharia > > wrote: > > > > > > > Normally, HDFS files are accessed through the namenode. If there was > a > > > > malicious process though, then I imagine it could talk to a datanode > > > > directly and request a specific block. > > > > > > > > On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana > > > > wrote: > > > > > > > > > Ok. Got it. > > > > > > > > > > Now, when my job needs to access another file, does it go to the > > > Namenode > > > > > to > > > > > get the block ids? How does the java process know where the files > are > > > and > > > > > how to access them? > > > > > > > > > > > > > > > Amandeep Khurana > > > > > Computer Science Graduate Student > > > > > University of California, Santa Cruz > > > > > > > > > > > > > > > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia > > > > > wrote: > > > > > > > > > > > I mentioned this case because even jobs written in Java can use > the > > > > HDFS > > > > > > API > > > > > > to talk to the NameNode and access the filesystem. People often > do > > > this > > > > > > because their job needs to read a config file, some small data > > table, > > > > etc > > > > > > and use this information in its map or reduce functions. In this > > > case, > > > > > you > > > > > > open the second file separately in your mapper's init function > and > > > read > > > > > > whatever you need from it. In general I wanted to point out that > > you > > > > > can't > > > > > > know which files a job will access unless you look at its source > > code > > > > or > > > > > > monitor the calls it makes; the input file(s) you provide in the > > job > > > > > > description are a hint to the MapReduce framework to place your > job > > > on > > > > > > certain nodes, but it's reasonable for the job to access other > > files > > > as > > > > > > well. > > > > > > > > > > > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana < > > ama...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Another question that I have here - When the jobs run arbitrary > > > code > > > > > and > > > > > > > access data from the HDFS, do they go to the namenode to get > the > > > > block > > > > > > > information? > > > > > > > > > > > > > > > > > > > > > Amandeep Khurana > > > > > > > Computer Science Graduate Student > > > > > > > University of California, Santa Cruz > > > > > > > > > > > > > > > > > > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana < > > > ama...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Assuming that the job is purely in Java and not involving > > > streaming > > > > > or > > > > > > > > pipes, wouldnt the resources (files) required by the job as > > > inputs > > > > be > > > > > > > known > > > > > > > > beforehand? So, if the map task is accessing a second file, > how > > > > does > > > > > it > > > > > > > make > > > > > > > > it different except that there are multiple files. The > > JobTracker > > > > > would > > > > > > > know > > > > > > > > beforehand that multiple files would be accessed. Right? > > > > > > > > > > > > > > > > I am slightly confused why you have mentioned this case > > > > separately... > > > > > > Can > > > > > > > > you elaborate on it a little bit? > > > > > > > > > > > > > > > > Amandeep > >
Re: Race Condition?
I'm having difficulty capturing the output of any of the dfs commands (either in Ruby or on the command line). Supposedly the output is being sent to stdout yet just running any of the commands on the command line does not display the output nor does redirecting to a file (e.g., hadoop dfs -copyToLocal src dest > out.txt). I'm not sure what I'm missing here... John On Sun, Feb 15, 2009 at 11:28 PM, Matei Zaharia wrote: > I would capture the output of the dfs -copyToLocal command, because I still > think that is the most likely cause of the data not making it. I don't know > how to capture this output in Ruby but I'm sure it's possible. You want to > capture both standard out and standard error. > One other slim possibility is that if your localdir is a fixed absolute > path, multiple map tasks on the machine may be trying to access it > concurrently, and maybe one of them deletes it when it's done and one > doesn't. Normally each task should run in its own temp directory though. > > On Sun, Feb 15, 2009 at 2:51 PM, S D wrote: > > > I was not able to determine the command shell return value for > > > > hadoop dfs -copyToLocal #{s3dir} #{localdir} > > > > but I did print out several variables after the call and determined that > > the > > call apparently did not go through successfully. In particular, prior to > my > > processData(localdir) command I use Ruby's puts to print out the contents > > of > > several directories including 'localdir' and '../localdir' - here is the > > weird thing: if I execute the following > > list = `ls -l "#{localdir}"` > > puts "List: #{list}" > > (where 'localdir' is the directory I need as an arg for processData) the > > processData command will execute properly. At first I thought that > running > > the puts command was allowing enough time to elapse for a race condition > to > > be avoided so that 'localdir' was ready when the processData command was > > called (I know that in certain ways that doesn't make sense given that > > hadoop dfs -copyToLocal blocks until it completes...) but then I tried > > other > > time consuming commands such as > > list = `ls -l "../#{localdir}"` > > puts "List: #{list}" > > and running processData(localdir) led to an error: > > 'localdir' not found > > > > Any clues on what could be going on? > > > > Thanks, > > John > > > > > > > > On Sat, Feb 14, 2009 at 6:45 PM, Matei Zaharia > wrote: > > > > > Have you logged the output of the dfs command to see whether it's > always > > > succeeded the copy? > > > > > > On Sat, Feb 14, 2009 at 2:46 PM, S D wrote: > > > > > > > In my Hadoop 0.19.0 program each map function is assigned a directory > > > > (representing a data location in my S3 datastore). The first thing > each > > > map > > > > function does is copy the particular S3 data to the local machine > that > > > the > > > > map task is running on and then being processing the data; e.g., > > > > > > > > command = "hadoop dfs -copyToLocal #{s3dir} #{localdir}" > > > > system "#{command}" > > > > > > > > In the above, "s3dir" is a directory that creates "localdir" - my > > > > expectation is that "localdir" is created in the work directory for > the > > > > particular task attempt. Following this copy command I then run a > > > function > > > > that processes the data; e.g., > > > > > > > > processData(localdir) > > > > > > > > In some instances my map/reduce program crashes and when I examine > the > > > logs > > > > I get a message saying that "localdir" can not be found. This > confuses > > me > > > > since the hadoop shell command above is blocking so that localdir > > should > > > > exist by the time processData() is called. I've found that if I add > in > > > some > > > > diagnostic lines prior to processData() such as puts statements to > > print > > > > out > > > > variables, I never run into the problem of the localdir not being > > found. > > > It > > > > is almost as if localdir needs time to be created before the call to > > > > processData(). > > > > > > > > Has anyone encountered anything like this? Any suggestions on what > > could > > > be > > > > wrong are appreciated. > > > > > > > > Thanks, > > > > John > > > > > > > > > >
Re: HDFS architecture based on GFS?
I dont know much about Hadoop streaming and have a quick question here. The snippets of code/programs that you attach into the map reduce job might want to access outside resources (like you mentioned). Now these might not need to go to the namenode right? For example a python script. How would it access the data? Would it ask the parent java process (in the tasktracker) to get the data or would it go and do stuff on its own? Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sun, Feb 15, 2009 at 8:23 PM, Matei Zaharia wrote: > Nope, typically the JobTracker just starts the process, and the tasktracker > talks directly to the namenode to get a pointer to the datanode, and then > directly to the datanode. > > On Sun, Feb 15, 2009 at 8:07 PM, Amandeep Khurana > wrote: > > > Alright.. Got it. > > > > Now, do the task trackers talk to the namenode and the data node directly > > or > > do they go through the job tracker for it? So, if my code is such that I > > need to access more files from the hdfs, would the job tracker get > involved > > or not? > > > > > > > > > > Amandeep Khurana > > Computer Science Graduate Student > > University of California, Santa Cruz > > > > > > On Sun, Feb 15, 2009 at 7:20 PM, Matei Zaharia > wrote: > > > > > Normally, HDFS files are accessed through the namenode. If there was a > > > malicious process though, then I imagine it could talk to a datanode > > > directly and request a specific block. > > > > > > On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana > > > wrote: > > > > > > > Ok. Got it. > > > > > > > > Now, when my job needs to access another file, does it go to the > > Namenode > > > > to > > > > get the block ids? How does the java process know where the files are > > and > > > > how to access them? > > > > > > > > > > > > Amandeep Khurana > > > > Computer Science Graduate Student > > > > University of California, Santa Cruz > > > > > > > > > > > > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia > > > wrote: > > > > > > > > > I mentioned this case because even jobs written in Java can use the > > > HDFS > > > > > API > > > > > to talk to the NameNode and access the filesystem. People often do > > this > > > > > because their job needs to read a config file, some small data > table, > > > etc > > > > > and use this information in its map or reduce functions. In this > > case, > > > > you > > > > > open the second file separately in your mapper's init function and > > read > > > > > whatever you need from it. In general I wanted to point out that > you > > > > can't > > > > > know which files a job will access unless you look at its source > code > > > or > > > > > monitor the calls it makes; the input file(s) you provide in the > job > > > > > description are a hint to the MapReduce framework to place your job > > on > > > > > certain nodes, but it's reasonable for the job to access other > files > > as > > > > > well. > > > > > > > > > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana < > ama...@gmail.com> > > > > > wrote: > > > > > > > > > > > Another question that I have here - When the jobs run arbitrary > > code > > > > and > > > > > > access data from the HDFS, do they go to the namenode to get the > > > block > > > > > > information? > > > > > > > > > > > > > > > > > > Amandeep Khurana > > > > > > Computer Science Graduate Student > > > > > > University of California, Santa Cruz > > > > > > > > > > > > > > > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana < > > ama...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Assuming that the job is purely in Java and not involving > > streaming > > > > or > > > > > > > pipes, wouldnt the resources (files) required by the job as > > inputs > > > be > > > > > > known > > > > > > > beforehand? So, if the map task is accessing a second file, how > > > does > > > > it > > > > > > make > > > > > > > it different except that there are multiple files. The > JobTracker > > > > would > > > > > > know > > > > > > > beforehand that multiple files would be accessed. Right? > > > > > > > > > > > > > > I am slightly confused why you have mentioned this case > > > separately... > > > > > Can > > > > > > > you elaborate on it a little bit? > > > > > > > > > > > > > > Amandeep > > > > > > > > > > > > > > > > > > > > > Amandeep Khurana > > > > > > > Computer Science Graduate Student > > > > > > > University of California, Santa Cruz > > > > > > > > > > > > > > > > > > > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia < > > ma...@cloudera.com > > > > > > > > > > wrote: > > > > > > > > > > > > > >> Typically the data flow is like this:1) Client submits a job > > > > > description > > > > > > >> to > > > > > > >> the JobTracker. > > > > > > >> 2) JobTracker figures out block locations for the input > file(s) > > by > > > > > > talking > > > > > > >> to HDFS NameNode. > > > > > > >> 3) JobTracker creates a job description file in HDFS which > will > > be > > > > > read > > >
Re: JvmMetrics
Hey David -- In case if no one has pointed you to this, you can submit this through JIRA. Brian On Feb 14, 2009, at 12:07 AM, David Alves wrote: Hi I ran into a use case where I need to keep two contexts for metrics. One being ganglia and the other being a file context (to do offline metrics analysis). I altered JvmMetrics to allow for the user to supply a context instead of if getting one by name, and altered file context for it to be able to timestamp metrics collection (like log4j does). Would be glad to submit a patch if anyone is interested. REgards
Re: Race Condition?
I would capture the output of the dfs -copyToLocal command, because I still think that is the most likely cause of the data not making it. I don't know how to capture this output in Ruby but I'm sure it's possible. You want to capture both standard out and standard error. One other slim possibility is that if your localdir is a fixed absolute path, multiple map tasks on the machine may be trying to access it concurrently, and maybe one of them deletes it when it's done and one doesn't. Normally each task should run in its own temp directory though. On Sun, Feb 15, 2009 at 2:51 PM, S D wrote: > I was not able to determine the command shell return value for > > hadoop dfs -copyToLocal #{s3dir} #{localdir} > > but I did print out several variables after the call and determined that > the > call apparently did not go through successfully. In particular, prior to my > processData(localdir) command I use Ruby's puts to print out the contents > of > several directories including 'localdir' and '../localdir' - here is the > weird thing: if I execute the following > list = `ls -l "#{localdir}"` > puts "List: #{list}" > (where 'localdir' is the directory I need as an arg for processData) the > processData command will execute properly. At first I thought that running > the puts command was allowing enough time to elapse for a race condition to > be avoided so that 'localdir' was ready when the processData command was > called (I know that in certain ways that doesn't make sense given that > hadoop dfs -copyToLocal blocks until it completes...) but then I tried > other > time consuming commands such as > list = `ls -l "../#{localdir}"` > puts "List: #{list}" > and running processData(localdir) led to an error: > 'localdir' not found > > Any clues on what could be going on? > > Thanks, > John > > > > On Sat, Feb 14, 2009 at 6:45 PM, Matei Zaharia wrote: > > > Have you logged the output of the dfs command to see whether it's always > > succeeded the copy? > > > > On Sat, Feb 14, 2009 at 2:46 PM, S D wrote: > > > > > In my Hadoop 0.19.0 program each map function is assigned a directory > > > (representing a data location in my S3 datastore). The first thing each > > map > > > function does is copy the particular S3 data to the local machine that > > the > > > map task is running on and then being processing the data; e.g., > > > > > > command = "hadoop dfs -copyToLocal #{s3dir} #{localdir}" > > > system "#{command}" > > > > > > In the above, "s3dir" is a directory that creates "localdir" - my > > > expectation is that "localdir" is created in the work directory for the > > > particular task attempt. Following this copy command I then run a > > function > > > that processes the data; e.g., > > > > > > processData(localdir) > > > > > > In some instances my map/reduce program crashes and when I examine the > > logs > > > I get a message saying that "localdir" can not be found. This confuses > me > > > since the hadoop shell command above is blocking so that localdir > should > > > exist by the time processData() is called. I've found that if I add in > > some > > > diagnostic lines prior to processData() such as puts statements to > print > > > out > > > variables, I never run into the problem of the localdir not being > found. > > It > > > is almost as if localdir needs time to be created before the call to > > > processData(). > > > > > > Has anyone encountered anything like this? Any suggestions on what > could > > be > > > wrong are appreciated. > > > > > > Thanks, > > > John > > > > > >
Re: HDFS on non-identical nodes
On Feb 15, 2009, at 3:21 AM, Deepak wrote: Thanks Brain and Chen! I finally sort that out why cluster is being stopped after running out of space. Its because of master failure due to disk space. Regarding automatic balancer, I guess in our case, rate of copying is faster than balancer rate, we found balancer do start but couldn't perform its job. There are parameters you can set which control how quickly the balancer is allowed to copy files about. Nevertheless, you shouldn't rely on it to work for anything performance critical -- you'll probably want to ensure there's enough space around to do your work in the short-term. Brian Anyways thanks for your help! It helped me sort out somethings. Cheers, Deepak On Thu, Feb 12, 2009 at 5:32 PM, He Chen wrote: I think you should confirm your balancer is still running. Do you change the threshold of the HDFS balancer? May be too large? The balancer will stop working when meets 5 conditions: 1. Datanodes are balanced (obviously you are not this kind); 2. No more block to be moved (all blocks on unbalanced nodes are busy or recently used) 3. No more block to be moved in 20 minutes and 5 times consecutive attempts 4. Another balancer is working 5. I/O exception The default setting is 10% for each datanodes, for 1TB it is 100GB, for 3T is 300GB, and for 60GB is 6GB Hope helpful On Thu, Feb 12, 2009 at 10:06 AM, Brian Bockelman >wrote: On Feb 12, 2009, at 2:54 AM, Deepak wrote: Hi, We're running Hadoop cluster on 4 nodes, our primary purpose of running is to provide distributed storage solution for internal applications here in TellyTopia Inc. Our cluster consists of non-identical nodes (one with 1TB another two with 3 TB and one more with 60GB) while copying data on HDFS we noticed that node with 60GB storage ran out of disk-space and even balancer couldn't balance because cluster was stopped. Now my questions are 1. Is Hadoop is suitable for non-identical cluster nodes? Yes. Our cluster has between 60GB and 40TB on our nodes. The majority have around 3TB. 2. Is there any way to automatically balancing of nodes? We have a cron script which automatically starts the Balancer. It's dirty, but it works. 3. Why Hadoop cluster stops when one node ran our of disk? That's not normal. Trust me, if that was always true, we'd be perpetually screwed :) There might be some other underlying error you're missing... Brian Any futher inputs are appericiapted! Cheers, Deepak TellyTopia Inc. -- Chen He RCF CSE Dept. University of Nebraska-Lincoln US -- Deepak TellyTopia Inc.
Re: HDFS architecture based on GFS?
Nope, typically the JobTracker just starts the process, and the tasktracker talks directly to the namenode to get a pointer to the datanode, and then directly to the datanode. On Sun, Feb 15, 2009 at 8:07 PM, Amandeep Khurana wrote: > Alright.. Got it. > > Now, do the task trackers talk to the namenode and the data node directly > or > do they go through the job tracker for it? So, if my code is such that I > need to access more files from the hdfs, would the job tracker get involved > or not? > > > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Sun, Feb 15, 2009 at 7:20 PM, Matei Zaharia wrote: > > > Normally, HDFS files are accessed through the namenode. If there was a > > malicious process though, then I imagine it could talk to a datanode > > directly and request a specific block. > > > > On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana > > wrote: > > > > > Ok. Got it. > > > > > > Now, when my job needs to access another file, does it go to the > Namenode > > > to > > > get the block ids? How does the java process know where the files are > and > > > how to access them? > > > > > > > > > Amandeep Khurana > > > Computer Science Graduate Student > > > University of California, Santa Cruz > > > > > > > > > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia > > wrote: > > > > > > > I mentioned this case because even jobs written in Java can use the > > HDFS > > > > API > > > > to talk to the NameNode and access the filesystem. People often do > this > > > > because their job needs to read a config file, some small data table, > > etc > > > > and use this information in its map or reduce functions. In this > case, > > > you > > > > open the second file separately in your mapper's init function and > read > > > > whatever you need from it. In general I wanted to point out that you > > > can't > > > > know which files a job will access unless you look at its source code > > or > > > > monitor the calls it makes; the input file(s) you provide in the job > > > > description are a hint to the MapReduce framework to place your job > on > > > > certain nodes, but it's reasonable for the job to access other files > as > > > > well. > > > > > > > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana > > > > wrote: > > > > > > > > > Another question that I have here - When the jobs run arbitrary > code > > > and > > > > > access data from the HDFS, do they go to the namenode to get the > > block > > > > > information? > > > > > > > > > > > > > > > Amandeep Khurana > > > > > Computer Science Graduate Student > > > > > University of California, Santa Cruz > > > > > > > > > > > > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana < > ama...@gmail.com> > > > > > wrote: > > > > > > > > > > > Assuming that the job is purely in Java and not involving > streaming > > > or > > > > > > pipes, wouldnt the resources (files) required by the job as > inputs > > be > > > > > known > > > > > > beforehand? So, if the map task is accessing a second file, how > > does > > > it > > > > > make > > > > > > it different except that there are multiple files. The JobTracker > > > would > > > > > know > > > > > > beforehand that multiple files would be accessed. Right? > > > > > > > > > > > > I am slightly confused why you have mentioned this case > > separately... > > > > Can > > > > > > you elaborate on it a little bit? > > > > > > > > > > > > Amandeep > > > > > > > > > > > > > > > > > > Amandeep Khurana > > > > > > Computer Science Graduate Student > > > > > > University of California, Santa Cruz > > > > > > > > > > > > > > > > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia < > ma...@cloudera.com > > > > > > > > wrote: > > > > > > > > > > > >> Typically the data flow is like this:1) Client submits a job > > > > description > > > > > >> to > > > > > >> the JobTracker. > > > > > >> 2) JobTracker figures out block locations for the input file(s) > by > > > > > talking > > > > > >> to HDFS NameNode. > > > > > >> 3) JobTracker creates a job description file in HDFS which will > be > > > > read > > > > > by > > > > > >> the nodes to copy over the job's code etc. > > > > > >> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with > > the > > > > > >> appropriate data blocks. > > > > > >> 5) After running, maps create intermediate output files on those > > > > slaves. > > > > > >> These are not in HDFS, they're in some temporary storage used by > > > > > >> MapReduce. > > > > > >> 6) JobTracker starts reduces on a series of slaves, which copy > > over > > > > the > > > > > >> appropriate map outputs, apply the reduce function, and write > the > > > > > outputs > > > > > >> to > > > > > >> HDFS (one output file per reducer). > > > > > >> 7) Some logs for the job may also be put into HDFS by the > > > JobTracker. > > > > > >> > > > > > >> However, there is a big caveat, which is that the map and reduce > > > tasks > > > > > run > > > > > >> arbitrary code. It is not unusu
Re: HDFS architecture based on GFS?
Alright.. Got it. Now, do the task trackers talk to the namenode and the data node directly or do they go through the job tracker for it? So, if my code is such that I need to access more files from the hdfs, would the job tracker get involved or not? Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sun, Feb 15, 2009 at 7:20 PM, Matei Zaharia wrote: > Normally, HDFS files are accessed through the namenode. If there was a > malicious process though, then I imagine it could talk to a datanode > directly and request a specific block. > > On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana > wrote: > > > Ok. Got it. > > > > Now, when my job needs to access another file, does it go to the Namenode > > to > > get the block ids? How does the java process know where the files are and > > how to access them? > > > > > > Amandeep Khurana > > Computer Science Graduate Student > > University of California, Santa Cruz > > > > > > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia > wrote: > > > > > I mentioned this case because even jobs written in Java can use the > HDFS > > > API > > > to talk to the NameNode and access the filesystem. People often do this > > > because their job needs to read a config file, some small data table, > etc > > > and use this information in its map or reduce functions. In this case, > > you > > > open the second file separately in your mapper's init function and read > > > whatever you need from it. In general I wanted to point out that you > > can't > > > know which files a job will access unless you look at its source code > or > > > monitor the calls it makes; the input file(s) you provide in the job > > > description are a hint to the MapReduce framework to place your job on > > > certain nodes, but it's reasonable for the job to access other files as > > > well. > > > > > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana > > > wrote: > > > > > > > Another question that I have here - When the jobs run arbitrary code > > and > > > > access data from the HDFS, do they go to the namenode to get the > block > > > > information? > > > > > > > > > > > > Amandeep Khurana > > > > Computer Science Graduate Student > > > > University of California, Santa Cruz > > > > > > > > > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana > > > > wrote: > > > > > > > > > Assuming that the job is purely in Java and not involving streaming > > or > > > > > pipes, wouldnt the resources (files) required by the job as inputs > be > > > > known > > > > > beforehand? So, if the map task is accessing a second file, how > does > > it > > > > make > > > > > it different except that there are multiple files. The JobTracker > > would > > > > know > > > > > beforehand that multiple files would be accessed. Right? > > > > > > > > > > I am slightly confused why you have mentioned this case > separately... > > > Can > > > > > you elaborate on it a little bit? > > > > > > > > > > Amandeep > > > > > > > > > > > > > > > Amandeep Khurana > > > > > Computer Science Graduate Student > > > > > University of California, Santa Cruz > > > > > > > > > > > > > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia > > > > > wrote: > > > > > > > > > >> Typically the data flow is like this:1) Client submits a job > > > description > > > > >> to > > > > >> the JobTracker. > > > > >> 2) JobTracker figures out block locations for the input file(s) by > > > > talking > > > > >> to HDFS NameNode. > > > > >> 3) JobTracker creates a job description file in HDFS which will be > > > read > > > > by > > > > >> the nodes to copy over the job's code etc. > > > > >> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with > the > > > > >> appropriate data blocks. > > > > >> 5) After running, maps create intermediate output files on those > > > slaves. > > > > >> These are not in HDFS, they're in some temporary storage used by > > > > >> MapReduce. > > > > >> 6) JobTracker starts reduces on a series of slaves, which copy > over > > > the > > > > >> appropriate map outputs, apply the reduce function, and write the > > > > outputs > > > > >> to > > > > >> HDFS (one output file per reducer). > > > > >> 7) Some logs for the job may also be put into HDFS by the > > JobTracker. > > > > >> > > > > >> However, there is a big caveat, which is that the map and reduce > > tasks > > > > run > > > > >> arbitrary code. It is not unusual to have a map that opens a > second > > > HDFS > > > > >> file to read some information (e.g. for doing a join of a small > > table > > > > >> against a big file). If you use Hadoop Streaming or Pipes to write > a > > > job > > > > >> in > > > > >> Python, Ruby, C, etc, then you are launching arbitrary processes > > which > > > > may > > > > >> also access external resources in this manner. Some people also > > > > read/write > > > > >> to DBs (e.g. MySQL) from their tasks. A comprehensive security > > > solution > > > > >> would ideally deal with these cases too. > > >
Re: HDFS architecture based on GFS?
Normally, HDFS files are accessed through the namenode. If there was a malicious process though, then I imagine it could talk to a datanode directly and request a specific block. On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana wrote: > Ok. Got it. > > Now, when my job needs to access another file, does it go to the Namenode > to > get the block ids? How does the java process know where the files are and > how to access them? > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia wrote: > > > I mentioned this case because even jobs written in Java can use the HDFS > > API > > to talk to the NameNode and access the filesystem. People often do this > > because their job needs to read a config file, some small data table, etc > > and use this information in its map or reduce functions. In this case, > you > > open the second file separately in your mapper's init function and read > > whatever you need from it. In general I wanted to point out that you > can't > > know which files a job will access unless you look at its source code or > > monitor the calls it makes; the input file(s) you provide in the job > > description are a hint to the MapReduce framework to place your job on > > certain nodes, but it's reasonable for the job to access other files as > > well. > > > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana > > wrote: > > > > > Another question that I have here - When the jobs run arbitrary code > and > > > access data from the HDFS, do they go to the namenode to get the block > > > information? > > > > > > > > > Amandeep Khurana > > > Computer Science Graduate Student > > > University of California, Santa Cruz > > > > > > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana > > > wrote: > > > > > > > Assuming that the job is purely in Java and not involving streaming > or > > > > pipes, wouldnt the resources (files) required by the job as inputs be > > > known > > > > beforehand? So, if the map task is accessing a second file, how does > it > > > make > > > > it different except that there are multiple files. The JobTracker > would > > > know > > > > beforehand that multiple files would be accessed. Right? > > > > > > > > I am slightly confused why you have mentioned this case separately... > > Can > > > > you elaborate on it a little bit? > > > > > > > > Amandeep > > > > > > > > > > > > Amandeep Khurana > > > > Computer Science Graduate Student > > > > University of California, Santa Cruz > > > > > > > > > > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia > > > wrote: > > > > > > > >> Typically the data flow is like this:1) Client submits a job > > description > > > >> to > > > >> the JobTracker. > > > >> 2) JobTracker figures out block locations for the input file(s) by > > > talking > > > >> to HDFS NameNode. > > > >> 3) JobTracker creates a job description file in HDFS which will be > > read > > > by > > > >> the nodes to copy over the job's code etc. > > > >> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with the > > > >> appropriate data blocks. > > > >> 5) After running, maps create intermediate output files on those > > slaves. > > > >> These are not in HDFS, they're in some temporary storage used by > > > >> MapReduce. > > > >> 6) JobTracker starts reduces on a series of slaves, which copy over > > the > > > >> appropriate map outputs, apply the reduce function, and write the > > > outputs > > > >> to > > > >> HDFS (one output file per reducer). > > > >> 7) Some logs for the job may also be put into HDFS by the > JobTracker. > > > >> > > > >> However, there is a big caveat, which is that the map and reduce > tasks > > > run > > > >> arbitrary code. It is not unusual to have a map that opens a second > > HDFS > > > >> file to read some information (e.g. for doing a join of a small > table > > > >> against a big file). If you use Hadoop Streaming or Pipes to write a > > job > > > >> in > > > >> Python, Ruby, C, etc, then you are launching arbitrary processes > which > > > may > > > >> also access external resources in this manner. Some people also > > > read/write > > > >> to DBs (e.g. MySQL) from their tasks. A comprehensive security > > solution > > > >> would ideally deal with these cases too. > > > >> > > > >> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana > > > > >> wrote: > > > >> > > > >> > A quick question here. How does a typical hadoop job work at the > > > system > > > >> > level? What are the various interactions and how does the data > flow? > > > >> > > > > >> > Amandeep > > > >> > > > > >> > > > > >> > Amandeep Khurana > > > >> > Computer Science Graduate Student > > > >> > University of California, Santa Cruz > > > >> > > > > >> > > > > >> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana < > ama...@gmail.com > > > > > > >> > wrote: > > > >> > > > > >> > > Thanks Matei. If the basic architecture is similar to the Google > > > >> stuff, I > > >
Re: setting up networking and ssh on multnode cluster...
okay, i will heed the tip on the 127 address set. here is the result of ssh 192.168.0.2... a...@node0:~$ ssh 192.168.0.2 ssh: connect to host 192.168.0.2 port 22: Connection timed out a...@node0:~$ the boxes are just connected with a cat5 cable. i have not done this with the hadoop account but af is my normal account and i figure it should work too. /etc/init.d/interfaces is empty/does not exist on the machines. (i am using ubuntu 8.10) please advise. Norbert Burger wrote: > > Fwiw, the extra references to 127.0.1.1 in each host file aren't > necessary. > > From node0, does 'ssh 192.168.0.2' work? If not, then the issue isn't > name > resolution -- take look at the network configs (eg., > /etc/init.d/interfaces) > on each machine. > > Norbert > > On Sun, Feb 15, 2009 at 7:31 PM, zander1013 wrote: > >> >> okay, >> >> i have changed /etc/hosts to look like this for node0... >> >> 127.0.0.1 localhost >> 127.0.1.1 node0 >> >> # /etc/hosts (for hadoop master and slave) >> 192.168.0.1 node0 >> 192.168.0.2 node1 >> #end hadoop section >> >> # The following lines are desirable for IPv6 capable hosts >> ::1 ip6-localhost ip6-loopback >> fe00::0 ip6-localnet >> ff00::0 ip6-mcastprefix >> ff02::1 ip6-allnodes >> ff02::2 ip6-allrouters >> ff02::3 ip6-allhosts >> >> ...and this for node1... >> >> 127.0.0.1 localhost >> 127.0.1.1 node1 >> >> # /etc/hosts (for hadoop master and slave) >> 192.168.0.1 node0 >> 192.168.0.2 node1 >> #end hadoop section >> >> # The following lines are desirable for IPv6 capable hosts >> ::1 ip6-localhost ip6-loopback >> fe00::0 ip6-localnet >> ff00::0 ip6-mcastprefix >> ff02::1 ip6-allnodes >> ff02::2 ip6-allrouters >> ff02::3 ip6-allhosts >> >> ... the machines are connected by a cat5 cable, they have wifi and are >> showing that they are connected to my wlan. also i have enabled all the >> user >> privleges in the user manager on both machines. here are the results from >> ssh on node0... >> >> had...@node0:~$ ssh node0 >> Linux node0 2.6.27-11-generic #1 SMP Thu Jan 29 19:24:39 UTC 2009 i686 >> >> The programs included with the Ubuntu system are free software; >> the exact distribution terms for each program are described in the >> individual files in /usr/share/doc/*/copyright. >> >> Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by >> applicable law. >> >> To access official Ubuntu documentation, please visit: >> http://help.ubuntu.com/ >> Last login: Sun Feb 15 16:00:28 2009 from node0 >> had...@node0:~$ exit >> logout >> Connection to node0 closed. >> had...@node0:~$ ssh node1 >> ssh: connect to host node1 port 22: Connection timed out >> had...@node0:~$ >> >> i will look into the link that you gave. >> >> -zander >> >> >> Norbert Burger wrote: >> > >> >> >> >> i have commented out the 192. addresses and changed 127.0.1.1 for >> node0 >> >> and >> >> 127.0.1.2 for node0 (in /etc/hosts). with this done i can ssh from one >> >> machine to itself and to the other but the prompt does not change when >> i >> >> ssh >> >> to the other machine. i don't know if there is a firewall preventing >> me >> >> from >> >> ssh or not. i have not set any up to prevent ssh and i have not taken >> >> action >> >> to specifically allow ssh other than what was prescribed in the >> tutorial >> >> for >> >> a single node cluster for both machines. >> > >> > >> > Why are you using 127.* addresses for your nodes? These fall into the >> > block >> > of IPs reserved for the loopback network (see >> > http://en.wikipedia.org/wiki/Loopback). >> > >> > Try changing both nodes back to 192.168, and re-starting Hadoop. >> > >> > Norbert >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/setting-up-networking-and-ssh-on-multnode-cluster...-tp22019559p22029812.html >> Sent from the Hadoop core-user mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/setting-up-networking-and-ssh-on-multnode-cluster...-tp22019559p22031038.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: HDFS architecture based on GFS?
Ok. Got it. Now, when my job needs to access another file, does it go to the Namenode to get the block ids? How does the java process know where the files are and how to access them? Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia wrote: > I mentioned this case because even jobs written in Java can use the HDFS > API > to talk to the NameNode and access the filesystem. People often do this > because their job needs to read a config file, some small data table, etc > and use this information in its map or reduce functions. In this case, you > open the second file separately in your mapper's init function and read > whatever you need from it. In general I wanted to point out that you can't > know which files a job will access unless you look at its source code or > monitor the calls it makes; the input file(s) you provide in the job > description are a hint to the MapReduce framework to place your job on > certain nodes, but it's reasonable for the job to access other files as > well. > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana > wrote: > > > Another question that I have here - When the jobs run arbitrary code and > > access data from the HDFS, do they go to the namenode to get the block > > information? > > > > > > Amandeep Khurana > > Computer Science Graduate Student > > University of California, Santa Cruz > > > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana > > wrote: > > > > > Assuming that the job is purely in Java and not involving streaming or > > > pipes, wouldnt the resources (files) required by the job as inputs be > > known > > > beforehand? So, if the map task is accessing a second file, how does it > > make > > > it different except that there are multiple files. The JobTracker would > > know > > > beforehand that multiple files would be accessed. Right? > > > > > > I am slightly confused why you have mentioned this case separately... > Can > > > you elaborate on it a little bit? > > > > > > Amandeep > > > > > > > > > Amandeep Khurana > > > Computer Science Graduate Student > > > University of California, Santa Cruz > > > > > > > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia > > wrote: > > > > > >> Typically the data flow is like this:1) Client submits a job > description > > >> to > > >> the JobTracker. > > >> 2) JobTracker figures out block locations for the input file(s) by > > talking > > >> to HDFS NameNode. > > >> 3) JobTracker creates a job description file in HDFS which will be > read > > by > > >> the nodes to copy over the job's code etc. > > >> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with the > > >> appropriate data blocks. > > >> 5) After running, maps create intermediate output files on those > slaves. > > >> These are not in HDFS, they're in some temporary storage used by > > >> MapReduce. > > >> 6) JobTracker starts reduces on a series of slaves, which copy over > the > > >> appropriate map outputs, apply the reduce function, and write the > > outputs > > >> to > > >> HDFS (one output file per reducer). > > >> 7) Some logs for the job may also be put into HDFS by the JobTracker. > > >> > > >> However, there is a big caveat, which is that the map and reduce tasks > > run > > >> arbitrary code. It is not unusual to have a map that opens a second > HDFS > > >> file to read some information (e.g. for doing a join of a small table > > >> against a big file). If you use Hadoop Streaming or Pipes to write a > job > > >> in > > >> Python, Ruby, C, etc, then you are launching arbitrary processes which > > may > > >> also access external resources in this manner. Some people also > > read/write > > >> to DBs (e.g. MySQL) from their tasks. A comprehensive security > solution > > >> would ideally deal with these cases too. > > >> > > >> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana > > >> wrote: > > >> > > >> > A quick question here. How does a typical hadoop job work at the > > system > > >> > level? What are the various interactions and how does the data flow? > > >> > > > >> > Amandeep > > >> > > > >> > > > >> > Amandeep Khurana > > >> > Computer Science Graduate Student > > >> > University of California, Santa Cruz > > >> > > > >> > > > >> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana > > > >> > wrote: > > >> > > > >> > > Thanks Matei. If the basic architecture is similar to the Google > > >> stuff, I > > >> > > can safely just work on the project using the information from the > > >> > papers. > > >> > > > > >> > > I am aware of the 4487 jira and the current status of the > > permissions > > >> > > mechanism. I had a look at them earlier. > > >> > > > > >> > > Cheers > > >> > > Amandeep > > >> > > > > >> > > > > >> > > Amandeep Khurana > > >> > > Computer Science Graduate Student > > >> > > University of California, Santa Cruz > > >> > > > > >> > > > > >> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia < > ma...@cloudera.com> > > >>
Re: HDFS architecture based on GFS?
I mentioned this case because even jobs written in Java can use the HDFS API to talk to the NameNode and access the filesystem. People often do this because their job needs to read a config file, some small data table, etc and use this information in its map or reduce functions. In this case, you open the second file separately in your mapper's init function and read whatever you need from it. In general I wanted to point out that you can't know which files a job will access unless you look at its source code or monitor the calls it makes; the input file(s) you provide in the job description are a hint to the MapReduce framework to place your job on certain nodes, but it's reasonable for the job to access other files as well. On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana wrote: > Another question that I have here - When the jobs run arbitrary code and > access data from the HDFS, do they go to the namenode to get the block > information? > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana > wrote: > > > Assuming that the job is purely in Java and not involving streaming or > > pipes, wouldnt the resources (files) required by the job as inputs be > known > > beforehand? So, if the map task is accessing a second file, how does it > make > > it different except that there are multiple files. The JobTracker would > know > > beforehand that multiple files would be accessed. Right? > > > > I am slightly confused why you have mentioned this case separately... Can > > you elaborate on it a little bit? > > > > Amandeep > > > > > > Amandeep Khurana > > Computer Science Graduate Student > > University of California, Santa Cruz > > > > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia > wrote: > > > >> Typically the data flow is like this:1) Client submits a job description > >> to > >> the JobTracker. > >> 2) JobTracker figures out block locations for the input file(s) by > talking > >> to HDFS NameNode. > >> 3) JobTracker creates a job description file in HDFS which will be read > by > >> the nodes to copy over the job's code etc. > >> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with the > >> appropriate data blocks. > >> 5) After running, maps create intermediate output files on those slaves. > >> These are not in HDFS, they're in some temporary storage used by > >> MapReduce. > >> 6) JobTracker starts reduces on a series of slaves, which copy over the > >> appropriate map outputs, apply the reduce function, and write the > outputs > >> to > >> HDFS (one output file per reducer). > >> 7) Some logs for the job may also be put into HDFS by the JobTracker. > >> > >> However, there is a big caveat, which is that the map and reduce tasks > run > >> arbitrary code. It is not unusual to have a map that opens a second HDFS > >> file to read some information (e.g. for doing a join of a small table > >> against a big file). If you use Hadoop Streaming or Pipes to write a job > >> in > >> Python, Ruby, C, etc, then you are launching arbitrary processes which > may > >> also access external resources in this manner. Some people also > read/write > >> to DBs (e.g. MySQL) from their tasks. A comprehensive security solution > >> would ideally deal with these cases too. > >> > >> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana > >> wrote: > >> > >> > A quick question here. How does a typical hadoop job work at the > system > >> > level? What are the various interactions and how does the data flow? > >> > > >> > Amandeep > >> > > >> > > >> > Amandeep Khurana > >> > Computer Science Graduate Student > >> > University of California, Santa Cruz > >> > > >> > > >> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana > >> > wrote: > >> > > >> > > Thanks Matei. If the basic architecture is similar to the Google > >> stuff, I > >> > > can safely just work on the project using the information from the > >> > papers. > >> > > > >> > > I am aware of the 4487 jira and the current status of the > permissions > >> > > mechanism. I had a look at them earlier. > >> > > > >> > > Cheers > >> > > Amandeep > >> > > > >> > > > >> > > Amandeep Khurana > >> > > Computer Science Graduate Student > >> > > University of California, Santa Cruz > >> > > > >> > > > >> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia > >> > wrote: > >> > > > >> > >> Forgot to add, this JIRA details the latest security features that > >> are > >> > >> being > >> > >> worked on in Hadoop trunk: > >> > >> https://issues.apache.org/jira/browse/HADOOP-4487. > >> > >> This document describes the current status and limitations of the > >> > >> permissions mechanism: > >> > >> > >> http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html. > >> > >> > >> > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia > > >> > >> wrote: > >> > >> > >> > >> > I think it's safe to assume that Hadoop works like MapReduce/GFS > at > >> > the > >> > >> > le
Re: setting up networking and ssh on multnode cluster...
Fwiw, the extra references to 127.0.1.1 in each host file aren't necessary. >From node0, does 'ssh 192.168.0.2' work? If not, then the issue isn't name resolution -- take look at the network configs (eg., /etc/init.d/interfaces) on each machine. Norbert On Sun, Feb 15, 2009 at 7:31 PM, zander1013 wrote: > > okay, > > i have changed /etc/hosts to look like this for node0... > > 127.0.0.1 localhost > 127.0.1.1 node0 > > # /etc/hosts (for hadoop master and slave) > 192.168.0.1 node0 > 192.168.0.2 node1 > #end hadoop section > > # The following lines are desirable for IPv6 capable hosts > ::1 ip6-localhost ip6-loopback > fe00::0 ip6-localnet > ff00::0 ip6-mcastprefix > ff02::1 ip6-allnodes > ff02::2 ip6-allrouters > ff02::3 ip6-allhosts > > ...and this for node1... > > 127.0.0.1 localhost > 127.0.1.1 node1 > > # /etc/hosts (for hadoop master and slave) > 192.168.0.1 node0 > 192.168.0.2 node1 > #end hadoop section > > # The following lines are desirable for IPv6 capable hosts > ::1 ip6-localhost ip6-loopback > fe00::0 ip6-localnet > ff00::0 ip6-mcastprefix > ff02::1 ip6-allnodes > ff02::2 ip6-allrouters > ff02::3 ip6-allhosts > > ... the machines are connected by a cat5 cable, they have wifi and are > showing that they are connected to my wlan. also i have enabled all the > user > privleges in the user manager on both machines. here are the results from > ssh on node0... > > had...@node0:~$ ssh node0 > Linux node0 2.6.27-11-generic #1 SMP Thu Jan 29 19:24:39 UTC 2009 i686 > > The programs included with the Ubuntu system are free software; > the exact distribution terms for each program are described in the > individual files in /usr/share/doc/*/copyright. > > Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by > applicable law. > > To access official Ubuntu documentation, please visit: > http://help.ubuntu.com/ > Last login: Sun Feb 15 16:00:28 2009 from node0 > had...@node0:~$ exit > logout > Connection to node0 closed. > had...@node0:~$ ssh node1 > ssh: connect to host node1 port 22: Connection timed out > had...@node0:~$ > > i will look into the link that you gave. > > -zander > > > Norbert Burger wrote: > > > >> > >> i have commented out the 192. addresses and changed 127.0.1.1 for node0 > >> and > >> 127.0.1.2 for node0 (in /etc/hosts). with this done i can ssh from one > >> machine to itself and to the other but the prompt does not change when i > >> ssh > >> to the other machine. i don't know if there is a firewall preventing me > >> from > >> ssh or not. i have not set any up to prevent ssh and i have not taken > >> action > >> to specifically allow ssh other than what was prescribed in the tutorial > >> for > >> a single node cluster for both machines. > > > > > > Why are you using 127.* addresses for your nodes? These fall into the > > block > > of IPs reserved for the loopback network (see > > http://en.wikipedia.org/wiki/Loopback). > > > > Try changing both nodes back to 192.168, and re-starting Hadoop. > > > > Norbert > > > > > > -- > View this message in context: > http://www.nabble.com/setting-up-networking-and-ssh-on-multnode-cluster...-tp22019559p22029812.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > >
Re: HDFS architecture based on GFS?
Another question that I have here - When the jobs run arbitrary code and access data from the HDFS, do they go to the namenode to get the block information? Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana wrote: > Assuming that the job is purely in Java and not involving streaming or > pipes, wouldnt the resources (files) required by the job as inputs be known > beforehand? So, if the map task is accessing a second file, how does it make > it different except that there are multiple files. The JobTracker would know > beforehand that multiple files would be accessed. Right? > > I am slightly confused why you have mentioned this case separately... Can > you elaborate on it a little bit? > > Amandeep > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia wrote: > >> Typically the data flow is like this:1) Client submits a job description >> to >> the JobTracker. >> 2) JobTracker figures out block locations for the input file(s) by talking >> to HDFS NameNode. >> 3) JobTracker creates a job description file in HDFS which will be read by >> the nodes to copy over the job's code etc. >> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with the >> appropriate data blocks. >> 5) After running, maps create intermediate output files on those slaves. >> These are not in HDFS, they're in some temporary storage used by >> MapReduce. >> 6) JobTracker starts reduces on a series of slaves, which copy over the >> appropriate map outputs, apply the reduce function, and write the outputs >> to >> HDFS (one output file per reducer). >> 7) Some logs for the job may also be put into HDFS by the JobTracker. >> >> However, there is a big caveat, which is that the map and reduce tasks run >> arbitrary code. It is not unusual to have a map that opens a second HDFS >> file to read some information (e.g. for doing a join of a small table >> against a big file). If you use Hadoop Streaming or Pipes to write a job >> in >> Python, Ruby, C, etc, then you are launching arbitrary processes which may >> also access external resources in this manner. Some people also read/write >> to DBs (e.g. MySQL) from their tasks. A comprehensive security solution >> would ideally deal with these cases too. >> >> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana >> wrote: >> >> > A quick question here. How does a typical hadoop job work at the system >> > level? What are the various interactions and how does the data flow? >> > >> > Amandeep >> > >> > >> > Amandeep Khurana >> > Computer Science Graduate Student >> > University of California, Santa Cruz >> > >> > >> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana >> > wrote: >> > >> > > Thanks Matei. If the basic architecture is similar to the Google >> stuff, I >> > > can safely just work on the project using the information from the >> > papers. >> > > >> > > I am aware of the 4487 jira and the current status of the permissions >> > > mechanism. I had a look at them earlier. >> > > >> > > Cheers >> > > Amandeep >> > > >> > > >> > > Amandeep Khurana >> > > Computer Science Graduate Student >> > > University of California, Santa Cruz >> > > >> > > >> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia >> > wrote: >> > > >> > >> Forgot to add, this JIRA details the latest security features that >> are >> > >> being >> > >> worked on in Hadoop trunk: >> > >> https://issues.apache.org/jira/browse/HADOOP-4487. >> > >> This document describes the current status and limitations of the >> > >> permissions mechanism: >> > >> >> http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html. >> > >> >> > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia >> > >> wrote: >> > >> >> > >> > I think it's safe to assume that Hadoop works like MapReduce/GFS at >> > the >> > >> > level described in those papers. In particular, in HDFS, there is a >> > >> master >> > >> > node containing metadata and a number of slave nodes (datanodes) >> > >> containing >> > >> > blocks, as in GFS. Clients start by talking to the master to list >> > >> > directories, etc. When they want to read a region of some file, >> they >> > >> tell >> > >> > the master the filename and offset, and they receive a list of >> block >> > >> > locations (datanodes). They then contact the individual datanodes >> to >> > >> read >> > >> > the blocks. When clients write a file, they first obtain a new >> block >> > ID >> > >> and >> > >> > list of nodes to write it to from the master, then contact the >> > datanodes >> > >> to >> > >> > write it (actually, the datanodes pipeline the write as in GFS) and >> > >> report >> > >> > when the write is complete. HDFS actually has some security >> mechanisms >> > >> built >> > >> > in, authenticating users based on their Unix ID and providing >> > Unix-like >> > >> file >> > >> > permissions. I don't
Re: HDFS architecture based on GFS?
Assuming that the job is purely in Java and not involving streaming or pipes, wouldnt the resources (files) required by the job as inputs be known beforehand? So, if the map task is accessing a second file, how does it make it different except that there are multiple files. The JobTracker would know beforehand that multiple files would be accessed. Right? I am slightly confused why you have mentioned this case separately... Can you elaborate on it a little bit? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia wrote: > Typically the data flow is like this:1) Client submits a job description to > the JobTracker. > 2) JobTracker figures out block locations for the input file(s) by talking > to HDFS NameNode. > 3) JobTracker creates a job description file in HDFS which will be read by > the nodes to copy over the job's code etc. > 4) JobTracker starts map tasks on the slaves (TaskTrackers) with the > appropriate data blocks. > 5) After running, maps create intermediate output files on those slaves. > These are not in HDFS, they're in some temporary storage used by MapReduce. > 6) JobTracker starts reduces on a series of slaves, which copy over the > appropriate map outputs, apply the reduce function, and write the outputs > to > HDFS (one output file per reducer). > 7) Some logs for the job may also be put into HDFS by the JobTracker. > > However, there is a big caveat, which is that the map and reduce tasks run > arbitrary code. It is not unusual to have a map that opens a second HDFS > file to read some information (e.g. for doing a join of a small table > against a big file). If you use Hadoop Streaming or Pipes to write a job in > Python, Ruby, C, etc, then you are launching arbitrary processes which may > also access external resources in this manner. Some people also read/write > to DBs (e.g. MySQL) from their tasks. A comprehensive security solution > would ideally deal with these cases too. > > On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana > wrote: > > > A quick question here. How does a typical hadoop job work at the system > > level? What are the various interactions and how does the data flow? > > > > Amandeep > > > > > > Amandeep Khurana > > Computer Science Graduate Student > > University of California, Santa Cruz > > > > > > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana > > wrote: > > > > > Thanks Matei. If the basic architecture is similar to the Google stuff, > I > > > can safely just work on the project using the information from the > > papers. > > > > > > I am aware of the 4487 jira and the current status of the permissions > > > mechanism. I had a look at them earlier. > > > > > > Cheers > > > Amandeep > > > > > > > > > Amandeep Khurana > > > Computer Science Graduate Student > > > University of California, Santa Cruz > > > > > > > > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia > > wrote: > > > > > >> Forgot to add, this JIRA details the latest security features that are > > >> being > > >> worked on in Hadoop trunk: > > >> https://issues.apache.org/jira/browse/HADOOP-4487. > > >> This document describes the current status and limitations of the > > >> permissions mechanism: > > >> > http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html. > > >> > > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia > > >> wrote: > > >> > > >> > I think it's safe to assume that Hadoop works like MapReduce/GFS at > > the > > >> > level described in those papers. In particular, in HDFS, there is a > > >> master > > >> > node containing metadata and a number of slave nodes (datanodes) > > >> containing > > >> > blocks, as in GFS. Clients start by talking to the master to list > > >> > directories, etc. When they want to read a region of some file, they > > >> tell > > >> > the master the filename and offset, and they receive a list of block > > >> > locations (datanodes). They then contact the individual datanodes to > > >> read > > >> > the blocks. When clients write a file, they first obtain a new block > > ID > > >> and > > >> > list of nodes to write it to from the master, then contact the > > datanodes > > >> to > > >> > write it (actually, the datanodes pipeline the write as in GFS) and > > >> report > > >> > when the write is complete. HDFS actually has some security > mechanisms > > >> built > > >> > in, authenticating users based on their Unix ID and providing > > Unix-like > > >> file > > >> > permissions. I don't know much about how these are implemented, but > > they > > >> > would be a good place to start looking. > > >> > > > >> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana > >> >wrote: > > >> > > > >> >> Thanks Matie > > >> >> > > >> >> I had gone through the architecture document online. I am currently > > >> >> working > > >> >> on a project towards Security in Hadoop. I do know how the data > moves > > >> >> around > > >> >> in the GFS but wasnt sure how
Re: HDFS architecture based on GFS?
This is good information! Thanks a ton. I'll take all this into account. Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia wrote: > Typically the data flow is like this:1) Client submits a job description to > the JobTracker. > 2) JobTracker figures out block locations for the input file(s) by talking > to HDFS NameNode. > 3) JobTracker creates a job description file in HDFS which will be read by > the nodes to copy over the job's code etc. > 4) JobTracker starts map tasks on the slaves (TaskTrackers) with the > appropriate data blocks. > 5) After running, maps create intermediate output files on those slaves. > These are not in HDFS, they're in some temporary storage used by MapReduce. > 6) JobTracker starts reduces on a series of slaves, which copy over the > appropriate map outputs, apply the reduce function, and write the outputs > to > HDFS (one output file per reducer). > 7) Some logs for the job may also be put into HDFS by the JobTracker. > > However, there is a big caveat, which is that the map and reduce tasks run > arbitrary code. It is not unusual to have a map that opens a second HDFS > file to read some information (e.g. for doing a join of a small table > against a big file). If you use Hadoop Streaming or Pipes to write a job in > Python, Ruby, C, etc, then you are launching arbitrary processes which may > also access external resources in this manner. Some people also read/write > to DBs (e.g. MySQL) from their tasks. A comprehensive security solution > would ideally deal with these cases too. > > On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana > wrote: > > > A quick question here. How does a typical hadoop job work at the system > > level? What are the various interactions and how does the data flow? > > > > Amandeep > > > > > > Amandeep Khurana > > Computer Science Graduate Student > > University of California, Santa Cruz > > > > > > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana > > wrote: > > > > > Thanks Matei. If the basic architecture is similar to the Google stuff, > I > > > can safely just work on the project using the information from the > > papers. > > > > > > I am aware of the 4487 jira and the current status of the permissions > > > mechanism. I had a look at them earlier. > > > > > > Cheers > > > Amandeep > > > > > > > > > Amandeep Khurana > > > Computer Science Graduate Student > > > University of California, Santa Cruz > > > > > > > > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia > > wrote: > > > > > >> Forgot to add, this JIRA details the latest security features that are > > >> being > > >> worked on in Hadoop trunk: > > >> https://issues.apache.org/jira/browse/HADOOP-4487. > > >> This document describes the current status and limitations of the > > >> permissions mechanism: > > >> > http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html. > > >> > > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia > > >> wrote: > > >> > > >> > I think it's safe to assume that Hadoop works like MapReduce/GFS at > > the > > >> > level described in those papers. In particular, in HDFS, there is a > > >> master > > >> > node containing metadata and a number of slave nodes (datanodes) > > >> containing > > >> > blocks, as in GFS. Clients start by talking to the master to list > > >> > directories, etc. When they want to read a region of some file, they > > >> tell > > >> > the master the filename and offset, and they receive a list of block > > >> > locations (datanodes). They then contact the individual datanodes to > > >> read > > >> > the blocks. When clients write a file, they first obtain a new block > > ID > > >> and > > >> > list of nodes to write it to from the master, then contact the > > datanodes > > >> to > > >> > write it (actually, the datanodes pipeline the write as in GFS) and > > >> report > > >> > when the write is complete. HDFS actually has some security > mechanisms > > >> built > > >> > in, authenticating users based on their Unix ID and providing > > Unix-like > > >> file > > >> > permissions. I don't know much about how these are implemented, but > > they > > >> > would be a good place to start looking. > > >> > > > >> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana > >> >wrote: > > >> > > > >> >> Thanks Matie > > >> >> > > >> >> I had gone through the architecture document online. I am currently > > >> >> working > > >> >> on a project towards Security in Hadoop. I do know how the data > moves > > >> >> around > > >> >> in the GFS but wasnt sure how much of that does HDFS follow and how > > >> >> different it is from GFS. Can you throw some light on that? > > >> >> > > >> >> Security would also involve the Map Reduce jobs following the same > > >> >> protocols. Thats why the question about how does the Hadoop > framework > > >> >> integrate with the HDFS, and how different is it from Map Reduce > and > > >> GFS. > > >> >> The GFS and M
Re: HDFS architecture based on GFS?
Typically the data flow is like this:1) Client submits a job description to the JobTracker. 2) JobTracker figures out block locations for the input file(s) by talking to HDFS NameNode. 3) JobTracker creates a job description file in HDFS which will be read by the nodes to copy over the job's code etc. 4) JobTracker starts map tasks on the slaves (TaskTrackers) with the appropriate data blocks. 5) After running, maps create intermediate output files on those slaves. These are not in HDFS, they're in some temporary storage used by MapReduce. 6) JobTracker starts reduces on a series of slaves, which copy over the appropriate map outputs, apply the reduce function, and write the outputs to HDFS (one output file per reducer). 7) Some logs for the job may also be put into HDFS by the JobTracker. However, there is a big caveat, which is that the map and reduce tasks run arbitrary code. It is not unusual to have a map that opens a second HDFS file to read some information (e.g. for doing a join of a small table against a big file). If you use Hadoop Streaming or Pipes to write a job in Python, Ruby, C, etc, then you are launching arbitrary processes which may also access external resources in this manner. Some people also read/write to DBs (e.g. MySQL) from their tasks. A comprehensive security solution would ideally deal with these cases too. On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana wrote: > A quick question here. How does a typical hadoop job work at the system > level? What are the various interactions and how does the data flow? > > Amandeep > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana > wrote: > > > Thanks Matei. If the basic architecture is similar to the Google stuff, I > > can safely just work on the project using the information from the > papers. > > > > I am aware of the 4487 jira and the current status of the permissions > > mechanism. I had a look at them earlier. > > > > Cheers > > Amandeep > > > > > > Amandeep Khurana > > Computer Science Graduate Student > > University of California, Santa Cruz > > > > > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia > wrote: > > > >> Forgot to add, this JIRA details the latest security features that are > >> being > >> worked on in Hadoop trunk: > >> https://issues.apache.org/jira/browse/HADOOP-4487. > >> This document describes the current status and limitations of the > >> permissions mechanism: > >> http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html. > >> > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia > >> wrote: > >> > >> > I think it's safe to assume that Hadoop works like MapReduce/GFS at > the > >> > level described in those papers. In particular, in HDFS, there is a > >> master > >> > node containing metadata and a number of slave nodes (datanodes) > >> containing > >> > blocks, as in GFS. Clients start by talking to the master to list > >> > directories, etc. When they want to read a region of some file, they > >> tell > >> > the master the filename and offset, and they receive a list of block > >> > locations (datanodes). They then contact the individual datanodes to > >> read > >> > the blocks. When clients write a file, they first obtain a new block > ID > >> and > >> > list of nodes to write it to from the master, then contact the > datanodes > >> to > >> > write it (actually, the datanodes pipeline the write as in GFS) and > >> report > >> > when the write is complete. HDFS actually has some security mechanisms > >> built > >> > in, authenticating users based on their Unix ID and providing > Unix-like > >> file > >> > permissions. I don't know much about how these are implemented, but > they > >> > would be a good place to start looking. > >> > > >> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana >> >wrote: > >> > > >> >> Thanks Matie > >> >> > >> >> I had gone through the architecture document online. I am currently > >> >> working > >> >> on a project towards Security in Hadoop. I do know how the data moves > >> >> around > >> >> in the GFS but wasnt sure how much of that does HDFS follow and how > >> >> different it is from GFS. Can you throw some light on that? > >> >> > >> >> Security would also involve the Map Reduce jobs following the same > >> >> protocols. Thats why the question about how does the Hadoop framework > >> >> integrate with the HDFS, and how different is it from Map Reduce and > >> GFS. > >> >> The GFS and Map Reduce papers give a good information on how those > >> systems > >> >> are designed but there is nothing that concrete for Hadoop that I > have > >> >> been > >> >> able to find. > >> >> > >> >> Amandeep > >> >> > >> >> > >> >> Amandeep Khurana > >> >> Computer Science Graduate Student > >> >> University of California, Santa Cruz > >> >> > >> >> > >> >> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia > >> >> wrote: > >> >> > >> >> > Hi Amandeep, > >> >> > Hadoop is d
Re: setting up networking and ssh on multnode cluster...
okay, i have changed /etc/hosts to look like this for node0... 127.0.0.1 localhost 127.0.1.1 node0 # /etc/hosts (for hadoop master and slave) 192.168.0.1 node0 192.168.0.2 node1 #end hadoop section # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters ff02::3 ip6-allhosts ...and this for node1... 127.0.0.1 localhost 127.0.1.1 node1 # /etc/hosts (for hadoop master and slave) 192.168.0.1 node0 192.168.0.2 node1 #end hadoop section # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters ff02::3 ip6-allhosts ... the machines are connected by a cat5 cable, they have wifi and are showing that they are connected to my wlan. also i have enabled all the user privleges in the user manager on both machines. here are the results from ssh on node0... had...@node0:~$ ssh node0 Linux node0 2.6.27-11-generic #1 SMP Thu Jan 29 19:24:39 UTC 2009 i686 The programs included with the Ubuntu system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. To access official Ubuntu documentation, please visit: http://help.ubuntu.com/ Last login: Sun Feb 15 16:00:28 2009 from node0 had...@node0:~$ exit logout Connection to node0 closed. had...@node0:~$ ssh node1 ssh: connect to host node1 port 22: Connection timed out had...@node0:~$ i will look into the link that you gave. -zander Norbert Burger wrote: > >> >> i have commented out the 192. addresses and changed 127.0.1.1 for node0 >> and >> 127.0.1.2 for node0 (in /etc/hosts). with this done i can ssh from one >> machine to itself and to the other but the prompt does not change when i >> ssh >> to the other machine. i don't know if there is a firewall preventing me >> from >> ssh or not. i have not set any up to prevent ssh and i have not taken >> action >> to specifically allow ssh other than what was prescribed in the tutorial >> for >> a single node cluster for both machines. > > > Why are you using 127.* addresses for your nodes? These fall into the > block > of IPs reserved for the loopback network (see > http://en.wikipedia.org/wiki/Loopback). > > Try changing both nodes back to 192.168, and re-starting Hadoop. > > Norbert > > -- View this message in context: http://www.nabble.com/setting-up-networking-and-ssh-on-multnode-cluster...-tp22019559p22029812.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: setting up networking and ssh on multnode cluster...
> > i have commented out the 192. addresses and changed 127.0.1.1 for node0 and > 127.0.1.2 for node0 (in /etc/hosts). with this done i can ssh from one > machine to itself and to the other but the prompt does not change when i > ssh > to the other machine. i don't know if there is a firewall preventing me > from > ssh or not. i have not set any up to prevent ssh and i have not taken > action > to specifically allow ssh other than what was prescribed in the tutorial > for > a single node cluster for both machines. Why are you using 127.* addresses for your nodes? These fall into the block of IPs reserved for the loopback network (see http://en.wikipedia.org/wiki/Loopback). Try changing both nodes back to 192.168, and re-starting Hadoop. Norbert
Re: HDFS architecture based on GFS?
A quick question here. How does a typical hadoop job work at the system level? What are the various interactions and how does the data flow? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana wrote: > Thanks Matei. If the basic architecture is similar to the Google stuff, I > can safely just work on the project using the information from the papers. > > I am aware of the 4487 jira and the current status of the permissions > mechanism. I had a look at them earlier. > > Cheers > Amandeep > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia wrote: > >> Forgot to add, this JIRA details the latest security features that are >> being >> worked on in Hadoop trunk: >> https://issues.apache.org/jira/browse/HADOOP-4487. >> This document describes the current status and limitations of the >> permissions mechanism: >> http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html. >> >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia >> wrote: >> >> > I think it's safe to assume that Hadoop works like MapReduce/GFS at the >> > level described in those papers. In particular, in HDFS, there is a >> master >> > node containing metadata and a number of slave nodes (datanodes) >> containing >> > blocks, as in GFS. Clients start by talking to the master to list >> > directories, etc. When they want to read a region of some file, they >> tell >> > the master the filename and offset, and they receive a list of block >> > locations (datanodes). They then contact the individual datanodes to >> read >> > the blocks. When clients write a file, they first obtain a new block ID >> and >> > list of nodes to write it to from the master, then contact the datanodes >> to >> > write it (actually, the datanodes pipeline the write as in GFS) and >> report >> > when the write is complete. HDFS actually has some security mechanisms >> built >> > in, authenticating users based on their Unix ID and providing Unix-like >> file >> > permissions. I don't know much about how these are implemented, but they >> > would be a good place to start looking. >> > >> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana > >wrote: >> > >> >> Thanks Matie >> >> >> >> I had gone through the architecture document online. I am currently >> >> working >> >> on a project towards Security in Hadoop. I do know how the data moves >> >> around >> >> in the GFS but wasnt sure how much of that does HDFS follow and how >> >> different it is from GFS. Can you throw some light on that? >> >> >> >> Security would also involve the Map Reduce jobs following the same >> >> protocols. Thats why the question about how does the Hadoop framework >> >> integrate with the HDFS, and how different is it from Map Reduce and >> GFS. >> >> The GFS and Map Reduce papers give a good information on how those >> systems >> >> are designed but there is nothing that concrete for Hadoop that I have >> >> been >> >> able to find. >> >> >> >> Amandeep >> >> >> >> >> >> Amandeep Khurana >> >> Computer Science Graduate Student >> >> University of California, Santa Cruz >> >> >> >> >> >> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia >> >> wrote: >> >> >> >> > Hi Amandeep, >> >> > Hadoop is definitely inspired by MapReduce/GFS and aims to provide >> those >> >> > capabilities as an open-source project. HDFS is similar to GFS (large >> >> > blocks, replication, etc); some notable things missing are read-write >> >> > support in the middle of a file (unlikely to be provided because few >> >> Hadoop >> >> > applications require it) and multiple appenders (the record append >> >> > operation). You can read about HDFS architecture at >> >> > http://hadoop.apache.org/core/docs/current/hdfs_design.html. The >> >> MapReduce >> >> > part of Hadoop interacts with HDFS in the same way that Google's >> >> MapReduce >> >> > interacts with GFS (shipping computation to the data), although >> Hadoop >> >> > MapReduce also supports running over other distributed filesystems. >> >> > >> >> > Matei >> >> > >> >> > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana > > >> >> > wrote: >> >> > >> >> > > Hi >> >> > > >> >> > > Is the HDFS architecture completely based on the Google Filesystem? >> If >> >> it >> >> > > isnt, what are the differences between the two? >> >> > > >> >> > > Secondly, is the coupling between Hadoop and HDFS same as how it is >> >> > between >> >> > > the Google's version of Map Reduce and GFS? >> >> > > >> >> > > Amandeep >> >> > > >> >> > > >> >> > > Amandeep Khurana >> >> > > Computer Science Graduate Student >> >> > > University of California, Santa Cruz >> >> > > >> >> > >> >> >> > >> > >> > >
Re: HDFS architecture based on GFS?
Thanks Matei. If the basic architecture is similar to the Google stuff, I can safely just work on the project using the information from the papers. I am aware of the 4487 jira and the current status of the permissions mechanism. I had a look at them earlier. Cheers Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia wrote: > Forgot to add, this JIRA details the latest security features that are > being > worked on in Hadoop trunk: > https://issues.apache.org/jira/browse/HADOOP-4487. > This document describes the current status and limitations of the > permissions mechanism: > http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html. > > On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia wrote: > > > I think it's safe to assume that Hadoop works like MapReduce/GFS at the > > level described in those papers. In particular, in HDFS, there is a > master > > node containing metadata and a number of slave nodes (datanodes) > containing > > blocks, as in GFS. Clients start by talking to the master to list > > directories, etc. When they want to read a region of some file, they tell > > the master the filename and offset, and they receive a list of block > > locations (datanodes). They then contact the individual datanodes to read > > the blocks. When clients write a file, they first obtain a new block ID > and > > list of nodes to write it to from the master, then contact the datanodes > to > > write it (actually, the datanodes pipeline the write as in GFS) and > report > > when the write is complete. HDFS actually has some security mechanisms > built > > in, authenticating users based on their Unix ID and providing Unix-like > file > > permissions. I don't know much about how these are implemented, but they > > would be a good place to start looking. > > > > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana >wrote: > > > >> Thanks Matie > >> > >> I had gone through the architecture document online. I am currently > >> working > >> on a project towards Security in Hadoop. I do know how the data moves > >> around > >> in the GFS but wasnt sure how much of that does HDFS follow and how > >> different it is from GFS. Can you throw some light on that? > >> > >> Security would also involve the Map Reduce jobs following the same > >> protocols. Thats why the question about how does the Hadoop framework > >> integrate with the HDFS, and how different is it from Map Reduce and > GFS. > >> The GFS and Map Reduce papers give a good information on how those > systems > >> are designed but there is nothing that concrete for Hadoop that I have > >> been > >> able to find. > >> > >> Amandeep > >> > >> > >> Amandeep Khurana > >> Computer Science Graduate Student > >> University of California, Santa Cruz > >> > >> > >> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia > >> wrote: > >> > >> > Hi Amandeep, > >> > Hadoop is definitely inspired by MapReduce/GFS and aims to provide > those > >> > capabilities as an open-source project. HDFS is similar to GFS (large > >> > blocks, replication, etc); some notable things missing are read-write > >> > support in the middle of a file (unlikely to be provided because few > >> Hadoop > >> > applications require it) and multiple appenders (the record append > >> > operation). You can read about HDFS architecture at > >> > http://hadoop.apache.org/core/docs/current/hdfs_design.html. The > >> MapReduce > >> > part of Hadoop interacts with HDFS in the same way that Google's > >> MapReduce > >> > interacts with GFS (shipping computation to the data), although Hadoop > >> > MapReduce also supports running over other distributed filesystems. > >> > > >> > Matei > >> > > >> > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana > >> > wrote: > >> > > >> > > Hi > >> > > > >> > > Is the HDFS architecture completely based on the Google Filesystem? > If > >> it > >> > > isnt, what are the differences between the two? > >> > > > >> > > Secondly, is the coupling between Hadoop and HDFS same as how it is > >> > between > >> > > the Google's version of Map Reduce and GFS? > >> > > > >> > > Amandeep > >> > > > >> > > > >> > > Amandeep Khurana > >> > > Computer Science Graduate Student > >> > > University of California, Santa Cruz > >> > > > >> > > >> > > > > >
Re: setting up networking and ssh on multnode cluster...
hi, sshd is running on both machines. i am using the default ubuntu 8.10 workstation install with openssh-server installed via "apt-get install". i have tried with the machines connected through both a switch and just pluging the ethernet cable from one into the other. right now i have just one cat-5 cable running from one machine to the other. i have commented out the 192. addresses and changed 127.0.1.1 for node0 and 127.0.1.2 for node0 (in /etc/hosts). with this done i can ssh from one machine to itself and to the other but the prompt does not change when i ssh to the other machine. i don't know if there is a firewall preventing me from ssh or not. i have not set any up to prevent ssh and i have not taken action to specifically allow ssh other than what was prescribed in the tutorial for a single node cluster for both machines. here is what i get with 127.0.1.1 for node0 and 127.0.1.2 for node1 (the 192 addresses commented out) in etc/hosts on both machines. it seems okay but the prompt does not change to node1 when i ssh to that node. that is why i am concerned. had...@node0:~$ ssh node0 Linux node0 2.6.27-11-generic #1 SMP Thu Jan 29 19:24:39 UTC 2009 i686 The programs included with the Ubuntu system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. To access official Ubuntu documentation, please visit: http://help.ubuntu.com/ Last login: Sun Feb 15 14:35:46 2009 from node1 had...@node0:~$ exit logout Connection to node0 closed. had...@node0:~$ ssh node1 Linux node0 2.6.27-11-generic #1 SMP Thu Jan 29 19:24:39 UTC 2009 i686 The programs included with the Ubuntu system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. To access official Ubuntu documentation, please visit: http://help.ubuntu.com/ Last login: Sun Feb 15 14:38:41 2009 from node0 had...@node0:~$ i hope you can advise. i don't know if this means there is no firewall to prevent ssh or not but it is the best i can do in trying to determine if there is or not without further advise. -zander james warren-2 wrote: > > Hi Zander - > Two simple explanations come to mind: > > * Is sshd is running on your boxes? > * If so, do you have a firewall preventing ssh access? > > cheers, > -jw > > On Sat, Feb 14, 2009 at 7:50 PM, zander1013 wrote: > >> >> hi, >> >> am going through the tutorial on multinode cluster setup by m. noll... >> >> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster) >> >> ... i am at the networking and ssh section. i am haveing trouble >> configuring >> the /etc/hosts file. my machines are node0 and node1 not master and >> slave. >> i >> am taking node0 as master and node1 as slave. >> >> i added the lines... >> # /etc/hosts (for master AND slave) >> 192.168.0.1node0 >> 192.168.0.2node1 >> ... to /etc/hosts. but when i try to ssh (either way) i get the error... >> "ssh: connect to host node1 (node0) port 22: Network is unreachable" >> >> /etc/hosts looks like this for node0... >> >> # /etc/hosts (for master and slave) >> 192.168.0.1 node0 >> 192.168.0.2 node1 >> #end hadoop section >> 127.0.0.1 localhost >> 127.0.1.1 node0 >> >> # The following lines are desirable for IPv6 capable hosts >> ::1 ip6-localhost ip6-loopback >> fe00::0 ip6-localnet >> ff00::0 ip6-mcastprefix >> ff02::1 ip6-allnodes >> ff02::2 ip6-allrouters >> ff02::3 ip6-allhosts >> >> >> and this for node1... >> >> # /etc/hosts (for master and slave) >> 192.168.0.1 node0 >> 192.168.0.2 node1 >> #end hadoop section >> 127.0.0.1 localhost >> 127.0.1.1 node1 >> >> # The following lines are desirable for IPv6 capable hosts >> ::1 ip6-localhost ip6-loopback >> fe00::0 ip6-localnet >> ff00::0 ip6-mcastprefix >> ff02::1 ip6-allnodes >> ff02::2 ip6-allrouters >> ff02::3 ip6-allhosts >> >> ... even after rebooting and connecting the machines with a switch using >> cat >> 6 cable i get the error i quoted above. >> >> please advise. >> >> >> thank you, >> >> -zander >> -- >> View this message in context: >> http://www.nabble.com/setting-up-networking-and-ssh-on-multnode-cluster...-tp22019559p22019559.html >> Sent from the Hadoop core-user mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/setting-up-networking-and-ssh-on-multnode-cluster...-tp22019559p2202.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Race Condition?
I was not able to determine the command shell return value for hadoop dfs -copyToLocal #{s3dir} #{localdir} but I did print out several variables after the call and determined that the call apparently did not go through successfully. In particular, prior to my processData(localdir) command I use Ruby's puts to print out the contents of several directories including 'localdir' and '../localdir' - here is the weird thing: if I execute the following list = `ls -l "#{localdir}"` puts "List: #{list}" (where 'localdir' is the directory I need as an arg for processData) the processData command will execute properly. At first I thought that running the puts command was allowing enough time to elapse for a race condition to be avoided so that 'localdir' was ready when the processData command was called (I know that in certain ways that doesn't make sense given that hadoop dfs -copyToLocal blocks until it completes...) but then I tried other time consuming commands such as list = `ls -l "../#{localdir}"` puts "List: #{list}" and running processData(localdir) led to an error: 'localdir' not found Any clues on what could be going on? Thanks, John On Sat, Feb 14, 2009 at 6:45 PM, Matei Zaharia wrote: > Have you logged the output of the dfs command to see whether it's always > succeeded the copy? > > On Sat, Feb 14, 2009 at 2:46 PM, S D wrote: > > > In my Hadoop 0.19.0 program each map function is assigned a directory > > (representing a data location in my S3 datastore). The first thing each > map > > function does is copy the particular S3 data to the local machine that > the > > map task is running on and then being processing the data; e.g., > > > > command = "hadoop dfs -copyToLocal #{s3dir} #{localdir}" > > system "#{command}" > > > > In the above, "s3dir" is a directory that creates "localdir" - my > > expectation is that "localdir" is created in the work directory for the > > particular task attempt. Following this copy command I then run a > function > > that processes the data; e.g., > > > > processData(localdir) > > > > In some instances my map/reduce program crashes and when I examine the > logs > > I get a message saying that "localdir" can not be found. This confuses me > > since the hadoop shell command above is blocking so that localdir should > > exist by the time processData() is called. I've found that if I add in > some > > diagnostic lines prior to processData() such as puts statements to print > > out > > variables, I never run into the problem of the localdir not being found. > It > > is almost as if localdir needs time to be created before the call to > > processData(). > > > > Has anyone encountered anything like this? Any suggestions on what could > be > > wrong are appreciated. > > > > Thanks, > > John > > >
Re: HDFS architecture based on GFS?
Forgot to add, this JIRA details the latest security features that are being worked on in Hadoop trunk: https://issues.apache.org/jira/browse/HADOOP-4487. This document describes the current status and limitations of the permissions mechanism: http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html. On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia wrote: > I think it's safe to assume that Hadoop works like MapReduce/GFS at the > level described in those papers. In particular, in HDFS, there is a master > node containing metadata and a number of slave nodes (datanodes) containing > blocks, as in GFS. Clients start by talking to the master to list > directories, etc. When they want to read a region of some file, they tell > the master the filename and offset, and they receive a list of block > locations (datanodes). They then contact the individual datanodes to read > the blocks. When clients write a file, they first obtain a new block ID and > list of nodes to write it to from the master, then contact the datanodes to > write it (actually, the datanodes pipeline the write as in GFS) and report > when the write is complete. HDFS actually has some security mechanisms built > in, authenticating users based on their Unix ID and providing Unix-like file > permissions. I don't know much about how these are implemented, but they > would be a good place to start looking. > > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana wrote: > >> Thanks Matie >> >> I had gone through the architecture document online. I am currently >> working >> on a project towards Security in Hadoop. I do know how the data moves >> around >> in the GFS but wasnt sure how much of that does HDFS follow and how >> different it is from GFS. Can you throw some light on that? >> >> Security would also involve the Map Reduce jobs following the same >> protocols. Thats why the question about how does the Hadoop framework >> integrate with the HDFS, and how different is it from Map Reduce and GFS. >> The GFS and Map Reduce papers give a good information on how those systems >> are designed but there is nothing that concrete for Hadoop that I have >> been >> able to find. >> >> Amandeep >> >> >> Amandeep Khurana >> Computer Science Graduate Student >> University of California, Santa Cruz >> >> >> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia >> wrote: >> >> > Hi Amandeep, >> > Hadoop is definitely inspired by MapReduce/GFS and aims to provide those >> > capabilities as an open-source project. HDFS is similar to GFS (large >> > blocks, replication, etc); some notable things missing are read-write >> > support in the middle of a file (unlikely to be provided because few >> Hadoop >> > applications require it) and multiple appenders (the record append >> > operation). You can read about HDFS architecture at >> > http://hadoop.apache.org/core/docs/current/hdfs_design.html. The >> MapReduce >> > part of Hadoop interacts with HDFS in the same way that Google's >> MapReduce >> > interacts with GFS (shipping computation to the data), although Hadoop >> > MapReduce also supports running over other distributed filesystems. >> > >> > Matei >> > >> > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana >> > wrote: >> > >> > > Hi >> > > >> > > Is the HDFS architecture completely based on the Google Filesystem? If >> it >> > > isnt, what are the differences between the two? >> > > >> > > Secondly, is the coupling between Hadoop and HDFS same as how it is >> > between >> > > the Google's version of Map Reduce and GFS? >> > > >> > > Amandeep >> > > >> > > >> > > Amandeep Khurana >> > > Computer Science Graduate Student >> > > University of California, Santa Cruz >> > > >> > >> > >
Re: HDFS architecture based on GFS?
I think it's safe to assume that Hadoop works like MapReduce/GFS at the level described in those papers. In particular, in HDFS, there is a master node containing metadata and a number of slave nodes (datanodes) containing blocks, as in GFS. Clients start by talking to the master to list directories, etc. When they want to read a region of some file, they tell the master the filename and offset, and they receive a list of block locations (datanodes). They then contact the individual datanodes to read the blocks. When clients write a file, they first obtain a new block ID and list of nodes to write it to from the master, then contact the datanodes to write it (actually, the datanodes pipeline the write as in GFS) and report when the write is complete. HDFS actually has some security mechanisms built in, authenticating users based on their Unix ID and providing Unix-like file permissions. I don't know much about how these are implemented, but they would be a good place to start looking. On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana wrote: > Thanks Matie > > I had gone through the architecture document online. I am currently working > on a project towards Security in Hadoop. I do know how the data moves > around > in the GFS but wasnt sure how much of that does HDFS follow and how > different it is from GFS. Can you throw some light on that? > > Security would also involve the Map Reduce jobs following the same > protocols. Thats why the question about how does the Hadoop framework > integrate with the HDFS, and how different is it from Map Reduce and GFS. > The GFS and Map Reduce papers give a good information on how those systems > are designed but there is nothing that concrete for Hadoop that I have been > able to find. > > Amandeep > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia > wrote: > > > Hi Amandeep, > > Hadoop is definitely inspired by MapReduce/GFS and aims to provide those > > capabilities as an open-source project. HDFS is similar to GFS (large > > blocks, replication, etc); some notable things missing are read-write > > support in the middle of a file (unlikely to be provided because few > Hadoop > > applications require it) and multiple appenders (the record append > > operation). You can read about HDFS architecture at > > http://hadoop.apache.org/core/docs/current/hdfs_design.html. The > MapReduce > > part of Hadoop interacts with HDFS in the same way that Google's > MapReduce > > interacts with GFS (shipping computation to the data), although Hadoop > > MapReduce also supports running over other distributed filesystems. > > > > Matei > > > > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana > > wrote: > > > > > Hi > > > > > > Is the HDFS architecture completely based on the Google Filesystem? If > it > > > isnt, what are the differences between the two? > > > > > > Secondly, is the coupling between Hadoop and HDFS same as how it is > > between > > > the Google's version of Map Reduce and GFS? > > > > > > Amandeep > > > > > > > > > Amandeep Khurana > > > Computer Science Graduate Student > > > University of California, Santa Cruz > > > > > >
Re: datanode not being started
just some more information: hadoop fsck produces: Status: HEALTHY Total size: 0 B Total dirs: 9 Total files: 0 (Files currently being written: 1) Total blocks (validated): 0 Minimally replicated blocks: 0 Over-replicated blocks: 0 Under-replicated blocks: 0 Mis-replicated blocks: 0 Default replication factor: 1 Average block replication: 0.0 Corrupt blocks: 0 Missing replicas: 0 Number of data-nodes: 0 Number of racks: 0 The filesystem under path '/' is HEALTHY on the newly formatted hdfs. jps says: 4723 Jps 4527 NameNode 4653 JobTracker I can't copy files onto the dfs since I get "NotReplicatedYetExceptions", which I suspect has to do with the fact that there are no datanodes. My "cluster" is a single MacPro with 8 cores. I haven't had to do anything extra before in order to get datanodes to be generated. 09/02/15 15:56:27 WARN dfs.DFSClient: Error Recovery for block null bad datanode[0] copyFromLocal: Could not get block locations. Aborting... The corresponding error in the logs is: 2009-02-15 15:56:27,123 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call addBlock(/user/hadoop/input/.DS_Store, DFSClient_755366230) from 127.0.0.1:49796: error: java.io.IOException: File /user/hadoop/input/.DS_Store could only be replicated to 0 nodes, instead of 1 java.io.IOException: File /user/hadoop/input/.DS_Store could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1120) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) On Sun, Feb 15, 2009 at 3:26 PM, Sandy wrote: > Thanks for your responses. > > I checked in the namenode and jobtracker logs and both say: > > INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 9000, call > delete(/Users/hadoop/hadoop-0.18.2/hadoop-hadoop/mapred/system, true) from > 127.0.0.1:61086: error: org.apache.hadoop.dfs.SafeModeException: Cannot > delete /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/mapred/system. Name node > is in safe mode. > The ratio of reported blocks 0. has not reached the threshold 0.9990. > Safe mode will be turned off automatically. > org.apache.hadoop.dfs.SafeModeException: Cannot delete > /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/mapred/system. Name node is in > safe mode. > The ratio of reported blocks 0. has not reached the threshold 0.9990. > Safe mode will be turned off automatically. > at > org.apache.hadoop.dfs.FSNamesystem.deleteInternal(FSNamesystem.java:1505) > at > org.apache.hadoop.dfs.FSNamesystem.delete(FSNamesystem.java:1477) > at org.apache.hadoop.dfs.NameNode.delete(NameNode.java:425) > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) > > > I think this is a continuation of my running problem. The nodes stay in > safe mode, but won't come out, even after several minutes. I believe this is > due to the fact that it keep trying to contact a datanode that does not > exist. Any suggestions on what I can do? > > I have recently tried to reformat the hdfs, using bin/hadoop namenode > -format. From the output directed to standard out, I thought this completed > correctly: > > Re-format filesystem in /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/dfs/name > ? (Y or N) Y > 09/02/15 15:16:39 INFO fs.FSNamesystem: > fsOwner=hadoop,staff,_lpadmin,com.apple.sharepoint.group.8,com.apple.sharepoint.group.3,com.apple.sharepoint.group.4,com.apple.sharepoint.group.2,com.apple.sharepoint.group.6,com.apple.sharepoint.group.9,com.apple.sharepoint.group.1,com.apple.sharepoint.group.5 > 09/02/15 15:16:39 INFO fs.FSNamesystem: supergroup=supergroup > 09/02/15 15:16:39 INFO fs.FSNamesystem: isPermissionEnabled=true > 09/02/15 15:16:39 INFO dfs.Storage: Image file of size 80 saved in 0 > seconds. > 09/02/15 15:16:39 INFO dfs.Storage: Storage directory > /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/dfs/name has been successfully > formatted. > 09/02/15 15:16:39 INFO dfs.NameNode: SHUTDOWN_MSG: > / > SHUTDOWN_MSG: Shutting down NameNode at > loteria.cs.tamu.edu/128.194.143.170 > / > > However, after reformatting, I find that I have the same problems. > > Thanks, > SM > > On Fri, Feb 13, 2009 at 5:
Re: setting up networking and ssh on multnode cluster...
Hi Zander - Two simple explanations come to mind: * Is sshd is running on your boxes? * If so, do you have a firewall preventing ssh access? cheers, -jw On Sat, Feb 14, 2009 at 7:50 PM, zander1013 wrote: > > hi, > > am going through the tutorial on multinode cluster setup by m. noll... > > http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster) > > ... i am at the networking and ssh section. i am haveing trouble > configuring > the /etc/hosts file. my machines are node0 and node1 not master and slave. > i > am taking node0 as master and node1 as slave. > > i added the lines... > # /etc/hosts (for master AND slave) > 192.168.0.1node0 > 192.168.0.2node1 > ... to /etc/hosts. but when i try to ssh (either way) i get the error... > "ssh: connect to host node1 (node0) port 22: Network is unreachable" > > /etc/hosts looks like this for node0... > > # /etc/hosts (for master and slave) > 192.168.0.1 node0 > 192.168.0.2 node1 > #end hadoop section > 127.0.0.1 localhost > 127.0.1.1 node0 > > # The following lines are desirable for IPv6 capable hosts > ::1 ip6-localhost ip6-loopback > fe00::0 ip6-localnet > ff00::0 ip6-mcastprefix > ff02::1 ip6-allnodes > ff02::2 ip6-allrouters > ff02::3 ip6-allhosts > > > and this for node1... > > # /etc/hosts (for master and slave) > 192.168.0.1 node0 > 192.168.0.2 node1 > #end hadoop section > 127.0.0.1 localhost > 127.0.1.1 node1 > > # The following lines are desirable for IPv6 capable hosts > ::1 ip6-localhost ip6-loopback > fe00::0 ip6-localnet > ff00::0 ip6-mcastprefix > ff02::1 ip6-allnodes > ff02::2 ip6-allrouters > ff02::3 ip6-allhosts > > ... even after rebooting and connecting the machines with a switch using > cat > 6 cable i get the error i quoted above. > > please advise. > > > thank you, > > -zander > -- > View this message in context: > http://www.nabble.com/setting-up-networking-and-ssh-on-multnode-cluster...-tp22019559p22019559.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > >
Re: HDFS architecture based on GFS?
Thanks Matie I had gone through the architecture document online. I am currently working on a project towards Security in Hadoop. I do know how the data moves around in the GFS but wasnt sure how much of that does HDFS follow and how different it is from GFS. Can you throw some light on that? Security would also involve the Map Reduce jobs following the same protocols. Thats why the question about how does the Hadoop framework integrate with the HDFS, and how different is it from Map Reduce and GFS. The GFS and Map Reduce papers give a good information on how those systems are designed but there is nothing that concrete for Hadoop that I have been able to find. Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia wrote: > Hi Amandeep, > Hadoop is definitely inspired by MapReduce/GFS and aims to provide those > capabilities as an open-source project. HDFS is similar to GFS (large > blocks, replication, etc); some notable things missing are read-write > support in the middle of a file (unlikely to be provided because few Hadoop > applications require it) and multiple appenders (the record append > operation). You can read about HDFS architecture at > http://hadoop.apache.org/core/docs/current/hdfs_design.html. The MapReduce > part of Hadoop interacts with HDFS in the same way that Google's MapReduce > interacts with GFS (shipping computation to the data), although Hadoop > MapReduce also supports running over other distributed filesystems. > > Matei > > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana > wrote: > > > Hi > > > > Is the HDFS architecture completely based on the Google Filesystem? If it > > isnt, what are the differences between the two? > > > > Secondly, is the coupling between Hadoop and HDFS same as how it is > between > > the Google's version of Map Reduce and GFS? > > > > Amandeep > > > > > > Amandeep Khurana > > Computer Science Graduate Student > > University of California, Santa Cruz > > >
Re: datanode not being started
Thanks for your responses. I checked in the namenode and jobtracker logs and both say: INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 9000, call delete(/Users/hadoop/hadoop-0.18.2/hadoop-hadoop/mapred/system, true) from 127.0.0.1:61086: error: org.apache.hadoop.dfs.SafeModeException: Cannot delete /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/mapred/system. Name node is in safe mode. The ratio of reported blocks 0. has not reached the threshold 0.9990. Safe mode will be turned off automatically. org.apache.hadoop.dfs.SafeModeException: Cannot delete /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/mapred/system. Name node is in safe mode. The ratio of reported blocks 0. has not reached the threshold 0.9990. Safe mode will be turned off automatically. at org.apache.hadoop.dfs.FSNamesystem.deleteInternal(FSNamesystem.java:1505) at org.apache.hadoop.dfs.FSNamesystem.delete(FSNamesystem.java:1477) at org.apache.hadoop.dfs.NameNode.delete(NameNode.java:425) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) I think this is a continuation of my running problem. The nodes stay in safe mode, but won't come out, even after several minutes. I believe this is due to the fact that it keep trying to contact a datanode that does not exist. Any suggestions on what I can do? I have recently tried to reformat the hdfs, using bin/hadoop namenode -format. From the output directed to standard out, I thought this completed correctly: Re-format filesystem in /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/dfs/name ? (Y or N) Y 09/02/15 15:16:39 INFO fs.FSNamesystem: fsOwner=hadoop,staff,_lpadmin,com.apple.sharepoint.group.8,com.apple.sharepoint.group.3,com.apple.sharepoint.group.4,com.apple.sharepoint.group.2,com.apple.sharepoint.group.6,com.apple.sharepoint.group.9,com.apple.sharepoint.group.1,com.apple.sharepoint.group.5 09/02/15 15:16:39 INFO fs.FSNamesystem: supergroup=supergroup 09/02/15 15:16:39 INFO fs.FSNamesystem: isPermissionEnabled=true 09/02/15 15:16:39 INFO dfs.Storage: Image file of size 80 saved in 0 seconds. 09/02/15 15:16:39 INFO dfs.Storage: Storage directory /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/dfs/name has been successfully formatted. 09/02/15 15:16:39 INFO dfs.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at loteria.cs.tamu.edu/128.194.143.170 / However, after reformatting, I find that I have the same problems. Thanks, SM On Fri, Feb 13, 2009 at 5:39 PM, james warren wrote: > Sandy - > > I suggest you take a look into your NameNode and DataNode logs. From the > information posted, these likely would be at > > > /Users/hadoop/hadoop-0.18.2/bin/../logs/hadoop-hadoop-namenode-loteria.cs.tamu.edu.log > > /Users/hadoop/hadoop-0.18.2/bin/../logs/hadoop-hadoop-jobtracker-loteria.cs.tamu.edu.log > > If the cause isn't obvious from what you see there, could you please post > the last few lines from each log? > > -jw > > On Fri, Feb 13, 2009 at 3:28 PM, Sandy wrote: > > > Hello, > > > > I would really appreciate any help I can get on this! I've suddenly ran > > into > > a very strange error. > > > > when I do: > > bin/start-all > > I get: > > hadoop$ bin/start-all.sh > > starting namenode, logging to > > > > > /Users/hadoop/hadoop-0.18.2/bin/../logs/hadoop-hadoop-namenode-loteria.cs.tamu.edu.out > > starting jobtracker, logging to > > > > > /Users/hadoop/hadoop-0.18.2/bin/../logs/hadoop-hadoop-jobtracker-loteria.cs.tamu.edu.out > > > > No datanode, secondary namenode or jobtracker are being started. > > > > When I try to upload anything on the dfs, I get a "node in safemode" > error > > (even after waiting 5 minutes), presumably because it's trying to reach a > > datanode that does not exist. The same "safemode" error occurs when I > try > > to run jobs. > > > > I have tried bin/stop-all and then bin/start-all again. I get the same > > problem! > > > > This is incredibly strange, since I was previously able to start and run > > jobs without any issue using this version on this machine. I am running > > jobs > > on a single Mac Pro running OS X 10.5 > > > > I have tried updating to hadoop-0.19.0, and I get the same problem. I > have > > even tried this using previous versions, and I'm getting the same > problem! > > > > Anyone have any idea why this suddenly could be happening? What am I > doing > > wrong? > > > > For convenience, I'm including portions of both conf/hadoop-env.sh and > > conf/hadoop-site.xml: > > > > --- hadoop-env.sh --- > > # Set Hadoop-specific environment variables here. > > > > # The only required environment var
Re: question about hadoop and amazon ec2 ?
1. They are related as one can use EC2 as a to serve computation part for hadoop. Refer: http://wiki.apache.org/hadoop/AmazonEC2 2. yes Refer: http://wiki.apache.org/hadoop/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster) 3. you can use EC2 as a to serve computation part for hadoop. --nitesh On Sun, Feb 15, 2009 at 2:18 PM, buddha1021 wrote: > > hi: > What is the relationship between the hadoop and the amazon ec2 ? > Can hadoop run on the common pc (but not server ) directly ? > Why someone says hadoop run on the amazon ec2 ? > thanks! > -- > View this message in context: > http://www.nabble.com/question-about-hadoop-and-amazon-ec2---tp22020652p22020652.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > -- Nitesh Bhatia Dhirubhai Ambani Institute of Information & Communication Technology Gandhinagar Gujarat "Life is never perfect. It just depends where you draw the line." visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
Re: HDFS architecture based on GFS?
Hi Amandeep, Hadoop is definitely inspired by MapReduce/GFS and aims to provide those capabilities as an open-source project. HDFS is similar to GFS (large blocks, replication, etc); some notable things missing are read-write support in the middle of a file (unlikely to be provided because few Hadoop applications require it) and multiple appenders (the record append operation). You can read about HDFS architecture at http://hadoop.apache.org/core/docs/current/hdfs_design.html. The MapReduce part of Hadoop interacts with HDFS in the same way that Google's MapReduce interacts with GFS (shipping computation to the data), although Hadoop MapReduce also supports running over other distributed filesystems. Matei On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana wrote: > Hi > > Is the HDFS architecture completely based on the Google Filesystem? If it > isnt, what are the differences between the two? > > Secondly, is the coupling between Hadoop and HDFS same as how it is between > the Google's version of Map Reduce and GFS? > > Amandeep > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz >
HDFS architecture based on GFS?
Hi Is the HDFS architecture completely based on the Google Filesystem? If it isnt, what are the differences between the two? Secondly, is the coupling between Hadoop and HDFS same as how it is between the Google's version of Map Reduce and GFS? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
Re: can't edit the file that mounted by fuse_dfs by editor
I followed these instructions http://wiki.apache.org/hadoop/MountableHDFS and was able to get things working with 0.19.0 on Fedora. The only problem I ran into was the AMD64 issue on one of my boxes (see the note on the above link); I edited the Makefile and set OSARCH as suggested but couldn't get things working on my AMD box. Aside from that, are you certain that vi is a supported command with Fuse-DFS? It is not listed in the commands on the above link. I'd first see if you can get things working with one of the commands they mention (e.g., cp, ls, more, cat, find, less, rm, mkdir, mv, rmdir). John On Wed, Feb 11, 2009 at 8:25 PM, zhuweimin wrote: > Hey all > > I was trying to edit the file that mounted by fuse_dfs by vi editor, but > the > contents could not save. > The command is like the following: > [had...@vm-centos-5-shu-4 src]$ vi /mnt/dfs/test.txt > The error message from system log (/var/log/messages) is the following: > Feb 12 09:53:48 VM-CentOS-5-SHU-4 fuse_dfs: ERROR: could not connect open > file fuse_dfs.c:1340 > > I using the hadoop0.19.0 and fuse-dfs version 26 with centos5.2. > Does anyone have an idea as to what could be wrong! > > Thanks! > zhuweimin > > >
Re: Hostnames on MapReduce Web UI
Try comment out te localhost definition in your /etc/hosts file. 2009/2/14 S D > I'm reviewing the task trackers on the web interface ( > http://jobtracker-hostname:50030/) for my cluster of 3 machines. The names > of the task trackers do not list real domain names; e.g., one of the task > trackers is listed as: > > tracker_localhost:localhost/127.0.0.1:48167 > > I believe that the networking on my machines is set correctly. What do I > need to configure so that the listing above will show the actual domain > name? This will help me in diagnosing where problems are occurring in my > cluster. Note that at the top of the page the hostname (in my case "storm") > is properly listed; e.g., > > storm Hadoop Machine List > > Thanks, > John > -- http://daily.appspot.com/food/
Some Storage communication related questions
Hi, I have multiple questions: Does hadoop use some parallel technique for CopyFromLocal and CopyToLocal (like DistCp) Or its simple ONE stream writing? For Amazon S3 to Local system communication, Hadoop uses Rest service interface or SOAP ? Are there some new storage systems currently in pipeline to be interfaced with hadoop ? Thanks, Wasim
Re: HDFS on non-identical nodes
Thanks Brain and Chen! I finally sort that out why cluster is being stopped after running out of space. Its because of master failure due to disk space. Regarding automatic balancer, I guess in our case, rate of copying is faster than balancer rate, we found balancer do start but couldn't perform its job. Anyways thanks for your help! It helped me sort out somethings. Cheers, Deepak On Thu, Feb 12, 2009 at 5:32 PM, He Chen wrote: > I think you should confirm your balancer is still running. Do you change > the threshold of the HDFS balancer? May be too large? > > The balancer will stop working when meets 5 conditions: > > 1. Datanodes are balanced (obviously you are not this kind); > 2. No more block to be moved (all blocks on unbalanced nodes are busy or > recently used) > 3. No more block to be moved in 20 minutes and 5 times consecutive attempts > 4. Another balancer is working > 5. I/O exception > > > The default setting is 10% for each datanodes, for 1TB it is 100GB, for 3T > is 300GB, and for 60GB is 6GB > > Hope helpful > > > On Thu, Feb 12, 2009 at 10:06 AM, Brian Bockelman wrote: > >> >> On Feb 12, 2009, at 2:54 AM, Deepak wrote: >> >> Hi, >>> >>> We're running Hadoop cluster on 4 nodes, our primary purpose of >>> running is to provide distributed storage solution for internal >>> applications here in TellyTopia Inc. >>> >>> Our cluster consists of non-identical nodes (one with 1TB another two >>> with 3 TB and one more with 60GB) while copying data on HDFS we >>> noticed that node with 60GB storage ran out of disk-space and even >>> balancer couldn't balance because cluster was stopped. Now my >>> questions are >>> >>> 1. Is Hadoop is suitable for non-identical cluster nodes? >>> >> >> Yes. Our cluster has between 60GB and 40TB on our nodes. The majority >> have around 3TB. >> >> >>> 2. Is there any way to automatically balancing of nodes? >>> >> >> We have a cron script which automatically starts the Balancer. It's dirty, >> but it works. >> >> >>> 3. Why Hadoop cluster stops when one node ran our of disk? >>> >>> >> That's not normal. Trust me, if that was always true, we'd be perpetually >> screwed :) >> >> There might be some other underlying error you're missing... >> >> Brian >> >> >> Any futher inputs are appericiapted! >>> >>> Cheers, >>> Deepak >>> TellyTopia Inc. >>> >> >> > > > -- > Chen He > RCF CSE Dept. > University of Nebraska-Lincoln > US > -- Deepak TellyTopia Inc.
question about hadoop and amazon ec2 ?
hi: What is the relationship between the hadoop and the amazon ec2 ? Can hadoop run on the common pc (but not server ) directly ? Why someone says hadoop run on the amazon ec2 ? thanks! -- View this message in context: http://www.nabble.com/question-about-hadoop-and-amazon-ec2---tp22020652p22020652.html Sent from the Hadoop core-user mailing list archive at Nabble.com.