Re: HDFS architecture based on GFS?

2009-02-15 Thread Rasit OZDAS
"If there was a
malicious process though, then I imagine it could talk to a datanode
directly and request a specific block."

I didn't understand usage of "malicuous" here,
but any process using HDFS api should first ask NameNode where the
file replications are.
Then - I assume - namenode returns the IP of best DataNode (or all IPs),
then call to specific DataNode is made.
Please correct me if I'm wrong.

Cheers,
Rasit

2009/2/16 Matei Zaharia :
> In general, yeah, the scripts can access any resource they want (within the
> permissions of the user that the task runs as). It's also possible to access
> HDFS from scripts because HDFS provides a FUSE interface that can make it
> look like a regular file system on the machine. (The FUSE module in turn
> talks to the namenode as a regular HDFS client.)
>
> On Sun, Feb 15, 2009 at 8:43 PM, Amandeep Khurana  wrote:
>
>> I dont know much about Hadoop streaming and have a quick question here.
>>
>> The snippets of code/programs that you attach into the map reduce job might
>> want to access outside resources (like you mentioned). Now these might not
>> need to go to the namenode right? For example a python script. How would it
>> access the data? Would it ask the parent java process (in the tasktracker)
>> to get the data or would it go and do stuff on its own?
>>
>>
>> Amandeep Khurana
>> Computer Science Graduate Student
>> University of California, Santa Cruz
>>
>>
>> On Sun, Feb 15, 2009 at 8:23 PM, Matei Zaharia  wrote:
>>
>> > Nope, typically the JobTracker just starts the process, and the
>> tasktracker
>> > talks directly to the namenode to get a pointer to the datanode, and then
>> > directly to the datanode.
>> >
>> > On Sun, Feb 15, 2009 at 8:07 PM, Amandeep Khurana 
>> > wrote:
>> >
>> > > Alright.. Got it.
>> > >
>> > > Now, do the task trackers talk to the namenode and the data node
>> directly
>> > > or
>> > > do they go through the job tracker for it? So, if my code is such that
>> I
>> > > need to access more files from the hdfs, would the job tracker get
>> > involved
>> > > or not?
>> > >
>> > >
>> > >
>> > >
>> > > Amandeep Khurana
>> > > Computer Science Graduate Student
>> > > University of California, Santa Cruz
>> > >
>> > >
>> > > On Sun, Feb 15, 2009 at 7:20 PM, Matei Zaharia 
>> > wrote:
>> > >
>> > > > Normally, HDFS files are accessed through the namenode. If there was
>> a
>> > > > malicious process though, then I imagine it could talk to a datanode
>> > > > directly and request a specific block.
>> > > >
>> > > > On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana 
>> > > > wrote:
>> > > >
>> > > > > Ok. Got it.
>> > > > >
>> > > > > Now, when my job needs to access another file, does it go to the
>> > > Namenode
>> > > > > to
>> > > > > get the block ids? How does the java process know where the files
>> are
>> > > and
>> > > > > how to access them?
>> > > > >
>> > > > >
>> > > > > Amandeep Khurana
>> > > > > Computer Science Graduate Student
>> > > > > University of California, Santa Cruz
>> > > > >
>> > > > >
>> > > > > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia > >
>> > > > wrote:
>> > > > >
>> > > > > > I mentioned this case because even jobs written in Java can use
>> the
>> > > > HDFS
>> > > > > > API
>> > > > > > to talk to the NameNode and access the filesystem. People often
>> do
>> > > this
>> > > > > > because their job needs to read a config file, some small data
>> > table,
>> > > > etc
>> > > > > > and use this information in its map or reduce functions. In this
>> > > case,
>> > > > > you
>> > > > > > open the second file separately in your mapper's init function
>> and
>> > > read
>> > > > > > whatever you need from it. In general I wanted to point out that
>> > you
>> > > > > can't
>> > > > > > know which files a job will access unless you look at its source
>> > code
>> > > > or
>> > > > > > monitor the calls it makes; the input file(s) you provide in the
>> > job
>> > > > > > description are a hint to the MapReduce framework to place your
>> job
>> > > on
>> > > > > > certain nodes, but it's reasonable for the job to access other
>> > files
>> > > as
>> > > > > > well.
>> > > > > >
>> > > > > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana <
>> > ama...@gmail.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Another question that I have here - When the jobs run arbitrary
>> > > code
>> > > > > and
>> > > > > > > access data from the HDFS, do they go to the namenode to get
>> the
>> > > > block
>> > > > > > > information?
>> > > > > > >
>> > > > > > >
>> > > > > > > Amandeep Khurana
>> > > > > > > Computer Science Graduate Student
>> > > > > > > University of California, Santa Cruz
>> > > > > > >
>> > > > > > >
>> > > > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana <
>> > > ama...@gmail.com>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Assuming that the job is purely in Java and not involving
>> > > streaming
>> > > > > or
>> > > > > > > > pipes, wouldnt the resources (files) required by 

Re: HADOOP-2536 supports Oracle too?

2009-02-15 Thread sandhiya

@Amandeep
Hi,
I'm new to Hadoop and am trying to run a simple database connectivity
program on it. Could you please tell me how u went about it?? my mail id is
"sandys_cr...@yahoo.com" . A copy of your code that successfully connected
to MySQL will also be helpful.
Thanks,
Sandhiya

Enis Soztutar-2 wrote:
> 
>  From the exception :
> 
> java.io.IOException: ORA-00933: SQL command not properly ended
> 
> I would broadly guess that Oracle JDBC driver might be complaining that 
> the statement does not end with ";", or something similar. you can
> 1. download the latest source code of hadoop
> 2. add a print statement printing the query (probably in
> DBInputFormat:119)
> 3. build hadoop jar
> 4. use the new hadoop jar to see the actual SQL query
> 5. run the query on Oracle if is gives an error.
> 
> Enis
> 
> 
> Amandeep Khurana wrote:
>> Ok. I created the same database in a MySQL database and ran the same
>> hadoop
>> job against it. It worked. So, that means there is some Oracle specific
>> issue. It cant be an issue with the JDBC drivers since I am using the
>> same
>> drivers in a simple JDBC client.
>>
>> What could it be?
>>
>> Amandeep
>>
>>
>> Amandeep Khurana
>> Computer Science Graduate Student
>> University of California, Santa Cruz
>>
>>
>> On Wed, Feb 4, 2009 at 10:26 AM, Amandeep Khurana 
>> wrote:
>>
>>   
>>> Ok. I'm not sure if I got it correct. Are you saying, I should test the
>>> statement that hadoop creates directly with the database?
>>>
>>> Amandeep
>>>
>>>
>>> Amandeep Khurana
>>> Computer Science Graduate Student
>>> University of California, Santa Cruz
>>>
>>>
>>> On Wed, Feb 4, 2009 at 7:13 AM, Enis Soztutar 
>>> wrote:
>>>
>>> 
 Hadoop-2536 connects to the db via JDBC, so in theory it should work
 with
 proper jdbc drivers.
 It has been tested against MySQL, Hsqldb, and PostreSQL, but not
 Oracle.

 To answer your earlier question, the actual SQL statements might not be
 recognized by Oracle, so I suggest the best way to test this is to
 insert
 print statements, and run the actual SQL statements against Oracle to
 see if
 the syntax is accepted.

 We would appreciate if you publish your results.

 Enis


 Amandeep Khurana wrote:

   
> Does the patch HADOOP-2536 support connecting to Oracle databases as
> well?
> Or is it just limited to MySQL?
>
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
>
> 
>>
>>   
> 
> 

-- 
View this message in context: 
http://www.nabble.com/HADOOP-2536-supports-Oracle-too--tp21823199p22032715.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Namenode not listening for remote connections to port 9000

2009-02-15 Thread Michael Lynch

Hmmm - I checked all the /etc/hosts files, and they're all fine. Then I 
switched the
conf/hadoop-site.xml to specify ip addresses instead of host names. Then oddly 
enough
it starts working...

Now the funny thing is this: It's fine ssh-ing to the correct machines to start 
up
datanodes, but when the datanode thread tries to make the connection back to the
namenode (from within a java app I assume) it doesn't resolve the names 
correctly.

Just looking at this makes me want to think that the namenode does its ssh work
in some non-java way, actually checking the hosts file, while the datanode does
its thing in Java, which doesn't seem to check the hosts file. Could this be 
some
Java funnyness that it's not checking the hosts file?

Now there's just one sucky thing about my setup: If I change my file with a list
of datanodes (dfs.hosts property) to also have IP addresses instead of 
hostnames,
then it fails. So parts of my config is specifying hostnames, and other parts 
are
specifying IP addresses.

Oh well - for development purposes this is good enough, 'cause out in the real 
world
I won't be using the hosts file to string it all together.

Thanks for the responses.


Mark Kerzner wrote:

I had a problem that it listened only on 8020, even though I told it to use
9000

On Fri, Feb 13, 2009 at 7:50 AM, Norbert Burger wrote:


On Fri, Feb 13, 2009 at 8:37 AM, Steve Loughran  wrote:


Michael Lynch wrote:


Hi,

As far as I can tell I've followed the setup instructions for a hadoop
cluster to the letter,
but I find that the datanodes can't connect to the namenode on port 9000
because it is only
listening for connections from localhost.

In my case, the namenode is called centos1, and the datanode is called
centos2. They are
centos 5.1 servers with an unmodified sun java 6 runtime.


fs.default.name takes a URL to the filesystem. such as
hdfs://centos1:9000/

If the machine is only binding to localhost, that may mean DNS fun. Try a
fully qualified name instead


(fs.default.name is defined in conf/hadoop-site.xml, overriding entries
from
conf/hadoop-default.xml).

Also, check your /etc/hosts file on both machines.  Could be that you have
a
incorrect setup where both localhost and the namenode hostname (centos1)
are
aliased to 127.0.0.1.

Norbert





Re: Hostnames on MapReduce Web UI

2009-02-15 Thread S D
Thanks, this did it. I changed my /etc/hosts file on each node from
 127.0.0.1 localhost localhost.localdomain
 127.0.0.1 
to just switch the order with
 127.0.0.1 
 127.0.0.1 localhost localhost.localdomain
This did the trick! I vaguely recall from somewhere that I need the
localhost localhost.localdomain line so I thought I better avoid removing it
altogether.

Thanks,
John



On Sun, Feb 15, 2009 at 10:38 AM, Nick Cen  wrote:

> Try comment out te localhost definition in your /etc/hosts file.
>
> 2009/2/14 S D 
>
> > I'm reviewing the task trackers on the web interface (
> > http://jobtracker-hostname:50030/) for my cluster of 3 machines. The
> names
> > of the task trackers do not list real domain names; e.g., one of the task
> > trackers is listed as:
> >
> > tracker_localhost:localhost/127.0.0.1:48167
> >
> > I believe that the networking on my machines is set correctly. What do I
> > need to configure so that the listing above will show the actual domain
> > name? This will help me in diagnosing where problems are occurring in my
> > cluster. Note that at the top of the page the hostname (in my case
> "storm")
> > is properly listed; e.g.,
> >
> > storm Hadoop Machine List
> >
> > Thanks,
> > John
> >
>
>
>
> --
> http://daily.appspot.com/food/
>


Re: HDFS architecture based on GFS?

2009-02-15 Thread Matei Zaharia
In general, yeah, the scripts can access any resource they want (within the
permissions of the user that the task runs as). It's also possible to access
HDFS from scripts because HDFS provides a FUSE interface that can make it
look like a regular file system on the machine. (The FUSE module in turn
talks to the namenode as a regular HDFS client.)

On Sun, Feb 15, 2009 at 8:43 PM, Amandeep Khurana  wrote:

> I dont know much about Hadoop streaming and have a quick question here.
>
> The snippets of code/programs that you attach into the map reduce job might
> want to access outside resources (like you mentioned). Now these might not
> need to go to the namenode right? For example a python script. How would it
> access the data? Would it ask the parent java process (in the tasktracker)
> to get the data or would it go and do stuff on its own?
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Sun, Feb 15, 2009 at 8:23 PM, Matei Zaharia  wrote:
>
> > Nope, typically the JobTracker just starts the process, and the
> tasktracker
> > talks directly to the namenode to get a pointer to the datanode, and then
> > directly to the datanode.
> >
> > On Sun, Feb 15, 2009 at 8:07 PM, Amandeep Khurana 
> > wrote:
> >
> > > Alright.. Got it.
> > >
> > > Now, do the task trackers talk to the namenode and the data node
> directly
> > > or
> > > do they go through the job tracker for it? So, if my code is such that
> I
> > > need to access more files from the hdfs, would the job tracker get
> > involved
> > > or not?
> > >
> > >
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> > >
> > > On Sun, Feb 15, 2009 at 7:20 PM, Matei Zaharia 
> > wrote:
> > >
> > > > Normally, HDFS files are accessed through the namenode. If there was
> a
> > > > malicious process though, then I imagine it could talk to a datanode
> > > > directly and request a specific block.
> > > >
> > > > On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana 
> > > > wrote:
> > > >
> > > > > Ok. Got it.
> > > > >
> > > > > Now, when my job needs to access another file, does it go to the
> > > Namenode
> > > > > to
> > > > > get the block ids? How does the java process know where the files
> are
> > > and
> > > > > how to access them?
> > > > >
> > > > >
> > > > > Amandeep Khurana
> > > > > Computer Science Graduate Student
> > > > > University of California, Santa Cruz
> > > > >
> > > > >
> > > > > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia  >
> > > > wrote:
> > > > >
> > > > > > I mentioned this case because even jobs written in Java can use
> the
> > > > HDFS
> > > > > > API
> > > > > > to talk to the NameNode and access the filesystem. People often
> do
> > > this
> > > > > > because their job needs to read a config file, some small data
> > table,
> > > > etc
> > > > > > and use this information in its map or reduce functions. In this
> > > case,
> > > > > you
> > > > > > open the second file separately in your mapper's init function
> and
> > > read
> > > > > > whatever you need from it. In general I wanted to point out that
> > you
> > > > > can't
> > > > > > know which files a job will access unless you look at its source
> > code
> > > > or
> > > > > > monitor the calls it makes; the input file(s) you provide in the
> > job
> > > > > > description are a hint to the MapReduce framework to place your
> job
> > > on
> > > > > > certain nodes, but it's reasonable for the job to access other
> > files
> > > as
> > > > > > well.
> > > > > >
> > > > > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana <
> > ama...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Another question that I have here - When the jobs run arbitrary
> > > code
> > > > > and
> > > > > > > access data from the HDFS, do they go to the namenode to get
> the
> > > > block
> > > > > > > information?
> > > > > > >
> > > > > > >
> > > > > > > Amandeep Khurana
> > > > > > > Computer Science Graduate Student
> > > > > > > University of California, Santa Cruz
> > > > > > >
> > > > > > >
> > > > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana <
> > > ama...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Assuming that the job is purely in Java and not involving
> > > streaming
> > > > > or
> > > > > > > > pipes, wouldnt the resources (files) required by the job as
> > > inputs
> > > > be
> > > > > > > known
> > > > > > > > beforehand? So, if the map task is accessing a second file,
> how
> > > > does
> > > > > it
> > > > > > > make
> > > > > > > > it different except that there are multiple files. The
> > JobTracker
> > > > > would
> > > > > > > know
> > > > > > > > beforehand that multiple files would be accessed. Right?
> > > > > > > >
> > > > > > > > I am slightly confused why you have mentioned this case
> > > > separately...
> > > > > > Can
> > > > > > > > you elaborate on it a little bit?
> > > > > > > >
> > > > > > > > Amandeep
> >

Re: Race Condition?

2009-02-15 Thread S D
I'm having difficulty capturing the output of any of the dfs commands
(either in Ruby or on  the command line). Supposedly the output is being
sent to stdout yet just running any of the commands on the command line does
not display the output nor does redirecting to a file (e.g., hadoop dfs
-copyToLocal src dest > out.txt). I'm not sure what I'm missing here...

John

On Sun, Feb 15, 2009 at 11:28 PM, Matei Zaharia  wrote:

> I would capture the output of the dfs -copyToLocal command, because I still
> think that is the most likely cause of the data not making it. I don't know
> how to capture this output in Ruby but I'm sure it's possible. You want to
> capture both standard out and standard error.
> One other slim possibility is that if your localdir is a fixed absolute
> path, multiple map tasks on the machine may be trying to access it
> concurrently, and maybe one of them deletes it when it's done and one
> doesn't. Normally each task should run in its own temp directory though.
>
> On Sun, Feb 15, 2009 at 2:51 PM, S D  wrote:
>
> > I was not able to determine the command shell return value for
> >
> > hadoop dfs -copyToLocal #{s3dir} #{localdir}
> >
> > but I did print out several variables after the call and determined that
> > the
> > call apparently did not go through successfully. In particular, prior to
> my
> > processData(localdir) command I use Ruby's puts to print out the contents
> > of
> > several directories including 'localdir' and '../localdir' - here is the
> > weird thing: if I execute the following
> > list = `ls -l "#{localdir}"`
> > puts "List: #{list}"
> > (where 'localdir' is the directory I need as an arg for processData) the
> > processData command will execute properly. At first I thought that
> running
> > the puts command was allowing enough time to elapse for a race condition
> to
> > be avoided so that 'localdir' was ready when the processData command was
> > called (I know that in certain ways that doesn't make sense given that
> > hadoop dfs -copyToLocal blocks until it completes...) but then I tried
> > other
> > time consuming commands such as
> > list = `ls -l "../#{localdir}"`
> > puts "List: #{list}"
> > and running processData(localdir) led to an error:
> > 'localdir' not found
> >
> > Any clues on what could be going on?
> >
> > Thanks,
> > John
> >
> >
> >
> > On Sat, Feb 14, 2009 at 6:45 PM, Matei Zaharia 
> wrote:
> >
> > > Have you logged the output of the dfs command to see whether it's
> always
> > > succeeded the copy?
> > >
> > > On Sat, Feb 14, 2009 at 2:46 PM, S D  wrote:
> > >
> > > > In my Hadoop 0.19.0 program each map function is assigned a directory
> > > > (representing a data location in my S3 datastore). The first thing
> each
> > > map
> > > > function does is copy the particular S3 data to the local machine
> that
> > > the
> > > > map task is running on and then being processing the data; e.g.,
> > > >
> > > > command = "hadoop dfs -copyToLocal #{s3dir} #{localdir}"
> > > > system "#{command}"
> > > >
> > > > In the above, "s3dir" is a directory that creates "localdir" - my
> > > > expectation is that "localdir" is created in the work directory for
> the
> > > > particular task attempt. Following this copy command I then run a
> > > function
> > > > that processes the data; e.g.,
> > > >
> > > > processData(localdir)
> > > >
> > > > In some instances my map/reduce program crashes and when I examine
> the
> > > logs
> > > > I get a message saying that "localdir" can not be found. This
> confuses
> > me
> > > > since the hadoop shell command above is blocking so that localdir
> > should
> > > > exist by the time processData() is called. I've found that if I add
> in
> > > some
> > > > diagnostic lines prior to processData() such as puts statements to
> > print
> > > > out
> > > > variables, I never run into the problem of the localdir not being
> > found.
> > > It
> > > > is almost as if localdir needs time to be created before the call to
> > > > processData().
> > > >
> > > > Has anyone encountered anything like this? Any suggestions on what
> > could
> > > be
> > > > wrong are appreciated.
> > > >
> > > > Thanks,
> > > > John
> > > >
> > >
> >
>


Re: HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
I dont know much about Hadoop streaming and have a quick question here.

The snippets of code/programs that you attach into the map reduce job might
want to access outside resources (like you mentioned). Now these might not
need to go to the namenode right? For example a python script. How would it
access the data? Would it ask the parent java process (in the tasktracker)
to get the data or would it go and do stuff on its own?


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sun, Feb 15, 2009 at 8:23 PM, Matei Zaharia  wrote:

> Nope, typically the JobTracker just starts the process, and the tasktracker
> talks directly to the namenode to get a pointer to the datanode, and then
> directly to the datanode.
>
> On Sun, Feb 15, 2009 at 8:07 PM, Amandeep Khurana 
> wrote:
>
> > Alright.. Got it.
> >
> > Now, do the task trackers talk to the namenode and the data node directly
> > or
> > do they go through the job tracker for it? So, if my code is such that I
> > need to access more files from the hdfs, would the job tracker get
> involved
> > or not?
> >
> >
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > On Sun, Feb 15, 2009 at 7:20 PM, Matei Zaharia 
> wrote:
> >
> > > Normally, HDFS files are accessed through the namenode. If there was a
> > > malicious process though, then I imagine it could talk to a datanode
> > > directly and request a specific block.
> > >
> > > On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana 
> > > wrote:
> > >
> > > > Ok. Got it.
> > > >
> > > > Now, when my job needs to access another file, does it go to the
> > Namenode
> > > > to
> > > > get the block ids? How does the java process know where the files are
> > and
> > > > how to access them?
> > > >
> > > >
> > > > Amandeep Khurana
> > > > Computer Science Graduate Student
> > > > University of California, Santa Cruz
> > > >
> > > >
> > > > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia 
> > > wrote:
> > > >
> > > > > I mentioned this case because even jobs written in Java can use the
> > > HDFS
> > > > > API
> > > > > to talk to the NameNode and access the filesystem. People often do
> > this
> > > > > because their job needs to read a config file, some small data
> table,
> > > etc
> > > > > and use this information in its map or reduce functions. In this
> > case,
> > > > you
> > > > > open the second file separately in your mapper's init function and
> > read
> > > > > whatever you need from it. In general I wanted to point out that
> you
> > > > can't
> > > > > know which files a job will access unless you look at its source
> code
> > > or
> > > > > monitor the calls it makes; the input file(s) you provide in the
> job
> > > > > description are a hint to the MapReduce framework to place your job
> > on
> > > > > certain nodes, but it's reasonable for the job to access other
> files
> > as
> > > > > well.
> > > > >
> > > > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana <
> ama...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Another question that I have here - When the jobs run arbitrary
> > code
> > > > and
> > > > > > access data from the HDFS, do they go to the namenode to get the
> > > block
> > > > > > information?
> > > > > >
> > > > > >
> > > > > > Amandeep Khurana
> > > > > > Computer Science Graduate Student
> > > > > > University of California, Santa Cruz
> > > > > >
> > > > > >
> > > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana <
> > ama...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Assuming that the job is purely in Java and not involving
> > streaming
> > > > or
> > > > > > > pipes, wouldnt the resources (files) required by the job as
> > inputs
> > > be
> > > > > > known
> > > > > > > beforehand? So, if the map task is accessing a second file, how
> > > does
> > > > it
> > > > > > make
> > > > > > > it different except that there are multiple files. The
> JobTracker
> > > > would
> > > > > > know
> > > > > > > beforehand that multiple files would be accessed. Right?
> > > > > > >
> > > > > > > I am slightly confused why you have mentioned this case
> > > separately...
> > > > > Can
> > > > > > > you elaborate on it a little bit?
> > > > > > >
> > > > > > > Amandeep
> > > > > > >
> > > > > > >
> > > > > > > Amandeep Khurana
> > > > > > > Computer Science Graduate Student
> > > > > > > University of California, Santa Cruz
> > > > > > >
> > > > > > >
> > > > > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia <
> > ma...@cloudera.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > >> Typically the data flow is like this:1) Client submits a job
> > > > > description
> > > > > > >> to
> > > > > > >> the JobTracker.
> > > > > > >> 2) JobTracker figures out block locations for the input
> file(s)
> > by
> > > > > > talking
> > > > > > >> to HDFS NameNode.
> > > > > > >> 3) JobTracker creates a job description file in HDFS which
> will
> > be
> > > > > read
> > > 

Re: JvmMetrics

2009-02-15 Thread Brian Bockelman

Hey David --

In case if no one has pointed you to this, you can submit this through  
JIRA.


Brian

On Feb 14, 2009, at 12:07 AM, David Alves wrote:


Hi

	I ran into a use case where I need to keep two contexts for  
metrics. One being ganglia and the other being a file context (to do  
offline metrics analysis).
	I altered JvmMetrics to allow for the user to supply a context  
instead of if getting one by name, and altered file context for it  
to be able to timestamp metrics collection (like log4j does).

Would be glad to submit a patch if anyone is interested.

REgards




Re: Race Condition?

2009-02-15 Thread Matei Zaharia
I would capture the output of the dfs -copyToLocal command, because I still
think that is the most likely cause of the data not making it. I don't know
how to capture this output in Ruby but I'm sure it's possible. You want to
capture both standard out and standard error.
One other slim possibility is that if your localdir is a fixed absolute
path, multiple map tasks on the machine may be trying to access it
concurrently, and maybe one of them deletes it when it's done and one
doesn't. Normally each task should run in its own temp directory though.

On Sun, Feb 15, 2009 at 2:51 PM, S D  wrote:

> I was not able to determine the command shell return value for
>
> hadoop dfs -copyToLocal #{s3dir} #{localdir}
>
> but I did print out several variables after the call and determined that
> the
> call apparently did not go through successfully. In particular, prior to my
> processData(localdir) command I use Ruby's puts to print out the contents
> of
> several directories including 'localdir' and '../localdir' - here is the
> weird thing: if I execute the following
> list = `ls -l "#{localdir}"`
> puts "List: #{list}"
> (where 'localdir' is the directory I need as an arg for processData) the
> processData command will execute properly. At first I thought that running
> the puts command was allowing enough time to elapse for a race condition to
> be avoided so that 'localdir' was ready when the processData command was
> called (I know that in certain ways that doesn't make sense given that
> hadoop dfs -copyToLocal blocks until it completes...) but then I tried
> other
> time consuming commands such as
> list = `ls -l "../#{localdir}"`
> puts "List: #{list}"
> and running processData(localdir) led to an error:
> 'localdir' not found
>
> Any clues on what could be going on?
>
> Thanks,
> John
>
>
>
> On Sat, Feb 14, 2009 at 6:45 PM, Matei Zaharia  wrote:
>
> > Have you logged the output of the dfs command to see whether it's always
> > succeeded the copy?
> >
> > On Sat, Feb 14, 2009 at 2:46 PM, S D  wrote:
> >
> > > In my Hadoop 0.19.0 program each map function is assigned a directory
> > > (representing a data location in my S3 datastore). The first thing each
> > map
> > > function does is copy the particular S3 data to the local machine that
> > the
> > > map task is running on and then being processing the data; e.g.,
> > >
> > > command = "hadoop dfs -copyToLocal #{s3dir} #{localdir}"
> > > system "#{command}"
> > >
> > > In the above, "s3dir" is a directory that creates "localdir" - my
> > > expectation is that "localdir" is created in the work directory for the
> > > particular task attempt. Following this copy command I then run a
> > function
> > > that processes the data; e.g.,
> > >
> > > processData(localdir)
> > >
> > > In some instances my map/reduce program crashes and when I examine the
> > logs
> > > I get a message saying that "localdir" can not be found. This confuses
> me
> > > since the hadoop shell command above is blocking so that localdir
> should
> > > exist by the time processData() is called. I've found that if I add in
> > some
> > > diagnostic lines prior to processData() such as puts statements to
> print
> > > out
> > > variables, I never run into the problem of the localdir not being
> found.
> > It
> > > is almost as if localdir needs time to be created before the call to
> > > processData().
> > >
> > > Has anyone encountered anything like this? Any suggestions on what
> could
> > be
> > > wrong are appreciated.
> > >
> > > Thanks,
> > > John
> > >
> >
>


Re: HDFS on non-identical nodes

2009-02-15 Thread Brian Bockelman


On Feb 15, 2009, at 3:21 AM, Deepak wrote:


Thanks Brain and Chen!

I finally sort that out why cluster is being stopped after running out
of space. Its because of master failure due to disk space.

Regarding automatic balancer, I guess in our case, rate of copying is
faster than balancer rate, we found balancer do start but couldn't
perform its job.


There are parameters you can set which control how quickly the  
balancer is allowed to copy files about.


Nevertheless, you shouldn't rely on it to work for anything  
performance critical -- you'll probably want to ensure there's enough  
space around to do your work in the short-term.


Brian




Anyways thanks for your help! It helped me sort out somethings.

Cheers,
Deepak

On Thu, Feb 12, 2009 at 5:32 PM, He Chen  wrote:
I think you should confirm your balancer is still running. Do you  
change

the threshold of the HDFS balancer? May be too large?

The balancer will stop working when meets 5 conditions:

1. Datanodes are balanced (obviously you are not this kind);
2. No more block to be moved (all blocks on unbalanced nodes are  
busy or

recently used)
3. No more block to be moved in 20 minutes and 5 times consecutive  
attempts

4. Another balancer is working
5. I/O exception


The default setting is 10% for each datanodes, for 1TB it is 100GB,  
for 3T

is 300GB, and for 60GB is 6GB

Hope helpful


On Thu, Feb 12, 2009 at 10:06 AM, Brian Bockelman >wrote:




On Feb 12, 2009, at 2:54 AM, Deepak wrote:

Hi,


We're running Hadoop cluster on 4 nodes, our primary purpose of
running is to provide distributed storage solution for internal
applications here in TellyTopia Inc.

Our cluster consists of non-identical nodes (one with 1TB another  
two

with 3 TB and one more with 60GB) while copying data on HDFS we
noticed that node with 60GB storage ran out of disk-space and even
balancer couldn't balance because cluster was stopped. Now my
questions are

1. Is Hadoop is suitable for non-identical cluster nodes?



Yes.  Our cluster has between 60GB and 40TB on our nodes.  The  
majority

have around 3TB.



2. Is there any way to automatically balancing of nodes?



We have a cron script which automatically starts the Balancer.   
It's dirty,

but it works.



3. Why Hadoop cluster stops when one node ran our of disk?


That's not normal.  Trust me, if that was always true, we'd be  
perpetually

screwed :)

There might be some other underlying error you're missing...

Brian


Any futher inputs are appericiapted!


Cheers,
Deepak
TellyTopia Inc.







--
Chen He
RCF CSE Dept.
University of Nebraska-Lincoln
US





--
Deepak
TellyTopia Inc.




Re: HDFS architecture based on GFS?

2009-02-15 Thread Matei Zaharia
Nope, typically the JobTracker just starts the process, and the tasktracker
talks directly to the namenode to get a pointer to the datanode, and then
directly to the datanode.

On Sun, Feb 15, 2009 at 8:07 PM, Amandeep Khurana  wrote:

> Alright.. Got it.
>
> Now, do the task trackers talk to the namenode and the data node directly
> or
> do they go through the job tracker for it? So, if my code is such that I
> need to access more files from the hdfs, would the job tracker get involved
> or not?
>
>
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Sun, Feb 15, 2009 at 7:20 PM, Matei Zaharia  wrote:
>
> > Normally, HDFS files are accessed through the namenode. If there was a
> > malicious process though, then I imagine it could talk to a datanode
> > directly and request a specific block.
> >
> > On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana 
> > wrote:
> >
> > > Ok. Got it.
> > >
> > > Now, when my job needs to access another file, does it go to the
> Namenode
> > > to
> > > get the block ids? How does the java process know where the files are
> and
> > > how to access them?
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> > >
> > > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia 
> > wrote:
> > >
> > > > I mentioned this case because even jobs written in Java can use the
> > HDFS
> > > > API
> > > > to talk to the NameNode and access the filesystem. People often do
> this
> > > > because their job needs to read a config file, some small data table,
> > etc
> > > > and use this information in its map or reduce functions. In this
> case,
> > > you
> > > > open the second file separately in your mapper's init function and
> read
> > > > whatever you need from it. In general I wanted to point out that you
> > > can't
> > > > know which files a job will access unless you look at its source code
> > or
> > > > monitor the calls it makes; the input file(s) you provide in the job
> > > > description are a hint to the MapReduce framework to place your job
> on
> > > > certain nodes, but it's reasonable for the job to access other files
> as
> > > > well.
> > > >
> > > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana 
> > > > wrote:
> > > >
> > > > > Another question that I have here - When the jobs run arbitrary
> code
> > > and
> > > > > access data from the HDFS, do they go to the namenode to get the
> > block
> > > > > information?
> > > > >
> > > > >
> > > > > Amandeep Khurana
> > > > > Computer Science Graduate Student
> > > > > University of California, Santa Cruz
> > > > >
> > > > >
> > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana <
> ama...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Assuming that the job is purely in Java and not involving
> streaming
> > > or
> > > > > > pipes, wouldnt the resources (files) required by the job as
> inputs
> > be
> > > > > known
> > > > > > beforehand? So, if the map task is accessing a second file, how
> > does
> > > it
> > > > > make
> > > > > > it different except that there are multiple files. The JobTracker
> > > would
> > > > > know
> > > > > > beforehand that multiple files would be accessed. Right?
> > > > > >
> > > > > > I am slightly confused why you have mentioned this case
> > separately...
> > > > Can
> > > > > > you elaborate on it a little bit?
> > > > > >
> > > > > > Amandeep
> > > > > >
> > > > > >
> > > > > > Amandeep Khurana
> > > > > > Computer Science Graduate Student
> > > > > > University of California, Santa Cruz
> > > > > >
> > > > > >
> > > > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia <
> ma...@cloudera.com
> > >
> > > > > wrote:
> > > > > >
> > > > > >> Typically the data flow is like this:1) Client submits a job
> > > > description
> > > > > >> to
> > > > > >> the JobTracker.
> > > > > >> 2) JobTracker figures out block locations for the input file(s)
> by
> > > > > talking
> > > > > >> to HDFS NameNode.
> > > > > >> 3) JobTracker creates a job description file in HDFS which will
> be
> > > > read
> > > > > by
> > > > > >> the nodes to copy over the job's code etc.
> > > > > >> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with
> > the
> > > > > >> appropriate data blocks.
> > > > > >> 5) After running, maps create intermediate output files on those
> > > > slaves.
> > > > > >> These are not in HDFS, they're in some temporary storage used by
> > > > > >> MapReduce.
> > > > > >> 6) JobTracker starts reduces on a series of slaves, which copy
> > over
> > > > the
> > > > > >> appropriate map outputs, apply the reduce function, and write
> the
> > > > > outputs
> > > > > >> to
> > > > > >> HDFS (one output file per reducer).
> > > > > >> 7) Some logs for the job may also be put into HDFS by the
> > > JobTracker.
> > > > > >>
> > > > > >> However, there is a big caveat, which is that the map and reduce
> > > tasks
> > > > > run
> > > > > >> arbitrary code. It is not unusu

Re: HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
Alright.. Got it.

Now, do the task trackers talk to the namenode and the data node directly or
do they go through the job tracker for it? So, if my code is such that I
need to access more files from the hdfs, would the job tracker get involved
or not?




Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sun, Feb 15, 2009 at 7:20 PM, Matei Zaharia  wrote:

> Normally, HDFS files are accessed through the namenode. If there was a
> malicious process though, then I imagine it could talk to a datanode
> directly and request a specific block.
>
> On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana 
> wrote:
>
> > Ok. Got it.
> >
> > Now, when my job needs to access another file, does it go to the Namenode
> > to
> > get the block ids? How does the java process know where the files are and
> > how to access them?
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia 
> wrote:
> >
> > > I mentioned this case because even jobs written in Java can use the
> HDFS
> > > API
> > > to talk to the NameNode and access the filesystem. People often do this
> > > because their job needs to read a config file, some small data table,
> etc
> > > and use this information in its map or reduce functions. In this case,
> > you
> > > open the second file separately in your mapper's init function and read
> > > whatever you need from it. In general I wanted to point out that you
> > can't
> > > know which files a job will access unless you look at its source code
> or
> > > monitor the calls it makes; the input file(s) you provide in the job
> > > description are a hint to the MapReduce framework to place your job on
> > > certain nodes, but it's reasonable for the job to access other files as
> > > well.
> > >
> > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana 
> > > wrote:
> > >
> > > > Another question that I have here - When the jobs run arbitrary code
> > and
> > > > access data from the HDFS, do they go to the namenode to get the
> block
> > > > information?
> > > >
> > > >
> > > > Amandeep Khurana
> > > > Computer Science Graduate Student
> > > > University of California, Santa Cruz
> > > >
> > > >
> > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana 
> > > > wrote:
> > > >
> > > > > Assuming that the job is purely in Java and not involving streaming
> > or
> > > > > pipes, wouldnt the resources (files) required by the job as inputs
> be
> > > > known
> > > > > beforehand? So, if the map task is accessing a second file, how
> does
> > it
> > > > make
> > > > > it different except that there are multiple files. The JobTracker
> > would
> > > > know
> > > > > beforehand that multiple files would be accessed. Right?
> > > > >
> > > > > I am slightly confused why you have mentioned this case
> separately...
> > > Can
> > > > > you elaborate on it a little bit?
> > > > >
> > > > > Amandeep
> > > > >
> > > > >
> > > > > Amandeep Khurana
> > > > > Computer Science Graduate Student
> > > > > University of California, Santa Cruz
> > > > >
> > > > >
> > > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia  >
> > > > wrote:
> > > > >
> > > > >> Typically the data flow is like this:1) Client submits a job
> > > description
> > > > >> to
> > > > >> the JobTracker.
> > > > >> 2) JobTracker figures out block locations for the input file(s) by
> > > > talking
> > > > >> to HDFS NameNode.
> > > > >> 3) JobTracker creates a job description file in HDFS which will be
> > > read
> > > > by
> > > > >> the nodes to copy over the job's code etc.
> > > > >> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with
> the
> > > > >> appropriate data blocks.
> > > > >> 5) After running, maps create intermediate output files on those
> > > slaves.
> > > > >> These are not in HDFS, they're in some temporary storage used by
> > > > >> MapReduce.
> > > > >> 6) JobTracker starts reduces on a series of slaves, which copy
> over
> > > the
> > > > >> appropriate map outputs, apply the reduce function, and write the
> > > > outputs
> > > > >> to
> > > > >> HDFS (one output file per reducer).
> > > > >> 7) Some logs for the job may also be put into HDFS by the
> > JobTracker.
> > > > >>
> > > > >> However, there is a big caveat, which is that the map and reduce
> > tasks
> > > > run
> > > > >> arbitrary code. It is not unusual to have a map that opens a
> second
> > > HDFS
> > > > >> file to read some information (e.g. for doing a join of a small
> > table
> > > > >> against a big file). If you use Hadoop Streaming or Pipes to write
> a
> > > job
> > > > >> in
> > > > >> Python, Ruby, C, etc, then you are launching arbitrary processes
> > which
> > > > may
> > > > >> also access external resources in this manner. Some people also
> > > > read/write
> > > > >> to DBs (e.g. MySQL) from their tasks. A comprehensive security
> > > solution
> > > > >> would ideally deal with these cases too.
> > >

Re: HDFS architecture based on GFS?

2009-02-15 Thread Matei Zaharia
Normally, HDFS files are accessed through the namenode. If there was a
malicious process though, then I imagine it could talk to a datanode
directly and request a specific block.

On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana  wrote:

> Ok. Got it.
>
> Now, when my job needs to access another file, does it go to the Namenode
> to
> get the block ids? How does the java process know where the files are and
> how to access them?
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia  wrote:
>
> > I mentioned this case because even jobs written in Java can use the HDFS
> > API
> > to talk to the NameNode and access the filesystem. People often do this
> > because their job needs to read a config file, some small data table, etc
> > and use this information in its map or reduce functions. In this case,
> you
> > open the second file separately in your mapper's init function and read
> > whatever you need from it. In general I wanted to point out that you
> can't
> > know which files a job will access unless you look at its source code or
> > monitor the calls it makes; the input file(s) you provide in the job
> > description are a hint to the MapReduce framework to place your job on
> > certain nodes, but it's reasonable for the job to access other files as
> > well.
> >
> > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana 
> > wrote:
> >
> > > Another question that I have here - When the jobs run arbitrary code
> and
> > > access data from the HDFS, do they go to the namenode to get the block
> > > information?
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> > >
> > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana 
> > > wrote:
> > >
> > > > Assuming that the job is purely in Java and not involving streaming
> or
> > > > pipes, wouldnt the resources (files) required by the job as inputs be
> > > known
> > > > beforehand? So, if the map task is accessing a second file, how does
> it
> > > make
> > > > it different except that there are multiple files. The JobTracker
> would
> > > know
> > > > beforehand that multiple files would be accessed. Right?
> > > >
> > > > I am slightly confused why you have mentioned this case separately...
> > Can
> > > > you elaborate on it a little bit?
> > > >
> > > > Amandeep
> > > >
> > > >
> > > > Amandeep Khurana
> > > > Computer Science Graduate Student
> > > > University of California, Santa Cruz
> > > >
> > > >
> > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia 
> > > wrote:
> > > >
> > > >> Typically the data flow is like this:1) Client submits a job
> > description
> > > >> to
> > > >> the JobTracker.
> > > >> 2) JobTracker figures out block locations for the input file(s) by
> > > talking
> > > >> to HDFS NameNode.
> > > >> 3) JobTracker creates a job description file in HDFS which will be
> > read
> > > by
> > > >> the nodes to copy over the job's code etc.
> > > >> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with the
> > > >> appropriate data blocks.
> > > >> 5) After running, maps create intermediate output files on those
> > slaves.
> > > >> These are not in HDFS, they're in some temporary storage used by
> > > >> MapReduce.
> > > >> 6) JobTracker starts reduces on a series of slaves, which copy over
> > the
> > > >> appropriate map outputs, apply the reduce function, and write the
> > > outputs
> > > >> to
> > > >> HDFS (one output file per reducer).
> > > >> 7) Some logs for the job may also be put into HDFS by the
> JobTracker.
> > > >>
> > > >> However, there is a big caveat, which is that the map and reduce
> tasks
> > > run
> > > >> arbitrary code. It is not unusual to have a map that opens a second
> > HDFS
> > > >> file to read some information (e.g. for doing a join of a small
> table
> > > >> against a big file). If you use Hadoop Streaming or Pipes to write a
> > job
> > > >> in
> > > >> Python, Ruby, C, etc, then you are launching arbitrary processes
> which
> > > may
> > > >> also access external resources in this manner. Some people also
> > > read/write
> > > >> to DBs (e.g. MySQL) from their tasks. A comprehensive security
> > solution
> > > >> would ideally deal with these cases too.
> > > >>
> > > >> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana  >
> > > >> wrote:
> > > >>
> > > >> > A quick question here. How does a typical hadoop job work at the
> > > system
> > > >> > level? What are the various interactions and how does the data
> flow?
> > > >> >
> > > >> > Amandeep
> > > >> >
> > > >> >
> > > >> > Amandeep Khurana
> > > >> > Computer Science Graduate Student
> > > >> > University of California, Santa Cruz
> > > >> >
> > > >> >
> > > >> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana <
> ama...@gmail.com
> > >
> > > >> > wrote:
> > > >> >
> > > >> > > Thanks Matei. If the basic architecture is similar to the Google
> > > >> stuff, I
> > > 

Re: setting up networking and ssh on multnode cluster...

2009-02-15 Thread zander1013

okay,

i will heed the tip on the 127 address set. here is the result of ssh
192.168.0.2...
a...@node0:~$ ssh 192.168.0.2
ssh: connect to host 192.168.0.2 port 22: Connection timed out
a...@node0:~$

the boxes are just connected with a cat5 cable.

i have not done this with the hadoop account but af is my normal account and
i figure it should work too.

/etc/init.d/interfaces is empty/does not exist on the machines. (i am using
ubuntu 8.10)

please advise.


Norbert Burger wrote:
> 
> Fwiw, the extra references to 127.0.1.1 in each host file aren't
> necessary.
> 
> From node0, does 'ssh 192.168.0.2' work?  If not, then the issue isn't
> name
> resolution -- take look at the network configs (eg.,
> /etc/init.d/interfaces)
> on each machine.
> 
> Norbert
> 
> On Sun, Feb 15, 2009 at 7:31 PM, zander1013  wrote:
> 
>>
>> okay,
>>
>> i have changed /etc/hosts to look like this for node0...
>>
>> 127.0.0.1   localhost
>> 127.0.1.1   node0
>>
>> # /etc/hosts (for hadoop master and slave)
>> 192.168.0.1 node0
>> 192.168.0.2 node1
>> #end hadoop section
>>
>> # The following lines are desirable for IPv6 capable hosts
>> ::1 ip6-localhost ip6-loopback
>> fe00::0 ip6-localnet
>> ff00::0 ip6-mcastprefix
>> ff02::1 ip6-allnodes
>> ff02::2 ip6-allrouters
>> ff02::3 ip6-allhosts
>>
>> ...and this for node1...
>>
>> 127.0.0.1   localhost
>> 127.0.1.1   node1
>>
>> # /etc/hosts (for hadoop master and slave)
>> 192.168.0.1 node0
>> 192.168.0.2 node1
>> #end hadoop section
>>
>> # The following lines are desirable for IPv6 capable hosts
>> ::1 ip6-localhost ip6-loopback
>> fe00::0 ip6-localnet
>> ff00::0 ip6-mcastprefix
>> ff02::1 ip6-allnodes
>> ff02::2 ip6-allrouters
>> ff02::3 ip6-allhosts
>>
>> ... the machines are connected by a cat5 cable, they have wifi and are
>> showing that they are connected to my wlan. also i have enabled all the
>> user
>> privleges in the user manager on both machines. here are the results from
>> ssh on node0...
>>
>> had...@node0:~$ ssh node0
>> Linux node0 2.6.27-11-generic #1 SMP Thu Jan 29 19:24:39 UTC 2009 i686
>>
>> The programs included with the Ubuntu system are free software;
>> the exact distribution terms for each program are described in the
>> individual files in /usr/share/doc/*/copyright.
>>
>> Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
>> applicable law.
>>
>> To access official Ubuntu documentation, please visit:
>> http://help.ubuntu.com/
>> Last login: Sun Feb 15 16:00:28 2009 from node0
>> had...@node0:~$ exit
>> logout
>> Connection to node0 closed.
>> had...@node0:~$ ssh node1
>> ssh: connect to host node1 port 22: Connection timed out
>> had...@node0:~$
>>
>> i will look into the link that you gave.
>>
>> -zander
>>
>>
>> Norbert Burger wrote:
>> >
>> >>
>> >> i have commented out the 192. addresses and changed 127.0.1.1 for
>> node0
>> >> and
>> >> 127.0.1.2 for node0 (in /etc/hosts). with this done i can ssh from one
>> >> machine to itself and to the other but the prompt does not change when
>> i
>> >> ssh
>> >> to the other machine. i don't know if there is a firewall preventing
>> me
>> >> from
>> >> ssh or not. i have not set any up to prevent ssh and i have not taken
>> >> action
>> >> to specifically allow ssh other than what was prescribed in the
>> tutorial
>> >> for
>> >> a single node cluster for both machines.
>> >
>> >
>> > Why are you using 127.* addresses for your nodes?  These fall into the
>> > block
>> > of IPs reserved for the loopback network (see
>> > http://en.wikipedia.org/wiki/Loopback).
>> >
>> > Try changing both nodes back to 192.168, and re-starting Hadoop.
>> >
>> > Norbert
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/setting-up-networking-and-ssh-on-multnode-cluster...-tp22019559p22029812.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/setting-up-networking-and-ssh-on-multnode-cluster...-tp22019559p22031038.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
Ok. Got it.

Now, when my job needs to access another file, does it go to the Namenode to
get the block ids? How does the java process know where the files are and
how to access them?


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia  wrote:

> I mentioned this case because even jobs written in Java can use the HDFS
> API
> to talk to the NameNode and access the filesystem. People often do this
> because their job needs to read a config file, some small data table, etc
> and use this information in its map or reduce functions. In this case, you
> open the second file separately in your mapper's init function and read
> whatever you need from it. In general I wanted to point out that you can't
> know which files a job will access unless you look at its source code or
> monitor the calls it makes; the input file(s) you provide in the job
> description are a hint to the MapReduce framework to place your job on
> certain nodes, but it's reasonable for the job to access other files as
> well.
>
> On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana 
> wrote:
>
> > Another question that I have here - When the jobs run arbitrary code and
> > access data from the HDFS, do they go to the namenode to get the block
> > information?
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana 
> > wrote:
> >
> > > Assuming that the job is purely in Java and not involving streaming or
> > > pipes, wouldnt the resources (files) required by the job as inputs be
> > known
> > > beforehand? So, if the map task is accessing a second file, how does it
> > make
> > > it different except that there are multiple files. The JobTracker would
> > know
> > > beforehand that multiple files would be accessed. Right?
> > >
> > > I am slightly confused why you have mentioned this case separately...
> Can
> > > you elaborate on it a little bit?
> > >
> > > Amandeep
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> > >
> > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia 
> > wrote:
> > >
> > >> Typically the data flow is like this:1) Client submits a job
> description
> > >> to
> > >> the JobTracker.
> > >> 2) JobTracker figures out block locations for the input file(s) by
> > talking
> > >> to HDFS NameNode.
> > >> 3) JobTracker creates a job description file in HDFS which will be
> read
> > by
> > >> the nodes to copy over the job's code etc.
> > >> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with the
> > >> appropriate data blocks.
> > >> 5) After running, maps create intermediate output files on those
> slaves.
> > >> These are not in HDFS, they're in some temporary storage used by
> > >> MapReduce.
> > >> 6) JobTracker starts reduces on a series of slaves, which copy over
> the
> > >> appropriate map outputs, apply the reduce function, and write the
> > outputs
> > >> to
> > >> HDFS (one output file per reducer).
> > >> 7) Some logs for the job may also be put into HDFS by the JobTracker.
> > >>
> > >> However, there is a big caveat, which is that the map and reduce tasks
> > run
> > >> arbitrary code. It is not unusual to have a map that opens a second
> HDFS
> > >> file to read some information (e.g. for doing a join of a small table
> > >> against a big file). If you use Hadoop Streaming or Pipes to write a
> job
> > >> in
> > >> Python, Ruby, C, etc, then you are launching arbitrary processes which
> > may
> > >> also access external resources in this manner. Some people also
> > read/write
> > >> to DBs (e.g. MySQL) from their tasks. A comprehensive security
> solution
> > >> would ideally deal with these cases too.
> > >>
> > >> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana 
> > >> wrote:
> > >>
> > >> > A quick question here. How does a typical hadoop job work at the
> > system
> > >> > level? What are the various interactions and how does the data flow?
> > >> >
> > >> > Amandeep
> > >> >
> > >> >
> > >> > Amandeep Khurana
> > >> > Computer Science Graduate Student
> > >> > University of California, Santa Cruz
> > >> >
> > >> >
> > >> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana  >
> > >> > wrote:
> > >> >
> > >> > > Thanks Matei. If the basic architecture is similar to the Google
> > >> stuff, I
> > >> > > can safely just work on the project using the information from the
> > >> > papers.
> > >> > >
> > >> > > I am aware of the 4487 jira and the current status of the
> > permissions
> > >> > > mechanism. I had a look at them earlier.
> > >> > >
> > >> > > Cheers
> > >> > > Amandeep
> > >> > >
> > >> > >
> > >> > > Amandeep Khurana
> > >> > > Computer Science Graduate Student
> > >> > > University of California, Santa Cruz
> > >> > >
> > >> > >
> > >> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia <
> ma...@cloudera.com>
> > >>

Re: HDFS architecture based on GFS?

2009-02-15 Thread Matei Zaharia
I mentioned this case because even jobs written in Java can use the HDFS API
to talk to the NameNode and access the filesystem. People often do this
because their job needs to read a config file, some small data table, etc
and use this information in its map or reduce functions. In this case, you
open the second file separately in your mapper's init function and read
whatever you need from it. In general I wanted to point out that you can't
know which files a job will access unless you look at its source code or
monitor the calls it makes; the input file(s) you provide in the job
description are a hint to the MapReduce framework to place your job on
certain nodes, but it's reasonable for the job to access other files as
well.

On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana  wrote:

> Another question that I have here - When the jobs run arbitrary code and
> access data from the HDFS, do they go to the namenode to get the block
> information?
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana 
> wrote:
>
> > Assuming that the job is purely in Java and not involving streaming or
> > pipes, wouldnt the resources (files) required by the job as inputs be
> known
> > beforehand? So, if the map task is accessing a second file, how does it
> make
> > it different except that there are multiple files. The JobTracker would
> know
> > beforehand that multiple files would be accessed. Right?
> >
> > I am slightly confused why you have mentioned this case separately... Can
> > you elaborate on it a little bit?
> >
> > Amandeep
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia 
> wrote:
> >
> >> Typically the data flow is like this:1) Client submits a job description
> >> to
> >> the JobTracker.
> >> 2) JobTracker figures out block locations for the input file(s) by
> talking
> >> to HDFS NameNode.
> >> 3) JobTracker creates a job description file in HDFS which will be read
> by
> >> the nodes to copy over the job's code etc.
> >> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with the
> >> appropriate data blocks.
> >> 5) After running, maps create intermediate output files on those slaves.
> >> These are not in HDFS, they're in some temporary storage used by
> >> MapReduce.
> >> 6) JobTracker starts reduces on a series of slaves, which copy over the
> >> appropriate map outputs, apply the reduce function, and write the
> outputs
> >> to
> >> HDFS (one output file per reducer).
> >> 7) Some logs for the job may also be put into HDFS by the JobTracker.
> >>
> >> However, there is a big caveat, which is that the map and reduce tasks
> run
> >> arbitrary code. It is not unusual to have a map that opens a second HDFS
> >> file to read some information (e.g. for doing a join of a small table
> >> against a big file). If you use Hadoop Streaming or Pipes to write a job
> >> in
> >> Python, Ruby, C, etc, then you are launching arbitrary processes which
> may
> >> also access external resources in this manner. Some people also
> read/write
> >> to DBs (e.g. MySQL) from their tasks. A comprehensive security solution
> >> would ideally deal with these cases too.
> >>
> >> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana 
> >> wrote:
> >>
> >> > A quick question here. How does a typical hadoop job work at the
> system
> >> > level? What are the various interactions and how does the data flow?
> >> >
> >> > Amandeep
> >> >
> >> >
> >> > Amandeep Khurana
> >> > Computer Science Graduate Student
> >> > University of California, Santa Cruz
> >> >
> >> >
> >> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana 
> >> > wrote:
> >> >
> >> > > Thanks Matei. If the basic architecture is similar to the Google
> >> stuff, I
> >> > > can safely just work on the project using the information from the
> >> > papers.
> >> > >
> >> > > I am aware of the 4487 jira and the current status of the
> permissions
> >> > > mechanism. I had a look at them earlier.
> >> > >
> >> > > Cheers
> >> > > Amandeep
> >> > >
> >> > >
> >> > > Amandeep Khurana
> >> > > Computer Science Graduate Student
> >> > > University of California, Santa Cruz
> >> > >
> >> > >
> >> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia 
> >> > wrote:
> >> > >
> >> > >> Forgot to add, this JIRA details the latest security features that
> >> are
> >> > >> being
> >> > >> worked on in Hadoop trunk:
> >> > >> https://issues.apache.org/jira/browse/HADOOP-4487.
> >> > >> This document describes the current status and limitations of the
> >> > >> permissions mechanism:
> >> > >>
> >> http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html.
> >> > >>
> >> > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia  >
> >> > >> wrote:
> >> > >>
> >> > >> > I think it's safe to assume that Hadoop works like MapReduce/GFS
> at
> >> > the
> >> > >> > le

Re: setting up networking and ssh on multnode cluster...

2009-02-15 Thread Norbert Burger
Fwiw, the extra references to 127.0.1.1 in each host file aren't necessary.

>From node0, does 'ssh 192.168.0.2' work?  If not, then the issue isn't name
resolution -- take look at the network configs (eg., /etc/init.d/interfaces)
on each machine.

Norbert

On Sun, Feb 15, 2009 at 7:31 PM, zander1013  wrote:

>
> okay,
>
> i have changed /etc/hosts to look like this for node0...
>
> 127.0.0.1   localhost
> 127.0.1.1   node0
>
> # /etc/hosts (for hadoop master and slave)
> 192.168.0.1 node0
> 192.168.0.2 node1
> #end hadoop section
>
> # The following lines are desirable for IPv6 capable hosts
> ::1 ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
> ff02::3 ip6-allhosts
>
> ...and this for node1...
>
> 127.0.0.1   localhost
> 127.0.1.1   node1
>
> # /etc/hosts (for hadoop master and slave)
> 192.168.0.1 node0
> 192.168.0.2 node1
> #end hadoop section
>
> # The following lines are desirable for IPv6 capable hosts
> ::1 ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
> ff02::3 ip6-allhosts
>
> ... the machines are connected by a cat5 cable, they have wifi and are
> showing that they are connected to my wlan. also i have enabled all the
> user
> privleges in the user manager on both machines. here are the results from
> ssh on node0...
>
> had...@node0:~$ ssh node0
> Linux node0 2.6.27-11-generic #1 SMP Thu Jan 29 19:24:39 UTC 2009 i686
>
> The programs included with the Ubuntu system are free software;
> the exact distribution terms for each program are described in the
> individual files in /usr/share/doc/*/copyright.
>
> Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
> applicable law.
>
> To access official Ubuntu documentation, please visit:
> http://help.ubuntu.com/
> Last login: Sun Feb 15 16:00:28 2009 from node0
> had...@node0:~$ exit
> logout
> Connection to node0 closed.
> had...@node0:~$ ssh node1
> ssh: connect to host node1 port 22: Connection timed out
> had...@node0:~$
>
> i will look into the link that you gave.
>
> -zander
>
>
> Norbert Burger wrote:
> >
> >>
> >> i have commented out the 192. addresses and changed 127.0.1.1 for node0
> >> and
> >> 127.0.1.2 for node0 (in /etc/hosts). with this done i can ssh from one
> >> machine to itself and to the other but the prompt does not change when i
> >> ssh
> >> to the other machine. i don't know if there is a firewall preventing me
> >> from
> >> ssh or not. i have not set any up to prevent ssh and i have not taken
> >> action
> >> to specifically allow ssh other than what was prescribed in the tutorial
> >> for
> >> a single node cluster for both machines.
> >
> >
> > Why are you using 127.* addresses for your nodes?  These fall into the
> > block
> > of IPs reserved for the loopback network (see
> > http://en.wikipedia.org/wiki/Loopback).
> >
> > Try changing both nodes back to 192.168, and re-starting Hadoop.
> >
> > Norbert
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/setting-up-networking-and-ssh-on-multnode-cluster...-tp22019559p22029812.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Re: HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
Another question that I have here - When the jobs run arbitrary code and
access data from the HDFS, do they go to the namenode to get the block
information?


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana  wrote:

> Assuming that the job is purely in Java and not involving streaming or
> pipes, wouldnt the resources (files) required by the job as inputs be known
> beforehand? So, if the map task is accessing a second file, how does it make
> it different except that there are multiple files. The JobTracker would know
> beforehand that multiple files would be accessed. Right?
>
> I am slightly confused why you have mentioned this case separately... Can
> you elaborate on it a little bit?
>
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia  wrote:
>
>> Typically the data flow is like this:1) Client submits a job description
>> to
>> the JobTracker.
>> 2) JobTracker figures out block locations for the input file(s) by talking
>> to HDFS NameNode.
>> 3) JobTracker creates a job description file in HDFS which will be read by
>> the nodes to copy over the job's code etc.
>> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with the
>> appropriate data blocks.
>> 5) After running, maps create intermediate output files on those slaves.
>> These are not in HDFS, they're in some temporary storage used by
>> MapReduce.
>> 6) JobTracker starts reduces on a series of slaves, which copy over the
>> appropriate map outputs, apply the reduce function, and write the outputs
>> to
>> HDFS (one output file per reducer).
>> 7) Some logs for the job may also be put into HDFS by the JobTracker.
>>
>> However, there is a big caveat, which is that the map and reduce tasks run
>> arbitrary code. It is not unusual to have a map that opens a second HDFS
>> file to read some information (e.g. for doing a join of a small table
>> against a big file). If you use Hadoop Streaming or Pipes to write a job
>> in
>> Python, Ruby, C, etc, then you are launching arbitrary processes which may
>> also access external resources in this manner. Some people also read/write
>> to DBs (e.g. MySQL) from their tasks. A comprehensive security solution
>> would ideally deal with these cases too.
>>
>> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana 
>> wrote:
>>
>> > A quick question here. How does a typical hadoop job work at the system
>> > level? What are the various interactions and how does the data flow?
>> >
>> > Amandeep
>> >
>> >
>> > Amandeep Khurana
>> > Computer Science Graduate Student
>> > University of California, Santa Cruz
>> >
>> >
>> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana 
>> > wrote:
>> >
>> > > Thanks Matei. If the basic architecture is similar to the Google
>> stuff, I
>> > > can safely just work on the project using the information from the
>> > papers.
>> > >
>> > > I am aware of the 4487 jira and the current status of the permissions
>> > > mechanism. I had a look at them earlier.
>> > >
>> > > Cheers
>> > > Amandeep
>> > >
>> > >
>> > > Amandeep Khurana
>> > > Computer Science Graduate Student
>> > > University of California, Santa Cruz
>> > >
>> > >
>> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia 
>> > wrote:
>> > >
>> > >> Forgot to add, this JIRA details the latest security features that
>> are
>> > >> being
>> > >> worked on in Hadoop trunk:
>> > >> https://issues.apache.org/jira/browse/HADOOP-4487.
>> > >> This document describes the current status and limitations of the
>> > >> permissions mechanism:
>> > >>
>> http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html.
>> > >>
>> > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia 
>> > >> wrote:
>> > >>
>> > >> > I think it's safe to assume that Hadoop works like MapReduce/GFS at
>> > the
>> > >> > level described in those papers. In particular, in HDFS, there is a
>> > >> master
>> > >> > node containing metadata and a number of slave nodes (datanodes)
>> > >> containing
>> > >> > blocks, as in GFS. Clients start by talking to the master to list
>> > >> > directories, etc. When they want to read a region of some file,
>> they
>> > >> tell
>> > >> > the master the filename and offset, and they receive a list of
>> block
>> > >> > locations (datanodes). They then contact the individual datanodes
>> to
>> > >> read
>> > >> > the blocks. When clients write a file, they first obtain a new
>> block
>> > ID
>> > >> and
>> > >> > list of nodes to write it to from the master, then contact the
>> > datanodes
>> > >> to
>> > >> > write it (actually, the datanodes pipeline the write as in GFS) and
>> > >> report
>> > >> > when the write is complete. HDFS actually has some security
>> mechanisms
>> > >> built
>> > >> > in, authenticating users based on their Unix ID and providing
>> > Unix-like
>> > >> file
>> > >> > permissions. I don't

Re: HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
Assuming that the job is purely in Java and not involving streaming or
pipes, wouldnt the resources (files) required by the job as inputs be known
beforehand? So, if the map task is accessing a second file, how does it make
it different except that there are multiple files. The JobTracker would know
beforehand that multiple files would be accessed. Right?

I am slightly confused why you have mentioned this case separately... Can
you elaborate on it a little bit?

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia  wrote:

> Typically the data flow is like this:1) Client submits a job description to
> the JobTracker.
> 2) JobTracker figures out block locations for the input file(s) by talking
> to HDFS NameNode.
> 3) JobTracker creates a job description file in HDFS which will be read by
> the nodes to copy over the job's code etc.
> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with the
> appropriate data blocks.
> 5) After running, maps create intermediate output files on those slaves.
> These are not in HDFS, they're in some temporary storage used by MapReduce.
> 6) JobTracker starts reduces on a series of slaves, which copy over the
> appropriate map outputs, apply the reduce function, and write the outputs
> to
> HDFS (one output file per reducer).
> 7) Some logs for the job may also be put into HDFS by the JobTracker.
>
> However, there is a big caveat, which is that the map and reduce tasks run
> arbitrary code. It is not unusual to have a map that opens a second HDFS
> file to read some information (e.g. for doing a join of a small table
> against a big file). If you use Hadoop Streaming or Pipes to write a job in
> Python, Ruby, C, etc, then you are launching arbitrary processes which may
> also access external resources in this manner. Some people also read/write
> to DBs (e.g. MySQL) from their tasks. A comprehensive security solution
> would ideally deal with these cases too.
>
> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana 
> wrote:
>
> > A quick question here. How does a typical hadoop job work at the system
> > level? What are the various interactions and how does the data flow?
> >
> > Amandeep
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana 
> > wrote:
> >
> > > Thanks Matei. If the basic architecture is similar to the Google stuff,
> I
> > > can safely just work on the project using the information from the
> > papers.
> > >
> > > I am aware of the 4487 jira and the current status of the permissions
> > > mechanism. I had a look at them earlier.
> > >
> > > Cheers
> > > Amandeep
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> > >
> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia 
> > wrote:
> > >
> > >> Forgot to add, this JIRA details the latest security features that are
> > >> being
> > >> worked on in Hadoop trunk:
> > >> https://issues.apache.org/jira/browse/HADOOP-4487.
> > >> This document describes the current status and limitations of the
> > >> permissions mechanism:
> > >>
> http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html.
> > >>
> > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia 
> > >> wrote:
> > >>
> > >> > I think it's safe to assume that Hadoop works like MapReduce/GFS at
> > the
> > >> > level described in those papers. In particular, in HDFS, there is a
> > >> master
> > >> > node containing metadata and a number of slave nodes (datanodes)
> > >> containing
> > >> > blocks, as in GFS. Clients start by talking to the master to list
> > >> > directories, etc. When they want to read a region of some file, they
> > >> tell
> > >> > the master the filename and offset, and they receive a list of block
> > >> > locations (datanodes). They then contact the individual datanodes to
> > >> read
> > >> > the blocks. When clients write a file, they first obtain a new block
> > ID
> > >> and
> > >> > list of nodes to write it to from the master, then contact the
> > datanodes
> > >> to
> > >> > write it (actually, the datanodes pipeline the write as in GFS) and
> > >> report
> > >> > when the write is complete. HDFS actually has some security
> mechanisms
> > >> built
> > >> > in, authenticating users based on their Unix ID and providing
> > Unix-like
> > >> file
> > >> > permissions. I don't know much about how these are implemented, but
> > they
> > >> > would be a good place to start looking.
> > >> >
> > >> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana  > >> >wrote:
> > >> >
> > >> >> Thanks Matie
> > >> >>
> > >> >> I had gone through the architecture document online. I am currently
> > >> >> working
> > >> >> on a project towards Security in Hadoop. I do know how the data
> moves
> > >> >> around
> > >> >> in the GFS but wasnt sure how

Re: HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
This is good information! Thanks a ton. I'll take all this into account.

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia  wrote:

> Typically the data flow is like this:1) Client submits a job description to
> the JobTracker.
> 2) JobTracker figures out block locations for the input file(s) by talking
> to HDFS NameNode.
> 3) JobTracker creates a job description file in HDFS which will be read by
> the nodes to copy over the job's code etc.
> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with the
> appropriate data blocks.
> 5) After running, maps create intermediate output files on those slaves.
> These are not in HDFS, they're in some temporary storage used by MapReduce.
> 6) JobTracker starts reduces on a series of slaves, which copy over the
> appropriate map outputs, apply the reduce function, and write the outputs
> to
> HDFS (one output file per reducer).
> 7) Some logs for the job may also be put into HDFS by the JobTracker.
>
> However, there is a big caveat, which is that the map and reduce tasks run
> arbitrary code. It is not unusual to have a map that opens a second HDFS
> file to read some information (e.g. for doing a join of a small table
> against a big file). If you use Hadoop Streaming or Pipes to write a job in
> Python, Ruby, C, etc, then you are launching arbitrary processes which may
> also access external resources in this manner. Some people also read/write
> to DBs (e.g. MySQL) from their tasks. A comprehensive security solution
> would ideally deal with these cases too.
>
> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana 
> wrote:
>
> > A quick question here. How does a typical hadoop job work at the system
> > level? What are the various interactions and how does the data flow?
> >
> > Amandeep
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana 
> > wrote:
> >
> > > Thanks Matei. If the basic architecture is similar to the Google stuff,
> I
> > > can safely just work on the project using the information from the
> > papers.
> > >
> > > I am aware of the 4487 jira and the current status of the permissions
> > > mechanism. I had a look at them earlier.
> > >
> > > Cheers
> > > Amandeep
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> > >
> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia 
> > wrote:
> > >
> > >> Forgot to add, this JIRA details the latest security features that are
> > >> being
> > >> worked on in Hadoop trunk:
> > >> https://issues.apache.org/jira/browse/HADOOP-4487.
> > >> This document describes the current status and limitations of the
> > >> permissions mechanism:
> > >>
> http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html.
> > >>
> > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia 
> > >> wrote:
> > >>
> > >> > I think it's safe to assume that Hadoop works like MapReduce/GFS at
> > the
> > >> > level described in those papers. In particular, in HDFS, there is a
> > >> master
> > >> > node containing metadata and a number of slave nodes (datanodes)
> > >> containing
> > >> > blocks, as in GFS. Clients start by talking to the master to list
> > >> > directories, etc. When they want to read a region of some file, they
> > >> tell
> > >> > the master the filename and offset, and they receive a list of block
> > >> > locations (datanodes). They then contact the individual datanodes to
> > >> read
> > >> > the blocks. When clients write a file, they first obtain a new block
> > ID
> > >> and
> > >> > list of nodes to write it to from the master, then contact the
> > datanodes
> > >> to
> > >> > write it (actually, the datanodes pipeline the write as in GFS) and
> > >> report
> > >> > when the write is complete. HDFS actually has some security
> mechanisms
> > >> built
> > >> > in, authenticating users based on their Unix ID and providing
> > Unix-like
> > >> file
> > >> > permissions. I don't know much about how these are implemented, but
> > they
> > >> > would be a good place to start looking.
> > >> >
> > >> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana  > >> >wrote:
> > >> >
> > >> >> Thanks Matie
> > >> >>
> > >> >> I had gone through the architecture document online. I am currently
> > >> >> working
> > >> >> on a project towards Security in Hadoop. I do know how the data
> moves
> > >> >> around
> > >> >> in the GFS but wasnt sure how much of that does HDFS follow and how
> > >> >> different it is from GFS. Can you throw some light on that?
> > >> >>
> > >> >> Security would also involve the Map Reduce jobs following the same
> > >> >> protocols. Thats why the question about how does the Hadoop
> framework
> > >> >> integrate with the HDFS, and how different is it from Map Reduce
> and
> > >> GFS.
> > >> >> The GFS and M

Re: HDFS architecture based on GFS?

2009-02-15 Thread Matei Zaharia
Typically the data flow is like this:1) Client submits a job description to
the JobTracker.
2) JobTracker figures out block locations for the input file(s) by talking
to HDFS NameNode.
3) JobTracker creates a job description file in HDFS which will be read by
the nodes to copy over the job's code etc.
4) JobTracker starts map tasks on the slaves (TaskTrackers) with the
appropriate data blocks.
5) After running, maps create intermediate output files on those slaves.
These are not in HDFS, they're in some temporary storage used by MapReduce.
6) JobTracker starts reduces on a series of slaves, which copy over the
appropriate map outputs, apply the reduce function, and write the outputs to
HDFS (one output file per reducer).
7) Some logs for the job may also be put into HDFS by the JobTracker.

However, there is a big caveat, which is that the map and reduce tasks run
arbitrary code. It is not unusual to have a map that opens a second HDFS
file to read some information (e.g. for doing a join of a small table
against a big file). If you use Hadoop Streaming or Pipes to write a job in
Python, Ruby, C, etc, then you are launching arbitrary processes which may
also access external resources in this manner. Some people also read/write
to DBs (e.g. MySQL) from their tasks. A comprehensive security solution
would ideally deal with these cases too.

On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana  wrote:

> A quick question here. How does a typical hadoop job work at the system
> level? What are the various interactions and how does the data flow?
>
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana 
> wrote:
>
> > Thanks Matei. If the basic architecture is similar to the Google stuff, I
> > can safely just work on the project using the information from the
> papers.
> >
> > I am aware of the 4487 jira and the current status of the permissions
> > mechanism. I had a look at them earlier.
> >
> > Cheers
> > Amandeep
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia 
> wrote:
> >
> >> Forgot to add, this JIRA details the latest security features that are
> >> being
> >> worked on in Hadoop trunk:
> >> https://issues.apache.org/jira/browse/HADOOP-4487.
> >> This document describes the current status and limitations of the
> >> permissions mechanism:
> >> http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html.
> >>
> >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia 
> >> wrote:
> >>
> >> > I think it's safe to assume that Hadoop works like MapReduce/GFS at
> the
> >> > level described in those papers. In particular, in HDFS, there is a
> >> master
> >> > node containing metadata and a number of slave nodes (datanodes)
> >> containing
> >> > blocks, as in GFS. Clients start by talking to the master to list
> >> > directories, etc. When they want to read a region of some file, they
> >> tell
> >> > the master the filename and offset, and they receive a list of block
> >> > locations (datanodes). They then contact the individual datanodes to
> >> read
> >> > the blocks. When clients write a file, they first obtain a new block
> ID
> >> and
> >> > list of nodes to write it to from the master, then contact the
> datanodes
> >> to
> >> > write it (actually, the datanodes pipeline the write as in GFS) and
> >> report
> >> > when the write is complete. HDFS actually has some security mechanisms
> >> built
> >> > in, authenticating users based on their Unix ID and providing
> Unix-like
> >> file
> >> > permissions. I don't know much about how these are implemented, but
> they
> >> > would be a good place to start looking.
> >> >
> >> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana  >> >wrote:
> >> >
> >> >> Thanks Matie
> >> >>
> >> >> I had gone through the architecture document online. I am currently
> >> >> working
> >> >> on a project towards Security in Hadoop. I do know how the data moves
> >> >> around
> >> >> in the GFS but wasnt sure how much of that does HDFS follow and how
> >> >> different it is from GFS. Can you throw some light on that?
> >> >>
> >> >> Security would also involve the Map Reduce jobs following the same
> >> >> protocols. Thats why the question about how does the Hadoop framework
> >> >> integrate with the HDFS, and how different is it from Map Reduce and
> >> GFS.
> >> >> The GFS and Map Reduce papers give a good information on how those
> >> systems
> >> >> are designed but there is nothing that concrete for Hadoop that I
> have
> >> >> been
> >> >> able to find.
> >> >>
> >> >> Amandeep
> >> >>
> >> >>
> >> >> Amandeep Khurana
> >> >> Computer Science Graduate Student
> >> >> University of California, Santa Cruz
> >> >>
> >> >>
> >> >> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia 
> >> >> wrote:
> >> >>
> >> >> > Hi Amandeep,
> >> >> > Hadoop is d

Re: setting up networking and ssh on multnode cluster...

2009-02-15 Thread zander1013

okay,

i have changed /etc/hosts to look like this for node0...

127.0.0.1   localhost
127.0.1.1   node0

# /etc/hosts (for hadoop master and slave)
192.168.0.1 node0
192.168.0.2 node1
#end hadoop section

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

...and this for node1...

127.0.0.1   localhost
127.0.1.1   node1

# /etc/hosts (for hadoop master and slave)
192.168.0.1 node0
192.168.0.2 node1
#end hadoop section

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

... the machines are connected by a cat5 cable, they have wifi and are
showing that they are connected to my wlan. also i have enabled all the user
privleges in the user manager on both machines. here are the results from
ssh on node0...

had...@node0:~$ ssh node0
Linux node0 2.6.27-11-generic #1 SMP Thu Jan 29 19:24:39 UTC 2009 i686

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

To access official Ubuntu documentation, please visit:
http://help.ubuntu.com/
Last login: Sun Feb 15 16:00:28 2009 from node0
had...@node0:~$ exit
logout
Connection to node0 closed.
had...@node0:~$ ssh node1
ssh: connect to host node1 port 22: Connection timed out
had...@node0:~$ 

i will look into the link that you gave.

-zander


Norbert Burger wrote:
> 
>>
>> i have commented out the 192. addresses and changed 127.0.1.1 for node0
>> and
>> 127.0.1.2 for node0 (in /etc/hosts). with this done i can ssh from one
>> machine to itself and to the other but the prompt does not change when i
>> ssh
>> to the other machine. i don't know if there is a firewall preventing me
>> from
>> ssh or not. i have not set any up to prevent ssh and i have not taken
>> action
>> to specifically allow ssh other than what was prescribed in the tutorial
>> for
>> a single node cluster for both machines.
> 
> 
> Why are you using 127.* addresses for your nodes?  These fall into the
> block
> of IPs reserved for the loopback network (see
> http://en.wikipedia.org/wiki/Loopback).
> 
> Try changing both nodes back to 192.168, and re-starting Hadoop.
> 
> Norbert
> 
> 

-- 
View this message in context: 
http://www.nabble.com/setting-up-networking-and-ssh-on-multnode-cluster...-tp22019559p22029812.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: setting up networking and ssh on multnode cluster...

2009-02-15 Thread Norbert Burger
>
> i have commented out the 192. addresses and changed 127.0.1.1 for node0 and
> 127.0.1.2 for node0 (in /etc/hosts). with this done i can ssh from one
> machine to itself and to the other but the prompt does not change when i
> ssh
> to the other machine. i don't know if there is a firewall preventing me
> from
> ssh or not. i have not set any up to prevent ssh and i have not taken
> action
> to specifically allow ssh other than what was prescribed in the tutorial
> for
> a single node cluster for both machines.


Why are you using 127.* addresses for your nodes?  These fall into the block
of IPs reserved for the loopback network (see
http://en.wikipedia.org/wiki/Loopback).

Try changing both nodes back to 192.168, and re-starting Hadoop.

Norbert


Re: HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
A quick question here. How does a typical hadoop job work at the system
level? What are the various interactions and how does the data flow?

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana  wrote:

> Thanks Matei. If the basic architecture is similar to the Google stuff, I
> can safely just work on the project using the information from the papers.
>
> I am aware of the 4487 jira and the current status of the permissions
> mechanism. I had a look at them earlier.
>
> Cheers
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia  wrote:
>
>> Forgot to add, this JIRA details the latest security features that are
>> being
>> worked on in Hadoop trunk:
>> https://issues.apache.org/jira/browse/HADOOP-4487.
>> This document describes the current status and limitations of the
>> permissions mechanism:
>> http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html.
>>
>> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia 
>> wrote:
>>
>> > I think it's safe to assume that Hadoop works like MapReduce/GFS at the
>> > level described in those papers. In particular, in HDFS, there is a
>> master
>> > node containing metadata and a number of slave nodes (datanodes)
>> containing
>> > blocks, as in GFS. Clients start by talking to the master to list
>> > directories, etc. When they want to read a region of some file, they
>> tell
>> > the master the filename and offset, and they receive a list of block
>> > locations (datanodes). They then contact the individual datanodes to
>> read
>> > the blocks. When clients write a file, they first obtain a new block ID
>> and
>> > list of nodes to write it to from the master, then contact the datanodes
>> to
>> > write it (actually, the datanodes pipeline the write as in GFS) and
>> report
>> > when the write is complete. HDFS actually has some security mechanisms
>> built
>> > in, authenticating users based on their Unix ID and providing Unix-like
>> file
>> > permissions. I don't know much about how these are implemented, but they
>> > would be a good place to start looking.
>> >
>> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana > >wrote:
>> >
>> >> Thanks Matie
>> >>
>> >> I had gone through the architecture document online. I am currently
>> >> working
>> >> on a project towards Security in Hadoop. I do know how the data moves
>> >> around
>> >> in the GFS but wasnt sure how much of that does HDFS follow and how
>> >> different it is from GFS. Can you throw some light on that?
>> >>
>> >> Security would also involve the Map Reduce jobs following the same
>> >> protocols. Thats why the question about how does the Hadoop framework
>> >> integrate with the HDFS, and how different is it from Map Reduce and
>> GFS.
>> >> The GFS and Map Reduce papers give a good information on how those
>> systems
>> >> are designed but there is nothing that concrete for Hadoop that I have
>> >> been
>> >> able to find.
>> >>
>> >> Amandeep
>> >>
>> >>
>> >> Amandeep Khurana
>> >> Computer Science Graduate Student
>> >> University of California, Santa Cruz
>> >>
>> >>
>> >> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia 
>> >> wrote:
>> >>
>> >> > Hi Amandeep,
>> >> > Hadoop is definitely inspired by MapReduce/GFS and aims to provide
>> those
>> >> > capabilities as an open-source project. HDFS is similar to GFS (large
>> >> > blocks, replication, etc); some notable things missing are read-write
>> >> > support in the middle of a file (unlikely to be provided because few
>> >> Hadoop
>> >> > applications require it) and multiple appenders (the record append
>> >> > operation). You can read about HDFS architecture at
>> >> > http://hadoop.apache.org/core/docs/current/hdfs_design.html. The
>> >> MapReduce
>> >> > part of Hadoop interacts with HDFS in the same way that Google's
>> >> MapReduce
>> >> > interacts with GFS (shipping computation to the data), although
>> Hadoop
>> >> > MapReduce also supports running over other distributed filesystems.
>> >> >
>> >> > Matei
>> >> >
>> >> > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana > >
>> >> > wrote:
>> >> >
>> >> > > Hi
>> >> > >
>> >> > > Is the HDFS architecture completely based on the Google Filesystem?
>> If
>> >> it
>> >> > > isnt, what are the differences between the two?
>> >> > >
>> >> > > Secondly, is the coupling between Hadoop and HDFS same as how it is
>> >> > between
>> >> > > the Google's version of Map Reduce and GFS?
>> >> > >
>> >> > > Amandeep
>> >> > >
>> >> > >
>> >> > > Amandeep Khurana
>> >> > > Computer Science Graduate Student
>> >> > > University of California, Santa Cruz
>> >> > >
>> >> >
>> >>
>> >
>> >
>>
>
>


Re: HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
Thanks Matei. If the basic architecture is similar to the Google stuff, I
can safely just work on the project using the information from the papers.

I am aware of the 4487 jira and the current status of the permissions
mechanism. I had a look at them earlier.

Cheers
Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia  wrote:

> Forgot to add, this JIRA details the latest security features that are
> being
> worked on in Hadoop trunk:
> https://issues.apache.org/jira/browse/HADOOP-4487.
> This document describes the current status and limitations of the
> permissions mechanism:
> http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html.
>
> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia  wrote:
>
> > I think it's safe to assume that Hadoop works like MapReduce/GFS at the
> > level described in those papers. In particular, in HDFS, there is a
> master
> > node containing metadata and a number of slave nodes (datanodes)
> containing
> > blocks, as in GFS. Clients start by talking to the master to list
> > directories, etc. When they want to read a region of some file, they tell
> > the master the filename and offset, and they receive a list of block
> > locations (datanodes). They then contact the individual datanodes to read
> > the blocks. When clients write a file, they first obtain a new block ID
> and
> > list of nodes to write it to from the master, then contact the datanodes
> to
> > write it (actually, the datanodes pipeline the write as in GFS) and
> report
> > when the write is complete. HDFS actually has some security mechanisms
> built
> > in, authenticating users based on their Unix ID and providing Unix-like
> file
> > permissions. I don't know much about how these are implemented, but they
> > would be a good place to start looking.
> >
> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana  >wrote:
> >
> >> Thanks Matie
> >>
> >> I had gone through the architecture document online. I am currently
> >> working
> >> on a project towards Security in Hadoop. I do know how the data moves
> >> around
> >> in the GFS but wasnt sure how much of that does HDFS follow and how
> >> different it is from GFS. Can you throw some light on that?
> >>
> >> Security would also involve the Map Reduce jobs following the same
> >> protocols. Thats why the question about how does the Hadoop framework
> >> integrate with the HDFS, and how different is it from Map Reduce and
> GFS.
> >> The GFS and Map Reduce papers give a good information on how those
> systems
> >> are designed but there is nothing that concrete for Hadoop that I have
> >> been
> >> able to find.
> >>
> >> Amandeep
> >>
> >>
> >> Amandeep Khurana
> >> Computer Science Graduate Student
> >> University of California, Santa Cruz
> >>
> >>
> >> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia 
> >> wrote:
> >>
> >> > Hi Amandeep,
> >> > Hadoop is definitely inspired by MapReduce/GFS and aims to provide
> those
> >> > capabilities as an open-source project. HDFS is similar to GFS (large
> >> > blocks, replication, etc); some notable things missing are read-write
> >> > support in the middle of a file (unlikely to be provided because few
> >> Hadoop
> >> > applications require it) and multiple appenders (the record append
> >> > operation). You can read about HDFS architecture at
> >> > http://hadoop.apache.org/core/docs/current/hdfs_design.html. The
> >> MapReduce
> >> > part of Hadoop interacts with HDFS in the same way that Google's
> >> MapReduce
> >> > interacts with GFS (shipping computation to the data), although Hadoop
> >> > MapReduce also supports running over other distributed filesystems.
> >> >
> >> > Matei
> >> >
> >> > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana 
> >> > wrote:
> >> >
> >> > > Hi
> >> > >
> >> > > Is the HDFS architecture completely based on the Google Filesystem?
> If
> >> it
> >> > > isnt, what are the differences between the two?
> >> > >
> >> > > Secondly, is the coupling between Hadoop and HDFS same as how it is
> >> > between
> >> > > the Google's version of Map Reduce and GFS?
> >> > >
> >> > > Amandeep
> >> > >
> >> > >
> >> > > Amandeep Khurana
> >> > > Computer Science Graduate Student
> >> > > University of California, Santa Cruz
> >> > >
> >> >
> >>
> >
> >
>


Re: setting up networking and ssh on multnode cluster...

2009-02-15 Thread zander1013

hi,

sshd is running on both machines. i am using  the default ubuntu 8.10
workstation install with openssh-server installed via "apt-get install". i
have tried with the machines connected through both a switch and just
pluging the ethernet cable from one into the other. right now i have just
one cat-5 cable running from one machine to the other.

i have commented out the 192. addresses and changed 127.0.1.1 for node0 and
127.0.1.2 for node0 (in /etc/hosts). with this done i can ssh from one
machine to itself and to the other but the prompt does not change when i ssh
to the other machine. i don't know if there is a firewall preventing me from
ssh or not. i have not set any up to prevent ssh and i have not taken action
to specifically allow ssh other than what was prescribed in the tutorial for
a single node cluster for both machines.

here is what i get with 127.0.1.1 for node0 and 127.0.1.2 for node1 (the 192
addresses commented out) in etc/hosts on both machines. it seems okay but
the prompt does not change to node1 when i ssh to that node. that is why i
am concerned.


had...@node0:~$ ssh node0
Linux node0 2.6.27-11-generic #1 SMP Thu Jan 29 19:24:39 UTC 2009 i686

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

To access official Ubuntu documentation, please visit:
http://help.ubuntu.com/
Last login: Sun Feb 15 14:35:46 2009 from node1
had...@node0:~$ exit
logout
Connection to node0 closed.
had...@node0:~$ ssh node1
Linux node0 2.6.27-11-generic #1 SMP Thu Jan 29 19:24:39 UTC 2009 i686

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

To access official Ubuntu documentation, please visit:
http://help.ubuntu.com/
Last login: Sun Feb 15 14:38:41 2009 from node0
had...@node0:~$ 

i hope you can advise. i don't know if this means there is no firewall to
prevent ssh or not but it is the best i can do in trying to determine if
there is or not without further advise.

-zander

james warren-2 wrote:
> 
> Hi Zander -
> Two simple explanations come to mind:
> 
> * Is sshd is running on your boxes?
> * If so, do you have a firewall preventing ssh access?
> 
> cheers,
> -jw
> 
> On Sat, Feb 14, 2009 at 7:50 PM, zander1013  wrote:
> 
>>
>> hi,
>>
>> am going through the tutorial on multinode cluster setup by m. noll...
>>
>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
>>
>> ... i am at the networking and ssh section. i am haveing trouble
>> configuring
>> the /etc/hosts file. my machines are node0 and node1 not master and
>> slave.
>> i
>> am taking node0 as master and node1 as slave.
>>
>> i added the lines...
>> # /etc/hosts (for master AND slave)
>>  192.168.0.1node0
>>  192.168.0.2node1
>> ... to /etc/hosts. but when i try to ssh (either way) i get the error...
>> "ssh: connect to host node1 (node0) port 22: Network is unreachable"
>>
>> /etc/hosts looks like this for node0...
>>
>> # /etc/hosts (for master and slave)
>> 192.168.0.1 node0
>> 192.168.0.2 node1
>> #end hadoop section
>> 127.0.0.1   localhost
>> 127.0.1.1   node0
>>
>> # The following lines are desirable for IPv6 capable hosts
>> ::1 ip6-localhost ip6-loopback
>> fe00::0 ip6-localnet
>> ff00::0 ip6-mcastprefix
>> ff02::1 ip6-allnodes
>> ff02::2 ip6-allrouters
>> ff02::3 ip6-allhosts
>>
>>
>> and this for node1...
>>
>> # /etc/hosts (for master and slave)
>> 192.168.0.1 node0
>> 192.168.0.2 node1
>> #end hadoop section
>> 127.0.0.1   localhost
>> 127.0.1.1   node1
>>
>> # The following lines are desirable for IPv6 capable hosts
>> ::1 ip6-localhost ip6-loopback
>> fe00::0 ip6-localnet
>> ff00::0 ip6-mcastprefix
>> ff02::1 ip6-allnodes
>> ff02::2 ip6-allrouters
>> ff02::3 ip6-allhosts
>>
>> ... even after rebooting and connecting the machines with a switch using
>> cat
>> 6 cable i get the error i quoted above.
>>
>> please advise.
>>
>>
>> thank you,
>>
>> -zander
>> --
>> View this message in context:
>> http://www.nabble.com/setting-up-networking-and-ssh-on-multnode-cluster...-tp22019559p22019559.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/setting-up-networking-and-ssh-on-multnode-cluster...-tp22019559p2202.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Race Condition?

2009-02-15 Thread S D
I was not able to determine the command shell return value for

 hadoop dfs -copyToLocal #{s3dir} #{localdir}

but I did print out several variables after the call and determined that the
call apparently did not go through successfully. In particular, prior to my
processData(localdir) command I use Ruby's puts to print out the contents of
several directories including 'localdir' and '../localdir' - here is the
weird thing: if I execute the following
 list = `ls -l "#{localdir}"`
 puts "List: #{list}"
(where 'localdir' is the directory I need as an arg for processData) the
processData command will execute properly. At first I thought that running
the puts command was allowing enough time to elapse for a race condition to
be avoided so that 'localdir' was ready when the processData command was
called (I know that in certain ways that doesn't make sense given that
hadoop dfs -copyToLocal blocks until it completes...) but then I tried other
time consuming commands such as
 list = `ls -l "../#{localdir}"`
 puts "List: #{list}"
and running processData(localdir) led to an error:
 'localdir' not found

Any clues on what could be going on?

Thanks,
John



On Sat, Feb 14, 2009 at 6:45 PM, Matei Zaharia  wrote:

> Have you logged the output of the dfs command to see whether it's always
> succeeded the copy?
>
> On Sat, Feb 14, 2009 at 2:46 PM, S D  wrote:
>
> > In my Hadoop 0.19.0 program each map function is assigned a directory
> > (representing a data location in my S3 datastore). The first thing each
> map
> > function does is copy the particular S3 data to the local machine that
> the
> > map task is running on and then being processing the data; e.g.,
> >
> > command = "hadoop dfs -copyToLocal #{s3dir} #{localdir}"
> > system "#{command}"
> >
> > In the above, "s3dir" is a directory that creates "localdir" - my
> > expectation is that "localdir" is created in the work directory for the
> > particular task attempt. Following this copy command I then run a
> function
> > that processes the data; e.g.,
> >
> > processData(localdir)
> >
> > In some instances my map/reduce program crashes and when I examine the
> logs
> > I get a message saying that "localdir" can not be found. This confuses me
> > since the hadoop shell command above is blocking so that localdir should
> > exist by the time processData() is called. I've found that if I add in
> some
> > diagnostic lines prior to processData() such as puts statements to print
> > out
> > variables, I never run into the problem of the localdir not being found.
> It
> > is almost as if localdir needs time to be created before the call to
> > processData().
> >
> > Has anyone encountered anything like this? Any suggestions on what could
> be
> > wrong are appreciated.
> >
> > Thanks,
> > John
> >
>


Re: HDFS architecture based on GFS?

2009-02-15 Thread Matei Zaharia
Forgot to add, this JIRA details the latest security features that are being
worked on in Hadoop trunk: https://issues.apache.org/jira/browse/HADOOP-4487.
This document describes the current status and limitations of the
permissions mechanism:
http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html.

On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia  wrote:

> I think it's safe to assume that Hadoop works like MapReduce/GFS at the
> level described in those papers. In particular, in HDFS, there is a master
> node containing metadata and a number of slave nodes (datanodes) containing
> blocks, as in GFS. Clients start by talking to the master to list
> directories, etc. When they want to read a region of some file, they tell
> the master the filename and offset, and they receive a list of block
> locations (datanodes). They then contact the individual datanodes to read
> the blocks. When clients write a file, they first obtain a new block ID and
> list of nodes to write it to from the master, then contact the datanodes to
> write it (actually, the datanodes pipeline the write as in GFS) and report
> when the write is complete. HDFS actually has some security mechanisms built
> in, authenticating users based on their Unix ID and providing Unix-like file
> permissions. I don't know much about how these are implemented, but they
> would be a good place to start looking.
>
> On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana wrote:
>
>> Thanks Matie
>>
>> I had gone through the architecture document online. I am currently
>> working
>> on a project towards Security in Hadoop. I do know how the data moves
>> around
>> in the GFS but wasnt sure how much of that does HDFS follow and how
>> different it is from GFS. Can you throw some light on that?
>>
>> Security would also involve the Map Reduce jobs following the same
>> protocols. Thats why the question about how does the Hadoop framework
>> integrate with the HDFS, and how different is it from Map Reduce and GFS.
>> The GFS and Map Reduce papers give a good information on how those systems
>> are designed but there is nothing that concrete for Hadoop that I have
>> been
>> able to find.
>>
>> Amandeep
>>
>>
>> Amandeep Khurana
>> Computer Science Graduate Student
>> University of California, Santa Cruz
>>
>>
>> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia 
>> wrote:
>>
>> > Hi Amandeep,
>> > Hadoop is definitely inspired by MapReduce/GFS and aims to provide those
>> > capabilities as an open-source project. HDFS is similar to GFS (large
>> > blocks, replication, etc); some notable things missing are read-write
>> > support in the middle of a file (unlikely to be provided because few
>> Hadoop
>> > applications require it) and multiple appenders (the record append
>> > operation). You can read about HDFS architecture at
>> > http://hadoop.apache.org/core/docs/current/hdfs_design.html. The
>> MapReduce
>> > part of Hadoop interacts with HDFS in the same way that Google's
>> MapReduce
>> > interacts with GFS (shipping computation to the data), although Hadoop
>> > MapReduce also supports running over other distributed filesystems.
>> >
>> > Matei
>> >
>> > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana 
>> > wrote:
>> >
>> > > Hi
>> > >
>> > > Is the HDFS architecture completely based on the Google Filesystem? If
>> it
>> > > isnt, what are the differences between the two?
>> > >
>> > > Secondly, is the coupling between Hadoop and HDFS same as how it is
>> > between
>> > > the Google's version of Map Reduce and GFS?
>> > >
>> > > Amandeep
>> > >
>> > >
>> > > Amandeep Khurana
>> > > Computer Science Graduate Student
>> > > University of California, Santa Cruz
>> > >
>> >
>>
>
>


Re: HDFS architecture based on GFS?

2009-02-15 Thread Matei Zaharia
I think it's safe to assume that Hadoop works like MapReduce/GFS at the
level described in those papers. In particular, in HDFS, there is a master
node containing metadata and a number of slave nodes (datanodes) containing
blocks, as in GFS. Clients start by talking to the master to list
directories, etc. When they want to read a region of some file, they tell
the master the filename and offset, and they receive a list of block
locations (datanodes). They then contact the individual datanodes to read
the blocks. When clients write a file, they first obtain a new block ID and
list of nodes to write it to from the master, then contact the datanodes to
write it (actually, the datanodes pipeline the write as in GFS) and report
when the write is complete. HDFS actually has some security mechanisms built
in, authenticating users based on their Unix ID and providing Unix-like file
permissions. I don't know much about how these are implemented, but they
would be a good place to start looking.
On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana  wrote:

> Thanks Matie
>
> I had gone through the architecture document online. I am currently working
> on a project towards Security in Hadoop. I do know how the data moves
> around
> in the GFS but wasnt sure how much of that does HDFS follow and how
> different it is from GFS. Can you throw some light on that?
>
> Security would also involve the Map Reduce jobs following the same
> protocols. Thats why the question about how does the Hadoop framework
> integrate with the HDFS, and how different is it from Map Reduce and GFS.
> The GFS and Map Reduce papers give a good information on how those systems
> are designed but there is nothing that concrete for Hadoop that I have been
> able to find.
>
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia 
> wrote:
>
> > Hi Amandeep,
> > Hadoop is definitely inspired by MapReduce/GFS and aims to provide those
> > capabilities as an open-source project. HDFS is similar to GFS (large
> > blocks, replication, etc); some notable things missing are read-write
> > support in the middle of a file (unlikely to be provided because few
> Hadoop
> > applications require it) and multiple appenders (the record append
> > operation). You can read about HDFS architecture at
> > http://hadoop.apache.org/core/docs/current/hdfs_design.html. The
> MapReduce
> > part of Hadoop interacts with HDFS in the same way that Google's
> MapReduce
> > interacts with GFS (shipping computation to the data), although Hadoop
> > MapReduce also supports running over other distributed filesystems.
> >
> > Matei
> >
> > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana 
> > wrote:
> >
> > > Hi
> > >
> > > Is the HDFS architecture completely based on the Google Filesystem? If
> it
> > > isnt, what are the differences between the two?
> > >
> > > Secondly, is the coupling between Hadoop and HDFS same as how it is
> > between
> > > the Google's version of Map Reduce and GFS?
> > >
> > > Amandeep
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> >
>


Re: datanode not being started

2009-02-15 Thread Sandy
just some more information:
hadoop fsck produces:
Status: HEALTHY
 Total size: 0 B
 Total dirs: 9
 Total files: 0 (Files currently being written: 1)
 Total blocks (validated): 0
 Minimally replicated blocks: 0
 Over-replicated blocks: 0
 Under-replicated blocks: 0
 Mis-replicated blocks: 0
 Default replication factor: 1
 Average block replication: 0.0
 Corrupt blocks: 0
 Missing replicas: 0
 Number of data-nodes: 0
 Number of racks: 0


The filesystem under path '/' is HEALTHY

on the newly formatted hdfs.

jps says:
4723 Jps
4527 NameNode
4653 JobTracker


I can't copy files onto the dfs since I get "NotReplicatedYetExceptions",
which I suspect has to do with the fact that there are no datanodes. My
"cluster" is a single MacPro with 8 cores. I haven't had to do anything
extra before in order to get datanodes to be generated.

09/02/15 15:56:27 WARN dfs.DFSClient: Error Recovery for block null bad
datanode[0]
copyFromLocal: Could not get block locations. Aborting...


The corresponding error in the logs is:

2009-02-15 15:56:27,123 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 9000, call addBlock(/user/hadoop/input/.DS_Store,
DFSClient_755366230) from 127.0.0.1:49796: error: java.io.IOException: File
/user/hadoop/input/.DS_Store could only be replicated to 0 nodes, instead of
1
java.io.IOException: File /user/hadoop/input/.DS_Store could only be
replicated to 0 nodes, instead of 1
at
org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1120)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)

On Sun, Feb 15, 2009 at 3:26 PM, Sandy  wrote:

> Thanks for your responses.
>
> I checked in the namenode and jobtracker logs and both say:
>
> INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 9000, call
> delete(/Users/hadoop/hadoop-0.18.2/hadoop-hadoop/mapred/system, true) from
> 127.0.0.1:61086: error: org.apache.hadoop.dfs.SafeModeException: Cannot
> delete /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/mapred/system. Name node
> is in safe mode.
> The ratio of reported blocks 0. has not reached the threshold 0.9990.
> Safe mode will be turned off automatically.
> org.apache.hadoop.dfs.SafeModeException: Cannot delete
> /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/mapred/system. Name node is in
> safe mode.
> The ratio of reported blocks 0. has not reached the threshold 0.9990.
> Safe mode will be turned off automatically.
> at
> org.apache.hadoop.dfs.FSNamesystem.deleteInternal(FSNamesystem.java:1505)
> at
> org.apache.hadoop.dfs.FSNamesystem.delete(FSNamesystem.java:1477)
> at org.apache.hadoop.dfs.NameNode.delete(NameNode.java:425)
> at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
>
>
> I think this is a continuation of my running problem. The nodes stay in
> safe mode, but won't come out, even after several minutes. I believe this is
> due to the fact that it keep trying to contact a datanode that does not
> exist. Any suggestions on what I can do?
>
> I have recently tried to reformat the hdfs, using bin/hadoop namenode
> -format. From the output directed to standard out, I thought this completed
> correctly:
>
> Re-format filesystem in /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/dfs/name
> ? (Y or N) Y
> 09/02/15 15:16:39 INFO fs.FSNamesystem:
> fsOwner=hadoop,staff,_lpadmin,com.apple.sharepoint.group.8,com.apple.sharepoint.group.3,com.apple.sharepoint.group.4,com.apple.sharepoint.group.2,com.apple.sharepoint.group.6,com.apple.sharepoint.group.9,com.apple.sharepoint.group.1,com.apple.sharepoint.group.5
> 09/02/15 15:16:39 INFO fs.FSNamesystem: supergroup=supergroup
> 09/02/15 15:16:39 INFO fs.FSNamesystem: isPermissionEnabled=true
> 09/02/15 15:16:39 INFO dfs.Storage: Image file of size 80 saved in 0
> seconds.
> 09/02/15 15:16:39 INFO dfs.Storage: Storage directory
> /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/dfs/name has been successfully
> formatted.
> 09/02/15 15:16:39 INFO dfs.NameNode: SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NameNode at
> loteria.cs.tamu.edu/128.194.143.170
> /
>
> However, after reformatting, I find that I have the same problems.
>
> Thanks,
> SM
>
> On Fri, Feb 13, 2009 at 5:

Re: setting up networking and ssh on multnode cluster...

2009-02-15 Thread james warren
Hi Zander -
Two simple explanations come to mind:

* Is sshd is running on your boxes?
* If so, do you have a firewall preventing ssh access?

cheers,
-jw

On Sat, Feb 14, 2009 at 7:50 PM, zander1013  wrote:

>
> hi,
>
> am going through the tutorial on multinode cluster setup by m. noll...
>
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
>
> ... i am at the networking and ssh section. i am haveing trouble
> configuring
> the /etc/hosts file. my machines are node0 and node1 not master and slave.
> i
> am taking node0 as master and node1 as slave.
>
> i added the lines...
> # /etc/hosts (for master AND slave)
>  192.168.0.1node0
>  192.168.0.2node1
> ... to /etc/hosts. but when i try to ssh (either way) i get the error...
> "ssh: connect to host node1 (node0) port 22: Network is unreachable"
>
> /etc/hosts looks like this for node0...
>
> # /etc/hosts (for master and slave)
> 192.168.0.1 node0
> 192.168.0.2 node1
> #end hadoop section
> 127.0.0.1   localhost
> 127.0.1.1   node0
>
> # The following lines are desirable for IPv6 capable hosts
> ::1 ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
> ff02::3 ip6-allhosts
>
>
> and this for node1...
>
> # /etc/hosts (for master and slave)
> 192.168.0.1 node0
> 192.168.0.2 node1
> #end hadoop section
> 127.0.0.1   localhost
> 127.0.1.1   node1
>
> # The following lines are desirable for IPv6 capable hosts
> ::1 ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
> ff02::3 ip6-allhosts
>
> ... even after rebooting and connecting the machines with a switch using
> cat
> 6 cable i get the error i quoted above.
>
> please advise.
>
>
> thank you,
>
> -zander
> --
> View this message in context:
> http://www.nabble.com/setting-up-networking-and-ssh-on-multnode-cluster...-tp22019559p22019559.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Re: HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
Thanks Matie

I had gone through the architecture document online. I am currently working
on a project towards Security in Hadoop. I do know how the data moves around
in the GFS but wasnt sure how much of that does HDFS follow and how
different it is from GFS. Can you throw some light on that?

Security would also involve the Map Reduce jobs following the same
protocols. Thats why the question about how does the Hadoop framework
integrate with the HDFS, and how different is it from Map Reduce and GFS.
The GFS and Map Reduce papers give a good information on how those systems
are designed but there is nothing that concrete for Hadoop that I have been
able to find.

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia  wrote:

> Hi Amandeep,
> Hadoop is definitely inspired by MapReduce/GFS and aims to provide those
> capabilities as an open-source project. HDFS is similar to GFS (large
> blocks, replication, etc); some notable things missing are read-write
> support in the middle of a file (unlikely to be provided because few Hadoop
> applications require it) and multiple appenders (the record append
> operation). You can read about HDFS architecture at
> http://hadoop.apache.org/core/docs/current/hdfs_design.html. The MapReduce
> part of Hadoop interacts with HDFS in the same way that Google's MapReduce
> interacts with GFS (shipping computation to the data), although Hadoop
> MapReduce also supports running over other distributed filesystems.
>
> Matei
>
> On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana 
> wrote:
>
> > Hi
> >
> > Is the HDFS architecture completely based on the Google Filesystem? If it
> > isnt, what are the differences between the two?
> >
> > Secondly, is the coupling between Hadoop and HDFS same as how it is
> between
> > the Google's version of Map Reduce and GFS?
> >
> > Amandeep
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
>


Re: datanode not being started

2009-02-15 Thread Sandy
Thanks for your responses.

I checked in the namenode and jobtracker logs and both say:

INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 9000, call
delete(/Users/hadoop/hadoop-0.18.2/hadoop-hadoop/mapred/system, true) from
127.0.0.1:61086: error: org.apache.hadoop.dfs.SafeModeException: Cannot
delete /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/mapred/system. Name node
is in safe mode.
The ratio of reported blocks 0. has not reached the threshold 0.9990.
Safe mode will be turned off automatically.
org.apache.hadoop.dfs.SafeModeException: Cannot delete
/Users/hadoop/hadoop-0.18.2/hadoop-hadoop/mapred/system. Name node is in
safe mode.
The ratio of reported blocks 0. has not reached the threshold 0.9990.
Safe mode will be turned off automatically.
at
org.apache.hadoop.dfs.FSNamesystem.deleteInternal(FSNamesystem.java:1505)
at org.apache.hadoop.dfs.FSNamesystem.delete(FSNamesystem.java:1477)
at org.apache.hadoop.dfs.NameNode.delete(NameNode.java:425)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)


I think this is a continuation of my running problem. The nodes stay in safe
mode, but won't come out, even after several minutes. I believe this is due
to the fact that it keep trying to contact a datanode that does not exist.
Any suggestions on what I can do?

I have recently tried to reformat the hdfs, using bin/hadoop namenode
-format. From the output directed to standard out, I thought this completed
correctly:

Re-format filesystem in /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/dfs/name ?
(Y or N) Y
09/02/15 15:16:39 INFO fs.FSNamesystem:
fsOwner=hadoop,staff,_lpadmin,com.apple.sharepoint.group.8,com.apple.sharepoint.group.3,com.apple.sharepoint.group.4,com.apple.sharepoint.group.2,com.apple.sharepoint.group.6,com.apple.sharepoint.group.9,com.apple.sharepoint.group.1,com.apple.sharepoint.group.5
09/02/15 15:16:39 INFO fs.FSNamesystem: supergroup=supergroup
09/02/15 15:16:39 INFO fs.FSNamesystem: isPermissionEnabled=true
09/02/15 15:16:39 INFO dfs.Storage: Image file of size 80 saved in 0
seconds.
09/02/15 15:16:39 INFO dfs.Storage: Storage directory
/Users/hadoop/hadoop-0.18.2/hadoop-hadoop/dfs/name has been successfully
formatted.
09/02/15 15:16:39 INFO dfs.NameNode: SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down NameNode at loteria.cs.tamu.edu/128.194.143.170
/

However, after reformatting, I find that I have the same problems.

Thanks,
SM

On Fri, Feb 13, 2009 at 5:39 PM, james warren  wrote:

> Sandy -
>
> I suggest you take a look into your NameNode and DataNode logs.  From the
> information posted, these likely would be at
>
>
> /Users/hadoop/hadoop-0.18.2/bin/../logs/hadoop-hadoop-namenode-loteria.cs.tamu.edu.log
>
> /Users/hadoop/hadoop-0.18.2/bin/../logs/hadoop-hadoop-jobtracker-loteria.cs.tamu.edu.log
>
> If the cause isn't obvious from what you see there, could you please post
> the last few lines from each log?
>
> -jw
>
> On Fri, Feb 13, 2009 at 3:28 PM, Sandy  wrote:
>
> > Hello,
> >
> > I would really appreciate any help I can get on this! I've suddenly ran
> > into
> > a very strange error.
> >
> > when I do:
> > bin/start-all
> > I get:
> > hadoop$ bin/start-all.sh
> > starting namenode, logging to
> >
> >
> /Users/hadoop/hadoop-0.18.2/bin/../logs/hadoop-hadoop-namenode-loteria.cs.tamu.edu.out
> > starting jobtracker, logging to
> >
> >
> /Users/hadoop/hadoop-0.18.2/bin/../logs/hadoop-hadoop-jobtracker-loteria.cs.tamu.edu.out
> >
> > No datanode, secondary namenode or jobtracker are being started.
> >
> > When I try to upload anything on the dfs, I get a "node in safemode"
> error
> > (even after waiting 5 minutes), presumably because it's trying to reach a
> > datanode that does not exist.  The same "safemode" error occurs when I
> try
> > to run jobs.
> >
> > I have tried bin/stop-all and then bin/start-all again. I get the same
> > problem!
> >
> > This is incredibly strange, since I was previously able to start and run
> > jobs without any issue using this version on this machine. I am running
> > jobs
> > on a single Mac Pro running OS X 10.5
> >
> > I have tried updating to hadoop-0.19.0, and I get the same problem. I
> have
> > even tried this using previous versions, and I'm getting the same
> problem!
> >
> > Anyone have any idea why this suddenly could be happening? What am I
> doing
> > wrong?
> >
> > For convenience, I'm including portions of both conf/hadoop-env.sh and
> > conf/hadoop-site.xml:
> >
> > --- hadoop-env.sh ---
> >  # Set Hadoop-specific environment variables here.
> >
> > # The only required environment var

Re: question about hadoop and amazon ec2 ?

2009-02-15 Thread nitesh bhatia
1. They are related as one can use EC2 as a to serve computation part
for hadoop.
Refer: http://wiki.apache.org/hadoop/AmazonEC2

2. yes
Refer: 
http://wiki.apache.org/hadoop/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)

3. you can use EC2 as a to serve computation part for hadoop.

--nitesh

On Sun, Feb 15, 2009 at 2:18 PM, buddha1021  wrote:
>
> hi:
> What is the relationship between the hadoop and the amazon ec2 ?
> Can hadoop run on the common pc (but not server ) directly ?
> Why someone says hadoop run on the amazon ec2 ?
> thanks!
> --
> View this message in context: 
> http://www.nabble.com/question-about-hadoop-and-amazon-ec2---tp22020652p22020652.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>



-- 
Nitesh Bhatia
Dhirubhai Ambani Institute of Information & Communication Technology
Gandhinagar
Gujarat

"Life is never perfect. It just depends where you draw the line."

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun


Re: HDFS architecture based on GFS?

2009-02-15 Thread Matei Zaharia
Hi Amandeep,
Hadoop is definitely inspired by MapReduce/GFS and aims to provide those
capabilities as an open-source project. HDFS is similar to GFS (large
blocks, replication, etc); some notable things missing are read-write
support in the middle of a file (unlikely to be provided because few Hadoop
applications require it) and multiple appenders (the record append
operation). You can read about HDFS architecture at
http://hadoop.apache.org/core/docs/current/hdfs_design.html. The MapReduce
part of Hadoop interacts with HDFS in the same way that Google's MapReduce
interacts with GFS (shipping computation to the data), although Hadoop
MapReduce also supports running over other distributed filesystems.

Matei

On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana  wrote:

> Hi
>
> Is the HDFS architecture completely based on the Google Filesystem? If it
> isnt, what are the differences between the two?
>
> Secondly, is the coupling between Hadoop and HDFS same as how it is between
> the Google's version of Map Reduce and GFS?
>
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>


HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
Hi

Is the HDFS architecture completely based on the Google Filesystem? If it
isnt, what are the differences between the two?

Secondly, is the coupling between Hadoop and HDFS same as how it is between
the Google's version of Map Reduce and GFS?

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


Re: can't edit the file that mounted by fuse_dfs by editor

2009-02-15 Thread S D
I followed these instructions

http://wiki.apache.org/hadoop/MountableHDFS

and was able to get things working with 0.19.0 on Fedora. The only problem I
ran into was the AMD64 issue on one of my boxes (see the note on the above
link); I edited the Makefile and set OSARCH as suggested but couldn't get
things working on my AMD box.

Aside from that, are you certain that vi is a supported command with
Fuse-DFS? It is not listed in the commands on the above link. I'd first see
if you can get things working with one of the commands they mention (e.g.,
cp, ls, more, cat, find, less, rm, mkdir, mv, rmdir).

John

On Wed, Feb 11, 2009 at 8:25 PM, zhuweimin  wrote:

> Hey all
>
> I was trying to edit the file that mounted by fuse_dfs by vi editor, but
> the
> contents could not save.
> The command is like the following:
> [had...@vm-centos-5-shu-4 src]$ vi /mnt/dfs/test.txt
> The error message from system log (/var/log/messages) is the following:
> Feb 12 09:53:48 VM-CentOS-5-SHU-4 fuse_dfs: ERROR: could not connect open
> file fuse_dfs.c:1340
>
> I using the hadoop0.19.0 and fuse-dfs version 26 with centos5.2.
> Does anyone have an idea as to what could be wrong!
>
> Thanks!
> zhuweimin
>
>
>


Re: Hostnames on MapReduce Web UI

2009-02-15 Thread Nick Cen
Try comment out te localhost definition in your /etc/hosts file.

2009/2/14 S D 

> I'm reviewing the task trackers on the web interface (
> http://jobtracker-hostname:50030/) for my cluster of 3 machines. The names
> of the task trackers do not list real domain names; e.g., one of the task
> trackers is listed as:
>
> tracker_localhost:localhost/127.0.0.1:48167
>
> I believe that the networking on my machines is set correctly. What do I
> need to configure so that the listing above will show the actual domain
> name? This will help me in diagnosing where problems are occurring in my
> cluster. Note that at the top of the page the hostname (in my case "storm")
> is properly listed; e.g.,
>
> storm Hadoop Machine List
>
> Thanks,
> John
>



-- 
http://daily.appspot.com/food/


Some Storage communication related questions

2009-02-15 Thread Wasim Bari
Hi,
 I have multiple questions:

Does hadoop use some parallel technique for CopyFromLocal and CopyToLocal  
(like DistCp) Or its simple ONE stream writing?

For Amazon S3 to Local system communication, Hadoop uses Rest service interface 
or SOAP ?

Are there some new storage systems currently in pipeline to be interfaced with 
hadoop ?


Thanks,

Wasim

Re: HDFS on non-identical nodes

2009-02-15 Thread Deepak
Thanks Brain and Chen!

I finally sort that out why cluster is being stopped after running out
of space. Its because of master failure due to disk space.

Regarding automatic balancer, I guess in our case, rate of copying is
faster than balancer rate, we found balancer do start but couldn't
perform its job.

Anyways thanks for your help! It helped me sort out somethings.

Cheers,
Deepak

On Thu, Feb 12, 2009 at 5:32 PM, He Chen  wrote:
>  I think you should confirm your balancer is still running. Do you change
> the threshold of the HDFS balancer? May be too large?
>
> The balancer will stop working when meets 5 conditions:
>
> 1. Datanodes are balanced (obviously you are not this kind);
> 2. No more block to be moved (all blocks on unbalanced nodes are busy or
> recently used)
> 3. No more block to be moved in 20 minutes and 5 times consecutive attempts
> 4. Another balancer is working
> 5. I/O exception
>
>
> The default setting is 10% for each datanodes, for 1TB it is 100GB, for 3T
> is 300GB, and for 60GB is 6GB
>
> Hope helpful
>
>
> On Thu, Feb 12, 2009 at 10:06 AM, Brian Bockelman wrote:
>
>>
>> On Feb 12, 2009, at 2:54 AM, Deepak wrote:
>>
>> Hi,
>>>
>>> We're running Hadoop cluster on 4 nodes, our primary purpose of
>>> running is to provide distributed storage solution for internal
>>> applications here in TellyTopia Inc.
>>>
>>> Our cluster consists of non-identical nodes (one with 1TB another two
>>> with 3 TB and one more with 60GB) while copying data on HDFS we
>>> noticed that node with 60GB storage ran out of disk-space and even
>>> balancer couldn't balance because cluster was stopped. Now my
>>> questions are
>>>
>>> 1. Is Hadoop is suitable for non-identical cluster nodes?
>>>
>>
>> Yes.  Our cluster has between 60GB and 40TB on our nodes.  The majority
>> have around 3TB.
>>
>>
>>> 2. Is there any way to automatically balancing of nodes?
>>>
>>
>> We have a cron script which automatically starts the Balancer.  It's dirty,
>> but it works.
>>
>>
>>> 3. Why Hadoop cluster stops when one node ran our of disk?
>>>
>>>
>> That's not normal.  Trust me, if that was always true, we'd be perpetually
>> screwed :)
>>
>> There might be some other underlying error you're missing...
>>
>> Brian
>>
>>
>> Any futher inputs are appericiapted!
>>>
>>> Cheers,
>>> Deepak
>>> TellyTopia Inc.
>>>
>>
>>
>
>
> --
> Chen He
> RCF CSE Dept.
> University of Nebraska-Lincoln
> US
>



-- 
Deepak
TellyTopia Inc.


question about hadoop and amazon ec2 ?

2009-02-15 Thread buddha1021

hi:
What is the relationship between the hadoop and the amazon ec2 ?
Can hadoop run on the common pc (but not server ) directly ? 
Why someone says hadoop run on the amazon ec2 ?
thanks!
-- 
View this message in context: 
http://www.nabble.com/question-about-hadoop-and-amazon-ec2---tp22020652p22020652.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.