Re: HDFS architecture based on GFS?

Rasit OZDAS Sun, 15 Feb 2009 23:41:48 -0800

"If there was a
malicious process though, then I imagine it could talk to a datanode
directly and request a specific block."


I didn't understand usage of "malicuous" here,
but any process using HDFS api should first ask NameNode where the
file replications are.
Then - I assume - namenode returns the IP of best DataNode (or all IPs),
then call to specific DataNode is made.
Please correct me if I'm wrong.

Cheers,
Rasit

2009/2/16 Matei Zaharia <ma...@cloudera.com>:
> In general, yeah, the scripts can access any resource they want (within the
> permissions of the user that the task runs as). It's also possible to access
> HDFS from scripts because HDFS provides a FUSE interface that can make it
> look like a regular file system on the machine. (The FUSE module in turn
> talks to the namenode as a regular HDFS client.)
>
> On Sun, Feb 15, 2009 at 8:43 PM, Amandeep Khurana <ama...@gmail.com> wrote:
>
>> I dont know much about Hadoop streaming and have a quick question here.
>>
>> The snippets of code/programs that you attach into the map reduce job might
>> want to access outside resources (like you mentioned). Now these might not
>> need to go to the namenode right? For example a python script. How would it
>> access the data? Would it ask the parent java process (in the tasktracker)
>> to get the data or would it go and do stuff on its own?
>>
>>
>> Amandeep Khurana
>> Computer Science Graduate Student
>> University of California, Santa Cruz
>>
>>
>> On Sun, Feb 15, 2009 at 8:23 PM, Matei Zaharia <ma...@cloudera.com> wrote:
>>
>> > Nope, typically the JobTracker just starts the process, and the
>> tasktracker
>> > talks directly to the namenode to get a pointer to the datanode, and then
>> > directly to the datanode.
>> >
>> > On Sun, Feb 15, 2009 at 8:07 PM, Amandeep Khurana <ama...@gmail.com>
>> > wrote:
>> >
>> > > Alright.. Got it.
>> > >
>> > > Now, do the task trackers talk to the namenode and the data node
>> directly
>> > > or
>> > > do they go through the job tracker for it? So, if my code is such that
>> I
>> > > need to access more files from the hdfs, would the job tracker get
>> > involved
>> > > or not?
>> > >
>> > >
>> > >
>> > >
>> > > Amandeep Khurana
>> > > Computer Science Graduate Student
>> > > University of California, Santa Cruz
>> > >
>> > >
>> > > On Sun, Feb 15, 2009 at 7:20 PM, Matei Zaharia <ma...@cloudera.com>
>> > wrote:
>> > >
>> > > > Normally, HDFS files are accessed through the namenode. If there was
>> a
>> > > > malicious process though, then I imagine it could talk to a datanode
>> > > > directly and request a specific block.
>> > > >
>> > > > On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana <ama...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Ok. Got it.
>> > > > >
>> > > > > Now, when my job needs to access another file, does it go to the
>> > > Namenode
>> > > > > to
>> > > > > get the block ids? How does the java process know where the files
>> are
>> > > and
>> > > > > how to access them?
>> > > > >
>> > > > >
>> > > > > Amandeep Khurana
>> > > > > Computer Science Graduate Student
>> > > > > University of California, Santa Cruz
>> > > > >
>> > > > >
>> > > > > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia <ma...@cloudera.com
>> >
>> > > > wrote:
>> > > > >
>> > > > > > I mentioned this case because even jobs written in Java can use
>> the
>> > > > HDFS
>> > > > > > API
>> > > > > > to talk to the NameNode and access the filesystem. People often
>> do
>> > > this
>> > > > > > because their job needs to read a config file, some small data
>> > table,
>> > > > etc
>> > > > > > and use this information in its map or reduce functions. In this
>> > > case,
>> > > > > you
>> > > > > > open the second file separately in your mapper's init function
>> and
>> > > read
>> > > > > > whatever you need from it. In general I wanted to point out that
>> > you
>> > > > > can't
>> > > > > > know which files a job will access unless you look at its source
>> > code
>> > > > or
>> > > > > > monitor the calls it makes; the input file(s) you provide in the
>> > job
>> > > > > > description are a hint to the MapReduce framework to place your
>> job
>> > > on
>> > > > > > certain nodes, but it's reasonable for the job to access other
>> > files
>> > > as
>> > > > > > well.
>> > > > > >
>> > > > > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana <
>> > ama...@gmail.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Another question that I have here - When the jobs run arbitrary
>> > > code
>> > > > > and
>> > > > > > > access data from the HDFS, do they go to the namenode to get
>> the
>> > > > block
>> > > > > > > information?
>> > > > > > >
>> > > > > > >
>> > > > > > > Amandeep Khurana
>> > > > > > > Computer Science Graduate Student
>> > > > > > > University of California, Santa Cruz
>> > > > > > >
>> > > > > > >
>> > > > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana <
>> > > ama...@gmail.com>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Assuming that the job is purely in Java and not involving
>> > > streaming
>> > > > > or
>> > > > > > > > pipes, wouldnt the resources (files) required by the job as
>> > > inputs
>> > > > be
>> > > > > > > known
>> > > > > > > > beforehand? So, if the map task is accessing a second file,
>> how
>> > > > does
>> > > > > it
>> > > > > > > make
>> > > > > > > > it different except that there are multiple files. The
>> > JobTracker
>> > > > > would
>> > > > > > > know
>> > > > > > > > beforehand that multiple files would be accessed. Right?
>> > > > > > > >
>> > > > > > > > I am slightly confused why you have mentioned this case
>> > > > separately...
>> > > > > > Can
>> > > > > > > > you elaborate on it a little bit?
>> > > > > > > >
>> > > > > > > > Amandeep
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Amandeep Khurana
>> > > > > > > > Computer Science Graduate Student
>> > > > > > > > University of California, Santa Cruz
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia <
>> > > ma...@cloudera.com
>> > > > >
>> > > > > > > wrote:
>> > > > > > > >
>> > > > > > > >> Typically the data flow is like this:1) Client submits a job
>> > > > > > description
>> > > > > > > >> to
>> > > > > > > >> the JobTracker.
>> > > > > > > >> 2) JobTracker figures out block locations for the input
>> > file(s)
>> > > by
>> > > > > > > talking
>> > > > > > > >> to HDFS NameNode.
>> > > > > > > >> 3) JobTracker creates a job description file in HDFS which
>> > will
>> > > be
>> > > > > > read
>> > > > > > > by
>> > > > > > > >> the nodes to copy over the job's code etc.
>> > > > > > > >> 4) JobTracker starts map tasks on the slaves (TaskTrackers)
>> > with
>> > > > the
>> > > > > > > >> appropriate data blocks.
>> > > > > > > >> 5) After running, maps create intermediate output files on
>> > those
>> > > > > > slaves.
>> > > > > > > >> These are not in HDFS, they're in some temporary storage
>> used
>> > by
>> > > > > > > >> MapReduce.
>> > > > > > > >> 6) JobTracker starts reduces on a series of slaves, which
>> copy
>> > > > over
>> > > > > > the
>> > > > > > > >> appropriate map outputs, apply the reduce function, and
>> write
>> > > the
>> > > > > > > outputs
>> > > > > > > >> to
>> > > > > > > >> HDFS (one output file per reducer).
>> > > > > > > >> 7) Some logs for the job may also be put into HDFS by the
>> > > > > JobTracker.
>> > > > > > > >>
>> > > > > > > >> However, there is a big caveat, which is that the map and
>> > reduce
>> > > > > tasks
>> > > > > > > run
>> > > > > > > >> arbitrary code. It is not unusual to have a map that opens a
>> > > > second
>> > > > > > HDFS
>> > > > > > > >> file to read some information (e.g. for doing a join of a
>> > small
>> > > > > table
>> > > > > > > >> against a big file). If you use Hadoop Streaming or Pipes to
>> > > write
>> > > > a
>> > > > > > job
>> > > > > > > >> in
>> > > > > > > >> Python, Ruby, C, etc, then you are launching arbitrary
>> > processes
>> > > > > which
>> > > > > > > may
>> > > > > > > >> also access external resources in this manner. Some people
>> > also
>> > > > > > > read/write
>> > > > > > > >> to DBs (e.g. MySQL) from their tasks. A comprehensive
>> security
>> > > > > > solution
>> > > > > > > >> would ideally deal with these cases too.
>> > > > > > > >>
>> > > > > > > >> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana <
>> > > > ama...@gmail.com
>> > > > > >
>> > > > > > > >> wrote:
>> > > > > > > >>
>> > > > > > > >> > A quick question here. How does a typical hadoop job work
>> at
>> > > the
>> > > > > > > system
>> > > > > > > >> > level? What are the various interactions and how does the
>> > data
>> > > > > flow?
>> > > > > > > >> >
>> > > > > > > >> > Amandeep
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> > Amandeep Khurana
>> > > > > > > >> > Computer Science Graduate Student
>> > > > > > > >> > University of California, Santa Cruz
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana <
>> > > > > ama...@gmail.com
>> > > > > > >
>> > > > > > > >> > wrote:
>> > > > > > > >> >
>> > > > > > > >> > > Thanks Matei. If the basic architecture is similar to
>> the
>> > > > Google
>> > > > > > > >> stuff, I
>> > > > > > > >> > > can safely just work on the project using the
>> information
>> > > from
>> > > > > the
>> > > > > > > >> > papers.
>> > > > > > > >> > >
>> > > > > > > >> > > I am aware of the 4487 jira and the current status of
>> the
>> > > > > > > permissions
>> > > > > > > >> > > mechanism. I had a look at them earlier.
>> > > > > > > >> > >
>> > > > > > > >> > > Cheers
>> > > > > > > >> > > Amandeep
>> > > > > > > >> > >
>> > > > > > > >> > >
>> > > > > > > >> > > Amandeep Khurana
>> > > > > > > >> > > Computer Science Graduate Student
>> > > > > > > >> > > University of California, Santa Cruz
>> > > > > > > >> > >
>> > > > > > > >> > >
>> > > > > > > >> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia <
>> > > > > > ma...@cloudera.com>
>> > > > > > > >> > wrote:
>> > > > > > > >> > >
>> > > > > > > >> > >> Forgot to add, this JIRA details the latest security
>> > > features
>> > > > > > that
>> > > > > > > >> are
>> > > > > > > >> > >> being
>> > > > > > > >> > >> worked on in Hadoop trunk:
>> > > > > > > >> > >> https://issues.apache.org/jira/browse/HADOOP-4487.
>> > > > > > > >> > >> This document describes the current status and
>> > limitations
>> > > of
>> > > > > the
>> > > > > > > >> > >> permissions mechanism:
>> > > > > > > >> > >>
>> > > > > > > >>
>> > > > > >
>> > > http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html
>> > > > .
>> > > > > > > >> > >>
>> > > > > > > >> > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia <
>> > > > > > ma...@cloudera.com
>> > > > > > > >
>> > > > > > > >> > >> wrote:
>> > > > > > > >> > >>
>> > > > > > > >> > >> > I think it's safe to assume that Hadoop works like
>> > > > > > MapReduce/GFS
>> > > > > > > at
>> > > > > > > >> > the
>> > > > > > > >> > >> > level described in those papers. In particular, in
>> > HDFS,
>> > > > > there
>> > > > > > is
>> > > > > > > a
>> > > > > > > >> > >> master
>> > > > > > > >> > >> > node containing metadata and a number of slave nodes
>> > > > > > (datanodes)
>> > > > > > > >> > >> containing
>> > > > > > > >> > >> > blocks, as in GFS. Clients start by talking to the
>> > master
>> > > > to
>> > > > > > list
>> > > > > > > >> > >> > directories, etc. When they want to read a region of
>> > some
>> > > > > file,
>> > > > > > > >> they
>> > > > > > > >> > >> tell
>> > > > > > > >> > >> > the master the filename and offset, and they receive
>> a
>> > > list
>> > > > > of
>> > > > > > > >> block
>> > > > > > > >> > >> > locations (datanodes). They then contact the
>> individual
>> > > > > > datanodes
>> > > > > > > >> to
>> > > > > > > >> > >> read
>> > > > > > > >> > >> > the blocks. When clients write a file, they first
>> > obtain
>> > > a
>> > > > > new
>> > > > > > > >> block
>> > > > > > > >> > ID
>> > > > > > > >> > >> and
>> > > > > > > >> > >> > list of nodes to write it to from the master, then
>> > > contact
>> > > > > the
>> > > > > > > >> > datanodes
>> > > > > > > >> > >> to
>> > > > > > > >> > >> > write it (actually, the datanodes pipeline the write
>> as
>> > > in
>> > > > > GFS)
>> > > > > > > and
>> > > > > > > >> > >> report
>> > > > > > > >> > >> > when the write is complete. HDFS actually has some
>> > > security
>> > > > > > > >> mechanisms
>> > > > > > > >> > >> built
>> > > > > > > >> > >> > in, authenticating users based on their Unix ID and
>> > > > providing
>> > > > > > > >> > Unix-like
>> > > > > > > >> > >> file
>> > > > > > > >> > >> > permissions. I don't know much about how these are
>> > > > > implemented,
>> > > > > > > but
>> > > > > > > >> > they
>> > > > > > > >> > >> > would be a good place to start looking.
>> > > > > > > >> > >> >
>> > > > > > > >> > >> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana <
>> > > > > > > >> ama...@gmail.com
>> > > > > > > >> > >> >wrote:
>> > > > > > > >> > >> >
>> > > > > > > >> > >> >> Thanks Matie
>> > > > > > > >> > >> >>
>> > > > > > > >> > >> >> I had gone through the architecture document online.
>> I
>> > > am
>> > > > > > > >> currently
>> > > > > > > >> > >> >> working
>> > > > > > > >> > >> >> on a project towards Security in Hadoop. I do know
>> how
>> > > the
>> > > > > > data
>> > > > > > > >> moves
>> > > > > > > >> > >> >> around
>> > > > > > > >> > >> >> in the GFS but wasnt sure how much of that does HDFS
>> > > > follow
>> > > > > > and
>> > > > > > > >> how
>> > > > > > > >> > >> >> different it is from GFS. Can you throw some light
>> on
>> > > > that?
>> > > > > > > >> > >> >>
>> > > > > > > >> > >> >> Security would also involve the Map Reduce jobs
>> > > following
>> > > > > the
>> > > > > > > same
>> > > > > > > >> > >> >> protocols. Thats why the question about how does the
>> > > > Hadoop
>> > > > > > > >> framework
>> > > > > > > >> > >> >> integrate with the HDFS, and how different is it
>> from
>> > > Map
>> > > > > > Reduce
>> > > > > > > >> and
>> > > > > > > >> > >> GFS.
>> > > > > > > >> > >> >> The GFS and Map Reduce papers give a good
>> information
>> > on
>> > > > how
>> > > > > > > those
>> > > > > > > >> > >> systems
>> > > > > > > >> > >> >> are designed but there is nothing that concrete for
>> > > Hadoop
>> > > > > > that
>> > > > > > > I
>> > > > > > > >> > have
>> > > > > > > >> > >> >> been
>> > > > > > > >> > >> >> able to find.
>> > > > > > > >> > >> >>
>> > > > > > > >> > >> >> Amandeep
>> > > > > > > >> > >> >>
>> > > > > > > >> > >> >>
>> > > > > > > >> > >> >> Amandeep Khurana
>> > > > > > > >> > >> >> Computer Science Graduate Student
>> > > > > > > >> > >> >> University of California, Santa Cruz
>> > > > > > > >> > >> >>
>> > > > > > > >> > >> >>
>> > > > > > > >> > >> >> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia <
>> > > > > > > >> ma...@cloudera.com>
>> > > > > > > >> > >> >> wrote:
>> > > > > > > >> > >> >>
>> > > > > > > >> > >> >> > Hi Amandeep,
>> > > > > > > >> > >> >> > Hadoop is definitely inspired by MapReduce/GFS and
>> > > aims
>> > > > to
>> > > > > > > >> provide
>> > > > > > > >> > >> those
>> > > > > > > >> > >> >> > capabilities as an open-source project. HDFS is
>> > > similar
>> > > > to
>> > > > > > GFS
>> > > > > > > >> > (large
>> > > > > > > >> > >> >> > blocks, replication, etc); some notable things
>> > missing
>> > > > are
>> > > > > > > >> > read-write
>> > > > > > > >> > >> >> > support in the middle of a file (unlikely to be
>> > > provided
>> > > > > > > because
>> > > > > > > >> > few
>> > > > > > > >> > >> >> Hadoop
>> > > > > > > >> > >> >> > applications require it) and multiple appenders
>> (the
>> > > > > record
>> > > > > > > >> append
>> > > > > > > >> > >> >> > operation). You can read about HDFS architecture
>> at
>> > > > > > > >> > >> >> >
>> > > > > http://hadoop.apache.org/core/docs/current/hdfs_design.html
>> > > > > > .
>> > > > > > > >> The
>> > > > > > > >> > >> >> MapReduce
>> > > > > > > >> > >> >> > part of Hadoop interacts with HDFS in the same way
>> > > that
>> > > > > > > Google's
>> > > > > > > >> > >> >> MapReduce
>> > > > > > > >> > >> >> > interacts with GFS (shipping computation to the
>> > data),
>> > > > > > > although
>> > > > > > > >> > >> Hadoop
>> > > > > > > >> > >> >> > MapReduce also supports running over other
>> > distributed
>> > > > > > > >> filesystems.
>> > > > > > > >> > >> >> >
>> > > > > > > >> > >> >> > Matei
>> > > > > > > >> > >> >> >
>> > > > > > > >> > >> >> > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana
>> <
>> > > > > > > >> > ama...@gmail.com
>> > > > > > > >> > >> >
>> > > > > > > >> > >> >> > wrote:
>> > > > > > > >> > >> >> >
>> > > > > > > >> > >> >> > > Hi
>> > > > > > > >> > >> >> > >
>> > > > > > > >> > >> >> > > Is the HDFS architecture completely based on the
>> > > > Google
>> > > > > > > >> > Filesystem?
>> > > > > > > >> > >> If
>> > > > > > > >> > >> >> it
>> > > > > > > >> > >> >> > > isnt, what are the differences between the two?
>> > > > > > > >> > >> >> > >
>> > > > > > > >> > >> >> > > Secondly, is the coupling between Hadoop and
>> HDFS
>> > > same
>> > > > > as
>> > > > > > > how
>> > > > > > > >> it
>> > > > > > > >> > is
>> > > > > > > >> > >> >> > between
>> > > > > > > >> > >> >> > > the Google's version of Map Reduce and GFS?
>> > > > > > > >> > >> >> > >
>> > > > > > > >> > >> >> > > Amandeep
>> > > > > > > >> > >> >> > >
>> > > > > > > >> > >> >> > >
>> > > > > > > >> > >> >> > > Amandeep Khurana
>> > > > > > > >> > >> >> > > Computer Science Graduate Student
>> > > > > > > >> > >> >> > > University of California, Santa Cruz
>> > > > > > > >> > >> >> > >
>> > > > > > > >> > >> >> >
>> > > > > > > >> > >> >>
>> > > > > > > >> > >> >
>> > > > > > > >> > >> >
>> > > > > > > >> > >>
>> > > > > > > >> > >
>> > > > > > > >> > >
>> > > > > > > >> >
>> > > > > > > >>
>> > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>



-- 
M. Raşit ÖZDAŞ

Re: HDFS architecture based on GFS?

Reply via email to