"If there was a malicious process though, then I imagine it could talk to a datanode directly and request a specific block."
I didn't understand usage of "malicuous" here, but any process using HDFS api should first ask NameNode where the file replications are. Then - I assume - namenode returns the IP of best DataNode (or all IPs), then call to specific DataNode is made. Please correct me if I'm wrong. Cheers, Rasit 2009/2/16 Matei Zaharia <ma...@cloudera.com>: > In general, yeah, the scripts can access any resource they want (within the > permissions of the user that the task runs as). It's also possible to access > HDFS from scripts because HDFS provides a FUSE interface that can make it > look like a regular file system on the machine. (The FUSE module in turn > talks to the namenode as a regular HDFS client.) > > On Sun, Feb 15, 2009 at 8:43 PM, Amandeep Khurana <ama...@gmail.com> wrote: > >> I dont know much about Hadoop streaming and have a quick question here. >> >> The snippets of code/programs that you attach into the map reduce job might >> want to access outside resources (like you mentioned). Now these might not >> need to go to the namenode right? For example a python script. How would it >> access the data? Would it ask the parent java process (in the tasktracker) >> to get the data or would it go and do stuff on its own? >> >> >> Amandeep Khurana >> Computer Science Graduate Student >> University of California, Santa Cruz >> >> >> On Sun, Feb 15, 2009 at 8:23 PM, Matei Zaharia <ma...@cloudera.com> wrote: >> >> > Nope, typically the JobTracker just starts the process, and the >> tasktracker >> > talks directly to the namenode to get a pointer to the datanode, and then >> > directly to the datanode. >> > >> > On Sun, Feb 15, 2009 at 8:07 PM, Amandeep Khurana <ama...@gmail.com> >> > wrote: >> > >> > > Alright.. Got it. >> > > >> > > Now, do the task trackers talk to the namenode and the data node >> directly >> > > or >> > > do they go through the job tracker for it? So, if my code is such that >> I >> > > need to access more files from the hdfs, would the job tracker get >> > involved >> > > or not? >> > > >> > > >> > > >> > > >> > > Amandeep Khurana >> > > Computer Science Graduate Student >> > > University of California, Santa Cruz >> > > >> > > >> > > On Sun, Feb 15, 2009 at 7:20 PM, Matei Zaharia <ma...@cloudera.com> >> > wrote: >> > > >> > > > Normally, HDFS files are accessed through the namenode. If there was >> a >> > > > malicious process though, then I imagine it could talk to a datanode >> > > > directly and request a specific block. >> > > > >> > > > On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana <ama...@gmail.com> >> > > > wrote: >> > > > >> > > > > Ok. Got it. >> > > > > >> > > > > Now, when my job needs to access another file, does it go to the >> > > Namenode >> > > > > to >> > > > > get the block ids? How does the java process know where the files >> are >> > > and >> > > > > how to access them? >> > > > > >> > > > > >> > > > > Amandeep Khurana >> > > > > Computer Science Graduate Student >> > > > > University of California, Santa Cruz >> > > > > >> > > > > >> > > > > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia <ma...@cloudera.com >> > >> > > > wrote: >> > > > > >> > > > > > I mentioned this case because even jobs written in Java can use >> the >> > > > HDFS >> > > > > > API >> > > > > > to talk to the NameNode and access the filesystem. People often >> do >> > > this >> > > > > > because their job needs to read a config file, some small data >> > table, >> > > > etc >> > > > > > and use this information in its map or reduce functions. In this >> > > case, >> > > > > you >> > > > > > open the second file separately in your mapper's init function >> and >> > > read >> > > > > > whatever you need from it. In general I wanted to point out that >> > you >> > > > > can't >> > > > > > know which files a job will access unless you look at its source >> > code >> > > > or >> > > > > > monitor the calls it makes; the input file(s) you provide in the >> > job >> > > > > > description are a hint to the MapReduce framework to place your >> job >> > > on >> > > > > > certain nodes, but it's reasonable for the job to access other >> > files >> > > as >> > > > > > well. >> > > > > > >> > > > > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana < >> > ama...@gmail.com> >> > > > > > wrote: >> > > > > > >> > > > > > > Another question that I have here - When the jobs run arbitrary >> > > code >> > > > > and >> > > > > > > access data from the HDFS, do they go to the namenode to get >> the >> > > > block >> > > > > > > information? >> > > > > > > >> > > > > > > >> > > > > > > Amandeep Khurana >> > > > > > > Computer Science Graduate Student >> > > > > > > University of California, Santa Cruz >> > > > > > > >> > > > > > > >> > > > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana < >> > > ama...@gmail.com> >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Assuming that the job is purely in Java and not involving >> > > streaming >> > > > > or >> > > > > > > > pipes, wouldnt the resources (files) required by the job as >> > > inputs >> > > > be >> > > > > > > known >> > > > > > > > beforehand? So, if the map task is accessing a second file, >> how >> > > > does >> > > > > it >> > > > > > > make >> > > > > > > > it different except that there are multiple files. The >> > JobTracker >> > > > > would >> > > > > > > know >> > > > > > > > beforehand that multiple files would be accessed. Right? >> > > > > > > > >> > > > > > > > I am slightly confused why you have mentioned this case >> > > > separately... >> > > > > > Can >> > > > > > > > you elaborate on it a little bit? >> > > > > > > > >> > > > > > > > Amandeep >> > > > > > > > >> > > > > > > > >> > > > > > > > Amandeep Khurana >> > > > > > > > Computer Science Graduate Student >> > > > > > > > University of California, Santa Cruz >> > > > > > > > >> > > > > > > > >> > > > > > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia < >> > > ma...@cloudera.com >> > > > > >> > > > > > > wrote: >> > > > > > > > >> > > > > > > >> Typically the data flow is like this:1) Client submits a job >> > > > > > description >> > > > > > > >> to >> > > > > > > >> the JobTracker. >> > > > > > > >> 2) JobTracker figures out block locations for the input >> > file(s) >> > > by >> > > > > > > talking >> > > > > > > >> to HDFS NameNode. >> > > > > > > >> 3) JobTracker creates a job description file in HDFS which >> > will >> > > be >> > > > > > read >> > > > > > > by >> > > > > > > >> the nodes to copy over the job's code etc. >> > > > > > > >> 4) JobTracker starts map tasks on the slaves (TaskTrackers) >> > with >> > > > the >> > > > > > > >> appropriate data blocks. >> > > > > > > >> 5) After running, maps create intermediate output files on >> > those >> > > > > > slaves. >> > > > > > > >> These are not in HDFS, they're in some temporary storage >> used >> > by >> > > > > > > >> MapReduce. >> > > > > > > >> 6) JobTracker starts reduces on a series of slaves, which >> copy >> > > > over >> > > > > > the >> > > > > > > >> appropriate map outputs, apply the reduce function, and >> write >> > > the >> > > > > > > outputs >> > > > > > > >> to >> > > > > > > >> HDFS (one output file per reducer). >> > > > > > > >> 7) Some logs for the job may also be put into HDFS by the >> > > > > JobTracker. >> > > > > > > >> >> > > > > > > >> However, there is a big caveat, which is that the map and >> > reduce >> > > > > tasks >> > > > > > > run >> > > > > > > >> arbitrary code. It is not unusual to have a map that opens a >> > > > second >> > > > > > HDFS >> > > > > > > >> file to read some information (e.g. for doing a join of a >> > small >> > > > > table >> > > > > > > >> against a big file). If you use Hadoop Streaming or Pipes to >> > > write >> > > > a >> > > > > > job >> > > > > > > >> in >> > > > > > > >> Python, Ruby, C, etc, then you are launching arbitrary >> > processes >> > > > > which >> > > > > > > may >> > > > > > > >> also access external resources in this manner. Some people >> > also >> > > > > > > read/write >> > > > > > > >> to DBs (e.g. MySQL) from their tasks. A comprehensive >> security >> > > > > > solution >> > > > > > > >> would ideally deal with these cases too. >> > > > > > > >> >> > > > > > > >> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana < >> > > > ama...@gmail.com >> > > > > > >> > > > > > > >> wrote: >> > > > > > > >> >> > > > > > > >> > A quick question here. How does a typical hadoop job work >> at >> > > the >> > > > > > > system >> > > > > > > >> > level? What are the various interactions and how does the >> > data >> > > > > flow? >> > > > > > > >> > >> > > > > > > >> > Amandeep >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > Amandeep Khurana >> > > > > > > >> > Computer Science Graduate Student >> > > > > > > >> > University of California, Santa Cruz >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana < >> > > > > ama...@gmail.com >> > > > > > > >> > > > > > > >> > wrote: >> > > > > > > >> > >> > > > > > > >> > > Thanks Matei. If the basic architecture is similar to >> the >> > > > Google >> > > > > > > >> stuff, I >> > > > > > > >> > > can safely just work on the project using the >> information >> > > from >> > > > > the >> > > > > > > >> > papers. >> > > > > > > >> > > >> > > > > > > >> > > I am aware of the 4487 jira and the current status of >> the >> > > > > > > permissions >> > > > > > > >> > > mechanism. I had a look at them earlier. >> > > > > > > >> > > >> > > > > > > >> > > Cheers >> > > > > > > >> > > Amandeep >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > > >> > > Amandeep Khurana >> > > > > > > >> > > Computer Science Graduate Student >> > > > > > > >> > > University of California, Santa Cruz >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > > >> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia < >> > > > > > ma...@cloudera.com> >> > > > > > > >> > wrote: >> > > > > > > >> > > >> > > > > > > >> > >> Forgot to add, this JIRA details the latest security >> > > features >> > > > > > that >> > > > > > > >> are >> > > > > > > >> > >> being >> > > > > > > >> > >> worked on in Hadoop trunk: >> > > > > > > >> > >> https://issues.apache.org/jira/browse/HADOOP-4487. >> > > > > > > >> > >> This document describes the current status and >> > limitations >> > > of >> > > > > the >> > > > > > > >> > >> permissions mechanism: >> > > > > > > >> > >> >> > > > > > > >> >> > > > > > >> > > http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html >> > > > . >> > > > > > > >> > >> >> > > > > > > >> > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia < >> > > > > > ma...@cloudera.com >> > > > > > > > >> > > > > > > >> > >> wrote: >> > > > > > > >> > >> >> > > > > > > >> > >> > I think it's safe to assume that Hadoop works like >> > > > > > MapReduce/GFS >> > > > > > > at >> > > > > > > >> > the >> > > > > > > >> > >> > level described in those papers. In particular, in >> > HDFS, >> > > > > there >> > > > > > is >> > > > > > > a >> > > > > > > >> > >> master >> > > > > > > >> > >> > node containing metadata and a number of slave nodes >> > > > > > (datanodes) >> > > > > > > >> > >> containing >> > > > > > > >> > >> > blocks, as in GFS. Clients start by talking to the >> > master >> > > > to >> > > > > > list >> > > > > > > >> > >> > directories, etc. When they want to read a region of >> > some >> > > > > file, >> > > > > > > >> they >> > > > > > > >> > >> tell >> > > > > > > >> > >> > the master the filename and offset, and they receive >> a >> > > list >> > > > > of >> > > > > > > >> block >> > > > > > > >> > >> > locations (datanodes). They then contact the >> individual >> > > > > > datanodes >> > > > > > > >> to >> > > > > > > >> > >> read >> > > > > > > >> > >> > the blocks. When clients write a file, they first >> > obtain >> > > a >> > > > > new >> > > > > > > >> block >> > > > > > > >> > ID >> > > > > > > >> > >> and >> > > > > > > >> > >> > list of nodes to write it to from the master, then >> > > contact >> > > > > the >> > > > > > > >> > datanodes >> > > > > > > >> > >> to >> > > > > > > >> > >> > write it (actually, the datanodes pipeline the write >> as >> > > in >> > > > > GFS) >> > > > > > > and >> > > > > > > >> > >> report >> > > > > > > >> > >> > when the write is complete. HDFS actually has some >> > > security >> > > > > > > >> mechanisms >> > > > > > > >> > >> built >> > > > > > > >> > >> > in, authenticating users based on their Unix ID and >> > > > providing >> > > > > > > >> > Unix-like >> > > > > > > >> > >> file >> > > > > > > >> > >> > permissions. I don't know much about how these are >> > > > > implemented, >> > > > > > > but >> > > > > > > >> > they >> > > > > > > >> > >> > would be a good place to start looking. >> > > > > > > >> > >> > >> > > > > > > >> > >> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana < >> > > > > > > >> ama...@gmail.com >> > > > > > > >> > >> >wrote: >> > > > > > > >> > >> > >> > > > > > > >> > >> >> Thanks Matie >> > > > > > > >> > >> >> >> > > > > > > >> > >> >> I had gone through the architecture document online. >> I >> > > am >> > > > > > > >> currently >> > > > > > > >> > >> >> working >> > > > > > > >> > >> >> on a project towards Security in Hadoop. I do know >> how >> > > the >> > > > > > data >> > > > > > > >> moves >> > > > > > > >> > >> >> around >> > > > > > > >> > >> >> in the GFS but wasnt sure how much of that does HDFS >> > > > follow >> > > > > > and >> > > > > > > >> how >> > > > > > > >> > >> >> different it is from GFS. Can you throw some light >> on >> > > > that? >> > > > > > > >> > >> >> >> > > > > > > >> > >> >> Security would also involve the Map Reduce jobs >> > > following >> > > > > the >> > > > > > > same >> > > > > > > >> > >> >> protocols. Thats why the question about how does the >> > > > Hadoop >> > > > > > > >> framework >> > > > > > > >> > >> >> integrate with the HDFS, and how different is it >> from >> > > Map >> > > > > > Reduce >> > > > > > > >> and >> > > > > > > >> > >> GFS. >> > > > > > > >> > >> >> The GFS and Map Reduce papers give a good >> information >> > on >> > > > how >> > > > > > > those >> > > > > > > >> > >> systems >> > > > > > > >> > >> >> are designed but there is nothing that concrete for >> > > Hadoop >> > > > > > that >> > > > > > > I >> > > > > > > >> > have >> > > > > > > >> > >> >> been >> > > > > > > >> > >> >> able to find. >> > > > > > > >> > >> >> >> > > > > > > >> > >> >> Amandeep >> > > > > > > >> > >> >> >> > > > > > > >> > >> >> >> > > > > > > >> > >> >> Amandeep Khurana >> > > > > > > >> > >> >> Computer Science Graduate Student >> > > > > > > >> > >> >> University of California, Santa Cruz >> > > > > > > >> > >> >> >> > > > > > > >> > >> >> >> > > > > > > >> > >> >> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia < >> > > > > > > >> ma...@cloudera.com> >> > > > > > > >> > >> >> wrote: >> > > > > > > >> > >> >> >> > > > > > > >> > >> >> > Hi Amandeep, >> > > > > > > >> > >> >> > Hadoop is definitely inspired by MapReduce/GFS and >> > > aims >> > > > to >> > > > > > > >> provide >> > > > > > > >> > >> those >> > > > > > > >> > >> >> > capabilities as an open-source project. HDFS is >> > > similar >> > > > to >> > > > > > GFS >> > > > > > > >> > (large >> > > > > > > >> > >> >> > blocks, replication, etc); some notable things >> > missing >> > > > are >> > > > > > > >> > read-write >> > > > > > > >> > >> >> > support in the middle of a file (unlikely to be >> > > provided >> > > > > > > because >> > > > > > > >> > few >> > > > > > > >> > >> >> Hadoop >> > > > > > > >> > >> >> > applications require it) and multiple appenders >> (the >> > > > > record >> > > > > > > >> append >> > > > > > > >> > >> >> > operation). You can read about HDFS architecture >> at >> > > > > > > >> > >> >> > >> > > > > http://hadoop.apache.org/core/docs/current/hdfs_design.html >> > > > > > . >> > > > > > > >> The >> > > > > > > >> > >> >> MapReduce >> > > > > > > >> > >> >> > part of Hadoop interacts with HDFS in the same way >> > > that >> > > > > > > Google's >> > > > > > > >> > >> >> MapReduce >> > > > > > > >> > >> >> > interacts with GFS (shipping computation to the >> > data), >> > > > > > > although >> > > > > > > >> > >> Hadoop >> > > > > > > >> > >> >> > MapReduce also supports running over other >> > distributed >> > > > > > > >> filesystems. >> > > > > > > >> > >> >> > >> > > > > > > >> > >> >> > Matei >> > > > > > > >> > >> >> > >> > > > > > > >> > >> >> > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana >> < >> > > > > > > >> > ama...@gmail.com >> > > > > > > >> > >> > >> > > > > > > >> > >> >> > wrote: >> > > > > > > >> > >> >> > >> > > > > > > >> > >> >> > > Hi >> > > > > > > >> > >> >> > > >> > > > > > > >> > >> >> > > Is the HDFS architecture completely based on the >> > > > Google >> > > > > > > >> > Filesystem? >> > > > > > > >> > >> If >> > > > > > > >> > >> >> it >> > > > > > > >> > >> >> > > isnt, what are the differences between the two? >> > > > > > > >> > >> >> > > >> > > > > > > >> > >> >> > > Secondly, is the coupling between Hadoop and >> HDFS >> > > same >> > > > > as >> > > > > > > how >> > > > > > > >> it >> > > > > > > >> > is >> > > > > > > >> > >> >> > between >> > > > > > > >> > >> >> > > the Google's version of Map Reduce and GFS? >> > > > > > > >> > >> >> > > >> > > > > > > >> > >> >> > > Amandeep >> > > > > > > >> > >> >> > > >> > > > > > > >> > >> >> > > >> > > > > > > >> > >> >> > > Amandeep Khurana >> > > > > > > >> > >> >> > > Computer Science Graduate Student >> > > > > > > >> > >> >> > > University of California, Santa Cruz >> > > > > > > >> > >> >> > > >> > > > > > > >> > >> >> > >> > > > > > > >> > >> >> >> > > > > > > >> > >> > >> > > > > > > >> > >> > >> > > > > > > >> > >> >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > > >> > >> > > > > > > >> >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > -- M. Raşit ÖZDAŞ