Re: HDFS architecture based on GFS?

2009-02-15 Thread Rasit OZDAS
"If there was a malicious process though, then I imagine it could talk to a datanode directly and request a specific block." I didn't understand usage of "malicuous" here, but any process using HDFS api should first ask NameNode where the file replications are. Then - I assume - namenode returns t

Re: HADOOP-2536 supports Oracle too?

2009-02-15 Thread sandhiya
@Amandeep Hi, I'm new to Hadoop and am trying to run a simple database connectivity program on it. Could you please tell me how u went about it?? my mail id is "sandys_cr...@yahoo.com" . A copy of your code that successfully connected to MySQL will also be helpful. Thanks, Sandhiya Enis Soztutar-

Re: Namenode not listening for remote connections to port 9000

2009-02-15 Thread Michael Lynch
Hmmm - I checked all the /etc/hosts files, and they're all fine. Then I switched the conf/hadoop-site.xml to specify ip addresses instead of host names. Then oddly enough it starts working... Now the funny thing is this: It's fine ssh-ing to the correct machines to start up datanodes, but when

Re: Hostnames on MapReduce Web UI

2009-02-15 Thread S D
Thanks, this did it. I changed my /etc/hosts file on each node from 127.0.0.1 localhost localhost.localdomain 127.0.0.1 to just switch the order with 127.0.0.1 127.0.0.1 localhost localhost.localdomain This did the trick! I vaguely recall from somewhere that I

Re: HDFS architecture based on GFS?

2009-02-15 Thread Matei Zaharia
In general, yeah, the scripts can access any resource they want (within the permissions of the user that the task runs as). It's also possible to access HDFS from scripts because HDFS provides a FUSE interface that can make it look like a regular file system on the machine. (The FUSE module in turn

Re: Race Condition?

2009-02-15 Thread S D
I'm having difficulty capturing the output of any of the dfs commands (either in Ruby or on the command line). Supposedly the output is being sent to stdout yet just running any of the commands on the command line does not display the output nor does redirecting to a file (e.g., hadoop dfs -copyTo

Re: HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
I dont know much about Hadoop streaming and have a quick question here. The snippets of code/programs that you attach into the map reduce job might want to access outside resources (like you mentioned). Now these might not need to go to the namenode right? For example a python script. How would it

Re: JvmMetrics

2009-02-15 Thread Brian Bockelman
Hey David -- In case if no one has pointed you to this, you can submit this through JIRA. Brian On Feb 14, 2009, at 12:07 AM, David Alves wrote: Hi I ran into a use case where I need to keep two contexts for metrics. One being ganglia and the other being a file context (to do offline

Re: Race Condition?

2009-02-15 Thread Matei Zaharia
I would capture the output of the dfs -copyToLocal command, because I still think that is the most likely cause of the data not making it. I don't know how to capture this output in Ruby but I'm sure it's possible. You want to capture both standard out and standard error. One other slim possibility

Re: HDFS on non-identical nodes

2009-02-15 Thread Brian Bockelman
On Feb 15, 2009, at 3:21 AM, Deepak wrote: Thanks Brain and Chen! I finally sort that out why cluster is being stopped after running out of space. Its because of master failure due to disk space. Regarding automatic balancer, I guess in our case, rate of copying is faster than balancer rate,

Re: HDFS architecture based on GFS?

2009-02-15 Thread Matei Zaharia
Nope, typically the JobTracker just starts the process, and the tasktracker talks directly to the namenode to get a pointer to the datanode, and then directly to the datanode. On Sun, Feb 15, 2009 at 8:07 PM, Amandeep Khurana wrote: > Alright.. Got it. > > Now, do the task trackers talk to the n

Re: HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
Alright.. Got it. Now, do the task trackers talk to the namenode and the data node directly or do they go through the job tracker for it? So, if my code is such that I need to access more files from the hdfs, would the job tracker get involved or not? Amandeep Khurana Computer Science Graduate

Re: HDFS architecture based on GFS?

2009-02-15 Thread Matei Zaharia
Normally, HDFS files are accessed through the namenode. If there was a malicious process though, then I imagine it could talk to a datanode directly and request a specific block. On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana wrote: > Ok. Got it. > > Now, when my job needs to access another f

Re: setting up networking and ssh on multnode cluster...

2009-02-15 Thread zander1013
okay, i will heed the tip on the 127 address set. here is the result of ssh 192.168.0.2... a...@node0:~$ ssh 192.168.0.2 ssh: connect to host 192.168.0.2 port 22: Connection timed out a...@node0:~$ the boxes are just connected with a cat5 cable. i have not done this with the hadoop account but

Re: HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
Ok. Got it. Now, when my job needs to access another file, does it go to the Namenode to get the block ids? How does the java process know where the files are and how to access them? Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sun, Feb 15, 2009 at

Re: HDFS architecture based on GFS?

2009-02-15 Thread Matei Zaharia
I mentioned this case because even jobs written in Java can use the HDFS API to talk to the NameNode and access the filesystem. People often do this because their job needs to read a config file, some small data table, etc and use this information in its map or reduce functions. In this case, you o

Re: setting up networking and ssh on multnode cluster...

2009-02-15 Thread Norbert Burger
Fwiw, the extra references to 127.0.1.1 in each host file aren't necessary. >From node0, does 'ssh 192.168.0.2' work? If not, then the issue isn't name resolution -- take look at the network configs (eg., /etc/init.d/interfaces) on each machine. Norbert On Sun, Feb 15, 2009 at 7:31 PM, zander10

Re: HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
Another question that I have here - When the jobs run arbitrary code and access data from the HDFS, do they go to the namenode to get the block information? Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana

Re: HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
Assuming that the job is purely in Java and not involving streaming or pipes, wouldnt the resources (files) required by the job as inputs be known beforehand? So, if the map task is accessing a second file, how does it make it different except that there are multiple files. The JobTracker would kno

Re: HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
This is good information! Thanks a ton. I'll take all this into account. Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia wrote: > Typically the data flow is like this:1) Client submits a job descr

Re: HDFS architecture based on GFS?

2009-02-15 Thread Matei Zaharia
Typically the data flow is like this:1) Client submits a job description to the JobTracker. 2) JobTracker figures out block locations for the input file(s) by talking to HDFS NameNode. 3) JobTracker creates a job description file in HDFS which will be read by the nodes to copy over the job's code e

Re: setting up networking and ssh on multnode cluster...

2009-02-15 Thread zander1013
okay, i have changed /etc/hosts to look like this for node0... 127.0.0.1 localhost 127.0.1.1 node0 # /etc/hosts (for hadoop master and slave) 192.168.0.1 node0 192.168.0.2 node1 #end hadoop section # The following lines are desirable for IPv6 capable hosts ::1 ip6-local

Re: setting up networking and ssh on multnode cluster...

2009-02-15 Thread Norbert Burger
> > i have commented out the 192. addresses and changed 127.0.1.1 for node0 and > 127.0.1.2 for node0 (in /etc/hosts). with this done i can ssh from one > machine to itself and to the other but the prompt does not change when i > ssh > to the other machine. i don't know if there is a firewall preve

Re: HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
A quick question here. How does a typical hadoop job work at the system level? What are the various interactions and how does the data flow? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana wrote

Re: HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
Thanks Matei. If the basic architecture is similar to the Google stuff, I can safely just work on the project using the information from the papers. I am aware of the 4487 jira and the current status of the permissions mechanism. I had a look at them earlier. Cheers Amandeep Amandeep Khurana Co

Re: setting up networking and ssh on multnode cluster...

2009-02-15 Thread zander1013
hi, sshd is running on both machines. i am using the default ubuntu 8.10 workstation install with openssh-server installed via "apt-get install". i have tried with the machines connected through both a switch and just pluging the ethernet cable from one into the other. right now i have just one

Re: Race Condition?

2009-02-15 Thread S D
I was not able to determine the command shell return value for hadoop dfs -copyToLocal #{s3dir} #{localdir} but I did print out several variables after the call and determined that the call apparently did not go through successfully. In particular, prior to my processData(localdir) command I

Re: HDFS architecture based on GFS?

2009-02-15 Thread Matei Zaharia
Forgot to add, this JIRA details the latest security features that are being worked on in Hadoop trunk: https://issues.apache.org/jira/browse/HADOOP-4487. This document describes the current status and limitations of the permissions mechanism: http://hadoop.apache.org/core/docs/current/hdfs_permiss

Re: HDFS architecture based on GFS?

2009-02-15 Thread Matei Zaharia
I think it's safe to assume that Hadoop works like MapReduce/GFS at the level described in those papers. In particular, in HDFS, there is a master node containing metadata and a number of slave nodes (datanodes) containing blocks, as in GFS. Clients start by talking to the master to list directorie

Re: datanode not being started

2009-02-15 Thread Sandy
just some more information: hadoop fsck produces: Status: HEALTHY Total size: 0 B Total dirs: 9 Total files: 0 (Files currently being written: 1) Total blocks (validated): 0 Minimally replicated blocks: 0 Over-replicated blocks: 0 Under-replicated blocks: 0 Mis-replicated blocks: 0 Default

Re: setting up networking and ssh on multnode cluster...

2009-02-15 Thread james warren
Hi Zander - Two simple explanations come to mind: * Is sshd is running on your boxes? * If so, do you have a firewall preventing ssh access? cheers, -jw On Sat, Feb 14, 2009 at 7:50 PM, zander1013 wrote: > > hi, > > am going through the tutorial on multinode cluster setup by m. noll... > > htt

Re: HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
Thanks Matie I had gone through the architecture document online. I am currently working on a project towards Security in Hadoop. I do know how the data moves around in the GFS but wasnt sure how much of that does HDFS follow and how different it is from GFS. Can you throw some light on that? Sec

Re: datanode not being started

2009-02-15 Thread Sandy
Thanks for your responses. I checked in the namenode and jobtracker logs and both say: INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 9000, call delete(/Users/hadoop/hadoop-0.18.2/hadoop-hadoop/mapred/system, true) from 127.0.0.1:61086: error: org.apache.hadoop.dfs.SafeModeException:

Re: question about hadoop and amazon ec2 ?

2009-02-15 Thread nitesh bhatia
1. They are related as one can use EC2 as a to serve computation part for hadoop. Refer: http://wiki.apache.org/hadoop/AmazonEC2 2. yes Refer: http://wiki.apache.org/hadoop/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster) 3. you can use EC2 as a to serve computation part for hadoop. --nites

Re: HDFS architecture based on GFS?

2009-02-15 Thread Matei Zaharia
Hi Amandeep, Hadoop is definitely inspired by MapReduce/GFS and aims to provide those capabilities as an open-source project. HDFS is similar to GFS (large blocks, replication, etc); some notable things missing are read-write support in the middle of a file (unlikely to be provided because few Hado

HDFS architecture based on GFS?

2009-02-15 Thread Amandeep Khurana
Hi Is the HDFS architecture completely based on the Google Filesystem? If it isnt, what are the differences between the two? Secondly, is the coupling between Hadoop and HDFS same as how it is between the Google's version of Map Reduce and GFS? Amandeep Amandeep Khurana Computer Science Gradua

Re: can't edit the file that mounted by fuse_dfs by editor

2009-02-15 Thread S D
I followed these instructions http://wiki.apache.org/hadoop/MountableHDFS and was able to get things working with 0.19.0 on Fedora. The only problem I ran into was the AMD64 issue on one of my boxes (see the note on the above link); I edited the Makefile and set OSARCH as suggested but couldn't g

Re: Hostnames on MapReduce Web UI

2009-02-15 Thread Nick Cen
Try comment out te localhost definition in your /etc/hosts file. 2009/2/14 S D > I'm reviewing the task trackers on the web interface ( > http://jobtracker-hostname:50030/) for my cluster of 3 machines. The names > of the task trackers do not list real domain names; e.g., one of the task > track

Some Storage communication related questions

2009-02-15 Thread Wasim Bari
Hi, I have multiple questions: Does hadoop use some parallel technique for CopyFromLocal and CopyToLocal (like DistCp) Or its simple ONE stream writing? For Amazon S3 to Local system communication, Hadoop uses Rest service interface or SOAP ? Are there some new storage systems currently

Re: HDFS on non-identical nodes

2009-02-15 Thread Deepak
Thanks Brain and Chen! I finally sort that out why cluster is being stopped after running out of space. Its because of master failure due to disk space. Regarding automatic balancer, I guess in our case, rate of copying is faster than balancer rate, we found balancer do start but couldn't perform

question about hadoop and amazon ec2 ?

2009-02-15 Thread buddha1021
hi: What is the relationship between the hadoop and the amazon ec2 ? Can hadoop run on the common pc (but not server ) directly ? Why someone says hadoop run on the amazon ec2 ? thanks! -- View this message in context: http://www.nabble.com/question-about-hadoop-and-amazon-ec2---tp22020652p2202