Re: walkthrough of developing first hadoop app from scratch

2008-03-21 Thread Arun C Murthy
On Mar 21, 2008, at 6:35 PM, Stephen J. Barr wrote: Hello, I am working on developing my first hadoop app from scratch. It is a Monte-Carlo simulation, and I am using the PiEstimator code from the examples as a reference. I believe I have what I want in a .java file. However, I couldn't

Re: walkthrough of developing first hadoop app from scratch

2008-03-21 Thread Stephen J. Barr
Thank you. That worked (well, it pointed out all the bugs in my code, which is a good start.) 朱盛凯 wrote: Hi Stephen, You can get an example of word count, there shows how to create jar archive of you application codes. $ mkdir wordcount_classes $ javac -classpath ${HADOOP_HOME}/hadoop-${HADO

Re: walkthrough of developing first hadoop app from scratch

2008-03-21 Thread 朱盛凯
Hi Stephen, You can get an example of word count, there shows how to create jar archive of you application codes. $ mkdir wordcount_classes $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java $ jar -cvf /usr/joe/wordcount.jar -C wordcount_classe

Re: walkthrough of developing first hadoop app from scratch

2008-03-21 Thread Paco NATHAN
Hi Stephen, Here's a sample Hadoop app which has its build based on Ant: http://code.google.com/p/ceteri-mapred/ Look in the "jyte" directory. A target called "prep.jar" simply uses the task in Ant to build a JAR for Hadoop to use. Yeah, I agree that docs and discussions seem to lean more t

walkthrough of developing first hadoop app from scratch

2008-03-21 Thread Stephen J. Barr
Hello, I am working on developing my first hadoop app from scratch. It is a Monte-Carlo simulation, and I am using the PiEstimator code from the examples as a reference. I believe I have what I want in a .java file. However, I couldn't find any documentation on how to make that .java file int

RE: Performance / cluster scaling question

2008-03-21 Thread dhruba Borthakur
The namenode lazily instructs a Datanode to delete blocks. As a response to every heartbeat from a Datanode, the Namenode instructs it to delete a maximum on 100 blocks. Typically, the heartbeat periodicity is 3 seconds. The heartbeat thread in the Datanode deletes the block files synchronously

Re: Performance / cluster scaling question

2008-03-21 Thread André Martin
After waiting a few hours (without having any load), the block number and "DFS Used" space seems to go down... My question is: is the hardware simply too weak/slow to send the block deletion request to the datanodes in a timely manner, or do simply those "crappy" HDDs cause the delay, since I no

Re: Performance / cluster scaling question

2008-03-21 Thread Ted Dunning
The delay may be in reporting the deleted blocks as free on the web interface as much as in actually marking them as deleted. On 3/21/08 2:48 PM, "André Martin" <[EMAIL PROTECTED]> wrote: > Right, I totally forgot about the replication factor... However > sometimes I even noticed ratios of 5:1

RE: Performance / cluster scaling question

2008-03-21 Thread Jeff Eastman
I wouldn't call it a design feature so much as a consequence of background processing in the NameNode to clean up the recently-closed files and reclaim their blocks. Jeff > -Original Message- > From: André Martin [mailto:[EMAIL PROTECTED] > Sent: Friday, March 21, 2008 2:48 PM > To: core-

Re: Performance / cluster scaling question

2008-03-21 Thread André Martin
Right, I totally forgot about the replication factor... However sometimes I even noticed ratios of 5:1 for block numbers to files... Is the delay for block deletion/reclaiming an intended behavior? Jeff Eastman wrote: That makes the math come out a lot closer (3*423763=1271289). I've also notic

RE: Performance / cluster scaling question

2008-03-21 Thread Jeff Eastman
That makes the math come out a lot closer (3*423763=1271289). I've also noticed there is some delay in reclaiming unused blocks so what you are seeing in terms of block allocations do not surprise me. > -Original Message- > From: André Martin [mailto:[EMAIL PROTECTED] > Sent: Friday, March

Re: Performance / cluster scaling question

2008-03-21 Thread André Martin
3 - the default one... Jeff Eastman wrote: What's your replication factor? Jeff -Original Message- From: André Martin [mailto:[EMAIL PROTECTED] Sent: Friday, March 21, 2008 2:25 PM To: core-user@hadoop.apache.org Subject: Performance / cluster scaling question Hi everyone, I ran a

RE: Performance / cluster scaling question

2008-03-21 Thread Jeff Eastman
What's your replication factor? Jeff > -Original Message- > From: André Martin [mailto:[EMAIL PROTECTED] > Sent: Friday, March 21, 2008 2:25 PM > To: core-user@hadoop.apache.org > Subject: Performance / cluster scaling question > > Hi everyone, > I ran a distributed system that consists

Re: Performance / cluster scaling question

2008-03-21 Thread André Martin
Attached image can be found here: http://www.andremartin.de/Performance-degradation.png

Performance / cluster scaling question

2008-03-21 Thread André Martin
Hi everyone, I ran a distributed system that consists of 50 spiders/crawlers and 8 server nodes with a Hadoop DFS cluster with 8 datanodes and a namenode... Each spider has 5 job processing / data crawling threads and puts crawled data as one complete file onto the DFS - additionally there are

RE: Master as DataNode

2008-03-21 Thread Jeff Eastman
I don't know the deep answer, but formatting your dfs creates a new namespaceId that needs to be consistent across all slaves. Any data directories containing old version ids will prevent the DataNode from starting on that node. Maybe somebody who really knows the machinery can elaborate to this.

Re: Master as DataNode

2008-03-21 Thread Colin Freas
yup, got it working with that technique. pushed it out to 5 machines, things look good. appreciate the help. what is it that causes this? i know i formatted the dfs more than once. is that what does it? or just adding nodes, or... ? -colin On Fri, Mar 21, 2008 at 2:30 PM, Jeff Eastman <[E

RE: Master as DataNode

2008-03-21 Thread Jeff Eastman
I encountered this while I was starting out too, while moving from a single node cluster to more nodes. I suggest clearing your hadoop-datastore directory, reformatting the HDFS and restarting again. You are very close :) Jeff > -Original Message- > From: Colin Freas [mailto:[EMAIL PROTECT

Re: Master as DataNode

2008-03-21 Thread Colin Freas
ah: 2008-03-21 14:06:05,526 ERROR org.apache.hadoop.dfs.DataNode: java.io.IOException: Incompatible namespaceIDs in /var/tmp/hadoop-datastore/hadoop/dfs/data: namenode namespaceID = 2121666262; datanode namespaceID = 2058961420 looks like i'm hitting this "Incompatible namespaceID" bug: http://i

Re: [core-user][reduce seems to run only on one machine]

2008-03-21 Thread Jean-Pierre OCALAN
Thank you guys for all that good answers, I appreciate that. Jean-Pierre. On Mar 21, 2008, at 12:47 PM, Ted Dunning wrote: The default number of reducers is 4. It is unlikely that a user who doesn't know about how to set the number of reducers has changed that value. This phenomenon of a

RE: Master as DataNode

2008-03-21 Thread Jeff Eastman
Check your logs. That should work out of the box with the configuration steps you described. Jeff > -Original Message- > From: Colin Freas [mailto:[EMAIL PROTECTED] > Sent: Friday, March 21, 2008 10:40 AM > To: core-user@hadoop.apache.org > Subject: Master as DataNode > > setting up a s

Master as DataNode

2008-03-21 Thread Colin Freas
setting up a simple hadoop cluster with two machines, i've gotten to the point where the two machines can see each other, things seem fine, but i'm trying to set up the master as both a master and a slave, just for testing purposes. so, i've put the master into the conf/masters file and the conf/s

Re: NFS mounted home, host RSA keys, localhost, strict sshds and bad mojo.

2008-03-21 Thread Colin Freas
ah, yes. that worked. thanks! On Fri, Mar 21, 2008 at 12:48 PM, Natarajan, Senthil <[EMAIL PROTECTED]> wrote: > I guess the following file might have localhost entry, change to hostname > > /conf/masters > /conf/slaves > > > -Original Message- > From: Colin Freas [mailto:[EMAIL PROTECTE

RE: NFS mounted home, host RSA keys, localhost, strict sshds and bad mojo.

2008-03-21 Thread Natarajan, Senthil
I guess the following file might have localhost entry, change to hostname /conf/masters /conf/slaves -Original Message- From: Colin Freas [mailto:[EMAIL PROTECTED] Sent: Friday, March 21, 2008 12:25 PM To: core-user@hadoop.apache.org Subject: NFS mounted home, host RSA keys, localhost, s

Re: [core-user][reduce seems to run only on one machine]

2008-03-21 Thread Ted Dunning
The default number of reducers is 4. It is unlikely that a user who doesn't know about how to set the number of reducers has changed that value. This phenomenon of apparently having only a single reducer often happens if you have a very skewed distribution of keys for the reduce phase. Imagine

Re: Input file globbing

2008-03-21 Thread Tom White
Thanks Hairong, I've just created https://issues.apache.org/jira/browse/HADOOP-3064 for this. Tom On 20/03/2008, Hairong Kuang <[EMAIL PROTECTED]> wrote: > Yes, this is a bug. This only occurs when a job's input path contains the > closures. JobConf.getInputPaths interprets mr/input/glob/2008/

Re: Hadoop For Image Analysis/Vectorization

2008-03-21 Thread Ted Dunning
On 3/21/08 8:29 AM, "Dan Tamowski" <[EMAIL PROTECTED]> wrote: > -Does Hadoop/MR offer a clean abstraction for both consuming and producing a > large number of files? (I know it can handily consume a large number of > fies, but all examples of output seem to form a single file) Yes. IT works v

NFS mounted home, host RSA keys, localhost, strict sshds and bad mojo.

2008-03-21 Thread Colin Freas
i'm working to set up a cluster across several machines where users' home dirs are on an nfs mount. i setup key authentication for the hadoop user, install all the software on one node, get everything running, and move on to another node. once there, however, my sshd complains because the host ke

Hadoop For Image Analysis/Vectorization

2008-03-21 Thread Dan Tamowski
Hello, Forgive me if I am missing something in the documentation, but nothing is jumping out at me. I am exploring the use of Hadoop for image analysis and/or image vectorization and have a few questions. I anticipate that there will be a large collection of image files as input with an equal num

Re: [core-user][reduce seems to run only on one machine]

2008-03-21 Thread Amar Kamat
On Fri, 21 Mar 2008, Jean-Pierre OCALAN wrote: > Hi, > > I'm currently working on a project that implies massive log parsing. I have > one master and 6 slaves. > By looking the each slaves logs I've noticed that REDUCE operation just runs > on one machine. > So does that mean that reduce just runs

[core-user][reduce seems to run only on one machine]

2008-03-21 Thread Jean-Pierre OCALAN
Hi, I'm currently working on a project that implies massive log parsing. I have one master and 6 slaves. By looking the each slaves logs I've noticed that REDUCE operation just runs on one machine. So does that mean that reduce just runs on one machine ? And if that is true how can I specif

Re: MapFile and MapFileOutputFormat

2008-03-21 Thread Rong-en Fan
On Fri, Mar 21, 2008 at 12:42 AM, Doug Cutting <[EMAIL PROTECTED]> wrote: > Rong-en Fan wrote: > > I have two questions regarding the mapfile in hadoop/hdfs. First, when > using > > MapFileOutputFormat as reducer's output, is there any way to change > > the index interval (i.e., able to call se