Re: Parallell maps

2009-07-03 Thread Ted Dunning
On Fri, Jul 3, 2009 at 4:36 PM, Marcus Herou wrote: > I understand what you are saying but the theory do not really get into my > head... You mean that the latency in the CPU + Disk-IO is something like > 1 times less (or perhaps more) than the latency between calling a > remote > system via s

Re: Parallell maps

2009-07-03 Thread Marcus Herou
Anyway why would it slow things down if it converges let's say 100 times faster (in terms of iterations) and you are able to have a memcached or whatever shared system (Voldemort) which is equal to the number of MR hosts i.e. a memcached server on each one of them ? I understand what you are sayin

Re: Parallell maps

2009-07-03 Thread Marcus Herou
When speaking in terms of Hadoop that is I guess...? But normally running in a single JVM then this is the case right ? /M On Sat, Jul 4, 2009 at 1:17 AM, Ted Dunning wrote: > No. It should not want that. > > On Fri, Jul 3, 2009 at 2:13 PM, Marcus Herou >wrote: > > > Should not N2 be wanting

Re: Parallell maps

2009-07-03 Thread Ted Dunning
No. It should not want that. On Fri, Jul 3, 2009 at 2:13 PM, Marcus Herou wrote: > Should not N2 be wanting to be aware of the freshest possible state of N1 ? >

Re: Parallell maps

2009-07-03 Thread Ted Dunning
That doesn't actually speed things up. Generally, in fact, it slows things down. This is a case of sequential update. Batch update converges more slowly in terms of the total number of operations, but because of the economies available in map-reduce programs (due to sequential reading, merge sor

Re: Parallell maps

2009-07-03 Thread Marcus Herou
Hi. I think I am confusing you guys with talking about various things at the same time. I am mostly (99.9%) after Sequential throughput but sometimes I need massively fast Random Access. And I would never ever pay for the BIG kahoona of machine(s) that would be needed to be able to give me great R

Re: grahical tool for hadoop mapreduce

2009-07-03 Thread Shevek
On Fri, 2009-06-26 at 10:55 -0500, Mark Kerzner wrote: > Tom, this is so much right on time! Bravo, Karmasphere. > I installed the plugins, and nothing crashed - in fact, I get the same > screens as the manual promises. > > It is worth reading this group - they released the plugin two days ago. A

Re: HDFS and long-running processes

2009-07-03 Thread Todd Lipcon
Hi David, I'm unaware of any issue that would cause memory leaks when a file is open for read for a long time. There are some issues currently with write pipeline recovery when a file is open for writing for a long time and the datanodes to which it's writing fail. So, I would not recommend havin

Re: Parallell maps

2009-07-03 Thread Ted Dunning
Do you want random access for web presentation? What is your required update time? What about search index delay? Or batch sequential access for large scale computation like pageRank? These are very different answers. The first is likely to be a standard sharded profile database with associate

Re: Parallell maps

2009-07-03 Thread Ted Dunning
Not my baby. I designed it out of my system at about the same time you did. With 0.20, however, we are re-evaluating it. I still think you are thinking about random access which is a mistake for batch computations like PageRank. On Fri, Jul 3, 2009 at 12:28 AM, Marcus Herou wrote: > Ted: > Don

Re: Parallell maps

2009-07-03 Thread Ted Dunning
For computing pageRank, however, I bet that memcache would actually slow you down by forcing you to have a smaller cluster. For a batch program, latency is not the issue, aggregate throughput is. If you have a 50 node MR cluster, you should be able to very easily sustain a few GB/s in reading you

Re: starting a tasktracker on a specific node in the cluster

2009-07-03 Thread Iman E
Hi Uri, The script start-mapred.sh has two commands one of them is used to start the jobtracker and the other is used to start the tasktrackers listed in the slaves file. I made a copy of the start-mapred.sh and removed the start job tracker command line. I change the slaves file according to w

HDFS and long-running processes

2009-07-03 Thread David B. Ritch
I have been told that it is not a good idea to keep HDFS files open for a long time. The reason sounded like a memory leak in the name node - that over time, the resources absorbed by an open file will increase. Is this still an issue with Hadoop-0,19.x and 0-20.x? Was it ever an issue? I have

Re: Parallell maps

2009-07-03 Thread Ted Dunning
I don't understand this statement. Basic page rank in map-reduce is normally a simple undergraduate class assignment: http://www.ics.uci.edu/~abehm/class.../uci/.../Behm-Shah_PageRank.ppt http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/pagerank.html What is it about your problem that m

Announcing: MRUnit - A tool for debugging and unit testing MapReduce programs

2009-07-03 Thread Christophe Bisciglia
Hadoop Fans, I just wanted to drop the community a quick note about a new tool we just released called MRUnit: http://bit.ly/J0AjZ MRUnit helps bridge the gap between MapReduce programs and JUnit by providing a set of interfaces and test harnesses, which allow MapReduce programs to be more easily

Re: Wich is better Namenode and JobTracker run in different server or not?

2009-07-03 Thread Alex Loddengaard
It's unnecessary to run the NN and JT daemons on separate machines in small clusters with more than three nodes. You'll only have performance benefits by putting these daemons on separate machines if you have a large (100s of nodes) cluster. It makes sense to separate the NN and JT daemons in a t

Re: My secondary namenode seem not be running, and may be the reason of my problem!!!

2009-07-03 Thread Alex Loddengaard
Hi, It's unclear exactly what the problem is, so you should try and follow the getting started guide more closely: < http://wiki.apache.org/hadoop/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster) > You should get a single-node cluster working before you try and get a multi-node cluster. Goo

Re: Parallell maps

2009-07-03 Thread Steve Loughran
Marcus Herou wrote: Hi. Comments inline Cheers //Marcus On Fri, Jul 3, 2009 at 4:48 PM, Steve Loughran wrote: Marcus Herou wrote: Hi. This is my company so I reveal what I like, even though the board would shoot me but hey do you think they are scanning this mailinglist ? :) The PR algo

Re: Parallell maps

2009-07-03 Thread Marcus Herou
Hi. Comments inline Cheers //Marcus On Fri, Jul 3, 2009 at 4:48 PM, Steve Loughran wrote: > Marcus Herou wrote: > >> Hi. >> >> This is my company so I reveal what I like, even though the board would >> shoot me but hey do you think they are scanning this mailinglist ? :) >> >> The PR algo is ve

Re: Parallell maps

2009-07-03 Thread Steve Loughran
Marcus Herou wrote: Hi. This is my company so I reveal what I like, even though the board would shoot me but hey do you think they are scanning this mailinglist ? :) The PR algo is very simple (but clever) and can be found on wikipedia: http://en.wikipedia.org/wiki/PageRank What is painful is t

Re: Parallell maps

2009-07-03 Thread Steve Loughran
Mark Kerzner wrote: That's awesome information, Marcus. I am working on a project which would require a similar architectural solution (although unlike you I can't broadcast the details), so that was very useful. One thing I can say though is that mine is in no way a competitor, being in a differ

Re: Running Hadoop without bin/hadoop

2009-07-03 Thread Steve Loughran
Michael Basnight wrote: I have a java app that runs in tomcat and now needs to talk to my hadoop infrastructure. Typically, all the testing ive done / examples show starting something that uses hadoop via the 'bin/hadoop -jar' cmd, but as you can imagine this is no good for a existing tomcat ap

Wich is better Namenode and JobTracker run in different server or not?

2009-07-03 Thread calikus
Hi, I wonder which is better, Namenode and JobTracker run in different server or not? -- View this message in context: http://www.nabble.com/Wich-is-better-Namenode-and-JobTracker-run-in-different-server-or-not--tp24321039p24321039.html Sent from the Hadoop core-user mailing list archive at N

My secondary namenode seem not be running, and may be the reason of my problem!!!

2009-07-03 Thread C J
Hallo everyone, I have installed the hadoop 0.18.3 on three linux machines, I am trying to run the example of WordCountv1.0 on a cluster. But I guess I have a problem somewhere. * Problem* *After formating the name node:* I am getting several STARTUP_MSG and at the end a "SHUTDOWN_MSG: shutting do

problem in sending you an email

2009-07-03 Thread C J
hallo, everytime I try sending you and email explaining my problem in Haddop the email does not reach you and I get the following error *Technical details of permanent failure: Google tried to deliver your message, but it was rejected by the recipient domain. We recommend contacting the other emai

Re: Parallell maps

2009-07-03 Thread Marcus Herou
Man, scanned through the slides, looks very promising. Great work ! //Marcus On Fri, Jul 3, 2009 at 9:28 AM, Marcus Herou wrote: > Hi. > > This is my company so I reveal what I like, even though the board would > shoot me but hey do you think they are scanning this mailinglist ? :) > > The PR a

Re: Using addCacheArchive

2009-07-03 Thread Uri Shani
I followed this thread and happy it finally worked for you. Can you summerise for our benefit what was the final working alteration of the code in your initial thread note? Thanks!! From: akhil1988 To: core-u...@hadoop.apache.org Date: 03/07/2009 06:59 AM Subject: Re: Using addCacheArchive

Re: Parallell maps

2009-07-03 Thread Marcus Herou
Hi. This is my company so I reveal what I like, even though the board would shoot me but hey do you think they are scanning this mailinglist ? :) The PR algo is very simple (but clever) and can be found on wikipedia: http://en.wikipedia.org/wiki/PageRank What is painful is to calculate it in a di

Re: starting a tasktracker on a specific node in the cluster

2009-07-03 Thread Uri Shani
So, what exactly did you do? From: Iman E To: common-user@hadoop.apache.org Date: 03/07/2009 04:34 AM Subject: Re: starting a tasktracker on a specific node in the cluster The method I described below is now working! The jobtracker takes sometime to update its list of available task tracke