Hey Sean, Just curious about your AWS comment. I am only in very early testing phases with AWS EMR. So, would you say that you generally recommend manually setting an EC2 cluster to run Mahout over EMR? I guess the question is: for those of us without the resources to setup an in-house hadoop cluster, what is the best setup we can hope to acheive?
On Jun 24, 2011, at 11:26 PM, Sean Owen wrote: > There's that. There's also the fact that a 32-way machine almost certainly > doesn't have 32 times the I/O bandwidth, let alone 32 times faster seek > latency. (That is, it doesn't have 32 disks.) For a lof these kinds of jobs > you could end up with an I/O bottleneck. > > Speaking of AWS and EMR, I find that I/O bottleneck is by far the issue > there. I spread my jobs there as far across instances and racks as possible > just to try to steal more little machine's I/O seeks! > > On Sat, Jun 25, 2011 at 3:17 AM, edwin <edwintc...@gmail.com> wrote: > >> Hi Ted, >> I'm wondering for "isn't going to work well", you refer to inevitable >> unnecessary hadoop overhead running on a single machine or there are other >> implications to run big jobs on a single machine? >> >> - edwin >> >> On Jun 24, 2011, at 7:11 PM, Ted Dunning wrote: >> >>> I have done this with VM's but I would not generally recommend it. >> Without >>> VM's you will have a pretty ugly configuration issue because Hadoop >> usually >>> assumes it owns the machine. >>> >>> Besides, this is a seriously square peg into a round hole kind of problem >>> here. Hadoop (map-reduce) was designed so that you could use several >> little >>> machines instead of one big one. It just isn't going to work well on a >>> single computer. >>> >>> On Fri, Jun 24, 2011 at 6:49 PM, XiaoboGu <guxiaobo1...@gmail.com> >> wrote: >>> >>>> Do you have any experience in running multiple data nodes and task >>>> trackers on a single SMP server. >>>> >>>>> -----Original Message----- >>>>> From: Ted Dunning [mailto:ted.dunn...@gmail.com] >>>>> Sent: Saturday, June 25, 2011 9:26 AM >>>>> To: user@mahout.apache.org >>>>> Cc: d...@mahout.apache.org >>>>> Subject: Re: Can all the algorithms in Mahout be run locally without a >>>> Hadoop cluster. >>>>> >>>>> Pretty big. SHould scream for local classifier learning. >>>>> >>>>> Local Hadoop should run pretty fast as well. >>>>> >>>>> On Fri, Jun 24, 2011 at 5:54 PM, XiaoboGu <guxiaobo1...@gmail.com> >>>> wrote: >>>>> >>>>>> 32Core, 256G RAM >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Ted Dunning [mailto:ted.dunn...@gmail.com] >>>>>>> Sent: Saturday, June 25, 2011 1:37 AM >>>>>>> To: user@mahout.apache.org >>>>>>> Cc: d...@mahout.apache.org >>>>>>> Subject: Re: Can all the algorithms in Mahout be run locally without >>>> a >>>>>> Hadoop cluster. >>>>>>> >>>>>>> Big iron is fine for some of the classifier stuff, but throughput per >>>> $ >>>>>> can >>>>>>> be higher for other algorithms with a cluster of smaller machines. >>>>>>> >>>>>>> How big a machine are you talking about? Even relatively small >>>> machines >>>>>> are >>>>>>> pretty massive any more. 8 core = 16 hyper-thread machines with 48GB >>>>>> seem >>>>>>> to be not even very impressive any more. >>>>>>> >>>>>>> On Fri, Jun 24, 2011 at 1:47 AM, XiaoboGu <guxiaobo1...@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>>> We will put a big SMP server to deploy Mahout. >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Xiaobo Gu >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>> >>>> >> >>