Thanks for your kind explanation :) I'll try with the small number of mappers.
On Sun, Mar 24, 2013 at 7:47 PM, Djordje Jevdjic <[email protected]>wrote: > 1.5GB should be enough to run the benchmark (i.e., the Mahout > "testclassifier" program) correctly with the given inputs. However, I do > not remember if it is enough for the model creation (trainclassifier). > You can try with a couple of sizes and with a single map job to find > the minimum heap size that you need such that the job doesn't crash. > > In general, if your machine does not have enough DRAM, the solution > is not to reduce the heap size per processes, but to reduce the number > of processes. This might leave your machine underutilized though, but > that's not a problem for the model creation. The actual benchmark is > only the last step, which your machine should handle well (4 cores and > 8GB of DRAM seems ok). > > As for the reducers, their job in this benchmark is much simpler > compared to the mappers, you can assume that they don't need > much heap. > > Regards, > Djordje > ________________________________________ > From: Jinchun Kim [[email protected]] > Sent: Monday, March 25, 2013 12:56 AM > To: Djordje Jevdjic > Cc: [email protected] > Subject: Re: Question about data analytic > > Thanks Djordje. > > The heap size indicated in mapred-site.xml is set to -Xmx 2048M and my > machine has 8GB DRAM. > Based on your reply to Fu, ( > http://www.mail-archive.com/[email protected]/msg00019.html) > I'm using 4 mappers and 2 reducers, so I guess my machine is not able to > run benchmark with 2GB heap size. > In the reply, you said > > > > number of maps = number of cores you want to run this on > number of reduce jobs = 1, unless the number of mappers is >8 > amount of memory = number of mappers * heap size > > > > > > Thus, running 4 mappers will require 8GB heap size in total, which is not > available for my machine > > > because OS and other processes might use the heap area also. > > > I'm going to reduce the heap size for 1.5GB and try it again. > > > > > What I'm wondering is if reducers also use the heap area... > > > If so, I need to decrease the number of reducers and hep size. > > Does each reducer require the heap area? > > > On Sun, Mar 24, 2013 at 6:05 PM, Djordje Jevdjic <[email protected] > <mailto:[email protected]>> wrote: > Dear Jinchun, > > A timeout of 1200sec is already too generous. Increasing it will not solve > the problem. > I cannot see your logs, but yes, the problem again seems to be the > indicated heap size > and the DRAM capacity your machine has. > > Regards, > Djordje > ________________________________________ > From: Jinchun Kim [[email protected]<mailto:[email protected]>] > Sent: Friday, March 22, 2013 3:04 PM > To: Djordje Jevdjic > Cc: [email protected]<mailto:[email protected]> > Subject: Re: Question about data analytic > > Thanks Djordje :) > I was able to prepare the input data file and now I'm trying to create > category-based splits of > Wikipedia dataset(41GB) and the training data set(5GB) using Mahout. > > I had no problem with the training data set, but Hadoop showed following > messages > when I tried to do a same job with Wikipedia dataset, > > ......... > 13/03/21 22:31:00 INFO mapred.JobClient: map 27% reduce 1% > 13/03/21 22:40:31 INFO mapred.JobClient: map 27% reduce 2% > 13/03/21 22:58:49 INFO mapred.JobClient: map 27% reduce 3% > 13/03/21 23:22:57 INFO mapred.JobClient: map 27% reduce 4% > 13/03/21 23:46:32 INFO mapred.JobClient: map 27% reduce 5% > 13/03/22 00:27:14 INFO mapred.JobClient: map 27% reduce 6% > 13/03/22 01:06:55 INFO mapred.JobClient: map 27% reduce 7% > 13/03/22 01:14:06 INFO mapred.JobClient: map 27% reduce 3% > 13/03/22 01:15:35 INFO mapred.JobClient: Task Id : > attempt_201303211339_0002_r_000000_1, Status : FAILED > Task attempt_201303211339_0002_r_000000_1 failed to report status for 1200 > seconds. Killing! > 13/03/22 01:20:09 INFO mapred.JobClient: map 27% reduce 4% > 13/03/22 01:33:35 INFO mapred.JobClient: Task Id : > attempt_201303211339_0002_m_000037_1, Status : FAILED > Task attempt_201303211339_0002_m_000037_1 failed to report status for 1228 > seconds. Killing! > 13/03/22 01:35:12 INFO mapred.JobClient: map 27% reduce 5% > 13/03/22 01:40:38 INFO mapred.JobClient: map 27% reduce 6% > 13/03/22 01:52:28 INFO mapred.JobClient: map 27% reduce 7% > 13/03/22 02:16:27 INFO mapred.JobClient: map 27% reduce 8% > 13/03/22 02:19:02 INFO mapred.JobClient: Task Id : > attempt_201303211339_0002_m_000018_1, Status : FAILED > Task attempt_201303211339_0002_m_000018_1 failed to report status for 1204 > seconds. Killing! > 13/03/22 02:49:03 INFO mapred.JobClient: map 27% reduce 9% > 13/03/22 02:52:04 INFO mapred.JobClient: map 28% reduce 9% > ........ > > Reduce falls back to the previous point and the process gets end at map > 46%, reduce 2% without being completed. > Is this also relevant to the heap and DRAM size? > I was wondering if increasing outage time will help or not.. > > > On Fri, Mar 22, 2013 at 8:46 AM, Djordje Jevdjic <[email protected] > <mailto:[email protected]><mailto:[email protected]<mailto: > [email protected]>>> wrote: > Dear Jinchun, > > The warning message that you get is irrelevant. The problem seems to be in > the amount of memory that is given to the map-reduce tasks. You need to > increase the heap size (e.g., run -Xmx 2048M) and make sure that you have > enough DRAM for the heap size you indicate. To change the heap size, edit > the following file > $HADOOP_HOME/conf/mapred-site.xml > and specify the heap size by adding/changing the following parameter > mapred.child.java.opts > > If your machine doesn't have enough DRAM, the whole process of preparing > the data and the model is indeed expected to take a couple of hours. > > Regards, > Djordje > ________________________________________ > From: Jinchun Kim [[email protected]<mailto:[email protected]><mailto: > [email protected]<mailto:[email protected]>>] > Sent: Friday, March 22, 2013 1:14 PM > To: [email protected]<mailto:[email protected]><mailto: > [email protected]<mailto:[email protected]>> > Subject: Question about data analytic > > Hi, All. > > I'm trying to run Data analytic on my x86, Ubuntu machine. > I found that when I divided 30GB Wikipedia input data into small chunks of > 64MB, > CPU usage was really low. > It was checked by /usr/bin/time command. > Most of execution time was idle and waiting. > User cpu time was only 13% of total running time. > > Is it because I'm running Data analytic with single node? > Or does it have something to do with following warning message..? > > WARN driver.MahoutDriver: No wikipediaXMLSplitter.props found on classpath, > will use command-line arguments only > > I don't understand why user cpu time is so low while it takes 2.5 hours to > finish > splitting Wikipedia inputs. > Thanks! > > -- > Jinchun Kim > > > > -- > Jinchun Kim > > > > -- > Thanks, > Jinchun Kim > -- Thanks, Jinchun Kim
