Re: Question about data analytic

Jinchun Kim Sun, 24 Mar 2013 17:55:36 -0700

Thanks for your kind explanation :)
I'll try with the small number of mappers.



On Sun, Mar 24, 2013 at 7:47 PM, Djordje Jevdjic <[email protected]>wrote:

> 1.5GB should be enough to run the benchmark (i.e., the Mahout
> "testclassifier" program) correctly with the given inputs. However, I do
> not remember if it is enough for the model creation (trainclassifier).
> You can try with a couple of sizes and with a single map job to find
> the minimum heap size that you need such that the job doesn't crash.
>
>  In general, if your machine does not have enough DRAM, the solution
> is not to reduce the heap size per processes, but to reduce the number
> of processes. This might leave your machine underutilized though, but
> that's not a problem for the model creation. The actual benchmark is
> only the last step, which your machine should handle well (4 cores and
> 8GB of DRAM seems ok).
>
> As for the reducers, their job in this benchmark is much simpler
> compared to the mappers, you can assume that they don't need
> much heap.
>
> Regards,
> Djordje
> ________________________________________
> From: Jinchun Kim [[email protected]]
> Sent: Monday, March 25, 2013 12:56 AM
> To: Djordje Jevdjic
> Cc: [email protected]
> Subject: Re: Question about data analytic
>
> Thanks Djordje.
>
> The heap size indicated in mapred-site.xml is set to -Xmx 2048M and my
> machine has 8GB DRAM.
> Based on your reply to Fu, (
> http://www.mail-archive.com/[email protected]/msg00019.html)
> I'm using 4 mappers and 2 reducers, so I guess my machine is not able to
> run benchmark with 2GB heap size.
> In the reply, you said
>
>
>
> number of maps = number of cores you want to run this on
> number of reduce jobs = 1, unless the number of mappers is >8
> amount of memory = number of mappers * heap size
>
>
>
>
>
> Thus, running 4 mappers will require 8GB heap size in total, which is not
> available for my machine
>
>
> because OS and other processes might use the heap area also.
>
>
> I'm going to reduce the heap size for 1.5GB and try it again.
>
>
>
>
> What I'm wondering is if reducers also use the heap area...
>
>
> If so, I need to decrease the number of reducers and hep size.
>
> Does each reducer require the heap area?
>
>
> On Sun, Mar 24, 2013 at 6:05 PM, Djordje Jevdjic <[email protected]
> <mailto:[email protected]>> wrote:
> Dear Jinchun,
>
> A timeout of 1200sec is already too generous. Increasing it will not solve
> the problem.
> I cannot see your logs, but yes, the problem again seems to be the
> indicated heap size
> and the DRAM capacity your machine has.
>
> Regards,
> Djordje
> ________________________________________
> From: Jinchun Kim [[email protected]<mailto:[email protected]>]
> Sent: Friday, March 22, 2013 3:04 PM
> To: Djordje Jevdjic
> Cc: [email protected]<mailto:[email protected]>
> Subject: Re: Question about data analytic
>
> Thanks Djordje :)
> I was able to prepare the input data file and now I'm trying to create
> category-based splits of
> Wikipedia dataset(41GB) and the training data set(5GB) using Mahout.
>
> I had no problem with the training data set, but Hadoop showed following
> messages
> when I tried to do a same job with Wikipedia dataset,
>
> .........
> 13/03/21 22:31:00 INFO mapred.JobClient:  map 27% reduce 1%
> 13/03/21 22:40:31 INFO mapred.JobClient:  map 27% reduce 2%
> 13/03/21 22:58:49 INFO mapred.JobClient:  map 27% reduce 3%
> 13/03/21 23:22:57 INFO mapred.JobClient:  map 27% reduce 4%
> 13/03/21 23:46:32 INFO mapred.JobClient:  map 27% reduce 5%
> 13/03/22 00:27:14 INFO mapred.JobClient:  map 27% reduce 6%
> 13/03/22 01:06:55 INFO mapred.JobClient:  map 27% reduce 7%
> 13/03/22 01:14:06 INFO mapred.JobClient:  map 27% reduce 3%
> 13/03/22 01:15:35 INFO mapred.JobClient: Task Id :
> attempt_201303211339_0002_r_000000_1, Status : FAILED
> Task attempt_201303211339_0002_r_000000_1 failed to report status for 1200
> seconds. Killing!
> 13/03/22 01:20:09 INFO mapred.JobClient:  map 27% reduce 4%
> 13/03/22 01:33:35 INFO mapred.JobClient: Task Id :
> attempt_201303211339_0002_m_000037_1, Status : FAILED
> Task attempt_201303211339_0002_m_000037_1 failed to report status for 1228
> seconds. Killing!
> 13/03/22 01:35:12 INFO mapred.JobClient:  map 27% reduce 5%
> 13/03/22 01:40:38 INFO mapred.JobClient:  map 27% reduce 6%
> 13/03/22 01:52:28 INFO mapred.JobClient:  map 27% reduce 7%
> 13/03/22 02:16:27 INFO mapred.JobClient:  map 27% reduce 8%
> 13/03/22 02:19:02 INFO mapred.JobClient: Task Id :
> attempt_201303211339_0002_m_000018_1, Status : FAILED
> Task attempt_201303211339_0002_m_000018_1 failed to report status for 1204
> seconds. Killing!
> 13/03/22 02:49:03 INFO mapred.JobClient:  map 27% reduce 9%
> 13/03/22 02:52:04 INFO mapred.JobClient:  map 28% reduce 9%
> ........
>
> Reduce falls back to the previous point and the process gets end at map
> 46%, reduce 2% without being completed.
> Is this also relevant to the heap and DRAM size?
> I was wondering if increasing outage time will help or not..
>
>
> On Fri, Mar 22, 2013 at 8:46 AM, Djordje Jevdjic <[email protected]
> <mailto:[email protected]><mailto:[email protected]<mailto:
> [email protected]>>> wrote:
> Dear Jinchun,
>
> The warning message that you get is irrelevant. The problem seems to be in
> the amount of memory that is given to the map-reduce tasks. You need to
> increase the heap size (e.g., run -Xmx 2048M) and make sure that you have
> enough DRAM for the heap size you indicate. To change the heap size, edit
> the following file
> $HADOOP_HOME/conf/mapred-site.xml
> and specify the heap size by adding/changing the following parameter
> mapred.child.java.opts
>
> If your machine doesn't have enough DRAM, the whole process of preparing
> the data and the model is indeed expected to take a couple of hours.
>
> Regards,
> Djordje
> ________________________________________
> From: Jinchun Kim [[email protected]<mailto:[email protected]><mailto:
> [email protected]<mailto:[email protected]>>]
> Sent: Friday, March 22, 2013 1:14 PM
> To: [email protected]<mailto:[email protected]><mailto:
> [email protected]<mailto:[email protected]>>
> Subject: Question about data analytic
>
> Hi, All.
>
> I'm trying to run Data analytic on my x86, Ubuntu machine.
> I found that when I divided 30GB Wikipedia input data into small chunks of
> 64MB,
> CPU usage was really low.
> It was checked by /usr/bin/time command.
> Most of execution time was idle and waiting.
> User cpu time was only 13% of total running time.
>
> Is it because I'm running Data analytic with single node?
> Or does it have something to do with following warning message..?
>
> WARN driver.MahoutDriver: No wikipediaXMLSplitter.props found on classpath,
> will use command-line arguments only
>
> I don't understand why user cpu time is so low while it takes 2.5 hours to
> finish
> splitting Wikipedia inputs.
> Thanks!
>
> --
> Jinchun Kim
>
>
>
> --
> Jinchun Kim
>
>
>
> --
> Thanks,
> Jinchun Kim
>



-- 
Thanks,
Jinchun Kim

Re: Question about data analytic

Reply via email to