RE: RE: A question about "data analytics"

Djordje Jevdjic Thu, 12 Apr 2012 15:12:48 -0700

Hello Fu,

I see that your mapred-site.xml file says that you want four map and two reduce 
processes. Each of these is spawned by the framework as
a separate process. You also ask for 2GB of heap space per process, which you 
definitely don't have in your little Xen machine. 
Basically, the minimum memory you need (number_of_map_processes * heap_size) + 
the framework overhead (less than 1GB) to 
run this. How many cores does your virtual machine have? For the purpose of 
this benchmark, you should have a hardware configuration similar to the 
following:


number of maps = number of cores you want to run this on
number of reduce jobs = 1, unless the number of mappers is >8
amount of memory = number of mappers * heap size

If you can't afford more than 2GB of memory, I suggest that you change the 
number of mappers and reducers to 1 in the config file and set the heap size to 
1.5GB.
The following parameters are affected (mapred-site.xml):

<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
      <value>1</value>
            <description>The maximum number of map tasks that will be run 
simultaneously by a task tracker. </description>
  </property>

<property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
      <value>1</value>
            <description>The maximum number of reduce tasks that will be run 
simultaneously by a task tracker. </description>
    </property>


<property>
  <name>mapred.map.tasks</name>
      <value>1</value>
            <description>The default number of map tasks per job. Ignored when 
mapred.job.tracker is "local".</description>
     </property>

<property>
  <name>mapred.reduce.tasks</name>
      <value>1</value>
        <description>The default number of reduce tasks per job.</description>
 </property>

<property>
  <name>mapred.child.java.opts</name>
      <value>-Xmx1536M</value>
 </property>


Also, please check the benchmark documentation tomorrow, I will refresh the 
instructions so that 
you can run the benchmark with smaller memory requirements. 


Regards,
Djordje



________________________________________
From: 付斌章 [[email protected]]
Sent: Thursday, April 12, 2012 3:27 PM
To: Djordje Jevdjic
Subject: Re: RE: A question about "data analytics"

Hello Djordje,

    Thanks for your advice, the problem is really caused by the tmp directory. 
I think the reason maybe that i didn't reformat the namenode after i changed 
the tmp directory. After i reformatted it, the "class cast" exception 
disappeared.  Unfortunately, another problem happened.  The job was killed 
every time i tried.  I find in "tasktracker.log" that the error is "FATAL 
org.apache.hadoop.mapred.TaskTracker: Task: 
attempt_201204120549_0001_m_000000_3 - Killed : Java heap space".

    Does it mean that the main memory is not enough. Actually, i am running 
data analytics in a Xen virtual machine with 2GB memory. Is this memory too 
small? Or, there is somthing wrong in my configuration. I have attached the 
configuration files and log files with this email. I will be very appreciated 
if you can help me to check these files.

BTW:
the output is
----------------------------------------
hadoop@debian-98:~$ $MAHOUT_HOME/bin/mahout wikipediaDataSetCreator -i 
wikipedia/chunks -o wikipediainput -c $MAHOUT_HOME/examples/temp/categories.txt
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/home/hadoop/hadoop-0.20.2
No HADOOP_CONF_DIR set, using /home/hadoop/hadoop-0.20.2/conf
MAHOUT-JOB: 
/home/hadoop/mahout-distribution-0.6/examples/target/mahout-examples-0.6-job.jar
12/04/12 01:40:23 WARN driver.MahoutDriver: No wikipediaDataSetCreator.props 
found on classpath, will use command-line arguments only
12/04/12 01:41:03 INFO bayes.WikipediaDatasetCreatorDriver: Input: 
wikipedia/chunks Out: wikipediainput Categories: 
/home/hadoop/mahout-distribution-0.6//examples/temp/categories.txt
12/04/12 01:41:04 INFO common.HadoopUtil: Deleting wikipediainput
12/04/12 01:41:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
12/04/12 01:41:05 INFO input.FileInputFormat: Total input paths to process : 7
12/04/12 01:41:07 INFO mapred.JobClient: Running job: job_201204120140_0001
12/04/12 01:41:08 INFO mapred.JobClient:  map 0% reduce 0%
Killed

Best Regards,
Fu, Binzhang

> -----原始邮件-----
> 发件人: "Djordje Jevdjic" <[email protected]>
> 发送时间: 2012年4月12日 星期四
> 收件人: "Fu Bin-zhang" <[email protected]>, "[email protected]" 
> <[email protected]>
> 抄送:
> 主题: RE: A question about "data analytics"
>
> Hello Fu Bin-zhang,
>
> The error message is very weird because FileSplit is a class derived from 
> InputSplit,
> and the conversion is legal. However, I've seen this message several times. 
> The error
> is highly likely related to the location of the hadoop tmp directory. Could 
> you please compress and
> send me the your $HADOOP_HOME/conf folder? No need to broadcast to the list, 
> send it to me directly.
>
> Regards,
> Djordje
> ________________________________________
> From: Fu Bin-zhang [[email protected]]
> Sent: Wednesday, April 11, 2012 4:11 PM
> To: [email protected]
> Subject: A question about "data analytics"
>
> Hi all,
>
>     I am trying to run the data analytics benchmark. I followed the 
> intructions in the cloudsuite website. Everything is ok until the 7th step 
> "create the category-based split of the Wikipedia dataset: ". The error is 
> "java.lang.ClassCastException: 
> org.apache.hadoop.mapreduce.lib.input.FileSplit cannot be cast to 
> org.apache.hadoop.mapred.InputSplit". I failed to find the answer with 
> google. Can anybody give a hint?
>
>     Thanks in advance.
>
> The output is:
> -------------------------------
> hadoop@debian-98:~$ $MAHOUT_HOME/bin/mahout wikipediaDataSetCreator -i 
> wikipedia/chunks -o wikipediainput -c 
> $MAHOUT_HOME/examples/temp/categories.txt
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using HADOOP_HOME=/home/hadoop/hadoop-0.20.2
> No HADOOP_CONF_DIR set, using /home/hadoop/hadoop-0.20.2/conf
> MAHOUT-JOB: 
> /home/hadoop/mahout-distribution-0.6/examples/target/mahout-examples-0.6-job.jar
> 12/04/11 06:55:10 WARN driver.MahoutDriver: No wikipediaDataSetCreator.props 
> found on classpath, will use command-line arguments only
> 12/04/11 06:55:12 INFO bayes.WikipediaDatasetCreatorDriver: Input: 
> wikipedia/chunks Out: wikipediainput Categories: 
> /home/hadoop/mahout-distribution-0.6/examples/temp/categories.txt
> 12/04/11 06:55:13 INFO common.HadoopUtil: Deleting wikipediainput
> 12/04/11 06:55:13 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
> the arguments. Applications should implement Tool for the same.
> 12/04/11 06:55:15 INFO input.FileInputFormat: Total input paths to process : 7
> 12/04/11 06:55:17 INFO mapred.JobClient: Running job: job_201204110624_0002
> 12/04/11 06:55:18 INFO mapred.JobClient:  map 0% reduce 0%
> 12/04/11 06:55:44 INFO mapred.JobClient: Task Id : 
> attempt_201204110624_0002_m_000003_0, Status : FAILED
> java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit 
> cannot be cast to org.apache.hadoop.mapred.InputSplit
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:323)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>     at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 12/04/11 06:55:48 INFO mapred.JobClient: Task Id : 
> attempt_201204110624_0002_m_000000_0, Status : FAILED
> java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit 
> cannot be cast to org.apache.hadoop.mapred.InputSplit
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:323)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>     at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 12/04/11 06:55:48 INFO mapred.JobClient: Task Id : 
> attempt_201204110624_0002_m_000001_0, Status : FAILED
> java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit 
> cannot be cast to org.apache.hadoop.mapred.InputSplit
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:323)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>     at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 12/04/11 06:55:51 INFO mapred.JobClient: Task Id : 
> attempt_201204110624_0002_m_000002_0, Status : FAILED
> java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit 
> cannot be cast to org.apache.hadoop.mapred.InputSplit
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:323)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>     at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 12/04/11 06:55:54 INFO mapred.JobClient: Task Id : 
> attempt_201204110624_0002_m_000003_1, Status : FAILED
> java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit 
> cannot be cast to org.apache.hadoop.mapred.InputSplit
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:323)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>     at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 12/04/11 06:56:03 INFO mapred.JobClient: Task Id : 
> attempt_201204110624_0002_m_000001_1, Status : FAILED
> java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit 
> cannot be cast to org.apache.hadoop.mapred.InputSplit
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:323)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>     at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 12/04/11 06:56:03 INFO mapred.JobClient: Task Id : 
> attempt_201204110624_0002_m_000000_1, Status : FAILED
> java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit 
> cannot be cast to org.apache.hadoop.mapred.InputSplit
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:323)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>     at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 12/04/11 06:56:03 INFO mapred.JobClient: Task Id : 
> attempt_201204110624_0002_m_000002_1, Status : FAILED
> java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit 
> cannot be cast to org.apache.hadoop.mapred.InputSplit
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:323)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>     at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 12/04/11 06:56:06 INFO mapred.JobClient: Task Id : 
> attempt_201204110624_0002_m_000003_2, Status : FAILED
> java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit 
> cannot be cast to org.apache.hadoop.mapred.InputSplit
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:323)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>     at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 12/04/11 06:56:15 INFO mapred.JobClient: Task Id : 
> attempt_201204110624_0002_m_000002_2, Status : FAILED
> java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit 
> cannot be cast to org.apache.hadoop.mapred.InputSplit
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:323)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>     at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 12/04/11 06:56:18 INFO mapred.JobClient: Task Id : 
> attempt_201204110624_0002_m_000000_2, Status : FAILED
> java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit 
> cannot be cast to org.apache.hadoop.mapred.InputSplit
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:323)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>     at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 12/04/11 06:56:18 INFO mapred.JobClient: Task Id : 
> attempt_201204110624_0002_m_000001_2, Status : FAILED
> java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit 
> cannot be cast to org.apache.hadoop.mapred.InputSplit
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:323)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>     at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 12/04/11 06:56:24 INFO mapred.JobClient: Job complete: job_201204110624_0002
> 12/04/11 06:56:24 INFO mapred.JobClient: Counters: 3
> 12/04/11 06:56:24 INFO mapred.JobClient:   Job Counters
> 12/04/11 06:56:24 INFO mapred.JobClient:     Launched map tasks=14
> 12/04/11 06:56:24 INFO mapred.JobClient:     Data-local map tasks=14
> 12/04/11 06:56:24 INFO mapred.JobClient:     Failed map tasks=1
> 12/04/11 06:56:24 INFO driver.MahoutDriver: Program took 74439 ms (Minutes: 
> 1.24065)
>
> ----------------
> Fu, Binzhang
> 2012-04-11

RE: RE: A question about "data analytics"

Reply via email to