[
https://issues.apache.org/jira/browse/MAPREDUCE-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447568#comment-13447568
]
Ming Jin commented on MAPREDUCE-577:
Hi everyone,
I found the exact same issue in Hadoop
v1.0.3(http://fossies.org/dox/hadoop-1.0.3/StreamXmlRecordReader_8java_source.html).
Is there any plan to fix it in v1.0.3?
> Duplicate Mapper input when using StreamXmlRecordReader
> ---
>
> Key: MAPREDUCE-577
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-577
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: contrib/streaming
> Environment: HADOOP 0.17.0, Java 6.0
>Reporter: David Campbell
>Assignee: Ravi Gummadi
> Fix For: 0.22.0
>
> Attachments: 0001-test-to-demonstrate-HADOOP-3484.patch,
> 0002-patch-for-HADOOP-3484.patch, 577.20S.patch, 577.patch, 577.v1.patch,
> 577.v2.patch, 577.v3.patch, 577.v4.patch, HADOOP-3484.combined.patch,
> HADOOP-3484.try3.patch
>
>
> I have an XML file with 93626 rows. A row is marked by
> I've confirmed this with grep and the Grep example program included with
> HADOOP.
> Here is the grep example output. 93626
> I've setup my job configuration as follows:
> conf.set("stream.recordreader.class",
> "org.apache.hadoop.streaming.StreamXmlRecordReader");
> conf.set("stream.recordreader.begin", "");
> conf.set("stream.recordreader.end", "");
> conf.setInputFormat(StreamInputFormat.class);
> I have a fairly simple test Mapper.
> Here's the map method.
> public void map(Text key, Text value, OutputCollector
> output, Reporter reporter) throws IOException {
> try {
> output.collect(totalWord, one);
> if (key != null && key.toString().indexOf("01852") != -1) {
> output.collect(new Text("01852"), one);
> }
> } catch (Exception ex) {
> Logger.getLogger(TestMapper.class.getName()).log(Level.SEVERE,
> null, ex);
> System.out.println(value);
> }
> }
> For totalWord ("TOTAL"), I get:
> TOTAL 140850
> and for 01852 I get.
> 01852 86
> There are 43 instances of 01852 in the file.
> I have the following setting in my config.
>conf.setNumMapTasks(1);
> I have a total of six machines in my cluster.
> If I run without this, the result is 12x the actual value, not 2x.
> Here's some info from the cluster web page.
> Maps Reduces Total Submissions Nodes Map Task Capacity Reduce
> Task CapacityAvg. Tasks/Node
> 0 0 1 6 12 12 4.00
> I've also noticed something really strange in the job's output. It looks
> like it's starting over or redoing things.
> This was run using all six nodes and no limitations on map or reduce tasks.
> I haven't seen this behavior in any other case.
> 08/06/03 10:50:35 INFO mapred.FileInputFormat: Total input paths to process :
> 1
> 08/06/03 10:50:36 INFO mapred.JobClient: Running job: job_200806030916_0018
> 08/06/03 10:50:37 INFO mapred.JobClient: map 0% reduce 0%
> 08/06/03 10:50:42 INFO mapred.JobClient: map 2% reduce 0%
> 08/06/03 10:50:45 INFO mapred.JobClient: map 12% reduce 0%
> 08/06/03 10:50:47 INFO mapred.JobClient: map 31% reduce 0%
> 08/06/03 10:50:48 INFO mapred.JobClient: map 49% reduce 0%
> 08/06/03 10:50:49 INFO mapred.JobClient: map 68% reduce 0%
> 08/06/03 10:50:50 INFO mapred.JobClient: map 100% reduce 0%
> 08/06/03 10:50:54 INFO mapred.JobClient: map 87% reduce 0%
> 08/06/03 10:50:55 INFO mapred.JobClient: map 100% reduce 0%
> 08/06/03 10:50:56 INFO mapred.JobClient: map 0% reduce 0%
> 08/06/03 10:51:00 INFO mapred.JobClient: map 0% reduce 1%
> 08/06/03 10:51:05 INFO mapred.JobClient: map 28% reduce 2%
> 08/06/03 10:51:07 INFO mapred.JobClient: map 80% reduce 4%
> 08/06/03 10:51:08 INFO mapred.JobClient: map 100% reduce 4%
> 08/06/03 10:51:09 INFO mapred.JobClient: map 100% reduce 7%
> 08/06/03 10:51:10 INFO mapred.JobClient: map 90% reduce 9%
> 08/06/03 10:51:11 INFO mapred.JobClient: map 100% reduce 9%
> 08/06/03 10:51:12 INFO mapred.JobClient: map 100% reduce 11%
> 08/06/03 10:51:13 INFO mapred.JobClient: map 90% reduce 11%
> 08/06/03 10:51:14 INFO mapred.JobClient: map 97% reduce 11%
> 08/06/03 10:51:15 INFO mapred.JobClient: map 63% reduce 11%
> 08/06/03 10:51:16 INFO mapred.JobClient: map 48% reduce 11%
> 08/06/03 10:51:17 INFO mapred.JobClient: map 21% reduce 11%
> 08/06/03 10:51:19 INFO mapred.JobClient: map 0% reduce 11%
> 08/06/03 10:51:20 INFO mapred.JobClient: map 15% reduce 12%
> 08/06/03 10:51:21 INFO mapred.JobClient: map 27% reduce 13%
> 08/06/03 10:51:22 INFO mapred.JobClient: map 67% reduce 13%
> 08/06/03 10:51:24 INFO mapred.JobClient: map 22% reduce 16%
> 08/06/03 10:51:25 INFO mapred.JobClient: map 46% reduce 1