[jira] [Updated] (MAPREDUCE-4631) Duplicate Mapper input when using StreamXmlRecordReader

2012-09-04 Thread Ming Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Jin updated MAPREDUCE-4631:


Environment: Hadoop v1.0.3, JDK 6

> Duplicate Mapper input when using StreamXmlRecordReader
> ---
>
> Key: MAPREDUCE-4631
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4631
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/streaming
>Affects Versions: 1.0.3
> Environment: Hadoop v1.0.3, JDK 6
>Reporter: Ming Jin
>
> This is the same defect as 
> https://issues.apache.org/jira/browse/MAPREDUCE-577, which was fixed in 
> v0.22.0.
> So I'm wondering whether there is a plan to fix it in v1.0.3 as well? Or 
> shall I move to v2.0.x?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAPREDUCE-4631) Duplicate Mapper input when using StreamXmlRecordReader

2012-09-04 Thread Ming Jin (JIRA)
Ming Jin created MAPREDUCE-4631:
---

 Summary: Duplicate Mapper input when using StreamXmlRecordReader
 Key: MAPREDUCE-4631
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4631
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
Affects Versions: 1.0.3
Reporter: Ming Jin


This is the same defect as https://issues.apache.org/jira/browse/MAPREDUCE-577, 
which was fixed in v0.22.0.

So I'm wondering whether there is a plan to fix it in v1.0.3 as well? Or shall 
I move to v2.0.x?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-577) Duplicate Mapper input when using StreamXmlRecordReader

2012-09-04 Thread Ming Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447568#comment-13447568
 ] 

Ming Jin commented on MAPREDUCE-577:


Hi everyone,

I found the exact same issue in Hadoop 
v1.0.3(http://fossies.org/dox/hadoop-1.0.3/StreamXmlRecordReader_8java_source.html).
 

Is there any plan to fix it in v1.0.3?

> Duplicate Mapper input when using StreamXmlRecordReader
> ---
>
> Key: MAPREDUCE-577
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-577
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/streaming
> Environment: HADOOP 0.17.0, Java 6.0
>Reporter: David Campbell
>Assignee: Ravi Gummadi
> Fix For: 0.22.0
>
> Attachments: 0001-test-to-demonstrate-HADOOP-3484.patch, 
> 0002-patch-for-HADOOP-3484.patch, 577.20S.patch, 577.patch, 577.v1.patch, 
> 577.v2.patch, 577.v3.patch, 577.v4.patch, HADOOP-3484.combined.patch, 
> HADOOP-3484.try3.patch
>
>
> I have an XML file with 93626 rows.  A row is marked by 
> I've confirmed this with grep and the Grep example program included with 
> HADOOP.
> Here is the grep example output.  93626   
> I've setup my job configuration as follows:   
> conf.set("stream.recordreader.class", 
> "org.apache.hadoop.streaming.StreamXmlRecordReader");
> conf.set("stream.recordreader.begin", "");
> conf.set("stream.recordreader.end", "");
> conf.setInputFormat(StreamInputFormat.class);
> I have a fairly simple test Mapper.
> Here's the map method.
>   public void map(Text key, Text value, OutputCollector 
> output, Reporter reporter) throws IOException {
> try {
> output.collect(totalWord, one);
> if (key != null && key.toString().indexOf("01852") != -1) {
> output.collect(new Text("01852"), one);
> }
> } catch (Exception ex) {
> Logger.getLogger(TestMapper.class.getName()).log(Level.SEVERE, 
> null, ex);
> System.out.println(value);
> }
> }
> For totalWord ("TOTAL"), I get:
> TOTAL 140850
> and for 01852 I get.
> 01852 86
> There are 43 instances of 01852 in the file.
> I have the following setting in my config.  
>conf.setNumMapTasks(1);
> I have a total of six machines in my cluster.
> If I run without this, the result is 12x the actual value, not 2x.
> Here's some info from the cluster web page.
> Maps  Reduces Total Submissions   Nodes   Map Task Capacity   Reduce 
> Task CapacityAvg. Tasks/Node
> 0 0   1   6   12  12  4.00
> I've also noticed something really strange in the job's output.  It looks 
> like it's starting over or redoing things.
> This was run using all six nodes and no limitations on map or reduce tasks.  
> I haven't seen this behavior in any other case.
> 08/06/03 10:50:35 INFO mapred.FileInputFormat: Total input paths to process : 
> 1
> 08/06/03 10:50:36 INFO mapred.JobClient: Running job: job_200806030916_0018
> 08/06/03 10:50:37 INFO mapred.JobClient:  map 0% reduce 0%
> 08/06/03 10:50:42 INFO mapred.JobClient:  map 2% reduce 0%
> 08/06/03 10:50:45 INFO mapred.JobClient:  map 12% reduce 0%
> 08/06/03 10:50:47 INFO mapred.JobClient:  map 31% reduce 0%
> 08/06/03 10:50:48 INFO mapred.JobClient:  map 49% reduce 0%
> 08/06/03 10:50:49 INFO mapred.JobClient:  map 68% reduce 0%
> 08/06/03 10:50:50 INFO mapred.JobClient:  map 100% reduce 0%
> 08/06/03 10:50:54 INFO mapred.JobClient:  map 87% reduce 0%
> 08/06/03 10:50:55 INFO mapred.JobClient:  map 100% reduce 0%
> 08/06/03 10:50:56 INFO mapred.JobClient:  map 0% reduce 0%
> 08/06/03 10:51:00 INFO mapred.JobClient:  map 0% reduce 1%
> 08/06/03 10:51:05 INFO mapred.JobClient:  map 28% reduce 2%
> 08/06/03 10:51:07 INFO mapred.JobClient:  map 80% reduce 4%
> 08/06/03 10:51:08 INFO mapred.JobClient:  map 100% reduce 4%
> 08/06/03 10:51:09 INFO mapred.JobClient:  map 100% reduce 7%
> 08/06/03 10:51:10 INFO mapred.JobClient:  map 90% reduce 9%
> 08/06/03 10:51:11 INFO mapred.JobClient:  map 100% reduce 9%
> 08/06/03 10:51:12 INFO mapred.JobClient:  map 100% reduce 11%
> 08/06/03 10:51:13 INFO mapred.JobClient:  map 90% reduce 11%
> 08/06/03 10:51:14 INFO mapred.JobClient:  map 97% reduce 11%
> 08/06/03 10:51:15 INFO mapred.JobClient:  map 63% reduce 11%
> 08/06/03 10:51:16 INFO mapred.JobClient:  map 48% reduce 11%
> 08/06/03 10:51:17 INFO mapred.JobClient:  map 21% reduce 11%
> 08/06/03 10:51:19 INFO mapred.JobClient:  map 0% reduce 11%
> 08/06/03 10:51:20 INFO mapred.JobClient:  map 15% reduce 12%
> 08/06/03 10:51:21 INFO mapred.JobClient:  map 27% reduce 13%
> 08/06/03 10:51:22 INFO mapred.JobClient:  map 67% reduce 13%
> 08/06/03 10:51:24 INFO mapred.JobClient:  map 22% reduce 16%
> 08/06/03 10:51:25 INFO mapred.JobClient:  map 46% reduce 1