[jira] [Commented] (MAPREDUCE-577) Duplicate Mapper input when using StreamXmlRecordReader

2013-08-15 Thread Pierre-Francois Laquerre (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13741530#comment-13741530
 ] 

Pierre-Francois Laquerre commented on MAPREDUCE-577:


This is still broken in 1.1.2.

 Duplicate Mapper input when using StreamXmlRecordReader
 ---

 Key: MAPREDUCE-577
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-577
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
 Environment: HADOOP 0.17.0, Java 6.0
Reporter: David Campbell
Assignee: Ravi Gummadi
 Fix For: 0.22.0

 Attachments: 0001-test-to-demonstrate-HADOOP-3484.patch, 
 0002-patch-for-HADOOP-3484.patch, 577.20S.patch, 577.patch, 577.v1.patch, 
 577.v2.patch, 577.v3.patch, 577.v4.patch, HADOOP-3484.combined.patch, 
 HADOOP-3484.try3.patch


 I have an XML file with 93626 rows.  A row is marked by row.../row.
 I've confirmed this with grep and the Grep example program included with 
 HADOOP.
 Here is the grep example output.  93626   row
 I've setup my job configuration as follows:   
 conf.set(stream.recordreader.class, 
 org.apache.hadoop.streaming.StreamXmlRecordReader);
 conf.set(stream.recordreader.begin, row);
 conf.set(stream.recordreader.end, /row);
 conf.setInputFormat(StreamInputFormat.class);
 I have a fairly simple test Mapper.
 Here's the map method.
   public void map(Text key, Text value, OutputCollectorText, IntWritable 
 output, Reporter reporter) throws IOException {
 try {
 output.collect(totalWord, one);
 if (key != null  key.toString().indexOf(01852) != -1) {
 output.collect(new Text(01852), one);
 }
 } catch (Exception ex) {
 Logger.getLogger(TestMapper.class.getName()).log(Level.SEVERE, 
 null, ex);
 System.out.println(value);
 }
 }
 For totalWord (TOTAL), I get:
 TOTAL 140850
 and for 01852 I get.
 01852 86
 There are 43 instances of 01852 in the file.
 I have the following setting in my config.  
conf.setNumMapTasks(1);
 I have a total of six machines in my cluster.
 If I run without this, the result is 12x the actual value, not 2x.
 Here's some info from the cluster web page.
 Maps  Reduces Total Submissions   Nodes   Map Task Capacity   Reduce 
 Task CapacityAvg. Tasks/Node
 0 0   1   6   12  12  4.00
 I've also noticed something really strange in the job's output.  It looks 
 like it's starting over or redoing things.
 This was run using all six nodes and no limitations on map or reduce tasks.  
 I haven't seen this behavior in any other case.
 08/06/03 10:50:35 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 08/06/03 10:50:36 INFO mapred.JobClient: Running job: job_200806030916_0018
 08/06/03 10:50:37 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:50:42 INFO mapred.JobClient:  map 2% reduce 0%
 08/06/03 10:50:45 INFO mapred.JobClient:  map 12% reduce 0%
 08/06/03 10:50:47 INFO mapred.JobClient:  map 31% reduce 0%
 08/06/03 10:50:48 INFO mapred.JobClient:  map 49% reduce 0%
 08/06/03 10:50:49 INFO mapred.JobClient:  map 68% reduce 0%
 08/06/03 10:50:50 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:54 INFO mapred.JobClient:  map 87% reduce 0%
 08/06/03 10:50:55 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:56 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:51:00 INFO mapred.JobClient:  map 0% reduce 1%
 08/06/03 10:51:05 INFO mapred.JobClient:  map 28% reduce 2%
 08/06/03 10:51:07 INFO mapred.JobClient:  map 80% reduce 4%
 08/06/03 10:51:08 INFO mapred.JobClient:  map 100% reduce 4%
 08/06/03 10:51:09 INFO mapred.JobClient:  map 100% reduce 7%
 08/06/03 10:51:10 INFO mapred.JobClient:  map 90% reduce 9%
 08/06/03 10:51:11 INFO mapred.JobClient:  map 100% reduce 9%
 08/06/03 10:51:12 INFO mapred.JobClient:  map 100% reduce 11%
 08/06/03 10:51:13 INFO mapred.JobClient:  map 90% reduce 11%
 08/06/03 10:51:14 INFO mapred.JobClient:  map 97% reduce 11%
 08/06/03 10:51:15 INFO mapred.JobClient:  map 63% reduce 11%
 08/06/03 10:51:16 INFO mapred.JobClient:  map 48% reduce 11%
 08/06/03 10:51:17 INFO mapred.JobClient:  map 21% reduce 11%
 08/06/03 10:51:19 INFO mapred.JobClient:  map 0% reduce 11%
 08/06/03 10:51:20 INFO mapred.JobClient:  map 15% reduce 12%
 08/06/03 10:51:21 INFO mapred.JobClient:  map 27% reduce 13%
 08/06/03 10:51:22 INFO mapred.JobClient:  map 67% reduce 13%
 08/06/03 10:51:24 INFO mapred.JobClient:  map 22% reduce 16%
 08/06/03 10:51:25 INFO mapred.JobClient:  map 46% reduce 16%
 08/06/03 10:51:26 INFO mapred.JobClient:  map 70% reduce 16%
 08/06/03 10:51:27 INFO mapred.JobClient:  map 73% reduce 18%
 08/06/03 10:51:28 INFO mapred.JobClient:  map 85% reduce 19%
 

[jira] [Commented] (MAPREDUCE-577) Duplicate Mapper input when using StreamXmlRecordReader

2013-02-20 Thread Clark Mobarry (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582426#comment-13582426
 ] 

Clark Mobarry commented on MAPREDUCE-577:
-

I found the exact same issue in Hadoop v2.0.0 (via Cloudera CDH 4.1.2.


 Duplicate Mapper input when using StreamXmlRecordReader
 ---

 Key: MAPREDUCE-577
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-577
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
 Environment: HADOOP 0.17.0, Java 6.0
Reporter: David Campbell
Assignee: Ravi Gummadi
 Fix For: 0.22.0

 Attachments: 0001-test-to-demonstrate-HADOOP-3484.patch, 
 0002-patch-for-HADOOP-3484.patch, 577.20S.patch, 577.patch, 577.v1.patch, 
 577.v2.patch, 577.v3.patch, 577.v4.patch, HADOOP-3484.combined.patch, 
 HADOOP-3484.try3.patch


 I have an XML file with 93626 rows.  A row is marked by row.../row.
 I've confirmed this with grep and the Grep example program included with 
 HADOOP.
 Here is the grep example output.  93626   row
 I've setup my job configuration as follows:   
 conf.set(stream.recordreader.class, 
 org.apache.hadoop.streaming.StreamXmlRecordReader);
 conf.set(stream.recordreader.begin, row);
 conf.set(stream.recordreader.end, /row);
 conf.setInputFormat(StreamInputFormat.class);
 I have a fairly simple test Mapper.
 Here's the map method.
   public void map(Text key, Text value, OutputCollectorText, IntWritable 
 output, Reporter reporter) throws IOException {
 try {
 output.collect(totalWord, one);
 if (key != null  key.toString().indexOf(01852) != -1) {
 output.collect(new Text(01852), one);
 }
 } catch (Exception ex) {
 Logger.getLogger(TestMapper.class.getName()).log(Level.SEVERE, 
 null, ex);
 System.out.println(value);
 }
 }
 For totalWord (TOTAL), I get:
 TOTAL 140850
 and for 01852 I get.
 01852 86
 There are 43 instances of 01852 in the file.
 I have the following setting in my config.  
conf.setNumMapTasks(1);
 I have a total of six machines in my cluster.
 If I run without this, the result is 12x the actual value, not 2x.
 Here's some info from the cluster web page.
 Maps  Reduces Total Submissions   Nodes   Map Task Capacity   Reduce 
 Task CapacityAvg. Tasks/Node
 0 0   1   6   12  12  4.00
 I've also noticed something really strange in the job's output.  It looks 
 like it's starting over or redoing things.
 This was run using all six nodes and no limitations on map or reduce tasks.  
 I haven't seen this behavior in any other case.
 08/06/03 10:50:35 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 08/06/03 10:50:36 INFO mapred.JobClient: Running job: job_200806030916_0018
 08/06/03 10:50:37 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:50:42 INFO mapred.JobClient:  map 2% reduce 0%
 08/06/03 10:50:45 INFO mapred.JobClient:  map 12% reduce 0%
 08/06/03 10:50:47 INFO mapred.JobClient:  map 31% reduce 0%
 08/06/03 10:50:48 INFO mapred.JobClient:  map 49% reduce 0%
 08/06/03 10:50:49 INFO mapred.JobClient:  map 68% reduce 0%
 08/06/03 10:50:50 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:54 INFO mapred.JobClient:  map 87% reduce 0%
 08/06/03 10:50:55 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:56 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:51:00 INFO mapred.JobClient:  map 0% reduce 1%
 08/06/03 10:51:05 INFO mapred.JobClient:  map 28% reduce 2%
 08/06/03 10:51:07 INFO mapred.JobClient:  map 80% reduce 4%
 08/06/03 10:51:08 INFO mapred.JobClient:  map 100% reduce 4%
 08/06/03 10:51:09 INFO mapred.JobClient:  map 100% reduce 7%
 08/06/03 10:51:10 INFO mapred.JobClient:  map 90% reduce 9%
 08/06/03 10:51:11 INFO mapred.JobClient:  map 100% reduce 9%
 08/06/03 10:51:12 INFO mapred.JobClient:  map 100% reduce 11%
 08/06/03 10:51:13 INFO mapred.JobClient:  map 90% reduce 11%
 08/06/03 10:51:14 INFO mapred.JobClient:  map 97% reduce 11%
 08/06/03 10:51:15 INFO mapred.JobClient:  map 63% reduce 11%
 08/06/03 10:51:16 INFO mapred.JobClient:  map 48% reduce 11%
 08/06/03 10:51:17 INFO mapred.JobClient:  map 21% reduce 11%
 08/06/03 10:51:19 INFO mapred.JobClient:  map 0% reduce 11%
 08/06/03 10:51:20 INFO mapred.JobClient:  map 15% reduce 12%
 08/06/03 10:51:21 INFO mapred.JobClient:  map 27% reduce 13%
 08/06/03 10:51:22 INFO mapred.JobClient:  map 67% reduce 13%
 08/06/03 10:51:24 INFO mapred.JobClient:  map 22% reduce 16%
 08/06/03 10:51:25 INFO mapred.JobClient:  map 46% reduce 16%
 08/06/03 10:51:26 INFO mapred.JobClient:  map 70% reduce 16%
 08/06/03 10:51:27 INFO mapred.JobClient:  map 73% reduce 18%
 08/06/03 10:51:28 INFO mapred.JobClient:  map 85% 

[jira] [Commented] (MAPREDUCE-577) Duplicate Mapper input when using StreamXmlRecordReader

2013-02-20 Thread Clark Mobarry (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582457#comment-13582457
 ] 

Clark Mobarry commented on MAPREDUCE-577:
-

I found the exact same issue in Hadoop 0.20.2 via Cloudera CDH 4.1.2 MRv1.  I 
did not attempt this with MRv2.

 Duplicate Mapper input when using StreamXmlRecordReader
 ---

 Key: MAPREDUCE-577
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-577
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
 Environment: HADOOP 0.17.0, Java 6.0
Reporter: David Campbell
Assignee: Ravi Gummadi
 Fix For: 0.22.0

 Attachments: 0001-test-to-demonstrate-HADOOP-3484.patch, 
 0002-patch-for-HADOOP-3484.patch, 577.20S.patch, 577.patch, 577.v1.patch, 
 577.v2.patch, 577.v3.patch, 577.v4.patch, HADOOP-3484.combined.patch, 
 HADOOP-3484.try3.patch


 I have an XML file with 93626 rows.  A row is marked by row.../row.
 I've confirmed this with grep and the Grep example program included with 
 HADOOP.
 Here is the grep example output.  93626   row
 I've setup my job configuration as follows:   
 conf.set(stream.recordreader.class, 
 org.apache.hadoop.streaming.StreamXmlRecordReader);
 conf.set(stream.recordreader.begin, row);
 conf.set(stream.recordreader.end, /row);
 conf.setInputFormat(StreamInputFormat.class);
 I have a fairly simple test Mapper.
 Here's the map method.
   public void map(Text key, Text value, OutputCollectorText, IntWritable 
 output, Reporter reporter) throws IOException {
 try {
 output.collect(totalWord, one);
 if (key != null  key.toString().indexOf(01852) != -1) {
 output.collect(new Text(01852), one);
 }
 } catch (Exception ex) {
 Logger.getLogger(TestMapper.class.getName()).log(Level.SEVERE, 
 null, ex);
 System.out.println(value);
 }
 }
 For totalWord (TOTAL), I get:
 TOTAL 140850
 and for 01852 I get.
 01852 86
 There are 43 instances of 01852 in the file.
 I have the following setting in my config.  
conf.setNumMapTasks(1);
 I have a total of six machines in my cluster.
 If I run without this, the result is 12x the actual value, not 2x.
 Here's some info from the cluster web page.
 Maps  Reduces Total Submissions   Nodes   Map Task Capacity   Reduce 
 Task CapacityAvg. Tasks/Node
 0 0   1   6   12  12  4.00
 I've also noticed something really strange in the job's output.  It looks 
 like it's starting over or redoing things.
 This was run using all six nodes and no limitations on map or reduce tasks.  
 I haven't seen this behavior in any other case.
 08/06/03 10:50:35 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 08/06/03 10:50:36 INFO mapred.JobClient: Running job: job_200806030916_0018
 08/06/03 10:50:37 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:50:42 INFO mapred.JobClient:  map 2% reduce 0%
 08/06/03 10:50:45 INFO mapred.JobClient:  map 12% reduce 0%
 08/06/03 10:50:47 INFO mapred.JobClient:  map 31% reduce 0%
 08/06/03 10:50:48 INFO mapred.JobClient:  map 49% reduce 0%
 08/06/03 10:50:49 INFO mapred.JobClient:  map 68% reduce 0%
 08/06/03 10:50:50 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:54 INFO mapred.JobClient:  map 87% reduce 0%
 08/06/03 10:50:55 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:56 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:51:00 INFO mapred.JobClient:  map 0% reduce 1%
 08/06/03 10:51:05 INFO mapred.JobClient:  map 28% reduce 2%
 08/06/03 10:51:07 INFO mapred.JobClient:  map 80% reduce 4%
 08/06/03 10:51:08 INFO mapred.JobClient:  map 100% reduce 4%
 08/06/03 10:51:09 INFO mapred.JobClient:  map 100% reduce 7%
 08/06/03 10:51:10 INFO mapred.JobClient:  map 90% reduce 9%
 08/06/03 10:51:11 INFO mapred.JobClient:  map 100% reduce 9%
 08/06/03 10:51:12 INFO mapred.JobClient:  map 100% reduce 11%
 08/06/03 10:51:13 INFO mapred.JobClient:  map 90% reduce 11%
 08/06/03 10:51:14 INFO mapred.JobClient:  map 97% reduce 11%
 08/06/03 10:51:15 INFO mapred.JobClient:  map 63% reduce 11%
 08/06/03 10:51:16 INFO mapred.JobClient:  map 48% reduce 11%
 08/06/03 10:51:17 INFO mapred.JobClient:  map 21% reduce 11%
 08/06/03 10:51:19 INFO mapred.JobClient:  map 0% reduce 11%
 08/06/03 10:51:20 INFO mapred.JobClient:  map 15% reduce 12%
 08/06/03 10:51:21 INFO mapred.JobClient:  map 27% reduce 13%
 08/06/03 10:51:22 INFO mapred.JobClient:  map 67% reduce 13%
 08/06/03 10:51:24 INFO mapred.JobClient:  map 22% reduce 16%
 08/06/03 10:51:25 INFO mapred.JobClient:  map 46% reduce 16%
 08/06/03 10:51:26 INFO mapred.JobClient:  map 70% reduce 16%
 08/06/03 10:51:27 INFO mapred.JobClient:  map 73% reduce 18%
 08/06/03 

[jira] [Commented] (MAPREDUCE-577) Duplicate Mapper input when using StreamXmlRecordReader

2012-09-04 Thread Ming Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13447568#comment-13447568
 ] 

Ming Jin commented on MAPREDUCE-577:


Hi everyone,

I found the exact same issue in Hadoop 
v1.0.3(http://fossies.org/dox/hadoop-1.0.3/StreamXmlRecordReader_8java_source.html).
 

Is there any plan to fix it in v1.0.3?

 Duplicate Mapper input when using StreamXmlRecordReader
 ---

 Key: MAPREDUCE-577
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-577
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
 Environment: HADOOP 0.17.0, Java 6.0
Reporter: David Campbell
Assignee: Ravi Gummadi
 Fix For: 0.22.0

 Attachments: 0001-test-to-demonstrate-HADOOP-3484.patch, 
 0002-patch-for-HADOOP-3484.patch, 577.20S.patch, 577.patch, 577.v1.patch, 
 577.v2.patch, 577.v3.patch, 577.v4.patch, HADOOP-3484.combined.patch, 
 HADOOP-3484.try3.patch


 I have an XML file with 93626 rows.  A row is marked by row.../row.
 I've confirmed this with grep and the Grep example program included with 
 HADOOP.
 Here is the grep example output.  93626   row
 I've setup my job configuration as follows:   
 conf.set(stream.recordreader.class, 
 org.apache.hadoop.streaming.StreamXmlRecordReader);
 conf.set(stream.recordreader.begin, row);
 conf.set(stream.recordreader.end, /row);
 conf.setInputFormat(StreamInputFormat.class);
 I have a fairly simple test Mapper.
 Here's the map method.
   public void map(Text key, Text value, OutputCollectorText, IntWritable 
 output, Reporter reporter) throws IOException {
 try {
 output.collect(totalWord, one);
 if (key != null  key.toString().indexOf(01852) != -1) {
 output.collect(new Text(01852), one);
 }
 } catch (Exception ex) {
 Logger.getLogger(TestMapper.class.getName()).log(Level.SEVERE, 
 null, ex);
 System.out.println(value);
 }
 }
 For totalWord (TOTAL), I get:
 TOTAL 140850
 and for 01852 I get.
 01852 86
 There are 43 instances of 01852 in the file.
 I have the following setting in my config.  
conf.setNumMapTasks(1);
 I have a total of six machines in my cluster.
 If I run without this, the result is 12x the actual value, not 2x.
 Here's some info from the cluster web page.
 Maps  Reduces Total Submissions   Nodes   Map Task Capacity   Reduce 
 Task CapacityAvg. Tasks/Node
 0 0   1   6   12  12  4.00
 I've also noticed something really strange in the job's output.  It looks 
 like it's starting over or redoing things.
 This was run using all six nodes and no limitations on map or reduce tasks.  
 I haven't seen this behavior in any other case.
 08/06/03 10:50:35 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 08/06/03 10:50:36 INFO mapred.JobClient: Running job: job_200806030916_0018
 08/06/03 10:50:37 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:50:42 INFO mapred.JobClient:  map 2% reduce 0%
 08/06/03 10:50:45 INFO mapred.JobClient:  map 12% reduce 0%
 08/06/03 10:50:47 INFO mapred.JobClient:  map 31% reduce 0%
 08/06/03 10:50:48 INFO mapred.JobClient:  map 49% reduce 0%
 08/06/03 10:50:49 INFO mapred.JobClient:  map 68% reduce 0%
 08/06/03 10:50:50 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:54 INFO mapred.JobClient:  map 87% reduce 0%
 08/06/03 10:50:55 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:56 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:51:00 INFO mapred.JobClient:  map 0% reduce 1%
 08/06/03 10:51:05 INFO mapred.JobClient:  map 28% reduce 2%
 08/06/03 10:51:07 INFO mapred.JobClient:  map 80% reduce 4%
 08/06/03 10:51:08 INFO mapred.JobClient:  map 100% reduce 4%
 08/06/03 10:51:09 INFO mapred.JobClient:  map 100% reduce 7%
 08/06/03 10:51:10 INFO mapred.JobClient:  map 90% reduce 9%
 08/06/03 10:51:11 INFO mapred.JobClient:  map 100% reduce 9%
 08/06/03 10:51:12 INFO mapred.JobClient:  map 100% reduce 11%
 08/06/03 10:51:13 INFO mapred.JobClient:  map 90% reduce 11%
 08/06/03 10:51:14 INFO mapred.JobClient:  map 97% reduce 11%
 08/06/03 10:51:15 INFO mapred.JobClient:  map 63% reduce 11%
 08/06/03 10:51:16 INFO mapred.JobClient:  map 48% reduce 11%
 08/06/03 10:51:17 INFO mapred.JobClient:  map 21% reduce 11%
 08/06/03 10:51:19 INFO mapred.JobClient:  map 0% reduce 11%
 08/06/03 10:51:20 INFO mapred.JobClient:  map 15% reduce 12%
 08/06/03 10:51:21 INFO mapred.JobClient:  map 27% reduce 13%
 08/06/03 10:51:22 INFO mapred.JobClient:  map 67% reduce 13%
 08/06/03 10:51:24 INFO mapred.JobClient:  map 22% reduce 16%
 08/06/03 10:51:25 INFO mapred.JobClient:  map 46% reduce 16%
 08/06/03 10:51:26 INFO mapred.JobClient:  map 70% reduce 16%
 08/06/03 

[jira] Commented: (MAPREDUCE-577) Duplicate Mapper input when using StreamXmlRecordReader

2010-10-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12926075#action_12926075
 ] 

Hudson commented on MAPREDUCE-577:
--

Integrated in Hadoop-Mapreduce-trunk-Commit #523 (See 
[https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/523/])


 Duplicate Mapper input when using StreamXmlRecordReader
 ---

 Key: MAPREDUCE-577
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-577
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
 Environment: HADOOP 0.17.0, Java 6.0
Reporter: David Campbell
Assignee: Ravi Gummadi
 Fix For: 0.22.0

 Attachments: 0001-test-to-demonstrate-HADOOP-3484.patch, 
 0002-patch-for-HADOOP-3484.patch, 577.20S.patch, 577.patch, 577.v1.patch, 
 577.v2.patch, 577.v3.patch, 577.v4.patch, HADOOP-3484.combined.patch, 
 HADOOP-3484.try3.patch


 I have an XML file with 93626 rows.  A row is marked by row.../row.
 I've confirmed this with grep and the Grep example program included with 
 HADOOP.
 Here is the grep example output.  93626   row
 I've setup my job configuration as follows:   
 conf.set(stream.recordreader.class, 
 org.apache.hadoop.streaming.StreamXmlRecordReader);
 conf.set(stream.recordreader.begin, row);
 conf.set(stream.recordreader.end, /row);
 conf.setInputFormat(StreamInputFormat.class);
 I have a fairly simple test Mapper.
 Here's the map method.
   public void map(Text key, Text value, OutputCollectorText, IntWritable 
 output, Reporter reporter) throws IOException {
 try {
 output.collect(totalWord, one);
 if (key != null  key.toString().indexOf(01852) != -1) {
 output.collect(new Text(01852), one);
 }
 } catch (Exception ex) {
 Logger.getLogger(TestMapper.class.getName()).log(Level.SEVERE, 
 null, ex);
 System.out.println(value);
 }
 }
 For totalWord (TOTAL), I get:
 TOTAL 140850
 and for 01852 I get.
 01852 86
 There are 43 instances of 01852 in the file.
 I have the following setting in my config.  
conf.setNumMapTasks(1);
 I have a total of six machines in my cluster.
 If I run without this, the result is 12x the actual value, not 2x.
 Here's some info from the cluster web page.
 Maps  Reduces Total Submissions   Nodes   Map Task Capacity   Reduce 
 Task CapacityAvg. Tasks/Node
 0 0   1   6   12  12  4.00
 I've also noticed something really strange in the job's output.  It looks 
 like it's starting over or redoing things.
 This was run using all six nodes and no limitations on map or reduce tasks.  
 I haven't seen this behavior in any other case.
 08/06/03 10:50:35 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 08/06/03 10:50:36 INFO mapred.JobClient: Running job: job_200806030916_0018
 08/06/03 10:50:37 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:50:42 INFO mapred.JobClient:  map 2% reduce 0%
 08/06/03 10:50:45 INFO mapred.JobClient:  map 12% reduce 0%
 08/06/03 10:50:47 INFO mapred.JobClient:  map 31% reduce 0%
 08/06/03 10:50:48 INFO mapred.JobClient:  map 49% reduce 0%
 08/06/03 10:50:49 INFO mapred.JobClient:  map 68% reduce 0%
 08/06/03 10:50:50 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:54 INFO mapred.JobClient:  map 87% reduce 0%
 08/06/03 10:50:55 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:56 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:51:00 INFO mapred.JobClient:  map 0% reduce 1%
 08/06/03 10:51:05 INFO mapred.JobClient:  map 28% reduce 2%
 08/06/03 10:51:07 INFO mapred.JobClient:  map 80% reduce 4%
 08/06/03 10:51:08 INFO mapred.JobClient:  map 100% reduce 4%
 08/06/03 10:51:09 INFO mapred.JobClient:  map 100% reduce 7%
 08/06/03 10:51:10 INFO mapred.JobClient:  map 90% reduce 9%
 08/06/03 10:51:11 INFO mapred.JobClient:  map 100% reduce 9%
 08/06/03 10:51:12 INFO mapred.JobClient:  map 100% reduce 11%
 08/06/03 10:51:13 INFO mapred.JobClient:  map 90% reduce 11%
 08/06/03 10:51:14 INFO mapred.JobClient:  map 97% reduce 11%
 08/06/03 10:51:15 INFO mapred.JobClient:  map 63% reduce 11%
 08/06/03 10:51:16 INFO mapred.JobClient:  map 48% reduce 11%
 08/06/03 10:51:17 INFO mapred.JobClient:  map 21% reduce 11%
 08/06/03 10:51:19 INFO mapred.JobClient:  map 0% reduce 11%
 08/06/03 10:51:20 INFO mapred.JobClient:  map 15% reduce 12%
 08/06/03 10:51:21 INFO mapred.JobClient:  map 27% reduce 13%
 08/06/03 10:51:22 INFO mapred.JobClient:  map 67% reduce 13%
 08/06/03 10:51:24 INFO mapred.JobClient:  map 22% reduce 16%
 08/06/03 10:51:25 INFO mapred.JobClient:  map 46% reduce 16%
 08/06/03 10:51:26 INFO mapred.JobClient:  map 70% reduce 16%
 08/06/03 10:51:27 INFO mapred.JobClient:  map 73% reduce 18%
 08/06/03 10:51:28 

[jira] Commented: (MAPREDUCE-577) Duplicate Mapper input when using StreamXmlRecordReader

2010-07-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885163#action_12885163
 ] 

Hadoop QA commented on MAPREDUCE-577:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448668/577.v4.patch
  against trunk revision 960446.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 8 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/285/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/285/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/285/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/285/console

This message is automatically generated.

 Duplicate Mapper input when using StreamXmlRecordReader
 ---

 Key: MAPREDUCE-577
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-577
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
 Environment: HADOOP 0.17.0, Java 6.0
Reporter: David Campbell
Assignee: Ravi Gummadi
 Attachments: 0001-test-to-demonstrate-HADOOP-3484.patch, 
 0002-patch-for-HADOOP-3484.patch, 577.20S.patch, 577.patch, 577.v1.patch, 
 577.v2.patch, 577.v3.patch, 577.v4.patch, HADOOP-3484.combined.patch, 
 HADOOP-3484.try3.patch


 I have an XML file with 93626 rows.  A row is marked by row.../row.
 I've confirmed this with grep and the Grep example program included with 
 HADOOP.
 Here is the grep example output.  93626   row
 I've setup my job configuration as follows:   
 conf.set(stream.recordreader.class, 
 org.apache.hadoop.streaming.StreamXmlRecordReader);
 conf.set(stream.recordreader.begin, row);
 conf.set(stream.recordreader.end, /row);
 conf.setInputFormat(StreamInputFormat.class);
 I have a fairly simple test Mapper.
 Here's the map method.
   public void map(Text key, Text value, OutputCollectorText, IntWritable 
 output, Reporter reporter) throws IOException {
 try {
 output.collect(totalWord, one);
 if (key != null  key.toString().indexOf(01852) != -1) {
 output.collect(new Text(01852), one);
 }
 } catch (Exception ex) {
 Logger.getLogger(TestMapper.class.getName()).log(Level.SEVERE, 
 null, ex);
 System.out.println(value);
 }
 }
 For totalWord (TOTAL), I get:
 TOTAL 140850
 and for 01852 I get.
 01852 86
 There are 43 instances of 01852 in the file.
 I have the following setting in my config.  
conf.setNumMapTasks(1);
 I have a total of six machines in my cluster.
 If I run without this, the result is 12x the actual value, not 2x.
 Here's some info from the cluster web page.
 Maps  Reduces Total Submissions   Nodes   Map Task Capacity   Reduce 
 Task CapacityAvg. Tasks/Node
 0 0   1   6   12  12  4.00
 I've also noticed something really strange in the job's output.  It looks 
 like it's starting over or redoing things.
 This was run using all six nodes and no limitations on map or reduce tasks.  
 I haven't seen this behavior in any other case.
 08/06/03 10:50:35 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 08/06/03 10:50:36 INFO mapred.JobClient: Running job: job_200806030916_0018
 08/06/03 10:50:37 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:50:42 INFO mapred.JobClient:  map 2% reduce 0%
 08/06/03 10:50:45 INFO mapred.JobClient:  map 12% reduce 0%
 08/06/03 10:50:47 INFO mapred.JobClient:  map 31% reduce 0%
 08/06/03 10:50:48 INFO mapred.JobClient:  map 49% reduce 0%
 08/06/03 10:50:49 INFO mapred.JobClient:  map 68% reduce 0%
 08/06/03 10:50:50 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:54 INFO mapred.JobClient:  map 87% reduce 0%
 08/06/03 10:50:55 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:56 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:51:00 INFO mapred.JobClient:  map 0% reduce 1%
 08/06/03 10:51:05 INFO mapred.JobClient:  map 

[jira] Commented: (MAPREDUCE-577) Duplicate Mapper input when using StreamXmlRecordReader

2010-07-05 Thread Ravi Gummadi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885173#action_12885173
 ] 

Ravi Gummadi commented on MAPREDUCE-577:


Contrib test failed is because of MAPREDUCE-1834.
All other tests passed.

 Duplicate Mapper input when using StreamXmlRecordReader
 ---

 Key: MAPREDUCE-577
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-577
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
 Environment: HADOOP 0.17.0, Java 6.0
Reporter: David Campbell
Assignee: Ravi Gummadi
 Attachments: 0001-test-to-demonstrate-HADOOP-3484.patch, 
 0002-patch-for-HADOOP-3484.patch, 577.20S.patch, 577.patch, 577.v1.patch, 
 577.v2.patch, 577.v3.patch, 577.v4.patch, HADOOP-3484.combined.patch, 
 HADOOP-3484.try3.patch


 I have an XML file with 93626 rows.  A row is marked by row.../row.
 I've confirmed this with grep and the Grep example program included with 
 HADOOP.
 Here is the grep example output.  93626   row
 I've setup my job configuration as follows:   
 conf.set(stream.recordreader.class, 
 org.apache.hadoop.streaming.StreamXmlRecordReader);
 conf.set(stream.recordreader.begin, row);
 conf.set(stream.recordreader.end, /row);
 conf.setInputFormat(StreamInputFormat.class);
 I have a fairly simple test Mapper.
 Here's the map method.
   public void map(Text key, Text value, OutputCollectorText, IntWritable 
 output, Reporter reporter) throws IOException {
 try {
 output.collect(totalWord, one);
 if (key != null  key.toString().indexOf(01852) != -1) {
 output.collect(new Text(01852), one);
 }
 } catch (Exception ex) {
 Logger.getLogger(TestMapper.class.getName()).log(Level.SEVERE, 
 null, ex);
 System.out.println(value);
 }
 }
 For totalWord (TOTAL), I get:
 TOTAL 140850
 and for 01852 I get.
 01852 86
 There are 43 instances of 01852 in the file.
 I have the following setting in my config.  
conf.setNumMapTasks(1);
 I have a total of six machines in my cluster.
 If I run without this, the result is 12x the actual value, not 2x.
 Here's some info from the cluster web page.
 Maps  Reduces Total Submissions   Nodes   Map Task Capacity   Reduce 
 Task CapacityAvg. Tasks/Node
 0 0   1   6   12  12  4.00
 I've also noticed something really strange in the job's output.  It looks 
 like it's starting over or redoing things.
 This was run using all six nodes and no limitations on map or reduce tasks.  
 I haven't seen this behavior in any other case.
 08/06/03 10:50:35 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 08/06/03 10:50:36 INFO mapred.JobClient: Running job: job_200806030916_0018
 08/06/03 10:50:37 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:50:42 INFO mapred.JobClient:  map 2% reduce 0%
 08/06/03 10:50:45 INFO mapred.JobClient:  map 12% reduce 0%
 08/06/03 10:50:47 INFO mapred.JobClient:  map 31% reduce 0%
 08/06/03 10:50:48 INFO mapred.JobClient:  map 49% reduce 0%
 08/06/03 10:50:49 INFO mapred.JobClient:  map 68% reduce 0%
 08/06/03 10:50:50 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:54 INFO mapred.JobClient:  map 87% reduce 0%
 08/06/03 10:50:55 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:56 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:51:00 INFO mapred.JobClient:  map 0% reduce 1%
 08/06/03 10:51:05 INFO mapred.JobClient:  map 28% reduce 2%
 08/06/03 10:51:07 INFO mapred.JobClient:  map 80% reduce 4%
 08/06/03 10:51:08 INFO mapred.JobClient:  map 100% reduce 4%
 08/06/03 10:51:09 INFO mapred.JobClient:  map 100% reduce 7%
 08/06/03 10:51:10 INFO mapred.JobClient:  map 90% reduce 9%
 08/06/03 10:51:11 INFO mapred.JobClient:  map 100% reduce 9%
 08/06/03 10:51:12 INFO mapred.JobClient:  map 100% reduce 11%
 08/06/03 10:51:13 INFO mapred.JobClient:  map 90% reduce 11%
 08/06/03 10:51:14 INFO mapred.JobClient:  map 97% reduce 11%
 08/06/03 10:51:15 INFO mapred.JobClient:  map 63% reduce 11%
 08/06/03 10:51:16 INFO mapred.JobClient:  map 48% reduce 11%
 08/06/03 10:51:17 INFO mapred.JobClient:  map 21% reduce 11%
 08/06/03 10:51:19 INFO mapred.JobClient:  map 0% reduce 11%
 08/06/03 10:51:20 INFO mapred.JobClient:  map 15% reduce 12%
 08/06/03 10:51:21 INFO mapred.JobClient:  map 27% reduce 13%
 08/06/03 10:51:22 INFO mapred.JobClient:  map 67% reduce 13%
 08/06/03 10:51:24 INFO mapred.JobClient:  map 22% reduce 16%
 08/06/03 10:51:25 INFO mapred.JobClient:  map 46% reduce 16%
 08/06/03 10:51:26 INFO mapred.JobClient:  map 70% reduce 16%
 08/06/03 10:51:27 INFO mapred.JobClient:  map 73% reduce 18%
 08/06/03 10:51:28 INFO mapred.JobClient:  map 85% reduce 19%
 08/06/03 10:51:29 INFO 

[jira] Commented: (MAPREDUCE-577) Duplicate Mapper input when using StreamXmlRecordReader

2010-07-05 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885404#action_12885404
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-577:
---

Thanks Bo Alder for the earlier patches.

 Duplicate Mapper input when using StreamXmlRecordReader
 ---

 Key: MAPREDUCE-577
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-577
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
 Environment: HADOOP 0.17.0, Java 6.0
Reporter: David Campbell
Assignee: Ravi Gummadi
 Fix For: 0.22.0

 Attachments: 0001-test-to-demonstrate-HADOOP-3484.patch, 
 0002-patch-for-HADOOP-3484.patch, 577.20S.patch, 577.patch, 577.v1.patch, 
 577.v2.patch, 577.v3.patch, 577.v4.patch, HADOOP-3484.combined.patch, 
 HADOOP-3484.try3.patch


 I have an XML file with 93626 rows.  A row is marked by row.../row.
 I've confirmed this with grep and the Grep example program included with 
 HADOOP.
 Here is the grep example output.  93626   row
 I've setup my job configuration as follows:   
 conf.set(stream.recordreader.class, 
 org.apache.hadoop.streaming.StreamXmlRecordReader);
 conf.set(stream.recordreader.begin, row);
 conf.set(stream.recordreader.end, /row);
 conf.setInputFormat(StreamInputFormat.class);
 I have a fairly simple test Mapper.
 Here's the map method.
   public void map(Text key, Text value, OutputCollectorText, IntWritable 
 output, Reporter reporter) throws IOException {
 try {
 output.collect(totalWord, one);
 if (key != null  key.toString().indexOf(01852) != -1) {
 output.collect(new Text(01852), one);
 }
 } catch (Exception ex) {
 Logger.getLogger(TestMapper.class.getName()).log(Level.SEVERE, 
 null, ex);
 System.out.println(value);
 }
 }
 For totalWord (TOTAL), I get:
 TOTAL 140850
 and for 01852 I get.
 01852 86
 There are 43 instances of 01852 in the file.
 I have the following setting in my config.  
conf.setNumMapTasks(1);
 I have a total of six machines in my cluster.
 If I run without this, the result is 12x the actual value, not 2x.
 Here's some info from the cluster web page.
 Maps  Reduces Total Submissions   Nodes   Map Task Capacity   Reduce 
 Task CapacityAvg. Tasks/Node
 0 0   1   6   12  12  4.00
 I've also noticed something really strange in the job's output.  It looks 
 like it's starting over or redoing things.
 This was run using all six nodes and no limitations on map or reduce tasks.  
 I haven't seen this behavior in any other case.
 08/06/03 10:50:35 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 08/06/03 10:50:36 INFO mapred.JobClient: Running job: job_200806030916_0018
 08/06/03 10:50:37 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:50:42 INFO mapred.JobClient:  map 2% reduce 0%
 08/06/03 10:50:45 INFO mapred.JobClient:  map 12% reduce 0%
 08/06/03 10:50:47 INFO mapred.JobClient:  map 31% reduce 0%
 08/06/03 10:50:48 INFO mapred.JobClient:  map 49% reduce 0%
 08/06/03 10:50:49 INFO mapred.JobClient:  map 68% reduce 0%
 08/06/03 10:50:50 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:54 INFO mapred.JobClient:  map 87% reduce 0%
 08/06/03 10:50:55 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:56 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:51:00 INFO mapred.JobClient:  map 0% reduce 1%
 08/06/03 10:51:05 INFO mapred.JobClient:  map 28% reduce 2%
 08/06/03 10:51:07 INFO mapred.JobClient:  map 80% reduce 4%
 08/06/03 10:51:08 INFO mapred.JobClient:  map 100% reduce 4%
 08/06/03 10:51:09 INFO mapred.JobClient:  map 100% reduce 7%
 08/06/03 10:51:10 INFO mapred.JobClient:  map 90% reduce 9%
 08/06/03 10:51:11 INFO mapred.JobClient:  map 100% reduce 9%
 08/06/03 10:51:12 INFO mapred.JobClient:  map 100% reduce 11%
 08/06/03 10:51:13 INFO mapred.JobClient:  map 90% reduce 11%
 08/06/03 10:51:14 INFO mapred.JobClient:  map 97% reduce 11%
 08/06/03 10:51:15 INFO mapred.JobClient:  map 63% reduce 11%
 08/06/03 10:51:16 INFO mapred.JobClient:  map 48% reduce 11%
 08/06/03 10:51:17 INFO mapred.JobClient:  map 21% reduce 11%
 08/06/03 10:51:19 INFO mapred.JobClient:  map 0% reduce 11%
 08/06/03 10:51:20 INFO mapred.JobClient:  map 15% reduce 12%
 08/06/03 10:51:21 INFO mapred.JobClient:  map 27% reduce 13%
 08/06/03 10:51:22 INFO mapred.JobClient:  map 67% reduce 13%
 08/06/03 10:51:24 INFO mapred.JobClient:  map 22% reduce 16%
 08/06/03 10:51:25 INFO mapred.JobClient:  map 46% reduce 16%
 08/06/03 10:51:26 INFO mapred.JobClient:  map 70% reduce 16%
 08/06/03 10:51:27 INFO mapred.JobClient:  map 73% reduce 18%
 08/06/03 10:51:28 INFO mapred.JobClient:  map 85% reduce 19%
 08/06/03 

[jira] Commented: (MAPREDUCE-577) Duplicate Mapper input when using StreamXmlRecordReader

2010-06-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883874#action_12883874
 ] 

Hadoop QA commented on MAPREDUCE-577:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448380/577.v3.patch
  against trunk revision 959193.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 14 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/275/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/275/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/275/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/275/console

This message is automatically generated.

 Duplicate Mapper input when using StreamXmlRecordReader
 ---

 Key: MAPREDUCE-577
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-577
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
 Environment: HADOOP 0.17.0, Java 6.0
Reporter: David Campbell
Assignee: Ravi Gummadi
 Attachments: 0001-test-to-demonstrate-HADOOP-3484.patch, 
 0002-patch-for-HADOOP-3484.patch, 577.20S.patch, 577.patch, 577.v1.patch, 
 577.v2.patch, 577.v3.patch, HADOOP-3484.combined.patch, HADOOP-3484.try3.patch


 I have an XML file with 93626 rows.  A row is marked by row.../row.
 I've confirmed this with grep and the Grep example program included with 
 HADOOP.
 Here is the grep example output.  93626   row
 I've setup my job configuration as follows:   
 conf.set(stream.recordreader.class, 
 org.apache.hadoop.streaming.StreamXmlRecordReader);
 conf.set(stream.recordreader.begin, row);
 conf.set(stream.recordreader.end, /row);
 conf.setInputFormat(StreamInputFormat.class);
 I have a fairly simple test Mapper.
 Here's the map method.
   public void map(Text key, Text value, OutputCollectorText, IntWritable 
 output, Reporter reporter) throws IOException {
 try {
 output.collect(totalWord, one);
 if (key != null  key.toString().indexOf(01852) != -1) {
 output.collect(new Text(01852), one);
 }
 } catch (Exception ex) {
 Logger.getLogger(TestMapper.class.getName()).log(Level.SEVERE, 
 null, ex);
 System.out.println(value);
 }
 }
 For totalWord (TOTAL), I get:
 TOTAL 140850
 and for 01852 I get.
 01852 86
 There are 43 instances of 01852 in the file.
 I have the following setting in my config.  
conf.setNumMapTasks(1);
 I have a total of six machines in my cluster.
 If I run without this, the result is 12x the actual value, not 2x.
 Here's some info from the cluster web page.
 Maps  Reduces Total Submissions   Nodes   Map Task Capacity   Reduce 
 Task CapacityAvg. Tasks/Node
 0 0   1   6   12  12  4.00
 I've also noticed something really strange in the job's output.  It looks 
 like it's starting over or redoing things.
 This was run using all six nodes and no limitations on map or reduce tasks.  
 I haven't seen this behavior in any other case.
 08/06/03 10:50:35 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 08/06/03 10:50:36 INFO mapred.JobClient: Running job: job_200806030916_0018
 08/06/03 10:50:37 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:50:42 INFO mapred.JobClient:  map 2% reduce 0%
 08/06/03 10:50:45 INFO mapred.JobClient:  map 12% reduce 0%
 08/06/03 10:50:47 INFO mapred.JobClient:  map 31% reduce 0%
 08/06/03 10:50:48 INFO mapred.JobClient:  map 49% reduce 0%
 08/06/03 10:50:49 INFO mapred.JobClient:  map 68% reduce 0%
 08/06/03 10:50:50 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:54 INFO mapred.JobClient:  map 87% reduce 0%
 08/06/03 10:50:55 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:56 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:51:00 INFO mapred.JobClient:  map 0% reduce 1%
 08/06/03 10:51:05 INFO mapred.JobClient:  map 28% reduce 2%
 

[jira] Commented: (MAPREDUCE-577) Duplicate Mapper input when using StreamXmlRecordReader

2010-06-29 Thread Ravi Gummadi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883507#action_12883507
 ] 

Ravi Gummadi commented on MAPREDUCE-577:


After merging, the file system block size is not updated properly. So adding 
FileSystem.closeAll() call at the begimning of each test case. Will upload a 
patch soon.

 Duplicate Mapper input when using StreamXmlRecordReader
 ---

 Key: MAPREDUCE-577
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-577
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
 Environment: HADOOP 0.17.0, Java 6.0
Reporter: David Campbell
Assignee: Ravi Gummadi
 Attachments: 0001-test-to-demonstrate-HADOOP-3484.patch, 
 0002-patch-for-HADOOP-3484.patch, 577.20S.patch, 577.patch, 577.v1.patch, 
 577.v2.patch, HADOOP-3484.combined.patch, HADOOP-3484.try3.patch


 I have an XML file with 93626 rows.  A row is marked by row.../row.
 I've confirmed this with grep and the Grep example program included with 
 HADOOP.
 Here is the grep example output.  93626   row
 I've setup my job configuration as follows:   
 conf.set(stream.recordreader.class, 
 org.apache.hadoop.streaming.StreamXmlRecordReader);
 conf.set(stream.recordreader.begin, row);
 conf.set(stream.recordreader.end, /row);
 conf.setInputFormat(StreamInputFormat.class);
 I have a fairly simple test Mapper.
 Here's the map method.
   public void map(Text key, Text value, OutputCollectorText, IntWritable 
 output, Reporter reporter) throws IOException {
 try {
 output.collect(totalWord, one);
 if (key != null  key.toString().indexOf(01852) != -1) {
 output.collect(new Text(01852), one);
 }
 } catch (Exception ex) {
 Logger.getLogger(TestMapper.class.getName()).log(Level.SEVERE, 
 null, ex);
 System.out.println(value);
 }
 }
 For totalWord (TOTAL), I get:
 TOTAL 140850
 and for 01852 I get.
 01852 86
 There are 43 instances of 01852 in the file.
 I have the following setting in my config.  
conf.setNumMapTasks(1);
 I have a total of six machines in my cluster.
 If I run without this, the result is 12x the actual value, not 2x.
 Here's some info from the cluster web page.
 Maps  Reduces Total Submissions   Nodes   Map Task Capacity   Reduce 
 Task CapacityAvg. Tasks/Node
 0 0   1   6   12  12  4.00
 I've also noticed something really strange in the job's output.  It looks 
 like it's starting over or redoing things.
 This was run using all six nodes and no limitations on map or reduce tasks.  
 I haven't seen this behavior in any other case.
 08/06/03 10:50:35 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 08/06/03 10:50:36 INFO mapred.JobClient: Running job: job_200806030916_0018
 08/06/03 10:50:37 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:50:42 INFO mapred.JobClient:  map 2% reduce 0%
 08/06/03 10:50:45 INFO mapred.JobClient:  map 12% reduce 0%
 08/06/03 10:50:47 INFO mapred.JobClient:  map 31% reduce 0%
 08/06/03 10:50:48 INFO mapred.JobClient:  map 49% reduce 0%
 08/06/03 10:50:49 INFO mapred.JobClient:  map 68% reduce 0%
 08/06/03 10:50:50 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:54 INFO mapred.JobClient:  map 87% reduce 0%
 08/06/03 10:50:55 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:56 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:51:00 INFO mapred.JobClient:  map 0% reduce 1%
 08/06/03 10:51:05 INFO mapred.JobClient:  map 28% reduce 2%
 08/06/03 10:51:07 INFO mapred.JobClient:  map 80% reduce 4%
 08/06/03 10:51:08 INFO mapred.JobClient:  map 100% reduce 4%
 08/06/03 10:51:09 INFO mapred.JobClient:  map 100% reduce 7%
 08/06/03 10:51:10 INFO mapred.JobClient:  map 90% reduce 9%
 08/06/03 10:51:11 INFO mapred.JobClient:  map 100% reduce 9%
 08/06/03 10:51:12 INFO mapred.JobClient:  map 100% reduce 11%
 08/06/03 10:51:13 INFO mapred.JobClient:  map 90% reduce 11%
 08/06/03 10:51:14 INFO mapred.JobClient:  map 97% reduce 11%
 08/06/03 10:51:15 INFO mapred.JobClient:  map 63% reduce 11%
 08/06/03 10:51:16 INFO mapred.JobClient:  map 48% reduce 11%
 08/06/03 10:51:17 INFO mapred.JobClient:  map 21% reduce 11%
 08/06/03 10:51:19 INFO mapred.JobClient:  map 0% reduce 11%
 08/06/03 10:51:20 INFO mapred.JobClient:  map 15% reduce 12%
 08/06/03 10:51:21 INFO mapred.JobClient:  map 27% reduce 13%
 08/06/03 10:51:22 INFO mapred.JobClient:  map 67% reduce 13%
 08/06/03 10:51:24 INFO mapred.JobClient:  map 22% reduce 16%
 08/06/03 10:51:25 INFO mapred.JobClient:  map 46% reduce 16%
 08/06/03 10:51:26 INFO mapred.JobClient:  map 70% reduce 16%
 08/06/03 10:51:27 INFO mapred.JobClient:  map 73% reduce 18%
 08/06/03 10:51:28 INFO 

[jira] Commented: (MAPREDUCE-577) Duplicate Mapper input when using StreamXmlRecordReader

2010-06-28 Thread Ravi Gummadi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883114#action_12883114
 ] 

Ravi Gummadi commented on MAPREDUCE-577:


This patch is on top of patch of MAPREDUCE-1888 because test cases are 
refactored in MAPREDUCE-1888.

 Duplicate Mapper input when using StreamXmlRecordReader
 ---

 Key: MAPREDUCE-577
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-577
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
 Environment: HADOOP 0.17.0, Java 6.0
Reporter: David Campbell
Assignee: Ravi Gummadi
 Attachments: 0001-test-to-demonstrate-HADOOP-3484.patch, 
 0002-patch-for-HADOOP-3484.patch, 577.20S.patch, 577.patch, 577.v1.patch, 
 577.v2.patch, HADOOP-3484.combined.patch, HADOOP-3484.try3.patch


 I have an XML file with 93626 rows.  A row is marked by row.../row.
 I've confirmed this with grep and the Grep example program included with 
 HADOOP.
 Here is the grep example output.  93626   row
 I've setup my job configuration as follows:   
 conf.set(stream.recordreader.class, 
 org.apache.hadoop.streaming.StreamXmlRecordReader);
 conf.set(stream.recordreader.begin, row);
 conf.set(stream.recordreader.end, /row);
 conf.setInputFormat(StreamInputFormat.class);
 I have a fairly simple test Mapper.
 Here's the map method.
   public void map(Text key, Text value, OutputCollectorText, IntWritable 
 output, Reporter reporter) throws IOException {
 try {
 output.collect(totalWord, one);
 if (key != null  key.toString().indexOf(01852) != -1) {
 output.collect(new Text(01852), one);
 }
 } catch (Exception ex) {
 Logger.getLogger(TestMapper.class.getName()).log(Level.SEVERE, 
 null, ex);
 System.out.println(value);
 }
 }
 For totalWord (TOTAL), I get:
 TOTAL 140850
 and for 01852 I get.
 01852 86
 There are 43 instances of 01852 in the file.
 I have the following setting in my config.  
conf.setNumMapTasks(1);
 I have a total of six machines in my cluster.
 If I run without this, the result is 12x the actual value, not 2x.
 Here's some info from the cluster web page.
 Maps  Reduces Total Submissions   Nodes   Map Task Capacity   Reduce 
 Task CapacityAvg. Tasks/Node
 0 0   1   6   12  12  4.00
 I've also noticed something really strange in the job's output.  It looks 
 like it's starting over or redoing things.
 This was run using all six nodes and no limitations on map or reduce tasks.  
 I haven't seen this behavior in any other case.
 08/06/03 10:50:35 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 08/06/03 10:50:36 INFO mapred.JobClient: Running job: job_200806030916_0018
 08/06/03 10:50:37 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:50:42 INFO mapred.JobClient:  map 2% reduce 0%
 08/06/03 10:50:45 INFO mapred.JobClient:  map 12% reduce 0%
 08/06/03 10:50:47 INFO mapred.JobClient:  map 31% reduce 0%
 08/06/03 10:50:48 INFO mapred.JobClient:  map 49% reduce 0%
 08/06/03 10:50:49 INFO mapred.JobClient:  map 68% reduce 0%
 08/06/03 10:50:50 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:54 INFO mapred.JobClient:  map 87% reduce 0%
 08/06/03 10:50:55 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:56 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:51:00 INFO mapred.JobClient:  map 0% reduce 1%
 08/06/03 10:51:05 INFO mapred.JobClient:  map 28% reduce 2%
 08/06/03 10:51:07 INFO mapred.JobClient:  map 80% reduce 4%
 08/06/03 10:51:08 INFO mapred.JobClient:  map 100% reduce 4%
 08/06/03 10:51:09 INFO mapred.JobClient:  map 100% reduce 7%
 08/06/03 10:51:10 INFO mapred.JobClient:  map 90% reduce 9%
 08/06/03 10:51:11 INFO mapred.JobClient:  map 100% reduce 9%
 08/06/03 10:51:12 INFO mapred.JobClient:  map 100% reduce 11%
 08/06/03 10:51:13 INFO mapred.JobClient:  map 90% reduce 11%
 08/06/03 10:51:14 INFO mapred.JobClient:  map 97% reduce 11%
 08/06/03 10:51:15 INFO mapred.JobClient:  map 63% reduce 11%
 08/06/03 10:51:16 INFO mapred.JobClient:  map 48% reduce 11%
 08/06/03 10:51:17 INFO mapred.JobClient:  map 21% reduce 11%
 08/06/03 10:51:19 INFO mapred.JobClient:  map 0% reduce 11%
 08/06/03 10:51:20 INFO mapred.JobClient:  map 15% reduce 12%
 08/06/03 10:51:21 INFO mapred.JobClient:  map 27% reduce 13%
 08/06/03 10:51:22 INFO mapred.JobClient:  map 67% reduce 13%
 08/06/03 10:51:24 INFO mapred.JobClient:  map 22% reduce 16%
 08/06/03 10:51:25 INFO mapred.JobClient:  map 46% reduce 16%
 08/06/03 10:51:26 INFO mapred.JobClient:  map 70% reduce 16%
 08/06/03 10:51:27 INFO mapred.JobClient:  map 73% reduce 18%
 08/06/03 10:51:28 INFO mapred.JobClient:  map 85% reduce 19%
 08/06/03 10:51:29 INFO 

[jira] Commented: (MAPREDUCE-577) Duplicate Mapper input when using StreamXmlRecordReader

2010-06-22 Thread Ravi Gummadi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881122#action_12881122
 ] 

Ravi Gummadi commented on MAPREDUCE-577:


In trunk, testcases were not picking the block size because in 
TestStreaming(the base class of the 2 tests of this patch) is creating input 
file by creating FileSystem object. As we were setting the config 
fs.local.block.size later, it is not effective for the FileSystem --- causing 
single split in both tests.

 Duplicate Mapper input when using StreamXmlRecordReader
 ---

 Key: MAPREDUCE-577
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-577
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
 Environment: HADOOP 0.17.0, Java 6.0
Reporter: David Campbell
Assignee: Ravi Gummadi
 Attachments: 0001-test-to-demonstrate-HADOOP-3484.patch, 
 0002-patch-for-HADOOP-3484.patch, 577.20S.patch, 577.patch, 
 HADOOP-3484.combined.patch, HADOOP-3484.try3.patch


 I have an XML file with 93626 rows.  A row is marked by row.../row.
 I've confirmed this with grep and the Grep example program included with 
 HADOOP.
 Here is the grep example output.  93626   row
 I've setup my job configuration as follows:   
 conf.set(stream.recordreader.class, 
 org.apache.hadoop.streaming.StreamXmlRecordReader);
 conf.set(stream.recordreader.begin, row);
 conf.set(stream.recordreader.end, /row);
 conf.setInputFormat(StreamInputFormat.class);
 I have a fairly simple test Mapper.
 Here's the map method.
   public void map(Text key, Text value, OutputCollectorText, IntWritable 
 output, Reporter reporter) throws IOException {
 try {
 output.collect(totalWord, one);
 if (key != null  key.toString().indexOf(01852) != -1) {
 output.collect(new Text(01852), one);
 }
 } catch (Exception ex) {
 Logger.getLogger(TestMapper.class.getName()).log(Level.SEVERE, 
 null, ex);
 System.out.println(value);
 }
 }
 For totalWord (TOTAL), I get:
 TOTAL 140850
 and for 01852 I get.
 01852 86
 There are 43 instances of 01852 in the file.
 I have the following setting in my config.  
conf.setNumMapTasks(1);
 I have a total of six machines in my cluster.
 If I run without this, the result is 12x the actual value, not 2x.
 Here's some info from the cluster web page.
 Maps  Reduces Total Submissions   Nodes   Map Task Capacity   Reduce 
 Task CapacityAvg. Tasks/Node
 0 0   1   6   12  12  4.00
 I've also noticed something really strange in the job's output.  It looks 
 like it's starting over or redoing things.
 This was run using all six nodes and no limitations on map or reduce tasks.  
 I haven't seen this behavior in any other case.
 08/06/03 10:50:35 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 08/06/03 10:50:36 INFO mapred.JobClient: Running job: job_200806030916_0018
 08/06/03 10:50:37 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:50:42 INFO mapred.JobClient:  map 2% reduce 0%
 08/06/03 10:50:45 INFO mapred.JobClient:  map 12% reduce 0%
 08/06/03 10:50:47 INFO mapred.JobClient:  map 31% reduce 0%
 08/06/03 10:50:48 INFO mapred.JobClient:  map 49% reduce 0%
 08/06/03 10:50:49 INFO mapred.JobClient:  map 68% reduce 0%
 08/06/03 10:50:50 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:54 INFO mapred.JobClient:  map 87% reduce 0%
 08/06/03 10:50:55 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:56 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:51:00 INFO mapred.JobClient:  map 0% reduce 1%
 08/06/03 10:51:05 INFO mapred.JobClient:  map 28% reduce 2%
 08/06/03 10:51:07 INFO mapred.JobClient:  map 80% reduce 4%
 08/06/03 10:51:08 INFO mapred.JobClient:  map 100% reduce 4%
 08/06/03 10:51:09 INFO mapred.JobClient:  map 100% reduce 7%
 08/06/03 10:51:10 INFO mapred.JobClient:  map 90% reduce 9%
 08/06/03 10:51:11 INFO mapred.JobClient:  map 100% reduce 9%
 08/06/03 10:51:12 INFO mapred.JobClient:  map 100% reduce 11%
 08/06/03 10:51:13 INFO mapred.JobClient:  map 90% reduce 11%
 08/06/03 10:51:14 INFO mapred.JobClient:  map 97% reduce 11%
 08/06/03 10:51:15 INFO mapred.JobClient:  map 63% reduce 11%
 08/06/03 10:51:16 INFO mapred.JobClient:  map 48% reduce 11%
 08/06/03 10:51:17 INFO mapred.JobClient:  map 21% reduce 11%
 08/06/03 10:51:19 INFO mapred.JobClient:  map 0% reduce 11%
 08/06/03 10:51:20 INFO mapred.JobClient:  map 15% reduce 12%
 08/06/03 10:51:21 INFO mapred.JobClient:  map 27% reduce 13%
 08/06/03 10:51:22 INFO mapred.JobClient:  map 67% reduce 13%
 08/06/03 10:51:24 INFO mapred.JobClient:  map 22% reduce 16%
 08/06/03 10:51:25 INFO mapred.JobClient:  map 46% reduce 16%
 08/06/03 10:51:26 INFO 

[jira] Commented: (MAPREDUCE-577) Duplicate Mapper input when using StreamXmlRecordReader

2010-06-18 Thread Ravi Gummadi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880159#action_12880159
 ] 

Ravi Gummadi commented on MAPREDUCE-577:


Looks like the testcase TestStreamXmlMultiOuter is failing in trunk but passing 
in 0.20. Will investigate.

 Duplicate Mapper input when using StreamXmlRecordReader
 ---

 Key: MAPREDUCE-577
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-577
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
 Environment: HADOOP 0.17.0, Java 6.0
Reporter: David Campbell
Assignee: Ravi Gummadi
 Attachments: 0001-test-to-demonstrate-HADOOP-3484.patch, 
 0002-patch-for-HADOOP-3484.patch, 577.patch, HADOOP-3484.combined.patch, 
 HADOOP-3484.try3.patch


 I have an XML file with 93626 rows.  A row is marked by row.../row.
 I've confirmed this with grep and the Grep example program included with 
 HADOOP.
 Here is the grep example output.  93626   row
 I've setup my job configuration as follows:   
 conf.set(stream.recordreader.class, 
 org.apache.hadoop.streaming.StreamXmlRecordReader);
 conf.set(stream.recordreader.begin, row);
 conf.set(stream.recordreader.end, /row);
 conf.setInputFormat(StreamInputFormat.class);
 I have a fairly simple test Mapper.
 Here's the map method.
   public void map(Text key, Text value, OutputCollectorText, IntWritable 
 output, Reporter reporter) throws IOException {
 try {
 output.collect(totalWord, one);
 if (key != null  key.toString().indexOf(01852) != -1) {
 output.collect(new Text(01852), one);
 }
 } catch (Exception ex) {
 Logger.getLogger(TestMapper.class.getName()).log(Level.SEVERE, 
 null, ex);
 System.out.println(value);
 }
 }
 For totalWord (TOTAL), I get:
 TOTAL 140850
 and for 01852 I get.
 01852 86
 There are 43 instances of 01852 in the file.
 I have the following setting in my config.  
conf.setNumMapTasks(1);
 I have a total of six machines in my cluster.
 If I run without this, the result is 12x the actual value, not 2x.
 Here's some info from the cluster web page.
 Maps  Reduces Total Submissions   Nodes   Map Task Capacity   Reduce 
 Task CapacityAvg. Tasks/Node
 0 0   1   6   12  12  4.00
 I've also noticed something really strange in the job's output.  It looks 
 like it's starting over or redoing things.
 This was run using all six nodes and no limitations on map or reduce tasks.  
 I haven't seen this behavior in any other case.
 08/06/03 10:50:35 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 08/06/03 10:50:36 INFO mapred.JobClient: Running job: job_200806030916_0018
 08/06/03 10:50:37 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:50:42 INFO mapred.JobClient:  map 2% reduce 0%
 08/06/03 10:50:45 INFO mapred.JobClient:  map 12% reduce 0%
 08/06/03 10:50:47 INFO mapred.JobClient:  map 31% reduce 0%
 08/06/03 10:50:48 INFO mapred.JobClient:  map 49% reduce 0%
 08/06/03 10:50:49 INFO mapred.JobClient:  map 68% reduce 0%
 08/06/03 10:50:50 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:54 INFO mapred.JobClient:  map 87% reduce 0%
 08/06/03 10:50:55 INFO mapred.JobClient:  map 100% reduce 0%
 08/06/03 10:50:56 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/03 10:51:00 INFO mapred.JobClient:  map 0% reduce 1%
 08/06/03 10:51:05 INFO mapred.JobClient:  map 28% reduce 2%
 08/06/03 10:51:07 INFO mapred.JobClient:  map 80% reduce 4%
 08/06/03 10:51:08 INFO mapred.JobClient:  map 100% reduce 4%
 08/06/03 10:51:09 INFO mapred.JobClient:  map 100% reduce 7%
 08/06/03 10:51:10 INFO mapred.JobClient:  map 90% reduce 9%
 08/06/03 10:51:11 INFO mapred.JobClient:  map 100% reduce 9%
 08/06/03 10:51:12 INFO mapred.JobClient:  map 100% reduce 11%
 08/06/03 10:51:13 INFO mapred.JobClient:  map 90% reduce 11%
 08/06/03 10:51:14 INFO mapred.JobClient:  map 97% reduce 11%
 08/06/03 10:51:15 INFO mapred.JobClient:  map 63% reduce 11%
 08/06/03 10:51:16 INFO mapred.JobClient:  map 48% reduce 11%
 08/06/03 10:51:17 INFO mapred.JobClient:  map 21% reduce 11%
 08/06/03 10:51:19 INFO mapred.JobClient:  map 0% reduce 11%
 08/06/03 10:51:20 INFO mapred.JobClient:  map 15% reduce 12%
 08/06/03 10:51:21 INFO mapred.JobClient:  map 27% reduce 13%
 08/06/03 10:51:22 INFO mapred.JobClient:  map 67% reduce 13%
 08/06/03 10:51:24 INFO mapred.JobClient:  map 22% reduce 16%
 08/06/03 10:51:25 INFO mapred.JobClient:  map 46% reduce 16%
 08/06/03 10:51:26 INFO mapred.JobClient:  map 70% reduce 16%
 08/06/03 10:51:27 INFO mapred.JobClient:  map 73% reduce 18%
 08/06/03 10:51:28 INFO mapred.JobClient:  map 85% reduce 19%
 08/06/03 10:51:29 INFO mapred.JobClient:  map 7% reduce 19%
 08/06/03