[jira] Updated: (MAPREDUCE-1082) Command line UI for queues' information is broken with hierarchical queues.

2009-11-30 Thread V.V.Chaitanya Krishna (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

V.V.Chaitanya Krishna updated MAPREDUCE-1082:
-

Attachment: MAPREDUCE-1082-3.patch

Uploading patch with the above ideas implemented.

> Command line UI for queues' information is broken with hierarchical queues.
> ---
>
> Key: MAPREDUCE-1082
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1082
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client, jobtracker
>Affects Versions: 0.21.0
>Reporter: Vinod K V
>Assignee: V.V.Chaitanya Krishna
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: MAPREDUCE-1082-1.txt, MAPREDUCE-1082-2.patch, 
> MAPREDUCE-1082-3.patch
>
>
> When the command "./bin/mapred --config ~/tmp/conf/ queue -list" is run, it 
> just hangs. I can see the following in the JT logs:
> {code}
> 2009-10-08 13:19:26,762 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 1 on 5 caught: java.lang.NullPointerException
> at org.apache.hadoop.mapreduce.QueueInfo.write(QueueInfo.java:217)
> at org.apache.hadoop.mapreduce.QueueInfo.write(QueueInfo.java:223)
> at 
> org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:159)
> at 
> org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:126)
> at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70)
> at org.apache.hadoop.ipc.Server.setupResponse(Server.java:1074)
> at org.apache.hadoop.ipc.Server.access$2400(Server.java:77)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:983)
> {code}
> Same is the case with "./bin/mapred --config ~/tmp/conf/ queue -info 
> "

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1249) mapreduce.reduce.shuffle.read.timeout's default value should be 3 minutes, in mapred-default.xml

2009-11-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784103#action_12784103
 ] 

Hadoop QA commented on MAPREDUCE-1249:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12426497/patch-1249-1.txt
  against trunk revision 885530.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/155/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/155/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/155/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/155/console

This message is automatically generated.

> mapreduce.reduce.shuffle.read.timeout's default value should be 3 minutes, in 
> mapred-default.xml
> 
>
> Key: MAPREDUCE-1249
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1249
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Amareshwari Sriramadasu
>Assignee: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: patch-1249-1.txt, patch-1249.txt
>
>
> mapreduce.reduce.shuffle.read.timeout has a value of 30,000 (30 seconds) in 
> mapred-default.xml, whereas the default value in Fetcher code is 3 minutes.
> It should be 3 minutes by default, as it was in pre MAPREDUCE-353.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1082) Command line UI for queues' information is broken with hierarchical queues.

2009-11-30 Thread V.V.Chaitanya Krishna (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

V.V.Chaitanya Krishna updated MAPREDUCE-1082:
-

Status: Open  (was: Patch Available)

> Command line UI for queues' information is broken with hierarchical queues.
> ---
>
> Key: MAPREDUCE-1082
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1082
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client, jobtracker
>Affects Versions: 0.21.0
>Reporter: Vinod K V
>Assignee: V.V.Chaitanya Krishna
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: MAPREDUCE-1082-1.txt, MAPREDUCE-1082-2.patch
>
>
> When the command "./bin/mapred --config ~/tmp/conf/ queue -list" is run, it 
> just hangs. I can see the following in the JT logs:
> {code}
> 2009-10-08 13:19:26,762 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 1 on 5 caught: java.lang.NullPointerException
> at org.apache.hadoop.mapreduce.QueueInfo.write(QueueInfo.java:217)
> at org.apache.hadoop.mapreduce.QueueInfo.write(QueueInfo.java:223)
> at 
> org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:159)
> at 
> org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:126)
> at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70)
> at org.apache.hadoop.ipc.Server.setupResponse(Server.java:1074)
> at org.apache.hadoop.ipc.Server.access$2400(Server.java:77)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:983)
> {code}
> Same is the case with "./bin/mapred --config ~/tmp/conf/ queue -info 
> "

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1248) Redundant memory copying in StreamKeyValUtil

2009-11-30 Thread Ruibang He (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruibang He updated MAPREDUCE-1248:
--

Attachment: MAPREDUCE-1248-v1.0.patch

An early solution

> Redundant memory copying in StreamKeyValUtil
> 
>
> Key: MAPREDUCE-1248
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1248
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/streaming
>Reporter: Ruibang He
>Priority: Minor
> Attachments: MAPREDUCE-1248-v1.0.patch
>
>
> I found that when MROutputThread collecting the output of  Reducer, it calls 
> StreamKeyValUtil.splitKeyVal() and two local byte-arrays are allocated there 
> for each line of output. Later these two byte-arrays are passed to variable 
> key and val. There are twice memory copying here, one is the 
> System.arraycopy() method, the other is inside key.set() / val.set().
> This causes double times of memory copying for the whole output (may lead to 
> higher CPU consumption), and frequent temporay object allocation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1248) Redundant memory copying in StreamKeyValUtil

2009-11-30 Thread Ruibang He (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784098#action_12784098
 ] 

Ruibang He commented on MAPREDUCE-1248:
---

Thanks, Guanyin. The lastest trunk has fixed the problem in 
KeyValueLineRecordReader.java, but in StreamKeyValUtil.java this problem still 
exists. Patch is attached for an early solution.

> Redundant memory copying in StreamKeyValUtil
> 
>
> Key: MAPREDUCE-1248
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1248
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/streaming
>Reporter: Ruibang He
>Priority: Minor
>
> I found that when MROutputThread collecting the output of  Reducer, it calls 
> StreamKeyValUtil.splitKeyVal() and two local byte-arrays are allocated there 
> for each line of output. Later these two byte-arrays are passed to variable 
> key and val. There are twice memory copying here, one is the 
> System.arraycopy() method, the other is inside key.set() / val.set().
> This causes double times of memory copying for the whole output (may lead to 
> higher CPU consumption), and frequent temporay object allocation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1256) org.apache.hadoop.mapred.TestFairScheduler.testPoolAssignment (from TestFairScheduler) is failing in trunk

2009-11-30 Thread Jothi Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784084#action_12784084
 ] 

Jothi Padmanabhan commented on MAPREDUCE-1256:
--

Duplicate of MAPREDUCE-1245?

> org.apache.hadoop.mapred.TestFairScheduler.testPoolAssignment (from 
> TestFairScheduler) is failing in trunk
> --
>
> Key: MAPREDUCE-1256
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1256
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: Iyappan Srinivasan
> Fix For: 0.22.0
>
>
> Trunk build is failing. The unit testcase that fail is:
> org.apache.hadoop.mapred.TestFairScheduler.testPoolAssignment (from 
> TestFairScheduler) 
> http://hudson.zones.apache.org/hudson/view/Hadoop/job/Hadoop-Mapreduce-trunk/160/testReport/org.apache.hadoop.mapred/TestFairScheduler/testPoolAssignment/
> Error Message
> Timeout occurred. Please note the time in the report does not reflect the 
> time until the timeout.
> Stacktrace
> junit.framework.AssertionFailedError: Timeout occurred. Please note the time 
> in the report does not reflect the time until the timeout

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1256) org.apache.hadoop.mapred.TestFairScheduler.testPoolAssignment (from TestFairScheduler) is failing in trunk

2009-11-30 Thread Iyappan Srinivasan (JIRA)
org.apache.hadoop.mapred.TestFairScheduler.testPoolAssignment (from 
TestFairScheduler) is failing in trunk
--

 Key: MAPREDUCE-1256
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1256
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.22.0
Reporter: Iyappan Srinivasan
 Fix For: 0.22.0


Trunk build is failing. The unit testcase that fail is:

org.apache.hadoop.mapred.TestFairScheduler.testPoolAssignment (from 
TestFairScheduler) 

http://hudson.zones.apache.org/hudson/view/Hadoop/job/Hadoop-Mapreduce-trunk/160/testReport/org.apache.hadoop.mapred/TestFairScheduler/testPoolAssignment/

Error Message
Timeout occurred. Please note the time in the report does not reflect the time 
until the timeout.
Stacktrace
junit.framework.AssertionFailedError: Timeout occurred. Please note the time in 
the report does not reflect the time until the timeout


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAPREDUCE-1255) How to write a custom input format and record reader to read multiple lines of text from files

2009-11-30 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved MAPREDUCE-1255.


Resolution: Invalid

Hi Kunal,

JIRA is meant for issue tracking, not questions. Please email the common-user 
or mapreduce-user mailing list with your question.

Thanks.

> How to write a custom input format and record reader to read multiple lines 
> of text from files
> --
>
> Key: MAPREDUCE-1255
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1255
> Project: Hadoop Map/Reduce
>  Issue Type: Task
>Affects Versions: 0.20.1
> Environment: Ubuntu, 32 bit system. Apache hadoop 0.20.1
>Reporter: Kunal Gupta
>Priority: Minor
>
> Can someone explain how to override the "FileInputFormat" and "RecordReader" 
> in order to be able to read multiple lines of text from input files in a 
> single map task?
> Here the key will be the offset of the first line of text and value will be 
> the N lines of text. 
> I have overridden the class FileInputFormat:
> public class MultiLineFileInputFormat
>   extends FileInputFormat{
> ...
> }
> and implemented the abstract method:
> public RecordReader createRecordReader(InputSplit split,
> TaskAttemptContext context)
>  throws IOException, InterruptedException {...}
> I have also overridden the recordreader class:
> public class MultiLineFileRecordReader extends RecordReader Text>
> {...}
> and in the job configuration, specified this new InputFormat class:
> job.setInputFormatClass(MultiLineFileInputFormat.class);
> When I  run this new map/reduce program, i get the following java error:
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.NoSuchMethodException: 
> CustomRecordReader$MultiLineFileInputFormat.()
>   at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
>   at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:882)
>   at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
>   at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>   at CustomRecordReader.main(CustomRecordReader.java:257)
> Caused by: java.lang.NoSuchMethodException: 
> CustomRecordReader$MultiLineFileInputFormat.()
>   at java.lang.Class.getConstructor0(Class.java:2706)
>   at java.lang.Class.getDeclaredConstructor(Class.java:1985)
>   at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109)
>   ... 5 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1143) runningMapTasks counter is not properly decremented in case of failed Tasks.

2009-11-30 Thread Hemanth Yamijala (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784077#action_12784077
 ] 

Hemanth Yamijala commented on MAPREDUCE-1143:
-

After a discussion with Arun, I felt I might clarify little more on what I am 
proposing. Some details:

- In TaskInProgress.java, introduce:
{code}
  boolean isRunning(TaskAttemptID taskId) {
return activeTasks.containsKey(taskId);
  }
{code}

- Modify JobInProgress.failedTask private API to have an additional parameter 
wasAttemptRunning, which would be initialized in JIP.updateTaskStatus to 
tip.isRunning(status.getTaskID())

- Use wasAttemptRunning only to update the running* counters

I originally thought we can modify wasRunning to indicate if the attempt was 
running (rather than if the TIP was running). But after speaking with Arun, I 
feel we want to localize the changes to as much as possible.

> runningMapTasks counter is not properly decremented in case of failed Tasks.
> 
>
> Key: MAPREDUCE-1143
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1143
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: rahul k singh
>Priority: Blocker
> Attachments: MAPRED-1143-1.patch, MAPRED-1143-2.patch, 
> MAPRED-1143-2.patch, MAPRED-1143-3.patch, MAPRED-1143-4.patch, 
> MAPRED-1143-ydist-1.patch, MAPRED-1143-ydist-2.patch, 
> MAPRED-1143-ydist-3.patch, MAPRED-1143-ydist-4.patch, 
> MAPRED-1143-ydist-5.patch, MAPRED-1143-ydist-6.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1255) How to write a custom input format and record reader to read multiple lines of text from files

2009-11-30 Thread Kunal Gupta (JIRA)
How to write a custom input format and record reader to read multiple lines of 
text from files
--

 Key: MAPREDUCE-1255
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1255
 Project: Hadoop Map/Reduce
  Issue Type: Task
Affects Versions: 0.20.1
 Environment: Ubuntu, 32 bit system. Apache hadoop 0.20.1
Reporter: Kunal Gupta
Priority: Minor


Can someone explain how to override the "FileInputFormat" and "RecordReader" in 
order to be able to read multiple lines of text from input files in a single 
map task?

Here the key will be the offset of the first line of text and value will be the 
N lines of text. 

I have overridden the class FileInputFormat:

public class MultiLineFileInputFormat
extends FileInputFormat{
...
}

and implemented the abstract method:

public RecordReader createRecordReader(InputSplit split,
TaskAttemptContext context)
 throws IOException, InterruptedException {...}

I have also overridden the recordreader class:

public class MultiLineFileRecordReader extends RecordReader
{...}

and in the job configuration, specified this new InputFormat class:

job.setInputFormatClass(MultiLineFileInputFormat.class);

When I  run this new map/reduce program, i get the following java error:

Exception in thread "main" java.lang.RuntimeException: 
java.lang.NoSuchMethodException: 
CustomRecordReader$MultiLineFileInputFormat.()
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:882)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
at CustomRecordReader.main(CustomRecordReader.java:257)
Caused by: java.lang.NoSuchMethodException: 
CustomRecordReader$MultiLineFileInputFormat.()
at java.lang.Class.getConstructor0(Class.java:2706)
at java.lang.Class.getDeclaredConstructor(Class.java:1985)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109)
... 5 more


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-222) Shuffle should be refactored to a separate task by itself

2009-11-30 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784075#action_12784075
 ] 

ZhuGuanyin commented on MAPREDUCE-222:
--

I think it would be better if shuffle and sort phase  seperate from reduce task.

1) The reschduled reduce need shuffle and sort again if the former reduce task 
failed in current implentation. Example, the reduce shuffle and sort phase cost 
a lot of time if a reduce need fetch map midoutput  from 100k maps.

2) we could shuffle and sort while anothers job's or tasks' reducer running, 
which would maximize resource utilization. In current implentation, the reduce 
slots are comsumed if it is shuffle or waiting the map finished.

3) we could localized the reduce task on the tasktracker where it has shuffled.

> Shuffle should be refactored to a separate task by itself
> -
>
> Key: MAPREDUCE-222
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-222
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: Devaraj Das
>
> Currently, shuffle phase is part of the reduce task. The idea here is to move 
> out the shuffle as a first-class task. This will improve the usage of the 
> network since we will then be able to schedule shuffle tasks independently, 
> and later on pin reduce tasks to those nodes. This will make most sense for 
> apps where there are multiple waves of reduces (the second wave of reduces 
> can directly start off doing the "reducer" phase).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1075) getQueue(String queue) in JobTracker would return NPE for invalid queue name

2009-11-30 Thread V.V.Chaitanya Krishna (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

V.V.Chaitanya Krishna updated MAPREDUCE-1075:
-

Status: Open  (was: Patch Available)

> getQueue(String queue) in JobTracker would return NPE for invalid queue name
> 
>
> Key: MAPREDUCE-1075
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1075
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: V.V.Chaitanya Krishna
>Assignee: V.V.Chaitanya Krishna
> Fix For: 0.21.0
>
> Attachments: MAPREDUCE-1075-1.patch, MAPREDUCE-1075-2.patch, 
> MAPREDUCE-1075-3.patch, MAPREDUCE-1075-4.patch, MAPREDUCE-1075-5.patch, 
> MAPREDUCE-1075-6.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1075) getQueue(String queue) in JobTracker would return NPE for invalid queue name

2009-11-30 Thread V.V.Chaitanya Krishna (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

V.V.Chaitanya Krishna updated MAPREDUCE-1075:
-

Attachment: MAPREDUCE-1075-6.patch

Uploading patch with the above comments implemented. 

> getQueue(String queue) in JobTracker would return NPE for invalid queue name
> 
>
> Key: MAPREDUCE-1075
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1075
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: V.V.Chaitanya Krishna
>Assignee: V.V.Chaitanya Krishna
> Fix For: 0.21.0
>
> Attachments: MAPREDUCE-1075-1.patch, MAPREDUCE-1075-2.patch, 
> MAPREDUCE-1075-3.patch, MAPREDUCE-1075-4.patch, MAPREDUCE-1075-5.patch, 
> MAPREDUCE-1075-6.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1099) Setup and cleanup tasks could affect job latency if they are caught running on bad nodes

2009-11-30 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784067#action_12784067
 ] 

ZhuGuanyin commented on MAPREDUCE-1099:
---

We have encountered the same problem, so We just remove the setup and cleanup 
task (inport the patch https://issues.apache.org/jira/browse/MAPREDUCE-463 )

> Setup and cleanup tasks could affect job latency if they are caught running 
> on bad nodes
> 
>
> Key: MAPREDUCE-1099
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1099
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobtracker
>Affects Versions: 0.20.1
>Reporter: Hemanth Yamijala
>
> We found cases on our clusters where a setup task got scheduled on a bad node 
> and took upwards of several minutes to run, adversely affecting job runtimes. 
> Speculation did not help here as speculation is not used for setup tasks. I 
> suspect the same could happen for cleanup tasks as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-353) Allow shuffle read and connection timeouts to be configurable

2009-11-30 Thread Amareshwari Sriramadasu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amareshwari Sriramadasu updated MAPREDUCE-353:
--

Attachment: patch-353-ydist.txt

Patch for Yahoo! distribution

> Allow shuffle read and connection timeouts to be configurable
> -
>
> Key: MAPREDUCE-353
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-353
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 0.21.0
>Reporter: Arun C Murthy
>Assignee: Ravi Gummadi
> Fix For: 0.21.0
>
> Attachments: MR-353.patch, MR-353.v1.patch, patch-353-ydist.txt
>
>
> It would be good for latency-sensitive applications to tune the shuffle 
> read/connection timeouts... in fact this made a huge difference to terasort 
> since we were seeing individual shuffles stuck for upwards of 60s and had to 
> have a very small read timeout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1136) ConcurrentModificationException when tasktracker updates task status to jobtracker

2009-11-30 Thread Qi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Liu updated MAPREDUCE-1136:
--

Attachment: MAPREDUCE-1136.patch

Patch for trunk.

> ConcurrentModificationException when tasktracker updates task status to 
> jobtracker
> --
>
> Key: MAPREDUCE-1136
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1136
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Qi Liu
> Attachments: MAPREDUCE-1136.0.18.3.patch, MAPREDUCE-1136.patch
>
>
> In Hadoop 0.18.3, the following exception happened during a job execution. It 
> does not happen often.
> Here is the stack trace of the exception.
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: 
> java.util.ConcurrentModificationException
> at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100)
> at java.util.TreeMap$ValueIterator.next(TreeMap.java:1145)
> at 
> org.apache.hadoop.mapred.JobTracker.getAllJobs(JobTracker.java:2376)
> at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:890)
>  org.apache.hadoop.ipc.Client.call(Client.java:716)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1185) URL to JT webconsole for running job and job history should be the same

2009-11-30 Thread Jothi Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784060#action_12784060
 ] 

Jothi Padmanabhan commented on MAPREDUCE-1185:
--

Trunk patch. 
Since you are already doing a remove of entries from the {{jobHistoryFileMap}} 
while iterating through it to handle the manual move scenario, I think the 
{{jobHistoryFileMap.remove()}} while deleting the history file is redundant and 
can be removed. 

> URL to JT webconsole for running job and job history should be the same
> ---
>
> Key: MAPREDUCE-1185
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1185
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobtracker
>Reporter: Sharad Agarwal
>Assignee: Sharad Agarwal
> Attachments: 1185_v1.patch, 1185_v2.patch, 1185_v3.patch, 
> 1185_v4.patch, 1185_v5.patch, patch-1185-1-ydist.txt, patch-1185-ydist.txt
>
>
> The tracking url for running jobs and the jobs which are retired is 
> different. This creates problem for clients which caches the job running url 
> because soon it becomes invalid when job is retired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1185) URL to JT webconsole for running job and job history should be the same

2009-11-30 Thread Amareshwari Sriramadasu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amareshwari Sriramadasu updated MAPREDUCE-1185:
---

Attachment: patch-1185-1-ydist.txt

The test for redirection was missing in earlier Y!20 patch.  Added the test in 
attached patch.

> URL to JT webconsole for running job and job history should be the same
> ---
>
> Key: MAPREDUCE-1185
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1185
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobtracker
>Reporter: Sharad Agarwal
>Assignee: Sharad Agarwal
> Attachments: 1185_v1.patch, 1185_v2.patch, 1185_v3.patch, 
> 1185_v4.patch, 1185_v5.patch, patch-1185-1-ydist.txt, patch-1185-ydist.txt
>
>
> The tracking url for running jobs and the jobs which are retired is 
> different. This creates problem for clients which caches the job running url 
> because soon it becomes invalid when job is retired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1249) mapreduce.reduce.shuffle.read.timeout's default value should be 3 minutes, in mapred-default.xml

2009-11-30 Thread Amareshwari Sriramadasu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amareshwari Sriramadasu updated MAPREDUCE-1249:
---

Assignee: Amareshwari Sriramadasu
  Status: Open  (was: Patch Available)

> mapreduce.reduce.shuffle.read.timeout's default value should be 3 minutes, in 
> mapred-default.xml
> 
>
> Key: MAPREDUCE-1249
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1249
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Amareshwari Sriramadasu
>Assignee: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: patch-1249-1.txt, patch-1249.txt
>
>
> mapreduce.reduce.shuffle.read.timeout has a value of 30,000 (30 seconds) in 
> mapred-default.xml, whereas the default value in Fetcher code is 3 minutes.
> It should be 3 minutes by default, as it was in pre MAPREDUCE-353.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1249) mapreduce.reduce.shuffle.read.timeout's default value should be 3 minutes, in mapred-default.xml

2009-11-30 Thread Amareshwari Sriramadasu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amareshwari Sriramadasu updated MAPREDUCE-1249:
---

Status: Patch Available  (was: Open)

Test failures are unrelated to the patch. Resubmitting for hudson

> mapreduce.reduce.shuffle.read.timeout's default value should be 3 minutes, in 
> mapred-default.xml
> 
>
> Key: MAPREDUCE-1249
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1249
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Amareshwari Sriramadasu
>Assignee: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: patch-1249-1.txt, patch-1249.txt
>
>
> mapreduce.reduce.shuffle.read.timeout has a value of 30,000 (30 seconds) in 
> mapred-default.xml, whereas the default value in Fetcher code is 3 minutes.
> It should be 3 minutes by default, as it was in pre MAPREDUCE-353.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1249) mapreduce.reduce.shuffle.read.timeout's default value should be 3 minutes, in mapred-default.xml

2009-11-30 Thread Amareshwari Sriramadasu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amareshwari Sriramadasu updated MAPREDUCE-1249:
---

Attachment: patch-1249-1.txt

Patch updating the doc.

> mapreduce.reduce.shuffle.read.timeout's default value should be 3 minutes, in 
> mapred-default.xml
> 
>
> Key: MAPREDUCE-1249
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1249
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Amareshwari Sriramadasu
>Assignee: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: patch-1249-1.txt, patch-1249.txt
>
>
> mapreduce.reduce.shuffle.read.timeout has a value of 30,000 (30 seconds) in 
> mapred-default.xml, whereas the default value in Fetcher code is 3 minutes.
> It should be 3 minutes by default, as it was in pre MAPREDUCE-353.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1249) mapreduce.reduce.shuffle.read.timeout's default value should be 3 minutes, in mapred-default.xml

2009-11-30 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784049#action_12784049
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1249:


It is documented that the parameters mapreduce.reduce.shuffle.connect.timeout 
and mapreduce.reduce.shuffle.read.timeout are cluster-wide parameters. But, 
actually they are job-level parameters. I think the tag "Expert" conveys the 
fact that users are not supposed to play with it. 

> mapreduce.reduce.shuffle.read.timeout's default value should be 3 minutes, in 
> mapred-default.xml
> 
>
> Key: MAPREDUCE-1249
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1249
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: patch-1249.txt
>
>
> mapreduce.reduce.shuffle.read.timeout has a value of 30,000 (30 seconds) in 
> mapred-default.xml, whereas the default value in Fetcher code is 3 minutes.
> It should be 3 minutes by default, as it was in pre MAPREDUCE-353.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1229) [Mumak] Allow customization of job submission policy

2009-11-30 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784046#action_12784046
 ] 

Hong Tang commented on MAPREDUCE-1229:
--

Attached new patch that addresses the comments by Dick.

bq. 1: Should TestSimulator*JobSubmission check to see whether the total 
"runtime" was reasonable for the Policy?
Currently, each policy is tested as a separate test case. It may be hard to 
combine them and compare the virtual runtime, which is only present as console 
output. I did do some basic sanity check manually after the run.

bq. 2: minor nit: Should SimulatorJobSubmissionPolicy/getPolicy(Configuration) 
use valueOf(policy.toUpper()) instead of looping through the types?
Updated in the patch based on the suggestion.

bq. 3: medium sized nit: in SimulatorJobClient.isOverloaded() there are two 
literals, 0.9 and 2.0F, that ought to be static private named values.
Added final variables to represent the magic constants, and added comments.

bq. 4: Here is my biggest point. The existing code cannot submit a job more 
often than once every five seconds when the jobs were spaced further apart than 
that and the policy is STRESS .
bq. 
bq. Please consider adding code to call the processLoadProbingEvent core code 
when we processJobCompleteEvent or a processJobSubmitEvent . That includes 
potentially adding a new LoadProbingEvent . This can lead to an accumulation 
because each LoadProbingEvent replaces itself, so we should track the ones that 
are in flight in a PriorityQueue and only add a new LoadProbingEvent whenever 
the new event has a time stamp strictly earlier than the earliest one already 
in flight. This will limit us to two events in flight with the current 
adjustLoadProbingInterval .
bq. 
bq. If you don't do that, then if a real dreadnaught of a job gets dropped into 
the system and the probing interval gets long it could take us a while to 
notice that we're okay to submit jobs, in the case where the job has many tasks 
finishing at about the same time, and we could submit tiny jobs as onsies every 
five seconds when the cluster is clear enough to accommodate lots of jobs. When 
the cluster can handle N jobs in less than 5N seconds for some N, we won't 
overload it with the existing code.
I changed the minimum load probing interval to 1 seconds (from 5 seconds). Note 
that when a job is submitted, it could take a few seconds before JT assigns the 
map tasks to TTs with free map slots. So reducing this interval further could 
lead to artificial load spikes.

I also added load checks after each job completion, and if the cluster is 
underloaded, we submit another job (and reset the load checking interval to the 
minimum value). This does bring in a potential danger when many jobs happen to 
complete at the same time, and inject a lot of jobs into the system. But I 
think such risk should be fairly low and thus would not worry much about it.

> [Mumak] Allow customization of job submission policy
> 
>
> Key: MAPREDUCE-1229
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1229
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/mumak
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Hong Tang
>Assignee: Hong Tang
> Fix For: 0.21.0, 0.22.0
>
> Attachments: mapreduce-1229-20091121.patch, 
> mapreduce-1229-20091123.patch, mapreduce-1229-20091130.patch
>
>
> Currently, mumak replay job submission faithfully. To make mumak useful for 
> evaluation purposes, it would be great if we can support other job submission 
> policies such as sequential job submission, or stress job submission.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1229) [Mumak] Allow customization of job submission policy

2009-11-30 Thread Hong Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Tang updated MAPREDUCE-1229:
-

Attachment: mapreduce-1229-20091130.patch

> [Mumak] Allow customization of job submission policy
> 
>
> Key: MAPREDUCE-1229
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1229
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/mumak
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Hong Tang
>Assignee: Hong Tang
> Fix For: 0.21.0, 0.22.0
>
> Attachments: mapreduce-1229-20091121.patch, 
> mapreduce-1229-20091123.patch, mapreduce-1229-20091130.patch
>
>
> Currently, mumak replay job submission faithfully. To make mumak useful for 
> evaluation purposes, it would be great if we can support other job submission 
> policies such as sequential job submission, or stress job submission.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1229) [Mumak] Allow customization of job submission policy

2009-11-30 Thread Hong Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Tang updated MAPREDUCE-1229:
-

Status: Open  (was: Patch Available)

> [Mumak] Allow customization of job submission policy
> 
>
> Key: MAPREDUCE-1229
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1229
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/mumak
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Hong Tang
>Assignee: Hong Tang
> Fix For: 0.21.0, 0.22.0
>
> Attachments: mapreduce-1229-20091121.patch, 
> mapreduce-1229-20091123.patch
>
>
> Currently, mumak replay job submission faithfully. To make mumak useful for 
> evaluation purposes, it would be great if we can support other job submission 
> policies such as sequential job submission, or stress job submission.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1251) c++ utils doesn't compile

2009-11-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784035#action_12784035
 ] 

Hadoop QA commented on MAPREDUCE-1251:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12426461/MR-1251.patch
  against trunk revision 885530.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/281/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/281/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/281/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/281/console

This message is automatically generated.

> c++ utils doesn't compile
> -
>
> Key: MAPREDUCE-1251
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1251
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: ubuntu karmic 64-bit
>Reporter: Eli Collins
>Assignee: Eli Collins
> Attachments: HDFS-790-1.patch, HDFS-790.patch, MR-1251.patch
>
>
> c++ utils doesn't compile on ubuntu karmic 64-bit. The latest patch for 
> HADOOP-5611 needs to be applied first.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1254) job.xml should add crc check in tasktracker and sub jvm.

2009-11-30 Thread ZhuGuanyin (JIRA)
job.xml should add crc check in tasktracker and sub jvm.


 Key: MAPREDUCE-1254
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1254
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: task, tasktracker
Affects Versions: 0.22.0
Reporter: ZhuGuanyin


Currently job.xml in tasktracker and subjvm are write to local disk through 
ChecksumFilesystem, and already had crc checksum information, but load the 
job.xml file without crc check. It would cause the mapred job finished 
successful but with wrong data because of disk error.  Example: The tasktracker 
and sub task jvm would load the default configuration if it doesn't 
successfully load the job.xml which maybe replace the mapper with 
IdentityMapper. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1247) Send out-of-band heartbeat to avoid fake lost tasktracker

2009-11-30 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784019#action_12784019
 ] 

ZhuGuanyin commented on MAPREDUCE-1247:
---

I agree, seperate the overtime lock method from heartbeat thread and never do 
i/o operations holding locks is the best solution. We had tried, but found it's 
not very easy to achieved and would not resolve recently, I propose a tempary 
solution. 

> Send out-of-band heartbeat to avoid fake lost tasktracker
> -
>
> Key: MAPREDUCE-1247
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1247
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: ZhuGuanyin
>
> Currently the TaskTracker report task status to jobtracker through heartbeat, 
> sometimes if the tasktracker  lock the tasktracker to do some cleanup  job, 
> like remove task temp data on disk, the heartbeat thread would hang for a 
> long time while waiting for the lock, so the jobtracker just thought it had 
> lost and would reschedule all its finished maps or un finished reduce on 
> other tasktrackers, we call it "fake lost tasktracker", some times it doesn't 
> acceptable especially when we run some large jobs.  So We introduce a 
> out-of-band heartbeat mechanism to send an out-of-band heartbeat in that case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1251) c++ utils doesn't compile

2009-11-30 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784017#action_12784017
 ] 

Eli Collins commented on MAPREDUCE-1251:


Thanks Todd. Also tested the patch against Centos 5.4 64-bit.

> c++ utils doesn't compile
> -
>
> Key: MAPREDUCE-1251
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1251
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: ubuntu karmic 64-bit
>Reporter: Eli Collins
>Assignee: Eli Collins
> Attachments: HDFS-790-1.patch, HDFS-790.patch, MR-1251.patch
>
>
> c++ utils doesn't compile on ubuntu karmic 64-bit. The latest patch for 
> HADOOP-5611 needs to be applied first.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1252) Shuffle deadlocks on wrong number of maps

2009-11-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784011#action_12784011
 ] 

Hadoop QA commented on MAPREDUCE-1252:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12426479/mr-1252.patch
  against trunk revision 885530.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/154/console

This message is automatically generated.

> Shuffle deadlocks on wrong number of maps
> -
>
> Key: MAPREDUCE-1252
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1252
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Blocker
> Fix For: 0.21.0, 0.22.0
>
> Attachments: mr-1252.patch
>
>
> The new shuffle assumes that the number of maps is correct. The new 
> JobSubmitter sets the old value. Something misfires in the middle causing:
> 09/12/01 00:00:15 WARN conf.Configuration: mapred.job.split.file is 
> deprecated. Instead, use mapreduce.job.splitfile
> 09/12/01 00:00:15 WARN conf.Configuration: mapred.map.tasks is deprecated. 
> Instead, use mapreduce.job.maps
> But my reduces got stuck at 2 maps / 12 when there were only 2 maps in the 
> job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1136) ConcurrentModificationException when tasktracker updates task status to jobtracker

2009-11-30 Thread Qi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Liu updated MAPREDUCE-1136:
--

Attachment: MAPREDUCE-1136.0.18.3.patch

Patch for 0.18.3 branch.

> ConcurrentModificationException when tasktracker updates task status to 
> jobtracker
> --
>
> Key: MAPREDUCE-1136
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1136
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Qi Liu
> Attachments: MAPREDUCE-1136.0.18.3.patch
>
>
> In Hadoop 0.18.3, the following exception happened during a job execution. It 
> does not happen often.
> Here is the stack trace of the exception.
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: 
> java.util.ConcurrentModificationException
> at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100)
> at java.util.TreeMap$ValueIterator.next(TreeMap.java:1145)
> at 
> org.apache.hadoop.mapred.JobTracker.getAllJobs(JobTracker.java:2376)
> at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:890)
>  org.apache.hadoop.ipc.Client.call(Client.java:716)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1251) c++ utils doesn't compile

2009-11-30 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783990#action_12783990
 ] 

Todd Lipcon commented on MAPREDUCE-1251:


+1 - started a new karmic VM on ec2, verified build failure, applied 
MR-1251.patch, and verified a successful build.

This should be committed to branch-20 as well. From the 0.20.1 tarball, I had 
to apply HADOOP-5612, HADOOP-5611, and MR-1251.patch before pipes would build 
(this same patch applies cleanly to that tarball). Doesn't look like any of 
those are in branch-20, but they all are necessary if we consider build failure 
to be critical.

> c++ utils doesn't compile
> -
>
> Key: MAPREDUCE-1251
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1251
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: ubuntu karmic 64-bit
>Reporter: Eli Collins
>Assignee: Eli Collins
> Attachments: HDFS-790-1.patch, HDFS-790.patch, MR-1251.patch
>
>
> c++ utils doesn't compile on ubuntu karmic 64-bit. The latest patch for 
> HADOOP-5611 needs to be applied first.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1253) Making Mumak work with Capacity-Scheduler

2009-11-30 Thread Anirban Dasgupta (JIRA)
Making Mumak work with Capacity-Scheduler
-

 Key: MAPREDUCE-1253
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1253
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: contrib/mumak
Affects Versions: 0.21.0, 0.22.0
Reporter: Anirban Dasgupta
Assignee: Anirban Dasgupta
Priority: Minor


In order to make the capacity-scheduler work in the mumak simulation 
environment, we have to replace the job-initialization threads of the capacity 
scheduler with classes that perform event-based initialization. We propose to 
use aspectj to disable the threads  of the JobInitializationPoller class used 
by the Capacity Scheduler, and then perform the corresponding initialization 
tasks through a simulation job-initialization class that receives periodic 
wake-up calls from the simulator engine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1252) Shuffle deadlocks on wrong number of maps

2009-11-30 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated MAPREDUCE-1252:
-

Attachment: mr-1252.patch

> Shuffle deadlocks on wrong number of maps
> -
>
> Key: MAPREDUCE-1252
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1252
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Blocker
> Fix For: 0.21.0, 0.22.0
>
> Attachments: mr-1252.patch
>
>
> The new shuffle assumes that the number of maps is correct. The new 
> JobSubmitter sets the old value. Something misfires in the middle causing:
> 09/12/01 00:00:15 WARN conf.Configuration: mapred.job.split.file is 
> deprecated. Instead, use mapreduce.job.splitfile
> 09/12/01 00:00:15 WARN conf.Configuration: mapred.map.tasks is deprecated. 
> Instead, use mapreduce.job.maps
> But my reduces got stuck at 2 maps / 12 when there were only 2 maps in the 
> job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1252) Shuffle deadlocks on wrong number of maps

2009-11-30 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated MAPREDUCE-1252:
-

Status: Patch Available  (was: Open)

> Shuffle deadlocks on wrong number of maps
> -
>
> Key: MAPREDUCE-1252
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1252
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Blocker
> Fix For: 0.21.0, 0.22.0
>
> Attachments: mr-1252.patch
>
>
> The new shuffle assumes that the number of maps is correct. The new 
> JobSubmitter sets the old value. Something misfires in the middle causing:
> 09/12/01 00:00:15 WARN conf.Configuration: mapred.job.split.file is 
> deprecated. Instead, use mapreduce.job.splitfile
> 09/12/01 00:00:15 WARN conf.Configuration: mapred.map.tasks is deprecated. 
> Instead, use mapreduce.job.maps
> But my reduces got stuck at 2 maps / 12 when there were only 2 maps in the 
> job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1252) Shuffle deadlocks on wrong number of maps

2009-11-30 Thread Owen O'Malley (JIRA)
Shuffle deadlocks on wrong number of maps
-

 Key: MAPREDUCE-1252
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1252
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Affects Versions: 0.21.0, 0.22.0
Reporter: Owen O'Malley
Assignee: Owen O'Malley
Priority: Blocker
 Fix For: 0.21.0, 0.22.0
 Attachments: mr-1252.patch

The new shuffle assumes that the number of maps is correct. The new 
JobSubmitter sets the old value. Something misfires in the middle causing:

09/12/01 00:00:15 WARN conf.Configuration: mapred.job.split.file is deprecated. 
Instead, use mapreduce.job.splitfile
09/12/01 00:00:15 WARN conf.Configuration: mapred.map.tasks is deprecated. 
Instead, use mapreduce.job.maps

But my reduces got stuck at 2 maps / 12 when there were only 2 maps in the job.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1222) [Mumak] We should not include nodes with numeric ips in cluster topology.

2009-11-30 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783971#action_12783971
 ] 

Hong Tang commented on MAPREDUCE-1222:
--

@dick, thanks for the comments. I am combing through the jdk code too and 
struggling to find the any explicit API. Finally, I found out that there are 
calls to directly translate ipv4/6 addresses from string literal form ot 
Inet{4|6}Address objects, but they are not in JDK, instead they are in 
sun.net.util.IPAddressUtil - which makes it unusable (unless we mandate 
everybody use sun jvm).

What bothers me for re-implementing RFC 2372 (IPv6 address architecture) is 
that it is a complicated scheme and we probably need a suite of unit-tests to 
guarantee our implementation is correct - which sounds to me to be way beyond 
the scope of this jira (which should not even exist if HDFS-778 is fixed).



> [Mumak] We should not include nodes with numeric ips in cluster topology.
> -
>
> Key: MAPREDUCE-1222
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1222
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/mumak
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Hong Tang
>Assignee: Hong Tang
> Fix For: 0.21.0, 0.22.0
>
> Attachments: IPv6-predicate.patch, mapreduce-1222-20091119.patch, 
> mapreduce-1222-20091121.patch
>
>
> Rumen infers cluster topology by parsing input split locations from job 
> history logs. Due to HDFS-778, a cluster node may appear both as a numeric ip 
> or as a host name in job history logs. We should exclude nodes appeared as 
> numeric ips in cluster toplogy when we run mumak until a solution is found so 
> that numeric ips would never appear in input split locations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1222) [Mumak] We should not include nodes with numeric ips in cluster topology.

2009-11-30 Thread Dick King (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783962#action_12783962
 ] 

Dick King commented on MAPREDUCE-1222:
--

After I wrote my comment of 24/Nov/09 07:59 PM , I looked at the Java API 
because I came to wonder whether unescaping and using the Java API could be 
made to work by itself.  I did look for alternatives before I created my big 
regular expression.

The big problem is that Java doesn't really present any API that distinguishes 
numeric IP addresses from symbolic addresses.  Although 
InetAddress.getByName(String) must have some means of parsing an IPV4 and IPV6 
literal numeric address, this functionality is not presented to java.net.* 
users.  InetAddress.getByName(String) will parse either a numeric address or a 
symbolic name and produce indistinguishable results.  That piece of the API 
does not give us a means to distinguish the two.  I was unable to find any 
other API that did make the distinction.

The formats of numeric literal IPV4 and IPV6 internet addresses are fixed in 
RFCs and are extremely unlikely to be changed in the foreseeable future.  We 
are therefore not exposed to any non-future-proofing.  The only exposure we 
have is a possible future IPV8, but the ICANN is doing its best to make that 
unnecessary for a very long time.

Considering that Apache already owns this regular expression we should consider 
using it.

I considered the simpler approach of considering any address that contains a 
colon character to be a numeric IPV6 address, but colons are used as other 
punctuation, ie., separation between IP address and port number.  That solution 
felt to me to be too brittle and accident-prone, and doesn't solve the IPV8 
problem.  There is a continuum of IPV6 solutions ranging from "look for a 
colon" to the correct regular expression you see here, and no principled way to 
decide where to stop.

> [Mumak] We should not include nodes with numeric ips in cluster topology.
> -
>
> Key: MAPREDUCE-1222
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1222
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/mumak
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Hong Tang
>Assignee: Hong Tang
> Fix For: 0.21.0, 0.22.0
>
> Attachments: IPv6-predicate.patch, mapreduce-1222-20091119.patch, 
> mapreduce-1222-20091121.patch
>
>
> Rumen infers cluster topology by parsing input split locations from job 
> history logs. Due to HDFS-778, a cluster node may appear both as a numeric ip 
> or as a host name in job history logs. We should exclude nodes appeared as 
> numeric ips in cluster toplogy when we run mumak until a solution is found so 
> that numeric ips would never appear in input split locations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1251) c++ utils doesn't compile

2009-11-30 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated MAPREDUCE-1251:
---

Status: Patch Available  (was: Open)

> c++ utils doesn't compile
> -
>
> Key: MAPREDUCE-1251
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1251
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: ubuntu karmic 64-bit
>Reporter: Eli Collins
>Assignee: Eli Collins
> Attachments: HDFS-790-1.patch, HDFS-790.patch, MR-1251.patch
>
>
> c++ utils doesn't compile on ubuntu karmic 64-bit. The latest patch for 
> HADOOP-5611 needs to be applied first.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1251) c++ utils doesn't compile

2009-11-30 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated MAPREDUCE-1251:
---

Status: Open  (was: Patch Available)

> c++ utils doesn't compile
> -
>
> Key: MAPREDUCE-1251
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1251
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: ubuntu karmic 64-bit
>Reporter: Eli Collins
>Assignee: Eli Collins
> Attachments: HDFS-790-1.patch, HDFS-790.patch, MR-1251.patch
>
>
> c++ utils doesn't compile on ubuntu karmic 64-bit. The latest patch for 
> HADOOP-5611 needs to be applied first.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1222) [Mumak] We should not include nodes with numeric ips in cluster topology.

2009-11-30 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783948#action_12783948
 ] 

Hong Tang commented on MAPREDUCE-1222:
--

@dick, thanks for the help. The solution is more complex than I would like to 
have in solving the trivial problem this jira represents.

I suggest we go for a less efficient, but more direct and thus 
easier-to-maintain solution: unescaping the dots and rely on java's IPv4 or 
IPv6 parsing code.

> [Mumak] We should not include nodes with numeric ips in cluster topology.
> -
>
> Key: MAPREDUCE-1222
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1222
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/mumak
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Hong Tang
>Assignee: Hong Tang
> Fix For: 0.21.0, 0.22.0
>
> Attachments: IPv6-predicate.patch, mapreduce-1222-20091119.patch, 
> mapreduce-1222-20091121.patch
>
>
> Rumen infers cluster topology by parsing input split locations from job 
> history logs. Due to HDFS-778, a cluster node may appear both as a numeric ip 
> or as a host name in job history logs. We should exclude nodes appeared as 
> numeric ips in cluster toplogy when we run mumak until a solution is found so 
> that numeric ips would never appear in input split locations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Moved: (MAPREDUCE-1251) c++ utils doesn't compile

2009-11-30 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins moved HDFS-790 to MAPREDUCE-1251:
-

Affects Version/s: (was: 0.22.0)
   (was: 0.20.2)
   (was: 0.20.1)
   (was: 0.21.0)
   0.22.0
   0.21.0
   0.20.2
   0.20.1
  Key: MAPREDUCE-1251  (was: HDFS-790)
  Project: Hadoop Map/Reduce  (was: Hadoop HDFS)

> c++ utils doesn't compile
> -
>
> Key: MAPREDUCE-1251
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1251
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: ubuntu karmic 64-bit
>Reporter: Eli Collins
>Assignee: Eli Collins
> Attachments: HDFS-790-1.patch, HDFS-790.patch, MR-1251.patch
>
>
> c++ utils doesn't compile on ubuntu karmic 64-bit. The latest patch for 
> HADOOP-5611 needs to be applied first.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1251) c++ utils doesn't compile

2009-11-30 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated MAPREDUCE-1251:
---

Attachment: MR-1251.patch

Patch for trunk, against MR where pipes lives.

> c++ utils doesn't compile
> -
>
> Key: MAPREDUCE-1251
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1251
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: ubuntu karmic 64-bit
>Reporter: Eli Collins
>Assignee: Eli Collins
> Attachments: HDFS-790-1.patch, HDFS-790.patch, MR-1251.patch
>
>
> c++ utils doesn't compile on ubuntu karmic 64-bit. The latest patch for 
> HADOOP-5611 needs to be applied first.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1247) Send out-of-band heartbeat to avoid fake lost tasktracker

2009-11-30 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783899#action_12783899
 ] 

Arun C Murthy commented on MAPREDUCE-1247:
--

I agree with Todd, we should *never* do any i/o operation holding locks... 
we've added the taskcleanup thread long ago for precisely the same reason. It 
is quite possible that we've since violated that - we should fix the primary 
cause rather than hide it with out-of-band heartbeats.

> Send out-of-band heartbeat to avoid fake lost tasktracker
> -
>
> Key: MAPREDUCE-1247
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1247
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: ZhuGuanyin
>
> Currently the TaskTracker report task status to jobtracker through heartbeat, 
> sometimes if the tasktracker  lock the tasktracker to do some cleanup  job, 
> like remove task temp data on disk, the heartbeat thread would hang for a 
> long time while waiting for the lock, so the jobtracker just thought it had 
> lost and would reschedule all its finished maps or un finished reduce on 
> other tasktrackers, we call it "fake lost tasktracker", some times it doesn't 
> acceptable especially when we run some large jobs.  So We introduce a 
> out-of-band heartbeat mechanism to send an out-of-band heartbeat in that case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-698) Per-pool task limits for the fair scheduler

2009-11-30 Thread Kevin Peterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Peterson updated MAPREDUCE-698:
-

Attachment: mapreduce-698-trunk-3.patch

changes from previous patch:
- extra newline in fairscheduler.java removed
- removed "single test" changes from build-contrib.xml (they didn't accomplish 
what I wanted -- to run just a single test method)
- Regarding checkAssignment, I made the change you suggested, but I'm not sure 
I'm testing things in the best way. The only thing I'm concerned with is that 
it ends up scheduling the right number from each pool, the only way I was able 
to get it to actually assign the jobs was to use checkAssignment.
- in the UI, labels are "Max Share"
- Removed Pool.numRunningTasks since it was only used from within 
PoolSchedulable, where this data is already available.
- Moved cap from getDemand() to updateDemand().
- Documentation updated
- Removed tabs.

> Per-pool task limits for the fair scheduler
> ---
>
> Key: MAPREDUCE-698
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-698
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>  Components: contrib/fair-share
>Reporter: Matei Zaharia
> Fix For: 0.21.0
>
> Attachments: MAPREDUCE-698-prelim.patch, mapreduce-698-trunk-3.patch, 
> mapreduce-698-trunk.patch, mapreduce-698-trunk.patch
>
>
> The fair scheduler could use a way to cap the share of a given pool similar 
> to MAPREDUCE-532.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1218) Collecting cpu and memory usage for TaskTrackers

2009-11-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783846#action_12783846
 ] 

Hadoop QA commented on MAPREDUCE-1218:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12426171/MAPREDUCE-1218-rename.sh
  against trunk revision 885530.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 2 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/280/console

This message is automatically generated.

> Collecting cpu and memory usage for TaskTrackers
> 
>
> Key: MAPREDUCE-1218
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1218
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>Affects Versions: 0.22.0
> Environment: linux
>Reporter: Scott Chen
>Assignee: Scott Chen
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1218-rename.sh, MAPREDUCE-1218.patch
>
>
> The information can be used for resource aware scheduling.
> Note that this is related to MAPREDUCE-220. There the per task resource 
> information is collected.
> This one collects the per machine information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1218) Collecting cpu and memory usage for TaskTrackers

2009-11-30 Thread Scott Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Chen updated MAPREDUCE-1218:
--

Fix Version/s: 0.22.0
Affects Version/s: 0.22.0
   Status: Patch Available  (was: Open)

> Collecting cpu and memory usage for TaskTrackers
> 
>
> Key: MAPREDUCE-1218
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1218
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>Affects Versions: 0.22.0
> Environment: linux
>Reporter: Scott Chen
>Assignee: Scott Chen
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1218-rename.sh, MAPREDUCE-1218.patch
>
>
> The information can be used for resource aware scheduling.
> Note that this is related to MAPREDUCE-220. There the per task resource 
> information is collected.
> This one collects the per machine information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-30 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783830#action_12783830
 ] 

Aaron Kimball commented on MAPREDUCE-1119:
--

* GridMix has been sporadically failing for a while now in Hudson, but not in a 
deterministic fashion. Cannot reproduce locally.
* The fair scheduler test times out on both my trunk and MAPREDUCE-1119 
branches locally. MAPREDUCE-1245?
* TestJobHistory passes locally both on my trunk and MAPREDUCE-1119 branches.


> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.5.patch, MAPREDUCE-1119.6.patch, 
> MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (MAPREDUCE-1078) Unit test for zero map jobs and killed jobs

2009-11-30 Thread Anirban Dasgupta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on MAPREDUCE-1078 started by Anirban Dasgupta.

> Unit test for zero map jobs and killed jobs
> ---
>
> Key: MAPREDUCE-1078
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1078
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Anirban Dasgupta
>Assignee: Anirban Dasgupta
>
> Adding unit test for zero map jobs and killed jobs

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1224) Calling "SELECT t.* from AS t" to get meta information is too expensive for big tables

2009-11-30 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783810#action_12783810
 ] 

Aaron Kimball commented on MAPREDUCE-1224:
--

Good to know that this works with SQL Server as well. Thanks for the patch.

> Calling "SELECT t.* from  AS t" to get meta information is too 
> expensive for big tables
> --
>
> Key: MAPREDUCE-1224
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1224
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/sqoop
>Affects Versions: 0.20.1
> Environment: all platforms, generic jdbc driver
>Reporter: Spencer Ho
> Attachments: MAPREDUCE-1224.patch, SqlManager.java
>
>
> The SqlManager uses the query, "SELECT t.* from  AS t" to get table 
> spec is too expensive for big tables, and it was called twice to generate 
> column names and types.  For tables that are big enough to be map-reduced, 
> this is too expensive to make sqoop useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1140) Per cache-file refcount can become negative when tasks release distributed-cache files

2009-11-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783801#action_12783801
 ] 

Hudson commented on MAPREDUCE-1140:
---

Integrated in Hadoop-Mapreduce-trunk-Commit #138 (See 
[http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/138/])


> Per cache-file refcount can become negative when tasks release 
> distributed-cache files
> --
>
> Key: MAPREDUCE-1140
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1140
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Vinod K V
>Assignee: Amareshwari Sriramadasu
> Fix For: 0.22.0
>
> Attachments: patch-1140-1.txt, patch-1140-2-ydist.txt, 
> patch-1140-2.txt, patch-1140-3.txt, patch-1140-ydist.txt, 
> patch-1140-ydist.txt, patch-1140.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1231) Distcp is very slow

2009-11-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783802#action_12783802
 ] 

Hudson commented on MAPREDUCE-1231:
---

Integrated in Hadoop-Mapreduce-trunk-Commit #138 (See 
[http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/138/])
. Added a new DistCp option, -skipcrccheck, so that the CRC check during 
setup can be skipped.  Contributed by Jothi Padmanabhan


> Distcp is very slow
> ---
>
> Key: MAPREDUCE-1231
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1231
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: distcp
>Affects Versions: 0.20.1
>Reporter: Jothi Padmanabhan
>Assignee: Jothi Padmanabhan
> Fix For: 0.22.0
>
> Attachments: mapred-1231-v1.patch, mapred-1231-v2.patch, 
> mapred-1231-v3.patch, mapred-1231-v3.patch, mapred-1231-y20-v2.patch, 
> mapred-1231-y20-v3.patch, mapred-1231-y20-v4.patch, mapred-1231-y20.patch, 
> mapred-1231.patch
>
>
> Currently distcp does a checksums check in addition to file length check to 
> decide if a remote file has to be copied. If the number of files is high 
> (thousands), this checksum check is proving to be fairly costly leading to a 
> long time before the copy is started.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1250) Refactor job token to use a common token interface

2009-11-30 Thread Kan Zhang (JIRA)
Refactor job token to use a common token interface
--

 Key: MAPREDUCE-1250
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1250
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: security
Reporter: Kan Zhang
Assignee: Kan Zhang


The idea is to use a common token interface for both job token and delegation 
token (HADOOP-6373) so that the RPC layer that uses them don't have to 
differentiate them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1231) Distcp is very slow

2009-11-30 Thread Tsz Wo (Nicholas), SZE (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo (Nicholas), SZE updated MAPREDUCE-1231:
--

   Resolution: Fixed
Fix Version/s: 0.22.0
   Status: Resolved  (was: Patch Available)

I have committed this.  Thanks, Jothi!

> Distcp is very slow
> ---
>
> Key: MAPREDUCE-1231
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1231
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: distcp
>Affects Versions: 0.20.1
>Reporter: Jothi Padmanabhan
>Assignee: Jothi Padmanabhan
> Fix For: 0.22.0
>
> Attachments: mapred-1231-v1.patch, mapred-1231-v2.patch, 
> mapred-1231-v3.patch, mapred-1231-v3.patch, mapred-1231-y20-v2.patch, 
> mapred-1231-y20-v3.patch, mapred-1231-y20-v4.patch, mapred-1231-y20.patch, 
> mapred-1231.patch
>
>
> Currently distcp does a checksums check in addition to file length check to 
> decide if a remote file has to be copied. If the number of files is high 
> (thousands), this checksum check is proving to be fairly costly leading to a 
> long time before the copy is started.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1224) Calling "SELECT t.* from AS t" to get meta information is too expensive for big tables

2009-11-30 Thread Spencer Ho (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783776#action_12783776
 ] 

Spencer Ho commented on MAPREDUCE-1224:
---

@Aaron,
This particular case that triggered the patch submission is for Microsoft SQL 
Server.  For MySQL, I am using direct mode which works for most of the cases.

> Calling "SELECT t.* from  AS t" to get meta information is too 
> expensive for big tables
> --
>
> Key: MAPREDUCE-1224
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1224
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/sqoop
>Affects Versions: 0.20.1
> Environment: all platforms, generic jdbc driver
>Reporter: Spencer Ho
> Attachments: MAPREDUCE-1224.patch, SqlManager.java
>
>
> The SqlManager uses the query, "SELECT t.* from  AS t" to get table 
> spec is too expensive for big tables, and it was called twice to generate 
> column names and types.  For tables that are big enough to be map-reduced, 
> this is too expensive to make sqoop useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-763) Capacity scheduler should clean up reservations if it runs tasks on nodes other than where it has made reservations

2009-11-30 Thread rahul k singh (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783702#action_12783702
 ] 

rahul k singh commented on MAPREDUCE-763:
-

Approach :

   For every job with higher memory requirement we reserve a TT , now if a task 
from the job is assigned to a unreserved tasktracker, we
   simply remove a tasktracker from the existing reserved TT list. This would 
make sure that we do not reserve more than required.
   JobInProgress maintains the list of TTs reserved for map and reduce, we can 
use the same , TT removed will just be first element in the list.

 The above solution has a small starvation issue incase of speculative 
tasks.That is because speculative tasks cannot run on certain kind of TTs. For 
example:
 There is a job with 3 tips namely tip1 , tip2 , tip3.
 for attempt tip1_1 we have reserved TT1.
 for attempt tip2_1 we have reserved TT2.
 for attempt tip3_1 we have reserved TT3.
In the above case if tip1_1 is assigned to TT4 then we simply unreserve TT1 as 
it is at the top of list. Now if there is second attempt
tip1_2 for tip1 , and if it cannot run on TT2 or TT3, there can be slight 
starvation for tip1_2 as it has to wait till it gets a TT where it could 
run.But this is kind of ok , as this is comparatively remote case and list of 
TTs is a dynamic structure. In order to make above work correctly we would need 
attempt id , reserved tt mapping(Any other suggestions?) . This would require 
some significant changes.

The above approach would also make sure that code changes are simple and 
straight forward . It definitly alleviate the current situation, where chances 
of going overboard with reservation is relatively higher.

> Capacity scheduler should clean up reservations if it runs tasks on nodes 
> other than where it has made reservations
> ---
>
> Key: MAPREDUCE-763
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-763
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/capacity-sched
>Affects Versions: 0.21.0
>Reporter: Hemanth Yamijala
>Assignee: rahul k singh
> Fix For: 0.21.0
>
>
> Currently capacity scheduler makes a reservation on nodes for high memory 
> jobs that cannot currently run at the time. It could happen that in the 
> meantime other tasktrackers become free to run the tasks of this job. Ideally 
> in the next heartbeat from the reserved TTs the reservation should be 
> removed. Otherwise it could unnecessarily block capacity for a while (until 
> the TT has enough slots free to run a task of this job).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1249) mapreduce.reduce.shuffle.read.timeout's default value should be 3 minutes, in mapred-default.xml

2009-11-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783650#action_12783650
 ] 

Hadoop QA commented on MAPREDUCE-1249:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12426393/patch-1249.txt
  against trunk revision 884832.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/279/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/279/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/279/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/279/console

This message is automatically generated.

> mapreduce.reduce.shuffle.read.timeout's default value should be 3 minutes, in 
> mapred-default.xml
> 
>
> Key: MAPREDUCE-1249
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1249
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: patch-1249.txt
>
>
> mapreduce.reduce.shuffle.read.timeout has a value of 30,000 (30 seconds) in 
> mapred-default.xml, whereas the default value in Fetcher code is 3 minutes.
> It should be 3 minutes by default, as it was in pre MAPREDUCE-353.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1140) Per cache-file refcount can become negative when tasks release distributed-cache files

2009-11-30 Thread Iyappan Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783643#action_12783643
 ] 

Iyappan Srinivasan commented on MAPREDUCE-1140:
---

| Regarding the tests, I spoke offline to Amarsri to understand the scenario 
executed by Karthikeyan in the comment above. It was not very clear why file1 
was being added twice. Some more details on configuration - that it was run on 
a single node, max failures was set to 1 should be documented for better 
understanding.

- Configuration :
The tests run on a single node running JT/NN and another node running TT/DN.
map.max.attempts is set to 1 and reduce.max.attempts is also set to 1 and 
local.cache.size is set to 4 GB and mapred.local.dir is set to only 1 spindle 
and not all the spindles.
This is done to force the TT to localize in the same path and try deleting the 
localcaheFiles when the size exceeds 4 GB.

The idea of the above testcase is :
Ran Job1 with cache files file1 and file2 - Job succeeded.
Ran Job2 with cache files file3 and file1. When file3 is getting localized, 
removed file3 from dfs - Job2 failed.
   - Here since file3 is deleted, the reference count of file1 should not be 
decremented twice(once during setup and once during cleanup).Thats the 
objective of this scenario.

Ran Job3 with cache files file1, file1(again) and file4. file4 is huge (say 
5GB), larger than local.cache.size.
 - To make sure that the decrement happened properly, file1 is added twice. 
When file4 is added, which 
is more than local cache size, other files like file2 and file3 ( which were 
used in the previous jobs) gets deleted
but not file1 (because it had reference count proper ).

| In order to match the regressions tests in trunk, I would suggest we also 
have in Job 2 a file, say file5, which we should verify is not even localized 
(because file3 fails localization). Then we can include file 5 in Job 3 and 
make sure localization happens successfully.

This scenario is  tested and the localization happens successfully.




> Per cache-file refcount can become negative when tasks release 
> distributed-cache files
> --
>
> Key: MAPREDUCE-1140
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1140
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Vinod K V
>Assignee: Amareshwari Sriramadasu
> Fix For: 0.22.0
>
> Attachments: patch-1140-1.txt, patch-1140-2-ydist.txt, 
> patch-1140-2.txt, patch-1140-3.txt, patch-1140-ydist.txt, 
> patch-1140-ydist.txt, patch-1140.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-896) Users can set non-writable permissions on temporary files for TT and can abuse disk usage.

2009-11-30 Thread Ravi Gummadi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Gummadi updated MAPREDUCE-896:
---

Attachment: MR-896.v1.patch

Updated the patch so that it appies to current trunk.

Please review and provide your comments.

> Users can set non-writable permissions on temporary files for TT and can 
> abuse disk usage.
> --
>
> Key: MAPREDUCE-896
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-896
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.21.0
>Reporter: Vinod K V
>Assignee: Ravi Gummadi
> Fix For: 0.21.0
>
> Attachments: MR-896.patch, MR-896.v1.patch
>
>
> As of now, irrespective of the TaskController in use, TT itself does a full 
> delete on local files created by itself or job tasks. This step, depending 
> upon TT's umask and the permissions set by files by the user, for e.g in 
> job-work/task-work or child.tmp directories, may or may not go through 
> successful completion fully. Thus is left an opportunity for abusing disk 
> space usage either accidentally or intentionally by TT/users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1248) Redundant memory copying in StreamKeyValUtil

2009-11-30 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783614#action_12783614
 ] 

ZhuGuanyin commented on MAPREDUCE-1248:
---

the same thing happenes in KeyValueLineRecordReader.java, when it calles the 
next() method.

> Redundant memory copying in StreamKeyValUtil
> 
>
> Key: MAPREDUCE-1248
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1248
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/streaming
>Reporter: Ruibang He
>Priority: Minor
>
> I found that when MROutputThread collecting the output of  Reducer, it calls 
> StreamKeyValUtil.splitKeyVal() and two local byte-arrays are allocated there 
> for each line of output. Later these two byte-arrays are passed to variable 
> key and val. There are twice memory copying here, one is the 
> System.arraycopy() method, the other is inside key.set() / val.set().
> This causes double times of memory copying for the whole output (may lead to 
> higher CPU consumption), and frequent temporay object allocation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1140) Per cache-file refcount can become negative when tasks release distributed-cache files

2009-11-30 Thread Hemanth Yamijala (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hemanth Yamijala updated MAPREDUCE-1140:


   Resolution: Fixed
Fix Version/s: 0.22.0
 Hadoop Flags: [Reviewed]
   Status: Resolved  (was: Patch Available)

I verified TestJobRetire passes on my machine with the patch as well.

Hence, I committed this to trunk. Thanks, Amareshwari !

> Per cache-file refcount can become negative when tasks release 
> distributed-cache files
> --
>
> Key: MAPREDUCE-1140
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1140
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Vinod K V
>Assignee: Amareshwari Sriramadasu
> Fix For: 0.22.0
>
> Attachments: patch-1140-1.txt, patch-1140-2-ydist.txt, 
> patch-1140-2.txt, patch-1140-3.txt, patch-1140-ydist.txt, 
> patch-1140-ydist.txt, patch-1140.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1247) Send out-of-band heartbeat to avoid fake lost tasktracker

2009-11-30 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783604#action_12783604
 ] 

ZhuGuanyin commented on MAPREDUCE-1247:
---

We could make the out-of-band heartbeat thread in tasktracker as 
optionally(default not start the thread through a configurable parameter),  
small cluster (running small jobs) are not needed. The additional thread is 
very usefull for the cluster running large jobs. Our Product hadoop cluster 
became more Robustness and never fake-lost-tasktracker any more,  I would 
attach the patch if someone interested.

> Send out-of-band heartbeat to avoid fake lost tasktracker
> -
>
> Key: MAPREDUCE-1247
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1247
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: ZhuGuanyin
>
> Currently the TaskTracker report task status to jobtracker through heartbeat, 
> sometimes if the tasktracker  lock the tasktracker to do some cleanup  job, 
> like remove task temp data on disk, the heartbeat thread would hang for a 
> long time while waiting for the lock, so the jobtracker just thought it had 
> lost and would reschedule all its finished maps or un finished reduce on 
> other tasktrackers, we call it "fake lost tasktracker", some times it doesn't 
> acceptable especially when we run some large jobs.  So We introduce a 
> out-of-band heartbeat mechanism to send an out-of-band heartbeat in that case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1248) Redundant memory copying in StreamKeyValUtil

2009-11-30 Thread Ruibang He (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783593#action_12783593
 ] 

Ruibang He commented on MAPREDUCE-1248:
---

I suggest to remove the two local byte-arrays, and replace the following code:

key.set(keyBytes);
val.set(valBytes);

with:

key.set(utf, start, keyLen);
val.set(utf, splitPos+separatorLength, valLen);

I have simply tested the above in my cluster. It works and the momery stops 
keeping going up.

Any thoughts?

> Redundant memory copying in StreamKeyValUtil
> 
>
> Key: MAPREDUCE-1248
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1248
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/streaming
>Reporter: Ruibang He
>Priority: Minor
>
> I found that when MROutputThread collecting the output of  Reducer, it calls 
> StreamKeyValUtil.splitKeyVal() and two local byte-arrays are allocated there 
> for each line of output. Later these two byte-arrays are passed to variable 
> key and val. There are twice memory copying here, one is the 
> System.arraycopy() method, the other is inside key.set() / val.set().
> This causes double times of memory copying for the whole output (may lead to 
> higher CPU consumption), and frequent temporay object allocation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1248) Redundant memory copying in StreamKeyValUtil

2009-11-30 Thread Ruibang He (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783591#action_12783591
 ] 

Ruibang He commented on MAPREDUCE-1248:
---

I often observed the memory consumption in the reduce phase of Reducers go up 
to heap limit and fall down repeatly. This phenomenon is often caused by 
frequent temporay object allocation. This is an impact to performance, 
regarding GC has to keep working constantly.

> Redundant memory copying in StreamKeyValUtil
> 
>
> Key: MAPREDUCE-1248
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1248
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/streaming
>Reporter: Ruibang He
>Priority: Minor
>
> I found that when MROutputThread collecting the output of  Reducer, it calls 
> StreamKeyValUtil.splitKeyVal() and two local byte-arrays are allocated there 
> for each line of output. Later these two byte-arrays are passed to variable 
> key and val. There are twice memory copying here, one is the 
> System.arraycopy() method, the other is inside key.set() / val.set().
> This causes double times of memory copying for the whole output (may lead to 
> higher CPU consumption), and frequent temporay object allocation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1249) mapreduce.reduce.shuffle.read.timeout's default value should be 3 minutes, in mapred-default.xml

2009-11-30 Thread Amareshwari Sriramadasu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amareshwari Sriramadasu updated MAPREDUCE-1249:
---

Status: Patch Available  (was: Open)

> mapreduce.reduce.shuffle.read.timeout's default value should be 3 minutes, in 
> mapred-default.xml
> 
>
> Key: MAPREDUCE-1249
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1249
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: patch-1249.txt
>
>
> mapreduce.reduce.shuffle.read.timeout has a value of 30,000 (30 seconds) in 
> mapred-default.xml, whereas the default value in Fetcher code is 3 minutes.
> It should be 3 minutes by default, as it was in pre MAPREDUCE-353.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1249) mapreduce.reduce.shuffle.read.timeout's default value should be 3 minutes, in mapred-default.xml

2009-11-30 Thread Amareshwari Sriramadasu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amareshwari Sriramadasu updated MAPREDUCE-1249:
---

Attachment: patch-1249.txt

Patch changing the value in mapred-default.xml

> mapreduce.reduce.shuffle.read.timeout's default value should be 3 minutes, in 
> mapred-default.xml
> 
>
> Key: MAPREDUCE-1249
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1249
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: patch-1249.txt
>
>
> mapreduce.reduce.shuffle.read.timeout has a value of 30,000 (30 seconds) in 
> mapred-default.xml, whereas the default value in Fetcher code is 3 minutes.
> It should be 3 minutes by default, as it was in pre MAPREDUCE-353.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1249) mapreduce.reduce.shuffle.read.timeout's default value should be 3 minutes, in mapred-default.xml

2009-11-30 Thread Amareshwari Sriramadasu (JIRA)
mapreduce.reduce.shuffle.read.timeout's default value should be 3 minutes, in 
mapred-default.xml


 Key: MAPREDUCE-1249
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1249
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Affects Versions: 0.21.0
Reporter: Amareshwari Sriramadasu
Priority: Blocker
 Fix For: 0.21.0


mapreduce.reduce.shuffle.read.timeout has a value of 30,000 (30 seconds) in 
mapred-default.xml, whereas the default value in Fetcher code is 3 minutes.
It should be 3 minutes by default, as it was in pre MAPREDUCE-353.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1248) Redundant memory copying in StreamKeyValUtil

2009-11-30 Thread Ruibang He (JIRA)
Redundant memory copying in StreamKeyValUtil


 Key: MAPREDUCE-1248
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1248
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: contrib/streaming
Reporter: Ruibang He
Priority: Minor


I found that when MROutputThread collecting the output of  Reducer, it calls 
StreamKeyValUtil.splitKeyVal() and two local byte-arrays are allocated there 
for each line of output. Later these two byte-arrays are passed to variable key 
and val. There are twice memory copying here, one is the System.arraycopy() 
method, the other is inside key.set() / val.set().

This causes double times of memory copying for the whole output (may lead to 
higher CPU consumption), and frequent temporay object allocation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1082) Command line UI for queues' information is broken with hierarchical queues.

2009-11-30 Thread Hemanth Yamijala (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783579#action_12783579
 ] 

Hemanth Yamijala commented on MAPREDUCE-1082:
-

Have a few comments on this patch:

- I don't know why readFields in JobQueueInfo needs to be overridden. In all 
the API, it is QueueInfo objects that are transferred over IPC, and I think it 
should remain that way.
- I think the test case needs to be an end-to-end test by definition, as the 
fix is both in JobQueueInfo as well as in the JobTracker, where when we 
translate JobQueueInfos to QueueInfos, we fix the translation by walking the 
entire hierarchy. I would suggest a test that brings up a MiniMRCluster with 
hierarchical queues, submits a job to one of the queues and calls 
Cluster.getRootQueues and verifies the returned QueueInfo information. We might 
need a package private JobTracker.setQueueManager to enable setting up 
hierarchical queues with a miniMRCluster.

> Command line UI for queues' information is broken with hierarchical queues.
> ---
>
> Key: MAPREDUCE-1082
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1082
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client, jobtracker
>Affects Versions: 0.21.0
>Reporter: Vinod K V
>Assignee: V.V.Chaitanya Krishna
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: MAPREDUCE-1082-1.txt, MAPREDUCE-1082-2.patch
>
>
> When the command "./bin/mapred --config ~/tmp/conf/ queue -list" is run, it 
> just hangs. I can see the following in the JT logs:
> {code}
> 2009-10-08 13:19:26,762 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 1 on 5 caught: java.lang.NullPointerException
> at org.apache.hadoop.mapreduce.QueueInfo.write(QueueInfo.java:217)
> at org.apache.hadoop.mapreduce.QueueInfo.write(QueueInfo.java:223)
> at 
> org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:159)
> at 
> org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:126)
> at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70)
> at org.apache.hadoop.ipc.Server.setupResponse(Server.java:1074)
> at org.apache.hadoop.ipc.Server.access$2400(Server.java:77)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:983)
> {code}
> Same is the case with "./bin/mapred --config ~/tmp/conf/ queue -info 
> "

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAPREDUCE-763) Capacity scheduler should clean up reservations if it runs tasks on nodes other than where it has made reservations

2009-11-30 Thread Sreekanth Ramakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sreekanth Ramakrishnan reassigned MAPREDUCE-763:


Assignee: rahul k singh  (was: Sreekanth Ramakrishnan)

> Capacity scheduler should clean up reservations if it runs tasks on nodes 
> other than where it has made reservations
> ---
>
> Key: MAPREDUCE-763
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-763
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/capacity-sched
>Affects Versions: 0.21.0
>Reporter: Hemanth Yamijala
>Assignee: rahul k singh
> Fix For: 0.21.0
>
>
> Currently capacity scheduler makes a reservation on nodes for high memory 
> jobs that cannot currently run at the time. It could happen that in the 
> meantime other tasktrackers become free to run the tasks of this job. Ideally 
> in the next heartbeat from the reserved TTs the reservation should be 
> removed. Otherwise it could unnecessarily block capacity for a while (until 
> the TT has enough slots free to run a task of this job).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1185) URL to JT webconsole for running job and job history should be the same

2009-11-30 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783576#action_12783576
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1185:


test-patch and ant test passed on Y!20 patch, except TestHdfsProxy.

> URL to JT webconsole for running job and job history should be the same
> ---
>
> Key: MAPREDUCE-1185
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1185
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobtracker
>Reporter: Sharad Agarwal
>Assignee: Sharad Agarwal
> Attachments: 1185_v1.patch, 1185_v2.patch, 1185_v3.patch, 
> 1185_v4.patch, 1185_v5.patch, patch-1185-ydist.txt
>
>
> The tracking url for running jobs and the jobs which are retired is 
> different. This creates problem for clients which caches the job running url 
> because soon it becomes invalid when job is retired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1247) Send out-of-band heartbeat to avoid fake lost tasktracker

2009-11-30 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783574#action_12783574
 ] 

ZhuGuanyin commented on MAPREDUCE-1247:
---

The taskCleanup thread lock the TaskTracker when it call 
MapOutputFile.removeAll() through TaskTracker.purgeTask() to cleanup a task or 
TaskTracker.purgeJob() to cleanup a job, if the midoutput file larger than 
50GB, and there some other io operations on this disk, it would hold the 
tasktracker lock for a long time enough to let the jobtracker treat this 
tasktracker as dead.

I think the current heartbeat thread has to handle too many things which 
doesn't its duty.  the deadlock in tasktracker currently may still happen and 
may not be found in current implentition. And I don't think it is the 
hearbeat's duty to found the deadlock in tasktracker.

> Send out-of-band heartbeat to avoid fake lost tasktracker
> -
>
> Key: MAPREDUCE-1247
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1247
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: ZhuGuanyin
>
> Currently the TaskTracker report task status to jobtracker through heartbeat, 
> sometimes if the tasktracker  lock the tasktracker to do some cleanup  job, 
> like remove task temp data on disk, the heartbeat thread would hang for a 
> long time while waiting for the lock, so the jobtracker just thought it had 
> lost and would reschedule all its finished maps or un finished reduce on 
> other tasktrackers, we call it "fake lost tasktracker", some times it doesn't 
> acceptable especially when we run some large jobs.  So We introduce a 
> out-of-band heartbeat mechanism to send an out-of-band heartbeat in that case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1247) Send out-of-band heartbeat to avoid fake lost tasktracker

2009-11-30 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783569#action_12783569
 ] 

Todd Lipcon commented on MAPREDUCE-1247:


My worry about a heartbeat thread that's entirely disconnected from the 
operation of the TT is that there are certain cases where the TT is "as good as 
dead" but not actually dead. For example, if the TT's in a deadlocked state, 
your "true heartbeat" would continue to function, whereas the TT is not healthy 
and should be considered dead.

I agree that the optimal system would separate these things, and provide some 
kind of health check interface to ensure that the service is actually getting 
work done. For a more achievable short term goal, I think deferring these slow 
operations to other threads is the safer route. Admittedly I don't work on the 
guts of this part of the system much, so will defer now to those that do.

> Send out-of-band heartbeat to avoid fake lost tasktracker
> -
>
> Key: MAPREDUCE-1247
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1247
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: ZhuGuanyin
>
> Currently the TaskTracker report task status to jobtracker through heartbeat, 
> sometimes if the tasktracker  lock the tasktracker to do some cleanup  job, 
> like remove task temp data on disk, the heartbeat thread would hang for a 
> long time while waiting for the lock, so the jobtracker just thought it had 
> lost and would reschedule all its finished maps or un finished reduce on 
> other tasktrackers, we call it "fake lost tasktracker", some times it doesn't 
> acceptable especially when we run some large jobs.  So We introduce a 
> out-of-band heartbeat mechanism to send an out-of-band heartbeat in that case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1140) Per cache-file refcount can become negative when tasks release distributed-cache files

2009-11-30 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783568#action_12783568
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1140:


test-patch result for 0.20 patch :
{noformat}
 [exec] -1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] -1 tests included.  The patch doesn't appear to include any new 
or modified tests.
 [exec] Please justify why no tests are needed for 
this patch.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
{noformat}

-1 tests included. It is difficult to port the testcase to 0.20, because the 
code in 0.20 is all static methods.

All unit tests passed on my machine except TestHdfsProxy

> Per cache-file refcount can become negative when tasks release 
> distributed-cache files
> --
>
> Key: MAPREDUCE-1140
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1140
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Vinod K V
>Assignee: Amareshwari Sriramadasu
> Attachments: patch-1140-1.txt, patch-1140-2-ydist.txt, 
> patch-1140-2.txt, patch-1140-3.txt, patch-1140-ydist.txt, 
> patch-1140-ydist.txt, patch-1140.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1247) Send out-of-band heartbeat to avoid fake lost tasktracker

2009-11-30 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783566#action_12783566
 ] 

ZhuGuanyin commented on MAPREDUCE-1247:
---

The out-of-band heartbeat thread (or we could call it the true heartbeat 
thread) only send tasktracker's name to jobtracker, and the jobtracker just 
update it's last seen time, we could add a new interface to 
InterTrackerProtocol, so it doesn't add  a lot of confusion or complexity. 



> Send out-of-band heartbeat to avoid fake lost tasktracker
> -
>
> Key: MAPREDUCE-1247
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1247
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: ZhuGuanyin
>
> Currently the TaskTracker report task status to jobtracker through heartbeat, 
> sometimes if the tasktracker  lock the tasktracker to do some cleanup  job, 
> like remove task temp data on disk, the heartbeat thread would hang for a 
> long time while waiting for the lock, so the jobtracker just thought it had 
> lost and would reschedule all its finished maps or un finished reduce on 
> other tasktrackers, we call it "fake lost tasktracker", some times it doesn't 
> acceptable especially when we run some large jobs.  So We introduce a 
> out-of-band heartbeat mechanism to send an out-of-band heartbeat in that case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1247) Send out-of-band heartbeat to avoid fake lost tasktracker

2009-11-30 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783567#action_12783567
 ] 

Todd Lipcon commented on MAPREDUCE-1247:


Any chance you have job jars that contain many thousands of classes? 
MAPREDUCE-967 may help with the cleanup taking a long time. Nevertheless I 
agree that some of those things should be deferred to other threads so the 
brunt of the work (eg IO bound things) don't hold critical locks.

> Send out-of-band heartbeat to avoid fake lost tasktracker
> -
>
> Key: MAPREDUCE-1247
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1247
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: ZhuGuanyin
>
> Currently the TaskTracker report task status to jobtracker through heartbeat, 
> sometimes if the tasktracker  lock the tasktracker to do some cleanup  job, 
> like remove task temp data on disk, the heartbeat thread would hang for a 
> long time while waiting for the lock, so the jobtracker just thought it had 
> lost and would reschedule all its finished maps or un finished reduce on 
> other tasktrackers, we call it "fake lost tasktracker", some times it doesn't 
> acceptable especially when we run some large jobs.  So We introduce a 
> out-of-band heartbeat mechanism to send an out-of-band heartbeat in that case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1247) Send out-of-band heartbeat to avoid fake lost tasktracker

2009-11-30 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783564#action_12783564
 ] 

ZhuGuanyin commented on MAPREDUCE-1247:
---

We print the java jstack when it became fake lost tasktracker on hadoop version 
0.19,  and found:

7 times the heartbeat thread waiting the TaskTracker lock ( 5 times because of 
taskCleanup thread hold for a long time, 2 times because of reduce sub jvm call 
TaskTracker.getMapCompletionEvents())


4 times the heartbeat thread waiting for the TaskTracker.TaskInProgress lock ( 
3 times because of taskCleanup thread hold for a long time, 1 time because of 
TaskLauncher hold for a long time)

2 times the heartbeat thread waiting for the AllocatorPerContext lock 


The heartbeat thread should only answer for the live or death status of 
tasktracker, but current implentition it has too many others things to do, we 
should let the heartbeat thread only do what it has to do.

> Send out-of-band heartbeat to avoid fake lost tasktracker
> -
>
> Key: MAPREDUCE-1247
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1247
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: ZhuGuanyin
>
> Currently the TaskTracker report task status to jobtracker through heartbeat, 
> sometimes if the tasktracker  lock the tasktracker to do some cleanup  job, 
> like remove task temp data on disk, the heartbeat thread would hang for a 
> long time while waiting for the lock, so the jobtracker just thought it had 
> lost and would reschedule all its finished maps or un finished reduce on 
> other tasktrackers, we call it "fake lost tasktracker", some times it doesn't 
> acceptable especially when we run some large jobs.  So We introduce a 
> out-of-band heartbeat mechanism to send an out-of-band heartbeat in that case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.