[jira] [Commented] (MAPREDUCE-2669) Some new examples and test cases for them.

2011-07-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065674#comment-13065674
 ] 

Hadoop QA commented on MAPREDUCE-2669:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12486531/MAPREDUCE-2669.patch
  against trunk revision 1146517.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 9 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 2254 javac compiler warnings (more 
than the trunk's current 2244 warnings).

-1 findbugs.  The patch appears to introduce 9 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed these core unit tests:
  org.apache.hadoop.cli.TestMRCLI
  org.apache.hadoop.fs.TestFileSystem

-1 contrib tests.  The patch failed contrib unit tests.

+1 system test framework.  The patch passed system test framework compile.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/470//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/470//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/470//console

This message is automatically generated.

> Some new examples and test cases for them.
> --
>
> Key: MAPREDUCE-2669
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2669
> Project: Hadoop Map/Reduce
>  Issue Type: Test
>  Components: examples
>Affects Versions: 0.22.0
>Reporter: Plamen Jeliazkov
>Priority: Minor
> Attachments: MAPREDUCE-2669.patch, mapreduce-new-examples-0.22.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Looking to add some more examples such as Mean, Median, and Standard 
> Deviation to the examples.
> I have some generic JUnit testcases as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2669) Some new examples and test cases for them.

2011-07-14 Thread Plamen Jeliazkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Plamen Jeliazkov updated MAPREDUCE-2669:


Attachment: MAPREDUCE-2669.patch

Making some changes for the QA bot.

> Some new examples and test cases for them.
> --
>
> Key: MAPREDUCE-2669
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2669
> Project: Hadoop Map/Reduce
>  Issue Type: Test
>  Components: examples
>Affects Versions: 0.22.0
>Reporter: Plamen Jeliazkov
>Priority: Minor
> Attachments: MAPREDUCE-2669.patch, mapreduce-new-examples-0.22.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Looking to add some more examples such as Mean, Median, and Standard 
> Deviation to the examples.
> I have some generic JUnit testcases as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-07-14 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065540#comment-13065540
 ] 

Robert Joseph Evans commented on MAPREDUCE-2324:


Looking back I realize that I probably have not answered Todd's question 
satisfactorily.  Yes there are out of band heartbeats, and in fact not every TT 
heartbeat will make it all the way through to this piece code, because the node 
may have no slots available by the time it gets to this Job.  The intention was 
not to verify that the job has been tried on every TT before giving up.  The 
idea was to do a reasonable effort in trying to schedule the job before giving 
up.  I suspect that the amount of free disk space on a node may very quite a 
bit between heartbeats, just because jobs are using disk space that then go 
away, HDFS is storing a file that is deleted, or several new blocks are added, 
so even if we give every node a chance at this job before giving up there is 
still a possibility that it will succeed later on.  We cannot predict the 
future, but we do need to put an upper bound on how long we try to do 
something, otherwise there will always be corner cases where we can get 
starvation.

It may also make since to use some statistical heuristics in MR-279 to try and 
give up sooner rather then later if someone is asking for something that is 
really outside of the norm.  But that is just an optimization.

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
> Attachments: MR-2324-security-v1.txt
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2669) Some new examples and test cases for them.

2011-07-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065532#comment-13065532
 ] 

Hadoop QA commented on MAPREDUCE-2669:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12486343/mapreduce-new-examples-0.22.patch
  against trunk revision 1146517.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 9 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 2254 javac compiler warnings (more 
than the trunk's current 2244 warnings).

-1 findbugs.  The patch appears to introduce 15 new Findbugs (version 
1.3.9) warnings.

-1 release audit.  The applied patch generated 5 release audit warnings 
(more than the trunk's current 2 warnings).

-1 core tests.  The patch failed these core unit tests:
  org.apache.hadoop.cli.TestMRCLI
  org.apache.hadoop.fs.TestFileSystem

-1 contrib tests.  The patch failed contrib unit tests.

+1 system test framework.  The patch passed system test framework compile.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/468//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/468//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/468//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/468//console

This message is automatically generated.

> Some new examples and test cases for them.
> --
>
> Key: MAPREDUCE-2669
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2669
> Project: Hadoop Map/Reduce
>  Issue Type: Test
>  Components: examples
>Affects Versions: 0.22.0
>Reporter: Plamen Jeliazkov
>Priority: Minor
> Attachments: mapreduce-new-examples-0.22.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Looking to add some more examples such as Mean, Median, and Standard 
> Deviation to the examples.
> I have some generic JUnit testcases as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-07-14 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065505#comment-13065505
 ] 

Robert Joseph Evans commented on MAPREDUCE-2324:


I don't believe that the fix I submitted is incomplete the issue is that MRv2 
does things so very differently we need to tackle the problem in a different 
way.  I am sure the patch is not perfect and I am very happy to see any better 
ideas/patches.  Also I am getting noise from my customers about this so I would 
like to see a fix in a sustaining release.  It is not a lot of noise but I do 
have to at least try to get a fix in.

I do agree that having different configuration values is an issue that I would 
like to avoid, but currently 0.23 has dropped mapreduce.reduce.input.limit all 
together along with who knows what other configuration values.  I do not see 
any way to maintain mapreduce.reduce.input.limit in MRv2.

I have started looking at the scheduler code in yarn and this is just 
preliminary but it looks like what we want to do is to extend Resource to 
include disk space not just RAM.  The NodeManager can then also report back the 
amount of disk space that it has free, just like the TaskTracker does.  Then 
for Reduce Tasks we teh MR Application Master can request the container based 
off of the estimated reduce input size. We can also put in a more generic 
resource starvation detection mechanism that would work for both RAM and Disk.

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
> Attachments: MR-2324-security-v1.txt
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-07-14 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065485#comment-13065485
 ] 

Allen Wittenauer commented on MAPREDUCE-2324:
-

> I am more concerned about a sustaining release for the 0.20.20X line

If a "real" fix will require a different config param, I'd rather see this 
bumped to 0.23. This has been a known (and annoying) bug for a long time, but 
doesn't really require an immediate, sustaining fix if that fix is going to be 
incomplete and ripped out 6 months later in a newer branch.

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
> Attachments: MR-2324-security-v1.txt
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-07-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065468#comment-13065468
 ] 

Hadoop QA commented on MAPREDUCE-2324:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12486489/MR-2324-security-v1.txt
  against trunk revision 1146517.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/469//console

This message is automatically generated.

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
> Attachments: MR-2324-security-v1.txt
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2669) Some new examples and test cases for them.

2011-07-14 Thread Plamen Jeliazkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Plamen Jeliazkov updated MAPREDUCE-2669:


Description: 
Looking to add some more examples such as Mean, Median, and Standard Deviation 
to the examples.
I have some generic JUnit testcases as well.

  was:
Looking to add some more examples such as Mean, Median, and Standard Deviation 
to the examples.
I have some generic JUnit testcases as well, though I feel that they can be 
improved.


> Some new examples and test cases for them.
> --
>
> Key: MAPREDUCE-2669
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2669
> Project: Hadoop Map/Reduce
>  Issue Type: Test
>  Components: examples
>Affects Versions: 0.22.0
>Reporter: Plamen Jeliazkov
>Priority: Minor
> Attachments: mapreduce-new-examples-0.22.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Looking to add some more examples such as Mean, Median, and Standard 
> Deviation to the examples.
> I have some generic JUnit testcases as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-07-14 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated MAPREDUCE-2324:
---

Attachment: MR-2324-security-v1.txt

I am uploading this patch based off of my initial proposal for limiting the 
maximum number of times that we try to schedule a reduce task and it fails 
because of size issues.  This patch is only intended for the security branch, 
not trunk or MR-279.  We still need to have a discussion about how MR-279 will 
handle these issues. 

 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 6 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
(version 1.3.9) warnings.
 [exec] 


> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
> Attachments: MR-2324-security-v1.txt
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-07-14 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated MAPREDUCE-2324:
---

Status: Patch Available  (was: Open)

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
> Attachments: MR-2324-security-v1.txt
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2489) Jobsplits with random hostnames can make the queue unusable

2011-07-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065433#comment-13065433
 ] 

Hadoop QA commented on MAPREDUCE-2489:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12486481/MAPREDUCE-2489-mapred-v4.patch
  against trunk revision 1146517.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The patch appears to cause tar ant target to fail.

-1 findbugs.  The patch appears to cause Findbugs (version 1.3.9) to fail.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed these core unit tests:


-1 contrib tests.  The patch failed contrib unit tests.

-1 system test framework.  The patch failed system test framework compile.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/467//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/467//console

This message is automatically generated.

> Jobsplits with random hostnames can make the queue unusable
> ---
>
> Key: MAPREDUCE-2489
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2489
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobtracker
>Affects Versions: 0.20.205.0, 0.23.0
>Reporter: Jeffrey Naisbitt
>Assignee: Jeffrey Naisbitt
> Attachments: MAPREDUCE-2489-0.20s-v2.patch, 
> MAPREDUCE-2489-0.20s-v3.patch, MAPREDUCE-2489-0.20s.patch, 
> MAPREDUCE-2489-mapred-v2.patch, MAPREDUCE-2489-mapred-v3.patch, 
> MAPREDUCE-2489-mapred-v4.patch, MAPREDUCE-2489-mapred.patch
>
>
> We saw an issue where a custom InputSplit was returning invalid hostnames for 
> the splits that were then causing the JobTracker to attempt to excessively 
> resolve host names.  This caused a major slowdown for the JobTracker.  We 
> should prevent invalid InputSplit hostnames from affecting everyone else.
> I propose we implement some verification for the hostnames to try to ensure 
> that we only do DNS lookups on valid hostnames (and fail otherwise).  We 
> could also fail the job after a certain number of failures in the resolve.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2669) Some new examples and test cases for them.

2011-07-14 Thread Plamen Jeliazkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Plamen Jeliazkov updated MAPREDUCE-2669:


Status: Patch Available  (was: Open)

> Some new examples and test cases for them.
> --
>
> Key: MAPREDUCE-2669
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2669
> Project: Hadoop Map/Reduce
>  Issue Type: Test
>  Components: examples
>Affects Versions: 0.22.0
>Reporter: Plamen Jeliazkov
>Priority: Minor
> Attachments: mapreduce-new-examples-0.22.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Looking to add some more examples such as Mean, Median, and Standard 
> Deviation to the examples.
> I have some generic JUnit testcases as well, though I feel that they can be 
> improved.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2489) Jobsplits with random hostnames can make the queue unusable

2011-07-14 Thread Jeffrey Naisbitt (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey Naisbitt updated MAPREDUCE-2489:


Attachment: MAPREDUCE-2489-mapred-v4.patch
MAPREDUCE-2489-0.20s-v3.patch

Updated patch for 0.20.205 - removing the portion of the code corresponding to 
HADOOP-7314 from this patch and placing it in its own patch on that jira.



> Jobsplits with random hostnames can make the queue unusable
> ---
>
> Key: MAPREDUCE-2489
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2489
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobtracker
>Affects Versions: 0.20.205.0, 0.23.0
>Reporter: Jeffrey Naisbitt
>Assignee: Jeffrey Naisbitt
> Attachments: MAPREDUCE-2489-0.20s-v2.patch, 
> MAPREDUCE-2489-0.20s-v3.patch, MAPREDUCE-2489-0.20s.patch, 
> MAPREDUCE-2489-mapred-v2.patch, MAPREDUCE-2489-mapred-v3.patch, 
> MAPREDUCE-2489-mapred-v4.patch, MAPREDUCE-2489-mapred.patch
>
>
> We saw an issue where a custom InputSplit was returning invalid hostnames for 
> the splits that were then causing the JobTracker to attempt to excessively 
> resolve host names.  This caused a major slowdown for the JobTracker.  We 
> should prevent invalid InputSplit hostnames from affecting everyone else.
> I propose we implement some verification for the hostnames to try to ensure 
> that we only do DNS lookups on valid hostnames (and fail otherwise).  We 
> could also fail the job after a certain number of failures in the resolve.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (MAPREDUCE-2686) NPE while requesting info for a non-existing job

2011-07-14 Thread Ramya Sunil (JIRA)
NPE while requesting info for a non-existing job


 Key: MAPREDUCE-2686
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2686
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 0.23.0
Reporter: Ramya Sunil
 Fix For: 0.23.0


While performing job related operations such as job -kill, -status, -events etc 
for an unknown job, the following NPE is seen:

Exception in thread "main" java.lang.NullPointerException
at 
org.apache.hadoop.mapred.ClientServiceDelegate.refreshProxy(ClientServiceDelegate.java:112)
at 
org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:100)
at 
org.apache.hadoop.mapred.ClientServiceDelegate.getRefreshedProxy(ClientServiceDelegate.java:93)
at 
org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:383)
at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:515)
at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:154)
at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:254)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83)
at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1074)


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-07-14 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065339#comment-13065339
 ] 

Robert Joseph Evans commented on MAPREDUCE-2324:


I have not really thought about this yet in reference to MR-279.  I am more 
concerned about a sustaining release for the 0.20.20X line.  And then look at 
porting the functionality to MR-279.

I am not sure that it even applies to MR-279 because of how scheduling is 
different.  My understanding is that the ApplicationMaster will make a request 
to the ResourceManager for a set of nodes that meet criteria X (I believe that 
disk space is one of the criteria you can request but it is currently ignored). 
 The ResourceManager looks at all of the nodes available and hands back a list 
of nodes that best match the given criteria.  So the ApplicationMaster has no 
idea at all which, if any nodes were considered and rejected, or even what all 
of the nodes in the system are.  If we wanted to keep track of individual nodes 
it would either have to be on the ResourceManager, which does have resource 
constraints, or in the ApplicationMaster which would now need a list of all 
nodes in the cluster along with which nodes were tried and rejected for which 
reasons.  

In fact mapreduce.reduce.input.limit is not in the MR-279 code base at all, so 
for MR-279 we need to think about resource limits and scheduling more generally.

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-07-14 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065323#comment-13065323
 ] 

Todd Lipcon commented on MAPREDUCE-2324:


It seems we need to not just count number of rejections, but actually the 
unique TTs. Especially with out-of-band heartbeats, it's possible some TT might 
heartbeat many times before another one heartbeats at all. This certainly uses 
more memory, but in MR-279, the memory usage in the JT isn't as big a deal 
right?

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (MAPREDUCE-2685) Hadoop-Mapreduce-trunk build is failing

2011-07-14 Thread Devaraj K (JIRA)
Hadoop-Mapreduce-trunk build is failing
---

 Key: MAPREDUCE-2685
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2685
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.23.0
Reporter: Devaraj K


Hadoop-Mapreduce-trunk has been failing for long time. 

https://builds.apache.org/job/Hadoop-Mapreduce-trunk/737/

org.apache.hadoop.cli.TestMRCLI.testAll is failing since Build #697
{code:xml}
2011-07-14 13:20:44,103 INFO  cli.CLITestHelper 
(CLITestHelper.java:displayResults(178)) -  Comparator: 
[TokenComparator]
2011-07-14 13:20:44,104 INFO  cli.CLITestHelper 
(CLITestHelper.java:displayResults(180)) -  Comparision result:   [fail]
2011-07-14 13:20:44,104 INFO  cli.CLITestHelper 
(CLITestHelper.java:displayResults(182)) - Expected output:   
[Usage: java FsShell [-mv  ... ]]
2011-07-14 13:20:44,104 INFO  cli.CLITestHelper 
(CLITestHelper.java:displayResults(184)) -   Actual output:   [mv: 
Wrong FS: har:/dest/dir0.har/dir0/file0, expected: hdfs://localhost:55672
Usage: hadoop fs [generic options] -mv  ... 
]
{code}

org.apache.hadoop.fs.TestFileSystem.testCommandFormat is failing since Build 
#702

{code:xml}
org.apache.hadoop.fs.shell.CommandFormat$TooManyArgumentsException: Too many 
arguments: expected 2 but got 3
at 
org.apache.hadoop.fs.shell.CommandFormat.parse(CommandFormat.java:113)
at org.apache.hadoop.fs.shell.CommandFormat.parse(CommandFormat.java:77)
at 
org.apache.hadoop.fs.TestFileSystem.__CLR3_0_2b0mwvrw7b(TestFileSystem.java:97)
at 
org.apache.hadoop.fs.TestFileSystem.testCommandFormat(TestFileSystem.java:92)

{code}
org.apache.hadoop.mapred.TestNodeRefresh.testBlacklistedNodeDecommissioning


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-07-14 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065318#comment-13065318
 ] 

Robert Joseph Evans commented on MAPREDUCE-2324:


OK I have thought about it and talk with some people about 
statistics/scheduling and the like, and the conclusion that I have come to is 
the following.

We should add in a new configuration parameter called 
mapreduce.reduce.input.limit.attempt.factor

This value would default to 1.0 and be used to determine the number of times a 
reduce task can be rejected because the estimated input size will not fit 
before the job is killed.  So if (#failedAttempts > (#ofActiveNodes * 
attempt.factor)) then kill the job. 

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2365) Add counters for FileInputFormat (BYTES_READ) and FileOutputFormat (BYTES_WRITTEN)

2011-07-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065303#comment-13065303
 ] 

Hudson commented on MAPREDUCE-2365:
---

Integrated in Hadoop-Mapreduce-trunk #737 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/737/])
MAPREDUCE-2365. Adding newer files.
MAPREDUCE-2365. Add counters to track bytes (read,written) via 
File(Input,Output)Format. Contributed by Siddharth Seth.

acmurthy : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1146517
Files : 
* 
/hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormatCounter.java
* 
/hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormatCounter.properties
* 
/hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapreduce/lib/output/FileOutputFormatCounter.java
* 
/hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapreduce/lib/output/FileOutputFormatCounter.properties

acmurthy : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1146515
Files : 
* 
/hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapred/ReduceTask.java
* /hadoop/common/trunk/mapreduce/CHANGES.txt
* /hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapred/MapTask.java
* /hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapred/Counters.java
* /hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapred/Task.java
* 
/hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java
* 
/hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.java
* 
/hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapreduce/TaskCounter.java
* 
/hadoop/common/trunk/mapreduce/src/test/mapred/org/apache/hadoop/mapred/TestJobCounters.java
* 
/hadoop/common/trunk/mapreduce/src/test/mapred/org/apache/hadoop/mapreduce/TestMapReduceLocal.java
* 
/hadoop/common/trunk/mapreduce/src/test/mapred/org/apache/hadoop/mapred/TestMiniMRDFSSort.java
* 
/hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapreduce/TaskCounter.properties
* 
/hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java
* 
/hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapreduce/lib/input/SequenceFileRecordReader.java


> Add counters for FileInputFormat (BYTES_READ) and FileOutputFormat 
> (BYTES_WRITTEN)
> --
>
> Key: MAPREDUCE-2365
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2365
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Owen O'Malley
>Assignee: Siddharth Seth
> Fix For: 0.20.203.0, 0.23.0
>
> Attachments: MR2365.patch
>
>
> MAP_INPUT_BYTES and MAP_OUTPUT_BYTES will be computed using the difference 
> between FileSystem
> counters before and after each next(K,V) and collect/write op.
> In case compression is being used, these counters will represent the 
> compressed data sizes. The uncompressed size will
> not be available.
> This is not a direct back-port of 5710. (Counters will be computed in MapTask 
> instead of in individual RecordReaders).
> 0.20.100 ->
>New API -> MAP_INPUT_BYTES will be computed using this method
>Old API -> MAP_INPUT_BYTES will remain unchanged.
> 0.23 ->
>New API -> MAP_INPUT_BYTES will be computed using this method
>Old API -> MAP_INPUT_BYTES likely to use this method

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2670) Fixing spelling mistake in FairSchedulerServlet.java

2011-07-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065302#comment-13065302
 ] 

Hudson commented on MAPREDUCE-2670:
---

Integrated in Hadoop-Mapreduce-trunk #737 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/737/])
MAPREDUCE-2670. Fixing spelling mistake in FairSchedulerServlet.java. 
Contributed by Eli Collins

eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1146485
Files : 
* /hadoop/common/trunk/mapreduce/CHANGES.txt
* 
/hadoop/common/trunk/mapreduce/src/contrib/fairscheduler/src/java/org/apache/hadoop/mapred/FairSchedulerServlet.java


> Fixing spelling mistake in FairSchedulerServlet.java
> 
>
> Key: MAPREDUCE-2670
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2670
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Eli Collins
>Assignee: Eli Collins
>Priority: Trivial
> Fix For: 0.23.0
>
> Attachments: mapreduce-2670-1.patch
>
>
> "Admininstration" is misspelled.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2494) Make the distributed cache delete entires using LRU priority

2011-07-14 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated MAPREDUCE-2494:
---

Affects Version/s: 0.20.205.0

> Make the distributed cache delete entires using LRU priority
> 
>
> Key: MAPREDUCE-2494
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2494
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: distributed-cache
>Affects Versions: 0.20.205.0, 0.21.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
> Fix For: 0.23.0
>
> Attachments: MAPREDUCE-2494-20.20X-V1.patch, MAPREDUCE-2494-V1.patch, 
> MAPREDUCE-2494-V2.patch
>
>
> Currently the distributed cache will wait until a cache directory is above a 
> preconfigured threshold.  At which point it will delete all entries that are 
> not currently being used.  It seems like we would get far fewer cache misses 
> if we kept some of them around, even when they are not being used.  We should 
> add in a configurable percentage for a goal of how much of the cache should 
> remain clear when not in use, and select objects to delete based off of how 
> recently they were used, and possibly also how large they are/how difficult 
> is it to download them again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-07-14 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated MAPREDUCE-2324:
---

Affects Version/s: 0.20.205.0

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-07-14 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065249#comment-13065249
 ] 

Robert Joseph Evans commented on MAPREDUCE-2324:


I just found the same issue and I am looking into what is the best way to solve 
it.

If mapreduce.reduce.input.limit is mis-configured or if a cluster is just 
running low on disk space in general then reduces with large a input may never 
get scheduled causing the Job to never fail and never succeed, just starve 
until the job is killed.

The JobInProgess tries to guess at the size of the input to all reducers in a 
job. If the size is over mapreduce.reduce.input.limit then the job is killed. 
If it is not then findNewReduceTask() checks to see if the estimated size is 
too big to fit on the node currently looking for work. If it is not then it 
will let some other task have a chance at the slot.

The idea is to keep track of how often it happens that a Reduce Slot is 
rejected because of the lack of space vs how often it succeeds and then guess 
if the reduce tasks will ever be scheduled.

So I would like some feedback on this.

1) How should we guess. Someone who found the bug here suggested P1 + (P2 * S), 
where S is the number of successful assignments. Possibly P1 = 20 and P2 = 2.0. 
I am not really sure.
2) What should we do when we guess that it will never get a slot? Should we 
fail the job or do we say, even though it might fail, well lets just schedule 
the it and see if it really will fail.


> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2684) Job Tracker can starve reduces with very large input.

2011-07-14 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065248#comment-13065248
 ] 

Robert Joseph Evans commented on MAPREDUCE-2684:


You are correct I will duplicate this one to that one, because that one was 
filed first.

> Job Tracker can starve reduces with very large input.
> -
>
> Key: MAPREDUCE-2684
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2684
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobtracker
>Affects Versions: 0.20.204.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
>
> If mapreduce.reduce.input.limit is mis-configured or if a cluster is just 
> running low on disk space in general then reduces with large a input may 
> never get scheduled causing the Job to never fail and never succeed, just 
> starve until the job is killed.
> The JobInProgess tries to guess at the size of the input to all reducers in a 
> job.  If the size is over mapreduce.reduce.input.limit then the job is 
> killed.  If it is not then findNewReduceTask() checks to see if the estimated 
> size is too big to fit on the node currently looking for work.  If it is not 
> then it will let some other task have a chance at the slot.
> The idea is to keep track of how often it happens that a Reduce Slot is 
> rejected because of the lack of space vs how often it succeeds and then guess 
> if the reduce tasks will ever be scheduled.
> So I would like some feedback on this.
> 1) How should we guess.  Someone who found the bug here suggested P1 + (P2 * 
> S), where S is the number of successful assignments.  Possibly P1 = 20 and P2 
> = 2.0.  I am not really sure.
> 2) What should we do when we guess that it will never get a slot?  Should we 
> fail the job or do we say, even though it might fail, well lets just schedule 
> the it and see if it really will fail.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (MAPREDUCE-2684) Job Tracker can starve reduces with very large input.

2011-07-14 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans resolved MAPREDUCE-2684.


Resolution: Duplicate

> Job Tracker can starve reduces with very large input.
> -
>
> Key: MAPREDUCE-2684
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2684
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobtracker
>Affects Versions: 0.20.204.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
>
> If mapreduce.reduce.input.limit is mis-configured or if a cluster is just 
> running low on disk space in general then reduces with large a input may 
> never get scheduled causing the Job to never fail and never succeed, just 
> starve until the job is killed.
> The JobInProgess tries to guess at the size of the input to all reducers in a 
> job.  If the size is over mapreduce.reduce.input.limit then the job is 
> killed.  If it is not then findNewReduceTask() checks to see if the estimated 
> size is too big to fit on the node currently looking for work.  If it is not 
> then it will let some other task have a chance at the slot.
> The idea is to keep track of how often it happens that a Reduce Slot is 
> rejected because of the lack of space vs how often it succeeds and then guess 
> if the reduce tasks will ever be scheduled.
> So I would like some feedback on this.
> 1) How should we guess.  Someone who found the bug here suggested P1 + (P2 * 
> S), where S is the number of successful assignments.  Possibly P1 = 20 and P2 
> = 2.0.  I am not really sure.
> 2) What should we do when we guess that it will never get a slot?  Should we 
> fail the job or do we say, even though it might fail, well lets just schedule 
> the it and see if it really will fail.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-07-14 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans reassigned MAPREDUCE-2324:
--

Assignee: Robert Joseph Evans

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira