[jira] [Commented] (MAPREDUCE-2761) New TaskController code doesn't run on Windows

2011-08-01 Thread Ravi Gummadi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076060#comment-13076060
 ] 

Ravi Gummadi commented on MAPREDUCE-2761:
-

Linux task controller seems to be giving NPE on linux while trying to 
kill/signal task process. And this is causing task processes not getting killed 
and there are about 100 tasks running on each TT node. And then tasks/jobs 
start failing. May need a separate JIRA for this.

Vinay, please provide the exception seen here.

> New TaskController code doesn't run on Windows
> --
>
> Key: MAPREDUCE-2761
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2761
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task-controller, tasktracker
>Affects Versions: 0.20.204.0, 0.23.0
>Reporter: Todd Lipcon
>
> After MAPREDUCE-2178, TaskController assumes that pids are always available. 
> The shell executor object that's used to launch a JVM isn't retained, but 
> rather the pid is set when the task heartbeats. On Windows, there are no 
> pids, and since the ShellCommandExecutor object is no longer around, we can't 
> call process.destroy(). So, the TaskController doesn't work on Cygwin anymore.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2641) Fix the ExponentiallySmoothedTaskRuntimeEstimator and its unit test

2011-08-01 Thread Mahadev konar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated MAPREDUCE-2641:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I just pushed this. Thanks Josh!

> Fix the ExponentiallySmoothedTaskRuntimeEstimator and its unit test
> ---
>
> Key: MAPREDUCE-2641
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2641
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: mrv2
>Reporter: Josh Wills
>Assignee: Josh Wills
>Priority: Minor
> Fix For: 0.23.0
>
> Attachments: MAPREDUCE-2641.patch
>
>
> Fixed the ExponentiallySmoothedTaskRuntimeEstimator so that it can run and 
> pass the test defined for it in TestRuntimeEstimators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2764) Fix renewal of dfs delegation tokens

2011-08-01 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13075991#comment-13075991
 ] 

Daryn Sharp commented on MAPREDUCE-2764:


The renewal problem can be solved far more easily, w/o false coupling, and w/o 
an assumption that hftp uses https.  A DFS issues and renews tokens.  The 
problem is the lack of traceability from a token to its origin DFS.  Setting a 
DFS DT's service field to be the DFS uri, instead of ip:port, will allow a 
trivial {{FileSystem.get}} to obtain the DFS.  All guessing is removed, and the 
hftp fs encapsulates that it's using https.

The RPC layer and the token selectors will require minor modification to use 
the authority of the uri in the service field.  The semantics of other tokens 
should not be affected.

> Fix renewal of dfs delegation tokens
> 
>
> Key: MAPREDUCE-2764
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2764
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Fix For: 0.20.205.0
>
>
> The JT may have issues renewing hftp tokens which disrupt long distcp jobs.  
> The problem is the JT's delegation token renewal code is built on brittle 
> assumptions.  The token's service field contains only the "ip:port" pair.  
> The renewal process assumes that the scheme must be hdfs.  If that fails due 
> to a {{VersionMismatchException}}, it tries https based on another assumption 
> that it must be hftp if it's not hdfs.  A number of other exceptions, most 
> commonly {{IOExceptions}}, can be generated which fouls up the renewal since 
> it won't fallback to https.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (MAPREDUCE-2764) Fix renewal of dfs delegation tokens

2011-08-01 Thread Daryn Sharp (JIRA)
Fix renewal of dfs delegation tokens


 Key: MAPREDUCE-2764
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2764
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Daryn Sharp
Assignee: Daryn Sharp
 Fix For: 0.20.205.0


The JT may have issues renewing hftp tokens which disrupt long distcp jobs.  
The problem is the JT's delegation token renewal code is built on brittle 
assumptions.  The token's service field contains only the "ip:port" pair.  The 
renewal process assumes that the scheme must be hdfs.  If that fails due to a 
{{VersionMismatchException}}, it tries https based on another assumption that 
it must be hftp if it's not hdfs.  A number of other exceptions, most commonly 
{{IOExceptions}}, can be generated which fouls up the renewal since it won't 
fallback to https.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2187) map tasks timeout during sorting

2011-08-01 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated MAPREDUCE-2187:
-

  Resolution: Fixed
Release Note: I just committed this. Thanks Anupam!
  Status: Resolved  (was: Patch Available)

> map tasks timeout during sorting
> 
>
> Key: MAPREDUCE-2187
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2187
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Gianmarco De Francisci Morales
>Assignee: Anupam Seth
> Fix For: 0.20.205.0
>
> Attachments: MAPREDUCE-2187-20-security-v2.patch, 
> MAPREDUCE-2187-20-security.patch, MAPREDUCE-2187-22.patch, 
> MAPREDUCE-2187-MR-279-v2.patch, MAPREDUCE-2187-branch-MR-279.patch, 
> MAPREDUCE-2187-trunk-v2.patch, MAPREDUCE-2187-trunk-v3.patch, 
> MAPREDUCE-2187-trunk.patch
>
>
> During the execution of a large job, the map tasks timeout:
> {code}
> INFO mapred.JobClient: Task Id : attempt_201010290414_60974_m_57_1, 
> Status : FAILED
> Task attempt_201010290414_60974_m_57_1 failed to report status for 609 
> seconds. Killing!
> {code}
> The bug is in the fact that the mapper has already finished, and, according 
> to the logs, the timeout occurs during the merge sort phase.
> The intermediate data generated by the map task is quite large. So I think 
> this is the problem.
> The logs show that the merge-sort was running for 10 minutes when the task 
> was killed.
> I think the mapred.Merger should call Reporter.progress() somewhere.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-08-01 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated MAPREDUCE-2324:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I just committed this. Thanks Robert!

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
> Fix For: 0.20.205.0
>
> Attachments: MR-2324-disable-check-v2.patch, MR-2324-security-v1.txt, 
> MR-2324-security-v2.txt, MR-2324-security-v3.patch, 
> MR-2324-secutiry-just-log-v1.patch
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-08-01 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073881#comment-13073881
 ] 

Robert Joseph Evans commented on MAPREDUCE-2324:


 [exec] -1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] -1 tests included.  The patch doesn't appear to include any new 
or modified tests.
 [exec] Please justify why no tests are needed for 
this patch.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
(version 1.3.9) warnings.


I did not add in any tests because the change was to disable something that did 
not have any tests for it to begin with.

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
> Fix For: 0.20.205.0
>
> Attachments: MR-2324-disable-check-v2.patch, MR-2324-security-v1.txt, 
> MR-2324-security-v2.txt, MR-2324-security-v3.patch, 
> MR-2324-secutiry-just-log-v1.patch
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2705) tasks localized and launched serially by TaskLauncher - causing other tasks to be delayed

2011-08-01 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073839#comment-13073839
 ] 

Thomas Graves commented on MAPREDUCE-2705:
--

reconfirmed those failures aren't from this patch.  All tests that failed in 
the hudson build 
(https://builds.apache.org/view/G-L/view/Hadoop/job/PreCommit-MAPREDUCE-Build/489)
 either passed when I run them or also failed on trunk without this patch. 

Thanks for the review/commit!

> tasks localized and launched serially by TaskLauncher - causing other tasks 
> to be delayed
> -
>
> Key: MAPREDUCE-2705
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2705
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.20.205.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Fix For: 0.20.205.0, 0.23.0
>
> Attachments: MAPREDUCE-2705-branch20.patch, MAPREDUCE-2705-trunk.patch
>
>
> The current TaskLauncher serially launches new tasks one at a time. During 
> the launch it does the localization and then starts the map/reduce task.  
> This can cause any other tasks to be blocked waiting for the current task to 
> be localized and started. In some instances we have seen a task that has a 
> large file to localize (1.2MB) block another task for about 40 minutes. This 
> particular task being blocked was a cleanup task which caused the job to be 
> delayed finishing for the 40 minutes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2489) Jobsplits with random hostnames can make the queue unusable

2011-08-01 Thread Mahadev konar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073838#comment-13073838
 ] 

Mahadev konar commented on MAPREDUCE-2489:
--

Jeffrey,
 One minor nit, 

 The method:

{code}

  static void verifyHostnames(String[] names) throws UnknownHostException {
{code}

does not seem appropriate for JobInProgress class. It needs to be moved out to 
some helper class. NetUtils seems more appropriate for this helper method. What 
do you think?


> Jobsplits with random hostnames can make the queue unusable
> ---
>
> Key: MAPREDUCE-2489
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2489
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobtracker
>Affects Versions: 0.20.205.0, 0.23.0
>Reporter: Jeffrey Naisbitt
>Assignee: Jeffrey Naisbitt
> Fix For: 0.20.205.0, 0.23.0
>
> Attachments: MAPREDUCE-2489-0.20s-v2.patch, 
> MAPREDUCE-2489-0.20s-v3.patch, MAPREDUCE-2489-0.20s.patch, 
> MAPREDUCE-2489-mapred-v2.patch, MAPREDUCE-2489-mapred-v3.patch, 
> MAPREDUCE-2489-mapred-v4.patch, MAPREDUCE-2489-mapred.patch
>
>
> We saw an issue where a custom InputSplit was returning invalid hostnames for 
> the splits that were then causing the JobTracker to attempt to excessively 
> resolve host names.  This caused a major slowdown for the JobTracker.  We 
> should prevent invalid InputSplit hostnames from affecting everyone else.
> I propose we implement some verification for the hostnames to try to ensure 
> that we only do DNS lookups on valid hostnames (and fail otherwise).  We 
> could also fail the job after a certain number of failures in the resolve.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-1824) JobTracker should reuse file system handle for delegation token renewal

2011-08-01 Thread Suresh Srinivas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suresh Srinivas updated MAPREDUCE-1824:
---

Fix Version/s: (was: 0.20.205.0)

This will not be fixed in 0.20.205 and Daryn is going to be creating new jira 
to fix the issue he intended to fix.

> JobTracker should reuse file system handle for delegation token renewal
> ---
>
> Key: MAPREDUCE-1824
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1824
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Jitendra Nath Pandey
>Assignee: Daryn Sharp
> Fix For: 0.23.0
>
> Attachments: MR-1824.1.patch
>
>
> In trunk, the DelegationTokenRenewal obtains the file system handle by 
> creating the uri out of service in the token, which is ip:port. The intention 
> of this jira is to use host name of the namenode so that fils system handle 
> in the cache on jobtracker could be re-used. This jira is created because 
> such an optimization is there in 20 code and the patch attached is the direct 
> port of the code in 20.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (MAPREDUCE-2763) IllegalArgumentException while using the dist cache

2011-08-01 Thread Ramya Sunil (JIRA)
IllegalArgumentException while using the dist cache
---

 Key: MAPREDUCE-2763
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2763
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 0.23.0
Reporter: Ramya Sunil
 Fix For: 0.23.0


IllegalArgumentException is seen while using distributed cache to cache some 
files and custom jars in classpath.

A simple way to reproduce this error is by using a streaming job:
hadoop jar hadoop-streaming.jar -libjars file:// -input 
 -output out -mapper "cat" -reducer NONE -cacheFile  
hdfs://#linkname

This is a regression introduced and the same command works fine on 0.20.x

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2763) IllegalArgumentException while using the dist cache

2011-08-01 Thread Ramya Sunil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073835#comment-13073835
 ] 

Ramya Sunil commented on MAPREDUCE-2763:


Below is the complete stacktrace:
{noformat}
Exception in thread "main" java.lang.IllegalArgumentException: Invalid 
specification for distributed-cache artifacts of type FILE : #uris=1 
#timestamps=2 #visibilities=2
at 
org.apache.hadoop.mapred.YARNRunner.parseDistributedCacheArtifacts(YARNRunner.java:411)
at 
org.apache.hadoop.mapred.YARNRunner.setupDistributedCache(YARNRunner.java:392)
at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:234)
at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:388)
at org.apache.hadoop.mapreduce.Job$2.run(Job.java:1064)
at org.apache.hadoop.mapreduce.Job$2.run(Job.java:1061)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1094)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1061)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:534)
at 
org.apache.hadoop.streaming.StreamJob.submitAndMonitorJob(StreamJob.java:1010)
at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:133)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83)
at 
org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:192)
{noformat}

> IllegalArgumentException while using the dist cache
> ---
>
> Key: MAPREDUCE-2763
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2763
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 0.23.0
>Reporter: Ramya Sunil
> Fix For: 0.23.0
>
>
> IllegalArgumentException is seen while using distributed cache to cache some 
> files and custom jars in classpath.
> A simple way to reproduce this error is by using a streaming job:
> hadoop jar hadoop-streaming.jar -libjars file:// -input 
>  -output out -mapper "cat" -reducer NONE -cacheFile  
> hdfs://#linkname
> This is a regression introduced and the same command works fine on 0.20.x

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-08-01 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated MAPREDUCE-2324:
---

Attachment: MR-2324-disable-check-v2.patch

I should have the test-patch results soon.

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
> Fix For: 0.20.205.0
>
> Attachments: MR-2324-disable-check-v2.patch, MR-2324-security-v1.txt, 
> MR-2324-security-v2.txt, MR-2324-security-v3.patch, 
> MR-2324-secutiry-just-log-v1.patch
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-08-01 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated MAPREDUCE-2324:
---

Status: Patch Available  (was: Open)

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
> Fix For: 0.20.205.0
>
> Attachments: MR-2324-disable-check-v2.patch, MR-2324-security-v1.txt, 
> MR-2324-security-v2.txt, MR-2324-security-v3.patch, 
> MR-2324-secutiry-just-log-v1.patch
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-08-01 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated MAPREDUCE-2324:
---

Status: Open  (was: Patch Available)

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
> Fix For: 0.20.205.0
>
> Attachments: MR-2324-security-v1.txt, MR-2324-security-v2.txt, 
> MR-2324-security-v3.patch, MR-2324-secutiry-just-log-v1.patch
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-08-01 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073827#comment-13073827
 ] 

Koji Noguchi commented on MAPREDUCE-2324:
-

bq. Should we just disable that check?
+1

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
> Fix For: 0.20.205.0
>
> Attachments: MR-2324-security-v1.txt, MR-2324-security-v2.txt, 
> MR-2324-security-v3.patch, MR-2324-secutiry-just-log-v1.patch
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2187) map tasks timeout during sorting

2011-08-01 Thread Anupam Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anupam Seth updated MAPREDUCE-2187:
---

Attachment: MAPREDUCE-2187-trunk-v3.patch

> map tasks timeout during sorting
> 
>
> Key: MAPREDUCE-2187
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2187
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Gianmarco De Francisci Morales
>Assignee: Anupam Seth
> Fix For: 0.20.205.0
>
> Attachments: MAPREDUCE-2187-20-security-v2.patch, 
> MAPREDUCE-2187-20-security.patch, MAPREDUCE-2187-22.patch, 
> MAPREDUCE-2187-MR-279-v2.patch, MAPREDUCE-2187-branch-MR-279.patch, 
> MAPREDUCE-2187-trunk-v2.patch, MAPREDUCE-2187-trunk-v3.patch, 
> MAPREDUCE-2187-trunk.patch
>
>
> During the execution of a large job, the map tasks timeout:
> {code}
> INFO mapred.JobClient: Task Id : attempt_201010290414_60974_m_57_1, 
> Status : FAILED
> Task attempt_201010290414_60974_m_57_1 failed to report status for 609 
> seconds. Killing!
> {code}
> The bug is in the fact that the mapper has already finished, and, according 
> to the logs, the timeout occurs during the merge sort phase.
> The intermediate data generated by the map task is quite large. So I think 
> this is the problem.
> The logs show that the merge-sort was running for 10 minutes when the task 
> was killed.
> I think the mapred.Merger should call Reporter.progress() somewhere.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2187) map tasks timeout during sorting

2011-08-01 Thread Anupam Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073826#comment-13073826
 ] 

Anupam Seth commented on MAPREDUCE-2187:


Thanks a lot Arun! 

I am attaching a revised patch for trunk.

> map tasks timeout during sorting
> 
>
> Key: MAPREDUCE-2187
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2187
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Gianmarco De Francisci Morales
>Assignee: Anupam Seth
> Fix For: 0.20.205.0
>
> Attachments: MAPREDUCE-2187-20-security-v2.patch, 
> MAPREDUCE-2187-20-security.patch, MAPREDUCE-2187-22.patch, 
> MAPREDUCE-2187-MR-279-v2.patch, MAPREDUCE-2187-branch-MR-279.patch, 
> MAPREDUCE-2187-trunk-v2.patch, MAPREDUCE-2187-trunk-v3.patch, 
> MAPREDUCE-2187-trunk.patch
>
>
> During the execution of a large job, the map tasks timeout:
> {code}
> INFO mapred.JobClient: Task Id : attempt_201010290414_60974_m_57_1, 
> Status : FAILED
> Task attempt_201010290414_60974_m_57_1 failed to report status for 609 
> seconds. Killing!
> {code}
> The bug is in the fact that the mapper has already finished, and, according 
> to the logs, the timeout occurs during the merge sort phase.
> The intermediate data generated by the map task is quite large. So I think 
> this is the problem.
> The logs show that the merge-sort was running for 10 minutes when the task 
> was killed.
> I think the mapred.Merger should call Reporter.progress() somewhere.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-08-01 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073824#comment-13073824
 ] 

Robert Joseph Evans commented on MAPREDUCE-2324:


That would also fix the problem.  I should be able to have a patch for that 
very quickly.

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
> Fix For: 0.20.205.0
>
> Attachments: MR-2324-security-v1.txt, MR-2324-security-v2.txt, 
> MR-2324-security-v3.patch, MR-2324-secutiry-just-log-v1.patch
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-08-01 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073817#comment-13073817
 ] 

Arun C Murthy commented on MAPREDUCE-2324:
--

On second thoughts, since ResourceEstimator.getEstimatedReduceInputSize is 
broken (as of now), one option is to not use it for comparing against available 
space on TT. Should we just disable that check?

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
> Fix For: 0.20.205.0
>
> Attachments: MR-2324-security-v1.txt, MR-2324-security-v2.txt, 
> MR-2324-security-v3.patch, MR-2324-secutiry-just-log-v1.patch
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-08-01 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073816#comment-13073816
 ] 

Robert Joseph Evans commented on MAPREDUCE-2324:


+1 that should be enough to unblock us on the sustaining release.  And we can 
look at what the correct thing to do in YARN is.

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
> Fix For: 0.20.205.0
>
> Attachments: MR-2324-security-v1.txt, MR-2324-security-v2.txt, 
> MR-2324-security-v3.patch, MR-2324-secutiry-just-log-v1.patch
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-08-01 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073813#comment-13073813
 ] 

Arun C Murthy commented on MAPREDUCE-2324:
--

Robert - the problem for reduce.input.limit was not 'right' value for the 
constant, but the fact that 'guessing' the reduce input was broken.

For now, should we commit the logging change while you investigate if we can 
fix the 'guess'? 

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
> Fix For: 0.20.205.0
>
> Attachments: MR-2324-security-v1.txt, MR-2324-security-v2.txt, 
> MR-2324-security-v3.patch, MR-2324-secutiry-just-log-v1.patch
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2187) map tasks timeout during sorting

2011-08-01 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073810#comment-13073810
 ] 

Arun C Murthy commented on MAPREDUCE-2187:
--

Anupam - please define the config key in MRJobConfig for trunk.

I've committed the patch to 0.20.205. Thanks.

> map tasks timeout during sorting
> 
>
> Key: MAPREDUCE-2187
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2187
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Gianmarco De Francisci Morales
>Assignee: Anupam Seth
> Fix For: 0.20.205.0
>
> Attachments: MAPREDUCE-2187-20-security-v2.patch, 
> MAPREDUCE-2187-20-security.patch, MAPREDUCE-2187-22.patch, 
> MAPREDUCE-2187-MR-279-v2.patch, MAPREDUCE-2187-branch-MR-279.patch, 
> MAPREDUCE-2187-trunk-v2.patch, MAPREDUCE-2187-trunk.patch
>
>
> During the execution of a large job, the map tasks timeout:
> {code}
> INFO mapred.JobClient: Task Id : attempt_201010290414_60974_m_57_1, 
> Status : FAILED
> Task attempt_201010290414_60974_m_57_1 failed to report status for 609 
> seconds. Killing!
> {code}
> The bug is in the fact that the mapper has already finished, and, according 
> to the logs, the timeout occurs during the merge sort phase.
> The intermediate data generated by the map task is quite large. So I think 
> this is the problem.
> The logs show that the merge-sort was running for 10 minutes when the task 
> was killed.
> I think the mapred.Merger should call Reporter.progress() somewhere.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2729) Reducers are always counted having "pending tasks" even if they can't be scheduled yet because not enough of their mappers have completed

2011-08-01 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073803#comment-13073803
 ] 

Arun C Murthy commented on MAPREDUCE-2729:
--

To qualify: please run it on a cluster of 5-10 nodes, verify the fix manually 
and please let me know. Thanks.

> Reducers are always counted having "pending tasks" even if they can't be 
> scheduled yet because not enough of their mappers have completed
> -
>
> Key: MAPREDUCE-2729
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2729
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 0.20.205.0
> Environment: 0.20.1xx-Secondary
>Reporter: Sherry Chen
>Assignee: Sherry Chen
> Fix For: 0.20.205.0
>
> Attachments: MAPREDUCE-2729.patch
>
>
> In capacity scheduler, number of users in a queue needing slots are 
> calculated based on whether users' jobs have any pending tasks.
> This works fine for map tasks. However, for reduce tasks, jobs do not need 
> reduce slots until the minimum number of map tasks have been completed.
> Here, we add checking whether reduce is ready to schedule (i.e. if a job has 
> completed enough map tasks) when we increment number of users in a queue 
> needing reduce slots.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2705) tasks localized and launched serially by TaskLauncher - causing other tasks to be delayed

2011-08-01 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073805#comment-13073805
 ] 

Thomas Graves commented on MAPREDUCE-2705:
--

when I first submitted the patch the tests on trunk were already broken. For 
instance see 
https://builds.apache.org/view/G-L/view/Hadoop/job/PreCommit-MAPREDUCE-Build/472/
 which is for a different jira.  Let me rerun now just to double check and I'll 
update you on results.

> tasks localized and launched serially by TaskLauncher - causing other tasks 
> to be delayed
> -
>
> Key: MAPREDUCE-2705
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2705
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.20.205.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Fix For: 0.20.205.0, 0.23.0
>
> Attachments: MAPREDUCE-2705-branch20.patch, MAPREDUCE-2705-trunk.patch
>
>
> The current TaskLauncher serially launches new tasks one at a time. During 
> the launch it does the localization and then starts the map/reduce task.  
> This can cause any other tasks to be blocked waiting for the current task to 
> be localized and started. In some instances we have seen a task that has a 
> large file to localize (1.2MB) block another task for about 40 minutes. This 
> particular task being blocked was a cleanup task which caused the job to be 
> delayed finishing for the 40 minutes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2494) Make the distributed cache delete entires using LRU priority

2011-08-01 Thread Mahadev konar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated MAPREDUCE-2494:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Just pushed this to 0.20-security branch. Thanks bobby!

> Make the distributed cache delete entires using LRU priority
> 
>
> Key: MAPREDUCE-2494
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2494
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: distributed-cache
>Affects Versions: 0.20.205.0, 0.21.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
> Fix For: 0.20.205.0, 0.23.0
>
> Attachments: MAPREDUCE-2494-20.20X-V1.patch, 
> MAPREDUCE-2494-20.20X-V3.patch, MAPREDUCE-2494-V1.patch, 
> MAPREDUCE-2494-V2.patch
>
>
> Currently the distributed cache will wait until a cache directory is above a 
> preconfigured threshold.  At which point it will delete all entries that are 
> not currently being used.  It seems like we would get far fewer cache misses 
> if we kept some of them around, even when they are not being used.  We should 
> add in a configurable percentage for a goal of how much of the cache should 
> remain clear when not in use, and select objects to delete based off of how 
> recently they were used, and possibly also how large they are/how difficult 
> is it to download them again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2705) tasks localized and launched serially by TaskLauncher - causing other tasks to be delayed

2011-08-01 Thread Devaraj Das (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073798#comment-13073798
 ] 

Devaraj Das commented on MAPREDUCE-2705:


Committed the patch in branch-0.20-security. Thomas, there are a bunch of test 
failures in the build for trunk. Can you please confirm these are harmless.

> tasks localized and launched serially by TaskLauncher - causing other tasks 
> to be delayed
> -
>
> Key: MAPREDUCE-2705
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2705
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.20.205.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Fix For: 0.20.205.0, 0.23.0
>
> Attachments: MAPREDUCE-2705-branch20.patch, MAPREDUCE-2705-trunk.patch
>
>
> The current TaskLauncher serially launches new tasks one at a time. During 
> the launch it does the localization and then starts the map/reduce task.  
> This can cause any other tasks to be blocked waiting for the current task to 
> be localized and started. In some instances we have seen a task that has a 
> large file to localize (1.2MB) block another task for about 40 minutes. This 
> particular task being blocked was a cleanup task which caused the job to be 
> delayed finishing for the 40 minutes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2621) TestCapacityScheduler fails with "Queue "q1" does not exist"

2011-08-01 Thread Matt Foley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Foley updated MAPREDUCE-2621:
--

Fix Version/s: 0.20.204.0

> TestCapacityScheduler fails with "Queue "q1" does not exist"
> 
>
> Key: MAPREDUCE-2621
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2621
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.204.0, 0.20.205.0
> Environment: 0.20.1xx-Secondary 
>Reporter: Sherry Chen
>Assignee: Sherry Chen
>Priority: Minor
> Fix For: 0.20.204.0, 0.20.205.0
>
> Attachments: MAPREDUCE-2621.patch, MAPREDUCE-2621_1.patch
>
>
> {quote}
> Error Message
> Queue "q1" does not exist
> Stacktrace
> java.io.IOException: Queue "q1" does not exist
>   at org.apache.hadoop.mapred.JobInProgress.(JobInProgress.java:354)
>   at 
> org.apache.hadoop.mapred.TestCapacityScheduler$FakeJobInProgress.(TestCapacityScheduler.java:172)
>   at 
> org.apache.hadoop.mapred.TestCapacityScheduler.submitJob(TestCapacityScheduler.java:794)
>   at 
> org.apache.hadoop.mapred.TestCapacityScheduler.submitJob(TestCapacityScheduler.java:818)
>   at 
> org.apache.hadoop.mapred.TestCapacityScheduler.submitJobAndInit(TestCapacityScheduler.java:825)
>   at 
> org.apache.hadoop.mapred.TestCapacityScheduler.testMultiTaskAssignmentInMultipleQueues(TestCapacityScheduler.java:1109)
> {quote}
> When queue name is invalid, an exception is thrown now. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2494) Make the distributed cache delete entires using LRU priority

2011-08-01 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073783#comment-13073783
 ] 

Robert Joseph Evans commented on MAPREDUCE-2494:


I have confirmed it.  The deletion is happening in LRU order.

> Make the distributed cache delete entires using LRU priority
> 
>
> Key: MAPREDUCE-2494
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2494
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: distributed-cache
>Affects Versions: 0.20.205.0, 0.21.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
> Fix For: 0.20.205.0, 0.23.0
>
> Attachments: MAPREDUCE-2494-20.20X-V1.patch, 
> MAPREDUCE-2494-20.20X-V3.patch, MAPREDUCE-2494-V1.patch, 
> MAPREDUCE-2494-V2.patch
>
>
> Currently the distributed cache will wait until a cache directory is above a 
> preconfigured threshold.  At which point it will delete all entries that are 
> not currently being used.  It seems like we would get far fewer cache misses 
> if we kept some of them around, even when they are not being used.  We should 
> add in a configurable percentage for a goal of how much of the cache should 
> remain clear when not in use, and select objects to delete based off of how 
> recently they were used, and possibly also how large they are/how difficult 
> is it to download them again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2621) TestCapacityScheduler fails with "Queue "q1" does not exist"

2011-08-01 Thread Matt Foley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Foley updated MAPREDUCE-2621:
--

Affects Version/s: 0.20.204.0

This was merged to 0.20-security-204:

r1150857 | ddas | 2011-07-25 19:28:14 + (Mon, 25 Jul 2011) | 1 line

Merge -r 1150527:1150528 from branch-0.20-security onto branch-0.20-security-204


The merged -r1150527:1150528 is really -c1150528, which is seen above to be the 
fix for this bug MAPREDUCE-2621 in the 0.20-security branch.

> TestCapacityScheduler fails with "Queue "q1" does not exist"
> 
>
> Key: MAPREDUCE-2621
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2621
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.204.0, 0.20.205.0
> Environment: 0.20.1xx-Secondary 
>Reporter: Sherry Chen
>Assignee: Sherry Chen
>Priority: Minor
> Fix For: 0.20.205.0
>
> Attachments: MAPREDUCE-2621.patch, MAPREDUCE-2621_1.patch
>
>
> {quote}
> Error Message
> Queue "q1" does not exist
> Stacktrace
> java.io.IOException: Queue "q1" does not exist
>   at org.apache.hadoop.mapred.JobInProgress.(JobInProgress.java:354)
>   at 
> org.apache.hadoop.mapred.TestCapacityScheduler$FakeJobInProgress.(TestCapacityScheduler.java:172)
>   at 
> org.apache.hadoop.mapred.TestCapacityScheduler.submitJob(TestCapacityScheduler.java:794)
>   at 
> org.apache.hadoop.mapred.TestCapacityScheduler.submitJob(TestCapacityScheduler.java:818)
>   at 
> org.apache.hadoop.mapred.TestCapacityScheduler.submitJobAndInit(TestCapacityScheduler.java:825)
>   at 
> org.apache.hadoop.mapred.TestCapacityScheduler.testMultiTaskAssignmentInMultipleQueues(TestCapacityScheduler.java:1109)
> {quote}
> When queue name is invalid, an exception is thrown now. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2494) Make the distributed cache delete entires using LRU priority

2011-08-01 Thread Mahadev konar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073691#comment-13073691
 ] 

Mahadev konar commented on MAPREDUCE-2494:
--

thanks bobby. Ill push it as soon as you confirm the results!

> Make the distributed cache delete entires using LRU priority
> 
>
> Key: MAPREDUCE-2494
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2494
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: distributed-cache
>Affects Versions: 0.20.205.0, 0.21.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
> Fix For: 0.20.205.0, 0.23.0
>
> Attachments: MAPREDUCE-2494-20.20X-V1.patch, 
> MAPREDUCE-2494-20.20X-V3.patch, MAPREDUCE-2494-V1.patch, 
> MAPREDUCE-2494-V2.patch
>
>
> Currently the distributed cache will wait until a cache directory is above a 
> preconfigured threshold.  At which point it will delete all entries that are 
> not currently being used.  It seems like we would get far fewer cache misses 
> if we kept some of them around, even when they are not being used.  We should 
> add in a configurable percentage for a goal of how much of the cache should 
> remain clear when not in use, and select objects to delete based off of how 
> recently they were used, and possibly also how large they are/how difficult 
> is it to download them again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2494) Make the distributed cache delete entires using LRU priority

2011-08-01 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073680#comment-13073680
 ] 

Robert Joseph Evans commented on MAPREDUCE-2494:


I have now run it on a 10 node cluster with gridmix and everything looks 
stable.  I still need to verify that the cache deletes things in LRU order, but 
I should have those results shortly.

> Make the distributed cache delete entires using LRU priority
> 
>
> Key: MAPREDUCE-2494
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2494
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: distributed-cache
>Affects Versions: 0.20.205.0, 0.21.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
> Fix For: 0.20.205.0, 0.23.0
>
> Attachments: MAPREDUCE-2494-20.20X-V1.patch, 
> MAPREDUCE-2494-20.20X-V3.patch, MAPREDUCE-2494-V1.patch, 
> MAPREDUCE-2494-V2.patch
>
>
> Currently the distributed cache will wait until a cache directory is above a 
> preconfigured threshold.  At which point it will delete all entries that are 
> not currently being used.  It seems like we would get far fewer cache misses 
> if we kept some of them around, even when they are not being used.  We should 
> add in a configurable percentage for a goal of how much of the cache should 
> remain clear when not in use, and select objects to delete based off of how 
> recently they were used, and possibly also how large they are/how difficult 
> is it to download them again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-08-01 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073677#comment-13073677
 ] 

Robert Joseph Evans commented on MAPREDUCE-2324:


I have been able to run gridmix on a 10 node cluster, and everything looks 
stable.  I have not been able to run it on anything larger because the 
processes here are really not set up to do that very easily.  The process in 
the past has been to run gridmix at scale after the branch is in QA not before 
then so the tools are not setup to deploy from a dev branch.  Plus I have to 
get approval from lots of people to make that happen.  I am trying to see if I 
can still do it, but I am not very hopeful that it will happen any time soon.

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
> Fix For: 0.20.205.0
>
> Attachments: MR-2324-security-v1.txt, MR-2324-security-v2.txt, 
> MR-2324-security-v3.patch, MR-2324-secutiry-just-log-v1.patch
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (MAPREDUCE-2762) [MR-279] - Cleanup staging dir after job completion

2011-08-01 Thread Mahadev konar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar reassigned MAPREDUCE-2762:


Assignee: Mahadev konar

> [MR-279] - Cleanup staging dir after job completion
> ---
>
> Key: MAPREDUCE-2762
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2762
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 0.23.0
>Reporter: Ramya Sunil
>Assignee: Mahadev konar
> Fix For: 0.23.0
>
>
> The files created under the staging dir have to be deleted after job 
> completion. Currently, all job.* files remain forever in the 
> ${yarn.apps.stagingDir}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (MAPREDUCE-2762) [MR-279] - Cleanup staging dir after job completion

2011-08-01 Thread Ramya Sunil (JIRA)
[MR-279] - Cleanup staging dir after job completion
---

 Key: MAPREDUCE-2762
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2762
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 0.23.0
Reporter: Ramya Sunil
 Fix For: 0.23.0


The files created under the staging dir have to be deleted after job 
completion. Currently, all job.* files remain forever in the 
${yarn.apps.stagingDir}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2740) MultipleOutputs in new API creates needless TaskAttemptContexts

2011-08-01 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated MAPREDUCE-2740:
---

  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

> MultipleOutputs in new API creates needless TaskAttemptContexts
> ---
>
> Key: MAPREDUCE-2740
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2740
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.23.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: 0.23.0
>
> Attachments: mr-2740.txt
>
>
> MultipleOutputs.write creates a new TaskAttemptContext, which we've seen to 
> take a significant amount of CPU. The TaskAttemptContext constructor creates 
> a JobConf, gets current UGI, etc. I don't see any reason it needs to do this, 
> instead of just creating a single TaskAttemptContext when the InputFormat is 
> created (or lazily but cached as a member)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2740) MultipleOutputs in new API creates needless TaskAttemptContexts

2011-08-01 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073629#comment-13073629
 ] 

Todd Lipcon commented on MAPREDUCE-2740:


I ran TestMultipleOutputs and TestMRMultipleOutputs and they pass. Those are 
the only tests which reference the changed code. I'll commit this to trunk 
momentarily

> MultipleOutputs in new API creates needless TaskAttemptContexts
> ---
>
> Key: MAPREDUCE-2740
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2740
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.23.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: 0.23.0
>
> Attachments: mr-2740.txt
>
>
> MultipleOutputs.write creates a new TaskAttemptContext, which we've seen to 
> take a significant amount of CPU. The TaskAttemptContext constructor creates 
> a JobConf, gets current UGI, etc. I don't see any reason it needs to do this, 
> instead of just creating a single TaskAttemptContext when the InputFormat is 
> created (or lazily but cached as a member)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (MAPREDUCE-2761) New TaskController code doesn't run on Windows

2011-08-01 Thread Todd Lipcon (JIRA)
New TaskController code doesn't run on Windows
--

 Key: MAPREDUCE-2761
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2761
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task-controller, tasktracker
Affects Versions: 0.20.204.0, 0.23.0
Reporter: Todd Lipcon


After MAPREDUCE-2178, TaskController assumes that pids are always available. 
The shell executor object that's used to launch a JVM isn't retained, but 
rather the pid is set when the task heartbeats. On Windows, there are no pids, 
and since the ShellCommandExecutor object is no longer around, we can't call 
process.destroy(). So, the TaskController doesn't work on Cygwin anymore.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2760) mapreduce.jobtracker.split.metainfo.maxsize typoed in mapred-default.xml

2011-08-01 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated MAPREDUCE-2760:
---

Status: Patch Available  (was: Open)

> mapreduce.jobtracker.split.metainfo.maxsize typoed in mapred-default.xml
> 
>
> Key: MAPREDUCE-2760
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2760
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 0.23.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Minor
> Fix For: 0.23.0
>
> Attachments: mr-2760.txt
>
>
> The configuration mapreduce.jobtracker.split.metainfo.maxsize is incorrectly 
> included in mapred-default.xml as mapreduce.*job*.split.metainfo.maxsize. It 
> seems that {{jobtracker}} is correct, since this is a JT-wide property rather 
> than a job property.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2760) mapreduce.jobtracker.split.metainfo.maxsize typoed in mapred-default.xml

2011-08-01 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated MAPREDUCE-2760:
---

Attachment: mr-2760.txt

Trivial patch

> mapreduce.jobtracker.split.metainfo.maxsize typoed in mapred-default.xml
> 
>
> Key: MAPREDUCE-2760
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2760
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 0.23.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Minor
> Fix For: 0.23.0
>
> Attachments: mr-2760.txt
>
>
> The configuration mapreduce.jobtracker.split.metainfo.maxsize is incorrectly 
> included in mapred-default.xml as mapreduce.*job*.split.metainfo.maxsize. It 
> seems that {{jobtracker}} is correct, since this is a JT-wide property rather 
> than a job property.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (MAPREDUCE-2760) mapreduce.jobtracker.split.metainfo.maxsize typoed in mapred-default.xml

2011-08-01 Thread Todd Lipcon (JIRA)
mapreduce.jobtracker.split.metainfo.maxsize typoed in mapred-default.xml


 Key: MAPREDUCE-2760
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2760
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.23.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Minor
 Fix For: 0.23.0


The configuration mapreduce.jobtracker.split.metainfo.maxsize is incorrectly 
included in mapred-default.xml as mapreduce.*job*.split.metainfo.maxsize. It 
seems that {{jobtracker}} is correct, since this is a JT-wide property rather 
than a job property.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2309) While querying the Job Statics from the command-line, if we give wrong status name then there is no warning or response.

2011-08-01 Thread Devaraj K (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K updated MAPREDUCE-2309:
-

Fix Version/s: 0.23.0
   0.20.4

> While querying the Job Statics from the command-line, if we give wrong status 
> name then there is no warning or response.
> 
>
> Key: MAPREDUCE-2309
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2309
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobtracker
>Affects Versions: 0.23.0
>Reporter: Devaraj K
>Assignee: Devaraj K
>Priority: Minor
> Fix For: 0.20.4, 0.23.0
>
> Attachments: MAPREDUCE-2309-0.20.patch, MAPREDUCE-2309-trunk.patch
>
>
> If we try to get the jobs information by giving the wrong status name from 
> the command line interface, it is not giving any warning or response.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2494) Make the distributed cache delete entires using LRU priority

2011-08-01 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073552#comment-13073552
 ] 

Robert Joseph Evans commented on MAPREDUCE-2494:


I have done some manual verification, but nothing on a cluster over 1 node. I 
will try to see what I can do on a smaller cluster.

> Make the distributed cache delete entires using LRU priority
> 
>
> Key: MAPREDUCE-2494
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2494
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: distributed-cache
>Affects Versions: 0.20.205.0, 0.21.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
> Fix For: 0.20.205.0, 0.23.0
>
> Attachments: MAPREDUCE-2494-20.20X-V1.patch, 
> MAPREDUCE-2494-20.20X-V3.patch, MAPREDUCE-2494-V1.patch, 
> MAPREDUCE-2494-V2.patch
>
>
> Currently the distributed cache will wait until a cache directory is above a 
> preconfigured threshold.  At which point it will delete all entries that are 
> not currently being used.  It seems like we would get far fewer cache misses 
> if we kept some of them around, even when they are not being used.  We should 
> add in a configurable percentage for a goal of how much of the cache should 
> remain clear when not in use, and select objects to delete based off of how 
> recently they were used, and possibly also how large they are/how difficult 
> is it to download them again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-08-01 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073550#comment-13073550
 ] 

Robert Joseph Evans commented on MAPREDUCE-2324:


I did initially look at trying to fix reduce.input.limit.  Currently it is 
something that someone has to manually guess what the value should be.  What is 
more this value is likely to need to change as the cluster fills up with data, 
and as data is deleted off of the cluster.  If it is wrong then either too many 
jobs fail that would have succeeded or some jobs, probably a very small number, 
will starve and never finish.  

To fix it Hadoop would have to automatically set reduce.input.limit dynamically 
and the only way I can think of to do that would be to gather statistics about 
all the nodes in the cluster and try to predict how likely this particular 
reduce will ever find the space it needs on a node.  I believe that we can 
compute the mean and X% confidence interval for disk space on the cluster 
without too much difficulty but I have my doubts that this will apply to a 
small cluster.  From what I read anything under 40 samples tends to be suspect, 
so it might not work for a cluster under 40 nodes.  Also I am not sure how the 
statistics would apply to this particular situation.  Would we want to compute 
this based off of a recent history of cluster of just a snapshot of its current 
state? If there is history how far back would we want to go, and how would we 
handle some nodes heart-beating in more regularly then others.  I am not a 
statistician and I could not find one to look over my work,  so instead I 
decided to take a bit more of a brute force approach that I know would work.

If you know a statistician that could provide a robust solution to this problem 
or at least tell me what if anything I am doing wrong then I am very happy to 
implement it.

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
> Fix For: 0.20.205.0
>
> Attachments: MR-2324-security-v1.txt, MR-2324-security-v2.txt, 
> MR-2324-security-v3.patch, MR-2324-secutiry-just-log-v1.patch
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2243) Close all the file streams propely in a finally block to avoid their leakage.

2011-08-01 Thread Tsz Wo (Nicholas), SZE (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo (Nicholas), SZE updated MAPREDUCE-2243:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

+1 patch looks good

I have committed this. Thanks, Devaraj!

> Close all the file streams propely in a finally block to avoid their leakage.
> -
>
> Key: MAPREDUCE-2243
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2243
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobtracker, tasktracker
>Affects Versions: 0.22.0, 0.23.0
> Environment: NA
>Reporter: Bhallamudi Venkata Siva Kamesh
>Assignee: Devaraj K
>Priority: Minor
> Fix For: 0.23.0
>
> Attachments: MAPREDUCE-2243-1.patch, MAPREDUCE-2243-2.patch, 
> MAPREDUCE-2243-3.patch, MAPREDUCE-2243-4.patch, MAPREDUCE-2243.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the following classes streams should be closed in finally block to avoid 
> their leakage in the exceptional cases.
> CompletedJobStatusStore.java
> --
>dataOut.writeInt(events.length);
> for (TaskCompletionEvent event : events) {
>   event.write(dataOut);
> }
>dataOut.close() ;
> EventWriter.java
> --
>encoder.flush();
>out.close();
> MapTask.java
> ---
> splitMetaInfo.write(out);
>  out.close();
> TaskLog
> 
>  1) str = fis.readLine();
>   fis.close();
> 2) dos.writeBytes(Long.toString(new File(logLocation, LogName.SYSLOG
>   .toString()).length() - prevLogLength) + "\n");
> dos.close();
> TotalOrderPartitioner.java
> ---
>  while (reader.next(key, value)) {
> parts.add(key);
> key = ReflectionUtils.newInstance(keyClass, conf);
>   }
> reader.close();

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2243) Close all the file streams propely in a finally block to avoid their leakage.

2011-08-01 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073508#comment-13073508
 ] 

Devaraj K commented on MAPREDUCE-2243:
--

I fixed and updated the patch. Thanks Nicholas.

> Close all the file streams propely in a finally block to avoid their leakage.
> -
>
> Key: MAPREDUCE-2243
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2243
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobtracker, tasktracker
>Affects Versions: 0.22.0, 0.23.0
> Environment: NA
>Reporter: Bhallamudi Venkata Siva Kamesh
>Assignee: Devaraj K
>Priority: Minor
> Fix For: 0.23.0
>
> Attachments: MAPREDUCE-2243-1.patch, MAPREDUCE-2243-2.patch, 
> MAPREDUCE-2243-3.patch, MAPREDUCE-2243-4.patch, MAPREDUCE-2243.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the following classes streams should be closed in finally block to avoid 
> their leakage in the exceptional cases.
> CompletedJobStatusStore.java
> --
>dataOut.writeInt(events.length);
> for (TaskCompletionEvent event : events) {
>   event.write(dataOut);
> }
>dataOut.close() ;
> EventWriter.java
> --
>encoder.flush();
>out.close();
> MapTask.java
> ---
> splitMetaInfo.write(out);
>  out.close();
> TaskLog
> 
>  1) str = fis.readLine();
>   fis.close();
> 2) dos.writeBytes(Long.toString(new File(logLocation, LogName.SYSLOG
>   .toString()).length() - prevLogLength) + "\n");
> dos.close();
> TotalOrderPartitioner.java
> ---
>  while (reader.next(key, value)) {
> parts.add(key);
> key = ReflectionUtils.newInstance(keyClass, conf);
>   }
> reader.close();

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2243) Close all the file streams propely in a finally block to avoid their leakage.

2011-08-01 Thread Devaraj K (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K updated MAPREDUCE-2243:
-

Attachment: MAPREDUCE-2243-4.patch

> Close all the file streams propely in a finally block to avoid their leakage.
> -
>
> Key: MAPREDUCE-2243
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2243
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobtracker, tasktracker
>Affects Versions: 0.22.0, 0.23.0
> Environment: NA
>Reporter: Bhallamudi Venkata Siva Kamesh
>Assignee: Devaraj K
>Priority: Minor
> Fix For: 0.23.0
>
> Attachments: MAPREDUCE-2243-1.patch, MAPREDUCE-2243-2.patch, 
> MAPREDUCE-2243-3.patch, MAPREDUCE-2243-4.patch, MAPREDUCE-2243.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the following classes streams should be closed in finally block to avoid 
> their leakage in the exceptional cases.
> CompletedJobStatusStore.java
> --
>dataOut.writeInt(events.length);
> for (TaskCompletionEvent event : events) {
>   event.write(dataOut);
> }
>dataOut.close() ;
> EventWriter.java
> --
>encoder.flush();
>out.close();
> MapTask.java
> ---
> splitMetaInfo.write(out);
>  out.close();
> TaskLog
> 
>  1) str = fis.readLine();
>   fis.close();
> 2) dos.writeBytes(Long.toString(new File(logLocation, LogName.SYSLOG
>   .toString()).length() - prevLogLength) + "\n");
> dos.close();
> TotalOrderPartitioner.java
> ---
>  while (reader.next(key, value)) {
> parts.add(key);
> key = ReflectionUtils.newInstance(keyClass, conf);
>   }
> reader.close();

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2243) Close all the file streams propely in a finally block to avoid their leakage.

2011-08-01 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073486#comment-13073486
 ] 

Tsz Wo (Nicholas), SZE commented on MAPREDUCE-2243:
---

Hi Devaraj, I was trying to commit the patch but there was an additional 
{{encoder.flush()}} in {{EventWriter.close()}}.  Could you fix it?

> Close all the file streams propely in a finally block to avoid their leakage.
> -
>
> Key: MAPREDUCE-2243
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2243
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobtracker, tasktracker
>Affects Versions: 0.22.0, 0.23.0
> Environment: NA
>Reporter: Bhallamudi Venkata Siva Kamesh
>Assignee: Devaraj K
>Priority: Minor
> Fix For: 0.23.0
>
> Attachments: MAPREDUCE-2243-1.patch, MAPREDUCE-2243-2.patch, 
> MAPREDUCE-2243-3.patch, MAPREDUCE-2243.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the following classes streams should be closed in finally block to avoid 
> their leakage in the exceptional cases.
> CompletedJobStatusStore.java
> --
>dataOut.writeInt(events.length);
> for (TaskCompletionEvent event : events) {
>   event.write(dataOut);
> }
>dataOut.close() ;
> EventWriter.java
> --
>encoder.flush();
>out.close();
> MapTask.java
> ---
> splitMetaInfo.write(out);
>  out.close();
> TaskLog
> 
>  1) str = fis.readLine();
>   fis.close();
> 2) dos.writeBytes(Long.toString(new File(logLocation, LogName.SYSLOG
>   .toString()).length() - prevLogLength) + "\n");
> dos.close();
> TotalOrderPartitioner.java
> ---
>  while (reader.next(key, value)) {
> parts.add(key);
> key = ReflectionUtils.newInstance(keyClass, conf);
>   }
> reader.close();

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2759) TaskTrackerAction should follow Open Closed Principle

2011-08-01 Thread Rajesh Putta (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073467#comment-13073467
 ] 

Rajesh Putta commented on MAPREDUCE-2759:
-

{code:title=TaskTrackerAction.java|borderStyle=solid}
switch (actionType) {
case LAUNCH_TASK:
  {
action = new LaunchTaskAction();
  }
  break;
case KILL_TASK:
  {
action = new KillTaskAction();
  }
  break;
case KILL_JOB:
  {
action = new KillJobAction();
  }
  break;
case REINIT_TRACKER:
  {
action = new ReinitTrackerAction();
  }
  break;
case COMMIT_TASK:
  {
action = new CommitTaskAction();
  }
  break;
{code} 
In the above case, for every action that are going to be added in the future 
the number of cases increases.

> TaskTrackerAction should follow Open Closed Principle
> -
>
> Key: MAPREDUCE-2759
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2759
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.23.0
> Environment: NA
>Reporter: Rajesh Putta
>Priority: Minor
>
> In the class TaskTrackerAction  there are fixed actions or directions 
> specified from the Job Tracker to the Task Tracker.So if in the future if 
> some more actions are specified from the Job Tracker to Task Tracker,Current 
> implementation is breaking Open Closed Principle(Open for extension,closed 
> for modification).As the number of actions increases in the future, the code 
> need to be modified to incorporate the actions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (MAPREDUCE-2759) TaskTrackerAction should follow Open Closed Principle

2011-08-01 Thread Rajesh Putta (JIRA)
TaskTrackerAction should follow Open Closed Principle
-

 Key: MAPREDUCE-2759
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2759
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: tasktracker
Affects Versions: 0.23.0
 Environment: NA
Reporter: Rajesh Putta
Priority: Minor


In the class TaskTrackerAction  there are fixed actions or directions specified 
from the Job Tracker to the Task Tracker.So if in the future if some more 
actions are specified from the Job Tracker to Task Tracker,Current 
implementation is breaking Open Closed Principle(Open for extension,closed for 
modification).As the number of actions increases in the future, the code need 
to be modified to incorporate the actions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira