[jira] Updated: (MAPREDUCE-1018) Document changes to the memory management and scheduling model
[ https://issues.apache.org/jira/browse/MAPREDUCE-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] rahul k singh updated MAPREDUCE-1018: - Attachment: MAPRED-1018-2.patch Document changes to the memory management and scheduling model -- Key: MAPREDUCE-1018 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1018 Project: Hadoop Map/Reduce Issue Type: Bug Components: documentation Affects Versions: 0.21.0 Reporter: Hemanth Yamijala Priority: Blocker Fix For: 0.21.0 Attachments: MAPRED-1018-1.patch, MAPRED-1018-2.patch, MAPRED-1018-commons.patch There were changes done for the configuration, monitoring and scheduling of high ram jobs. This must be documented in the mapred-defaults.xml and also on forrest documentation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAPREDUCE-1215) Counter deprecation warnings in jobtracker log are excessive
Counter deprecation warnings in jobtracker log are excessive Key: MAPREDUCE-1215 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1215 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.21.0 Reporter: Chris Douglas In a recent test, the log message {noformat} WARN org.apache.hadoop.mapred.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. \ Use org.apache.hadoop.mapreduce.TaskCounter instead {noformat} was nearly a third of a 1.3GB jobtracker log. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1018) Document changes to the memory management and scheduling model
[ https://issues.apache.org/jira/browse/MAPREDUCE-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778231#action_12778231 ] Vinod K V commented on MAPREDUCE-1018: -- Already so many other mapreduce issues have only modified cluster-setup.xml, the one in mapreduce project. Rahul mentioned offline that forrest documentation is not getting generated in mapreduce sub-project. Assuming we'll address that in a separate issue, I propose we have only one patch - the mapred one. - The mapred Patch has git prefixes which need to be removed - Monitoring/Scheduling based on RAM is completely removed. So remove the references too. Just add a note saying that (quoting from HADOOP-5881) there isn't any need for distinguishing vmem from physical memory w.r.t configuration. Depending on a site's requirements, the configuration items can reflect whether one wants tasks to go beyond physical memory or not. cluster_setup.html - All config names should be renamed to the new names. Of-course this means a slightly different patch for 0.20 - which we will come to after the patch for trunk's done - mapred.{map|reduce}.child.ulimit also need to be renamed - What happens when monitoring is enabled, but job has -1? - Memory-monitoring is no longer defined in terms of per-task-limit and per-node-limit. It is now driven by per-slot-size and number of slots. We should use these new terms through-out. - Before getting into details, consider the following additional memory-related parameters than can be configured to enable better scheduling:\ The above line is no longer needed. capacity_scheduler.html - Feature for monitoring RAM no more. Remove all references. - Working of scheduling -- Point 1: 4 parameters, not three. Parameters described in cluster_setup. vmem.reserved no more used. -- Point 2: This is changed completely. No more offsets. Total = numSlots * PerSlotMemSize. Used = Sigma(numSlotsPerTask * PerSlotMemSize) -- Point 3: JT now rejects the jobs, not the scheduler. - See the MapReduce Tutorial for details on how the TT monitors memory usage. See cluster_setup instead? - Need to update mapred_tutorial.html's memory management section. Aslo need a reference to this in both cluster_setup.html as well as capacity_scheduler.html - Another point I've already mentioned on the JIRA. Along with everything else, we should document that job setup and job cleanup tasks of all jobs, either requiring or not requiring high memory for their maps and reduces, still run on a single slot. Document changes to the memory management and scheduling model -- Key: MAPREDUCE-1018 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1018 Project: Hadoop Map/Reduce Issue Type: Bug Components: documentation Affects Versions: 0.21.0 Reporter: Hemanth Yamijala Priority: Blocker Fix For: 0.21.0 Attachments: MAPRED-1018-1.patch, MAPRED-1018-2.patch, MAPRED-1018-commons.patch There were changes done for the configuration, monitoring and scheduling of high ram jobs. This must be documented in the mapred-defaults.xml and also on forrest documentation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1143) runningMapTasks counter is not properly decremented in case of failed Tasks.
[ https://issues.apache.org/jira/browse/MAPREDUCE-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778233#action_12778233 ] Hemanth Yamijala commented on MAPREDUCE-1143: - I spoke to Amarsri and Rahul about my comments and found out some explanations: bq. For instance, even after this patch, I see that the number of running tasks is decremented under different checks when a task completes and when a task fails. I assume this is for good reason, but still it is difficult to review. So, the different checks are as follows: {code} completedTask() { if (this tip is complete) { return; } update counters } failedTask() { if (any attempt was running for this tip before status update) { update counters } } {code} It appears completedTask doesn't need the check for TIP being complete at all, as it can never happen. A tip is marked complete only if atleast one attempt has completed and remains so. If another attempt comes in reporting success now, we fail this in status update and do not follow the completedTask code path at all. So, for all practical purposes, counters are being updated unconditionally in completedTask. Further, in the same code path, the task is removed from the active tasks as well. Hence no further check is necessary. The check in failedTask is required though. This is because a task can fail *after* it has been marked as succeeded. For e.g. if there are fetch failures for a map, or if a tracker is lost. In this case, we should not update counters again because they would have already been updated when the task succeeded. However, in this context, I am a little worried that we are checking for any attempt being running before status update, rather than this specific attempt. At least in theory it is possible this results in some inconsistency. Consider this sequence of events: - A task is scheduled - It is speculated - It completes - Counters are decremented here. - It fails (lost TT, fetch failures) - The current patch will decrement counters here again. - The speculated attempt succeeds. In practice though, this scenario may not be very likely. Apparently fetch failures and lost TTs are the only extreme cases when this is possible. And there is considerable time lag that can happen before a task completes and it has to be failed. The time lag will in most cases be large enough to kill the speculative attempt as well. With this background, is it worth changing the current patch to: {code} failedTask() { if (this task was running before status update) { update counters } } {code} This seems more correct to me, but was wondering if it was worth the change. Thoughts ? runningMapTasks counter is not properly decremented in case of failed Tasks. Key: MAPREDUCE-1143 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1143 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: rahul k singh Priority: Blocker Attachments: MAPRED-1143-1.patch, MAPRED-1143-2.patch, MAPRED-1143-2.patch, MAPRED-1143-3.patch, MAPRED-1143-4.patch, MAPRED-1143-ydist-1.patch, MAPRED-1143-ydist-2.patch, MAPRED-1143-ydist-3.patch, MAPRED-1143-ydist-4.patch, MAPRED-1143-ydist-5.patch, MAPRED-1143-ydist-6.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-896) Users can set non-writable permissions on temporary files for TT and can abuse disk usage.
[ https://issues.apache.org/jira/browse/MAPREDUCE-896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Gummadi updated MAPREDUCE-896: --- Attachment: MR-896.patch Attaching patch with the fix. Please review and provide your comments. Patch does the following: (1) When deleting $jobId/$attemptId/work or when deleting $jobId/$attemptId, TT uses task-controller to enable the path for deletion(by changing permissions). (a) LinuxTaskController sets 770 as permissions for all the files/directories within this dir recursively and then TT will delete the dir. (b) DefaultTaskController sets rwx for user(same as TT) for this dir recursively and then TT deletes the dir. (2) Deletion of $jobId is done as earlier because user can't create any files in this dir. (3) Deletion of work dir in TaskRunner(useful with jvm reuse) is also done by changing the permissions first(no taskcontroller is needed in this case). (4) With jvm reuse, after final task of a Jvm is finished, workDir is deleted using procedure mentioned in (1) above by task-controller. (5) Modified TestTaskTrackerLocalization and TestLocalizationWithLinuxTaskController to have directories/files created in $attemptId and set nonwritable permissions which tests whether (a) DefaultTaskController and (b) LinuxTaskController would be able to remove these new files from $attemptId. Users can set non-writable permissions on temporary files for TT and can abuse disk usage. -- Key: MAPREDUCE-896 URL: https://issues.apache.org/jira/browse/MAPREDUCE-896 Project: Hadoop Map/Reduce Issue Type: Bug Components: tasktracker Affects Versions: 0.21.0 Reporter: Vinod K V Assignee: Ravi Gummadi Fix For: 0.21.0 Attachments: MR-896.patch As of now, irrespective of the TaskController in use, TT itself does a full delete on local files created by itself or job tasks. This step, depending upon TT's umask and the permissions set by files by the user, for e.g in job-work/task-work or child.tmp directories, may or may not go through successful completion fully. Thus is left an opportunity for abusing disk space usage either accidentally or intentionally by TT/users. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-1140) Per cache-file refcount can become negative when tasks release distributed-cache files
[ https://issues.apache.org/jira/browse/MAPREDUCE-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amareshwari Sriramadasu updated MAPREDUCE-1140: --- Status: Open (was: Patch Available) Per cache-file refcount can become negative when tasks release distributed-cache files -- Key: MAPREDUCE-1140 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1140 Project: Hadoop Map/Reduce Issue Type: Bug Components: tasktracker Affects Versions: 0.20.2, 0.21.0, 0.22.0 Reporter: Vinod K V Assignee: Amareshwari Sriramadasu Attachments: patch-1140-1.txt, patch-1140-2.txt, patch-1140-ydist.txt, patch-1140.txt -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1147) Map output records counter missing for map-only jobs in new API
[ https://issues.apache.org/jira/browse/MAPREDUCE-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778235#action_12778235 ] Amar Kamat commented on MAPREDUCE-1147: --- For branch 20, all tests except the ones mentioned below have passed. # hdfs.TestDatanodeBlockScanner FAILED (timeout) # hdfs.TestDistributedFileSystem FAILED # hdfs.server.namenode.TestFsck FAILED (timeout) # mapred.TestReduceFetch FAILED None of these seems related to this issue. Map output records counter missing for map-only jobs in new API --- Key: MAPREDUCE-1147 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1147 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.20.1, 0.21.0 Reporter: Chris Douglas Assignee: Amar Kamat Priority: Blocker Fix For: 0.20.2 Attachments: mapred-1147-v1.3.patch, mapred-1147-v1.4-y20.patch, mapred-1147-v1.4.patch In the new API, the counter for map output records is not incremented for map-only jobs -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-1140) Per cache-file refcount can become negative when tasks release distributed-cache files
[ https://issues.apache.org/jira/browse/MAPREDUCE-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amareshwari Sriramadasu updated MAPREDUCE-1140: --- Attachment: patch-1140-2.txt Patch with review comments incorporated. Per cache-file refcount can become negative when tasks release distributed-cache files -- Key: MAPREDUCE-1140 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1140 Project: Hadoop Map/Reduce Issue Type: Bug Components: tasktracker Affects Versions: 0.20.2, 0.21.0, 0.22.0 Reporter: Vinod K V Assignee: Amareshwari Sriramadasu Attachments: patch-1140-1.txt, patch-1140-2.txt, patch-1140-ydist.txt, patch-1140.txt -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-1140) Per cache-file refcount can become negative when tasks release distributed-cache files
[ https://issues.apache.org/jira/browse/MAPREDUCE-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amareshwari Sriramadasu updated MAPREDUCE-1140: --- Status: Patch Available (was: Open) Per cache-file refcount can become negative when tasks release distributed-cache files -- Key: MAPREDUCE-1140 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1140 Project: Hadoop Map/Reduce Issue Type: Bug Components: tasktracker Affects Versions: 0.20.2, 0.21.0, 0.22.0 Reporter: Vinod K V Assignee: Amareshwari Sriramadasu Attachments: patch-1140-1.txt, patch-1140-2.txt, patch-1140-ydist.txt, patch-1140.txt -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1185) URL to JT webconsole for running job and job history should be the same
[ https://issues.apache.org/jira/browse/MAPREDUCE-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778242#action_12778242 ] Jothi Padmanabhan commented on MAPREDUCE-1185: -- In the test case, does it also make sense to add one more check -- do not set {{conn.setInstanceFollowRedirects(false)}} and then ensure that {{conn.connect()}} is successful. If not the contents, we would at least verify that the redirection works as expected, no? URL to JT webconsole for running job and job history should be the same --- Key: MAPREDUCE-1185 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1185 Project: Hadoop Map/Reduce Issue Type: Improvement Components: jobtracker Reporter: Sharad Agarwal Assignee: Sharad Agarwal Attachments: 1185_v1.patch, 1185_v2.patch, 1185_v3.patch The tracking url for running jobs and the jobs which are retired is different. This creates problem for clients which caches the job running url because soon it becomes invalid when job is retired. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.