[
https://issues.apache.org/jira/browse/HADOOP-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vivek Ratan updated HADOOP-1462:
--------------------------------
Attachment: 1462_04.patch
My bad about the misspelt function name. That has been corrected. I have also
added progress reporting to the sort code in SequenceFile.Sorter, in between
calls to sort and flush. Progress reporting can also be added in
SortFile::SeqFileComparator::compare(), but I've kept it in SortPass:run() for
now. See 1462_04.patch.
> Better progress reporting from a Task
> -------------------------------------
>
> Key: HADOOP-1462
> URL: https://issues.apache.org/jira/browse/HADOOP-1462
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Vivek Ratan
> Assignee: Vivek Ratan
> Fix For: 0.14.0
>
> Attachments: 1462.patch, 1462_02.patch, 1462_03.patch, 1462_04.patch
>
>
> The Task code that reports progress updates has the following problems:
> 1. Some RPC calls are blocking. For example, in MapRunner::run(), the call to
> RecordReader::next() can result in a blocking RPC call to the Task Tracker
> (TT) to report progress.
> 2. Some RPC calls are unnecessary. The Ping thread pings the TT once every
> second, while we also independently send progress updates every second. We
> don't, for example, need to ping the TT right after we send the progress
> update.
> 3. In some places, we spawn a thread to send progress updates (in
> MapOutputBuffer::collect(), for example). If our code gets stuck, the thread
> will continue sending updates to the TT and we will never be shut down.
> These issues, in some form or another, have been reported in HADOOP-1201 and
> HADOOP-1431.
> I propose we make the following changes:
> 1. In order to make the RPC calls non-blocking, we need a thread that calls
> TT. This thread, to be created early on, will make sure we make the most
> appropriate RPCs. It will have access to two flags: a progress flag that
> indicates that the Task has made progress since the last RPC, and a
> keep_alive flag that indicates that we need to let the TT know that we're
> alive. This thread will also handle pings. It's logic will be something like
> this:
> while (1) {
> if (progress_flag is set) {
> // report progress update
> umbilical::progress(...);
> if (failure), kill task;
> reset progress_flag;
> reset keep_alive_flag; // calling progress() also indicates
> that we're alive
> }
> else if (keep_alive_flag is set) {
> // let TT know we're alive
> umbilical::progress(same params as last time);
> if (failure), kill task;
> reset keep_alive_flag;
> break;
> }
> else {
> // see if TT is alive
> umbilical::ping();
> if (failure), kill task;
> }
> sleep (1 sec);
> }
> 2. progress_flag and keep_alive_flag are set by the MapReduce code.
> Reporter::progress() (in Task.java) sets keep_alive_flag while progress_flag
> is set whenever Task's taskProgress object has any of its fields changed.
> 3. We do away with Task::reportProgress() as this code is now handled in the
> Progress thread. Wherever this method is called in our MapReduce kernel code,
> we should replace it either with Reporter::progress() (if the intent was to
> let TT know that we're alive) or we simply remove that call (if the intent
> was to transmit progress changes to the TT).
> 4. TaskUmbilicalProtocol::progress() should return a boolean, and should
> return the same values that TaskUmbilicalProtocol::ping() does. This will let
> the Task know whether its ID is known to the TT.
> 5. We no longer need to create a ping thread in TaskTracker::Child. However,
> we can perhaps create the Progress thread in the same place the Ping thread
> was created.
> 6. We will need to remove code that creates progress threads. This is in
> MapTask::MapOutputBuffer::collect(), MapTask::MapOutputBuffer::flush(), and
> ReduceTask::ReduceCopier(), at the least. Instead, we will need to add code
> that updates the progress or calls Reporter::progress(). Any of these calls
> simply update flags. so there's not a lot of performance penalty (at worst,
> updating progress_flag or keep_alive_flag may need to be done within a
> synchronized block, but even that may not be necessary since the flags are
> just boolean values). As per HADOOP-1431, these calls can be made through a
> ReportingComparator, or from within the generic BuferSorter, or perhaps from
> some place better.
> I may have missed out on some details, but hopefully the overall idea is
> clear. Comments welcome.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.