[jira] [Commented] (MAPREDUCE-5110) Long task launch delays can lead to multiple parallel attempts of the task

2013-04-15 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13632626#comment-13632626
 ] 

Karthik Kambatla commented on MAPREDUCE-5110:
-

Thanks for chiming in, Vinod.

My intention was precisely to add an aggressive timeout for task attempt 
launches and keeping it job-configurable should be good. We can implement it 
either on JT or TT. Do you think it is okay to implement in on TT? Please 
suggest - I ll upload a patch accordingly.

If interested, the user should be able to configure this timeout to be shorter 
than the tracker-expiry-interval to ensure a single attempt.


> Long task launch delays can lead to multiple parallel attempts of the task
> --
>
> Key: MAPREDUCE-5110
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5110
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 1.1.2
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: expose-mr-5110.patch, mr-5110.patch, mr-5110.patch, 
> mr-5110-tt-only.patch
>
>
> If a task takes too long to launch, the JT expires the task and schedules 
> another attempt. The earlier attempt can start after the later attempt 
> leading to two parallel attempts running at the same time. This is 
> particularly an issue if the user turns off speculation and expects a single 
> attempt of a task to run at any point in time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5110) Long task launch delays can lead to multiple parallel attempts of the task

2013-04-15 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13632543#comment-13632543
 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-5110:


Trying to understand this, mostly agree with what Arun said. To summarize:
 - Strictly guaranteeing serial execution of task attempts is not possible in 
general and is a non-requirement
 - JT already deals with all kinds of slow-ness with tasks and irrespective of 
this patch, clients have to deal with the slowness.

bq. Where possible (i.e., not transient network partitions), run a single task 
attempt for a task when speculation is turned off
Seems an arbitrary non-requirement, don't see what we gain from this.

The JIRA started with the above goal which isn't worth pursing from what I see, 
but now it seems to have transformed into something more benign. Looked at the 
patch. It looks like you want quicker failure when tasks are getting 
launched/localized to meet some kind of SLAs? If that is the case, instead of 
calling it a 'TT-side implementation', if we call it an aggressive timeout 
enforced on TTs for tasks, and make it job-configurable, that should do. Right?

> Long task launch delays can lead to multiple parallel attempts of the task
> --
>
> Key: MAPREDUCE-5110
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5110
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 1.1.2
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: expose-mr-5110.patch, mr-5110.patch, mr-5110.patch, 
> mr-5110-tt-only.patch
>
>
> If a task takes too long to launch, the JT expires the task and schedules 
> another attempt. The earlier attempt can start after the later attempt 
> leading to two parallel attempts running at the same time. This is 
> particularly an issue if the user turns off speculation and expects a single 
> attempt of a task to run at any point in time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5110) Long task launch delays can lead to multiple parallel attempts of the task

2013-04-15 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13632397#comment-13632397
 ] 

Karthik Kambatla commented on MAPREDUCE-5110:
-

Thanks for your response, Arun.

Let me take a step back and explain in detail:

AIC the issue this JIRA address is - "Where possible (i.e., not transient 
network partitions), run a single task attempt for a task when speculation is 
turned off". A JT solution (a.k.a MAPREDUCE-2217) spawns another task attempt, 
but doesn't kill the currently running task before doing so. Through a TT-side 
solution (patch here), one will be able to kill the currently running attempt 
first before spawning another task attempt.

I see your point of avoid-TT-changes-if-possible. I guess the trade-off is 
between marginal increase in TT code complexity (a timeout check and logging 
changes) and running multiple attempts of the task. Given the low cost of the 
fix, I believe we should address this scenario which seems to be far more 
frequent compared to network partitions.

> Long task launch delays can lead to multiple parallel attempts of the task
> --
>
> Key: MAPREDUCE-5110
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5110
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 1.1.2
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: expose-mr-5110.patch, mr-5110.patch, mr-5110.patch, 
> mr-5110-tt-only.patch
>
>
> If a task takes too long to launch, the JT expires the task and schedules 
> another attempt. The earlier attempt can start after the later attempt 
> leading to two parallel attempts running at the same time. This is 
> particularly an issue if the user turns off speculation and expects a single 
> attempt of a task to run at any point in time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5110) Long task launch delays can lead to multiple parallel attempts of the task

2013-04-15 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13632377#comment-13632377
 ] 

Arun C Murthy commented on MAPREDUCE-5110:
--

[~kkambatl] Sorry if I wasn't clear. 

My point is simple: The JT already has code to deal with slow launches i.e. 
ExpireLaunchingTasks. What benefit do we get by re-implementing this on TT? Why 
not reduce the timeout on JT? 

> Long task launch delays can lead to multiple parallel attempts of the task
> --
>
> Key: MAPREDUCE-5110
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5110
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 1.1.2
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: expose-mr-5110.patch, mr-5110.patch, mr-5110.patch, 
> mr-5110-tt-only.patch
>
>
> If a task takes too long to launch, the JT expires the task and schedules 
> another attempt. The earlier attempt can start after the later attempt 
> leading to two parallel attempts running at the same time. This is 
> particularly an issue if the user turns off speculation and expects a single 
> attempt of a task to run at any point in time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5110) Long task launch delays can lead to multiple parallel attempts of the task

2013-04-05 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13623925#comment-13623925
 ] 

Karthik Kambatla commented on MAPREDUCE-5110:
-

Hey Arun, sorry for the delay. I was trying to figure out the root cause behind 
these occasional launch delays, we encounter them once in a while on a highly 
loaded cluster. It looks like a node-specific hardware/OS issue. When this 
happens, the task in question delays the entire job. 

I still believe limiting the task launch time is helpful, particularly in the 
case of node-specific hardware issues - failing disks, slow networks etc. Also, 
I discussed this offline with Alejandro and Tom, and they suggested we might 
not want to introduce a new config for this, but may be use half of the 
mapred.task.timeout. What do you think of that? 

> Long task launch delays can lead to multiple parallel attempts of the task
> --
>
> Key: MAPREDUCE-5110
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5110
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 1.1.2
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: expose-mr-5110.patch, mr-5110.patch, mr-5110.patch, 
> mr-5110-tt-only.patch
>
>
> If a task takes too long to launch, the JT expires the task and schedules 
> another attempt. The earlier attempt can start after the later attempt 
> leading to two parallel attempts running at the same time. This is 
> particularly an issue if the user turns off speculation and expects a single 
> attempt of a task to run at any point in time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5110) Long task launch delays can lead to multiple parallel attempts of the task

2013-04-05 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13623893#comment-13623893
 ] 

Arun C Murthy commented on MAPREDUCE-5110:
--

[~kkambatl] I haven't heard back, you comfortable with my suggestion that this 
is not worth additional complexity? Thanks.

> Long task launch delays can lead to multiple parallel attempts of the task
> --
>
> Key: MAPREDUCE-5110
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5110
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 1.1.2
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: expose-mr-5110.patch, mr-5110.patch, mr-5110.patch, 
> mr-5110-tt-only.patch
>
>
> If a task takes too long to launch, the JT expires the task and schedules 
> another attempt. The earlier attempt can start after the later attempt 
> leading to two parallel attempts running at the same time. This is 
> particularly an issue if the user turns off speculation and expects a single 
> attempt of a task to run at any point in time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5110) Long task launch delays can lead to multiple parallel attempts of the task

2013-03-31 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13618468#comment-13618468
 ] 

Arun C Murthy commented on MAPREDUCE-5110:
--

bq. All this said, if you are uncomfortable with the JT changes, I can restrict 
the changes to TT.

[~kkambatl] As I said in my comment, I'm more worried about TT-side than JT...

The JT already has code to catch slow launches i.e. ExpireLaunchingTasks 
thread. Why do we need to re-invent it on TT side? IAC, the problem we are 
trying to solve here is fairly obscure i.e. a hardware issue causing slow task 
launches... even that is already tracked by JT, right?

> Long task launch delays can lead to multiple parallel attempts of the task
> --
>
> Key: MAPREDUCE-5110
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5110
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 1.1.2
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: expose-mr-5110.patch, mr-5110.patch, mr-5110.patch, 
> mr-5110-tt-only.patch
>
>
> If a task takes too long to launch, the JT expires the task and schedules 
> another attempt. The earlier attempt can start after the later attempt 
> leading to two parallel attempts running at the same time. This is 
> particularly an issue if the user turns off speculation and expects a single 
> attempt of a task to run at any point in time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5110) Long task launch delays can lead to multiple parallel attempts of the task

2013-03-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13617577#comment-13617577
 ] 

Hadoop QA commented on MAPREDUCE-5110:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12576121/mr-5110-tt-only.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3482//console

This message is automatically generated.

> Long task launch delays can lead to multiple parallel attempts of the task
> --
>
> Key: MAPREDUCE-5110
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5110
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 1.1.2
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: expose-mr-5110.patch, mr-5110.patch, mr-5110.patch, 
> mr-5110-tt-only.patch
>
>
> If a task takes too long to launch, the JT expires the task and schedules 
> another attempt. The earlier attempt can start after the later attempt 
> leading to two parallel attempts running at the same time. This is 
> particularly an issue if the user turns off speculation and expects a single 
> attempt of a task to run at any point in time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5110) Long task launch delays can lead to multiple parallel attempts of the task

2013-03-29 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13617471#comment-13617471
 ] 

Karthik Kambatla commented on MAPREDUCE-5110:
-

Thanks Arun. Agree that we can't guarantee a single task attempt in the face of 
a transient network partition. That said, I think there is merit to solving 
something we can. For instance, the users could have their own SLAs (time or 
percentile or plain hardware-based) to guard against inconsistencies due to 
network partitions.

bq. I think MAPREDUCE-2217 made an important improvement and we should keep it. 
However, I'm very scared of trying to implement MAPREDUCE-2217 via TT-side 
changes, particularly, when we are adding complexity to already squiggly code 
on the TT.

Agree MAPREDUCE-2217 addresses the hung TT case, but only for UNASSIGNED tasks. 
The RUNNING/COMMIT_PENDING tasks still are addressed by TT. In other words, the 
rationale of monitoring task progress for RUNNING/COMMIT_PENDING in TT instead 
of JT applies to this case too. If anything, the proposed patch only makes it 
consistent.

All this said, if you are uncomfortable with the JT changes, I can restrict the 
changes to TT.

> Long task launch delays can lead to multiple parallel attempts of the task
> --
>
> Key: MAPREDUCE-5110
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5110
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 1.1.2
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: expose-mr-5110.patch, mr-5110.patch, mr-5110.patch
>
>
> If a task takes too long to launch, the JT expires the task and schedules 
> another attempt. The earlier attempt can start after the later attempt 
> leading to two parallel attempts running at the same time. This is 
> particularly an issue if the user turns off speculation and expects a single 
> attempt of a task to run at any point in time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5110) Long task launch delays can lead to multiple parallel attempts of the task

2013-03-29 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13617336#comment-13617336
 ] 

Arun C Murthy commented on MAPREDUCE-5110:
--

bq. If we want to strictly guarantee serial execution of task attempts (say, 
when speculative execution is turned off), we want to kill the task first 
before re-scheduling on another node.

[~kkambatl] the premise that we can strictly guarantee the above is basically 
impossible. There a bunch of other scenarios where we won't be guarantee this, 
for e.g. you might schedule a task on TT which then is deemed 'lost' 10 mins 
later without a single HB after the schedule; but in reality that TT is just 
having trouble talking to JT. This means that multiple tasks will be running 
simultaneously since the JT will re-schedule all tasks on that TT. In reality, 
this is the more common case (lost TT) and there is, pretty much, nothing we 
can do about it.

However, there are enough checks/balances to ensure there is consistency for 
the job in the system (longer writeup).

As a result, I'm inclined to close this as 'wont fix'. I think MAPREDUCE-2217 
made an important improvement and we should keep it. However, I'm very scared 
of trying to implement MAPREDUCE-2217 via TT-side changes, particularly, when 
we are adding complexity to already squiggly code on the TT.

Makes sense?

> Long task launch delays can lead to multiple parallel attempts of the task
> --
>
> Key: MAPREDUCE-5110
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5110
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 1.1.2
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: expose-mr-5110.patch, mr-5110.patch, mr-5110.patch
>
>
> If a task takes too long to launch, the JT expires the task and schedules 
> another attempt. The earlier attempt can start after the later attempt 
> leading to two parallel attempts running at the same time. This is 
> particularly an issue if the user turns off speculation and expects a single 
> attempt of a task to run at any point in time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5110) Long task launch delays can lead to multiple parallel attempts of the task

2013-03-28 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13617104#comment-13617104
 ] 

Karthik Kambatla commented on MAPREDUCE-5110:
-

To answer Arun's question in its entirety: Yes, it reverts MAPREDUCE-2217. 
MAPREDUCE-2217 was to address the case of hung TaskTracker. Till the time of 
this JIRA, we never observed the current issue of the TT taking too long to 
start a task (which only seems to manifest due to hardware issues). The current 
approach handles both scenarios.

> Long task launch delays can lead to multiple parallel attempts of the task
> --
>
> Key: MAPREDUCE-5110
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5110
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 1.1.2
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: expose-mr-5110.patch, mr-5110.patch, mr-5110.patch
>
>
> If a task takes too long to launch, the JT expires the task and schedules 
> another attempt. The earlier attempt can start after the later attempt 
> leading to two parallel attempts running at the same time. This is 
> particularly an issue if the user turns off speculation and expects a single 
> attempt of a task to run at any point in time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5110) Long task launch delays can lead to multiple parallel attempts of the task

2013-03-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616847#comment-13616847
 ] 

Hadoop QA commented on MAPREDUCE-5110:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12575879/mr-5110.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3476//console

This message is automatically generated.

> Long task launch delays can lead to multiple parallel attempts of the task
> --
>
> Key: MAPREDUCE-5110
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5110
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 1.1.2
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: expose-mr-5110.patch, mr-5110.patch, mr-5110.patch
>
>
> If a task takes too long to launch, the JT expires the task and schedules 
> another attempt. The earlier attempt can start after the later attempt 
> leading to two parallel attempts running at the same time. This is 
> particularly an issue if the user turns off speculation and expects a single 
> attempt of a task to run at any point in time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5110) Long task launch delays can lead to multiple parallel attempts of the task

2013-03-28 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616779#comment-13616779
 ] 

Karthik Kambatla commented on MAPREDUCE-5110:
-

If we want to strictly guarantee serial execution of task attempts (say, when 
speculative execution is turned off), we want to kill the task first before 
re-scheduling on another node. If we treat these as decoupled events, the JT 
will schedule another attempt. Meanwhile, before the TT kills it, the task 
might make some progress violating the above guarantee.


> Long task launch delays can lead to multiple parallel attempts of the task
> --
>
> Key: MAPREDUCE-5110
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5110
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 1.1.2
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: expose-mr-5110.patch, mr-5110.patch, mr-5110.patch
>
>
> If a task takes too long to launch, the JT expires the task and schedules 
> another attempt. The earlier attempt can start after the later attempt 
> leading to two parallel attempts running at the same time. This is 
> particularly an issue if the user turns off speculation and expects a single 
> attempt of a task to run at any point in time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5110) Long task launch delays can lead to multiple parallel attempts of the task

2013-03-28 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616768#comment-13616768
 ] 

Arun C Murthy commented on MAPREDUCE-5110:
--

So, this essentially reverts MAPREDUCE-2217?

Why isn't it sufficient to just let the TT kill the task once JT expires it?

> Long task launch delays can lead to multiple parallel attempts of the task
> --
>
> Key: MAPREDUCE-5110
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5110
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 1.1.2
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: expose-mr-5110.patch, mr-5110.patch, mr-5110.patch
>
>
> If a task takes too long to launch, the JT expires the task and schedules 
> another attempt. The earlier attempt can start after the later attempt 
> leading to two parallel attempts running at the same time. This is 
> particularly an issue if the user turns off speculation and expects a single 
> attempt of a task to run at any point in time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5110) Long task launch delays can lead to multiple parallel attempts of the task

2013-03-28 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616523#comment-13616523
 ] 

Alejandro Abdelnur commented on MAPREDUCE-5110:
---

+1, nice hunting.

> Long task launch delays can lead to multiple parallel attempts of the task
> --
>
> Key: MAPREDUCE-5110
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5110
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 1.1.2
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: expose-mr-5110.patch, mr-5110.patch, mr-5110.patch
>
>
> If a task takes too long to launch, the JT expires the task and schedules 
> another attempt. The earlier attempt can start after the later attempt 
> leading to two parallel attempts running at the same time. This is 
> particularly an issue if the user turns off speculation and expects a single 
> attempt of a task to run at any point in time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5110) Long task launch delays can lead to multiple parallel attempts of the task

2013-03-27 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616000#comment-13616000
 ] 

Karthik Kambatla commented on MAPREDUCE-5110:
-

The issue can be reproduced by applying the above patch on a 4 node cluster (8 
map/reduce slots) with mapred.tasktracker.expiry.interval set to 1000 and 
running terasort on 20 GB data.

> Long task launch delays can lead to multiple parallel attempts of the task
> --
>
> Key: MAPREDUCE-5110
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5110
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 1.1.2
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: expose-mr-5110.patch
>
>
> If a task takes too long to launch, the JT expires the task and schedules 
> another attempt. The earlier attempt can start after the later attempt 
> leading to two parallel attempts running at the same time. This is 
> particularly an issue if the user turns off speculation and expects a single 
> attempt of a task to run at any point in time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira