[jira] [Created] (YARN-3020) n similar addContainerRequest()s produce n*(n+1)/2 containers

2015-01-08 Thread Peter D Kirchner (JIRA)
Peter D Kirchner created YARN-3020:
--

 Summary: n similar addContainerRequest()s produce n*(n+1)/2 
containers
 Key: YARN-3020
 URL: https://issues.apache.org/jira/browse/YARN-3020
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.5.2, 2.5.1, 2.6.0, 2.5.0
Reporter: Peter D Kirchner


BUG: If the application master calls addContainerRequest() n times, but with 
the same priority, I get 1+2+3+...+n containers = n*(n+1)/2 .

If the application master calls addContainerRequest() n times, but with a 
unique priority each time, I get n containers (as I intended).

Analysis:
There is a logic problem in AMRMClientImpl.java.
Although AMRMClientImpl.java, allocate() does an ask.clear() , on subsequent 
calls to addContainerRequest(), addResourceRequest() finds the previous 
matching remoteRequest and increments the container count rather than starting 
anew, and does an addResourceRequestToAsk() which defeats the ask.clear().

>From documentation and code comments, it was hard for me to discern the 
>intended behavior of the API, but the inconsistency reported in this issue 
>suggests one case or the other is implemented incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3020) n similar addContainerRequest()s produce n*(n+1)/2 containers

2015-01-09 Thread Peter D Kirchner (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14271729#comment-14271729
 ] 

Peter D Kirchner commented on YARN-3020:


Hi thank you for your comment.  I believe your test works because it does all 
addContainerRequest()s between two heartbeats, before any interaction with the 
server on the accumulated requests.  If you insert a Thread.sleep() of more 
than the heartbeat interval in your test loop, allocate() is called in between 
your requests, and you should get O(n2) containers for your n requests.

> n similar addContainerRequest()s produce n*(n+1)/2 containers
> -
>
> Key: YARN-3020
> URL: https://issues.apache.org/jira/browse/YARN-3020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
>Reporter: Peter D Kirchner
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> BUG: If the application master calls addContainerRequest() n times, but with 
> the same priority, I get 1+2+3+...+n containers = n*(n+1)/2 .
> If the application master calls addContainerRequest() n times, but with a 
> unique priority each time, I get n containers (as I intended).
> Analysis:
> There is a logic problem in AMRMClientImpl.java.
> Although AMRMClientImpl.java, allocate() does an ask.clear() , on subsequent 
> calls to addContainerRequest(), addResourceRequest() finds the previous 
> matching remoteRequest and increments the container count rather than 
> starting anew, and does an addResourceRequestToAsk() which defeats the 
> ask.clear().
> From documentation and code comments, it was hard for me to discern the 
> intended behavior of the API, but the inconsistency reported in this issue 
> suggests one case or the other is implemented incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3020) n similar addContainerRequest()s produce n*(n+1)/2 containers

2015-01-09 Thread Peter D Kirchner (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter D Kirchner updated YARN-3020:
---
Description: 
BUG: If the application master calls addContainerRequest() n times, but with 
the same priority, I get up to 1+2+3+...+n containers = n*(n+1)/2 .  The most 
containers are requested when the interval between calls to 
addContainerRequest() exceeds the heartbeat interval of calls to allocate() (in 
AMRMClientImpl's run() method).

If the application master calls addContainerRequest() n times, but with a 
unique priority each time, I get n containers (as I intended).

Analysis:
There is a logic problem in AMRMClientImpl.java.
Although AMRMClientImpl.java, allocate() does an ask.clear() , on subsequent 
calls to addContainerRequest(), addResourceRequest() finds the previous 
matching remoteRequest and increments the container count rather than starting 
anew, and does an addResourceRequestToAsk() which defeats the ask.clear().

>From documentation and code comments, it was hard for me to discern the 
>intended behavior of the API, but the inconsistency reported in this issue 
>suggests one case or the other is implemented incorrectly.

  was:
BUG: If the application master calls addContainerRequest() n times, but with 
the same priority, I get 1+2+3+...+n containers = n*(n+1)/2 .

If the application master calls addContainerRequest() n times, but with a 
unique priority each time, I get n containers (as I intended).

Analysis:
There is a logic problem in AMRMClientImpl.java.
Although AMRMClientImpl.java, allocate() does an ask.clear() , on subsequent 
calls to addContainerRequest(), addResourceRequest() finds the previous 
matching remoteRequest and increments the container count rather than starting 
anew, and does an addResourceRequestToAsk() which defeats the ask.clear().

>From documentation and code comments, it was hard for me to discern the 
>intended behavior of the API, but the inconsistency reported in this issue 
>suggests one case or the other is implemented incorrectly.


> n similar addContainerRequest()s produce n*(n+1)/2 containers
> -
>
> Key: YARN-3020
> URL: https://issues.apache.org/jira/browse/YARN-3020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
>Reporter: Peter D Kirchner
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> BUG: If the application master calls addContainerRequest() n times, but with 
> the same priority, I get up to 1+2+3+...+n containers = n*(n+1)/2 .  The most 
> containers are requested when the interval between calls to 
> addContainerRequest() exceeds the heartbeat interval of calls to allocate() 
> (in AMRMClientImpl's run() method).
> If the application master calls addContainerRequest() n times, but with a 
> unique priority each time, I get n containers (as I intended).
> Analysis:
> There is a logic problem in AMRMClientImpl.java.
> Although AMRMClientImpl.java, allocate() does an ask.clear() , on subsequent 
> calls to addContainerRequest(), addResourceRequest() finds the previous 
> matching remoteRequest and increments the container count rather than 
> starting anew, and does an addResourceRequestToAsk() which defeats the 
> ask.clear().
> From documentation and code comments, it was hard for me to discern the 
> intended behavior of the API, but the inconsistency reported in this issue 
> suggests one case or the other is implemented incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3020) n similar addContainerRequest()s produce n*(n+1)/2 containers

2015-01-09 Thread Peter D Kirchner (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14271971#comment-14271971
 ] 

Peter D Kirchner commented on YARN-3020:


Hold the phone.  Your test in TestAMRMClient.java will ~always succeed but does 
not test for the bug.  Your test counts the *client's* accumulation of resource 
requests, and does not test the *number of times* the requests, in various 
stages of accumulation, gets sent to the server (Resource Manager).  Please 
call addContainerRequest() multiple times across multiple heartbeats, and then 
examine the number of containers that the Resource Manager allocates, and you 
should see the bug.

> n similar addContainerRequest()s produce n*(n+1)/2 containers
> -
>
> Key: YARN-3020
> URL: https://issues.apache.org/jira/browse/YARN-3020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
>Reporter: Peter D Kirchner
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> BUG: If the application master calls addContainerRequest() n times, but with 
> the same priority, I get up to 1+2+3+...+n containers = n*(n+1)/2 .  The most 
> containers are requested when the interval between calls to 
> addContainerRequest() exceeds the heartbeat interval of calls to allocate() 
> (in AMRMClientImpl's run() method).
> If the application master calls addContainerRequest() n times, but with a 
> unique priority each time, I get n containers (as I intended).
> Analysis:
> There is a logic problem in AMRMClientImpl.java.
> Although AMRMClientImpl.java, allocate() does an ask.clear() , on subsequent 
> calls to addContainerRequest(), addResourceRequest() finds the previous 
> matching remoteRequest and increments the container count rather than 
> starting anew, and does an addResourceRequestToAsk() which defeats the 
> ask.clear().
> From documentation and code comments, it was hard for me to discern the 
> intended behavior of the API, but the inconsistency reported in this issue 
> suggests one case or the other is implemented incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3020) n similar addContainerRequest()s produce n*(n+1)/2 containers

2015-01-12 Thread Peter D Kirchner (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273990#comment-14273990
 ] 

Peter D Kirchner commented on YARN-3020:


It looks like the bug may have come in with the code reorganization of r1494017 
on 2013-06-18.  I did not follow the log past this introduction of 
AMRMClient.java in its present form and location.

In my code on my system (and I am supposing also in yours) each 
addContainerRequest() is taking about a second even without a sleep.  The 
heartbeat I set in createAMRMClientAsync() was 1000 milliseconds (1 second), so 
I set it to 10 seconds to rule out that the addContainerRequest() was somehow 
synchronous with allocate().  FWIW, for 10 containers requested, I got 17 
containers with a heartbeat of 10 seconds.  One heartbeat call to allocate() 
produced 7 containers, the next call produced 10.  Each heartbeat on which the 
AMRMClient detects a change (in the number of containers the AM has "add"ed) 
that needs to be sent to the RM, it sends the then-current total not the diff.

Limiting the AM to ~1 container request per second is impractical, so the bug 
is potentially initially helpful because the application does not have to wait 
2 minutes to assemble 100 containers, all it needs to do is call 
addContainerRequest() about 15 times, taking about 15 seconds with a 1 second 
heartbeat.  The addContainerRequest() performance will need to be improved, or 
the limitation of 1 container per addContainerRequest() introduced in r1503960 
2013-07-16 will need to be reversed.

But by the time one naively requests 100 containers, and get 5,050, The bug is 
probably hurting application and cluster performance.  Maybe a lot.

> n similar addContainerRequest()s produce n*(n+1)/2 containers
> -
>
> Key: YARN-3020
> URL: https://issues.apache.org/jira/browse/YARN-3020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
>Reporter: Peter D Kirchner
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> BUG: If the application master calls addContainerRequest() n times, but with 
> the same priority, I get up to 1+2+3+...+n containers = n*(n+1)/2 .  The most 
> containers are requested when the interval between calls to 
> addContainerRequest() exceeds the heartbeat interval of calls to allocate() 
> (in AMRMClientImpl's run() method).
> If the application master calls addContainerRequest() n times, but with a 
> unique priority each time, I get n containers (as I intended).
> Analysis:
> There is a logic problem in AMRMClientImpl.java.
> Although AMRMClientImpl.java, allocate() does an ask.clear() , on subsequent 
> calls to addContainerRequest(), addResourceRequest() finds the previous 
> matching remoteRequest and increments the container count rather than 
> starting anew, and does an addResourceRequestToAsk() which defeats the 
> ask.clear().
> From documentation and code comments, it was hard for me to discern the 
> intended behavior of the API, but the inconsistency reported in this issue 
> suggests one case or the other is implemented incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3020) n similar addContainerRequest()s produce n*(n+1)/2 containers

2015-01-13 Thread Peter D Kirchner (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14275825#comment-14275825
 ] 

Peter D Kirchner commented on YARN-3020:


I investigated the rates in the third paragraph of my comment immediately 
above, and found that an application is able to make addContainerRequest()s 
much faster than this.  Bear in mind that the elapsed time for making the 
client-api call to addContainerRequest() is not a measurement of the 
performance impact of the reported over-requests sent to the server and the 
resulting over-allocation of containers. It turns out my application has some 
extrinsic delay in issuing addContainerRequests which predominated in limiting 
the rate I measured and reported in the third paragraph of the comment 
immediately above.

To follow up, I measured addContainerRequest() timing with System.nanoTime().  
The first call to addContainerRequest() takes around 5 milliseconds.  The rest 
take around half a millisecond on average.  Here are some statistics for 
calling addContainerRequest():  microseconds average=433 count=914 max=11202 
min=223 .  I measure similar times for consecutive calls (without additional 
application delays in between addContainerRequest()s).

When the over-request bug is fixed, I will still think it tedious to call 1000x 
for 1000 identical containers but many applications can probably afford the 
half second to do so. Arguably, the bug exists in part because of the 
tediousness of bookkeeping on the yarn-client-api side for these requests.  If 
in the process of bug-fixing or cleanup, a change that re-introduces an integer 
quantity with the request would be welcome.

> n similar addContainerRequest()s produce n*(n+1)/2 containers
> -
>
> Key: YARN-3020
> URL: https://issues.apache.org/jira/browse/YARN-3020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
>Reporter: Peter D Kirchner
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> BUG: If the application master calls addContainerRequest() n times, but with 
> the same priority, I get up to 1+2+3+...+n containers = n*(n+1)/2 .  The most 
> containers are requested when the interval between calls to 
> addContainerRequest() exceeds the heartbeat interval of calls to allocate() 
> (in AMRMClientImpl's run() method).
> If the application master calls addContainerRequest() n times, but with a 
> unique priority each time, I get n containers (as I intended).
> Analysis:
> There is a logic problem in AMRMClientImpl.java.
> Although AMRMClientImpl.java, allocate() does an ask.clear() , on subsequent 
> calls to addContainerRequest(), addResourceRequest() finds the previous 
> matching remoteRequest and increments the container count rather than 
> starting anew, and does an addResourceRequestToAsk() which defeats the 
> ask.clear().
> From documentation and code comments, it was hard for me to discern the 
> intended behavior of the API, but the inconsistency reported in this issue 
> suggests one case or the other is implemented incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3060) AMRMClient getMatchingRequests() returns all matches not "outstanding" requests as specified

2015-01-14 Thread Peter D Kirchner (JIRA)
Peter D Kirchner created YARN-3060:
--

 Summary: AMRMClient getMatchingRequests() returns all matches not 
"outstanding" requests as specified
 Key: YARN-3060
 URL: https://issues.apache.org/jira/browse/YARN-3060
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.5.2, 2.5.1, 2.6.0, 2.5.0
Reporter: Peter D Kirchner


The AMRMClient api documentation specifies that getMatchingRequests() returns 
*outstanding* requests that match the method's arguments.  
getMatchingRequests() currently returns all matching application container 
requests matching the arguments, which does not seem to be within the possible 
meanings of "outstanding" requests. The API specification and implementation 
need to be reconciled and ambiguities resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3020) n similar addContainerRequest()s produce n*(n+1)/2 containers

2015-01-20 Thread Peter D Kirchner (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284301#comment-14284301
 ] 

Peter D Kirchner commented on YARN-3020:


https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ywskycn : Please 
take a look at this snippet modifying distributedShell, and the output, and 
perhaps you will get my point.  Observe that the accounting behind what gets 
sent to the RM on heartbeats following either addContainerRequest() or 
removeContainerRequest() is defective.  100 containers are assigned as the 
result of this code that ostensibly requests only 10:  10 adds, with 
interleaved heartbeats, followed by 10 removes that with interleaved heartbeats 
should be no-ops.  55 containers result from the adds (1+2+3+4+5+6+7+8+9+10). 
45 additional containers are requested as the result of the 10 calls to remove 
(9+8+7+6+5+4+3+2+1). 

for (int i=0; i<20; i++){
 try {
  ContainerRequest containerAsk = setupContainerAskForRM();

if(i<10) {
amRMClient.addContainerRequest(containerAsk);
} else {
amRMClient.removeContainerRequest(containerAsk);
}
Thread.sleep(1500);
List list1 = amRMClient.getMatchingRequests(containerAsk.getPriority(), 
"*", containerAsk.getCapability());
LinkedHashSet set1 = (java.util.LinkedHashSet)(list1.get(0));   
System.out.println("i="+i+" outstanding="+set1.size());

 } catch (InterruptedException e1) {
e1.printStackTrace();
 }
}

DEBUG [Thread-7] (AMRMClientImpl.java:585) - addResourceRequest: applicationId= 
priority=0 resourceName=* numContainers=1 #asks=1
i=0 outstanding=1
DEBUG [Thread-7] (AMRMClientImpl.java:585) - addResourceRequest: applicationId= 
priority=0 resourceName=* numContainers=2 #asks=1
i=1 outstanding=2
DEBUG [Thread-7] (AMRMClientImpl.java:585) - addResourceRequest: applicationId= 
priority=0 resourceName=* numContainers=3 #asks=1
i=2 outstanding=3
DEBUG [Thread-7] (AMRMClientImpl.java:585) - addResourceRequest: applicationId= 
priority=0 resourceName=* numContainers=4 #asks=1
i=3 outstanding=4
DEBUG [Thread-7] (AMRMClientImpl.java:585) - addResourceRequest: applicationId= 
priority=0 resourceName=* numContainers=5 #asks=1
i=4 outstanding=5
DEBUG [Thread-7] (AMRMClientImpl.java:585) - addResourceRequest: applicationId= 
priority=0 resourceName=* numContainers=6 #asks=1
i=5 outstanding=6
DEBUG [Thread-7] (AMRMClientImpl.java:585) - addResourceRequest: applicationId= 
priority=0 resourceName=* numContainers=7 #asks=1
i=6 outstanding=7
DEBUG [Thread-7] (AMRMClientImpl.java:585) - addResourceRequest: applicationId= 
priority=0 resourceName=* numContainers=8 #asks=1
i=7 outstanding=8
DEBUG [Thread-7] (AMRMClientImpl.java:585) - addResourceRequest: applicationId= 
priority=0 resourceName=* numContainers=9 #asks=1
i=8 outstanding=9
DEBUG [Thread-7] (AMRMClientImpl.java:585) - addResourceRequest: applicationId= 
priority=0 resourceName=* numContainers=10 #asks=1
i=9 outstanding=10
DEBUG [Thread-7] (AMRMClientImpl.java:619) - BEFORE decResourceRequest: 
applicationId= priority=0 resourceName=* numContainers=10 #asks=0
 INFO [Thread-7] (AMRMClientImpl.java:652) - AFTER decResourceRequest: 
applicationId= priority=0 resourceName=* numContainers=9 #asks=1
i=10 outstanding=10
DEBUG [Thread-7] (AMRMClientImpl.java:619) - BEFORE decResourceRequest: 
applicationId= priority=0 resourceName=* numContainers=9 #asks=0
 INFO [Thread-7] (AMRMClientImpl.java:652) - AFTER decResourceRequest: 
applicationId= priority=0 resourceName=* numContainers=8 #asks=1
i=11 outstanding=10
DEBUG [Thread-7] (AMRMClientImpl.java:619) - BEFORE decResourceRequest: 
applicationId= priority=0 resourceName=* numContainers=8 #asks=0
 INFO [Thread-7] (AMRMClientImpl.java:652) - AFTER decResourceRequest: 
applicationId= priority=0 resourceName=* numContainers=7 #asks=1
i=12 outstanding=10
DEBUG [Thread-7] (AMRMClientImpl.java:619) - BEFORE decResourceRequest: 
applicationId= priority=0 resourceName=* numContainers=7 #asks=0
 INFO [Thread-7] (AMRMClientImpl.java:652) - AFTER decResourceRequest: 
applicationId= priority=0 resourceName=* numContainers=6 #asks=1
i=13 outstanding=10
DEBUG [Thread-7] (AMRMClientImpl.java:619) - BEFORE decResourceRequest: 
applicationId= priority=0 resourceName=* numContainers=6 #asks=0
 INFO [Thread-7] (AMRMClientImpl.java:652) - AFTER decResourceRequest: 
applicationId= priority=0 resourceName=* numContainers=5 #asks=1
i=14 outstanding=10
DEBUG [Thread-7] (AMRMClientImpl.java:619) - BEFORE decResourceRequest: 
applicationId= priority=0 resourceName=* numContainers=5 #asks=0
 INFO [Thread-7] (AMRMClientImpl.java:652) - AFTER decResourceRequest: 
applicationId= priority=0 resourceName=* numContainers=4 #asks=1
i=15 outstanding=10
DEBUG [Thread-7] (AMRMClientImpl.java:619) - BEFORE decResourceRequest: 
applicationId= priority=0 resourceName=* numContainers=4 #asks=0
 INFO [Thread-7] (AMRMClientImpl.java:652) - AFTER d

[jira] [Commented] (YARN-3020) n similar addContainerRequest()s produce n*(n+1)/2 containers

2015-01-30 Thread Peter D Kirchner (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299184#comment-14299184
 ] 

Peter D Kirchner commented on YARN-3020:


That expected usage you describe, and current implementation contains a basic 
synchronization problem.
The client application's RPC updates requests to the RM before it receives the 
containers newly assigned during that heartbeat.
Therefore, if (as is currently the case) the client calculates the total 
requests, the total is too large by at least the number of matching incoming 
assignments.
Per expected usage and current implementation, both add and remove cause this 
obsolete, too-high total to be sent.
Cause or coincidence, I see applications (including but not limited to 
distributedShell) making matching requests in a short interval and never 
calling remove.
They receive the behavior they need, or closer to it, than the expected usage 
would produce.

Further, in this API implementation/expected usage the remove API tries to 
serve two purposes that are similar but not identical: to update the 
client-side bookkeeping and to identify the request data to be sent to the 
server.  The problem here is that if there are only removes for allocated 
containers, then the server-side bookkeeping is correct until the client sends 
the total.  The removes called for incoming assigned containers should not be 
forwarded to the RM until there is at least one matching add, or a bona-fide 
removal of a previously add-ed request.

I suppose the current implementation could be defended because its error is:
1) "only" too high by the number of matching incoming assignments,
2) persists "only" for the number of heartbeats it takes to clear the 
out of sync condition
3) results in spurious allocations "only" once the application's 
intentional matching requests were granted.
I maintain that spurious allocations are worst-case and especially damaging if 
obtained by preemption.

I want to suggest an alternative that is simpler and accurate, and limited to 
the AMRMClient and RM. The fact that the scheduler is updated by replacement 
informs the choice of where Yarn should calculate that total for a matching 
request.
The client is in a position to accurately calculate how much its current wants 
differ from what it has asked for over its life.
This suggests a fix to the synchronization problem by having the client send 
the net of add/remove requests it has accumulated over a heartbeat cycle,
and having the RM update its totals, from the difference obtained from the 
client, using synchronized methods.
(Note, this client would not ordinarily call remove when it received a 
container, as the scheduler has already
properly accounted for it when it made the allocation).

> n similar addContainerRequest()s produce n*(n+1)/2 containers
> -
>
> Key: YARN-3020
> URL: https://issues.apache.org/jira/browse/YARN-3020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
>Reporter: Peter D Kirchner
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> BUG: If the application master calls addContainerRequest() n times, but with 
> the same priority, I get up to 1+2+3+...+n containers = n*(n+1)/2 .  The most 
> containers are requested when the interval between calls to 
> addContainerRequest() exceeds the heartbeat interval of calls to allocate() 
> (in AMRMClientImpl's run() method).
> If the application master calls addContainerRequest() n times, but with a 
> unique priority each time, I get n containers (as I intended).
> Analysis:
> There is a logic problem in AMRMClientImpl.java.
> Although AMRMClientImpl.java, allocate() does an ask.clear() , on subsequent 
> calls to addContainerRequest(), addResourceRequest() finds the previous 
> matching remoteRequest and increments the container count rather than 
> starting anew, and does an addResourceRequestToAsk() which defeats the 
> ask.clear().
> From documentation and code comments, it was hard for me to discern the 
> intended behavior of the API, but the inconsistency reported in this issue 
> suggests one case or the other is implemented incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3020) n similar addContainerRequest()s produce n*(n+1)/2 containers

2015-02-10 Thread Peter D Kirchner (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14315013#comment-14315013
 ] 

Peter D Kirchner commented on YARN-3020:


Hi Wei Yan,
My point, adjusted to take the "expected usage" into account, is that when 
matching requests and/or allocations are spread over multiple heartbeats, too 
many containers are requested and received.

So, suppose my application calls addContainerRequest() 10 times.

Let's take your example where the AMRMClient sends 1 container request on 
heartbeat 1, and 10 requests at heartbeat 2, overwriting the 1.
Say also that the second RPC returns with 1 container.

The second request is high by one, i.e. 10, because the application does not 
yet know about the incoming allocation.
Subsequent updates are also high by approximately the number of incoming 
containers.
My application heartbeat is 1 second and the RM is typically allocating 1 
container/node/second so I'd expect 10 containers coming in on the third 
heartbeat.
Per expected usage, my AMRMClient would have sent out an updated request for 9 
containers at that time.
My application would zero-out the matching request on the fourth heartbeat and 
release the nine extra containers (90% more) that it received that it never 
intended to request.  

In the present implementation, with the AMRMClient keeping track of the totals, 
removeContainerRequest() properly decrements AMRMClient's idea of the 
outstanding count.
But due to this information being a heartbeat out of date vs. the scheduler's, 
(pending a definitive fix) a partial fix would be that the AMRMClient should 
not routinely update the RM with this matching total, whenever the scheduler's 
tally is likely to be more accurate.
Occasions when the RM should be updated are when there is a new matching 
addContainerRequest(), i.e. the scheduler's target could otherwise be too low, 
or when the AMRMClient's outstanding count is decremented to zero.


Please see my response to Wangda Tan 30 Jan 2015.
Thank you.

> n similar addContainerRequest()s produce n*(n+1)/2 containers
> -
>
> Key: YARN-3020
> URL: https://issues.apache.org/jira/browse/YARN-3020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
>Reporter: Peter D Kirchner
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> BUG: If the application master calls addContainerRequest() n times, but with 
> the same priority, I get up to 1+2+3+...+n containers = n*(n+1)/2 .  The most 
> containers are requested when the interval between calls to 
> addContainerRequest() exceeds the heartbeat interval of calls to allocate() 
> (in AMRMClientImpl's run() method).
> If the application master calls addContainerRequest() n times, but with a 
> unique priority each time, I get n containers (as I intended).
> Analysis:
> There is a logic problem in AMRMClientImpl.java.
> Although AMRMClientImpl.java, allocate() does an ask.clear() , on subsequent 
> calls to addContainerRequest(), addResourceRequest() finds the previous 
> matching remoteRequest and increments the container count rather than 
> starting anew, and does an addResourceRequestToAsk() which defeats the 
> ask.clear().
> From documentation and code comments, it was hard for me to discern the 
> intended behavior of the API, but the inconsistency reported in this issue 
> suggests one case or the other is implemented incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)