[ 
https://issues.apache.org/jira/browse/VCL-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386315#comment-15386315
 ] 

Andy Kurth commented on VCL-846:
--------------------------------

The deleted (*reclaim.pm*) process doesn't take very long.  The longest 
scenario would be when it sanitizes the computer instead of inserting a reload 
request.  I'd guess this takes 30 seconds or less.

The process for the _new_ reservation which is currently failing needs 
additional logic and shouldn't immediately fail if a _deleted/*_ request exists 
or a _pending/deleted_ process is running.  It should wait for the process to 
finish.  After the process is finished, it should also check if the completed 
_pending/deleted_ process inserted a reload request.

For _pending/deleted_ processes which only sanitize the computer, I think we 
should leave things as is.  The _pending/new_ process will wait for it to 
finish.

For _pending/deleted_ processes which insert a _reload_ request, we could add a 
check immediately before inserting the _reload_ request to see if a reservation 
exists assigned to the computer.  If so, do not insert the _reload_ request.  
Instead, make sure the computer state gets/is set to _reload_ and tag 
currentimage.txt as *tainted*.  (This tainted tag is a very new addition to the 
back end code.)  The _deleted_ process quietly exits. The _new_ process sees 
the tainted flag and will always reload the computer.

The _pending/new_ process already checks if a _reload_ process is running. If 
it is for the same image, it waits for this process to finish. If for a 
different image, it forcefully kills the _pending/reload_ process. There may be 
some timing corner cases related to this.

We'll need to think through and check any timing issues such as:
* User deletes reservation in _pending/reserved_, computer is assigned to 
another reservation before _pending/reserved_ process exits
* User deletes reservation in _pending/reserved_, computer is assigned to 
another reservation in the seconds between when _pending/reserved_ process 
exits and _reserved/deleted_ process starts
* User deletes reservation in _pending/reserved_, computer is assigned to 
another reservation in the split second after _pending/deleted_ process checks 
for another reservation and inserts a _reload_ request


> Improve flow of handling nodes for deleted reservations assigned to new 
> reservations
> ------------------------------------------------------------------------------------
>
>                 Key: VCL-846
>                 URL: https://issues.apache.org/jira/browse/VCL-846
>             Project: VCL
>          Issue Type: Bug
>          Components: vcld (backend)
>            Reporter: Aaron Peeler
>             Fix For: 2.5
>
>
> As a user can make a new reservation, the front-end can assign a node that is 
> currently being cleaned up from a deleted reservation. So far the states 
> observed are:
> currentstate = pending
> laststate = deleted
> node state = reserved
> If this node is assigned to a new reservation, the back end checks for 
> existing process, logs it and fails the new reservation. It fails it likely 
> because the new process on the backend does not know how much time is left to 
> clean up the deleted reservation. 
> A decision/implementation on synchronizing the front-end and back-end needs 
> to be made on how to handle this case as to not have reservation failures.
> Suggestions are to:
> 1) only select available machines not assigned to reservations on the front 
> end. 
> 2) Or on the backend - wait until the previous deleted process is complete( 
> which could take a while depending on what can be done)
> 3) on the backend -  intercept the process and force a reload - not matter 
> what the image is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to