[
https://issues.apache.org/jira/browse/VCL-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386315#comment-15386315
]
Andy Kurth commented on VCL-846:
--------------------------------
The deleted (*reclaim.pm*) process doesn't take very long. The longest
scenario would be when it sanitizes the computer instead of inserting a reload
request. I'd guess this takes 30 seconds or less.
The process for the _new_ reservation which is currently failing needs
additional logic and shouldn't immediately fail if a _deleted/*_ request exists
or a _pending/deleted_ process is running. It should wait for the process to
finish. After the process is finished, it should also check if the completed
_pending/deleted_ process inserted a reload request.
For _pending/deleted_ processes which only sanitize the computer, I think we
should leave things as is. The _pending/new_ process will wait for it to
finish.
For _pending/deleted_ processes which insert a _reload_ request, we could add a
check immediately before inserting the _reload_ request to see if a reservation
exists assigned to the computer. If so, do not insert the _reload_ request.
Instead, make sure the computer state gets/is set to _reload_ and tag
currentimage.txt as *tainted*. (This tainted tag is a very new addition to the
back end code.) The _deleted_ process quietly exits. The _new_ process sees
the tainted flag and will always reload the computer.
The _pending/new_ process already checks if a _reload_ process is running. If
it is for the same image, it waits for this process to finish. If for a
different image, it forcefully kills the _pending/reload_ process. There may be
some timing corner cases related to this.
We'll need to think through and check any timing issues such as:
* User deletes reservation in _pending/reserved_, computer is assigned to
another reservation before _pending/reserved_ process exits
* User deletes reservation in _pending/reserved_, computer is assigned to
another reservation in the seconds between when _pending/reserved_ process
exits and _reserved/deleted_ process starts
* User deletes reservation in _pending/reserved_, computer is assigned to
another reservation in the split second after _pending/deleted_ process checks
for another reservation and inserts a _reload_ request
> Improve flow of handling nodes for deleted reservations assigned to new
> reservations
> ------------------------------------------------------------------------------------
>
> Key: VCL-846
> URL: https://issues.apache.org/jira/browse/VCL-846
> Project: VCL
> Issue Type: Bug
> Components: vcld (backend)
> Reporter: Aaron Peeler
> Fix For: 2.5
>
>
> As a user can make a new reservation, the front-end can assign a node that is
> currently being cleaned up from a deleted reservation. So far the states
> observed are:
> currentstate = pending
> laststate = deleted
> node state = reserved
> If this node is assigned to a new reservation, the back end checks for
> existing process, logs it and fails the new reservation. It fails it likely
> because the new process on the backend does not know how much time is left to
> clean up the deleted reservation.
> A decision/implementation on synchronizing the front-end and back-end needs
> to be made on how to handle this case as to not have reservation failures.
> Suggestions are to:
> 1) only select available machines not assigned to reservations on the front
> end.
> 2) Or on the backend - wait until the previous deleted process is complete(
> which could take a while depending on what can be done)
> 3) on the backend - intercept the process and force a reload - not matter
> what the image is.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)