Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/2618
  
    OK good I do understand the problem.
    
    There really are a few ways that I see we can make the stack trace much 
less likely to come out in the common case.  The following are in my preferred 
order, but I am open to other ideas.
    
    1)  We don't delete the blobs on the nimbus side for a while after we kill 
the topology.
    Currently we delete the blobs on a timer that runs every 10 seconds by 
default, and I would have to trace through things, but I think we may do some 
other deletions before that happens.  If instead we kept a separate map (TOPO_X 
can be cleaned up after Y) then when cleanup runs it can check that map and if 
it does not find the topo it wants to clean up, or if it finds it and the time 
has passed, then it cleans it up.
    
    2) We don't output the stack trace until it has failed some number of times 
in a row.  This would mean that we would still output the error if the blob was 
deleted when it should not have been, but would not look like an error until it 
had been gone for 1 or 2 seconds.  Hopefully long enough to actually have 
killed the workers.
    
    3) We have the supervisor inform the AsyncLocalizer about topologies that 
are in the process of being killed.
    Right now part of the issue with the race is that killing a worker can take 
a non-trivial amount of time.  This makes the window that the race can happen 
in much larger.  If as soon as the supervisors know that a topology is being 
killed they tell the AsyncLocalizer it could then not output errors for any 
topology in the process of being killed.  The issue here is that informing the 
supervisors happens in a background thread and is not guaranteed to happen, so 
it might not work as frequently as we would like.


---

Reply via email to