Github user revans2 commented on the issue: https://github.com/apache/storm/pull/2618 @danny0405 {{updateBlobs}} does not need to be guarded by a lock. This is what I was talking about with the code being complex. {{requestDownloadBaseTopologyBlobs}} is protected by a lock simply because of this non-thread safe code. https://github.com/apache/storm/blob/402a371ccdb39ccd7146fe9743e91ca36fee6d15/storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java#L222-L226 Part of this were written prior to the more to java8 so {{computeIfAbsent}} was not available. Now that it is we could replace it and I believe remove the lock, but I would want to spend some time to be sure it was not accidentally protecting something else in there too. {{requestDownloadTopologyBlobs}} looks like it does not need to be synchronized at all. It must have been a mistake on my part, but it does look like it might be providing some protection to a bug in https://github.com/apache/storm/blob/402a371ccdb39ccd7146fe9743e91ca36fee6d15/storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java#L191-L195 Which is executing outside of a lock, but looks to not be thread safe. Declaring {{updateBlobs}} as synchronized does absolutely noting except make it have conflicts with {{requestDownloadTopologyBlobs}} and {{requestDownloadBaseTopologyBlobs}}. And if we are able to remove the locks there, then it will not be an issue at all. {{updateBlobs}} is scheduled using {{scheduleWithFixedDelay}} https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay(java.lang.Runnable,%20long,%20long,%20java.util.concurrent.TimeUnit) The javadocs clearly states that the next execution only starts a fixed delay after the previous one finished. There will only ever be one copy it running at a time. Additionally everything it does is already asynchronous so would be happening on a separate thread. Making it synchronized would just slow things down. The having a blob disappear at the wrong time is a race that will always be in the system and we cannot fix it with synchronization because it is happening on separate servers. The only thing we can do is to deal with it when it happens. The way the current code deals with it is to try again later. This means that a worker that is trying to come up for the first time will not come up until the blob is fully downloaded, but if we are trying to update the blob and it has disappeared we will simply keep the older version around until we don't need it any more. Yes we may log some exceptions while we do it, but that is the worst thing that will happen.
---