[ https://issues.apache.org/jira/browse/HDFS-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13803740#comment-13803740 ]
Colin Patrick McCabe commented on HDFS-5394: -------------------------------------------- bq. This state machine logic seems like overkill. It's cool how we can do all this with atomic swaps, but synchronization isn't a big overhead as this isn't a hot path, there aren't that many states, and this the transitions aren't that complicated. I think it'd be simpler to just use a couple different maps: replicasMap for cached replicas and pendingCache / pendingUncache for in-progress replicas. I think the state machine makes things easier to understand, and is more efficient as well. A "big lock" architecture would end up being just as complex, due to the fact that we can't do long-running operations under the lock. You'd have to carefully check the state of everything after taking the big lock again, which ends up being just as complex as the compare-and-swap here. Whether to advertise a block is a different issue than state machine vs. ad hoc. Currently, we only advertise blocks in the {{CACHING}} state, meaning blocks that have been completely cached and not scheduled for uncaching. Originally I had a state which was "uncaching but visible" but it seemed like that wasn't a good idea because work might be scheduled on the node in the erroneous belief that the block would stay cached, etc. bq. I like the idea of UncachingTask, it's good future-proofing. However, I don't think it should be using the same executor as CachingTask, since the executor is currently capped at 4 threads per volume in the interest of disk contention. Uncaching tasks do no I/O, and we wouldn't want stuck uncaching tasks to stop other work. It might be better to instead have a background sweeper thread that goes through pendingUncache (kicked on-demand as necessary). This works well with a separate pendingUncache map too. It's not really possible for Uncaching tasks to "get stuck" currently, since all they do is munmap. Although that's technically a blocking system call, since the mmap is not dirty it won't take long. We could have a separate executor, I suppose, to prevent caching tasks from blocking uncaching ones. In the long term, we are not going to have threads blocking waiting for clients. Instead, we'll have tasks that reschedule themselves with some fixed period to poll whether the client has released the mmap. It would be nice to think of a way to edge-trigger this, but that's probably an optimization for later. In any case, I'll create another executor to prevent mmap from blocking munmap. bq. NativeIO: should move the comment up to be javadoc instead OK. bq. A remark, this only sets nextValue to null, which is okay here, but seeing it did make me check that it's not an error: The Java compiler complains about local variables that are used before initialization. So you don't need to check. See http://developer.nokia.com/Community/Wiki/Initializing_local_variables_in_Java: bq. In Java, it is a fixed rule that you may not read the value from a local variable unless the compiler can prove that the variable will have been initialized. Otherwise, the code will not compile. This is the "definite assignment" rule. bq. I don't see a method named "startUncaching", need to update the exception text OK. bq. Why is it called "visibleLength" instead of just just "length"? We don't support partial caching, reading of blocks open for append, or cached reads of blocks that are being cached, so mention of "visibility" threw me off. That's why I wanted javadoc above. Yes, but we plan on supporting those things in the future. So we should be clear about what we're measuring, which is not the length of the block file, but the length of the block file that the client is allowed to see. bq. Could we keep FsDatasetImpl#validToCache method separate rather than inlining it in cacheBlock? The issue is that {{ConcurrentHashMap}} has a method for atomically adding a value for the given key if one is not already present. If this were a separate method to check the map and then another to add to it, we would have a race condition. Plus, we'd be checking the map two times, which is inefficient. Given that it's only 2 or 3 lines, I don't think a separate method makes sense here. bq. Don't we need to remove a CACHING_CANCELLED MappableBlock from the replicaMap when the swap check in CachingTask fails? Fixed, thanks. > fix race conditions in DN caching and uncaching > ----------------------------------------------- > > Key: HDFS-5394 > URL: https://issues.apache.org/jira/browse/HDFS-5394 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, namenode > Affects Versions: HDFS-4949 > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Attachments: HDFS-5394-caching.001.patch, > HDFS-5394-caching.002.patch, HDFS-5394-caching.003.patch > > > The DN needs to handle situations where it is asked to cache the same replica > more than once. (Currently, it can actually do two mmaps and mlocks.) It > also needs to handle the situation where caching a replica is cancelled > before said caching completes. -- This message was sent by Atlassian JIRA (v6.1#6144)