[ https://issues.apache.org/jira/browse/HDFS-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814568#comment-13814568 ]
Colin Patrick McCabe commented on HDFS-5394: -------------------------------------------- bq. Unused imports in FsDatasetImpl and FsVolumeImpl removed bq. Do we still need to rename getExecutor to getCacheExecutor in FsVolumeImpl? Well, the name of the variable is {{cacheExecutor}}; shouldn't the getter be {{getCacheExecutor}}? bq. State#isUncaching() is unused removed bq. Could use a core pool size of 0 for uncachingExecutor, I don't think it's that latency sensitive agreed bq. usedBytes javadoc: "more things to cache that we can't actually do because of" is an awkward turn of phrase, maybe say "assign more blocks than we can actually cache because of" instead ok bq. MappableBlock#load javadoc: visibleLeng parameter should be renamed to length. The return value is now also a MappableBlock, not a boolean. fixed bq. Key: rename id to blockId for clarity? or add a bit of javadoc added javadoc bq. Naming the HashMap replicaMap is confusing since there's already a datanode ReplicaMap class. Maybe mappableBlockMap instead? ok bq. Caching can fail if the underlying block is invalidated in between getting the block's filename and running the CacheTask. It'd be nice to distinguish this race from a real error for when we do metrics (and also quash the exception). I just added a catch block for the {{FileNotFound}} exception which both {{getBlockInputStream}} and {{getMetaDataInputStream}} can throw. I still think we want to log this exception, but at INFO rather than WARN. We will retry sending the {{DNA_CACHE}} command (once 5366 is committed), so hitting this narrow race if a block is being moved is just a temporary setback. bq. If we get a DNA_CACHE for a block that is currently being uncached, shouldn't we try to cancel the uncache and re-cache it? The NN will resend the command, but it'd be better to not have to wait for that. We don't know how far along the uncaching process is. We can't cancel it if we already called {{munmap}}. We could allow cancellation of pending uncaches by splitting {{UNCACHING}} into {{UNCACHING_SCHEDULED}} and {{UNCACHING_IN_PROGRESS}}, and only allowing cancellation on the former. This might be a good improvement to make as part of 5182. But for now, the uncaching process is really quick, so let's keep it simple. bq. Could this be written with value.state == State.CACHING_CANCELLED instead? Would be clearer, and I believe equivalent since uncacheBlock won't set the state to UNCACHING if it's CACHING or CACHING_CANCELLED. well, if value is null, you don't want to be dereferencing that, right? bq. Even better would be interrupting a CachingTask on uncache since it'll save us I/O and CPU. That kind of interruption logic gets complex quickly. I'd rather save that for a potential performance improvement JIRA later down the line. I also think that if we're thrashing (cancelling caching requests right and left) the real fix might be on the NameNode anyway... bq. Could we combine CACHING_CANCELLED into UNCACHING? It seems like CachingTask could check for UNCACHING in that if statement at the end and uncache, same sort of change for uncacheBlock. I would rather not do that, since right now we can look at entries in the map and instantly know that anything in state {{UNCACHING}} has an associated {{Runnable}} scheduled in the {{Executor}}. cancelled is not really the same thing as uncaching since in the former case, there is actually nothing to do! bq. I think using a switch/case on the prevValue.state in uncacheBlock would be clearer ok bq. 6,000,000 milliseconds seem like very long test timeouts Can we change them to say, 60,000? the general idea is to do stuff that can time out in {{GenericTestUtils#waitFor}} blocks. The waitFor blocks actually give useful backtraces and messages when they time out, unlike the generic test timeouts. I wanted to avoid the scenario where the test-level timeouts kick in, but out of paranoia, I set the overall test timeout to 10 minutes in case there was some other unexpected timeout. I wanted to avoid the issues we've had with zombie tests in Jenkins causing heisenfailures. bq. Are these new log prints for sanity checking? Maybe we can just remove them. it's more so you can see what's going on in the sea of log messages. otherwise, it becomes hard to debug. bq. Some of the comments seem to refer to a previous patch version that used a countdown latch. fixed bq. It's unclear what this is testing beyond caching and then uncaching a bunch of blocks. Can we check for log prints to see that it's actually cancelling as expected? Any other ideas for definitively hitting cancellation? we could add callback hooks to more points in the system, and set up a bunch of countdown latches (or similar), but it might be overkill here. I have looked at the logs and I do see cancellation. you'd have to hit a GC or something to avoid it in this test, I think > fix race conditions in DN caching and uncaching > ----------------------------------------------- > > Key: HDFS-5394 > URL: https://issues.apache.org/jira/browse/HDFS-5394 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, namenode > Affects Versions: 3.0.0 > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Attachments: HDFS-5394-caching.001.patch, > HDFS-5394-caching.002.patch, HDFS-5394-caching.003.patch, > HDFS-5394-caching.004.patch, HDFS-5394.005.patch, HDFS-5394.006.patch, > HDFS-5394.007.patch > > > The DN needs to handle situations where it is asked to cache the same replica > more than once. (Currently, it can actually do two mmaps and mlocks.) It > also needs to handle the situation where caching a replica is cancelled > before said caching completes. -- This message was sent by Atlassian JIRA (v6.1#6144)