[ 
https://issues.apache.org/jira/browse/HDFS-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13816735#comment-13816735
 ] 

Colin Patrick McCabe commented on HDFS-5394:
--------------------------------------------

bq. CACHING_CANCELLED discussion

yeah, it does make more sense to explicitly check for the states we expect to 
be in, rather than having a catch-all.  I have changed this to use 
{{Precondition}} to assert that we are in the correct state, since that seemed 
more appropriate, and also to be clearer about needing to be in the {{CACHING}} 
or {{CACHING_CANCELLED}} state there.

bq. Makes sense, though I'll note that 6,000,000 is 100 minutes, not ten 
minutes  Overkill.

Noted.  Reduced this to 10 minutes, which should be ample.

bq. Do we need that Preconditions check in setUp? There's already an assumeTrue 
for the same thing right above it, so I don't think it'll do anything.

No, it's a repeat of the previous one.  Removed.

bq. I'd like to see the LogVerificationAppender used in 
testUncachingBlocksBeforeCachingFinishes too. This seems like it might be flaky 
though. What was wrong with the old approach that used a barrier to force 
ordering?

The problem is we don't have a barrier in all the places we would need it.  
We'd need to know that the DN had received the DN_CACHE heartbeat response and 
initiated caching during the 3-second window it has to do so, in order to know 
that we would later see a log message about cancellation.  To check for the log 
message would be, as you guessed, flaky and we don't need another flaky test.

I'd like to keep a LogVerificationAppender for this test in mind as a future 
improvement, but still get this fix committed soon since HDFS-5366, HDFS-5320, 
HDFS-5451, and HDFS-5431 all depend on this patch to some extent.  Perhaps we 
can roll a test improvement for this into HDFS-5451, since that JIRA is all 
about debuggability and logging.

> fix race conditions in DN caching and uncaching
> -----------------------------------------------
>
>                 Key: HDFS-5394
>                 URL: https://issues.apache.org/jira/browse/HDFS-5394
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, namenode
>    Affects Versions: 3.0.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-5394-caching.001.patch, 
> HDFS-5394-caching.002.patch, HDFS-5394-caching.003.patch, 
> HDFS-5394-caching.004.patch, HDFS-5394.005.patch, HDFS-5394.006.patch, 
> HDFS-5394.007.patch, HDFS-5394.008.patch
>
>
> The DN needs to handle situations where it is asked to cache the same replica 
> more than once.  (Currently, it can actually do two mmaps and mlocks.)  It 
> also needs to handle the situation where caching a replica is cancelled 
> before said caching completes.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to