[ https://issues.apache.org/jira/browse/HADOOP-14044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hrishikesh Gadre updated HADOOP-14044: -------------------------------------- Attachment: HADOOP-14044-002.patch [~xiaochen] Thanks for the feedback. Here is a patch implementing this approach. I verified this patch manually on a real cluster. For this I had to disable Zookeeper watch to ensure that the inconsistency between local cache and the ZK state can be reproduced. Please take a look and let me have your feedback. Due to concurrency issue, I think it would be difficult to write a unit test for this scenario. > Synchronization issue in delegation token cancel functionality > -------------------------------------------------------------- > > Key: HADOOP-14044 > URL: https://issues.apache.org/jira/browse/HADOOP-14044 > Project: Hadoop Common > Issue Type: Bug > Reporter: Hrishikesh Gadre > Assignee: Hrishikesh Gadre > Attachments: dt_fail.log, dt_success.log, HADOOP-14044-001.patch, > HADOOP-14044-002.patch > > > We are using Hadoop delegation token authentication functionality in Apache > Solr. As part of the integration testing, I found following issue with the > delegation token cancelation functionality. > Consider a setup with 2 Solr servers (S1 and S2) which are configured to use > delegation token functionality backed by Zookeeper. Now invoke following > steps, > [Step 1] Send a request to S1 to create a delegation token. > (Delegation token DT is created successfully) > [Step 2] Send a request to cancel DT to S2 > (DT is canceled successfully. client receives HTTP 200 response) > [Step 3] Send a request to cancel DT to S2 again > (DT cancelation fails. client receives HTTP 404 response) > [Step 4] Send a request to cancel DT to S1 > At this point we get two different responses. > - DT cancelation fails. client receives HTTP 404 response > - DT cancelation succeeds. client receives HTTP 200 response > Also as per the current implementation, each server maintains an in_memory > cache of current tokens which is updated using the ZK watch mechanism. e.g. > the ZK watch on S1 will ensure that the in_memory cache is synchronized after > step 2. > After investigation, I found the root cause for this behavior is due to the > race condition between step 4 and the firing of ZK watch on S1. Whenever the > watch fires before the step 4 - we get HTTP 404 response (as expected). When > that is not the case - we get HTTP 200 response along with following ERROR > message in the log, > {noformat} > Attempted to remove a non-existing znode /ZKDTSMTokensRoot/DT_XYZ > {noformat} > From client perspective, the server *should* return HTTP 404 error when the > cancel request is sent out for an invalid token. > Ref: Here is the relevant Solr unit test for reference, > https://github.com/apache/lucene-solr/blob/746786636404cdb8ce505ed0ed02b8d9144ab6c4/solr/core/src/test/org/apache/solr/cloud/TestSolrCloudWithDelegationTokens.java#L285 -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org