[ https://issues.apache.org/jira/browse/HDFS-14822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922724#comment-16922724 ]
Chen Liang edited comment on HDFS-14822 at 9/4/19 6:14 PM: ----------------------------------------------------------- Effectively there are three cases getLastSeenStateId() gets called currently. a. GlobalStatIdContext.receiveRequestState() b. Handler in Server.run() c. GlobalStateIdContext.updateResponseState() And I think none of these places really has to rely on this lock: (a): The use case here is when server side receives an RPC request, it reads the client state id field and builds the in-memory call object. To do this it gets the server id (which acquires the lock currently). The server id has two use cases here: 1. if current NN is active, and client state is larger than ANN id (which is an invalid case), the code sets the call id to server id overwriting client id. Since this is an invalid case, I don't think it worth protecting with a lock. Also, removing the lock means overwriting client state id with a potentially older server id, which is still fine. 2. checks if server id is too far behind client id, if yes, throws exception. For this case, removing the lock means getting inaccurate server id. But if it's just for the purpose of evaluating if NN is too far behind, the impact of this inaccuracy should be insignificant. (b): The use case here is when server side needs to check if a call should be reinserted to the queue. It needs to check if the call is ahead or behind the state id of the NN itself. If we remove the lock here, we may get a potentially not so current, older NN id. For Observer, this just means there is higher chance client id can be larger than server id, and the call just gets reinserted, which is fine. For ANN, because of the check in (a), if client id is larger than ANN id, it would have been overwritten by ANN state by that time. So by the time here, it could not be larger than ANN id. (c): This is where NN replies the RPC call, it sets its own state id in the RPC response. For ANN, removing the locking here means "There would be no guarantee client will always get the most recent state of ANN" (this is what the comments are trying to say). But thinking further, even with current locking, it could happen that while the response is on the fly, ANN state has already proceeded. There is just no way of getting THE current state of ANN. But, this should not matter anyway. All that matters is that client sees consistent view. Due to the fact that here, get state id is called on replying to client, even without the locking, the txnid of this particular client operation is guaranteed to be updated at this point (also because txnid is a volatile) which further ganrantees that client always sees his/her operation being done with it returned. was (Author: vagarychen): Effectively there are three cases getLastSeenStateId() gets called currently. a. GlobalStatIdContext.receiveRequestState() b. Handler in Server.run() c. GlobalStateIdContext.updateResponseState() (a): The use case here is when server side receives an RPC request, it reads the client state id field and builds the in-memory call object. To do this it gets the server id (which acquires the lock currently). The server id has two use cases here: 1. if current NN is active, and client state is larger than ANN id (which is an invalid case), the code sets the call id to server id overwriting client id. Since this is an invalid case, I don't think it worth protecting with a lock. Also, removing the lock means overwriting client state id with a potentially older server id, which is still fine. 2. checks if server id is too far behind client id, if yes, throws exception. For this case, removing the lock means getting inaccurate server id. But if it's just for the purpose of evaluating if NN is too far behind, the impact of this inaccuracy should be insignificant. (b): The use case here is when server side needs to check if a call should be reinserted to the queue. It needs to check if the call is ahead or behind the state id of the NN itself. If we remove the lock here, we may get a potentially not so current, older NN id. For Observer, this just means there is higher chance client id can be larger than server id, and the call just gets reinserted, which is fine. For ANN, because of the check in (a), if client id is larger than ANN id, it would have been overwritten by ANN state by that time. So by the time here, it could not be larger than ANN id. (c): This is where NN replies the RPC call, it sets its own state id in the RPC response. For ANN, removing the locking here means "There would be no guarantee client will always get the most recent state of ANN" (this is what the comments are trying to say). But thinking further, even with current locking, it could happen that while the response is on the fly, ANN state has already proceeded. There is just no way of getting THE current state of ANN. But, this should not matter anyway. All that matters is that client sees consistent view. Due to the fact that here, get state id is called on replying to client, even without the locking, the txnid of this particular client operation is guaranteed to be updated at this point (also because txnid is a volatile) which further ganrantees that client always sees his/her operation being done with it returned. > [SBN read] Revisit GlobalStateIdContext locking when getting server state id > ---------------------------------------------------------------------------- > > Key: HDFS-14822 > URL: https://issues.apache.org/jira/browse/HDFS-14822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs > Reporter: Chen Liang > Assignee: Chen Liang > Priority: Major > > As mentioned under HDFS-14277. One potential performance issue of Observer > read is that {{GlobalStateIdContext#getLastSeenStateId}} calls > getCorrectLastAppliedOrWrittenTxId which ends up acquiring lock on txnid. We > internally had some discussion and analysis, we believe this lock can be > avoided, by calling the non-locking version method > {{getLastAppliedOrWrittenTxId.}} -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org