[ https://issues.apache.org/jira/browse/HDFS-5939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13905606#comment-13905606 ]
Yongjun Zhang commented on HDFS-5939: ------------------------------------- Thanks Haohui. Indeed, the contract of Random.nextInt() expects numOfDatanodes to be greater than 0, otherwise, it will throw IllegalArgumentException("n must be positive"); That's what I listed in the original bug report, and we haven't seen this exception throw from NetworkTopology.chooseRandom(String scope, String excludedScope) until HDFS-5939. Investigation of this bug shows that numOfDatanodes is 0 because no dataNode is running in this case. Prior to my fix, there are three cases of how method NetworkTopology.chooseRandom(String scope, String excludedScope) could finish: 1. return valid Node 2. return null (in the beginning of the method) 3. throw the above exception when calling Random.nextInt() ( in the end of the method). It seems all callers of this method didn't check for case 2. The result would be, if it happens, the caller would result in null pointer exception (again, there is no report saying this ever happened). HDFS-5939 is case 3 where the caller is NamenodeWebHdfs.redirectURI(..). My submitted fix makes chooseRandom method to return null before calling Random.netxInt() when numDatanode is 0, and throw NoDatanodeException from caller side. Basically my fix replace the InvalidArgumentException with NoDatanodeException for this case with an explicit message to help user, With my submitted fix here, if numOfDatanode==0 happens for other callers of chooseRandom method in real case, my fix won't really hide the problem. That is, it will result in null pointer exception, instead of the InvalidArgumentException. Now this is covered by HDFS-5970. I hope there is a field report of HDFS-5970 before we fix HDFS-5970 so we can understand why it happened. Another alternative to my fix is, to change the interface of NetworkTopology.chooseRandom exception spec, and to let it throw NodatanodeException instead of InvalidArgumentException. I didn't do this in my submitted fix for two reasons: - the caller has better chance to provide a more helpful message. - the impact of changing the interface in wider. Would you please let me know what you think? thanks. > WebHdfs returns misleading error code and logs nothing if trying to create a > file with no DNs in cluster > -------------------------------------------------------------------------------------------------------- > > Key: HDFS-5939 > URL: https://issues.apache.org/jira/browse/HDFS-5939 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client > Affects Versions: 2.3.0 > Reporter: Yongjun Zhang > Assignee: Yongjun Zhang > Attachments: HDFS-5939.001.patch > > > When trying to access hdfs via webhdfs, and when datanode is dead, user will > see an exception below without any clue that it's caused by dead datanode: > $ curl -i -X PUT > ".../webhdfs/v1/t1?op=CREATE&user.name=<userName>&overwrite=false" > ... > {"RemoteException":{"exception":"IllegalArgumentException","javaClassName":"java.lang.IllegalArgumentException","message":"n > must be positive"}} > Need to fix the report to give user hint about dead datanode. -- This message was sent by Atlassian JIRA (v6.1.5#6160)