[
https://issues.apache.org/jira/browse/HDFS-5939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13905606#comment-13905606
]
Yongjun Zhang commented on HDFS-5939:
-------------------------------------
Thanks Haohui.
Indeed, the contract of Random.nextInt() expects numOfDatanodes to be greater
than 0, otherwise, it will throw
IllegalArgumentException("n must be positive");
That's what I listed in the original bug report, and we haven't seen this
exception throw from
NetworkTopology.chooseRandom(String scope, String excludedScope)
until HDFS-5939.
Investigation of this bug shows that numOfDatanodes is 0 because no dataNode is
running in this case.
Prior to my fix, there are three cases of how method
NetworkTopology.chooseRandom(String scope, String excludedScope)
could finish:
1. return valid Node
2. return null (in the beginning of the method)
3. throw the above exception when calling Random.nextInt() ( in the end of the
method).
It seems all callers of this method didn't check for case 2. The result would
be, if it happens, the caller would result in null pointer exception (again,
there is no report saying this ever happened).
HDFS-5939 is case 3 where the caller is NamenodeWebHdfs.redirectURI(..). My
submitted fix makes chooseRandom method to return null before calling
Random.netxInt() when numDatanode is 0, and throw NoDatanodeException from
caller side. Basically my fix replace the InvalidArgumentException with
NoDatanodeException for this case with an explicit message to help user,
With my submitted fix here, if numOfDatanode==0 happens for other callers of
chooseRandom method in real case, my fix won't really hide the problem. That
is, it will result in null pointer exception, instead of the
InvalidArgumentException. Now this is covered by HDFS-5970. I hope there is a
field report of HDFS-5970 before we fix HDFS-5970 so we can understand why it
happened.
Another alternative to my fix is, to change the interface of
NetworkTopology.chooseRandom exception spec, and to let it throw
NodatanodeException instead of InvalidArgumentException. I didn't do this in my
submitted fix for two reasons:
- the caller has better chance to provide a more helpful message.
- the impact of changing the interface in wider.
Would you please let me know what you think? thanks.
> WebHdfs returns misleading error code and logs nothing if trying to create a
> file with no DNs in cluster
> --------------------------------------------------------------------------------------------------------
>
> Key: HDFS-5939
> URL: https://issues.apache.org/jira/browse/HDFS-5939
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs-client
> Affects Versions: 2.3.0
> Reporter: Yongjun Zhang
> Assignee: Yongjun Zhang
> Attachments: HDFS-5939.001.patch
>
>
> When trying to access hdfs via webhdfs, and when datanode is dead, user will
> see an exception below without any clue that it's caused by dead datanode:
> $ curl -i -X PUT
> ".../webhdfs/v1/t1?op=CREATE&user.name=<userName>&overwrite=false"
> ...
> {"RemoteException":{"exception":"IllegalArgumentException","javaClassName":"java.lang.IllegalArgumentException","message":"n
> must be positive"}}
> Need to fix the report to give user hint about dead datanode.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)