[ 
https://issues.apache.org/jira/browse/HDFS-5939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13905606#comment-13905606
 ] 

Yongjun Zhang commented on HDFS-5939:
-------------------------------------

Thanks Haohui.

Indeed, the contract of Random.nextInt() expects numOfDatanodes to be greater 
than 0, otherwise, it will throw
   IllegalArgumentException("n must be positive");
That's what I listed in the original bug report, and we haven't seen this 
exception throw from 
  NetworkTopology.chooseRandom(String scope, String excludedScope)
until HDFS-5939.

Investigation of this bug shows that numOfDatanodes is 0 because no dataNode is 
running in this case.

Prior to my fix, there are three cases of how method 
  NetworkTopology.chooseRandom(String scope, String excludedScope)
could finish:
1. return valid Node
2. return null (in the beginning of the method)
3. throw the above exception when calling Random.nextInt() ( in the end of the 
method).

It seems all callers of this method didn't check for case 2. The result would 
be, if it happens, the caller would result in null pointer exception (again, 
there is no report saying this ever happened).

HDFS-5939 is case 3 where the caller is NamenodeWebHdfs.redirectURI(..).  My 
submitted fix makes chooseRandom method to return null before calling 
Random.netxInt() when numDatanode is 0, and throw NoDatanodeException from 
caller side. Basically my fix replace the InvalidArgumentException with 
NoDatanodeException for this case with an explicit message to help user,   

With my submitted fix here, if numOfDatanode==0 happens for other callers of 
chooseRandom method in real case, my fix won't really hide the problem. That 
is, it will result in null pointer exception, instead of the 
InvalidArgumentException.  Now this is covered by HDFS-5970. I hope there is a 
field report of HDFS-5970 before we fix HDFS-5970 so we can understand why it 
happened.

Another alternative to my fix is, to change the interface of 
NetworkTopology.chooseRandom exception spec, and to let it throw 
NodatanodeException instead of InvalidArgumentException. I didn't do this in my 
submitted fix for two reasons:
- the caller has better chance to provide a more helpful message.
- the impact of changing the interface in wider.

Would you please let me know what you think? thanks.













> WebHdfs returns misleading error code and logs nothing if trying to create a 
> file with no DNs in cluster
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-5939
>                 URL: https://issues.apache.org/jira/browse/HDFS-5939
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>    Affects Versions: 2.3.0
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-5939.001.patch
>
>
> When trying to access hdfs via webhdfs, and when datanode is dead, user will 
> see an exception below without any clue that it's caused by dead datanode:
> $ curl -i -X PUT 
> ".../webhdfs/v1/t1?op=CREATE&user.name=<userName>&overwrite=false"
> ...
> {"RemoteException":{"exception":"IllegalArgumentException","javaClassName":"java.lang.IllegalArgumentException","message":"n
>  must be positive"}}
> Need to fix the report to give user hint about dead datanode.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to