[ https://issues.apache.org/jira/browse/HDFS-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289219#comment-15289219 ]
James Clampffer commented on HDFS-9890: --------------------------------------- Has anyone been able to reproduce this error? I've reviewed a couple times now and can't find anything in the patch that looks like it could trigger this sort of error. I think it's more likely that the patch is exposing a real bug that should be tracked in another JIRA. I'm going to spend a few more hours debugging on different machines with more/less cores and some different architectures but if nothing shows up I'm inclined to +1 and deal with the underlying library error once someone can find a reproducer. The value of having this landed to help prevent regressions before HA and Kerberos are done outweighs bugs that emerge occasionally when it's run IMO. > libhdfs++: Add test suite to simulate network issues > ---------------------------------------------------- > > Key: HDFS-9890 > URL: https://issues.apache.org/jira/browse/HDFS-9890 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client > Reporter: James Clampffer > Assignee: Xiaowei Zhu > Attachments: HDFS-9890.HDFS-8707.000.patch, > HDFS-9890.HDFS-8707.001.patch, HDFS-9890.HDFS-8707.002.patch, > HDFS-9890.HDFS-8707.003.patch, HDFS-9890.HDFS-8707.004.patch, > HDFS-9890.HDFS-8707.005.patch, HDFS-9890.HDFS-8707.006.patch, > HDFS-9890.HDFS-8707.007.patch, hs_err_pid26832.log, hs_err_pid4944.log > > > I propose adding a test suite to simulate various network issues/failures in > order to get good test coverage on some of the retry paths that aren't easy > to hit in mock unit tests. > At the moment the only things that hit the retry paths are the gmock unit > tests. The gmock are only as good as their mock implementations which do a > great job of simulating protocol correctness but not more complex > interactions. They also can't really simulate the types of lock contention > and subtle memory stomps that show up while doing hundreds or thousands of > concurrent reads. We should add a new minidfscluster test that focuses on > heavy read/seek load and then randomly convert error codes returned by > network functions into errors. > List of things to simulate(while heavily loaded), roughly in order of how > badly I think they need to be tested at the moment: > -Rpc connection disconnect > -Rpc connection slowed down enough to cause a timeout and trigger retry > -DN connection disconnect -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org