[ 
https://issues.apache.org/jira/browse/HDFS-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Clampffer updated HDFS-9890:
----------------------------------
    Description: 
I propose adding a test suite to simulate various network issues/failures in 
order to get good test coverage on some of the retry paths that aren't easy to 
hit in mock unit tests.

At the moment the only things that hit the retry paths are the gmock unit 
tests.  The gmock are only as good as their mock implementations which do a 
great job of simulating protocol correctness but not more complex interactions. 
 They also can't really simulate the types of lock contention and subtle memory 
stomps that show up while doing hundreds or thousands of concurrent reads.   We 
should add a new minidfscluster test that focuses on heavy read/seek load and 
then randomly convert error codes returned by network functions into errors.

List of things to simulate(while heavily loaded), roughly in order of how badly 
I think they need to be tested at the moment:
-Rpc connection disconnect
-Rpc connection slowed down enough to cause a timeout and trigger retry
-DN connection disconnect



  was:
I propose adding a test suite to simulate various network issues/failures in 
order to get good test coverage on some of the retry paths that aren't easy to 
hit in unit tests.

At the moment the only things that hit the retry paths are the gmock unit 
tests.  The gmock are only as good as their mock implementations which do a 
great job of simulating protocol correctness but not more complex interactions. 
 They also can't really simulate the types of lock contention and subtle memory 
stomps that show up while doing hundreds or thousands of concurrent reads.

I'd like to make a standalone "bring your own cluster" test suite that can do 
things like drop connections, slow connections down, and cause connections to 
hang for short periods of time.  I think this should be a standalone test for a 
few reasons:
-The tools for doing this sort of thing inside the tcp/ip stack are platform 
dependent.  On linux it looks like it could be done with iptables but I'm not 
sure about mac or windows.  I can make linux version, but I don't have enough 
windows and mac experience (or dev hardware) to be productive there.
-This needs to scale as large as possible for machines capable of doing it.  
The CI tests could run a dialed back version but chances of hitting bugs is 
much lower. There are certain bugs that I've only been able to reproduce when 
running at sufficient scale.  My laptop with 4 physical cores and 1 disk can't 
sustain the loads that start making lock contention and resource ownership gaps 
show up; running the client on a 24 core server against a "real" cluster tends 
to make issues apparent quickly.
-As mentioned above, I think some of these bugs won't show up regardless of how 
long they run on low end hardware e.g. typical dev workstation.  It's just not 
possible to get enough parts moving at once.  I don't want people to waste time 
waiting for <some large number> operations to run if it's only ever going to be 
running a few dozen concurrently.  I'm not sure what sort of hardware the CI 
tests run on, but I don't think the rest of the Hadoop community would 
appreciate a test that attempts to hog all resources for an extended period of 
time.

List of things to simulate(while heavily loaded), roughly in order of how badly 
I think they need to be tested at the moment:
-Rpc connection disconnect
-Rpc connection slowed down enough to cause a timeout and trigger retry
-DN connection disconnect

The initial motivation for filing this is that I've hit a bug 2 times (ever) 
where the rpc engine can't match a call-id with a request it sent out.  I have 
a guess as to what's causing it, but not enough info to post a meaningful jira 
(haven't ruled out something else in the process stomping on libhdfs memory).


> libhdfs++: Add test suite to simulate network issues
> ----------------------------------------------------
>
>                 Key: HDFS-9890
>                 URL: https://issues.apache.org/jira/browse/HDFS-9890
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: James Clampffer
>            Assignee: James Clampffer
>
> I propose adding a test suite to simulate various network issues/failures in 
> order to get good test coverage on some of the retry paths that aren't easy 
> to hit in mock unit tests.
> At the moment the only things that hit the retry paths are the gmock unit 
> tests.  The gmock are only as good as their mock implementations which do a 
> great job of simulating protocol correctness but not more complex 
> interactions.  They also can't really simulate the types of lock contention 
> and subtle memory stomps that show up while doing hundreds or thousands of 
> concurrent reads.   We should add a new minidfscluster test that focuses on 
> heavy read/seek load and then randomly convert error codes returned by 
> network functions into errors.
> List of things to simulate(while heavily loaded), roughly in order of how 
> badly I think they need to be tested at the moment:
> -Rpc connection disconnect
> -Rpc connection slowed down enough to cause a timeout and trigger retry
> -DN connection disconnect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to