Re: [Gluster-devel] Regression tests time

2018-01-27 Thread Xavi Hernandez
Hi Amar,

On 28 Jan 2018 06:50, "Amar Tumballi"  wrote:

Thanks for this experiment, Xavi!!

I see two proposals here in the thread.

1. Remove unnecessary sleep commands.
2. Try to bring explicit checks, so our tests are more consistent.

I am personally in favor of 1. Lets do this.

About 2, as its already discussed, we may get into issues due to many
outside glusterfs project setups causing much harder problems to debug. Not
sure if we should depend on our 'eventing' framework in such test cases ?
Would that help?


That would be a good way to detect when something can be done. I've not
worked in these lines yet. But this is not the only way. For example, in
the kill_brick command there was a sleep after that to give time glusterd
to be aware of the change. Instead of the sleep, we can directly request
glusterd the state of the brick. If it's down, we are done without needing
to wait unnecessarily. If for some reason it takes more than one second, we
won't fail spuriously because we are directly checking the state. For
extreme cases where something really fails, we can define a bigger timeout,
for example 5 seconds. This way we cover all cases but in the most common
case it will only take some tens or hundreds of milliseconds.

Reducing timeouts have made more evident some races that currently exist in
the code. Till now I've identified a bug in AFR and a couple of races in
RPC code than were causing spurious failures. I still have to identify
another race (probably in RPC also) that is generating unexpected
disconnections (or incorrect reconnections).

Xavi


Regards,
Amar

On Thu, Jan 25, 2018 at 8:07 PM, Xavi Hernandez  wrote:

> On Thu, Jan 25, 2018 at 3:03 PM, Jeff Darcy  wrote:
>
>>
>>
>>
>> On Wed, Jan 24, 2018, at 9:37 AM, Xavi Hernandez wrote:
>>
>> That happens when we use arbitrary delays. If we use an explicit check,
>> it will work on all systems.
>>
>>
>> You're arguing against a position not taken. I'm not expressing
>> opposition to explicit checks. I'm just saying they don't come for free. If
>> you don't believe me, try adding explicit checks in some of the harder
>> cases where we're waiting for something that's subject to OS scheduling
>> delays, or for large numbers of operations to complete. Geo-replication or
>> multiplexing tests should provide some good examples. Adding explicit
>> conditions is the right thing to do in the abstract, but as a practical
>> matter the returns must justify the cost.
>>
>> BTW, some of our longest-running tests are in EC. Do we need all of
>> those, and do they all need to run as long, or could some be
>> eliminated/shortened?
>>
>
> Some tests were already removed some time ago. Anyway, with the changes
> introduced, it takes between 10 and 15 minutes to execute all ec related
> tests from basic/ec and bugs/ec (an average of 16 to 25 seconds per test).
> Before the changes, the same tests were taking between 30 and 60 minutes.
>
> AFR tests have also improved from almost 60 minutes to around 30.
>
>
>> I agree that parallelizing tests is the way to go, but if we reduce the
>> total time to 50%, the parallelized tests will also take 50% less of the
>> time.
>>
>>
>> Taking 50% less time but failing spuriously 1% of the time, or all of the
>> time in some environments, is not a good thing. If you want to add explicit
>> checks that's great, but you also mentioned shortening timeouts and that's
>> much more risky.
>>
>
> If we have a single test that takes 45 minutes (as we currently have in
> some executions: bugs/nfs/bug-1053579.t), parallelization won't help much.
> We need to make this test to run faster.
>
> Some tests that were failing after the changes have revealed errors in the
> test itself or even in the code, so I think it's a good thing. Currently
> I'm investigating what seems a race in the rpc layer during connections
> that causes some tests to fail. This is a real problem that high delays or
> slow machines were hiding. It seems to cause some gluster requests to fail
> spuriously after reconnecting to a brick or glusterd. I'm not 100% sure
> about this yet, but initial analysis seems to indicate that.
>
> Xavi
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Amar Tumballi (amarts)
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Regression tests time

2018-01-27 Thread Amar Tumballi
Thanks for this experiment, Xavi!!

I see two proposals here in the thread.

1. Remove unnecessary sleep commands.
2. Try to bring explicit checks, so our tests are more consistent.

I am personally in favor of 1. Lets do this.

About 2, as its already discussed, we may get into issues due to many
outside glusterfs project setups causing much harder problems to debug. Not
sure if we should depend on our 'eventing' framework in such test cases ?
Would that help?

Regards,
Amar

On Thu, Jan 25, 2018 at 8:07 PM, Xavi Hernandez  wrote:

> On Thu, Jan 25, 2018 at 3:03 PM, Jeff Darcy  wrote:
>
>>
>>
>>
>> On Wed, Jan 24, 2018, at 9:37 AM, Xavi Hernandez wrote:
>>
>> That happens when we use arbitrary delays. If we use an explicit check,
>> it will work on all systems.
>>
>>
>> You're arguing against a position not taken. I'm not expressing
>> opposition to explicit checks. I'm just saying they don't come for free. If
>> you don't believe me, try adding explicit checks in some of the harder
>> cases where we're waiting for something that's subject to OS scheduling
>> delays, or for large numbers of operations to complete. Geo-replication or
>> multiplexing tests should provide some good examples. Adding explicit
>> conditions is the right thing to do in the abstract, but as a practical
>> matter the returns must justify the cost.
>>
>> BTW, some of our longest-running tests are in EC. Do we need all of
>> those, and do they all need to run as long, or could some be
>> eliminated/shortened?
>>
>
> Some tests were already removed some time ago. Anyway, with the changes
> introduced, it takes between 10 and 15 minutes to execute all ec related
> tests from basic/ec and bugs/ec (an average of 16 to 25 seconds per test).
> Before the changes, the same tests were taking between 30 and 60 minutes.
>
> AFR tests have also improved from almost 60 minutes to around 30.
>
>
>> I agree that parallelizing tests is the way to go, but if we reduce the
>> total time to 50%, the parallelized tests will also take 50% less of the
>> time.
>>
>>
>> Taking 50% less time but failing spuriously 1% of the time, or all of the
>> time in some environments, is not a good thing. If you want to add explicit
>> checks that's great, but you also mentioned shortening timeouts and that's
>> much more risky.
>>
>
> If we have a single test that takes 45 minutes (as we currently have in
> some executions: bugs/nfs/bug-1053579.t), parallelization won't help much.
> We need to make this test to run faster.
>
> Some tests that were failing after the changes have revealed errors in the
> test itself or even in the code, so I think it's a good thing. Currently
> I'm investigating what seems a race in the rpc layer during connections
> that causes some tests to fail. This is a real problem that high delays or
> slow machines were hiding. It seems to cause some gluster requests to fail
> spuriously after reconnecting to a brick or glusterd. I'm not 100% sure
> about this yet, but initial analysis seems to indicate that.
>
> Xavi
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Amar Tumballi (amarts)
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Regression tests time

2018-01-25 Thread Xavi Hernandez
On Thu, Jan 25, 2018 at 3:03 PM, Jeff Darcy  wrote:

>
>
>
> On Wed, Jan 24, 2018, at 9:37 AM, Xavi Hernandez wrote:
>
> That happens when we use arbitrary delays. If we use an explicit check, it
> will work on all systems.
>
>
> You're arguing against a position not taken. I'm not expressing opposition
> to explicit checks. I'm just saying they don't come for free. If you don't
> believe me, try adding explicit checks in some of the harder cases where
> we're waiting for something that's subject to OS scheduling delays, or for
> large numbers of operations to complete. Geo-replication or multiplexing
> tests should provide some good examples. Adding explicit conditions is the
> right thing to do in the abstract, but as a practical matter the returns
> must justify the cost.
>
> BTW, some of our longest-running tests are in EC. Do we need all of those,
> and do they all need to run as long, or could some be eliminated/shortened?
>

Some tests were already removed some time ago. Anyway, with the changes
introduced, it takes between 10 and 15 minutes to execute all ec related
tests from basic/ec and bugs/ec (an average of 16 to 25 seconds per test).
Before the changes, the same tests were taking between 30 and 60 minutes.

AFR tests have also improved from almost 60 minutes to around 30.


> I agree that parallelizing tests is the way to go, but if we reduce the
> total time to 50%, the parallelized tests will also take 50% less of the
> time.
>
>
> Taking 50% less time but failing spuriously 1% of the time, or all of the
> time in some environments, is not a good thing. If you want to add explicit
> checks that's great, but you also mentioned shortening timeouts and that's
> much more risky.
>

If we have a single test that takes 45 minutes (as we currently have in
some executions: bugs/nfs/bug-1053579.t), parallelization won't help much.
We need to make this test to run faster.

Some tests that were failing after the changes have revealed errors in the
test itself or even in the code, so I think it's a good thing. Currently
I'm investigating what seems a race in the rpc layer during connections
that causes some tests to fail. This is a real problem that high delays or
slow machines were hiding. It seems to cause some gluster requests to fail
spuriously after reconnecting to a brick or glusterd. I'm not 100% sure
about this yet, but initial analysis seems to indicate that.

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Regression tests time

2018-01-25 Thread Jeff Darcy



On Wed, Jan 24, 2018, at 9:37 AM, Xavi Hernandez wrote:
> That happens when we use arbitrary delays. If we use an explicit
> check, it will work on all systems.
You're arguing against a position not taken. I'm not expressing
opposition to explicit checks. I'm just saying they don't come for free.
If you don't believe me, try adding explicit checks in some of the
harder cases where we're waiting for something that's subject to OS
scheduling delays, or for large numbers of operations to complete. Geo-
replication or multiplexing tests should provide some good examples.
Adding explicit conditions is the right thing to do in the abstract, but
as a practical matter the returns must justify the cost.
BTW, some of our longest-running tests are in EC. Do we need all of
those, and do they all need to run as long, or could some be
eliminated/shortened?
> I agree that parallelizing tests is the way to go, but if we reduce
> the total time to 50%, the parallelized tests will also take 50% less
> of the time.
Taking 50% less time but failing spuriously 1% of the time, or all of
the time in some environments, is not a good thing. If you want to add
explicit checks that's great, but you also mentioned shortening timeouts
and that's much more risky.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Regression tests time

2018-01-24 Thread Xavi Hernandez
On Wed, Jan 24, 2018 at 3:11 PM, Jeff Darcy  wrote:

>
>
>
> On Tue, Jan 23, 2018, at 12:58 PM, Xavi Hernandez wrote:
>
> I've made some experiments [1] with the time that centos regression takes
> to complete. After some changes the time taken to run a full regression has
> dropped between 2.5 and 3.5 hours (depending on the run time of 2 tests,
> see below).
>
> Basically the changes are related with delays manually introduced in some
> places (sleeps in test files or even in the code, or delays in timer
> events). I've changed some sleeps with better ways to detect some
> condition, and I've left the delays in other places but with reduced time.
> Probably the used values are not the best ones in all cases, but it
> highlights that we should seriously consider how we detect things instead
> of simply waiting for some amount of time (and hope it's enough). The total
> test time is more than 2 hours less with these changes, so this means that
> >2 hours of the whole regression time is spent waiting unnecessarily.
>
>
> We should definitely try to detect specific conditions instead of just
> sleeping for a fixed amount of time. That said, sometimes it would take
> significant additional effort to add a marker for a condition plus code to
> check for it. We need to be *really* careful about changing timeouts in
> these cases. It's easy to come up with something that works on one
> development system and then causes spurious failures for others.
>

That happens when we use arbitrary delays. If we use an explicit check, it
will work on all systems. Additionally, using specific checks makes it
possible to define bigger timeouts to handle corner cases because in the
normal case we'll continue as soon as the check is satisfied, which will be
almost always. But if it really fails, on that particular cases it will
take some time to detect it, which is fine because this way we allow enough
time for "normal" delays.

One of the biggest problems I had to deal with when I implemented
> multiplexing was these kinds of timing dependencies in tests, and I had to
> go through it all again when I came to Facebook. While I applaud the effort
> to reduce single-test times, I believe that parallelizing tests will
> long-term be a more effective (and definitely safer) route to reducing
> overall latency.
>

I agree that parallelizing tests is the way to go, but if we reduce the
total time to 50%, the parallelized tests will also take 50% less of the
time.

Additionally, reducing the time it takes to do each test, is a good way to
detect corner cases. If we always sleep in some cases, we could be missing
some failures that can happen if there's no sleep (and users can do the
same requests than us but without sleeping).


> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Regression tests time

2018-01-24 Thread Jeff Darcy



On Tue, Jan 23, 2018, at 12:58 PM, Xavi Hernandez wrote:
> I've made some experiments [1] with the time that centos regression
> takes to complete. After some changes the time taken to run a full
> regression has dropped between 2.5 and 3.5 hours (depending on the run
> time of 2 tests, see below).> 
> Basically the changes are related with delays manually introduced in
> some places (sleeps in test files or even in the code, or delays in
> timer events). I've changed some sleeps with better ways to detect
> some condition, and I've left the delays in other places but with
> reduced time. Probably the used values are not the best ones in all
> cases, but it highlights that we should seriously consider how we
> detect things instead of simply waiting for some amount of time (and
> hope it's enough). The total test time is more than 2 hours less with
> these changes, so this means that >2 hours of the whole regression
> time is spent waiting unnecessarily.
We should definitely try to detect specific conditions instead of just
sleeping for a fixed amount of time. That said, sometimes it would take
significant additional effort to add a marker for a condition plus code
to check for it. We need to be *really* careful about changing timeouts
in these cases. It's easy to come up with something that works on one
development system and then causes spurious failures for others. One of
the biggest problems I had to deal with when I implemented multiplexing
was these kinds of timing dependencies in tests, and I had to go through
it all again when I came to Facebook. While I applaud the effort to
reduce single-test times, I believe that parallelizing tests will long-
term be a more effective (and definitely safer) route to reducing
overall latency.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Regression tests time

2018-01-23 Thread Xavi Hernandez
Hi,

I've made some experiments [1] with the time that centos regression takes
to complete. After some changes the time taken to run a full regression has
dropped between 2.5 and 3.5 hours (depending on the run time of 2 tests,
see below).

Basically the changes are related with delays manually introduced in some
places (sleeps in test files or even in the code, or delays in timer
events). I've changed some sleeps with better ways to detect some
condition, and I've left the delays in other places but with reduced time.
Probably the used values are not the best ones in all cases, but it
highlights that we should seriously consider how we detect things instead
of simply waiting for some amount of time (and hope it's enough). The total
test time is more than 2 hours less with these changes, so this means that
>2 hours of the whole regression time is spent waiting unnecessarily.

There are still some issues that I've been unable to solve. Probably the
most critical is the time taken by a couple of tests:

   - tests/bugs/nfs/bug-1053579.t
   - tests/bugs/fuse/many-groups-for-acl.t

These tests take around a minute if they work fine (~60 and ~45 seconds),
but sometimes they take a lot more time (~45 and ~30 minutes) but without
failing. The difference is in the time that it takes to create some system
groups and users.

For example, one of the things the first test does it to create 200 groups.
This is done in ~25 seconds on fast cases and in ~15 minutes on slow cases.
This means that sometimes, creating each group takes more than 4 seconds
while other times it takes around 100 milliseconds. This is > x30
difference.

I'm not sure what is the cause for this. If the slaves are connected to
some external kerberos or ldap source, maybe there are some network issues
(or service unavailability) at some times that cause timeouts or delays. In
my local system (Fedora 27) I see high CPU usage by process sssd_be during
group creation. I'm not sure why or if it also happens on slaves, but it
seems a good candidate. However in my system it seems to always take about
25 seconds to complete.

Even after the changes, tests are full of sleeps. There's one of 180
seconds (bugs/shard/parallel-truncate-read.t). Not sure if it's really
necessary, but there are many more with smaller delays between 1 and 60
seconds. Assuming that each sleep is only executed once, the total time
spent in sleeps is still 15 minutes.

I still need to fix some tests that seem to be failing often after the
changes.

Xavi

[1] https://review.gluster.org/19254
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel