Re: [Gluster-devel] ./tests/basic/uss.t is timing out in release-6 branch

2019-05-22 Thread FNU Raghavendra Manjunath
More analysis:

It looks like in the 1st iteration, the testcase is stuck  at the test
(TEST ! stat $M0/.history/snap6/aaa) from line 385 of uss.t

Before, it was the last test to be executed from uss.t. So the assumption
was that after the completion of that test (i.e. test from line 385),
cleanup function was
either getting blocked or taking more time to do cleanups.

Hence the patch [1] was submitted to reduce the amount of work done by
cleanup function. The patch ensured that, the snapshots, volume etc created
in the test
are deleted before cleanup function is executed. But even with that, we
observed uss.t to fail sometimes (mainly with brick-mux regressions). To
get more infomration
regarding the failure, another patch [2] was sent. From that patch some
more information is received.

1) Everytime uss.t times out, the script (uss.t) is stuck in executing the
particular test from line 385 (TEST ! stat $M0/.history/snap6/aaa)

- This test's purpose is to ensure that, looking for a file that does
not exist in a snapshot should fail.

2) Adding TRACE logs via [2] indicates that:
 -  the stat request sent by the test reaches snapshot daemon and later
the gfapi client instance that the snapshot daemon spawns
to communicate with the snapshot volume.
 - The stat request is served by the md-cache xlator in the gfapi
client instance (and hence successful).

  "[2019-05-16 18:31:18.607521]:++ G_LOG:./tests/basic/uss.t:
TEST: 392 ! stat /mnt/glusterfs/0/.history/snap6/aaa ++
   [2019-05-16 18:31:18.617104] T [MSGID: 0]
[syncop.c:2424:syncop_stat] 0-stack-trace: stack-address: 0x7fc63405dba8,
winding from gfapi to meta-autoload
   [2019-05-16 18:31:18.617119] T [MSGID: 0]
[defaults.c:2841:default_stat] 0-stack-trace: stack-address:
0x7fc63405dba8, winding from meta-autoload to
0e69605de2974f1b887deee5b3f63b52
   [2019-05-16 18:31:18.617130] T [MSGID: 0]
[io-stats.c:2709:io_stats_stat] 0-stack-trace: stack-address:
0x7fc63405dba8, winding from 0e69605de2974f1b887deee5b3f63b52 to
0e69605de2974f1b887deee5b3f63b52-io-threads
   [2019-05-16 18:31:18.617142] D [MSGID: 0]
[io-threads.c:376:iot_schedule]
0-0e69605de2974f1b887deee5b3f63b52-io-threads: STAT scheduled as fast
priority fop
   [2019-05-16 18:31:18.617162] T [MSGID: 0]
[defaults.c:2068:default_stat_resume] 0-stack-trace: stack-address:
0x7fc63405dba8, winding from 0e69605de2974f1b887deee5b3f63b52-io-threads to
0e69605de2974f1b887deee5b3f63b52-md-cache
   [2019-05-16 18:31:18.617176] T [MSGID: 0] [md-cache.c:1359:mdc_stat]
0-stack-trace: stack-address: 0x7fc63405dba8,
0e69605de2974f1b887deee5b3f63b52-md-cache returned 0  => SUCCESSFUL
HERE
   [2019-05-16 18:31:18.617186] T [MSGID: 0]
[defaults.c:1406:default_stat_cbk] 0-stack-trace: stack-address:
0x7fc63405dba8, 0e69605de2974f1b887deee5b3f63b52-io-threads returned 0
   [2019-05-16 18:31:18.617195] T [MSGID: 0]
[io-stats.c:2059:io_stats_stat_cbk] 0-stack-trace: stack-address:
0x7fc63405dba8, 0e69605de2974f1b887deee5b3f63b52 returned 0
  "
- The stat response does not reach snapshot daemon. So snapshot daemon
is not able to send any response back to the gluster client which initiated
this stat request. This leads to client waiting for a response resulting in
timeout as per
  the regression test infra (which sets 200 seconds timeout)

Suspects:
==

* First of all the the stat request from the line 385 (TEST ! stat
$M0/.history/snap6/aaa) should not be successful. Because, the test deletes
the snapshot "snap6", removes the file "aaa" from the mount point,
  again takes the snapshot "snap6" and performs the stat operatoin on the
deleted file "aaa". So the stat should fail.

* The patch [2] has been sent to collect more information about the failure
(with more logs added to snapview-server and also log level being changed
to TRACE in the .t file)

[1] https://review.gluster.org/#/c/glusterfs/+/22649/
[2] https://review.gluster.org/#/c/glusterfs/+/22728/

Regards,
Raghavendra

On Wed, May 1, 2019 at 11:11 AM Sanju Rakonde  wrote:

> Thank you Raghavendra.
>
> On Tue, Apr 30, 2019 at 11:46 PM FNU Raghavendra Manjunath <
> rab...@redhat.com> wrote:
>
>>
>> To make things relatively easy for the cleanup () function in the test
>> framework, I think it would be better to ensure that uss.t itself deletes
>> snapshots and the volume once the tests are done. Patch [1] has been
>> submitted for review.
>>
>> [1] https://review.gluster.org/#/c/glusterfs/+/22649/
>>
>> Regards,
>> Raghavendra
>>
>> On Tue, Apr 30, 2019 at 10:42 AM FNU Raghavendra Manjunath <
>> rab...@redhat.com> wrote:
>>
>>>
>>> The failure looks similar to the issue I had mentioned in [1]
>>>
>>> In short for some reason the cleanup (the cleanup function that we call
>>> in our .t files) seems to be taking more time and also not cleaning up
>>> properly. This leads to problems for the 2nd iteration (where basic things
>>> such as volume creation or volume 

[Gluster-devel] More intelligent file distribution across subvols of DHT when file size is known

2019-05-22 Thread Krutika Dhananjay
Hi,

I've proposed a solution to the problem of space running out in some
children of DHT even when its other children have free space available,
here - https://github.com/gluster/glusterfs/issues/675.

The proposal aims to solve a very specific instance of this generic class
of problems where fortunately the size of the file that is getting created
is known beforehand.

Requesting feedback on the proposal or even alternate solutions, if you
have any.

-Krutika
___

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/836554017

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/486278655

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel