Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-20 Thread Krutika Dhananjay
Apologies. Pressed 'send' even before I was done.

On Tue, Jun 20, 2017 at 11:39 AM, Krutika Dhananjay 
wrote:

> Some update on this topic:
>
> I ran fio again, this time with Raghavendra's epoll-rearm patch @
> https://review.gluster.org/17391
>
> The IOPs increased to ~50K (from 38K).
> Avg READ latency as seen by the io-stats translator that sits above
> client-io-threads came down to 963us (from 1666us).
> ∆ (2,3) is down to 804us.
> The disk utilization didn't improve.
>

>From code reading, it appears there is some serialization between POLLIN,
POLLOUT and POLLERR events for a given socket because of
socket_private->lock which they all contend for.

Discussed the same with Raghavendra G.
(I think he already alluded to the same point in this thread earlier.)
Let me make some quick dirty changes to see if fixing this serialization
improves performance further and I'll update the thread accordingly.

-Krutika


>
>
> On Sat, Jun 10, 2017 at 12:47 AM, Manoj Pillai  wrote:
>
>> So comparing the key latency, ∆ (2,3), in the two cases:
>>
>> iodepth=1: 171 us
>> iodepth=8: 1453 us (in the ballpark of 171*8=1368). That's not good! (I
>> wonder if that relation roughly holds up for other values of iodepth).
>>
>> This data doesn't conclusively establish that the problem is in gluster.
>> You'd see similar results if the network were saturated, like Vijay
>> suggested. But from what I remember of this test, the throughput here is
>> far too low for that to be the case.
>>
>> -- Manoj
>>
>>
>> On Thu, Jun 8, 2017 at 6:37 PM, Krutika Dhananjay 
>> wrote:
>>
>>> Indeed the latency on the client side dropped with iodepth=1. :)
>>> I ran the test twice and the results were consistent.
>>>
>>> Here are the exact numbers:
>>>
>>> *Translator Position*   *Avg Latency of READ fop as
>>> seen by this translator*
>>>
>>> 1. parent of client-io-threads437us
>>>
>>> ∆ (1,2) = 69us
>>>
>>> 2. parent of protocol/client-0368us
>>>
>>> ∆ (2,3) = 171us
>>>
>>> - end of client stack -
>>> - beginning of brick stack --
>>>
>>> 3. child of protocol/server   197us
>>>
>>> ∆ (3,4) = 4us
>>>
>>> 4. parent of io-threads193us
>>>
>>> ∆ (4,5) = 32us
>>>
>>> 5. child-of-io-threads  161us
>>>
>>> ∆ (5,6) = 11us
>>>
>>> 6. parent of storage/posix   150us
>>> ...
>>>  end of brick stack 
>>>
>>> Will continue reading code and get back when I find sth concrete.
>>>
>>> -Krutika
>>>
>>>
>>> On Thu, Jun 8, 2017 at 12:22 PM, Manoj Pillai 
>>> wrote:
>>>
 Thanks. So I was suggesting a repeat of the test but this time with
 iodepth=1 in the fio job. If reducing the no. of concurrent requests
  reduces drastically the high latency you're seeing from the client-side,
 that would strengthen the hypothesis than serialization/contention among
 concurrent requests at the n/w layers is the root cause here.

 -- Manoj


 On Thu, Jun 8, 2017 at 11:46 AM, Krutika Dhananjay  wrote:

> Hi,
>
> This is what my job file contains:
>
> [global]
> ioengine=libaio
> #unified_rw_reporting=1
> randrepeat=1
> norandommap=1
> group_reporting
> direct=1
> runtime=60
> thread
> size=16g
>
>
> [workload]
> bs=4k
> rw=randread
> iodepth=8
> numjobs=1
> file_service_type=random
> filename=/perf5/iotest/fio_5
> filename=/perf6/iotest/fio_6
> filename=/perf7/iotest/fio_7
> filename=/perf8/iotest/fio_8
>
> I have 3 vms reading from one mount, and each of these vms is running
> the above job in parallel.
>
> -Krutika
>
> On Tue, Jun 6, 2017 at 9:14 PM, Manoj Pillai 
> wrote:
>
>>
>>
>> On Tue, Jun 6, 2017 at 5:05 PM, Krutika Dhananjay <
>> kdhan...@redhat.com> wrote:
>>
>>> Hi,
>>>
>>> As part of identifying performance bottlenecks within gluster stack
>>> for VM image store use-case, I loaded io-stats at multiple points on the
>>> client and brick stack and ran randrd test using fio from within the 
>>> hosted
>>> vms in parallel.
>>>
>>> Before I get to the results, a little bit about the configuration ...
>>>
>>> 3 node cluster; 1x3 plain replicate volume with group virt settings,
>>> direct-io.
>>> 3 FUSE clients, one per node in the cluster (which implies reads are
>>> served from the replica that is local to the client).
>>>
>>> io-stats was loaded at the following places:
>>> On the client stack: Above client-io-threads and above
>>> protocol/client-0 (the first child of AFR).
>>> On the brick stack: Below protocol/server, above and 

Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-20 Thread Krutika Dhananjay
Some update on this topic:

I ran fio again, this time with Raghavendra's epoll-rearm patch @
https://review.gluster.org/17391

The IOPs increased to ~50K (from 38K).
Avg READ latency as seen by the io-stats translator that sits above
client-io-threads came down to 963us (from 1666us).
∆ (2,3) is down to 804us.
The disk utilization didn't improve.



On Sat, Jun 10, 2017 at 12:47 AM, Manoj Pillai  wrote:

> So comparing the key latency, ∆ (2,3), in the two cases:
>
> iodepth=1: 171 us
> iodepth=8: 1453 us (in the ballpark of 171*8=1368). That's not good! (I
> wonder if that relation roughly holds up for other values of iodepth).
>
> This data doesn't conclusively establish that the problem is in gluster.
> You'd see similar results if the network were saturated, like Vijay
> suggested. But from what I remember of this test, the throughput here is
> far too low for that to be the case.
>
> -- Manoj
>
>
> On Thu, Jun 8, 2017 at 6:37 PM, Krutika Dhananjay 
> wrote:
>
>> Indeed the latency on the client side dropped with iodepth=1. :)
>> I ran the test twice and the results were consistent.
>>
>> Here are the exact numbers:
>>
>> *Translator Position*   *Avg Latency of READ fop as
>> seen by this translator*
>>
>> 1. parent of client-io-threads437us
>>
>> ∆ (1,2) = 69us
>>
>> 2. parent of protocol/client-0368us
>>
>> ∆ (2,3) = 171us
>>
>> - end of client stack -
>> - beginning of brick stack --
>>
>> 3. child of protocol/server   197us
>>
>> ∆ (3,4) = 4us
>>
>> 4. parent of io-threads193us
>>
>> ∆ (4,5) = 32us
>>
>> 5. child-of-io-threads  161us
>>
>> ∆ (5,6) = 11us
>>
>> 6. parent of storage/posix   150us
>> ...
>>  end of brick stack 
>>
>> Will continue reading code and get back when I find sth concrete.
>>
>> -Krutika
>>
>>
>> On Thu, Jun 8, 2017 at 12:22 PM, Manoj Pillai  wrote:
>>
>>> Thanks. So I was suggesting a repeat of the test but this time with
>>> iodepth=1 in the fio job. If reducing the no. of concurrent requests
>>>  reduces drastically the high latency you're seeing from the client-side,
>>> that would strengthen the hypothesis than serialization/contention among
>>> concurrent requests at the n/w layers is the root cause here.
>>>
>>> -- Manoj
>>>
>>>
>>> On Thu, Jun 8, 2017 at 11:46 AM, Krutika Dhananjay 
>>> wrote:
>>>
 Hi,

 This is what my job file contains:

 [global]
 ioengine=libaio
 #unified_rw_reporting=1
 randrepeat=1
 norandommap=1
 group_reporting
 direct=1
 runtime=60
 thread
 size=16g


 [workload]
 bs=4k
 rw=randread
 iodepth=8
 numjobs=1
 file_service_type=random
 filename=/perf5/iotest/fio_5
 filename=/perf6/iotest/fio_6
 filename=/perf7/iotest/fio_7
 filename=/perf8/iotest/fio_8

 I have 3 vms reading from one mount, and each of these vms is running
 the above job in parallel.

 -Krutika

 On Tue, Jun 6, 2017 at 9:14 PM, Manoj Pillai 
 wrote:

>
>
> On Tue, Jun 6, 2017 at 5:05 PM, Krutika Dhananjay  > wrote:
>
>> Hi,
>>
>> As part of identifying performance bottlenecks within gluster stack
>> for VM image store use-case, I loaded io-stats at multiple points on the
>> client and brick stack and ran randrd test using fio from within the 
>> hosted
>> vms in parallel.
>>
>> Before I get to the results, a little bit about the configuration ...
>>
>> 3 node cluster; 1x3 plain replicate volume with group virt settings,
>> direct-io.
>> 3 FUSE clients, one per node in the cluster (which implies reads are
>> served from the replica that is local to the client).
>>
>> io-stats was loaded at the following places:
>> On the client stack: Above client-io-threads and above
>> protocol/client-0 (the first child of AFR).
>> On the brick stack: Below protocol/server, above and below io-threads
>> and just above storage/posix.
>>
>> Based on a 60-second run of randrd test and subsequent analysis of
>> the stats dumped by the individual io-stats instances, the following is
>> what I found:
>>
>> *​​Translator Position*   *Avg Latency of READ
>> fop as seen by this translator*
>>
>> 1. parent of client-io-threads1666us
>>
>> ∆ (1,2) = 50us
>>
>> 2. parent of protocol/client-01616us
>>
>> ∆ (2,3) = 1453us
>>
>> - end of client stack -
>> - beginning of brick stack ---
>>
>> 3. child of protocol/server

Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-09 Thread Manoj Pillai
So comparing the key latency, ∆ (2,3), in the two cases:

iodepth=1: 171 us
iodepth=8: 1453 us (in the ballpark of 171*8=1368). That's not good! (I
wonder if that relation roughly holds up for other values of iodepth).

This data doesn't conclusively establish that the problem is in gluster.
You'd see similar results if the network were saturated, like Vijay
suggested. But from what I remember of this test, the throughput here is
far too low for that to be the case.

-- Manoj


On Thu, Jun 8, 2017 at 6:37 PM, Krutika Dhananjay 
wrote:

> Indeed the latency on the client side dropped with iodepth=1. :)
> I ran the test twice and the results were consistent.
>
> Here are the exact numbers:
>
> *Translator Position*   *Avg Latency of READ fop as
> seen by this translator*
>
> 1. parent of client-io-threads437us
>
> ∆ (1,2) = 69us
>
> 2. parent of protocol/client-0368us
>
> ∆ (2,3) = 171us
>
> - end of client stack -
> - beginning of brick stack --
>
> 3. child of protocol/server   197us
>
> ∆ (3,4) = 4us
>
> 4. parent of io-threads193us
>
> ∆ (4,5) = 32us
>
> 5. child-of-io-threads  161us
>
> ∆ (5,6) = 11us
>
> 6. parent of storage/posix   150us
> ...
>  end of brick stack 
>
> Will continue reading code and get back when I find sth concrete.
>
> -Krutika
>
>
> On Thu, Jun 8, 2017 at 12:22 PM, Manoj Pillai  wrote:
>
>> Thanks. So I was suggesting a repeat of the test but this time with
>> iodepth=1 in the fio job. If reducing the no. of concurrent requests
>>  reduces drastically the high latency you're seeing from the client-side,
>> that would strengthen the hypothesis than serialization/contention among
>> concurrent requests at the n/w layers is the root cause here.
>>
>> -- Manoj
>>
>>
>> On Thu, Jun 8, 2017 at 11:46 AM, Krutika Dhananjay 
>> wrote:
>>
>>> Hi,
>>>
>>> This is what my job file contains:
>>>
>>> [global]
>>> ioengine=libaio
>>> #unified_rw_reporting=1
>>> randrepeat=1
>>> norandommap=1
>>> group_reporting
>>> direct=1
>>> runtime=60
>>> thread
>>> size=16g
>>>
>>>
>>> [workload]
>>> bs=4k
>>> rw=randread
>>> iodepth=8
>>> numjobs=1
>>> file_service_type=random
>>> filename=/perf5/iotest/fio_5
>>> filename=/perf6/iotest/fio_6
>>> filename=/perf7/iotest/fio_7
>>> filename=/perf8/iotest/fio_8
>>>
>>> I have 3 vms reading from one mount, and each of these vms is running
>>> the above job in parallel.
>>>
>>> -Krutika
>>>
>>> On Tue, Jun 6, 2017 at 9:14 PM, Manoj Pillai  wrote:
>>>


 On Tue, Jun 6, 2017 at 5:05 PM, Krutika Dhananjay 
 wrote:

> Hi,
>
> As part of identifying performance bottlenecks within gluster stack
> for VM image store use-case, I loaded io-stats at multiple points on the
> client and brick stack and ran randrd test using fio from within the 
> hosted
> vms in parallel.
>
> Before I get to the results, a little bit about the configuration ...
>
> 3 node cluster; 1x3 plain replicate volume with group virt settings,
> direct-io.
> 3 FUSE clients, one per node in the cluster (which implies reads are
> served from the replica that is local to the client).
>
> io-stats was loaded at the following places:
> On the client stack: Above client-io-threads and above
> protocol/client-0 (the first child of AFR).
> On the brick stack: Below protocol/server, above and below io-threads
> and just above storage/posix.
>
> Based on a 60-second run of randrd test and subsequent analysis of the
> stats dumped by the individual io-stats instances, the following is what I
> found:
>
> *​​Translator Position*   *Avg Latency of READ
> fop as seen by this translator*
>
> 1. parent of client-io-threads1666us
>
> ∆ (1,2) = 50us
>
> 2. parent of protocol/client-01616us
>
> ∆ (2,3) = 1453us
>
> - end of client stack -
> - beginning of brick stack ---
>
> 3. child of protocol/server   163us
>
> ∆ (3,4) = 7us
>
> 4. parent of io-threads156us
>
> ∆ (4,5) = 20us
>
> 5. child-of-io-threads  136us
>
> ∆ (5,6) = 11us
>
> 6. parent of storage/posix   125us
> ...
>  end of brick stack 
>
> So it seems like the biggest bottleneck here is a combination of the
> network + epoll, rpc layer?
> I must admit I am no expert with networks, but I'm assuming if the
> client is reading from the local brick, 

Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-08 Thread Krutika Dhananjay
Indeed the latency on the client side dropped with iodepth=1. :)
I ran the test twice and the results were consistent.

Here are the exact numbers:

*Translator Position*   *Avg Latency of READ fop as
seen by this translator*

1. parent of client-io-threads437us

∆ (1,2) = 69us

2. parent of protocol/client-0368us

∆ (2,3) = 171us

- end of client stack -
- beginning of brick stack --

3. child of protocol/server   197us

∆ (3,4) = 4us

4. parent of io-threads193us

∆ (4,5) = 32us

5. child-of-io-threads  161us

∆ (5,6) = 11us

6. parent of storage/posix   150us
...
 end of brick stack 

Will continue reading code and get back when I find sth concrete.

-Krutika


On Thu, Jun 8, 2017 at 12:22 PM, Manoj Pillai  wrote:

> Thanks. So I was suggesting a repeat of the test but this time with
> iodepth=1 in the fio job. If reducing the no. of concurrent requests
>  reduces drastically the high latency you're seeing from the client-side,
> that would strengthen the hypothesis than serialization/contention among
> concurrent requests at the n/w layers is the root cause here.
>
> -- Manoj
>
>
> On Thu, Jun 8, 2017 at 11:46 AM, Krutika Dhananjay 
> wrote:
>
>> Hi,
>>
>> This is what my job file contains:
>>
>> [global]
>> ioengine=libaio
>> #unified_rw_reporting=1
>> randrepeat=1
>> norandommap=1
>> group_reporting
>> direct=1
>> runtime=60
>> thread
>> size=16g
>>
>>
>> [workload]
>> bs=4k
>> rw=randread
>> iodepth=8
>> numjobs=1
>> file_service_type=random
>> filename=/perf5/iotest/fio_5
>> filename=/perf6/iotest/fio_6
>> filename=/perf7/iotest/fio_7
>> filename=/perf8/iotest/fio_8
>>
>> I have 3 vms reading from one mount, and each of these vms is running the
>> above job in parallel.
>>
>> -Krutika
>>
>> On Tue, Jun 6, 2017 at 9:14 PM, Manoj Pillai  wrote:
>>
>>>
>>>
>>> On Tue, Jun 6, 2017 at 5:05 PM, Krutika Dhananjay 
>>> wrote:
>>>
 Hi,

 As part of identifying performance bottlenecks within gluster stack for
 VM image store use-case, I loaded io-stats at multiple points on the client
 and brick stack and ran randrd test using fio from within the hosted vms in
 parallel.

 Before I get to the results, a little bit about the configuration ...

 3 node cluster; 1x3 plain replicate volume with group virt settings,
 direct-io.
 3 FUSE clients, one per node in the cluster (which implies reads are
 served from the replica that is local to the client).

 io-stats was loaded at the following places:
 On the client stack: Above client-io-threads and above
 protocol/client-0 (the first child of AFR).
 On the brick stack: Below protocol/server, above and below io-threads
 and just above storage/posix.

 Based on a 60-second run of randrd test and subsequent analysis of the
 stats dumped by the individual io-stats instances, the following is what I
 found:

 *​​Translator Position*   *Avg Latency of READ fop
 as seen by this translator*

 1. parent of client-io-threads1666us

 ∆ (1,2) = 50us

 2. parent of protocol/client-01616us

 ∆ (2,3) = 1453us

 - end of client stack -
 - beginning of brick stack ---

 3. child of protocol/server   163us

 ∆ (3,4) = 7us

 4. parent of io-threads156us

 ∆ (4,5) = 20us

 5. child-of-io-threads  136us

 ∆ (5,6) = 11us

 6. parent of storage/posix   125us
 ...
  end of brick stack 

 So it seems like the biggest bottleneck here is a combination of the
 network + epoll, rpc layer?
 I must admit I am no expert with networks, but I'm assuming if the
 client is reading from the local brick, then
 even latency contribution from the actual network won't be much, in
 which case bulk of the latency is coming from epoll, rpc layer, etc at both
 client and brick end? Please correct me if I'm wrong.

 I will, of course, do some more runs and confirm if the pattern is
 consistent.

 -Krutika


>>> Really interesting numbers! How many concurrent requests are in flight
>>> in this test? Could you post the fio job? I'm wondering if/how these
>>> latency numbers change if you reduce the number of concurrent requests.
>>>
>>> -- Manoj
>>>
>>>
>>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org

Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-08 Thread Ashish Pandey

Please note the bug in fio https://github.com/axboe/fio/issues/376 which is 
actually impacting performance in case of EC volume. 
I am not sure if this would be relevant in your case but thought to mention it. 

Ashish 
- Original Message -

From: "Manoj Pillai" <mpil...@redhat.com> 
To: "Krutika Dhananjay" <kdhan...@redhat.com> 
Cc: "Gluster Devel" <gluster-devel@gluster.org> 
Sent: Thursday, June 8, 2017 12:22:19 PM 
Subject: Re: [Gluster-devel] Performance experiments with io-stats translator 

Thanks. So I was suggesting a repeat of the test but this time with iodepth=1 
in the fio job. If reducing the no. of concurrent requests reduces drastically 
the high latency you're seeing from the client-side, that would strengthen the 
hypothesis than serialization/contention among concurrent requests at the n/w 
layers is the root cause here. 

-- Manoj 

On Thu, Jun 8, 2017 at 11:46 AM, Krutika Dhananjay < kdhan...@redhat.com > 
wrote: 



Hi, 

This is what my job file contains: 

[global] 
ioengine=libaio 
#unified_rw_reporting=1 
randrepeat=1 
norandommap=1 
group_reporting 
direct=1 
runtime=60 
thread 
size=16g 


[workload] 
bs=4k 
rw=randread 
iodepth=8 
numjobs=1 
file_service_type=random 
filename=/perf5/iotest/fio_5 
filename=/perf6/iotest/fio_6 
filename=/perf7/iotest/fio_7 
filename=/perf8/iotest/fio_8 

I have 3 vms reading from one mount, and each of these vms is running the above 
job in parallel. 

-Krutika 

On Tue, Jun 6, 2017 at 9:14 PM, Manoj Pillai < mpil...@redhat.com > wrote: 





On Tue, Jun 6, 2017 at 5:05 PM, Krutika Dhananjay < kdhan...@redhat.com > 
wrote: 



Hi, 

As part of identifying performance bottlenecks within gluster stack for VM 
image store use-case, I loaded io-stats at multiple points on the client and 
brick stack and ran randrd test using fio from within the hosted vms in 
parallel. 

Before I get to the results, a little bit about the configuration ... 

3 node cluster; 1x3 plain replicate volume with group virt settings, direct-io. 
3 FUSE clients, one per node in the cluster (which implies reads are served 
from the replica that is local to the client). 

io-stats was loaded at the following places: 
On the client stack: Above client-io-threads and above protocol/client-0 (the 
first child of AFR). 
On the brick stack: Below protocol/server, above and below io-threads and just 
above storage/posix. 

Based on a 60-second run of randrd test and subsequent analysis of the stats 
dumped by the individual io-stats instances, the following is what I found: 

​​Translator Position Avg Latency of READ fop as seen by this translator 

1. parent of client-io-threads 1666us 

∆ (1,2) = 50us 

2. parent of protocol/client-0 1616us 

∆ (2,3) = 1453us 

- end of client stack - 
- beginning of brick stack --- 

3. child of protocol/server 163us 

∆ (3,4) = 7us 

4. parent of io-threads 156us 

∆ (4,5) = 20us 

5. child-of-io-threads 136us 

∆ (5,6) = 11us 

6. parent of storage/posix 125us 
... 
 end of brick stack  

So it seems like the biggest bottleneck here is a combination of the network + 
epoll, rpc layer? 
I must admit I am no expert with networks, but I'm assuming if the client is 
reading from the local brick, then 
even latency contribution from the actual network won't be much, in which case 
bulk of the latency is coming from epoll, rpc layer, etc at both client and 
brick end? Please correct me if I'm wrong. 

I will, of course, do some more runs and confirm if the pattern is consistent. 

-Krutika 





Really interesting numbers! How many concurrent requests are in flight in this 
test? Could you post the fio job? I'm wondering if/how these latency numbers 
change if you reduce the number of concurrent requests. 

-- Manoj 










___ 
Gluster-devel mailing list 
Gluster-devel@gluster.org 
http://lists.gluster.org/mailman/listinfo/gluster-devel 

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-08 Thread Manoj Pillai
Thanks. So I was suggesting a repeat of the test but this time with
iodepth=1 in the fio job. If reducing the no. of concurrent requests
 reduces drastically the high latency you're seeing from the client-side,
that would strengthen the hypothesis than serialization/contention among
concurrent requests at the n/w layers is the root cause here.

-- Manoj

On Thu, Jun 8, 2017 at 11:46 AM, Krutika Dhananjay 
wrote:

> Hi,
>
> This is what my job file contains:
>
> [global]
> ioengine=libaio
> #unified_rw_reporting=1
> randrepeat=1
> norandommap=1
> group_reporting
> direct=1
> runtime=60
> thread
> size=16g
>
>
> [workload]
> bs=4k
> rw=randread
> iodepth=8
> numjobs=1
> file_service_type=random
> filename=/perf5/iotest/fio_5
> filename=/perf6/iotest/fio_6
> filename=/perf7/iotest/fio_7
> filename=/perf8/iotest/fio_8
>
> I have 3 vms reading from one mount, and each of these vms is running the
> above job in parallel.
>
> -Krutika
>
> On Tue, Jun 6, 2017 at 9:14 PM, Manoj Pillai  wrote:
>
>>
>>
>> On Tue, Jun 6, 2017 at 5:05 PM, Krutika Dhananjay 
>> wrote:
>>
>>> Hi,
>>>
>>> As part of identifying performance bottlenecks within gluster stack for
>>> VM image store use-case, I loaded io-stats at multiple points on the client
>>> and brick stack and ran randrd test using fio from within the hosted vms in
>>> parallel.
>>>
>>> Before I get to the results, a little bit about the configuration ...
>>>
>>> 3 node cluster; 1x3 plain replicate volume with group virt settings,
>>> direct-io.
>>> 3 FUSE clients, one per node in the cluster (which implies reads are
>>> served from the replica that is local to the client).
>>>
>>> io-stats was loaded at the following places:
>>> On the client stack: Above client-io-threads and above protocol/client-0
>>> (the first child of AFR).
>>> On the brick stack: Below protocol/server, above and below io-threads
>>> and just above storage/posix.
>>>
>>> Based on a 60-second run of randrd test and subsequent analysis of the
>>> stats dumped by the individual io-stats instances, the following is what I
>>> found:
>>>
>>> *​​Translator Position*   *Avg Latency of READ fop
>>> as seen by this translator*
>>>
>>> 1. parent of client-io-threads1666us
>>>
>>> ∆ (1,2) = 50us
>>>
>>> 2. parent of protocol/client-01616us
>>>
>>> ∆ (2,3) = 1453us
>>>
>>> - end of client stack -
>>> - beginning of brick stack ---
>>>
>>> 3. child of protocol/server   163us
>>>
>>> ∆ (3,4) = 7us
>>>
>>> 4. parent of io-threads156us
>>>
>>> ∆ (4,5) = 20us
>>>
>>> 5. child-of-io-threads  136us
>>>
>>> ∆ (5,6) = 11us
>>>
>>> 6. parent of storage/posix   125us
>>> ...
>>>  end of brick stack 
>>>
>>> So it seems like the biggest bottleneck here is a combination of the
>>> network + epoll, rpc layer?
>>> I must admit I am no expert with networks, but I'm assuming if the
>>> client is reading from the local brick, then
>>> even latency contribution from the actual network won't be much, in
>>> which case bulk of the latency is coming from epoll, rpc layer, etc at both
>>> client and brick end? Please correct me if I'm wrong.
>>>
>>> I will, of course, do some more runs and confirm if the pattern is
>>> consistent.
>>>
>>> -Krutika
>>>
>>>
>> Really interesting numbers! How many concurrent requests are in flight in
>> this test? Could you post the fio job? I'm wondering if/how these latency
>> numbers change if you reduce the number of concurrent requests.
>>
>> -- Manoj
>>
>>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-08 Thread Krutika Dhananjay
@Xavi/Raghavendra,

Indeed. Even I suspect the mutex contention at epoll layer and I've been
reading the corresponding code (my first time) ever since I got these
numbers.
I will get back to you if I have any specific questions for you around this.

-Krutika

On Thu, Jun 8, 2017 at 9:58 AM, Raghavendra G 
wrote:

>
>
> On Wed, Jun 7, 2017 at 11:59 AM, Xavier Hernandez 
> wrote:
>
>> Hi Krutika,
>>
>> On 06/06/17 13:35, Krutika Dhananjay wrote:
>>
>>> Hi,
>>>
>>> As part of identifying performance bottlenecks within gluster stack for
>>> VM image store use-case, I loaded io-stats at multiple points on the
>>> client and brick stack and ran randrd test using fio from within the
>>> hosted vms in parallel.
>>>
>>> Before I get to the results, a little bit about the configuration ...
>>>
>>> 3 node cluster; 1x3 plain replicate volume with group virt settings,
>>> direct-io.
>>> 3 FUSE clients, one per node in the cluster (which implies reads are
>>> served from the replica that is local to the client).
>>>
>>> io-stats was loaded at the following places:
>>> On the client stack: Above client-io-threads and above protocol/client-0
>>> (the first child of AFR).
>>> On the brick stack: Below protocol/server, above and below io-threads
>>> and just above storage/posix.
>>>
>>> Based on a 60-second run of randrd test and subsequent analysis of the
>>> stats dumped by the individual io-stats instances, the following is what
>>> I found:
>>>
>>> _*​​Translator Position*_*   *_*Avg Latency of READ
>>> fop as seen by this translator*_
>>>
>>> 1. parent of client-io-threads1666us
>>>
>>> ∆ (1,2) = 50us
>>>
>>> 2. parent of protocol/client-01616us
>>>
>>> ∆(2,3) = 1453us
>>>
>>> - end of client stack -
>>> - beginning of brick stack ---
>>>
>>> 3. child of protocol/server   163us
>>>
>>> ∆(3,4) = 7us
>>>
>>> 4. parent of io-threads156us
>>>
>>> ∆(4,5) = 20us
>>>
>>> 5. child-of-io-threads  136us
>>>
>>> ∆ (5,6) = 11us
>>>
>>> 6. parent of storage/posix   125us
>>> ...
>>>  end of brick stack 
>>>
>>> So it seems like the biggest bottleneck here is a combination of the
>>> network + epoll, rpc layer?
>>> I must admit I am no expert with networks, but I'm assuming if the
>>> client is reading from the local brick, then
>>> even latency contribution from the actual network won't be much, in
>>> which case bulk of the latency is coming from epoll, rpc layer, etc at
>>> both client and brick end? Please correct me if I'm wrong.
>>>
>>> I will, of course, do some more runs and confirm if the pattern is
>>> consistent.
>>>
>>
>> very interesting. These results are similar to what I also observed when
>> doing some ec tests.
>>
>
> For EC we've found [1] to increase the performance. Though not sure
> whether it'll have any significant impact for replicated setups.
>
>
> My personal feeling is that there's high serialization and/or contention
>> in the network layer caused by mutexes, but I don't have data to support
>> that.
>>
>
> As to lock contention or lack of concurrency at socket/rpc layers, AFAIK
> we've following suspects in I/O path (as opposed to accepting/listen paths):
>
> * Only one of reading from socket, writing to socket, error handling on
> socket, voluntary shutdown of sockets (through shutdown) can be in progress
> at a time. IOW, these operations are not concurrent as each one of them
> acquires a lock contended by others. My gut feeling is that at least
> reading from socket and writing to socket can be made concurrent and I've
> to spend more time on this to have a definitive answer.
>
> * Till [1], handler also incurred cost of message processing by higher
> layers (not just the cost of reading a msg from socket). Since we've epoll
> configured with EPOLL_ONESHOT and add back socket only after handler
> completes there was a lag after one msg is read before another msg can be
> read from same socket.
>
> * EPOLL_ONESHOT also means processing of one event (say POLLIN) also
> excludes other events (like POLLOUT when lots of msgs waiting to be written
> to socket) till the event is processed. The vice-versa scenario - reads
> blocked when writes are pending on a socket and a POLLOUT is received - is
> also true here. I think this is another area where we can improve.
>
> Will update the thread as and when I can think of a valid suspect.
>
> [1] https://review.gluster.org/17391
>
>
>>
>> Xavi
>>
>>
>>
>>> -Krutika
>>>
>>>
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> 

Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-08 Thread Krutika Dhananjay
Hi,

So I used Sanjay's setup to get these numbers. So I'm guessing it's a 10G
network. I will check again and let you know if that isn't the case.

-Krutika

On Tue, Jun 6, 2017 at 9:38 PM, Vijay Bellur  wrote:

> Nice work!
>
> What is the network interconnect bandwidth? How much of the network
> bandwidth is in use while the test is being run? Wondering if there is
> saturation in the network layer.
>
> -Vijay
>
> On Tue, Jun 6, 2017 at 7:35 AM, Krutika Dhananjay 
> wrote:
>
>> Hi,
>>
>> As part of identifying performance bottlenecks within gluster stack for
>> VM image store use-case, I loaded io-stats at multiple points on the client
>> and brick stack and ran randrd test using fio from within the hosted vms in
>> parallel.
>>
>> Before I get to the results, a little bit about the configuration ...
>>
>> 3 node cluster; 1x3 plain replicate volume with group virt settings,
>> direct-io.
>> 3 FUSE clients, one per node in the cluster (which implies reads are
>> served from the replica that is local to the client).
>>
>> io-stats was loaded at the following places:
>> On the client stack: Above client-io-threads and above protocol/client-0
>> (the first child of AFR).
>> On the brick stack: Below protocol/server, above and below io-threads and
>> just above storage/posix.
>>
>> Based on a 60-second run of randrd test and subsequent analysis of the
>> stats dumped by the individual io-stats instances, the following is what I
>> found:
>>
>> *​​Translator Position*   *Avg Latency of READ fop
>> as seen by this translator*
>>
>> 1. parent of client-io-threads1666us
>>
>> ∆ (1,2) = 50us
>>
>> 2. parent of protocol/client-01616us
>>
>> ∆ (2,3) = 1453us
>>
>> - end of client stack -
>> - beginning of brick stack ---
>>
>> 3. child of protocol/server   163us
>>
>> ∆ (3,4) = 7us
>>
>> 4. parent of io-threads156us
>>
>> ∆ (4,5) = 20us
>>
>> 5. child-of-io-threads  136us
>>
>> ∆ (5,6) = 11us
>>
>> 6. parent of storage/posix   125us
>> ...
>>  end of brick stack 
>>
>> So it seems like the biggest bottleneck here is a combination of the
>> network + epoll, rpc layer?
>> I must admit I am no expert with networks, but I'm assuming if the client
>> is reading from the local brick, then
>> even latency contribution from the actual network won't be much, in which
>> case bulk of the latency is coming from epoll, rpc layer, etc at both
>> client and brick end? Please correct me if I'm wrong.
>>
>> I will, of course, do some more runs and confirm if the pattern is
>> consistent.
>>
>> -Krutika
>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-08 Thread Krutika Dhananjay
Hi,

This is what my job file contains:

[global]
ioengine=libaio
#unified_rw_reporting=1
randrepeat=1
norandommap=1
group_reporting
direct=1
runtime=60
thread
size=16g


[workload]
bs=4k
rw=randread
iodepth=8
numjobs=1
file_service_type=random
filename=/perf5/iotest/fio_5
filename=/perf6/iotest/fio_6
filename=/perf7/iotest/fio_7
filename=/perf8/iotest/fio_8

I have 3 vms reading from one mount, and each of these vms is running the
above job in parallel.

-Krutika

On Tue, Jun 6, 2017 at 9:14 PM, Manoj Pillai  wrote:

>
>
> On Tue, Jun 6, 2017 at 5:05 PM, Krutika Dhananjay 
> wrote:
>
>> Hi,
>>
>> As part of identifying performance bottlenecks within gluster stack for
>> VM image store use-case, I loaded io-stats at multiple points on the client
>> and brick stack and ran randrd test using fio from within the hosted vms in
>> parallel.
>>
>> Before I get to the results, a little bit about the configuration ...
>>
>> 3 node cluster; 1x3 plain replicate volume with group virt settings,
>> direct-io.
>> 3 FUSE clients, one per node in the cluster (which implies reads are
>> served from the replica that is local to the client).
>>
>> io-stats was loaded at the following places:
>> On the client stack: Above client-io-threads and above protocol/client-0
>> (the first child of AFR).
>> On the brick stack: Below protocol/server, above and below io-threads and
>> just above storage/posix.
>>
>> Based on a 60-second run of randrd test and subsequent analysis of the
>> stats dumped by the individual io-stats instances, the following is what I
>> found:
>>
>> *​​Translator Position*   *Avg Latency of READ fop
>> as seen by this translator*
>>
>> 1. parent of client-io-threads1666us
>>
>> ∆ (1,2) = 50us
>>
>> 2. parent of protocol/client-01616us
>>
>> ∆ (2,3) = 1453us
>>
>> - end of client stack -
>> - beginning of brick stack ---
>>
>> 3. child of protocol/server   163us
>>
>> ∆ (3,4) = 7us
>>
>> 4. parent of io-threads156us
>>
>> ∆ (4,5) = 20us
>>
>> 5. child-of-io-threads  136us
>>
>> ∆ (5,6) = 11us
>>
>> 6. parent of storage/posix   125us
>> ...
>>  end of brick stack 
>>
>> So it seems like the biggest bottleneck here is a combination of the
>> network + epoll, rpc layer?
>> I must admit I am no expert with networks, but I'm assuming if the client
>> is reading from the local brick, then
>> even latency contribution from the actual network won't be much, in which
>> case bulk of the latency is coming from epoll, rpc layer, etc at both
>> client and brick end? Please correct me if I'm wrong.
>>
>> I will, of course, do some more runs and confirm if the pattern is
>> consistent.
>>
>> -Krutika
>>
>>
> Really interesting numbers! How many concurrent requests are in flight in
> this test? Could you post the fio job? I'm wondering if/how these latency
> numbers change if you reduce the number of concurrent requests.
>
> -- Manoj
>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-07 Thread Raghavendra G
On Wed, Jun 7, 2017 at 11:59 AM, Xavier Hernandez 
wrote:

> Hi Krutika,
>
> On 06/06/17 13:35, Krutika Dhananjay wrote:
>
>> Hi,
>>
>> As part of identifying performance bottlenecks within gluster stack for
>> VM image store use-case, I loaded io-stats at multiple points on the
>> client and brick stack and ran randrd test using fio from within the
>> hosted vms in parallel.
>>
>> Before I get to the results, a little bit about the configuration ...
>>
>> 3 node cluster; 1x3 plain replicate volume with group virt settings,
>> direct-io.
>> 3 FUSE clients, one per node in the cluster (which implies reads are
>> served from the replica that is local to the client).
>>
>> io-stats was loaded at the following places:
>> On the client stack: Above client-io-threads and above protocol/client-0
>> (the first child of AFR).
>> On the brick stack: Below protocol/server, above and below io-threads
>> and just above storage/posix.
>>
>> Based on a 60-second run of randrd test and subsequent analysis of the
>> stats dumped by the individual io-stats instances, the following is what
>> I found:
>>
>> _*​​Translator Position*_*   *_*Avg Latency of READ
>> fop as seen by this translator*_
>>
>> 1. parent of client-io-threads1666us
>>
>> ∆ (1,2) = 50us
>>
>> 2. parent of protocol/client-01616us
>>
>> ∆(2,3) = 1453us
>>
>> - end of client stack -
>> - beginning of brick stack ---
>>
>> 3. child of protocol/server   163us
>>
>> ∆(3,4) = 7us
>>
>> 4. parent of io-threads156us
>>
>> ∆(4,5) = 20us
>>
>> 5. child-of-io-threads  136us
>>
>> ∆ (5,6) = 11us
>>
>> 6. parent of storage/posix   125us
>> ...
>>  end of brick stack 
>>
>> So it seems like the biggest bottleneck here is a combination of the
>> network + epoll, rpc layer?
>> I must admit I am no expert with networks, but I'm assuming if the
>> client is reading from the local brick, then
>> even latency contribution from the actual network won't be much, in
>> which case bulk of the latency is coming from epoll, rpc layer, etc at
>> both client and brick end? Please correct me if I'm wrong.
>>
>> I will, of course, do some more runs and confirm if the pattern is
>> consistent.
>>
>
> very interesting. These results are similar to what I also observed when
> doing some ec tests.
>

For EC we've found [1] to increase the performance. Though not sure whether
it'll have any significant impact for replicated setups.


My personal feeling is that there's high serialization and/or contention in
> the network layer caused by mutexes, but I don't have data to support that.
>

As to lock contention or lack of concurrency at socket/rpc layers, AFAIK
we've following suspects in I/O path (as opposed to accepting/listen paths):

* Only one of reading from socket, writing to socket, error handling on
socket, voluntary shutdown of sockets (through shutdown) can be in progress
at a time. IOW, these operations are not concurrent as each one of them
acquires a lock contended by others. My gut feeling is that at least
reading from socket and writing to socket can be made concurrent and I've
to spend more time on this to have a definitive answer.

* Till [1], handler also incurred cost of message processing by higher
layers (not just the cost of reading a msg from socket). Since we've epoll
configured with EPOLL_ONESHOT and add back socket only after handler
completes there was a lag after one msg is read before another msg can be
read from same socket.

* EPOLL_ONESHOT also means processing of one event (say POLLIN) also
excludes other events (like POLLOUT when lots of msgs waiting to be written
to socket) till the event is processed. The vice-versa scenario - reads
blocked when writes are pending on a socket and a POLLOUT is received - is
also true here. I think this is another area where we can improve.

Will update the thread as and when I can think of a valid suspect.

[1] https://review.gluster.org/17391


>
> Xavi
>
>
>
>> -Krutika
>>
>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-07 Thread Xavier Hernandez

Hi Krutika,

On 06/06/17 13:35, Krutika Dhananjay wrote:

Hi,

As part of identifying performance bottlenecks within gluster stack for
VM image store use-case, I loaded io-stats at multiple points on the
client and brick stack and ran randrd test using fio from within the
hosted vms in parallel.

Before I get to the results, a little bit about the configuration ...

3 node cluster; 1x3 plain replicate volume with group virt settings,
direct-io.
3 FUSE clients, one per node in the cluster (which implies reads are
served from the replica that is local to the client).

io-stats was loaded at the following places:
On the client stack: Above client-io-threads and above protocol/client-0
(the first child of AFR).
On the brick stack: Below protocol/server, above and below io-threads
and just above storage/posix.

Based on a 60-second run of randrd test and subsequent analysis of the
stats dumped by the individual io-stats instances, the following is what
I found:

_*​​Translator Position*_*   *_*Avg Latency of READ
fop as seen by this translator*_

1. parent of client-io-threads1666us

∆ (1,2) = 50us

2. parent of protocol/client-01616us

∆(2,3) = 1453us

- end of client stack -
- beginning of brick stack ---

3. child of protocol/server   163us

∆(3,4) = 7us

4. parent of io-threads156us

∆(4,5) = 20us

5. child-of-io-threads  136us

∆ (5,6) = 11us

6. parent of storage/posix   125us
...
 end of brick stack 

So it seems like the biggest bottleneck here is a combination of the
network + epoll, rpc layer?
I must admit I am no expert with networks, but I'm assuming if the
client is reading from the local brick, then
even latency contribution from the actual network won't be much, in
which case bulk of the latency is coming from epoll, rpc layer, etc at
both client and brick end? Please correct me if I'm wrong.

I will, of course, do some more runs and confirm if the pattern is
consistent.


very interesting. These results are similar to what I also observed when 
doing some ec tests.


My personal feeling is that there's high serialization and/or contention 
in the network layer caused by mutexes, but I don't have data to support 
that.


Xavi



-Krutika


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-06 Thread Vijay Bellur
Nice work!

What is the network interconnect bandwidth? How much of the network
bandwidth is in use while the test is being run? Wondering if there is
saturation in the network layer.

-Vijay

On Tue, Jun 6, 2017 at 7:35 AM, Krutika Dhananjay 
wrote:

> Hi,
>
> As part of identifying performance bottlenecks within gluster stack for VM
> image store use-case, I loaded io-stats at multiple points on the client
> and brick stack and ran randrd test using fio from within the hosted vms in
> parallel.
>
> Before I get to the results, a little bit about the configuration ...
>
> 3 node cluster; 1x3 plain replicate volume with group virt settings,
> direct-io.
> 3 FUSE clients, one per node in the cluster (which implies reads are
> served from the replica that is local to the client).
>
> io-stats was loaded at the following places:
> On the client stack: Above client-io-threads and above protocol/client-0
> (the first child of AFR).
> On the brick stack: Below protocol/server, above and below io-threads and
> just above storage/posix.
>
> Based on a 60-second run of randrd test and subsequent analysis of the
> stats dumped by the individual io-stats instances, the following is what I
> found:
>
> *​​Translator Position*   *Avg Latency of READ fop as
> seen by this translator*
>
> 1. parent of client-io-threads1666us
>
> ∆ (1,2) = 50us
>
> 2. parent of protocol/client-01616us
>
> ∆ (2,3) = 1453us
>
> - end of client stack -
> - beginning of brick stack ---
>
> 3. child of protocol/server   163us
>
> ∆ (3,4) = 7us
>
> 4. parent of io-threads156us
>
> ∆ (4,5) = 20us
>
> 5. child-of-io-threads  136us
>
> ∆ (5,6) = 11us
>
> 6. parent of storage/posix   125us
> ...
>  end of brick stack 
>
> So it seems like the biggest bottleneck here is a combination of the
> network + epoll, rpc layer?
> I must admit I am no expert with networks, but I'm assuming if the client
> is reading from the local brick, then
> even latency contribution from the actual network won't be much, in which
> case bulk of the latency is coming from epoll, rpc layer, etc at both
> client and brick end? Please correct me if I'm wrong.
>
> I will, of course, do some more runs and confirm if the pattern is
> consistent.
>
> -Krutika
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-06 Thread Manoj Pillai
On Tue, Jun 6, 2017 at 5:05 PM, Krutika Dhananjay 
wrote:

> Hi,
>
> As part of identifying performance bottlenecks within gluster stack for VM
> image store use-case, I loaded io-stats at multiple points on the client
> and brick stack and ran randrd test using fio from within the hosted vms in
> parallel.
>
> Before I get to the results, a little bit about the configuration ...
>
> 3 node cluster; 1x3 plain replicate volume with group virt settings,
> direct-io.
> 3 FUSE clients, one per node in the cluster (which implies reads are
> served from the replica that is local to the client).
>
> io-stats was loaded at the following places:
> On the client stack: Above client-io-threads and above protocol/client-0
> (the first child of AFR).
> On the brick stack: Below protocol/server, above and below io-threads and
> just above storage/posix.
>
> Based on a 60-second run of randrd test and subsequent analysis of the
> stats dumped by the individual io-stats instances, the following is what I
> found:
>
> *​​Translator Position*   *Avg Latency of READ fop as
> seen by this translator*
>
> 1. parent of client-io-threads1666us
>
> ∆ (1,2) = 50us
>
> 2. parent of protocol/client-01616us
>
> ∆ (2,3) = 1453us
>
> - end of client stack -
> - beginning of brick stack ---
>
> 3. child of protocol/server   163us
>
> ∆ (3,4) = 7us
>
> 4. parent of io-threads156us
>
> ∆ (4,5) = 20us
>
> 5. child-of-io-threads  136us
>
> ∆ (5,6) = 11us
>
> 6. parent of storage/posix   125us
> ...
>  end of brick stack 
>
> So it seems like the biggest bottleneck here is a combination of the
> network + epoll, rpc layer?
> I must admit I am no expert with networks, but I'm assuming if the client
> is reading from the local brick, then
> even latency contribution from the actual network won't be much, in which
> case bulk of the latency is coming from epoll, rpc layer, etc at both
> client and brick end? Please correct me if I'm wrong.
>
> I will, of course, do some more runs and confirm if the pattern is
> consistent.
>
> -Krutika
>
>
Really interesting numbers! How many concurrent requests are in flight in
this test? Could you post the fio job? I'm wondering if/how these latency
numbers change if you reduce the number of concurrent requests.

-- Manoj
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Performance experiments with io-stats translator

2017-06-06 Thread Krutika Dhananjay
Hi,

As part of identifying performance bottlenecks within gluster stack for VM
image store use-case, I loaded io-stats at multiple points on the client
and brick stack and ran randrd test using fio from within the hosted vms in
parallel.

Before I get to the results, a little bit about the configuration ...

3 node cluster; 1x3 plain replicate volume with group virt settings,
direct-io.
3 FUSE clients, one per node in the cluster (which implies reads are served
from the replica that is local to the client).

io-stats was loaded at the following places:
On the client stack: Above client-io-threads and above protocol/client-0
(the first child of AFR).
On the brick stack: Below protocol/server, above and below io-threads and
just above storage/posix.

Based on a 60-second run of randrd test and subsequent analysis of the
stats dumped by the individual io-stats instances, the following is what I
found:

*​​Translator Position*   *Avg Latency of READ fop as
seen by this translator*

1. parent of client-io-threads1666us

∆ (1,2) = 50us

2. parent of protocol/client-01616us

∆ (2,3) = 1453us

- end of client stack -
- beginning of brick stack ---

3. child of protocol/server   163us

∆ (3,4) = 7us

4. parent of io-threads156us

∆ (4,5) = 20us

5. child-of-io-threads  136us

∆ (5,6) = 11us

6. parent of storage/posix   125us
...
 end of brick stack 

So it seems like the biggest bottleneck here is a combination of the
network + epoll, rpc layer?
I must admit I am no expert with networks, but I'm assuming if the client
is reading from the local brick, then
even latency contribution from the actual network won't be much, in which
case bulk of the latency is coming from epoll, rpc layer, etc at both
client and brick end? Please correct me if I'm wrong.

I will, of course, do some more runs and confirm if the pattern is
consistent.

-Krutika
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel