Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-09 Thread sridhar basam
On Thu, Feb 9, 2012 at 3:53 PM, Jim Schutt  wrote:
> On 02/06/2012 11:35 AM, Gregory Farnum wrote:
>>
>> On Mon, Feb 6, 2012 at 10:20 AM, Jim Schutt  wrote:
>>>
>>> On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:


 On Mon, Feb 6, 2012 at 8:20 AM, Jim Schutt    wrote:
>>>
>>>
>>>
>
> The above suggests to me that the slowdown is a result
> of requests not getting submitted at the same rate as
> when things are running well.
>

 Yeah, it really looks like that. My suggestions wouldn't help there.

 I do see that when things go well the number of writes per device is
 capped at ~200 writes per second and the throughput per device is
 ~100MB/sec. Is 100MB/sec the expected device throughput?
>>>
>>>
>>>
>>> Pretty much, at least for the outer tracks on a drive.  I've seen
>>> ~108 MB/s with dd to a block device.  Also, I've got 8 drives per
>>> SAS adapter with 6 Gb/s links, so it seems unlikely to me that my
>>> disk subsystem is any sort of significant bottleneck.
>>
>>
>> Well, you might try changing your throttling settings on the OSDs.
>> ms_dispatch_throttle_bytes defaults to 100<<20 (100MB) and is used for
>> throttling dispatch; osd_max_client_bytes defaults to 500<<20 (500MB)
>> and is used to limit the amount of client data in memory (ie; messages
>> are included in this throttler for their entire lifetime, not just
>> while waiting for dispatch).
>>
>>
>
> I've made a little progress isolating this.
>
> "osd client message size cap =  500" makes the stall
> completely reproducible (which also means I can reproduce
> on two different network types, ethernet and IPoIB.), and I
> am able to generate graphs of throttled/receive/process time
> for each request received by an OSD (see attached SVG plot).
>
> Such plots suggest to me my problem is caused by stalled
> receives.  Using debug ms = 30 on my OSDs turns up instances
> of this:
>
> osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef sd=173
> pgs=7 cs=1 l=1).reader wants 4194432 from dispatch throttler 0/2500
> osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef sd=173
> pgs=7 cs=1 l=1).reader got front 128
> osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef sd=173
> pgs=7 cs=1 l=1).reader allocating new rx buffer at offset 0
> osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef sd=173
> pgs=7 cs=1 l=1).reader reading nonblocking into 0x1656c000 len 4194304
> osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef sd=173
> pgs=7 cs=1 l=1).reader read 1369231 of 4194304
> osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef sd=173
> pgs=7 cs=1 l=1).reader reading nonblocking into 0x166ba48f len 2825073
> osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef sd=173
> pgs=7 cs=1 l=1).reader read 1448 of 2825073
>
> which I take to mean that the reader thread sat in poll() for 56 secs, in
> this case.
>
> I was able to correlate such stalls with tcpdump output collected on
> clients.  Here's an example from another run:
>
> 15:09:37.584600 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length 0
> 15:09:37.584613 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr
> 1096144], length 31856
> 15:09:37.584655 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr
> 1096144], length 31856
> 15:09:37.624476 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length 0
> 15:09:37.624489 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr
> 1096184], length 31856
> 15:09:37.624532 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq
> 23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr
> 1096184], length 31856
> 15:09:37.664454 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length 0
> 15:09:37.664468 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr
> 1096224], length 31856
> 15:09:37.664506 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.],

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-09 Thread sridhar basam
On Thu, Feb 9, 2012 at 6:15 PM, Jim Schutt  wrote:
> On 02/09/2012 03:40 PM, sridhar basam wrote:
>>
>> On Thu, Feb 9, 2012 at 3:53 PM, Jim Schutt  wrote:
>>>
>>> On 02/06/2012 11:35 AM, Gregory Farnum wrote:
>>>>
>>>>
>>>> On Mon, Feb 6, 2012 at 10:20 AM, Jim Schutt
>>>>  wrote:
>>>>>
>>>>>
>>>>> On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 6, 2012 at 8:20 AM, Jim Schutt
>>>>>>  wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>> The above suggests to me that the slowdown is a result
>>>>>>> of requests not getting submitted at the same rate as
>>>>>>> when things are running well.
>>>>>>>
>>>>>>
>>>>>> Yeah, it really looks like that. My suggestions wouldn't help there.
>>>>>>
>>>>>> I do see that when things go well the number of writes per device is
>>>>>> capped at ~200 writes per second and the throughput per device is
>>>>>> ~100MB/sec. Is 100MB/sec the expected device throughput?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Pretty much, at least for the outer tracks on a drive.  I've seen
>>>>> ~108 MB/s with dd to a block device.  Also, I've got 8 drives per
>>>>> SAS adapter with 6 Gb/s links, so it seems unlikely to me that my
>>>>> disk subsystem is any sort of significant bottleneck.
>>>>
>>>>
>>>>
>>>> Well, you might try changing your throttling settings on the OSDs.
>>>> ms_dispatch_throttle_bytes defaults to 100<<20 (100MB) and is used for
>>>> throttling dispatch; osd_max_client_bytes defaults to 500<<20 (500MB)
>>>> and is used to limit the amount of client data in memory (ie; messages
>>>> are included in this throttler for their entire lifetime, not just
>>>> while waiting for dispatch).
>>>>
>>>>
>>>
>>> I've made a little progress isolating this.
>>>
>>> "osd client message size cap =  500" makes the stall
>>> completely reproducible (which also means I can reproduce
>>> on two different network types, ethernet and IPoIB.), and I
>>> am able to generate graphs of throttled/receive/process time
>>> for each request received by an OSD (see attached SVG plot).
>>>
>>> Such plots suggest to me my problem is caused by stalled
>>> receives.  Using debug ms = 30 on my OSDs turns up instances
>>> of this:
>>>
>>> osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 --
>>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef
>>> sd=173
>>> pgs=7 cs=1 l=1).reader wants 4194432 from dispatch throttler 0/2500
>>> osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 --
>>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef
>>> sd=173
>>> pgs=7 cs=1 l=1).reader got front 128
>>> osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 --
>>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef
>>> sd=173
>>> pgs=7 cs=1 l=1).reader allocating new rx buffer at offset 0
>>> osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 --
>>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef
>>> sd=173
>>> pgs=7 cs=1 l=1).reader reading nonblocking into 0x1656c000 len 4194304
>>> osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 --
>>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef
>>> sd=173
>>> pgs=7 cs=1 l=1).reader read 1369231 of 4194304
>>> osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 --
>>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef
>>> sd=173
>>> pgs=7 cs=1 l=1).reader reading nonblocking into 0x166ba48f len 2825073
>>> osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 --
>>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef
>>> sd=173
>>> pgs=7 cs=1 l=1).reader read 1448 of 2825073
>>>
>>> which I take to mean that the reader thread sat in poll() for 56 secs, in
>>> this case.
>&

Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-10 Thread sridhar basam
On Fri, Feb 10, 2012 at 10:32 AM, Jim Schutt  wrote:
> On 02/09/2012 06:26 PM, sridhar basam wrote:
>>
>> Do you mind capturing to a pcap file and providing that. Makes it
>> easier to analyse things. If not, i understand. If you can make do the
>> capture on both ends, do it with a snaplen of 68 so that you get all
>> of the headers and there shouldn't be too much payload information in
>> the file.
>
>
> I've got a pcap file for this run for this client, with snaplen 128
> (I thought I might need to look for ceph message headers).  It's 13 MB
> compressed.  How shall I get it to you?
>

Can i grab it off some webserver you control? Or you can temporarily
drop it into docs.google.com and add accesss from my email account?

> In the meantime, I'll try to capture this from both sides.
>
>
>>
>> I will take a look at the additional output and see if anything pops
>> out. I am assuming the below output was immediately after what you
>> posted in your earlier email.

i don't see anything out of the ordinary once things recover, the
sender even starts to do TSO after a short while.

 Sridhar


>
>
> Yes.
>
> Thanks -- Jim
>
>
>>
>>
>>  Sridhar
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-10 Thread sridhar basam
On Fri, Feb 10, 2012 at 6:09 PM, Jim Schutt  wrote:
>
> [ added Cc:netdev
>  See http://www.spinics.net/lists/ceph-devel/msg04804.html
>  for the start of the thread.
>  -- Jim
> ]
>
>
> On 02/10/2012 10:13 AM, sridhar basam wrote:
>>
>> On Fri, Feb 10, 2012 at 10:32 AM, Jim Schutt  wrote:
>>>
>>> On 02/09/2012 06:26 PM, sridhar basam wrote:
>>>>
>>>>
>>>> Do you mind capturing to a pcap file and providing that. Makes it
>>>> easier to analyse things. If not, i understand. If you can make do the
>>>> capture on both ends, do it with a snaplen of 68 so that you get all
>>>> of the headers and there shouldn't be too much payload information in
>>>> the file.
>>>
>>>
>>>
>>> I've got a pcap file for this run for this client, with snaplen 128
>>> (I thought I might need to look for ceph message headers).  It's 13 MB
>>> compressed.  How shall I get it to you?
>>>
>>
>> Can i grab it off some webserver you control? Or you can temporarily
>> drop it into docs.google.com and add accesss from my email account?
>
>
> I tabled this for the moment while I worked on collecting
> packet traces from both ends.  But you'll probably want
> to see the pcap files for what I'm about to show.  Also,
> I think I need to add netdev to this discussion ...
>
>
>>
>>> In the meantime, I'll try to capture this from both sides.
>
>
> Here's another example, captured from both sides, with
> TSO/GSO/GRO all off, snaplen 68.
>
> This was captured from the client side.  Same pattern, in
> that the client sends many retransmits over a period of
> a couple minutes.  It's different in that the client
> seems to give up and reconnect ...
>
> 11:57:35.984024 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack
> 18109890, win 5792, options [nop,nop,TS[|tcp]>
> 11:57:35.984032 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18112786:18114234, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:35.984038 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18114234:18115682, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:35.984120 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack
> 18112786, win 5792, options [nop,nop,TS[|tcp]>
> 11:57:35.984129 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18115682:18117130, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:35.984135 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18117130:18118578, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:35.984143 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack
> 18115682, win 5792, options [nop,nop,TS[|tcp]>
> 11:57:35.984148 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18118578:18120026, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:35.984153 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:35.984270 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack
> 18118578, win 5792, options [nop,nop,TS[|tcp]>
> 11:57:35.984278 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18121474:18122922, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:35.984283 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18122922:18124370, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:35.984420 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack
> 18120026, win 5792, options [nop,nop,TS[|tcp]>
> 11:57:35.984428 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18124370:18125818, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:36.184945 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:36.587936 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:37.393937 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:39.003937 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:42.227933 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:48.675931 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,no