Re: [ceph-users] Replication lag in block storage

2014-03-15 Thread Ирек Фасихов
Which model you have hard drives?


2014-03-14 21:59 GMT+04:00 Greg Poirier :

> We are stressing these boxes pretty spectacularly at the moment.
>
> On every box I have one OSD that is pegged for IO almost constantly.
>
> ceph-1:
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sdv   0.00 0.00  104.00  160.00   748.00  1000.0013.24
> 1.154.369.461.05   3.70  97.60
>
> ceph-2:
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sdq   0.0025.00  109.00  218.00   844.00  1773.5016.01
> 1.374.209.031.78   3.01  98.40
>
> ceph-3:
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sdm   0.00 0.00  126.00   56.00   996.00   540.0016.88
> 1.015.588.060.00   5.43  98.80
>
> These are all disks in my block storage pool.
>
>  osdmap e26698: 102 osds: 102 up, 102 in
>   pgmap v6752413: 4624 pgs, 3 pools, 14151 GB data, 21729 kobjects
> 28517 GB used, 65393 GB / 93911 GB avail
> 4624 active+clean
>   client io 1915 kB/s rd, 59690 kB/s wr, 1464 op/s
>
> I don't see any smart errors, but i'm slowly working my way through all of
> the disks on these machines with smartctl to see if anything stands out.
>
>
> On Fri, Mar 14, 2014 at 9:52 AM, Gregory Farnum  wrote:
>
>> On Fri, Mar 14, 2014 at 9:37 AM, Greg Poirier 
>> wrote:
>> > So, on the cluster that I _expect_ to be slow, it appears that we are
>> > waiting on journal commits. I want to make sure that I am reading this
>> > correctly:
>> >
>> >   "received_at": "2014-03-14 12:14:22.659170",
>> >
>> > { "time": "2014-03-14 12:14:22.660191",
>> >   "event": "write_thread_in_journal_buffer"},
>> >
>> > At this point we have received the write and are attempting to write the
>> > transaction to the OSD's journal, yes?
>> >
>> > Then:
>> >
>> > { "time": "2014-03-14 12:14:22.900779",
>> >   "event": "journaled_completion_queued"},
>> >
>> > 240ms later we have successfully written to the journal?
>>
>> Correct. That seems an awfully long time for a 16K write, although I
>> don't know how much data I have on co-located journals. (At least, I'm
>> assuming it's in the 16K range based on the others, although I'm just
>> now realizing that subops aren't providing that information...I've
>> created a ticket to include that diagnostic info in future.)
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> > I expect this particular slowness due to colocation of journal and data
>> on
>> > the same disk (and it's a spinning disk, not an SSD). I expect some of
>> this
>> > could be alleviated by migrating journals to SSDs, but I am looking to
>> > rebuild in the near future--so am willing to hobble in the meantime.
>> >
>> > I am surprised that our all SSD cluster is also underperforming. I am
>> trying
>> > colocating the journal on the same disk with all SSDs at the moment and
>> will
>> > see if the performance degradation is of the same nature.
>> >
>> >
>> >
>> > On Thu, Mar 13, 2014 at 6:25 PM, Gregory Farnum 
>> wrote:
>> >>
>> >> Right. So which is the interval that's taking all the time? Probably
>> >> it's waiting for the journal commit, but maybe there's something else
>> >> blocking progress. If it is the journal commit, check out how busy the
>> >> disk is (is it just saturated?) and what its normal performance
>> >> characteristics are (is it breaking?).
>> >> -Greg
>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
>> >>
>> >>
>> >> On Thu, Mar 13, 2014 at 5:48 PM, Greg Poirier > >
>> >> wrote:
>> >> > Many of the sub ops look like this, with significant lag between
>> >> > received_at
>> >> > and commit_sent:
>> >> >
>> >> > { "description": "osd_op(client.6869831.0:1192491
>> >> > rbd_data.67b14a2ae8944a.9105 [write 507904~3686400]
>> >> > 6.556a4db0
>> >> > e660)",
>> >> >   "received_at": "2014-03-13 20:42:05.811936",
>> >> >   "age": "46.088198",
>> >> >   "duration": "0.038328",
>> >> > 
>> >> > { "time": "2014-03-13 20:42:05.850215",
>> >> >   "event": "commit_sent"},
>> >> > { "time": "2014-03-13 20:42:05.850264",
>> >> >   "event": "done"}]]},
>> >> >
>> >> > In this case almost 39ms between received_at and commit_sent.
>> >> >
>> >> > A particularly egregious example of 80+ms lag between received_at and
>> >> > commit_sent:
>> >> >
>> >> >{ "description": "osd_op(client.6869831.0:1190526
>> >> > rbd_data.67b14a2ae8944a.8fac [write 3325952~868352]
>> >> > 6.5255f5fd
>> >> > e660)",
>> >> >   "received_at": "201

Re: [ceph-users] Replication lag in block storage

2014-03-14 Thread Greg Poirier
We are stressing these boxes pretty spectacularly at the moment.

On every box I have one OSD that is pegged for IO almost constantly.

ceph-1:
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdv   0.00 0.00  104.00  160.00   748.00  1000.0013.24
1.154.369.461.05   3.70  97.60

ceph-2:
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdq   0.0025.00  109.00  218.00   844.00  1773.5016.01
1.374.209.031.78   3.01  98.40

ceph-3:
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdm   0.00 0.00  126.00   56.00   996.00   540.0016.88
1.015.588.060.00   5.43  98.80

These are all disks in my block storage pool.

 osdmap e26698: 102 osds: 102 up, 102 in
  pgmap v6752413: 4624 pgs, 3 pools, 14151 GB data, 21729 kobjects
28517 GB used, 65393 GB / 93911 GB avail
4624 active+clean
  client io 1915 kB/s rd, 59690 kB/s wr, 1464 op/s

I don't see any smart errors, but i'm slowly working my way through all of
the disks on these machines with smartctl to see if anything stands out.


On Fri, Mar 14, 2014 at 9:52 AM, Gregory Farnum  wrote:

> On Fri, Mar 14, 2014 at 9:37 AM, Greg Poirier 
> wrote:
> > So, on the cluster that I _expect_ to be slow, it appears that we are
> > waiting on journal commits. I want to make sure that I am reading this
> > correctly:
> >
> >   "received_at": "2014-03-14 12:14:22.659170",
> >
> > { "time": "2014-03-14 12:14:22.660191",
> >   "event": "write_thread_in_journal_buffer"},
> >
> > At this point we have received the write and are attempting to write the
> > transaction to the OSD's journal, yes?
> >
> > Then:
> >
> > { "time": "2014-03-14 12:14:22.900779",
> >   "event": "journaled_completion_queued"},
> >
> > 240ms later we have successfully written to the journal?
>
> Correct. That seems an awfully long time for a 16K write, although I
> don't know how much data I have on co-located journals. (At least, I'm
> assuming it's in the 16K range based on the others, although I'm just
> now realizing that subops aren't providing that information...I've
> created a ticket to include that diagnostic info in future.)
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> > I expect this particular slowness due to colocation of journal and data
> on
> > the same disk (and it's a spinning disk, not an SSD). I expect some of
> this
> > could be alleviated by migrating journals to SSDs, but I am looking to
> > rebuild in the near future--so am willing to hobble in the meantime.
> >
> > I am surprised that our all SSD cluster is also underperforming. I am
> trying
> > colocating the journal on the same disk with all SSDs at the moment and
> will
> > see if the performance degradation is of the same nature.
> >
> >
> >
> > On Thu, Mar 13, 2014 at 6:25 PM, Gregory Farnum 
> wrote:
> >>
> >> Right. So which is the interval that's taking all the time? Probably
> >> it's waiting for the journal commit, but maybe there's something else
> >> blocking progress. If it is the journal commit, check out how busy the
> >> disk is (is it just saturated?) and what its normal performance
> >> characteristics are (is it breaking?).
> >> -Greg
> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
> >>
> >>
> >> On Thu, Mar 13, 2014 at 5:48 PM, Greg Poirier 
> >> wrote:
> >> > Many of the sub ops look like this, with significant lag between
> >> > received_at
> >> > and commit_sent:
> >> >
> >> > { "description": "osd_op(client.6869831.0:1192491
> >> > rbd_data.67b14a2ae8944a.9105 [write 507904~3686400]
> >> > 6.556a4db0
> >> > e660)",
> >> >   "received_at": "2014-03-13 20:42:05.811936",
> >> >   "age": "46.088198",
> >> >   "duration": "0.038328",
> >> > 
> >> > { "time": "2014-03-13 20:42:05.850215",
> >> >   "event": "commit_sent"},
> >> > { "time": "2014-03-13 20:42:05.850264",
> >> >   "event": "done"}]]},
> >> >
> >> > In this case almost 39ms between received_at and commit_sent.
> >> >
> >> > A particularly egregious example of 80+ms lag between received_at and
> >> > commit_sent:
> >> >
> >> >{ "description": "osd_op(client.6869831.0:1190526
> >> > rbd_data.67b14a2ae8944a.8fac [write 3325952~868352]
> >> > 6.5255f5fd
> >> > e660)",
> >> >   "received_at": "2014-03-13 20:41:40.227813",
> >> >   "age": "320.017087",
> >> >   "duration": "0.086852",
> >> > 
> >> > { "time": "2014-03-13 20:41:40.314633",
> >> >   "event": "commit_sent

Re: [ceph-users] Replication lag in block storage

2014-03-14 Thread Gregory Farnum
On Fri, Mar 14, 2014 at 9:37 AM, Greg Poirier  wrote:
> So, on the cluster that I _expect_ to be slow, it appears that we are
> waiting on journal commits. I want to make sure that I am reading this
> correctly:
>
>   "received_at": "2014-03-14 12:14:22.659170",
>
> { "time": "2014-03-14 12:14:22.660191",
>   "event": "write_thread_in_journal_buffer"},
>
> At this point we have received the write and are attempting to write the
> transaction to the OSD's journal, yes?
>
> Then:
>
> { "time": "2014-03-14 12:14:22.900779",
>   "event": "journaled_completion_queued"},
>
> 240ms later we have successfully written to the journal?

Correct. That seems an awfully long time for a 16K write, although I
don't know how much data I have on co-located journals. (At least, I'm
assuming it's in the 16K range based on the others, although I'm just
now realizing that subops aren't providing that information...I've
created a ticket to include that diagnostic info in future.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


> I expect this particular slowness due to colocation of journal and data on
> the same disk (and it's a spinning disk, not an SSD). I expect some of this
> could be alleviated by migrating journals to SSDs, but I am looking to
> rebuild in the near future--so am willing to hobble in the meantime.
>
> I am surprised that our all SSD cluster is also underperforming. I am trying
> colocating the journal on the same disk with all SSDs at the moment and will
> see if the performance degradation is of the same nature.
>
>
>
> On Thu, Mar 13, 2014 at 6:25 PM, Gregory Farnum  wrote:
>>
>> Right. So which is the interval that's taking all the time? Probably
>> it's waiting for the journal commit, but maybe there's something else
>> blocking progress. If it is the journal commit, check out how busy the
>> disk is (is it just saturated?) and what its normal performance
>> characteristics are (is it breaking?).
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Thu, Mar 13, 2014 at 5:48 PM, Greg Poirier 
>> wrote:
>> > Many of the sub ops look like this, with significant lag between
>> > received_at
>> > and commit_sent:
>> >
>> > { "description": "osd_op(client.6869831.0:1192491
>> > rbd_data.67b14a2ae8944a.9105 [write 507904~3686400]
>> > 6.556a4db0
>> > e660)",
>> >   "received_at": "2014-03-13 20:42:05.811936",
>> >   "age": "46.088198",
>> >   "duration": "0.038328",
>> > 
>> > { "time": "2014-03-13 20:42:05.850215",
>> >   "event": "commit_sent"},
>> > { "time": "2014-03-13 20:42:05.850264",
>> >   "event": "done"}]]},
>> >
>> > In this case almost 39ms between received_at and commit_sent.
>> >
>> > A particularly egregious example of 80+ms lag between received_at and
>> > commit_sent:
>> >
>> >{ "description": "osd_op(client.6869831.0:1190526
>> > rbd_data.67b14a2ae8944a.8fac [write 3325952~868352]
>> > 6.5255f5fd
>> > e660)",
>> >   "received_at": "2014-03-13 20:41:40.227813",
>> >   "age": "320.017087",
>> >   "duration": "0.086852",
>> > 
>> > { "time": "2014-03-13 20:41:40.314633",
>> >   "event": "commit_sent"},
>> > { "time": "2014-03-13 20:41:40.314665",
>> >   "event": "done"}]]},
>> >
>> >
>> >
>> > On Thu, Mar 13, 2014 at 4:17 PM, Gregory Farnum 
>> > wrote:
>> >>
>> >> On Thu, Mar 13, 2014 at 3:56 PM, Greg Poirier 
>> >> wrote:
>> >> > We've been seeing this issue on all of our dumpling clusters, and I'm
>> >> > wondering what might be the cause of it.
>> >> >
>> >> > In dump_historic_ops, the time between op_applied and
>> >> > sub_op_commit_rec
>> >> > or
>> >> > the time between commit_sent and sub_op_applied is extremely high.
>> >> > Some
>> >> > of
>> >> > the osd_sub_ops are as long as 100 ms. A sample dump_historic_ops is
>> >> > included at the bottom.
>> >>
>> >> It's important to understand what each of those timestamps are
>> >> reporting.
>> >>
>> >> op_applied: the point at which an OSD has applied an operation to its
>> >> readable backing filesystem in-memory (which for xfs or ext4 will be
>> >> after it's committed to the journal)
>> >> sub_op_commit_rec: the point at which an OSD has gotten commits from
>> >> the replica OSDs
>> >> commit_sent: the point at which a replica OSD has sent a commit back
>> >> to its primary
>> >> sub_op_applied: the point at which a replica OSD has applied a
>> >> particular operation to its backing filesystem in-memory (again, after
>> >> the journal if using xfs)
>> >>
>> >> Reads are never served from replicas, so a long time between
>> >> commit_sent and sub_op_applied should not in itself be an issue. A lag
>> >> time between op_applied and sub_op_commit_rec means

Re: [ceph-users] Replication lag in block storage

2014-03-14 Thread Greg Poirier
So, on the cluster that I _expect_ to be slow, it appears that we are
waiting on journal commits. I want to make sure that I am reading this
correctly:

  "received_at": "2014-03-14 12:14:22.659170",

{ "time": "2014-03-14 12:14:22.660191",
  "event": "write_thread_in_journal_buffer"},

At this point we have received the write and are attempting to write the
transaction to the OSD's journal, yes?

Then:

{ "time": "2014-03-14 12:14:22.900779",
  "event": "journaled_completion_queued"},

240ms later we have successfully written to the journal?

I expect this particular slowness due to colocation of journal and data on
the same disk (and it's a spinning disk, not an SSD). I expect some of this
could be alleviated by migrating journals to SSDs, but I am looking to
rebuild in the near future--so am willing to hobble in the meantime.

I am surprised that our all SSD cluster is also underperforming. I am
trying colocating the journal on the same disk with all SSDs at the moment
and will see if the performance degradation is of the same nature.



On Thu, Mar 13, 2014 at 6:25 PM, Gregory Farnum  wrote:

> Right. So which is the interval that's taking all the time? Probably
> it's waiting for the journal commit, but maybe there's something else
> blocking progress. If it is the journal commit, check out how busy the
> disk is (is it just saturated?) and what its normal performance
> characteristics are (is it breaking?).
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Thu, Mar 13, 2014 at 5:48 PM, Greg Poirier 
> wrote:
> > Many of the sub ops look like this, with significant lag between
> received_at
> > and commit_sent:
> >
> > { "description": "osd_op(client.6869831.0:1192491
> > rbd_data.67b14a2ae8944a.9105 [write 507904~3686400]
> 6.556a4db0
> > e660)",
> >   "received_at": "2014-03-13 20:42:05.811936",
> >   "age": "46.088198",
> >   "duration": "0.038328",
> > 
> > { "time": "2014-03-13 20:42:05.850215",
> >   "event": "commit_sent"},
> > { "time": "2014-03-13 20:42:05.850264",
> >   "event": "done"}]]},
> >
> > In this case almost 39ms between received_at and commit_sent.
> >
> > A particularly egregious example of 80+ms lag between received_at and
> > commit_sent:
> >
> >{ "description": "osd_op(client.6869831.0:1190526
> > rbd_data.67b14a2ae8944a.8fac [write 3325952~868352]
> 6.5255f5fd
> > e660)",
> >   "received_at": "2014-03-13 20:41:40.227813",
> >   "age": "320.017087",
> >   "duration": "0.086852",
> > 
> > { "time": "2014-03-13 20:41:40.314633",
> >   "event": "commit_sent"},
> > { "time": "2014-03-13 20:41:40.314665",
> >   "event": "done"}]]},
> >
> >
> >
> > On Thu, Mar 13, 2014 at 4:17 PM, Gregory Farnum 
> wrote:
> >>
> >> On Thu, Mar 13, 2014 at 3:56 PM, Greg Poirier 
> >> wrote:
> >> > We've been seeing this issue on all of our dumpling clusters, and I'm
> >> > wondering what might be the cause of it.
> >> >
> >> > In dump_historic_ops, the time between op_applied and
> sub_op_commit_rec
> >> > or
> >> > the time between commit_sent and sub_op_applied is extremely high.
> Some
> >> > of
> >> > the osd_sub_ops are as long as 100 ms. A sample dump_historic_ops is
> >> > included at the bottom.
> >>
> >> It's important to understand what each of those timestamps are
> reporting.
> >>
> >> op_applied: the point at which an OSD has applied an operation to its
> >> readable backing filesystem in-memory (which for xfs or ext4 will be
> >> after it's committed to the journal)
> >> sub_op_commit_rec: the point at which an OSD has gotten commits from
> >> the replica OSDs
> >> commit_sent: the point at which a replica OSD has sent a commit back
> >> to its primary
> >> sub_op_applied: the point at which a replica OSD has applied a
> >> particular operation to its backing filesystem in-memory (again, after
> >> the journal if using xfs)
> >>
> >> Reads are never served from replicas, so a long time between
> >> commit_sent and sub_op_applied should not in itself be an issue. A lag
> >> time between op_applied and sub_op_commit_rec means that the OSD is
> >> waiting on its replicas. A long time there indicates either that the
> >> replica is processing slowly, or that there's some issue in the
> >> communications stack (all the way from the raw ethernet up to the
> >> message handling in the OSD itself).
> >> So the first thing to look for are sub ops which have a lag time
> >> between the received_at and commit_sent timestamps. If none of those
> >> ever turn up, but unusually long waits for sub_op_commit_rec are still
> >> present, then it'll take more effort to correlate particular subops on
> >> replicas with the op on the primary they c

Re: [ceph-users] Replication lag in block storage

2014-03-13 Thread Gregory Farnum
Right. So which is the interval that's taking all the time? Probably
it's waiting for the journal commit, but maybe there's something else
blocking progress. If it is the journal commit, check out how busy the
disk is (is it just saturated?) and what its normal performance
characteristics are (is it breaking?).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thu, Mar 13, 2014 at 5:48 PM, Greg Poirier  wrote:
> Many of the sub ops look like this, with significant lag between received_at
> and commit_sent:
>
> { "description": "osd_op(client.6869831.0:1192491
> rbd_data.67b14a2ae8944a.9105 [write 507904~3686400] 6.556a4db0
> e660)",
>   "received_at": "2014-03-13 20:42:05.811936",
>   "age": "46.088198",
>   "duration": "0.038328",
> 
> { "time": "2014-03-13 20:42:05.850215",
>   "event": "commit_sent"},
> { "time": "2014-03-13 20:42:05.850264",
>   "event": "done"}]]},
>
> In this case almost 39ms between received_at and commit_sent.
>
> A particularly egregious example of 80+ms lag between received_at and
> commit_sent:
>
>{ "description": "osd_op(client.6869831.0:1190526
> rbd_data.67b14a2ae8944a.8fac [write 3325952~868352] 6.5255f5fd
> e660)",
>   "received_at": "2014-03-13 20:41:40.227813",
>   "age": "320.017087",
>   "duration": "0.086852",
> 
> { "time": "2014-03-13 20:41:40.314633",
>   "event": "commit_sent"},
> { "time": "2014-03-13 20:41:40.314665",
>   "event": "done"}]]},
>
>
>
> On Thu, Mar 13, 2014 at 4:17 PM, Gregory Farnum  wrote:
>>
>> On Thu, Mar 13, 2014 at 3:56 PM, Greg Poirier 
>> wrote:
>> > We've been seeing this issue on all of our dumpling clusters, and I'm
>> > wondering what might be the cause of it.
>> >
>> > In dump_historic_ops, the time between op_applied and sub_op_commit_rec
>> > or
>> > the time between commit_sent and sub_op_applied is extremely high. Some
>> > of
>> > the osd_sub_ops are as long as 100 ms. A sample dump_historic_ops is
>> > included at the bottom.
>>
>> It's important to understand what each of those timestamps are reporting.
>>
>> op_applied: the point at which an OSD has applied an operation to its
>> readable backing filesystem in-memory (which for xfs or ext4 will be
>> after it's committed to the journal)
>> sub_op_commit_rec: the point at which an OSD has gotten commits from
>> the replica OSDs
>> commit_sent: the point at which a replica OSD has sent a commit back
>> to its primary
>> sub_op_applied: the point at which a replica OSD has applied a
>> particular operation to its backing filesystem in-memory (again, after
>> the journal if using xfs)
>>
>> Reads are never served from replicas, so a long time between
>> commit_sent and sub_op_applied should not in itself be an issue. A lag
>> time between op_applied and sub_op_commit_rec means that the OSD is
>> waiting on its replicas. A long time there indicates either that the
>> replica is processing slowly, or that there's some issue in the
>> communications stack (all the way from the raw ethernet up to the
>> message handling in the OSD itself).
>> So the first thing to look for are sub ops which have a lag time
>> between the received_at and commit_sent timestamps. If none of those
>> ever turn up, but unusually long waits for sub_op_commit_rec are still
>> present, then it'll take more effort to correlate particular subops on
>> replicas with the op on the primary they correspond to, and see where
>> the time lag is coming into it.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replication lag in block storage

2014-03-13 Thread Greg Poirier
Many of the sub ops look like this, with significant lag between
received_at and commit_sent:

{ "description": "osd_op(client.6869831.0:1192491
rbd_data.67b14a2ae8944a.9105 [write 507904~3686400] 6.556a4db0
e660)",
  "received_at": "2014-03-13 20:42:05.811936",
  "age": "46.088198",
  "duration": "0.038328",

{ "time": "2014-03-13 20:42:05.850215",
  "event": "commit_sent"},
{ "time": "2014-03-13 20:42:05.850264",
  "event": "done"}]]},

In this case almost 39ms between received_at and commit_sent.

A particularly egregious example of 80+ms lag between received_at and
commit_sent:

   { "description": "osd_op(client.6869831.0:1190526
rbd_data.67b14a2ae8944a.8fac [write 3325952~868352] 6.5255f5fd
e660)",
  "received_at": "2014-03-13 20:41:40.227813",
  "age": "320.017087",
  "duration": "0.086852",

{ "time": "2014-03-13 20:41:40.314633",
  "event": "commit_sent"},
{ "time": "2014-03-13 20:41:40.314665",
  "event": "done"}]]},



On Thu, Mar 13, 2014 at 4:17 PM, Gregory Farnum  wrote:

> On Thu, Mar 13, 2014 at 3:56 PM, Greg Poirier 
> wrote:
> > We've been seeing this issue on all of our dumpling clusters, and I'm
> > wondering what might be the cause of it.
> >
> > In dump_historic_ops, the time between op_applied and sub_op_commit_rec
> or
> > the time between commit_sent and sub_op_applied is extremely high. Some
> of
> > the osd_sub_ops are as long as 100 ms. A sample dump_historic_ops is
> > included at the bottom.
>
> It's important to understand what each of those timestamps are reporting.
>
> op_applied: the point at which an OSD has applied an operation to its
> readable backing filesystem in-memory (which for xfs or ext4 will be
> after it's committed to the journal)
> sub_op_commit_rec: the point at which an OSD has gotten commits from
> the replica OSDs
> commit_sent: the point at which a replica OSD has sent a commit back
> to its primary
> sub_op_applied: the point at which a replica OSD has applied a
> particular operation to its backing filesystem in-memory (again, after
> the journal if using xfs)
>
> Reads are never served from replicas, so a long time between
> commit_sent and sub_op_applied should not in itself be an issue. A lag
> time between op_applied and sub_op_commit_rec means that the OSD is
> waiting on its replicas. A long time there indicates either that the
> replica is processing slowly, or that there's some issue in the
> communications stack (all the way from the raw ethernet up to the
> message handling in the OSD itself).
> So the first thing to look for are sub ops which have a lag time
> between the received_at and commit_sent timestamps. If none of those
> ever turn up, but unusually long waits for sub_op_commit_rec are still
> present, then it'll take more effort to correlate particular subops on
> replicas with the op on the primary they correspond to, and see where
> the time lag is coming into it.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replication lag in block storage

2014-03-13 Thread Gregory Farnum
On Thu, Mar 13, 2014 at 3:56 PM, Greg Poirier  wrote:
> We've been seeing this issue on all of our dumpling clusters, and I'm
> wondering what might be the cause of it.
>
> In dump_historic_ops, the time between op_applied and sub_op_commit_rec or
> the time between commit_sent and sub_op_applied is extremely high. Some of
> the osd_sub_ops are as long as 100 ms. A sample dump_historic_ops is
> included at the bottom.

It's important to understand what each of those timestamps are reporting.

op_applied: the point at which an OSD has applied an operation to its
readable backing filesystem in-memory (which for xfs or ext4 will be
after it's committed to the journal)
sub_op_commit_rec: the point at which an OSD has gotten commits from
the replica OSDs
commit_sent: the point at which a replica OSD has sent a commit back
to its primary
sub_op_applied: the point at which a replica OSD has applied a
particular operation to its backing filesystem in-memory (again, after
the journal if using xfs)

Reads are never served from replicas, so a long time between
commit_sent and sub_op_applied should not in itself be an issue. A lag
time between op_applied and sub_op_commit_rec means that the OSD is
waiting on its replicas. A long time there indicates either that the
replica is processing slowly, or that there's some issue in the
communications stack (all the way from the raw ethernet up to the
message handling in the OSD itself).
So the first thing to look for are sub ops which have a lag time
between the received_at and commit_sent timestamps. If none of those
ever turn up, but unusually long waits for sub_op_commit_rec are still
present, then it'll take more effort to correlate particular subops on
replicas with the op on the primary they correspond to, and see where
the time lag is coming into it.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com