Re: [ceph-users] pg 21.1f9 is stuck inactive for 53316.902820, current state remapped

2019-08-22 Thread Lars Täuber
There are 30 osds.

Thu, 22 Aug 2019 14:38:10 +0700
wahyu.muqs...@gmail.com ==> ceph-users@lists.ceph.com, Lars Täuber 
 :
> how many osd do you use ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg 21.1f9 is stuck inactive for 53316.902820, current state remapped

2019-08-22 Thread Lars Täuber
All osd are up.

I manually mark one out of 30 "out" not "down".
The primary osd of the stuck pgs are neither marked as out nor as down.

Thanks
Lars

Thu, 22 Aug 2019 15:01:12 +0700
wahyu.muqs...@gmail.com ==> wahyu.muqs...@gmail.com, Lars Täuber 
 :
> I think you use too few osd. when you use erasure code, the probability of 
> primary pg being on the down osd will increase
> On 22 Aug 2019 14.51 +0700, Lars Täuber , wrote:
> > There are 30 osds.
> >
> > Thu, 22 Aug 2019 14:38:10 +0700
> > wahyu.muqs...@gmail.com ==> ceph-users@lists.ceph.com, Lars Täuber 
> >  :  
> > > how many osd do you use ?  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph status: pg backfill_toofull, but all OSDs have enough space

2019-08-22 Thread Lars Täuber
Hi there!

We also experience this behaviour of our cluster while it is moving pgs.

# ceph health detail
HEALTH_ERR 1 MDSs report slow metadata IOs; Reduced data availability: 2 pgs 
inactive; Degraded data redundancy (low space): 1 pg backfill_toofull
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
mdsmds1(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked 
for 359 secs
PG_AVAILABILITY Reduced data availability: 2 pgs inactive
pg 21.231 is stuck inactive for 878.224182, current state remapped, last 
acting [20,2147483647,13,2147483647,15,10]
pg 21.240 is stuck inactive for 878.123932, current state remapped, last 
acting [26,17,21,20,2147483647,2147483647]
PG_DEGRADED_FULL Degraded data redundancy (low space): 1 pg backfill_toofull
pg 21.376 is active+remapped+backfill_wait+backfill_toofull, acting 
[6,11,29,2,10,15]
# ceph pg map 21.376
osdmap e68016 pg 21.376 (21.376) -> up [6,5,23,21,10,11] acting 
[6,11,29,2,10,15]

# ceph osd dump | fgrep ratio
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85

This happens while the cluster is rebalancing the pgs after I manually mark a 
single osd out.
see here:
 Subject: [ceph-users] pg 21.1f9 is stuck inactive for 53316.902820,  current 
state remapped
 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036634.html


Mostly the cluster heals itself at least into state HEALTH_WARN:


# ceph health detail
HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 2 pgs 
inactive
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
mdsmds1(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked 
for 1155 secs
PG_AVAILABILITY Reduced data availability: 2 pgs inactive
pg 21.231 is stuck inactive for 1677.312219, current state remapped, last 
acting [20,2147483647,13,2147483647,15,10]
pg 21.240 is stuck inactive for 1677.211969, current state remapped, last 
acting [26,17,21,20,2147483647,2147483647]



Cheers,
Lars


Wed, 21 Aug 2019 17:28:05 -0500
Reed Dier  ==> Vladimir Brik 
 :
> Just chiming in to say that I too had some issues with backfill_toofull PGs, 
> despite no OSD's being in a backfill_full state, albeit, there were some 
> nearfull OSDs.
> 
> I was able to get through it by reweighting down the OSD that was the target 
> reported by ceph pg dump | grep 'backfill_toofull'.
> 
> This was on 14.2.2.
> 
> Reed
> 
> > On Aug 21, 2019, at 2:50 PM, Vladimir Brik  
> > wrote:
> > 
> > Hello
> > 
> > After increasing number of PGs in a pool, ceph status is reporting 
> > "Degraded data redundancy (low space): 1 pg backfill_toofull", but I don't 
> > understand why, because all OSDs seem to have enough space.
> > 
> > ceph health detail says:
> > pg 40.155 is active+remapped+backfill_toofull, acting [20,57,79,85]
> > 
> > $ ceph pg map 40.155
> > osdmap e3952 pg 40.155 (40.155) -> up [20,57,66,85] acting [20,57,79,85]
> > 
> > So I guess Ceph wants to move 40.155 from 66 to 79 (or other way around?). 
> > According to "osd df", OSD 66's utilization is 71.90%, OSD 79's utilization 
> > is 58.45%. The OSD with least free space in the cluster is 81.23% full, and 
> > it's not any of the ones above.
> > 
> > OSD backfillfull_ratio is 90% (is there a better way to determine this?):
> > $ ceph osd dump | grep ratio
> > full_ratio 0.95
> > backfillfull_ratio 0.9
> > nearfull_ratio 0.7
> > 
> > Does anybody know why a PG could be in the backfill_toofull state if no OSD 
> > is in the backfillfull state?
> > 
> > 
> > Vlad
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> 


-- 
Informationstechnologie
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstraße 22-23  10117 Berlin
Tel.: +49 30 20370-352   http://www.bbaw.de


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph status: pg backfill_toofull, but all OSDs have enough space

2019-08-22 Thread Brad Hubbard
https://tracker.ceph.com/issues/41255 is probably reporting the same issue.

On Thu, Aug 22, 2019 at 6:31 PM Lars Täuber  wrote:
>
> Hi there!
>
> We also experience this behaviour of our cluster while it is moving pgs.
>
> # ceph health detail
> HEALTH_ERR 1 MDSs report slow metadata IOs; Reduced data availability: 2 pgs 
> inactive; Degraded data redundancy (low space): 1 pg backfill_toofull
> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
> mdsmds1(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked 
> for 359 secs
> PG_AVAILABILITY Reduced data availability: 2 pgs inactive
> pg 21.231 is stuck inactive for 878.224182, current state remapped, last 
> acting [20,2147483647,13,2147483647,15,10]
> pg 21.240 is stuck inactive for 878.123932, current state remapped, last 
> acting [26,17,21,20,2147483647,2147483647]
> PG_DEGRADED_FULL Degraded data redundancy (low space): 1 pg backfill_toofull
> pg 21.376 is active+remapped+backfill_wait+backfill_toofull, acting 
> [6,11,29,2,10,15]
> # ceph pg map 21.376
> osdmap e68016 pg 21.376 (21.376) -> up [6,5,23,21,10,11] acting 
> [6,11,29,2,10,15]
>
> # ceph osd dump | fgrep ratio
> full_ratio 0.95
> backfillfull_ratio 0.9
> nearfull_ratio 0.85
>
> This happens while the cluster is rebalancing the pgs after I manually mark a 
> single osd out.
> see here:
>  Subject: [ceph-users] pg 21.1f9 is stuck inactive for 53316.902820,  current 
> state remapped
>  http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036634.html
>
>
> Mostly the cluster heals itself at least into state HEALTH_WARN:
>
>
> # ceph health detail
> HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 2 pgs 
> inactive
> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
> mdsmds1(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked 
> for 1155 secs
> PG_AVAILABILITY Reduced data availability: 2 pgs inactive
> pg 21.231 is stuck inactive for 1677.312219, current state remapped, last 
> acting [20,2147483647,13,2147483647,15,10]
> pg 21.240 is stuck inactive for 1677.211969, current state remapped, last 
> acting [26,17,21,20,2147483647,2147483647]
>
>
>
> Cheers,
> Lars
>
>
> Wed, 21 Aug 2019 17:28:05 -0500
> Reed Dier  ==> Vladimir Brik 
>  :
> > Just chiming in to say that I too had some issues with backfill_toofull 
> > PGs, despite no OSD's being in a backfill_full state, albeit, there were 
> > some nearfull OSDs.
> >
> > I was able to get through it by reweighting down the OSD that was the 
> > target reported by ceph pg dump | grep 'backfill_toofull'.
> >
> > This was on 14.2.2.
> >
> > Reed
> >
> > > On Aug 21, 2019, at 2:50 PM, Vladimir Brik 
> > >  wrote:
> > >
> > > Hello
> > >
> > > After increasing number of PGs in a pool, ceph status is reporting 
> > > "Degraded data redundancy (low space): 1 pg backfill_toofull", but I 
> > > don't understand why, because all OSDs seem to have enough space.
> > >
> > > ceph health detail says:
> > > pg 40.155 is active+remapped+backfill_toofull, acting [20,57,79,85]
> > >
> > > $ ceph pg map 40.155
> > > osdmap e3952 pg 40.155 (40.155) -> up [20,57,66,85] acting [20,57,79,85]
> > >
> > > So I guess Ceph wants to move 40.155 from 66 to 79 (or other way 
> > > around?). According to "osd df", OSD 66's utilization is 71.90%, OSD 79's 
> > > utilization is 58.45%. The OSD with least free space in the cluster is 
> > > 81.23% full, and it's not any of the ones above.
> > >
> > > OSD backfillfull_ratio is 90% (is there a better way to determine this?):
> > > $ ceph osd dump | grep ratio
> > > full_ratio 0.95
> > > backfillfull_ratio 0.9
> > > nearfull_ratio 0.7
> > >
> > > Does anybody know why a PG could be in the backfill_toofull state if no 
> > > OSD is in the backfillfull state?
> > >
> > >
> > > Vlad
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> --
> Informationstechnologie
> Berlin-Brandenburgische Akademie der Wissenschaften
> Jägerstraße 22-23  10117 Berlin
> Tel.: +49 30 20370-352   http://www.bbaw.de
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Theory: High I/O-wait inside VM with RBD due to CPU throttling

2019-08-22 Thread Wido den Hollander
Hi,

In a couple of situations I have encountered that Virtual Machines
running on RBD had a high I/O-wait, nearly 100%, on their vdX (VirtIO)
or sdX (Virtio-SCSI) devices while they were performing CPU intensive tasks.

These servers would be running a very CPU intensive application while
*not* doing that many disk I/O.

I however noticed that the I/O-wait of the disk(s) in the VM went up to
100%.

This VM is CPU limited by Libvirt by putting that KVM process in it's
own cgroup with a CPU limitation.

Now, my theory is:

KVM (qemu-kvm) is completely userspace and librbd runs inside qemu-kvm
as a library. All threads for disk I/O are part of the same PID and thus
part of that cgroup.

If a process inside the Virtual Machine now starts to consume all CPU
time there is nothing left for librbd which slows it down.

This then causes a increased I/O-wait inside the Virtual Machine. Even
though the VM is not performing a lot of disk I/O. The wait of the I/O
goes up due to this.


Is my theory sane? Can somebody confirm this?

Thanks,

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Tunables client support

2019-08-22 Thread Lukáš Kubín
Hello,
I am considering enabling optimal crush tunables in our Jewel cluster (4
nodes, 52 OSD, used as OpenStack Cinder+Nova backend = RBD images). I've
got two questions:

1. Do I understand right that having the optimal tunables on can be
considered best practice and should be applied in most scenarios? Or is
there something I should be warned about?

2. There's a minimal kernel version requirement for KRBD clients. Does a
similar restriction apply on librbd (libvirt-qemu) clients too? Basically,
I just need to ensure we're not going to harm our clients (OpenStack
instances) after setting the tunables. We're running version 4.4 Linux
kernels on compute nodes, which is not supported by KRBD with Jewel set of
tunables.

Thanks and greetings,

Lukas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Theory: High I/O-wait inside VM with RBD due to CPU throttling

2019-08-22 Thread Jason Dillaman
On Thu, Aug 22, 2019 at 9:23 AM Wido den Hollander  wrote:
>
> Hi,
>
> In a couple of situations I have encountered that Virtual Machines
> running on RBD had a high I/O-wait, nearly 100%, on their vdX (VirtIO)
> or sdX (Virtio-SCSI) devices while they were performing CPU intensive tasks.
>
> These servers would be running a very CPU intensive application while
> *not* doing that many disk I/O.
>
> I however noticed that the I/O-wait of the disk(s) in the VM went up to
> 100%.
>
> This VM is CPU limited by Libvirt by putting that KVM process in it's
> own cgroup with a CPU limitation.
>
> Now, my theory is:
>
> KVM (qemu-kvm) is completely userspace and librbd runs inside qemu-kvm
> as a library. All threads for disk I/O are part of the same PID and thus
> part of that cgroup.
>
> If a process inside the Virtual Machine now starts to consume all CPU
> time there is nothing left for librbd which slows it down.
>
> This then causes a increased I/O-wait inside the Virtual Machine. Even
> though the VM is not performing a lot of disk I/O. The wait of the I/O
> goes up due to this.
>
>
> Is my theory sane?

Yes, I would say that your theory is sane. Have you looked into
libvirt's cgroup controls for limiting the emulator portion vs the
vCPUs [1]? I'd hope the librbd code and threads should be running in
the emulator cgroup (in a perfect world).

> Can somebody confirm this?
>
> Thanks,
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[1] https://libvirt.org/cgroups.html

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] hsbench 0.2 released

2019-08-22 Thread Mark Nelson

Hi Folks,


I've updated hsbench (new S3 benchmark) to 0.2


Notable changes since 0.1:


- Can now output CSV results

- Can now output JSON results

- Fix for poor read performance with low thread counts

- New bucket listing benchmark with a new "mk" flag that lets you 
control the number of keys to fetch at once.



You can get it here:


https://github.com/markhpc/hsbench


I've been doing quite a bit of testing of the RadosGW with hsbench this 
week and so far it's doing exactly what I hoped it would do!



Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDSs report damaged metadata

2019-08-22 Thread Robert LeBlanc
We just had metadata damage show up on our Jewel cluster. I tried a few
things like renaming directories and scanning, but the damage would just
show up again in less than 24 hours. I finally just copied the directories
with the damage to a tmp location on CephFS, then swapped it with the
damaged one. When I deleted the directories with the damage the active MDS
crashed, but the replay took over just fine. I haven't had the messages now
for almost a week.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Aug 19, 2019 at 10:30 PM Lars Täuber  wrote:

> Hi there!
>
> Does anyone else have an idea what I could do to get rid of this error?
>
> BTW: it is the third time that the pg 20.0 is gone inconsistent.
> This is a pg from the metadata pool (cephfs).
> May this be related anyhow?
>
> # ceph health detail
> HEALTH_ERR 1 MDSs report damaged metadata; 1 scrub errors; Possible data
> damage: 1 pg inconsistent
> MDS_DAMAGE 1 MDSs report damaged metadata
> mdsmds3(mds.0): Metadata damage detected
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 20.0 is active+clean+inconsistent, acting [9,27,15]
>
>
> Best regards,
> Lars
>
>
> Mon, 19 Aug 2019 13:51:59 +0200
> Lars Täuber  ==> Paul Emmerich  :
> > Hi Paul,
> >
> > thanks for the hint.
> >
> > I did a recursive scrub from "/". The log says there where some inodes
> with bad backtraces repaired. But the error remains.
> > May this have something to do with a deleted file? Or a file within a
> snapshot?
> >
> > The path told by
> >
> > # ceph tell mds.mds3 damage ls
> > 2019-08-19 13:43:04.608 7f563f7f6700  0 client.894552 ms_handle_reset on
> v2:192.168.16.23:6800/176704036
> > 2019-08-19 13:43:04.624 7f56407f8700  0 client.894558 ms_handle_reset on
> v2:192.168.16.23:6800/176704036
> > [
> > {
> > "damage_type": "backtrace",
> > "id": 3760765989,
> > "ino": 1099518115802,
> > "path": "~mds0/stray7/15161f7/dovecot.index.backup"
> > }
> > ]
> >
> > starts a bit strange to me.
> >
> > Are the snapshots also repaired with a recursive repair operation?
> >
> > Thanks
> > Lars
> >
> >
> > Mon, 19 Aug 2019 13:30:53 +0200
> > Paul Emmerich  ==> Lars Täuber 
> :
> > > Hi,
> > >
> > > that error just says that the path is wrong. I unfortunately don't
> > > know the correct way to instruct it to scrub a stray path off the top
> > > of my head; you can always run a recursive scrub on / to go over
> > > everything, though
> > >
> > >
> > > Paul
> > >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Increase pg_num while backfilling

2019-08-22 Thread Lukáš Kubín
Hello,
yesterday I've added 4th OSD node (increase from 39 to 52 OSDs) into our
Jewel cluster. Backfilling of remapped pgs is still running and seems it
will run for another day until complete.

I know the pg_num of largest is undersized and I should increase it from
512 to 2048.

The question is - should I wait for backfilling to complete or can I still
increase pg_num (and pgp_num) while backfilling is running?

Thanks and greetings,

Lukas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Theory: High I/O-wait inside VM with RBD due to CPU throttling

2019-08-22 Thread Wido den Hollander



On 8/22/19 3:59 PM, Jason Dillaman wrote:
> On Thu, Aug 22, 2019 at 9:23 AM Wido den Hollander  wrote:
>>
>> Hi,
>>
>> In a couple of situations I have encountered that Virtual Machines
>> running on RBD had a high I/O-wait, nearly 100%, on their vdX (VirtIO)
>> or sdX (Virtio-SCSI) devices while they were performing CPU intensive tasks.
>>
>> These servers would be running a very CPU intensive application while
>> *not* doing that many disk I/O.
>>
>> I however noticed that the I/O-wait of the disk(s) in the VM went up to
>> 100%.
>>
>> This VM is CPU limited by Libvirt by putting that KVM process in it's
>> own cgroup with a CPU limitation.
>>
>> Now, my theory is:
>>
>> KVM (qemu-kvm) is completely userspace and librbd runs inside qemu-kvm
>> as a library. All threads for disk I/O are part of the same PID and thus
>> part of that cgroup.
>>
>> If a process inside the Virtual Machine now starts to consume all CPU
>> time there is nothing left for librbd which slows it down.
>>
>> This then causes a increased I/O-wait inside the Virtual Machine. Even
>> though the VM is not performing a lot of disk I/O. The wait of the I/O
>> goes up due to this.
>>
>>
>> Is my theory sane?
> 
> Yes, I would say that your theory is sane. Have you looked into
> libvirt's cgroup controls for limiting the emulator portion vs the
> vCPUs [1]? I'd hope the librbd code and threads should be running in
> the emulator cgroup (in a perfect world).
> 

I checked with 'virsh schedinfo X' and this is the output I got:

Scheduler  : posix
cpu_shares : 1000
vcpu_period: 10
vcpu_quota : -1
emulator_period: 10
emulator_quota : -1
global_period  : 10
global_quota   : -1
iothread_period: 10
iothread_quota : -1


How can we confirm if the librbd code runs inside the Emulator part?

Wido

>> Can somebody confirm this?
>>
>> Thanks,
>>
>> Wido
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> [1] https://libvirt.org/cgroups.html
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Theory: High I/O-wait inside VM with RBD due to CPU throttling

2019-08-22 Thread Jason Dillaman
On Thu, Aug 22, 2019 at 11:29 AM Wido den Hollander  wrote:
>
>
>
> On 8/22/19 3:59 PM, Jason Dillaman wrote:
> > On Thu, Aug 22, 2019 at 9:23 AM Wido den Hollander  wrote:
> >>
> >> Hi,
> >>
> >> In a couple of situations I have encountered that Virtual Machines
> >> running on RBD had a high I/O-wait, nearly 100%, on their vdX (VirtIO)
> >> or sdX (Virtio-SCSI) devices while they were performing CPU intensive 
> >> tasks.
> >>
> >> These servers would be running a very CPU intensive application while
> >> *not* doing that many disk I/O.
> >>
> >> I however noticed that the I/O-wait of the disk(s) in the VM went up to
> >> 100%.
> >>
> >> This VM is CPU limited by Libvirt by putting that KVM process in it's
> >> own cgroup with a CPU limitation.
> >>
> >> Now, my theory is:
> >>
> >> KVM (qemu-kvm) is completely userspace and librbd runs inside qemu-kvm
> >> as a library. All threads for disk I/O are part of the same PID and thus
> >> part of that cgroup.
> >>
> >> If a process inside the Virtual Machine now starts to consume all CPU
> >> time there is nothing left for librbd which slows it down.
> >>
> >> This then causes a increased I/O-wait inside the Virtual Machine. Even
> >> though the VM is not performing a lot of disk I/O. The wait of the I/O
> >> goes up due to this.
> >>
> >>
> >> Is my theory sane?
> >
> > Yes, I would say that your theory is sane. Have you looked into
> > libvirt's cgroup controls for limiting the emulator portion vs the
> > vCPUs [1]? I'd hope the librbd code and threads should be running in
> > the emulator cgroup (in a perfect world).
> >
>
> I checked with 'virsh schedinfo X' and this is the output I got:
>
> Scheduler  : posix
> cpu_shares : 1000
> vcpu_period: 10
> vcpu_quota : -1
> emulator_period: 10
> emulator_quota : -1
> global_period  : 10
> global_quota   : -1
> iothread_period: 10
> iothread_quota : -1
>
>
> How can we confirm if the librbd code runs inside the Emulator part?

You can look under the "/proc//tasks// directories.
The "comm" file has the thread friendly name. If it's a librbd /
librados thread you will see things like the following (taken from an
'rbd bench-write' process):

$ cat */comm
rbd
log
service
admin_socket
msgr-worker-0
msgr-worker-1
msgr-worker-2
rbd
ms_dispatch
ms_local
safe_timer
fn_anonymous
safe_timer
safe_timer
fn-radosclient
tp_librbd
safe_timer
safe_timer
taskfin_librbd
signal_handler

Those directories also have "cgroup" files which will indicate which
cgroup the thread is currently living under. For example, the
"tp_librbd" thread is running under the following cgroups in my
environment:

11:blkio:/
10:hugetlb:/
9:freezer:/
8:net_cls,net_prio:/
7:memory:/user.slice/user-1000.slice/user@1000.service
6:cpu,cpuacct:/
5:devices:/user.slice
4:perf_event:/
3:cpuset:/
2:pids:/user.slice/user-1000.slice/user@1000.service
1:name=systemd:/user.slice/user-1000.slice/user@1000.service/gnome-terminal-server.service
0::/user.slice/user-1000.slice/user@1000.service/gnome-terminal-server.service


> Wido
>
> >> Can somebody confirm this?
> >>
> >> Thanks,
> >>
> >> Wido
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > [1] https://libvirt.org/cgroups.html
> >



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Watch a RADOS object for changes, specifically iscsi gateway.conf object

2019-08-22 Thread Wesley Dillingham
I am interested in keeping a revision history of ceph-iscsi's gateway.conf 
object for any and all changes. It seems to me this may come in handy to revert 
the environment to a previous state. My question is are there any existing 
tools which do similar or could someone please suggest, if they exist, 
libraries or code examples (ideally python) which may further me in this goal.

The RADOS man page has the ability to "listwatchers" so I believe this is 
achievable. Thanks in advance.

Respectfully,

Wes Dillingham
wdilling...@godaddy.com
Site Reliability Engineer IV - Platform Storage / Ceph

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Watch a RADOS object for changes, specifically iscsi gateway.conf object

2019-08-22 Thread Lenz Grimmer
On 8/22/19 9:38 PM, Wesley Dillingham wrote:

> I am interested in keeping a revision history of ceph-iscsi's
> gateway.conf object for any and all changes. It seems to me this may
> come in handy to revert the environment to a previous state. My question
> is are there any existing tools which do similar or could someone please
> suggest, if they exist, libraries or code examples (ideally python)
> which may further me in this goal. 
> 
> The RADOS man page has the ability to "*listwatchers" so I believe this
> is **achievable**. Thanks in advance. *

This is how ceph-iscsi seems to be doing it, by polling the config
object's epoch:

https://github.com/ceph/ceph-iscsi/blob/master/rbd-target-api.py#L2673

NFS Ganesha (C code) seems to be using a watch:

https://github.com/nfs-ganesha/nfs-ganesha/blob/next/src/config_parsing/conf_url_rados.c#L375

Not sure if that interface is available via the Python bindings as well
- this might be more effective than polling...

Lenz

-- 
SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg
GF: Felix Imendörffer, HRB 247165 (AG Nürnberg)



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-22 Thread Eric Ivancich
Thank you for providing the profiling data, Vladimir. There are 5078 threads 
and most of them are waiting. Here is a list of the deepest call of each thread 
with duplicates removed.

+ 100.00% epoll_wait
  + 100.00% 
get_obj_data::flush(rgw::OwningList&&)
+ 100.00% poll
+ 100.00% poll
  + 100.00% poll
+ 100.00% pthread_cond_timedwait@@GLIBC_2.3.2
  + 100.00% pthread_cond_timedwait@@GLIBC_2.3.2
+ 100.00% pthread_cond_wait@@GLIBC_2.3.2
  + 100.00% pthread_cond_wait@@GLIBC_2.3.2
  + 100.00% read
+ 100.00% 
_ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_

The only interesting ones are the second and last:

* get_obj_data::flush(rgw::OwningList&&)
* 
_ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_

They are essentially part of the same call stack that results from processing a 
GetObj request, and five threads are in this call stack (the only difference is 
wether or not they include the call into boost intrusive list). Here’s the full 
call stack of those threads:

+ 100.00% clone
  + 100.00% start_thread
+ 100.00% worker_thread
  + 100.00% process_new_connection
+ 100.00% handle_request
  + 100.00% RGWCivetWebFrontend::process(mg_connection*)
+ 100.00% process_request(RGWRados*, RGWREST*, RGWRequest*, 
std::string const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, 
OpsLogSocket*, opt
ional_yield, rgw::dmclock::Scheduler*, int*)
  + 100.00% rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, 
RGWRequest*, req_state*, bool)
+ 100.00% RGWGetObj::execute()
  + 100.00% RGWRados::Object::Read::iterate(long, long, 
RGWGetDataCB*)
+ 100.00% RGWRados::iterate_obj(RGWObjectCtx&, 
RGWBucketInfo const&, rgw_obj const&, long, long, unsigned long, int 
(*)(rgw_raw_obj const&, l
ong, long, long, bool, RGWObjState*, void*), void*)
  + 100.00% _get_obj_iterate_cb(rgw_raw_obj const&, long, 
long, long, bool, RGWObjState*, void*)
+ 100.00% RGWRados::get_obj_iterate_cb(rgw_raw_obj 
const&, long, long, long, bool, RGWObjState*, void*)
  + 100.00% 
get_obj_data::flush(rgw::OwningList&&)
+ 100.00% 
_ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_

So this isn’t background processing but request processing. I’m not clear why 
these requests are consuming so much CPU for so long.

From your initial message:
> I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, 
> radosgw process on those machines starts consuming 100% of 5 CPU cores for 
> days at a time, even though the machine is not being used for data transfers 
> (nothing in radosgw logs, couple of KB/s of network).
> 
> This situation can affect any number of our rados gateways, lasts from few 
> hours to few days and stops if radosgw process is restarted or on its own.


I’m going to check with others who’re more familiar with this code path.

> Begin forwarded message:
> 
> From: Vladimir Brik 
> Subject: Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is 
> being transferred
> Date: August 21, 2019 at 4:47:01 PM EDT
> To: "J. Eric Ivancich" , Mark Nelson 
> , ceph-users@lists.ceph.com
> 
> > Are you running multisite?
> No
> 
> > Do you have dynamic bucket resharding turned on?
> Yes. "radosgw-admin reshard list" prints "[]"
> 
> > Are you using lifecycle?
> I am not sure. How can I check? "radosgw-admin lc list" says "[]"
> 
> > And just to be clear -- sometimes all 3 of your rados gateways are
> > simultaneously in this state?
> Multiple, but I have not seen all 3 being in this state simultaneously. 
> Currently one gateway has 1 thread using 100% of CPU, and another has 5 
> threads each using 100% CPU.
> 
> Here are the fruits of my attempts to capture the call graph using perf and 
> gdbpmp:
> https://icecube.wisc.edu/~vbrik/perf.data
> https://icecube.wisc.edu/~vbrik/gdbpmp.data
> 
> These are the commands that I ran and their outputs (note I couldn't get perf 
> not to generate the warning):
> rgw-3 gdbpmp # ./gdbpmp.py -n 100 -p 73688 -o gdbpmp.data
> Attaching to process 73688...Done.
> Gathering 
> Samples
> Profiling complete with 100 samples.
> 
> rgw-3 ~ # perf record --call-graph fp -p 73688 -- sleep 10
> [ perf 

[ceph-users] Balancer dont work with state pgs backfill_toofull

2019-08-22 Thread EDH - Manuel Rios Fernandez
Root affected got more than 70TB free.  The only solution is manual reweight
the OSD. But in this situacion balancer in unmap mode should move data to
get all HEALTHY

 

Hope some fix come in the next 14.2.X to fix that issue. 

 

Ceph 14.2.2 Centos 7.6

  cluster:

id: e1ee8086-7cce-43fd-a252-3d677af22428

health: HEALTH_ERR

2 nearfull osd(s)

2 pool(s) nearfull

Degraded data redundancy (low space): 9 pgs backfill_toofull

 

  services:

mon: 3 daemons, quorum CEPH-MON01,CEPH002,CEPH003 (age 19m)

mgr: CEPH001(active, since 12h), standbys: CEPH-MON01

osd: 90 osds: 90 up (since 4h), 90 in (since 34h); 9 remapped pgs

rgw: 1 daemon active (ceph-rgw03)

 

  data:

pools:   18 pools, 8252 pgs

objects: 105.59M objects, 294 TiB

usage:   340 TiB used, 84 TiB / 424 TiB avail

pgs: 197930/116420259 objects misplaced (0.170%)

 8243 active+clean

 9active+remapped+backfill_toofull

 

ID  CLASS   WEIGHTREWEIGHT SIZERAW USE DATAOMAPMETAAVAIL
%USE  VAR  PGS STATUS TYPE NAME

-41 392.89362- 393 TiB 317 TiB 317 TiB  19 MiB 561 GiB  75
TiB 00   -root archive

-44 130.95793- 131 TiB 106 TiB 106 TiB 608 KiB 188 GiB  25
TiB 81.29 1.01   -host CEPH005

83 archive  10.91309  1.0  11 TiB 9.2 TiB 9.1 TiB  56 KiB  16 GiB 1.7
TiB 83.97 1.05 131 up osd.83

84 archive  10.91309  1.0  11 TiB 8.5 TiB 8.5 TiB  56 KiB  15 GiB 2.4
TiB 78.33 0.98 121 up osd.84

85 archive  10.91309  1.0  11 TiB 9.0 TiB 9.0 TiB  92 KiB  16 GiB 1.9
TiB 82.24 1.03 129 up osd.85

86 archive  10.91309  1.0  11 TiB  10 TiB  10 TiB  48 KiB  18 GiB 597
GiB 94.66 1.18 148 up osd.86

87 archive  10.91399  1.0  11 TiB  10 TiB  10 TiB  88 KiB  18 GiB 596
GiB 94.66 1.18 149 up osd.87

88 archive  10.91309  1.0  11 TiB 8.3 TiB 8.3 TiB  56 KiB  14 GiB 2.6
TiB 76.02 0.95 119 up osd.88

97 archive  10.91309  1.0  11 TiB 7.7 TiB 7.7 TiB  44 KiB  14 GiB 3.2
TiB 70.44 0.88 109 up osd.97

98 archive  10.91309  1.0  11 TiB 8.4 TiB 8.4 TiB  72 KiB  15 GiB 2.5
TiB 77.18 0.96 126 up osd.98

99 archive  10.91309  1.0  11 TiB 8.0 TiB 8.0 TiB  24 KiB  15 GiB 2.9
TiB 73.29 0.91 116 up osd.99

100 archive  10.91309  1.0  11 TiB 9.5 TiB 9.4 TiB  32 KiB  17 GiB 1.4
TiB 86.72 1.08 132 up osd.100

101 archive  10.91309  1.0  11 TiB 8.4 TiB 8.4 TiB  12 KiB  15 GiB 2.5
TiB 76.71 0.96 125 up osd.101

102 archive  10.91309  1.0  11 TiB 8.9 TiB 8.8 TiB  28 KiB  15 GiB 2.1
TiB 81.20 1.01 125 up osd.102

-17 0- 0 B 0 B 0 B 0 B 0 B 0
B 00   -host CEPH006

-26 130.96783- 131 TiB 104 TiB 104 TiB 6.7 MiB 184 GiB  27
TiB 79.21 0.99   -host CEPH007

14 archive  10.91399  1.0  11 TiB 8.3 TiB 8.3 TiB  28 KiB  15 GiB 2.6
TiB 76.06 0.95 126 up osd.14

15 archive  10.91399  1.0  11 TiB 8.9 TiB 8.9 TiB  84 KiB  16 GiB 2.0
TiB 81.72 1.02 130 up osd.15

16 archive  10.91399  1.0  11 TiB 8.7 TiB 8.7 TiB  80 KiB  15 GiB 2.2
TiB 79.98 1.00 127 up osd.16

39 archive  10.91399  1.0  11 TiB 8.1 TiB 8.1 TiB 3.4 MiB  14 GiB 2.8
TiB 74.26 0.93 118 up osd.39

40 archive  10.91399  1.0  11 TiB 9.2 TiB 9.2 TiB  53 KiB  16 GiB 1.7
TiB 84.53 1.05 132 up osd.40

44 archive  10.91399  1.0  11 TiB 8.1 TiB 8.1 TiB 2.6 MiB  15 GiB 2.8
TiB 74.40 0.93 117 up osd.44

48 archive  10.91399  1.0  11 TiB 9.7 TiB 9.7 TiB  44 KiB  17 GiB 1.2
TiB 89.02 1.11 135 up osd.48

49 archive  10.91399  1.0  11 TiB 8.6 TiB 8.6 TiB 132 KiB  15 GiB 2.3
TiB 78.90 0.98 126 up osd.49

52 archive  10.91399  1.0  11 TiB 9.3 TiB 9.3 TiB  28 KiB  17 GiB 1.6
TiB 85.28 1.06 134 up osd.52

77 archive  10.91399  1.0  11 TiB 8.1 TiB 8.1 TiB  73 KiB  15 GiB 2.8
TiB 74.44 0.93 118 up osd.77

89 archive  10.91399  1.0  11 TiB 7.2 TiB 7.2 TiB  60 KiB  13 GiB 3.7
TiB 66.22 0.83 106 up osd.89

90 archive  10.91399  1.0  11 TiB 9.4 TiB 9.3 TiB  48 KiB  16 GiB 1.6
TiB 85.68 1.07 137 up osd.90

-31 130.96783- 131 TiB 107 TiB 107 TiB  12 MiB 189 GiB  24
TiB 81.86 1.02   -host CEPH008

  5 archive  10.91399  1.0  11 TiB 9.6 TiB 9.6 TiB 2.7 MiB  17 GiB 1.3
TiB 87.81 1.10 135 up osd.5

  6 archive  10.91399  1.0  11 TiB 8.4 TiB 8.4 TiB 3.9 MiB  16 GiB 2.5
TiB 77.19 0.96 124 up osd.6

11 archive  10.91399  1.0  11 TiB 8.9 TiB 8.8 TiB  48 KiB  16 GiB 2.1
TiB 81.11 1.01 128 up osd.11

45 archive  10.91399  1.0  11 TiB 9.5 TiB 9.4 TiB  48 KiB  17 GiB 1.5
TiB 86.66 1.08 138 up osd.45

46 archive  10.91399  1.0  11