[ceph-users] rbd map hangs

2018-06-06 Thread Tracy Reed

Hello all! I'm running luminous with old style non-bluestore OSDs. ceph
10.2.9 clients though, haven't been able to upgrade those yet. 

Occasionally I have access to rbds hang on the client such as right now.
I tried to dd a VM image into a mapped rbd and it just hung.

Then I tried to map a new rbd and that hangs also.

How would I troubleshoot this? /var/log/ceph is empty, nothing in
/var/log/messages or dmesg etc.

I just discovered:

find /sys/kernel/debug/ceph -type f -print -exec cat {} \;

which produces (among other seemingly innocuous things, let me know if
anyone wants to see the rest):

osd2(unknown sockaddr family 0) 0%(doesn't exist) 100%

which seems suspicious.

rbd ls works reliably. As does create.  Cluster is healthy. 

But the processes which hung trying to access that mapped rbd appear to
be completely unkillable. What 

else should I check?

Thanks!


-- 
Tracy Reed
http://tracyreed.org
Digital signature attached for your safety.


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to throttle operations like "rbd rm"

2018-06-06 Thread Yao Guotao
Hi Jason,


Thank you very much for your reply.
I think the RBD trash is a good way. But, the QoS in Ceph is a better solution.

I am looking forward to the backend QoS of Ceph.



Thanks.


At 2018-06-06 21:53:23, "Jason Dillaman"  wrote:
>The 'rbd_concurrent_management_ops' setting controls how many
>concurrent, in-flight RADOS object delete operations are possible per
>image removal. The default is only 10, so given ten 10 images being
>deleted concurrently, I am actually surprised that blocked all IO from
>your VMs.
>
>Adding support for limiting the maximum number of concurrent image
>deletions would definitely be an OpenStack enhancement. There is an
>open blueprint for optionally utilizing the RBD trash instead of
>having Cinder delete the images [1], which would allow you to defer
>deletions to whenever is convenient. Additionally, once Ceph adds
>support for backend QoS (fingers crossed in Nautilus), we can change
>librbd to flag all IO for maintenance activities to background (best
>effort) priority, which might be the best long-term solution.
>
>[1] https://blueprints.launchpad.net/cinder/+spec/rbd-deferred-volume-deletion
>
>On Wed, Jun 6, 2018 at 6:40 AM, Yao Guotao  wrote:
>> Hi Cephers,
>>
>> We use Ceph with Openstack by librbd library.
>>
>> Last week, my colleague delete 10 volumes from Openstack dashboard at the
>> same time, each volume has about 1T used.
>> During this time, the disk of OSDs are busy, and there have no I/O for
>> normal vm.
>>
>> So, I want to konw if there are any parameters that can be set to throttle?
>>
>> I find a parameter about rbd op is 'rbd_concurrent_management_ops'.
>> I am trying to figure out how it works in code, and I find the parameter can
>> only control the asyncchronous deletion of all objects of an image.
>>
>> Besides, Should it be controlled at Openstack Nova or Cinder layer?
>>
>> Thanks,
>> Yao Guotao
>>
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
>-- 
>Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mimic cephfs snapshot in active/standby mds env

2018-06-06 Thread Brady Deetz
I've seen several mentions of stable snapshots in Mimic for cephfs in
multi-active mds environments. I'm currently running active/standby in
12.2.5 with no snapshops. If I upgrade to Mimic, is there any concern with
snapshots in an active/standby MDS environment. It seems like a silly
question since it is considered stable for multi-mds, but better safe than
sorry.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] QEMU maps RBD but can't read them

2018-06-06 Thread Jason Dillaman
On Wed, Jun 6, 2018 at 4:48 PM, Wladimir Mutel  wrote:
> Jason Dillaman wrote:
>
 The caps for those users looks correct for Luminous and later
 clusters. Any chance you are using data pools with the images? It's
 just odd that you have enough permissions to open the RBD image but
 cannot read its data objects.
>
>
>>>  Yes, I use erasure-pool as data-pool for these images
>>>  (to save on replication overhead).
>>>  Should I add it to the [osd] profile list ?
>
>
>> Indeed, that's the problem since the libvirt and/or iso user doesn't
>> have access to the data-pool.
>
>
> This really helped, thanks !
>
> client.iso
> key: AQBp...gA==
> caps: [mon] profile rbd
> caps: [osd] profile rbd pool=iso, profile rbd pool=jerasure21
> client.libvirt
> key: AQBt...IA==
> caps: [mon] profile rbd
> caps: [osd] profile rbd pool=libvirt, profile rbd pool=jerasure21
>
> Now I can boot the VM from the .iso image and install Windows.
>
> One more question, how should I set profile 'rbd-read-only' properly
> ? I tried to set is for 'client.iso' on both 'iso' and 'jerasure21' pools,
> and this did not work. Set profile on both pools to 'rbd', it worked. But I
> don't want my iso imaged to be accidentally modified by virtual guests. Can
> this be solved with Ceph auth, or in some other way ? (in fact, I look for
> Ceph equivalent of 'chattr +i')
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

QEMU doesn't currently handle the case for opening RBD images in
read-only mode, so if you attempt to use 'profile rbd-read-only', I
suspect attempting to open the image will fail. You could perhaps take
a middle ground and just apply 'profile rbd-read-only pool=jerasure21'
to protect the contents of the image.

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg inconsistent, scrub stat mismatch on bytes

2018-06-06 Thread Adrian
Update to this.

The affected pg didn't seem inconsistent:

[root@admin-ceph1-qh2 ~]# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
   pg 6.20 is active+clean+inconsistent, acting [114,26,44]
[root@admin-ceph1-qh2 ~]# rados list-inconsistent-obj 6.20
--format=json-pretty
{
   "epoch": 210034,
   "inconsistents": []
}

Although pg query showed the primary info.stats.stat_sum.num_bytes differed
from the peers

A pg repair on 6.20 seems to have resolved the issue for now but the
info.stats.stat_sum.num_bytes still differs so presumably will become
inconsistent again next time it scrubs.

Adrian.

On Tue, Jun 5, 2018 at 12:09 PM, Adrian  wrote:

> Hi Cephers,
>
> We recently upgraded one of our clusters from hammer to jewel and then to
> luminous (12.2.5, 5 mons/mgr, 21 storage nodes * 9 osd's). After some
> deep-scubs we have an inconsistent pg with a log message we've not seen
> before:
>
> HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 6.20 is active+clean+inconsistent, acting [114,26,44]
>
>
> Ceph log shows
>
> 2018-06-03 06:53:35.467791 osd.114 osd.114 172.26.28.25:6825/40819 395 : 
> cluster [ERR] 6.20 scrub stat mismatch, got 6526/6526 objects, 87/87 clones, 
> 6526/6526 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 
> 25952454144/25952462336 bytes, 0/0 hit_set_archive bytes.
> 2018-06-03 06:53:35.467799 osd.114 osd.114 172.26.28.25:6825/40819 396 : 
> cluster [ERR] 6.20 scrub 1 errors
> 2018-06-03 06:53:40.701632 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41298 
> : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
> 2018-06-03 06:53:40.701668 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41299 
> : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent 
> (PG_DAMAGED)
> 2018-06-03 07:00:00.000137 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41345 
> : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg 
> inconsistent
>
> There are no EC pools - looks like it may be the same as
> https://tracker.ceph.com/issues/22656 although as in #7 this is not a
> cache pool.
>
> Wondering if this is ok to issue a pg repair on 6.20 or if there's
> something else we should be looking at first ?
>
> Thanks in advance,
> Adrian.
>
> ---
> Adrian : aussie...@gmail.com
> If violence doesn't solve your problem, you're not using enough of it.
>



-- 
---
Adrian : aussie...@gmail.com
If violence doesn't solve your problem, you're not using enough of it.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Openstack VMs with Ceph EC pools

2018-06-06 Thread Pardhiv Karri
Hi,

Is anyone using Openstack with Ceph  Erasure Coding pools as it now
supports RBD in Luminous. If so, hows the performance?

Thanks,
Pardhiv Karri
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Developer Monthly - June 2018

2018-06-06 Thread Leonardo Vaz
On Fri, Jun 1, 2018 at 4:56 PM, Leonardo Vaz  wrote:
> Hey Cephers,
>
> This is just a friendly reminder that the next Ceph Developer Monthly
> meeting is coming up:
>
>  http://wiki.ceph.com/Planning
>
> If you have work that you're doing that it a feature work, significant
> backports, or anything you would like to discuss with the core team,
> please add it to the following page:
>
>  http://wiki.ceph.com/CDM_06-JUN-2018
>
> This edition happens on NA/EMEAC friendly hours (12:30 EST) and we
> will use the following Bluejeans URL for the video conference:
>
>  https://redhat.bluejeans.com/376400604

Hey Cephers,

In case you missed the Ceph Developer Monthly meeting earlier today,
the video recording has been uploaded to our YouTube Channel:

  https://youtu.be/ghxzvJ51nFQ

Kindest regards,

Leo

> The meeting details are also available on Ceph Community Calendar:
>
>  
> https://calendar.google.com/calendar/b/1?cid=OXRzOWM3bHQ3dTF2aWMyaWp2dnFxbGZwbzBAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
>
> If you have questions or comments, please let us know.
>
> Kindest regards,
>
> Leo
>
> --
> Leonardo Vaz
> Ceph Community Manager
> Open Source and Standards Team



-- 
Leonardo Vaz
Ceph Community Manager
Open Source and Standards Team
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] QEMU maps RBD but can't read them

2018-06-06 Thread Wladimir Mutel

Jason Dillaman wrote:


The caps for those users looks correct for Luminous and later
clusters. Any chance you are using data pools with the images? It's
just odd that you have enough permissions to open the RBD image but
cannot read its data objects.



 Yes, I use erasure-pool as data-pool for these images
 (to save on replication overhead).
 Should I add it to the [osd] profile list ?



Indeed, that's the problem since the libvirt and/or iso user doesn't
have access to the data-pool.


This really helped, thanks !

client.iso
key: AQBp...gA==
caps: [mon] profile rbd
caps: [osd] profile rbd pool=iso, profile rbd pool=jerasure21
client.libvirt
key: AQBt...IA==
caps: [mon] profile rbd
caps: [osd] profile rbd pool=libvirt, profile rbd pool=jerasure21

Now I can boot the VM from the .iso image and install Windows.

	One more question, how should I set profile 'rbd-read-only' properly ? 
I tried to set is for 'client.iso' on both 'iso' and 'jerasure21' pools, 
and this did not work. Set profile on both pools to 'rbd', it worked. 
But I don't want my iso imaged to be accidentally modified by virtual 
guests. Can this be solved with Ceph auth, or in some other way ? (in 
fact, I look for Ceph equivalent of 'chattr +i')

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS/ceph-fuse performance

2018-06-06 Thread Gregory Farnum
>
> On 06/06/2018 12:22 PM, Andras Pataki wrote:
> > Hi Greg,
> >
> > The docs say that client_cache_size is the number of inodes that are
> > cached, not bytes of data.  Is that incorrect?
>

Oh whoops, you're correct of course. Sorry about that!

On Wed, Jun 6, 2018 at 12:33 PM Andras Pataki 
wrote:

> Staring at the logs a bit more it seems like the following lines might
> be the clue:
>
> 2018-06-06 08:14:17.615359 7fffefa45700 10 objectcacher trim  start:
> bytes: max 2147483640 <(214)%20748-3640>  clean 2145935360
> <(214)%20593-5360>, objects: max 8192 current 8192
> 2018-06-06 08:14:17.615361 7fffefa45700 10 objectcacher trim finish:
> max 2147483640 <(214)%20748-3640>  clean 2145935360 <(214)%20593-5360>,
> objects: max 8192 current 8192
>
> The object caching could not free objects up to cache new ones perhaps
> (it was caching 8192 objects which is the maximum in the config)?  Not
> sure why that would be though.  Unfortunately the job since then
> terminated so I can't look at the caches any longer of the client.
>

Yeah, that's got to be why. I don't *think* there's any reason to set a
reachable limit on number of objects. It may not be able to free them if
they're still dirty and haven't been flushed; that ought to be the only
reason. Or maybe you've discovered some bug in the caching code,
but...well, it's not super likely.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS/ceph-fuse performance

2018-06-06 Thread Andras Pataki
Staring at the logs a bit more it seems like the following lines might 
be the clue:


2018-06-06 08:14:17.615359 7fffefa45700 10 objectcacher trim  start: 
bytes: max 2147483640  clean 2145935360, objects: max 8192 current 8192
2018-06-06 08:14:17.615361 7fffefa45700 10 objectcacher trim finish:  
max 2147483640  clean 2145935360, objects: max 8192 current 8192


The object caching could not free objects up to cache new ones perhaps 
(it was caching 8192 objects which is the maximum in the config)?  Not 
sure why that would be though.  Unfortunately the job since then 
terminated so I can't look at the caches any longer of the client.


Andras


On 06/06/2018 12:22 PM, Andras Pataki wrote:

Hi Greg,

The docs say that client_cache_size is the number of inodes that are 
cached, not bytes of data.  Is that incorrect?


Andras


On 06/06/2018 11:25 AM, Gregory Farnum wrote:

On Wed, Jun 6, 2018 at 5:52 AM, Andras Pataki
 wrote:
We're using CephFS with Luminous 12.2.5 and the fuse client (on 
CentOS 7.4,

kernel 3.10.0-693.5.2.el7.x86_64).  Performance has been very good
generally, but we're currently running into some strange performance 
issues

with one of our applications.  The client in this case is on a higher
latency link - it is about 2.5ms away from all the ceph server nodes 
(all
ceph server nodes are near each other on 10/40Ggbps local ethernet, 
only the

client is "away").

The application is reading contiguous data at 64k chunks, the strace 
(-tt -T

flags) looks something like:

06:37:04.152667 read(3, ".:.:.\t./.:.:.:.:.\t./.:.:.:.:.\t./"..., 
65536) =

65536 <0.024052>
06:37:04.178432 read(3, ",1523\t./.:.:.:.:.\t0/0:34,0:34:99"..., 
65536) =

65536 <0.023990>
06:37:04.204087 read(3, ":20:21:0,21,738\t0/0:8,0:8:0:0,0,"..., 
65536) =

65536 <0.024053>
06:37:04.229919 read(3, "665\t0/0:35,0:35:99:0,102,1530\t./"..., 
65536) =

65536 <0.024574>
06:37:04.255623 read(3, ":37:99:0,99,1485\t0/0:34,0:34:99:"..., 
65536) =

65536 <0.023795>
06:37:04.280914 read(3, ":.\t./.:.:.:.:.:.:.\t./.:.:.:.:.:."..., 
65536) =

65536 <0.023614>
06:37:04.306022 read(3, "0,0,0\t./.:0,0:0:.:0,0,0\t./.:0,0:"..., 
65536) =

65536 <0.024037>


so each 64k read takes about 23-24ms.  The client has the file open for
read, the machine is not busy (load of 0.2), neither are the ceph 
nodes.

The fuse client seems pretty idle also.

Increasing the log level to 20 for 'client' and 'objectcacher' on 
ceph-fuse,
it looks like ceph-fuse gets ll_read requests of 4k in size, and it 
looks
like it does an async read from the OSDs in 4k chunks (if I'm 
interpreting

the logs right).  Here is a trace of one ll_read:

2018-06-06 08:14:17.609495 7fffe7a35700  3 client.16794661 ll_read
0x5556dadfc1a0 0x1000d092e5f  238173646848~4096
2018-06-06 08:14:17.609506 7fffe7a35700 10 client.16794661 get_caps
0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=0,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 
objects

6769 dirty_or_tx 0] parents=0x5f187680 0x5f138680) have
pAsLsXsFsxcrwb need Fr want Fc revoking -
2018-06-06 08:14:17.609517 7fffe7a35700 10 client.16794661 _read_async
0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 
objects
6769 dirty_or_tx 0] parents=0x5f187680 0x5f138680) 
238173646848~4096

2018-06-06 08:14:17.609523 7fffe7a35700 10 client.16794661
min_bytes=4194304 max_bytes=268435456 max_periods=64
2018-06-06 08:14:17.609528 7fffe7a35700 10 objectcacher readx
extent(1000d092e5f.ddd1 (56785) in @6 94208~4096 -> [0,4096])
2018-06-06 08:14:17.609532 7fffe7a35700 10
objectcacher.object(1000d092e5f.ddd1/head) map_read 
1000d092e5f.ddd1

94208~4096
2018-06-06 08:14:17.609535 7fffe7a35700 20
objectcacher.object(1000d092e5f.ddd1/head) map_read miss 4096 
left, bh[

0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing] waiters = {}
2018-06-06 08:14:17.609537 7fffe7a35700  7 objectcacher bh_read on bh[
0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing] waiters = {}
outstanding reads 0
2018-06-06 08:14:17.609576 7fffe7a35700 10 objectcacher readx missed,
waiting on bh[ 0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 rx] 
waiters

= {} off 94208
2018-06-06 08:14:17.609579 7fffe7a35700 20 objectcacher readx defer
0x55570211ec00
2018-06-06 08:14:17.609580 7fffe7a35700  5 client.16794661 
get_cap_ref got

first FILE_CACHE ref on 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 
objects

6769 dirty_or_tx 0] parents=0x5f187680 0x5f138680)
2018-06-06 

[ceph-users] Prioritize recovery over backfilling

2018-06-06 Thread Caspar Smit
Hi all,

We have a Luminous 12.2.2 cluster with 3 nodes and i recently added a node
to it.

osd-max-backfills is at the default 1 so backfilling didn't go very fast
but that doesn't matter.

Once it started backfilling everything looked ok:

~300 pgs in backfill_wait
~10 pgs backfilling (~number of new osd's)

But i noticed the degraded objects increasing a lot. I presume a pg that is
in backfill_wait state doesn't accept any new writes anymore? Hence
increasing the degraded objects?

So far so good, but once a while i noticed a random OSD flapping (they come
back up automatically). This isn't because the disk is saturated but a
driver/controller/kernel incompatibility which 'hangs' the disk for a short
time (scsi abort_task error in syslog). Investigating further i noticed
this was already the case before the node expansion.

These OSD's flapping results in lots of pg states which are a bit worrying:

 109 active+remapped+backfill_wait
 80  active+undersized+degraded+remapped+backfill_wait
 51  active+recovery_wait+degraded+remapped
 41  active+recovery_wait+degraded
 27  active+recovery_wait+undersized+degraded+remapped
 14  active+undersized+remapped+backfill_wait
 4   active+undersized+degraded+remapped+backfilling

I think the recovery_wait is more important then the backfill_wait, so i
like to prioritize these because the recovery_wait was triggered by the
flapping OSD's

furthermore the undersized ones should get absolute priority or is that
already the case?

I was thinking about setting "nobackfill" to prioritize recovery instead of
backfilling.
Would that help in this situation? Or am i making it even worse then?

ps. i tried increasing the heartbeat values for the OSD's to no avail, they
still get flagged as down once in a while after a hiccup of the driver.

i've injected the following settings into all OSD's and MON's:

osd heartbeat interval  18 (default = 6)
osd heartbeat grace 60 (default = 20)
osd mon heartbeat interval 60 (default = 30)

Am i adjusting the right settings or are there any other settings to
increase the heartbeat grace?

Do these settings need a restart of the daemons or is injecting sufficient?

ps2. the drives which are flapping are Seagate Enterprise Capacity 10TB
SATA 7k2 disks with model number:  *ST1NM0086. *Are these drives
notorious for this behaviour? Anyone has experience with these drives in a
CEPH environment?

Kind regards,
Caspar Smit
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] QEMU maps RBD but can't read them

2018-06-06 Thread Jason Dillaman
On Wed, Jun 6, 2018 at 3:02 PM, Wladimir Mutel  wrote:
> Jason Dillaman wrote:
>>
>> The caps for those users looks correct for Luminous and later
>> clusters. Any chance you are using data pools with the images? It's
>> just odd that you have enough permissions to open the RBD image but
>> cannot read its data objects.
>
>
> Yes, I use erasure-pool as data-pool for these images
> (to save on replication overhead).
> Should I add it to the [osd] profile list ?

Indeed, that's the problem since the libvirt and/or iso user doesn't
have access to the data-pool.

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] QEMU maps RBD but can't read them

2018-06-06 Thread Wladimir Mutel

Jason Dillaman wrote:

The caps for those users looks correct for Luminous and later
clusters. Any chance you are using data pools with the images? It's
just odd that you have enough permissions to open the RBD image but
cannot read its data objects.


Yes, I use erasure-pool as data-pool for these images
(to save on replication overhead).
Should I add it to the [osd] profile list ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] QEMU maps RBD but can't read them

2018-06-06 Thread Wladimir Mutel

Jason Dillaman wrote:

Can you run "rbd --id libvirt --pool libvirt win206-test-3tb " w/o error? It sounds like your CephX caps for
client.libvirt are not permitting read access to the image data
objects.


I tried to run 'rbd export' with these params,
but it said it was unable to find a keyring.
Is keyring file mandatory for every client ?

'ceph auth ls' shows these accounts with seemingly-proper
permissions :

client.iso
key: AQBp...gA==
caps: [mon] profile rbd
caps: [osd] profile rbd pool=iso
client.libvirt
key: AQBt...IA==
caps: [mon] profile rbd
caps: [osd] profile rbd pool=libvirt

And these same keys are listed in /etc/libvirt/secrets :

/etc/libvirt/secrets# ls | while read a ; do echo $a : $(cat $a) ; done
ac1d8d7b-d243-4474-841d-91c26fd93a14.base64 : AQBt...IA==

ac1d8d7b-d243-4474-841d-91c26fd93a14.xml : private='yes'> ac1d8d7b-d243-4474-841d-91c26fd93a14 
CEPH passphrase example  
ceph_example  


cf00c7e4-740a-4935-9d7c-223d3c81871f.base64 : AQBp...gA==

cf00c7e4-740a-4935-9d7c-223d3c81871f.xml : private='yes'> cf00c7e4-740a-4935-9d7c-223d3c81871f 
CEPH ISO pool  
ceph_iso  


I just thought this should be enough. no ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stop scrubbing

2018-06-06 Thread Joe Comeau
When I am upgrading from filestore to bluestore
or any other server maintenance for a short time 
(ie high I/O while rebuilding)
 
ceph osd set noout
ceph osd set noscrub
ceph osd set nodeep-scrub
 
when finished
 

ceph osd unset noscrub
ceph osd unset nodeep-scrub

ceph osd unset noout
 
again only while working on a server/cluster for a short time


>>> Alexandru Cucu  6/6/2018 1:51 AM >>>
Hi,

The only way I know is pretty brutal: list all the PGs with a
scrubbing process, get the primary OSD and mark it as down. The
scrubbing process will stop.
Make sure you set the noout, norebalance and norecovery flags so you
don't add even more load to your cluster.

On Tue, Jun 5, 2018 at 11:41 PM Marc Roos  wrote:
>
>
> Is it possible to stop the current running scrubs/deep-scrubs?
>
> http://tracker.ceph.com/issues/11202
>
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] QEMU maps RBD but can't read them

2018-06-06 Thread Jason Dillaman
Can you run "rbd --id libvirt --pool libvirt win206-test-3tb " w/o error? It sounds like your CephX caps for
client.libvirt are not permitting read access to the image data
objects.

On Wed, Jun 6, 2018 at 2:18 PM, Wladimir Mutel  wrote:
>
> Dear all,
>
> I installed QEMU, libvirtd and its RBD plugins and now trying
> to make QEMU use my Ceph storage. I created 'iso' pool
> and imported Windows installation image there (rbd import).
> Also I created 'libvirt' pool and there, created 2.7-TB image
> for Windows installation. I created client.iso and
> client.libvirt accounts for Ceph authentication,
> and configured their secrets for pool access in virsh
> (as told in http://docs.ceph.com/docs/master/rbd/libvirt/ ).
> Then I started pools and checked that I can list their contents
> from virsh. Then I created a VM with dummy HDD and optical
> drive, and edited them using 'virsh edit' :
>
> 
>   
>   
> 
>   
>   
> 
>   
>   
>   
>   
> 
>
> 
>   
>   
> 
>   
>name='iso/SW_DVD9_Win_Server_STD_CORE_2016_64Bit_Russian_-4_DC_STD_MLF_X21-70539.ISO'>
> 
>   
>   
>   
>   
>   
> 
>
> Now I see this in the systemd journalctl :
>
> чер 06 16:24:12 p10s qemu-system-x86_64[4907]: 2018-06-06 16:24:12.147
> 7f40f37fe700 -1 librbd::io::ObjectRequest: 0x7f40d4010500
> handle_read_object: failed to read from object: (1) Operation not permitted
>
> What should I check and where ?
> I can map the same RBD using rbd-nbd and read sectors
> from the mapped device. If I map using kernel RBD driver
> (I know this is not recommended to do on the same host),
> I get :
>
> чер 06 16:27:54 p10s kernel: rbd: image
> SW_DVD9_Win_Server_STD_CORE_2016_64Bit_Russian_-4_DC_STD_MLF_X21-70539.ISO:
> image uses unsupported features: 0x38
>
> and
>
> RBD image feature set mismatch. You can disable features unsupported by the
> kernel with "rbd feature disable
> iso/SW_DVD9_Win_Server_STD_CORE_2016_64Bit_Russian_-4_DC_STD_MLF_X21-70539.ISO
> object-map fast-diff deep-flatten".
>
> Probably I need to change some attributes for the RBD
> to be usable with QEMU. Please give some hints.
> Thank you in advance.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] QEMU maps RBD but can't read them

2018-06-06 Thread Wladimir Mutel


Dear all,

I installed QEMU, libvirtd and its RBD plugins and now trying
to make QEMU use my Ceph storage. I created 'iso' pool
and imported Windows installation image there (rbd import).
Also I created 'libvirt' pool and there, created 2.7-TB image
for Windows installation. I created client.iso and
client.libvirt accounts for Ceph authentication,
and configured their secrets for pool access in virsh
(as told in http://docs.ceph.com/docs/master/rbd/libvirt/ ).
Then I started pools and checked that I can list their contents
from virsh. Then I created a VM with dummy HDD and optical
drive, and edited them using 'virsh edit' :


  
  

  
  

  
  
  
  



  
  

  
  name='iso/SW_DVD9_Win_Server_STD_CORE_2016_64Bit_Russian_-4_DC_STD_MLF_X21-70539.ISO'>


  
  
  
  
  


Now I see this in the systemd journalctl :

чер 06 16:24:12 p10s qemu-system-x86_64[4907]: 2018-06-06 16:24:12.147 
7f40f37fe700 -1 librbd::io::ObjectRequest: 0x7f40d4010500 
handle_read_object: failed to read from object: (1) Operation not permitted


What should I check and where ?
I can map the same RBD using rbd-nbd and read sectors
from the mapped device. If I map using kernel RBD driver
(I know this is not recommended to do on the same host),
I get :

чер 06 16:27:54 p10s kernel: rbd: image 
SW_DVD9_Win_Server_STD_CORE_2016_64Bit_Russian_-4_DC_STD_MLF_X21-70539.ISO: 
image uses unsupported features: 0x38


and

RBD image feature set mismatch. You can disable features unsupported by 
the kernel with "rbd feature disable 
iso/SW_DVD9_Win_Server_STD_CORE_2016_64Bit_Russian_-4_DC_STD_MLF_X21-70539.ISO 
object-map fast-diff deep-flatten".


Probably I need to change some attributes for the RBD
to be usable with QEMU. Please give some hints.
Thank you in advance.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Reinstall everything

2018-06-06 Thread Max Cuttins

Hi everybody,

I would like to start from zero.
However last time I run the command to purge everything I got an issue.

I had a complete cleaned up system as expected, but disk was still OSD 
and the new installation refused to overwrite disk in use.
The only way to make it work was manually format the disks with fdisk 
and zap again with ceph later.


Is there something I shoulded do before purge everything in order to do 
not have similar issue?


Thanks,
Max
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS/ceph-fuse performance

2018-06-06 Thread Andras Pataki

Hi Greg,

The docs say that client_cache_size is the number of inodes that are 
cached, not bytes of data.  Is that incorrect?


Andras


On 06/06/2018 11:25 AM, Gregory Farnum wrote:

On Wed, Jun 6, 2018 at 5:52 AM, Andras Pataki
 wrote:

We're using CephFS with Luminous 12.2.5 and the fuse client (on CentOS 7.4,
kernel 3.10.0-693.5.2.el7.x86_64).  Performance has been very good
generally, but we're currently running into some strange performance issues
with one of our applications.  The client in this case is on a higher
latency link - it is about 2.5ms away from all the ceph server nodes (all
ceph server nodes are near each other on 10/40Ggbps local ethernet, only the
client is "away").

The application is reading contiguous data at 64k chunks, the strace (-tt -T
flags) looks something like:

06:37:04.152667 read(3, ".:.:.\t./.:.:.:.:.\t./.:.:.:.:.\t./"..., 65536) =
65536 <0.024052>
06:37:04.178432 read(3, ",1523\t./.:.:.:.:.\t0/0:34,0:34:99"..., 65536) =
65536 <0.023990>
06:37:04.204087 read(3, ":20:21:0,21,738\t0/0:8,0:8:0:0,0,"..., 65536) =
65536 <0.024053>
06:37:04.229919 read(3, "665\t0/0:35,0:35:99:0,102,1530\t./"..., 65536) =
65536 <0.024574>
06:37:04.255623 read(3, ":37:99:0,99,1485\t0/0:34,0:34:99:"..., 65536) =
65536 <0.023795>
06:37:04.280914 read(3, ":.\t./.:.:.:.:.:.:.\t./.:.:.:.:.:."..., 65536) =
65536 <0.023614>
06:37:04.306022 read(3, "0,0,0\t./.:0,0:0:.:0,0,0\t./.:0,0:"..., 65536) =
65536 <0.024037>


so each 64k read takes about 23-24ms.  The client has the file open for
read, the machine is not busy (load of 0.2), neither are the ceph nodes.
The fuse client seems pretty idle also.

Increasing the log level to 20 for 'client' and 'objectcacher' on ceph-fuse,
it looks like ceph-fuse gets ll_read requests of 4k in size, and it looks
like it does an async read from the OSDs in 4k chunks (if I'm interpreting
the logs right).  Here is a trace of one ll_read:

2018-06-06 08:14:17.609495 7fffe7a35700  3 client.16794661 ll_read
0x5556dadfc1a0 0x1000d092e5f  238173646848~4096
2018-06-06 08:14:17.609506 7fffe7a35700 10 client.16794661 get_caps
0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=0,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects
6769 dirty_or_tx 0] parents=0x5f187680 0x5f138680) have
pAsLsXsFsxcrwb need Fr want Fc revoking -
2018-06-06 08:14:17.609517 7fffe7a35700 10 client.16794661 _read_async
0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects
6769 dirty_or_tx 0] parents=0x5f187680 0x5f138680) 238173646848~4096
2018-06-06 08:14:17.609523 7fffe7a35700 10 client.16794661
min_bytes=4194304 max_bytes=268435456 max_periods=64
2018-06-06 08:14:17.609528 7fffe7a35700 10 objectcacher readx
extent(1000d092e5f.ddd1 (56785) in @6 94208~4096 -> [0,4096])
2018-06-06 08:14:17.609532 7fffe7a35700 10
objectcacher.object(1000d092e5f.ddd1/head) map_read 1000d092e5f.ddd1
94208~4096
2018-06-06 08:14:17.609535 7fffe7a35700 20
objectcacher.object(1000d092e5f.ddd1/head) map_read miss 4096 left, bh[
0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing] waiters = {}
2018-06-06 08:14:17.609537 7fffe7a35700  7 objectcacher bh_read on bh[
0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing] waiters = {}
outstanding reads 0
2018-06-06 08:14:17.609576 7fffe7a35700 10 objectcacher readx missed,
waiting on bh[ 0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 rx] waiters
= {} off 94208
2018-06-06 08:14:17.609579 7fffe7a35700 20 objectcacher readx defer
0x55570211ec00
2018-06-06 08:14:17.609580 7fffe7a35700  5 client.16794661 get_cap_ref got
first FILE_CACHE ref on 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects
6769 dirty_or_tx 0] parents=0x5f187680 0x5f138680)
2018-06-06 08:14:17.609587 7fffe7a35700 15 inode.get on 0x5f138680
0x1000d092e5f.head now 4
2018-06-06 08:14:17.612318 7fffefa45700  7 objectcacher bh_read_finish
1000d092e5f.ddd1/head tid 29067611 94208~4096 (bl is 4096) returned 0
outstanding reads 1
2018-06-06 08:14:17.612338 7fffefa45700 20 objectcacher checking bh bh[
0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 rx] waiters = {
94208->[0x5557007383a0, ]}
2018-06-06 08:14:17.612341 7fffefa45700 10 objectcacher bh_read_finish read
bh[ 0x5fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean firstbyte=46]
waiters = {}
2018-06-06 08:14:17.612344 7fffefa45700 10
objectcacher.object(1000d092e5f.ddd1/head) try_merge_bh bh[
0x5fecdd40 94208~4096 0x5556226235c0 (4096) v 0 

Re: [ceph-users] Reduced productivity because of slow requests

2018-06-06 Thread Grigory Murashov

Hello Jamie!

Do you think this spike is the reason of the problem. Not a consequence?

Grigory Murashov

06.06.2018 16:57, Jamie Fargen пишет:

Is bond0.111 just for RGW traffic?
Is there a load balancer in front of your RGWs?

From the graph you linked to for bond0.111, it seems like you might 
just be having a spike in Rados Gateway traffic. You might want to dig 
into the logs at those times on your Load Balancer/RGWs to see if you 
can correlate what is generating the traffic.


Regards,
-Jamie Fargen

On Wed, Jun 6, 2018 at 7:57 AM, Grigory Murashov 
mailto:muras...@voximplant.com>> wrote:


Hello cephers!

I have luminous 12.2.5 cluster of 3 nodes 5 OSDs each with S3 RGW.
All OSDs are HDD.

I often (about twice a day) have slow request problem which
reduces cluster efficiency. It can be started both in day peak and
night time. Doesn't matter.

That's what I have in ceph health detail

https://avatars.mds.yandex.net/get-pdb/234183/9ba023d0-4352-4235-8826-76b412016e9f/s1200



Top and iostat results on osd.21's node

https://avatars.mds.yandex.net/get-pdb/51720/52ef79c1-eb1a-450a-8c95-675077045b84/s1200




https://avatars.mds.yandex.net/get-pdb/51720/0d98131c-82d3-4274-a406-743490e1f966/s1200



In fact in reduces cluster's io operations for about an half an
hour twice a day

https://avatars.mds.yandex.net/get-pdb/222681/bed8f638-f259-403e-83cb-c7bfb30f14f1/s1200



That's normal io while status is OK

https://avatars.mds.yandex.net/get-pdb/245485/33ee3a53-083a-4656-b585-8df0007db2e2/s1200



That's how it affects on incoming traffic to RGW

https://avatars.mds.yandex.net/get-pdb/51720/5a486d30-0d44-46f0-8f0f-668a05947bc8/s1200



Since it starts in any time but twice a day and for fixed period
of time I assume it could be some recovery or rebalancing operations.

I tried to find smth out in osd logs but there are nothing about it.

Any thoughts how to avoid it?

Appreciate your help.

-- 
Grigory Murashov


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
Jamie Fargen
Consultant
jfar...@redhat.com 
813-817-4430


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS/ceph-fuse performance

2018-06-06 Thread Gregory Farnum
On Wed, Jun 6, 2018 at 5:52 AM, Andras Pataki
 wrote:
> We're using CephFS with Luminous 12.2.5 and the fuse client (on CentOS 7.4,
> kernel 3.10.0-693.5.2.el7.x86_64).  Performance has been very good
> generally, but we're currently running into some strange performance issues
> with one of our applications.  The client in this case is on a higher
> latency link - it is about 2.5ms away from all the ceph server nodes (all
> ceph server nodes are near each other on 10/40Ggbps local ethernet, only the
> client is "away").
>
> The application is reading contiguous data at 64k chunks, the strace (-tt -T
> flags) looks something like:
>
> 06:37:04.152667 read(3, ".:.:.\t./.:.:.:.:.\t./.:.:.:.:.\t./"..., 65536) =
> 65536 <0.024052>
> 06:37:04.178432 read(3, ",1523\t./.:.:.:.:.\t0/0:34,0:34:99"..., 65536) =
> 65536 <0.023990>
> 06:37:04.204087 read(3, ":20:21:0,21,738\t0/0:8,0:8:0:0,0,"..., 65536) =
> 65536 <0.024053>
> 06:37:04.229919 read(3, "665\t0/0:35,0:35:99:0,102,1530\t./"..., 65536) =
> 65536 <0.024574>
> 06:37:04.255623 read(3, ":37:99:0,99,1485\t0/0:34,0:34:99:"..., 65536) =
> 65536 <0.023795>
> 06:37:04.280914 read(3, ":.\t./.:.:.:.:.:.:.\t./.:.:.:.:.:."..., 65536) =
> 65536 <0.023614>
> 06:37:04.306022 read(3, "0,0,0\t./.:0,0:0:.:0,0,0\t./.:0,0:"..., 65536) =
> 65536 <0.024037>
>
>
> so each 64k read takes about 23-24ms.  The client has the file open for
> read, the machine is not busy (load of 0.2), neither are the ceph nodes.
> The fuse client seems pretty idle also.
>
> Increasing the log level to 20 for 'client' and 'objectcacher' on ceph-fuse,
> it looks like ceph-fuse gets ll_read requests of 4k in size, and it looks
> like it does an async read from the OSDs in 4k chunks (if I'm interpreting
> the logs right).  Here is a trace of one ll_read:
>
> 2018-06-06 08:14:17.609495 7fffe7a35700  3 client.16794661 ll_read
> 0x5556dadfc1a0 0x1000d092e5f  238173646848~4096
> 2018-06-06 08:14:17.609506 7fffe7a35700 10 client.16794661 get_caps
> 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
> cap_refs={4=0,1024=0,2048=0,4096=0,8192=0} open={1=1,2=0} mode=100664
> size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
> caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects
> 6769 dirty_or_tx 0] parents=0x5f187680 0x5f138680) have
> pAsLsXsFsxcrwb need Fr want Fc revoking -
> 2018-06-06 08:14:17.609517 7fffe7a35700 10 client.16794661 _read_async
> 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
> cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664
> size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
> caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects
> 6769 dirty_or_tx 0] parents=0x5f187680 0x5f138680) 238173646848~4096
> 2018-06-06 08:14:17.609523 7fffe7a35700 10 client.16794661
> min_bytes=4194304 max_bytes=268435456 max_periods=64
> 2018-06-06 08:14:17.609528 7fffe7a35700 10 objectcacher readx
> extent(1000d092e5f.ddd1 (56785) in @6 94208~4096 -> [0,4096])
> 2018-06-06 08:14:17.609532 7fffe7a35700 10
> objectcacher.object(1000d092e5f.ddd1/head) map_read 1000d092e5f.ddd1
> 94208~4096
> 2018-06-06 08:14:17.609535 7fffe7a35700 20
> objectcacher.object(1000d092e5f.ddd1/head) map_read miss 4096 left, bh[
> 0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing] waiters = {}
> 2018-06-06 08:14:17.609537 7fffe7a35700  7 objectcacher bh_read on bh[
> 0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing] waiters = {}
> outstanding reads 0
> 2018-06-06 08:14:17.609576 7fffe7a35700 10 objectcacher readx missed,
> waiting on bh[ 0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 rx] waiters
> = {} off 94208
> 2018-06-06 08:14:17.609579 7fffe7a35700 20 objectcacher readx defer
> 0x55570211ec00
> 2018-06-06 08:14:17.609580 7fffe7a35700  5 client.16794661 get_cap_ref got
> first FILE_CACHE ref on 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
> cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664
> size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
> caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects
> 6769 dirty_or_tx 0] parents=0x5f187680 0x5f138680)
> 2018-06-06 08:14:17.609587 7fffe7a35700 15 inode.get on 0x5f138680
> 0x1000d092e5f.head now 4
> 2018-06-06 08:14:17.612318 7fffefa45700  7 objectcacher bh_read_finish
> 1000d092e5f.ddd1/head tid 29067611 94208~4096 (bl is 4096) returned 0
> outstanding reads 1
> 2018-06-06 08:14:17.612338 7fffefa45700 20 objectcacher checking bh bh[
> 0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 rx] waiters = {
> 94208->[0x5557007383a0, ]}
> 2018-06-06 08:14:17.612341 7fffefa45700 10 objectcacher bh_read_finish read
> bh[ 0x5fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean firstbyte=46]
> waiters = {}
> 2018-06-06 08:14:17.612344 7fffefa45700 10
> objectcacher.object(1000d092e5f.ddd1/head) try_merge_bh bh[
> 0x5fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean firstbyte=46]

Re: [ceph-users] Reduced productivity because of slow requests

2018-06-06 Thread Jamie Fargen
Is bond0.111 just for RGW traffic?
Is there a load balancer in front of your RGWs?

>From the graph you linked to for bond0.111, it seems like you might just be
having a spike in Rados Gateway traffic.  You might want to dig into the
logs at those times on your Load Balancer/RGWs to see if you can correlate
what is generating the traffic.

Regards,
-Jamie Fargen

On Wed, Jun 6, 2018 at 7:57 AM, Grigory Murashov 
wrote:

> Hello cephers!
>
> I have luminous 12.2.5 cluster of 3 nodes 5 OSDs each with S3 RGW. All
> OSDs are HDD.
>
> I often (about twice a day) have slow request problem which reduces
> cluster efficiency. It can be started both in day peak and night time.
> Doesn't matter.
>
> That's what I have in ceph health detail https://avatars.mds.yandex.net
> /get-pdb/234183/9ba023d0-4352-4235-8826-76b412016e9f/s1200
>
> Top and iostat results on osd.21's node
> https://avatars.mds.yandex.net/get-pdb/51720/52ef79c1-eb1a-
> 450a-8c95-675077045b84/s1200
>
> https://avatars.mds.yandex.net/get-pdb/51720/0d98131c-82d3-
> 4274-a406-743490e1f966/s1200
>
> In fact in reduces cluster's io operations for about an half an hour twice
> a day
> https://avatars.mds.yandex.net/get-pdb/222681/bed8f638-f259-
> 403e-83cb-c7bfb30f14f1/s1200
>
> That's normal io while status is OK
> https://avatars.mds.yandex.net/get-pdb/245485/33ee3a53-083a-
> 4656-b585-8df0007db2e2/s1200
>
> That's how it affects on incoming traffic to RGW
> https://avatars.mds.yandex.net/get-pdb/51720/5a486d30-0d44-
> 46f0-8f0f-668a05947bc8/s1200
>
> Since it starts in any time but twice a day and for fixed period of time I
> assume it could be some recovery or rebalancing operations.
>
> I tried to find smth out in osd logs but there are nothing about it.
>
> Any thoughts how to avoid it?
>
> Appreciate your help.
>
> --
> Grigory Murashov
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jamie Fargen
Consultant
jfar...@redhat.com
813-817-4430
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Update to Mimic with prior Snapshots leads to MDS damaged metadata

2018-06-06 Thread Yan, Zheng
On Wed, Jun 6, 2018 at 3:25 PM, Tobias Florek  wrote:
> Hi,
>
> I upgraded a ceph cluster to mimic yesterday according to the release
> notes. Specifically I did stop all standby MDS and then restarted the
> only active MDS with the new version.
>
> The cluster was installed with luminous. Its cephfs volume had snapshots
> prior to the update, but only one active MDS.
>
> The post-installation steps failed though:
>  ceph daemon mds. scrub_path /
> returned an error, which I corrected with
>  ceph daemon mds. scrub_path / repair
>
> While
>  ceph daemon mds. scrub_path '~mdsdir'
> did not show any error.
>

The correct commands should be:

ceph daemon  scrub_path / force recursive repair
ceph daemon  scrub_path '~mdsdir' force recursive repair


>
> After some time, ceph health reported MDS damaged metadata:
>> ceph tell mds. damage ls | jq '.[].damage_type' | sort | uniq -c
> 398 "backtrace"
> 718 "dentry"
>
> Examples of damage:
>
> {
>   "damage_type": "dentry",
>   "id": 118195760,
>   "ino": 1099513350198,
>   "frag": "000100*",
>   "dname":
> "1524578400.M820820P705532.dovecot-15-hgjlx,S=425674,W=431250:2,RS",
>   "snap_id": "head",
>   "path":
> "/path/to/mails/user/Maildir/.Trash/cur/1524578400.M820820P705532.dovecot-15-hgjlx,S=425674,W=431250:2,RS"
> },
> {
>   "damage_type": "backtrace",
>   "id": 121083841,
>   "ino": 1099515215027,
>   "path":
> "/path/to/mails/other_user/Maildir/.Junk/cur/1528189963.M416032P698926.dovecot-15-xmpkh,S=4010,W=4100:2,Sab"
> },
>

'ceph daemon  scrub_path / force recursive repair'
should also fix above errors.

Regards
Yan, Zheng


>
> Directories with damage can still be listed by the kernel cephfs mount
> (4.16.7), but not the fuse mount, which stalls.
>
>
> Can anyone help? That's unfortunately a production cluster.
>
> Regards,
>  Tobias Florek
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS/ceph-fuse performance

2018-06-06 Thread Andras Pataki
We're using CephFS with Luminous 12.2.5 and the fuse client (on CentOS 
7.4, kernel 3.10.0-693.5.2.el7.x86_64).  Performance has been very good 
generally, but we're currently running into some strange performance 
issues with one of our applications.  The client in this case is on a 
higher latency link - it is about 2.5ms away from all the ceph server 
nodes (all ceph server nodes are near each other on 10/40Ggbps local 
ethernet, only the client is "away").


The application is reading contiguous data at 64k chunks, the strace 
(-tt -T flags) looks something like:


   06:37:04.152667 read(3, ".:.:.\t./.:.:.:.:.\t./.:.:.:.:.\t./"...,
   65536) = 65536 <0.024052>
   06:37:04.178432 read(3, ",1523\t./.:.:.:.:.\t0/0:34,0:34:99"...,
   65536) = 65536 <0.023990>
   06:37:04.204087 read(3, ":20:21:0,21,738\t0/0:8,0:8:0:0,0,"...,
   65536) = 65536 <0.024053>
   06:37:04.229919 read(3, "665\t0/0:35,0:35:99:0,102,1530\t./"...,
   65536) = 65536 <0.024574>
   06:37:04.255623 read(3, ":37:99:0,99,1485\t0/0:34,0:34:99:"...,
   65536) = 65536 <0.023795>
   06:37:04.280914 read(3, ":.\t./.:.:.:.:.:.:.\t./.:.:.:.:.:."...,
   65536) = 65536 <0.023614>
   06:37:04.306022 read(3, "0,0,0\t./.:0,0:0:.:0,0,0\t./.:0,0:"...,
   65536) = 65536 <0.024037>


so each 64k read takes about 23-24ms.  The client has the file open for 
read, the machine is not busy (load of 0.2), neither are the ceph 
nodes.  The fuse client seems pretty idle also.


Increasing the log level to 20 for 'client' and 'objectcacher' on 
ceph-fuse, it looks like ceph-fuse gets ll_read requests of 4k in size, 
and it looks like it does an async read from the OSDs in 4k chunks (if 
I'm interpreting the logs right).  Here is a trace of one ll_read:


   2018-06-06 08:14:17.609495 7fffe7a35700  3 client.16794661 ll_read
   0x5556dadfc1a0 0x1000d092e5f 238173646848~4096
   2018-06-06 08:14:17.609506 7fffe7a35700 10 client.16794661 get_caps
   0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
   cap_refs={4=0,1024=0,2048=0,4096=0,8192=0} open={1=1,2=0}
   mode=100664 size=244712765330/249011634176 mtime=2018-06-05
   00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
   objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
   parents=0x5f187680 0x5f138680) have pAsLsXsFsxcrwb need Fr
   want Fc revoking -
   2018-06-06 08:14:17.609517 7fffe7a35700 10 client.16794661
   _read_async 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
   cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
   mode=100664 size=244712765330/249011634176 mtime=2018-06-05
   00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
   objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
   parents=0x5f187680 0x5f138680) 238173646848~4096
   2018-06-06 08:14:17.609523 7fffe7a35700 10 client.16794661
   min_bytes=4194304 max_bytes=268435456 max_periods=64
   2018-06-06 08:14:17.609528 7fffe7a35700 10 objectcacher readx
   extent(1000d092e5f.ddd1 (56785) in @6 94208~4096 -> [0,4096])
   2018-06-06 08:14:17.609532 7fffe7a35700 10
   objectcacher.object(1000d092e5f.ddd1/head) map_read
   1000d092e5f.ddd1 94208~4096
   2018-06-06 08:14:17.609535 7fffe7a35700 20
   objectcacher.object(1000d092e5f.ddd1/head) map_read miss 4096
   left, bh[ 0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing]
   waiters = {}
   2018-06-06 08:14:17.609537 7fffe7a35700  7 objectcacher bh_read on
   bh[ 0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing]
   waiters = {} outstanding reads 0
   2018-06-06 08:14:17.609576 7fffe7a35700 10 objectcacher readx
   missed, waiting on bh[ 0x5fecdd40 94208~4096 0x5556226235c0 (0)
   v 0 rx] waiters = {} off 94208
   2018-06-06 08:14:17.609579 7fffe7a35700 20 objectcacher readx defer
   0x55570211ec00
   2018-06-06 08:14:17.609580 7fffe7a35700  5 client.16794661
   get_cap_ref got first FILE_CACHE ref on
   0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
   cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
   mode=100664 size=244712765330/249011634176 mtime=2018-06-05
   00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
   objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
   parents=0x5f187680 0x5f138680)
   2018-06-06 08:14:17.609587 7fffe7a35700 15 inode.get on
   0x5f138680 0x1000d092e5f.head now 4
   2018-06-06 08:14:17.612318 7fffefa45700  7 objectcacher
   bh_read_finish 1000d092e5f.ddd1/head tid 29067611 94208~4096 (bl
   is 4096) returned 0 outstanding reads 1
   2018-06-06 08:14:17.612338 7fffefa45700 20 objectcacher checking bh
   bh[ 0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 rx] waiters = {
   94208->[0x5557007383a0, ]}
   2018-06-06 08:14:17.612341 7fffefa45700 10 objectcacher
   bh_read_finish read bh[ 0x5fecdd40 94208~4096 0x5556226235c0
   (4096) v 0 clean firstbyte=46] waiters = {}
   2018-06-06 08:14:17.612344 7fffefa45700 10
   objectcacher.object(1000d092e5f.ddd1/head) try_merge_bh bh[
   0x5fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean
   

Re: [ceph-users] Reduced productivity because of slow requests

2018-06-06 Thread Piotr Dałek

On 18-06-06 01:57 PM, Grigory Murashov wrote:

Hello cephers!

I have luminous 12.2.5 cluster of 3 nodes 5 OSDs each with S3 RGW. All OSDs 
are HDD.


I often (about twice a day) have slow request problem which reduces cluster 
efficiency. It can be started both in day peak and night time. Doesn't matter.


That's what I have in ceph health detail 
https://avatars.mds.yandex.net/get-pdb/234183/9ba023d0-4352-4235-8826-76b412016e9f/s1200 
[..]
Since it starts in any time but twice a day and for fixed period of time I 
assume it could be some recovery or rebalancing operations.


I tried to find smth out in osd logs but there are nothing about it.

Any thoughts how to avoid it?


Have you tried disabling scrub and deep scrub?

--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovhcloud.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Reduced productivity because of slow requests

2018-06-06 Thread Grigory Murashov

Hello cephers!

I have luminous 12.2.5 cluster of 3 nodes 5 OSDs each with S3 RGW. All 
OSDs are HDD.


I often (about twice a day) have slow request problem which reduces 
cluster efficiency. It can be started both in day peak and night time. 
Doesn't matter.


That's what I have in ceph health detail 
https://avatars.mds.yandex.net/get-pdb/234183/9ba023d0-4352-4235-8826-76b412016e9f/s1200


Top and iostat results on osd.21's node
https://avatars.mds.yandex.net/get-pdb/51720/52ef79c1-eb1a-450a-8c95-675077045b84/s1200

https://avatars.mds.yandex.net/get-pdb/51720/0d98131c-82d3-4274-a406-743490e1f966/s1200

In fact in reduces cluster's io operations for about an half an hour 
twice a day

https://avatars.mds.yandex.net/get-pdb/222681/bed8f638-f259-403e-83cb-c7bfb30f14f1/s1200

That's normal io while status is OK
https://avatars.mds.yandex.net/get-pdb/245485/33ee3a53-083a-4656-b585-8df0007db2e2/s1200

That's how it affects on incoming traffic to RGW 
https://avatars.mds.yandex.net/get-pdb/51720/5a486d30-0d44-46f0-8f0f-668a05947bc8/s1200


Since it starts in any time but twice a day and for fixed period of time 
I assume it could be some recovery or rebalancing operations.


I tried to find smth out in osd logs but there are nothing about it.

Any thoughts how to avoid it?

Appreciate your help.

--
Grigory Murashov

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to throttle operations like "rbd rm"

2018-06-06 Thread Yao Guotao
Hi Cephers,


We use Ceph with Openstack by librbd library.



Last week, my colleague delete 10 volumes from Openstack dashboard at the same 
time, each volume has about 1T used.
During this time, the disk of OSDs are busy, and there have no I/O for normal 
vm.


So, I want to konw if there are any parameters that can be set to throttle?


I find a parameter about rbd op is 'rbd_concurrent_management_ops'.

I am trying to figure out how it works in code, and I find the parameter can 
only control the asyncchronous deletion of all objects of an image.



Besides, Should it be controlled at Openstack Nova or Cinder layer?


Thanks,
Yao Guotao
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stop scrubbing

2018-06-06 Thread Alexandru Cucu
Hi,

The only way I know is pretty brutal: list all the PGs with a
scrubbing process, get the primary OSD and mark it as down. The
scrubbing process will stop.
Make sure you set the noout, norebalance and norecovery flags so you
don't add even more load to your cluster.

On Tue, Jun 5, 2018 at 11:41 PM Marc Roos  wrote:
>
>
> Is it possible to stop the current running scrubs/deep-scrubs?
>
> http://tracker.ceph.com/issues/11202
>
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel/Luminous Filestore/Bluestore for a new cluster

2018-06-06 Thread Simon Ironside

On 05/06/18 01:14, Subhachandra Chandra wrote:

We have not observed any major issues. We have had occasional OSD daemon 
crashes due to an assert which is a known bug but the cluster recovered 
without any intervention each time. All the nodes have been rebooted 2-3 
times due to CoreOS updates and no issues with that either.


Hi Subhachandra,

Thanks for your answer, it's this sort of stuff that worries me. My 
Filestore OSD daemons on Hammer & Jewel don't crash at all so this 
sounds like a regression and I should wait before deploying 
Luminous/Bluestore.


Simon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problem with S3 policy (grant RW access)

2018-06-06 Thread Valéry Tschopp
Hello,

We have a problem with a R/W policy on a bucket.

If the bucket owner grant read/write access to another user, the objects 
created by the grantee are not accessible by the owner (see below) !?!

Why does the owner of a bucket not access objects created by a grantee?

Is is a bug?

## Setup

- radosgw 12.2.5, with OpenStack Keystone integration
- PROJECT_A owner of bucket A
- PROJECT_B with R/W access to bucket A

With the OpenStack Keystone integration the radosgw user ID is the OpenStack 
project ID. Users are only member of a project.

## S3 Policy

The S3 bucket policy `projectB_read-write.json` grant R/W access to PROJECT_B:

{
  "Version": "2012-10-17",
  "Id": "read-write",
  "Statement": [
{
  "Sid": "projectB-read_write",
  "Effect": "Allow",
  "Principal": {
"AWS": [
  "arn:aws:iam::PROJECT_B_ID:root"
]
  },
  "Action": [
"s3:ListBucket",
"s3:PutObject",
"s3:DeleteObject",
"s3:GetObject"
  ],
  "Resource": [
"arn:aws:s3:::*"
  ]
}
  ]
}

## Example of the problem

Owner (PROJECT_A) creates bucket and set policy:

$ s3cmd -c s3cfg-projectA mb s3://test
$ s3cmd -c s3cfg-projectA setpolicy projectB_read-write.json s3://test

Grantee (PROJECT_B) uploads an object into the bucket:

$ s3cmd -c s3cfg-projectB put example.data s3://test
upload: 'example.data' -> 's3://test/example.data'  [part 1 of 2, 15MB] [1 
of 1]
 15728640 of 15728640   100% in1s14.99 MB/s  done
upload: 'example.data' -> 's3://test/example.data'   [part 2 of 2, 479kB] 
[1 of 1]
 491466 of 491466   100% in0s 2.99 MB/s  done

Owner (PROJECT_B) tries to download the object uploaded by grantee (PROJECT_B):

$ s3cmd -c s3cfg-projectA get s3://test/example.data
download: 's3://test/example.data' -> './example.data'  [1 of 1]
ERROR: S3 error: 403 (AccessDenied)

## Possible workaround

If we add the bucket owner (PROJECT_A) in the policy too, then he will be able 
to access objects created by the grantee (PROJECT_B):

"Principal": {
  "AWS": [
"arn:aws:iam::PROJECT_A_ID:root",
"arn:aws:iam::PROJECT_B_ID:root"
  ]
},


-- 
SWITCH
Valéry Tschopp, Software Engineer
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
Email: valery.tsch...@switch.ch Phone: +41 44 268 1544
https://www.switch.ch/ 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Update to Mimic with prior Snapshots leads to MDS damaged metadata

2018-06-06 Thread Tobias Florek
Hi,

I upgraded a ceph cluster to mimic yesterday according to the release
notes. Specifically I did stop all standby MDS and then restarted the
only active MDS with the new version.

The cluster was installed with luminous. Its cephfs volume had snapshots
prior to the update, but only one active MDS.

The post-installation steps failed though:
 ceph daemon mds. scrub_path /
returned an error, which I corrected with
 ceph daemon mds. scrub_path / repair

While
 ceph daemon mds. scrub_path '~mdsdir'
did not show any error.


After some time, ceph health reported MDS damaged metadata:
> ceph tell mds. damage ls | jq '.[].damage_type' | sort | uniq -c
398 "backtrace"
718 "dentry"

Examples of damage:

{
  "damage_type": "dentry",
  "id": 118195760,
  "ino": 1099513350198,
  "frag": "000100*",
  "dname":
"1524578400.M820820P705532.dovecot-15-hgjlx,S=425674,W=431250:2,RS",
  "snap_id": "head",
  "path":
"/path/to/mails/user/Maildir/.Trash/cur/1524578400.M820820P705532.dovecot-15-hgjlx,S=425674,W=431250:2,RS"
},
{
  "damage_type": "backtrace",
  "id": 121083841,
  "ino": 1099515215027,
  "path":
"/path/to/mails/other_user/Maildir/.Junk/cur/1528189963.M416032P698926.dovecot-15-xmpkh,S=4010,W=4100:2,Sab"
},


Directories with damage can still be listed by the kernel cephfs mount
(4.16.7), but not the fuse mount, which stalls.


Can anyone help? That's unfortunately a production cluster.

Regards,
 Tobias Florek
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Adding additional disks to the production cluster without performance impacts on the existing

2018-06-06 Thread John Molefe
Hi everyone

We have completed all phases and the only remaining part is just adding the 
disks to the current cluster but i am afraid of impacting performance as it is 
on production.
Any guides and advices on how this can be achieved with least impact on 
production??

Thanks in advance
John


Vrywaringsklousule / Disclaimer: 
http://www.nwu.ac.za/it/gov-man/disclaimer.html 
( http://www.nwu.ac.za/it/gov-man/disclaimer.html )  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com