Re: [ceph-users] Group-based permissions issue when using ACLs on CephFS

2018-03-23 Thread Josh Haft
On Fri, Mar 23, 2018 at 8:49 PM, Yan, Zheng  wrote:

> On Fri, Mar 23, 2018 at 9:50 PM, Josh Haft  wrote:
> > On Fri, Mar 23, 2018 at 12:14 AM, Yan, Zheng  wrote:
> >>
> >> On Fri, Mar 23, 2018 at 5:14 AM, Josh Haft  wrote:
> >> > Hello!
> >> >
> >> > I'm running Ceph 12.2.2 with one primary and one standby MDS. Mounting
> >> > CephFS via ceph-fuse (to leverage quotas), and enabled ACLs by adding
> >> > fuse_default_permissions=0 and client_acl_type=posix_acl to the mount
> >> > options. I then export this mount via NFS and the clients mount
> NFS4.1.
> >> >
> >> does fuse_default_permissions=0 work?
> >
> > Yes, ACLs work as expected when I set fuse_default_permissions=0.
> >
> >> > After doing some in-depth testing it seems I'm unable to allow access
> from
> >> > the NFS clients to a directory/file based on group membership when the
> >> > underlying CephFS was mounted with ACL support. This issue appears
> using
> >> > both filesystem permissions (e.g. chgrp) and NFSv4 ACLs. However,
> ACLs do
> >> > work if the principal is a user instead of a group. If I disable ACL
> support
> >> > on the ceph-fuse mount, things work as expected using fs permissions;
> >> > obviously I don't get ACL support.
> >> >
> >> > As an intermediate step I did check whether this works directly on the
> >> > CephFS filesystem - on the NFS server - and it does. So it appears to
> be an
> >> > issue re-exporting it via NFS.
> >> >
> >> > I do not see this issue when mounting CephFS via the kernel,
> exporting via
> >> > NFS, and re-running these tests.
> >> >
> >> > I searched the ML and bug reports but only found this -
> >> > http://tracker.ceph.com/issues/12617 - which seems close to the
> issue I'm
> >> > running into, but was closed as resolved 2+ years ago.
> >> >
> >> > Has anyone else run into this? Am I missing something obvious?
> >> >
> >>
> >> ceph-fuse does permission check according to localhost's config of
> >> supplement group. that's why you see this behavior.
> >
> > You're saying both the NFS client and server (where ceph-fuse is
> > running) need to use the same directory backend? (they are)
> > I should have mentioned I'm using LDAP/AD on client and server, so I
> > don't think that is the problem.
> >
> > Either way, I would not expect the behavior to change simply by
> > enabling ACLs, especially when I'm using filesystem permissions, and
> > ACLs aren't part of the equation.
>
> More specifically, ceph-fuse find which groups request initiator are
> in by function fuse_req_getgroups(). this function does tricks on
> "/proc/%lu/task/%lu/status".  It only works  when nfs client and
> ceph-fuse are running on the same machine.
>
> So why does this work when I'm using ceph-fuse but ACLs are disabled?

>
> >> Yan, Zheng
> >>
> >> > Thanks!
> >> > Josh
> >> >
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enable object map kernel module

2018-03-23 Thread Konstantin Shalygin

how can we deal with that? I see some comments that large images without
omap may suffer to get deleted


Only way for now is use nbd-rbd or fuse-rbd.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shell / curl test script for rgw

2018-03-23 Thread Konstantin Shalygin


On 03/24/2018 07:22 AM, Marc Roos wrote:
  
Thanks! I got it working, although I had to change the date to "date -R

-u", because I got the "RequestTimeTooSkewed" error.

I also had to enable buckets=read on the account that was already able
to read and write via cyberduck, I don’t get that.

radosgw-admin caps add --uid='test$test1' --caps "buckets=read"



Please, post your version.
Because I was tune up date by this reason ("RequestTimeTooSkewed").



k
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Group-based permissions issue when using ACLs on CephFS

2018-03-23 Thread Yan, Zheng
On Fri, Mar 23, 2018 at 9:50 PM, Josh Haft  wrote:
> On Fri, Mar 23, 2018 at 12:14 AM, Yan, Zheng  wrote:
>>
>> On Fri, Mar 23, 2018 at 5:14 AM, Josh Haft  wrote:
>> > Hello!
>> >
>> > I'm running Ceph 12.2.2 with one primary and one standby MDS. Mounting
>> > CephFS via ceph-fuse (to leverage quotas), and enabled ACLs by adding
>> > fuse_default_permissions=0 and client_acl_type=posix_acl to the mount
>> > options. I then export this mount via NFS and the clients mount NFS4.1.
>> >
>> does fuse_default_permissions=0 work?
>
> Yes, ACLs work as expected when I set fuse_default_permissions=0.
>
>> > After doing some in-depth testing it seems I'm unable to allow access from
>> > the NFS clients to a directory/file based on group membership when the
>> > underlying CephFS was mounted with ACL support. This issue appears using
>> > both filesystem permissions (e.g. chgrp) and NFSv4 ACLs. However, ACLs do
>> > work if the principal is a user instead of a group. If I disable ACL 
>> > support
>> > on the ceph-fuse mount, things work as expected using fs permissions;
>> > obviously I don't get ACL support.
>> >
>> > As an intermediate step I did check whether this works directly on the
>> > CephFS filesystem - on the NFS server - and it does. So it appears to be an
>> > issue re-exporting it via NFS.
>> >
>> > I do not see this issue when mounting CephFS via the kernel, exporting via
>> > NFS, and re-running these tests.
>> >
>> > I searched the ML and bug reports but only found this -
>> > http://tracker.ceph.com/issues/12617 - which seems close to the issue I'm
>> > running into, but was closed as resolved 2+ years ago.
>> >
>> > Has anyone else run into this? Am I missing something obvious?
>> >
>>
>> ceph-fuse does permission check according to localhost's config of
>> supplement group. that's why you see this behavior.
>
> You're saying both the NFS client and server (where ceph-fuse is
> running) need to use the same directory backend? (they are)
> I should have mentioned I'm using LDAP/AD on client and server, so I
> don't think that is the problem.
>
> Either way, I would not expect the behavior to change simply by
> enabling ACLs, especially when I'm using filesystem permissions, and
> ACLs aren't part of the equation.

More specifically, ceph-fuse find which groups request initiator are
in by function fuse_req_getgroups(). this function does tricks on
"/proc/%lu/task/%lu/status".  It only works  when nfs client and
ceph-fuse are running on the same machine.


>> Yan, Zheng
>>
>> > Thanks!
>> > Josh
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to persist configuration about enabled mgr plugins in Luminous 12.2.4

2018-03-23 Thread Gregory Farnum
I believe this popped up recently and is a container bug. It’s forcibly
resetting the modules to run on every start.
On Sat, Mar 24, 2018 at 5:44 AM Subhachandra Chandra 
wrote:

> Hi,
>
>We used ceph-ansible to install/update our Ceph cluster config where
> all the cph dameons run as containers. In mgr.yml I have the following
> config
>
> ###
>
> # MODULES #
>
> ###
>
> # Ceph mgr modules to enable, current modules available are:
> status,dashboard,localpool,restful,zabbix,prometheus,influx
>
> ceph_mgr_modules: [status,dashboard,prometheus]
>
> In Luminous.2, when the MGR container restarted, the mgr daemon used to
> reload the plugins. Since I upgraded to Luminous.4, the mgr daemon has
> stopped reloading the plugins and need me to run "ceph mgr module enable
> " to load them. What changed between the two versions in how the
> manager is configured. Is there a config file that can be used to specify
> the plugins to load at mgr start time? Looking at ceph-ansible, it looks
> like during installation it just runs the "module enable" commands and
> somehow that used to work.
>
> Thanks
> Subhachandra
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shell / curl test script for rgw

2018-03-23 Thread Marc Roos
 
Thanks! I got it working, although I had to change the date to "date -R 
-u", because I got the "RequestTimeTooSkewed" error. 

I also had to enable buckets=read on the account that was already able 
to read and write via cyberduck, I don’t get that. 

radosgw-admin caps add --uid='test$test1' --caps "buckets=read" 






-Original Message-
From: Konstantin Shalygin [mailto:k0...@k0ste.ru] 
Sent: zondag 18 maart 2018 6:35
To: ceph-users@lists.ceph.com
Cc: Marc Roos
Subject: *SPAM* Re: [ceph-users] Shell / curl test script for 
rgw

Hi Mark


> But is there a simple shell script
> that I can use to test with? I have problems with the signature in 
> this one


This is 100% working test admin api (uid should have 
caps="buckets=read").


> #!/bin/bash
> s3_access_key=""
> s3_secret_key=""
> s3_host="objects-us-west-1.dream.io"
> query="admin/bucket"
> method="GET"
> date=$(for i in $(date -u "+%H") ; do date "+%a, %d %b %Y $(( 10#$i 
> )):%M:%S +" ; done) header="${method}\n\n\n${date}\n/${query}"
> sig=$(echo -en ${header} | openssl sha1 -hmac ${s3_secret_key} -binary
> | base64)
>
> curl -s -H "Date: ${date}" \
> -H "Authorization: AWS ${s3_access_key}:${sig}" \ -H "Host: 
> ${s3_host}" \ -X ${method} \ 
> "https://${s3_host}/${query}?format=json=True;




k


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Enable object map kernel module

2018-03-23 Thread Thiago Gonzaga
Hi All,

I'm starting with ceph and faced a problem while using object-map

root@ceph-mon-1:/home/tgonzaga# rbd create test -s 1024 --image-format 2
--image-feature exclusive-lock
root@ceph-mon-1:/home/tgonzaga# rbd feature enable test object-map
root@ceph-mon-1:/home/tgonzaga# rbd list
test
root@ceph-mon-1:/home/tgonzaga# rbd map test
rbd: sysfs write failed
RBD image feature set mismatch. You can disable features unsupported by the
kernel with "rbd feature disable test object-map".
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (6) No such device or address

how can we deal with that? I see some comments that large images without
omap may suffer to get deleted

Regards,

*Thiago Gonzaga*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to persist configuration about enabled mgr plugins in Luminous 12.2.4

2018-03-23 Thread Subhachandra Chandra
Hi,

   We used ceph-ansible to install/update our Ceph cluster config where all
the cph dameons run as containers. In mgr.yml I have the following config

###

# MODULES #

###

# Ceph mgr modules to enable, current modules available are:
status,dashboard,localpool,restful,zabbix,prometheus,influx

ceph_mgr_modules: [status,dashboard,prometheus]

In Luminous.2, when the MGR container restarted, the mgr daemon used to
reload the plugins. Since I upgraded to Luminous.4, the mgr daemon has
stopped reloading the plugins and need me to run "ceph mgr module enable
" to load them. What changed between the two versions in how the
manager is configured. Is there a config file that can be used to specify
the plugins to load at mgr start time? Looking at ceph-ansible, it looks
like during installation it just runs the "module enable" commands and
somehow that used to work.

Thanks
Subhachandra
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS Bug/Problem

2018-03-23 Thread John Spray
On Fri, Mar 23, 2018 at 7:45 PM, Perrin, Christopher (zimkop1)
 wrote:
> Hi,
>
> Last week out MDSs started failing one after another, and could not be 
> started anymore. After a lot of tinkering I found out that MDSs crashed after 
> trying to rejoin the Cluster. The only Solution I found that, let them start 
> again was resetting the journal vie cephfs-journal-tool. Now I have broken 
> files all over the Cluster.

Can you clarify:
 - is the backtrace below is from before you used cephfs-journal-tool
or after?
 - what do you mean by "broken file"

The backtrace you're seeing is in some damage-handling code -- the
crash is a bug (http://tracker.ceph.com/issues/23452), but you'd only
be hitting it if your metadata was already inconsistent.  One way it
could get inconsistent would be from use of cephfs-journal-tool
(wiping the journal but not the session table, resulting in session
state inconsistent with the rest of the metadata), but if the crash
was from before you did that then something else would be going on.

It would also help if you could be more specific about the original
issue that drove you to using cephfs-journal-tool, if that was
something other than the backtrace you included.

Cheers,
John

> Before the crash the OSDs blocked tens of thousands of slow requests.
>
> Can I somehow restore the broken files (I still have a backup of the journal) 
> and how can I make sure that this doesn't happen agian. I am still not sure 
> why this even happened.
>
> This happened on ceph version 12.2.3.
>
> This is the log of one MDS:
>   -224> 2018-03-22 15:52:47.310437 7fd5798fd700  1 -- x.x.1.17:6803/122963511 
> <== mon.0 x.x.1.17:6789/0 2  auth_reply(proto 2 0 (0) Success) v1  
> 33+0+0 (3611581813 0 0) 0x555883df2780 con 0x555883eb5000
>   -223> 2018-03-22 15:52:47.310482 7fd5798fd700 10 monclient(hunting): my 
> global_id is 745317
>   -222> 2018-03-22 15:52:47.310634 7fd5798fd700  1 -- x.x.1.17:6803/122963511 
> --> x.x.1.17:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x555883df2f00 
> con 0
>   -221> 2018-03-22 15:52:47.311096 7fd57c09f700  5 -- x.x.1.17:6803/122963511 
> >> x.x.1.17:6789/0 conn(0x555883eb5000 :-1 
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=793748 cs=1 l=1). rx mon.0 
> seq 3 0x555883df2f00 auth_reply(proto 2 0 (0) Success) v1
>   -220> 2018-03-22 15:52:47.311178 7fd5798fd700  1 -- x.x.1.17:6803/122963511 
> <== mon.0 x.x.1.17:6789/0 3  auth_reply(proto 2 0 (0) Success) v1  
> 222+0+0 (1789869469 0 0) 0x555883df2f00 con 0x555883eb5000
>   -219> 2018-03-22 15:52:47.311319 7fd5798fd700  1 -- x.x.1.17:6803/122963511 
> --> x.x.1.17:6789/0 -- auth(proto 2 181 bytes epoch 0) v1 -- 0x555883df2780 
> con 0
>   -218> 2018-03-22 15:52:47.312122 7fd57c09f700  5 -- x.x.1.17:6803/122963511 
> >> x.x.1.17:6789/0 conn(0x555883eb5000 :-1 
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=793748 cs=1 l=1). rx mon.0 
> seq 4 0x555883df2780 auth_reply(proto 2 0 (0) Success) v1
>   -217> 2018-03-22 15:52:47.312208 7fd5798fd700  1 -- x.x.1.17:6803/122963511 
> <== mon.0 x.x.1.17:6789/0 4  auth_reply(proto 2 0 (0) Success) v1  
> 799+0+0 (4156877078 0 0) 0x555883df2780 con 0x555883eb5000
>   -216> 2018-03-22 15:52:47.312393 7fd5798fd700  1 monclient: found mon.filer1
>   -215> 2018-03-22 15:52:47.312416 7fd5798fd700 10 monclient: 
> _send_mon_message to mon.filer1 at x.x.1.17:6789/0
>   -214> 2018-03-22 15:52:47.312427 7fd5798fd700  1 -- x.x.1.17:6803/122963511 
> --> x.x.1.17:6789/0 -- mon_subscribe({monmap=0+}) v2 -- 0x555883c8ed80 con 0
>   -213> 2018-03-22 15:52:47.312461 7fd5798fd700 10 monclient: 
> _check_auth_rotating renewing rotating keys (they expired before 2018-03-22 
> 15:52:17.312460)
>   -212> 2018-03-22 15:52:47.312477 7fd5798fd700 10 monclient: 
> _send_mon_message to mon.filer1 at x.x.1.17:6789/0
>   -211> 2018-03-22 15:52:47.312482 7fd5798fd700  1 -- x.x.1.17:6803/122963511 
> --> x.x.1.17:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- 0x555883df2f00 con > 0
>   -210> 2018-03-22 15:52:47.312552 7fd580637200  5 monclient: authenticate 
> success, global_id 745317
>   -209> 2018-03-22 15:52:47.312570 7fd580637200 10 monclient: 
> wait_auth_rotating waiting (until 2018-03-22 15:53:17.312568)
>   -208> 2018-03-22 15:52:47.312776 7fd57c09f700  5 -- x.x.1.17:6803/122963511 
> >> x.x.1.17:6789/0 conn(0x555883eb5000 :-1 
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=793748 cs=1 l=1). rx mon.0 
> seq 5 0x555883c8f8c0 mon_map magic: 0 v1
>   -207> 2018-03-22 15:52:47.312841 7fd5798fd700  1 -- x.x.1.17:6803/122963511 
> <== mon.0 x.x.1.17:6789/0 5  mon_map magic: 0 v1  433+0+0 (493202164 
> 0 0) 0x555883c8f8c0 con 0x555883eb5000
>   -206> 2018-03-22 15:52:47.312868 7fd5798fd700 10 monclient: handle_monmap 
> mon_map magic: 0 v1
>   -205> 2018-03-22 15:52:47.312892 7fd5798fd700 10 monclient:  got monmap 7, 
> mon.filer1 is now rank 0
>   -204> 2018-03-22 15:52:47.312901 7fd57c09f700  5 -- x.x.1.17:6803/122963511 
> >> 

Re: [ceph-users] Erasure Coded Pools and OpenStack

2018-03-23 Thread Mike Cave
Thank you for getting back to me so quickly.

Your suggestion of adding the config change in ceph.conf was a great one. That 
helped a lot. I didn't realize that the client would need to be updated and 
thought that it was a cluster side modification only. 

Something else that I missed was giving full permissions to the glance user. 

When I ran your rbd command as suggested I received a permission error which 
tweaked me to look at the permissions and realized that I had not added a 
permission rule for the glance user.

I can confirm that I have the data going into the EC pool from OpenStack!

Thanks again,

Mike

-Original Message-
From: Jason Dillaman 
Reply-To: "dilla...@redhat.com" 
Date: Thursday, March 22, 2018 at 5:15 PM
To: Cave Mike 
Cc: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] Erasure Coded Pools and OpenStack

On Fri, Mar 23, 2018 at 8:08 AM, Mike Cave  wrote:
> Greetings all!
>
>
>
> I’m currently attempting to create an EC pool for my glance images, however
> when I save an image through the OpenStack command line, the data is not
> ending up in the EC pool.
>
> So a little information on what I’ve done so far.
>
> The way that I understand things to work is that you need a metadata pool to
> front the EC pool; I created and  ‘images’ pool and then created an
> ‘images_data_ec’ pool.
>
> Following are the steps I used.
>
>
>
> First, I created my EC profile:
>
>
>
> ceph osd erasure-code-profile set 2-1 k=2 m=1 crush-device-class=hdd
>
>
>
> I used the values of k=2 and m=1 to match my dev cluster config (which is
> only three OSD servers (10x4TB OSDs and 2 SSD (for journal) per server)
> which is set to a failure domain: server)
>
>
>
> Then I created my pools:
>
>
>
> ceph osd pool create images 16
>
> ceph osd pool create images_data_ec 128 erasure 2-1
>
> ceph osd pool application enable images images_data_ec
>
> ceph osd pool application enable images_data_ec rbd
>
> ceph osd pool set images_data_ec allow_ec_overwrites true
>
>
>
> Then I added the following to my ceph.conf to tell ceph to use the
> images_data_ec pool when the glance user in invoked. I then restarted all
> the ceph services on all the nodes.
>
>
>
> [client.glance]
>
> rbd default data pool = images_data_ec
>
>
>
> So with that configured I used the rbd cli to create an image:
>
>
>
> rbd create images/myimage --size 1G --data-pool images_data_ec

With that "rbd default data pool = images_data_ec" configuration
override, you shouldn't need to specify the "--data-pool" optional.

>
> Then checked the image details:
>
>
>
> rbd -p images --image myimage info
>
> rbd image 'myimage':
>
> size 1024 GB in 262144 objects
>
> order 22 (4096 kB objects)
>
> data_pool: images_data_ec
>
> block_name_prefix: rbd_data.162.6fdb4874b0dc51
>
> format: 2
>
> features: layering, exclusive-lock, object-map, fast-diff,
> deep-flatten, data-pool
>
> flags:
>
> create_timestamp: Thu Mar 22 16:29:03 2018
>
>
>
> This looks okay so I continued on and uploaded an image through the
> OpenStack cli:
>
>
>
> OpenStack image create --disk-format qcow2 --unprotected --public --file
> cirros-0.4.0-x86_64-disk.img cirros-test-image
>
>
>
> However, when I inspect the image I see that it is not using the data pool
> as expected:

Have you ensured that your configuration override is on the glance
controller node? Have you confirmed that glance is configured to use
the "glance" user? If you run "rbd --id glance create images/ --size 1" is the image properly associated w/ the data pool?

>
> rbd -p images --image 91147e95-3e3d-4dc1-934d-bcaad7f645be info
>
> rbd image '91147e95-3e3d-4dc1-934d-bcaad7f645be':
>
> size 12418 kB in 2 objects
>
> order 23 (8192 kB objects)
>
> block_name_prefix: rbd_data.6fdbe73cf7c855
>
> format: 2
>
> features: layering, exclusive-lock, object-map, fast-diff,
> deep-flatten
>
> flags:
>
> create_timestamp: Thu Mar 22 16:29:37 2018
>
>
>
> When I look at the usage of the two pools only the images pool has any data
> in it. Also, when I query the EC pool for a list of images, it returns
> empty, even though there should be something from the rbd cli uploaded image
> in there (or so I thought). Should there be something to query in the EC
> pool to prove data is being written there?

The EC pool would only be used for data, so an "rbd ls images_data_ec"
is expected to return zero images (since the images are registered in
the "images" pool in your case).
>
> So far, I have been unable to get this to work and I’m completely at a loss.
>
> Does anyone have any experience with this configuration and or maybe some
> guidance for getting it to work?
>
> Ideally, I want to 

Re: [ceph-users] Uneven pg distribution cause high fs_apply_latency on osds with more pgs

2018-03-23 Thread David Turner
Luminous addresses it with a mgr plugin that actively changes the weights
of OSDs to balance the distribution.  In addition to having PGs distributed
well for your OSDs to have an equal amount of data on them is also which
OSDs are Primary.  If you're running into a lot of latency on specific OSDs
during high read volumes, then it might just be that there are too many PGs
that the OSD is primary for affecting it.  All reads happen on the primary
OSD even though you have multiple secondary OSDs for the PG.  Just a
thought, might not be what's happening.

On Thu, Mar 8, 2018 at 11:26 PM shadow_lin  wrote:

> Thanks for your advice.
> I will try to reweight osds of my cluster.
>
> Why ceph is so sensitive to unblanced pg distribution during high load?
> ceph osd df result is: https://pastebin.com/ur4Q9jsA.  ceph osd perf
> result is: https://pastebin.com/87DitPhV
>
> There is no osd with very high pg count compare to others. When the wirte
> test load is low everything seems fine, but during high write load test,
> some of the osds with higher pg can have 3-10 time of fs_apply_latency
> compare to others.
>
> My guess is the high loaded osds kinda slowed the whole cluster(because I
> have only one pool with all osds)to the level of how fast they can handle.
> So other osd has lower load and have a good latency.
>
> Is this expected during high load(Indicate the load is too hight for
> current cluster to hanlde)?
>
> How does luminous solve the unevenly pg distribution problem?I read about
> there is a pg-upmap exception table in the osdmap in luminous 12.2.x. It is
> said to use this it is possible to achive perfect pg distribution among
> osds.
>
> 2018-03-09
> --
> shadow_lin
> --
>
> *发件人:*David Turner 
> *发送时间:*2018-03-09 06:45
> *主题:*Re: [ceph-users] Uneven pg distribution cause high fs_apply_latency
> on osds with more pgs
> *收件人:*"shadow_lin"
> *抄送:*"ceph-users"
>
>
> PGs being unevenly distributed is a common occurrence in Ceph.  Luminous
> started making some steps towards correcting this, but you're in Jewel.
> There are a lot of threads in the ML archives about fixing PG
> distribution.  Generally every method comes down to increasing the weight
> on OSDs with too few PGs and decreasing the weight on the OSDs with too
> many PGs.  There are a lot of schools of thought on the best way to
> implement this in your environment which has everything to do with your
> client IO patterns and workloads.  Looking into `ceph osd reweight-by-pg`
> might be a good place for you to start as you are only looking at 1 pool in
> your cluster.  If you have more pools, you generally need `ceph osd
> reweight-by-utilization`.
>
> On Wed, Mar 7, 2018 at 8:19 AM shadow_lin  wrote:
>
>> Hi list,
>>Ceph version is jewel 10.2.10 and all osd are using filestore.
>> The Cluster has 96 osds and 1 pool with size=2 replication with 4096
>> pg(base on pg calculate method from ceph doc for 100pg/per osd).
>> The osd with the most pg count has 104 PGs and there are 6 osds have
>> above 100 PGs
>> Most of the osd have around 7x-9x PGs
>> The osd with the least pg count has 58 PGs
>>
>> During the write test some of the osds have very high fs_apply_latency
>> like 1000ms-4000ms while the normal ones are like 100-600ms. The osds with
>> high latency are always the ones with more pg on it.
>>
>> iostat on the high latency osd shows the hdds are having high %util at
>> about 95%-96% while the normal ones are having %util at 40%-60%
>>
>> I think the reason to cause this is because the osds have more pgs need
>> to handle more write request to it.Is this right?
>> But even though the pg distribution is not even but the variation is not
>> that much.How could the performance be so sensitive to it?
>>
>> Is there anything I can do to improve the performance and reduce the
>> latency?
>>
>> How can I make the pg distribution to be more even?
>>
>> Thanks
>>
>>
>> 2018-03-07
>> --
>> shadowlin
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephalocon slides/videos

2018-03-23 Thread David Turner
A lot of videos from ceph days and such pop up on the [1] Ceph youtube
channel.

[1] https://www.youtube.com/channel/UCno-Fry25FJ7B4RycCxOtfw

On Fri, Mar 23, 2018 at 5:28 AM Serkan Çoban  wrote:

> Hi,
>
> Where can I find slides/videos of the conference?
> I already tried (1), but cannot view the videos.
>
> Serkan
>
> 1- http://www.itdks.com/eventlist/detail/1962
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lost space or expected?

2018-03-23 Thread David Turner
The first thing I looked at was if you had any snapshots/clones in your
pools, but that count is 0 for you.  Second, I would look at seeing if you
have orphaned objects from deleted RBDs.  You could check that by comparing
a list of the rbd 'block_name_prefix' for all of the rbds in the pool with
the prefix of object names in that pool.

rados ls | cut -d . -f1,2 | sort -u | grep ^rbd_data
for rbd in $(rbd ls); do rbd info --pool rbd-replica-ssd $rbd | awk
'/block_name_prefix/ {print $2}'; done | sort

Alternatively you can let bash do the work for you by diff'ing the output
of the commands directly

diff <(rados ls | cut -d . -f1,2 | sort -u | grep ^rbd_data) <(for rbd in
$(rbd ls); do rbd info --pool rbd-replica-ssd $rbd | awk
'/block_name_prefix/ {print $2}'; done | sort) | awk '/>/ {print $2}'

Anything listed are rbd prefixes with objects for rbds that do not exist.
If you do have any that show up here, you would want to triple check that
the RBD doesn't actually exist and then work on finding the objects with
that prefix and delete them with something like `rados ls | grep $prefix |
rados rm`.

Also to note, rbd_data is not the only thing that uses the rbd prefix,
there is also rbd_header, rbd_object_map, and perhaps other things that
will also need to be cleaned up if you find orphans.  Hopefully you
don't... but hopefully you do so you can get an answer to your question and
a direction to go.

On Tue, Mar 20, 2018 at 9:54 AM Caspar Smit  wrote:

> Hi all,
>
> Here's the output of 'rados df' for one of our clusters (Luminous 12.2.2):
>
> ec_pool 75563G 19450232 0 116701392 0 0 0 385351922 27322G 800335856 294T
> rbd 42969M 10881 0 32643 0 0 0 615060980 14767G 970301192 207T
> rbdssd 252G 65446 0 196338 0 0 0 29392480 1581G 211205402 2601G
>
> total_objects 19526559
> total_used 148T
> total_avail 111T
> total_space 259T
>
>
> ec_pool (k=4, m=2)
> rbd (size = 3/2)
> rbdssd (size = 3/2)
>
> If i calculate the space i should be using:
>
> ec_pool = 75 TB x 1.5 = 112.5 TB  (4+2 is storage times 1.5 right?)
> rbd = 42 GB x 3 = 150 GB
> rbdssd = 252 GB x 3 = 756 GB
>
> Let's say 114TB in total.
>
> Why is there 148TB used space? (That's a 30TB difference)
> Is this expected behaviour? A bug? (if so, how can i reclaim this space?)
>
> kind regards,
> Caspar
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] remove big rbd image is very slow

2018-03-23 Thread David Turner
Just to note the "magic" of object-map... If you had a 50TB RBD with
object-map and 100% of the RBD is in use, the rbd rm will take the same
amount of time to delete the RBD as if you don't have object-map enabled on
a brand new 50TB RBD that has no data in it.  Removing that many objects
just takes a long time.  object-map makes it so that operations like rbd rm
will skip objects not in use and speed up the operation.

I had a dev accidentally create a 1EB (exabyte) RBD that would have
probably taken months to actually delete.  Instead I did what you were
talking about and removed the RADOS objects manually for the RBD and then
removed the RBD much more quickly.

To directly answer your question about if this speed is expected, I would
say yes.  Ceph is really good at keeping your data in tact... not so good
at deleting it. :)

On Sat, Mar 17, 2018 at 12:18 PM Jack  wrote:

> Yes, this is what object-map does, it tracks used objects
>
> For your 50TB new image:
> - Without object-map, rbd rm must interate over every object, find out
> that the object does not exists, look after the next object etc
> - With object-map, rbd rm get the used objects list, find it empty, and
> job is done
>
> For rbd export, this may the the same
> However, rbd export exports a full image (so, in your case, 20TB)
> You may want to use rbd export-diff (which will still be some kind of
> slow without object-map, but will only output usefull data)
>
> Another tip: if you can, consider using rbd-nbd
> This allows you to mount a rbd volume using librbd
>
>
> On 03/17/2018 05:11 PM, shadow_lin wrote:
> > Hi list,
> > My ceph version is jewel 10.2.10.
> > I tired to use rbd rm to remove a 50TB image(without object map because
> krbd does't support it).It takes about 30mins to just complete about 3%. Is
> this expected? Is there a way to make it faster?
> > I know there are scripts to delete rados objects of the rbd image to
> make it faster. But is the slowness expected for rbd rm command?
> >
> > PS: I also encounter very slow rbd export for large rbd image(20TB image
> but with only a few GB data).Takes hours to completed the export.I guess
> both are related to object map not enabled, but krbd doesn't support object
> map feature.
> >
> >
> >
> >
> > 2018-03-18
> >
> >
> >
> > shadowlin
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Moving OSDs between hosts

2018-03-23 Thread David Turner
Just moving the OSD is indeed the right thing to do and the crush map will
update when the OSDs start up on the new host.  The only "gotcha" is if you
do not have your journals/WAL/DBs on the same device as your data.  In that
case, you will need to move both devices to the new server for the OSD to
start.  Without them, the OSD will simply fail to start and you can go back
and move the second device without any problems, just a little more time
that the disk is moved.

Please note that moving the disks will change the crush map, which means
that the algorithm used to place data on OSDs will recalculate where your
data goes.  You will have a lot of data movement after doing this even
though you have the same amount of disks.

On Fri, Mar 16, 2018 at 7:23 PM  wrote:

> Hi jon,
>
> Am 16. März 2018 17:00:09 MEZ schrieb Jon Light :
> >Hi all,
> >
> >I have a very small cluster consisting of 1 overloaded OSD node and a
> >couple MON/MGR/MDS nodes. I will be adding new OSD nodes to the cluster
> >and
> >need to move 36 drives from the existing node to a new one. I'm running
> >Luminous 12.2.2 on Ubuntu 16.04 and everything was created with
> >ceph-deploy.
> >
> >What is the best course of action for moving these drives? I have read
> >some
> >posts that suggest I can simply move the drive and once the new OSD
> >node
> >sees the drive it will update the cluster automatically.
>
> I would give this a try. Had Test this scenario at the beginning of my
> Cluster (Jewel/ceph deploy/ceph disk) and i was able to remove One osd and
> put it in Another Node- udev had done his Magic.
>
> - Mehmet
>
> >
> >Time isn't a problem and I want to minimize risk so I want to move 1
> >OSD at
> >a time. I was planning on stopping the OSD, moving it to the new host,
> >and
> >waiting for the OSD to become up and in and the cluster to be healthy.
> >Are
> >there any other steps I need to take? Should I do anything different?
> >
> >Thanks in advance
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel version for Debian 9 CephFS/RBD clients

2018-03-23 Thread Nicolas Huillard
Le vendredi 23 mars 2018 à 12:14 +0100, Ilya Dryomov a écrit :
> On Fri, Mar 23, 2018 at 11:48 AM,   wrote:
> > The stock kernel from Debian is perfect
> > Spectre / meltdown mitigations are worthless for a Ceph point of
> > view,
> > and should be disabled (again, strictly from a Ceph point of view)

I know that Ceph itself don't need this, but the cpeh client machines,
specially those hosting VMs or mùore diverse code, should have those
mitigations.

> > If you need the luminous features, using the userspace
> > implementations
> > is required (librbd via rbd-nbd or qemu, libcephfs via fuse etc)

I'd rather use the faster kernel cephfs implementation instead of fuse,
specially with the Meltdown PTI mitigation (I guess fuse implies twice
the userland-to-kernel calls which are costly using PTI).
I don't have an idea yet re. RBD...

> luminous cluster-wide feature bits are supported since kernel 4.13.

This means that there are differences between 4.9 and 4.14 re. Ceph
features. I know that quota are not supported yet in any kernel, but I
don't use this...
Are there some performance/stability improvements in the kernel that
would justify using 4.14 instead of 4.9 ? I can't find any list
anywhere...
Since I'm building a new cluster, I'd rather choose the latest software
from the start if it's justified.

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why we show removed snaps in ceph osd dump pool info?

2018-03-23 Thread David Turner
The removed snaps is also in the osd map.  It does truncate the list over
time to show ranges and such and it is definitely annoying, but it is
needed for some of the internals of Ceph.  I don't remember what they are,
but that was the gist of the answer I got back when we were working on some
bugs with the Ceph support team previously.

On Wed, Mar 14, 2018 at 5:38 AM linghucongsong 
wrote:

> what is the purpose for we to show the removed snaps? look like the
> removed snaps no use to the user. we use rbd export and import backup
> images from one ceph cluster to another ceph cluster. the increment image
> backup depand on the snap.and we wiil remove the snap after the backup.so
> it will show a lot of snaps removed like below!
>
> 'volumes' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins
> pg_num 128 pgp_num 128 last_change 5554 flags hashpspool stripe_width 0
> application rbd
> removed_snaps
> [1~4,6~5f,66~149,1b0~660,811~1,813~1,815~1,817~1,819~1,81b~1,81d~1,81f~1,821~1,823~1,825~1,827~1,829~1,82b~1,82d~1,82f~1,831~1,833~1,835~1,837~1,839~1,83b~1,83d~1,83f~1,841~1,843~1,845~1,847~1,849~1,84b~1,84d~1,84f~1,851~1,853~1,855~1,857~1,859~1,85b~1,85d~1,85f~1,861~1,863~1,865~1,867~1,869~1,86b~1,86d~1,86f~1,871~1,873~1,875~1,877~1,879~1,87b~1,87d~1,87f~1,881~1,883~1,885~1,887~1,889~1,88b~1,88d~1,88f~1,891~1,893~1,895~1,897~1,899~1,89b~1,89d~1,89f~1,8a1~1,8a3~1,8a5~1,8a7~1,8a9~1,8ab~1,8ad~1,8af~1,8b1~1,8b3~1,8b5~1,8b7~1,8b9~1,8bb~1,8bd~1,8bf~1,8c1~1,8c3~1,8c5~1,8c7~1,8c9~1,8cb~1,8cd~1,8cf~1,8d1~1,8d3~1,8d5~1,8d7~1,8d9~1,8db~1,8dd~1,8df~1,8e1~1,8e3~1,8e5~1,8e7~1,8e9~1,8eb~1,8ed~1,8ef~1,8f1~1,8f3~1,8f5~1,8f7~1,8f9~1,8fb~1,8fd~1,8ff~1,901~1,903~1,905~1,907~1,909~1,90b~1,90d~1,90f~1,911~1,913~1,915~1,917~1,919~1,91b~1,91d~1,91f~1,921~1]
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CHOOSING THE NUMBER OF PLACEMENT GROUPS

2018-03-23 Thread David Turner
PGs per pool also has a lot to do with how much data each pool will have.
If 1 pool will have 90% of the data, it should have 90% of the PGs.  If it
will be common for you to create and delete pools (not usually common and
probably something you can do simpler), then you can aim to start at a
minimum recommendation and stay between that and the recommended amount.
So something like 40 < You < 100 PGs/osd.

Most things you can do in Ceph are CephFS, RBD, and RGW.  There are few
times you need to set up multiple RBD pools as you can create thousands of
RBDs in 1 pool, CephFS is more common, but still you can set up securities
so that each user only has access to a subfolder and not the entire FS
directory tree, etc.  There are generally ways to configure things so that
you don't need new pools every time someone has a storage need.

On Fri, Mar 9, 2018 at 6:31 AM Caspar Smit  wrote:

> Hi Will,
>
> Yes, adding new pools will increase the number of PG's per OSD. But you
> can always decrease the number of pg's per OSD by adding new Hosts/OSD's.
>
> When you design a cluster you have to calculate how many pools you're
> going to use and use that information with PGcalc. (
> https://ceph.com/pgcalc/)
>
> If you add pools later on they were not part of the original design and
> you probably will need additional space (OSD's) too.
>
> Kind regards,
> Caspar
>
> 2018-03-09 11:05 GMT+01:00 Will Zhao :
>
>> Hi Janne:
>> Thanks for your response. Approximately 100 PGs per OSD, yes, I
>> missed out this part.
>> I am still a little confused. Because 100-PGs-per-OSD rule is the
>> result of sumation of all used pools .
>> I konw I can create many pools.Assume that I have 5 pools now , and
>> the rule has already been met.
>> So if I create the sixth pool,  the total PGs will increased , then
>> the PGs per OSD will be more then 100.
>> Will this not violate the rule ?
>>
>>
>> On Fri, Mar 9, 2018 at 5:40 PM, Janne Johansson 
>> wrote:
>> >
>> >
>> > 2018-03-09 10:27 GMT+01:00 Will Zhao :
>> >>
>> >> Hi all:
>> >>
>> >>  I have a tiny question. I have read the documents, and it
>> >> recommend approximately 100 placement groups for normal usage.
>> >
>> >
>> > Per OSD. Approximately 100 PGs per OSD, when all used pools are summed
>> up.
>> > For things like radosgw, let it use the low defaults (8?) and then
>> expand on
>> > the pools
>> > that actually see a lot of data getting into them, leave the rest as is.
>> >
>> >
>> >>
>> >> Because the pg num can not be decreased, so if in current cluster,
>> >> the pg num have met this rule, and when I try to create a new pool ,
>> >> what pg num I should set ? I think no matter what I do , it  will
>> >> violate the pg-num-rule, add burden to osd.  This means , if I want my
>> >>  cluster be used by many different users, I should bulid a new cluster
>> >> for new user ?
>> >>
>> >
>> > No, one cluster can serve a lot of clients. You can have lots of pools
>> if
>> > you need,
>> > and those pools can have separate OSD hosts serving them if you need
>> strong
>> > separation, but still managed from the same cluster.
>> >
>> > --
>> > May the most significant bit of your life be positive.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous and jemalloc

2018-03-23 Thread Alexandre DERUMIER
Hi,

I think it's no more a problem since async messenger is default.
Difference is minimal now between jemalloc and tcmalloc.

Regards,

Alexandre

- Mail original -
De: "Xavier Trilla" 
À: "ceph-users" 
Cc: "Arnau Marcé" 
Envoyé: Vendredi 23 Mars 2018 13:34:03
Objet: [ceph-users] Luminous and jemalloc



Hi, 



Does anybody have information about using jemalloc with Luminous? For what I’ve 
seen on the mailing list and online, bluestor crashes when using jemalloc. 



We’ve been running ceph with jemalloc since Hammer, as performance with 
tcmalloc was terrible (We run a quite big full SSD cluster) and jemalloc was a 
game changer (CPU usage and latency were extremely reduced when using 
jemalloc). 



But looks like Ceph with a recent TCmalloc library and a high thread cache work 
pretty well, do you have experience with that? Is jemalloc still justified or 
it does not make sense anymore? 



Thanks for your comments! 

Xavier. 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel version for Debian 9 CephFS/RBD clients

2018-03-23 Thread Ilya Dryomov
On Fri, Mar 23, 2018 at 3:01 PM,   wrote:
> Ok ^^
>
> For Cephfs, as far as I know, quota support is not supported in kernel space
> This is not specific to luminous, tho

quota support is coming, hopefully in 4.17.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel version for Debian 9 CephFS/RBD clients

2018-03-23 Thread ceph
Ok ^^

For Cephfs, as far as I know, quota support is not supported in kernel space
This is not specific to luminous, tho

On 03/23/2018 03:00 PM, Ilya Dryomov wrote:
> On Fri, Mar 23, 2018 at 2:18 PM,   wrote:
>> On 03/23/2018 12:14 PM, Ilya Dryomov wrote:
>>> luminous cluster-wide feature bits are supported since kernel 4.13.
>>
>> ?
>>
>> # uname -a
>> Linux abweb1 4.14.0-0.bpo.3-amd64 #1 SMP Debian 4.14.13-1~bpo9+1
>> (2018-01-14) x86_64 GNU/Linux
>> # rbd info truc
>> rbd image 'truc':
>> size 20480 MB in 5120 objects
>> order 22 (4096 kB objects)
>> block_name_prefix: rbd_data.9eca966b8b4567
>> format: 2
>> features: layering, exclusive-lock, object-map, fast-diff, 
>> deep-flatten
>> flags:
>> # rbd map truc
>> rbd: sysfs write failed
>> RBD image feature set mismatch. You can disable features unsupported by
>> the kernel with "rbd feature disable pool/truc object-map fast-diff
>> deep-flatten".
>> In some cases useful info is found in syslog - try "dmesg | tail".
>> rbd: map failed: (6) No such device or address
>> # dmesg | tail -1
>> [1108045.667333] rbd: image truc: image uses unsupported features: 0x38
> 
> Those are rbd image features.  Your email also mentioned "libcephfs via
> fuse", so I assumed you had meant cluster-wide feature bits.
> 
> Thanks,
> 
> Ilya
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel version for Debian 9 CephFS/RBD clients

2018-03-23 Thread Ilya Dryomov
On Fri, Mar 23, 2018 at 2:18 PM,   wrote:
> On 03/23/2018 12:14 PM, Ilya Dryomov wrote:
>> luminous cluster-wide feature bits are supported since kernel 4.13.
>
> ?
>
> # uname -a
> Linux abweb1 4.14.0-0.bpo.3-amd64 #1 SMP Debian 4.14.13-1~bpo9+1
> (2018-01-14) x86_64 GNU/Linux
> # rbd info truc
> rbd image 'truc':
> size 20480 MB in 5120 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.9eca966b8b4567
> format: 2
> features: layering, exclusive-lock, object-map, fast-diff, 
> deep-flatten
> flags:
> # rbd map truc
> rbd: sysfs write failed
> RBD image feature set mismatch. You can disable features unsupported by
> the kernel with "rbd feature disable pool/truc object-map fast-diff
> deep-flatten".
> In some cases useful info is found in syslog - try "dmesg | tail".
> rbd: map failed: (6) No such device or address
> # dmesg | tail -1
> [1108045.667333] rbd: image truc: image uses unsupported features: 0x38

Those are rbd image features.  Your email also mentioned "libcephfs via
fuse", so I assumed you had meant cluster-wide feature bits.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Group-based permissions issue when using ACLs on CephFS

2018-03-23 Thread Josh Haft
On Fri, Mar 23, 2018 at 12:14 AM, Yan, Zheng  wrote:
>
> On Fri, Mar 23, 2018 at 5:14 AM, Josh Haft  wrote:
> > Hello!
> >
> > I'm running Ceph 12.2.2 with one primary and one standby MDS. Mounting
> > CephFS via ceph-fuse (to leverage quotas), and enabled ACLs by adding
> > fuse_default_permissions=0 and client_acl_type=posix_acl to the mount
> > options. I then export this mount via NFS and the clients mount NFS4.1.
> >
> does fuse_default_permissions=0 work?

Yes, ACLs work as expected when I set fuse_default_permissions=0.

> > After doing some in-depth testing it seems I'm unable to allow access from
> > the NFS clients to a directory/file based on group membership when the
> > underlying CephFS was mounted with ACL support. This issue appears using
> > both filesystem permissions (e.g. chgrp) and NFSv4 ACLs. However, ACLs do
> > work if the principal is a user instead of a group. If I disable ACL support
> > on the ceph-fuse mount, things work as expected using fs permissions;
> > obviously I don't get ACL support.
> >
> > As an intermediate step I did check whether this works directly on the
> > CephFS filesystem - on the NFS server - and it does. So it appears to be an
> > issue re-exporting it via NFS.
> >
> > I do not see this issue when mounting CephFS via the kernel, exporting via
> > NFS, and re-running these tests.
> >
> > I searched the ML and bug reports but only found this -
> > http://tracker.ceph.com/issues/12617 - which seems close to the issue I'm
> > running into, but was closed as resolved 2+ years ago.
> >
> > Has anyone else run into this? Am I missing something obvious?
> >
>
> ceph-fuse does permission check according to localhost's config of
> supplement group. that's why you see this behavior.

You're saying both the NFS client and server (where ceph-fuse is
running) need to use the same directory backend? (they are)
I should have mentioned I'm using LDAP/AD on client and server, so I
don't think that is the problem.

Either way, I would not expect the behavior to change simply by
enabling ACLs, especially when I'm using filesystem permissions, and
ACLs aren't part of the equation.

> Regards
> Yan, Zheng
>
> > Thanks!
> > Josh
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel version for Debian 9 CephFS/RBD clients

2018-03-23 Thread ceph
On 03/23/2018 12:14 PM, Ilya Dryomov wrote:
> luminous cluster-wide feature bits are supported since kernel 4.13.

?

# uname -a
Linux abweb1 4.14.0-0.bpo.3-amd64 #1 SMP Debian 4.14.13-1~bpo9+1
(2018-01-14) x86_64 GNU/Linux
# rbd info truc
rbd image 'truc':
size 20480 MB in 5120 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.9eca966b8b4567
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags:
# rbd map truc
rbd: sysfs write failed
RBD image feature set mismatch. You can disable features unsupported by
the kernel with "rbd feature disable pool/truc object-map fast-diff
deep-flatten".
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (6) No such device or address
# dmesg | tail -1
[1108045.667333] rbd: image truc: image uses unsupported features: 0x38
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous and jemalloc

2018-03-23 Thread Xavier Trilla
Hi,

Does anybody have information about using jemalloc with  Luminous? For what 
I've seen on the mailing list and online, bluestor crashes when using jemalloc.

We've been running ceph with jemalloc since Hammer, as performance with 
tcmalloc was terrible (We run a quite big full SSD cluster) and jemalloc was a 
game changer (CPU usage and latency were extremely reduced when using jemalloc).

But looks like Ceph with a recent TCmalloc library and a high thread cache work 
pretty well, do you have experience with that? Is jemalloc still justified or 
it does not make sense anymore?

Thanks for your comments!
Xavier.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore cluster, bad IO perf on blocksize<64k... could it be throttling ?

2018-03-23 Thread Maged Mokhtar
On 2018-03-21 19:50, Frederic BRET wrote:

> Hi all,
> 
> The context :
> - Test cluster aside production one
> - Fresh install on Luminous
> - choice of Bluestore (coming from Filestore)
> - Default config (including wpq queuing)
> - 6 nodes SAS12, 14 OSD, 2 SSD, 2 x 10Gb nodes, far more Gb at each switch 
> uplink...
> - R3 pool, 2 nodes per site
> - separate db (25GB) and wal (600MB) partitions on SSD for each OSD to be 
> able to observe each kind of IO with iostat
> - RBD client fio --ioengine=libaio --iodepth=128 --direct=1 
> - client RDB :  rbd map rbd/test_rbd -o queue_depth=1024
> - Just to point out, this is not a thread on SSD performance or adequation 
> between SSD and number of OSD. These 12Gb SAS 10DWPD SSD are perfectly 
> performing with lot of headroom on the production cluster even with XFS 
> filestore and journals on SSDs. 
> - This thread is about a possible bottleneck on low size blocks with 
> rocksdb/wal/Bluestore.
> 
> To begin with, Bluestore performance is really breathtaking compared to 
> filestore/XFS : we saturate the 20Gb clients bandwidth on this small test 
> cluster, as soon as IO blocksize=64k, a thing we couldn't achieve with 
> Filestore and journals, even at 256k.
> 
> The downside, all small IO blockizes (4k, 8k, 16k, 32k) are considerably 
> slower and appear somewhat capped.
> 
> Just to compare, here are observed latencies at 2 consecutive values for 
> blocksize 64k and 32k :
> 64k :
> write: io=55563MB, bw=1849.2MB/s, iops=29586, runt= 30048msec
> lat (msec): min=2, max=867, avg=17.29, stdev=32.31
> 
> 32k :
> write: io=6332.2MB, bw=207632KB/s, iops=6488, runt= 31229msec
> lat (msec): min=1, max=5111, avg=78.81, stdev=430.50
> 
> Whereas 64k one is almost filling the 20Gb client connection, the 32k one is 
> only getting a mere 1/10th of the bandwidth, and IOs latencies are multiplied 
> by 4.5 (or get a  ~60ms pause ? ... )
> 
> And we see the same constant latency at 16k, 8k and 4k :
> 16k :
> write: io=3129.4MB, bw=102511KB/s, iops=6406, runt= 31260msec
> lat (msec): min=0.908, max=6.67, avg=79.87, stdev=500.08
> 
> 8k :
> write: io=1592.8MB, bw=52604KB/s, iops=6575, runt= 31005msec
> lat (msec): min=0.824, max=5.49, avg=77.82, stdev=461.61
> 
> 4k :
> write: io=837892KB, bw=26787KB/s, iops=6696, runt= 31280msec
> lat (msec): min=0.766, max=5.45, avg=76.39, stdev=428.29
> 
> To compare with filestore, on 4k IOs results I have on hand from previous 
> install, we were getting almost 2x the Bluestore perfs on the exact same 
> cluster :
> WRITE: io=1221.4MB, aggrb=41477KB/,s maxt=30152msec
> 
> The thing is during these small blocksize fio benchmarks, nowhere nodes CPU, 
> OSD, SSD, or of course network are saturated (ie. I think this has nothing to 
> do with write amplification), nevertheless clients IOPS starve at low values.
> Shouldn't Bluestore IOPs be far higher than Filestore on small IOs too ?
> 
> To summerize, here is what we can observe :
> 
> Seeking counters, I found in "perf dump" incrementing values with slow IO 
> benchs, here for 1 run of 4k fio :
> "deferred_write_ops": 7631,
> "deferred_write_bytes": 31457280,
> 
> Does this means throttling or other QoS mechanism may be the cause and 
> default config values may be artificially limiting small IO performance on 
> our architecture ? And has anyone an idea on how to circumvent it ?
> 
> OSD Config Reference documentation may be talking about these aspects in the 
> QoS/MClock/Caveats section, but I'm not sure to understand the whole picture. 
> 
> Could someone help ?
> 
> Thanks
> Frederic 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hi Fredric, 

I too hope someone from ceph team will answer this. I believe some
people do see this behavior. 

In the meantime i would suggest further data: 

1) What is the raw disk iops and disk utilization (%busy) on your hdds ?
you do show the ssds (2800-4000 iops), but likely it is the hdds
iops/utilization that could be an issue.

2) Can you try setting 
bluestore_prefer_deferred_size_hdd = 0
(in effect we are disabling the deferred writes mechanism) and see if
this helps 

3) If you have a controller with write back cache, can you enable it. 

Again i wish someone from ceph team input into this. 

Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS Bug/Problem

2018-03-23 Thread Perrin, Christopher (zimkop1)
Hi,

Last week out MDSs started failing one after another, and could not be started 
anymore. After a lot of tinkering I found out that MDSs crashed after trying to 
rejoin the Cluster. The only Solution I found that, let them start again was 
resetting the journal vie cephfs-journal-tool. Now I have broken files all over 
the Cluster.

Before the crash the OSDs blocked tens of thousands of slow requests.

Can I somehow restore the broken files (I still have a backup of the journal) 
and how can I make sure that this doesn't happen agian. I am still not sure why 
this even happened.

This happened on ceph version 12.2.3.

This is the log of one MDS:
  -224> 2018-03-22 15:52:47.310437 7fd5798fd700  1 -- x.x.1.17:6803/122963511 
<== mon.0 x.x.1.17:6789/0 2  auth_reply(proto 2 0 (0) Success) v1  
33+0+0 (3611581813 0 0) 0x555883df2780 con 0x555883eb5000
  -223> 2018-03-22 15:52:47.310482 7fd5798fd700 10 monclient(hunting): my 
global_id is 745317
  -222> 2018-03-22 15:52:47.310634 7fd5798fd700  1 -- x.x.1.17:6803/122963511 
--> x.x.1.17:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x555883df2f00 con 0
  -221> 2018-03-22 15:52:47.311096 7fd57c09f700  5 -- x.x.1.17:6803/122963511 
>> x.x.1.17:6789/0 conn(0x555883eb5000 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=793748 cs=1 l=1). rx mon.0 
seq 3 0x555883df2f00 auth_reply(proto 2 0 (0) Success) v1
  -220> 2018-03-22 15:52:47.311178 7fd5798fd700  1 -- x.x.1.17:6803/122963511 
<== mon.0 x.x.1.17:6789/0 3  auth_reply(proto 2 0 (0) Success) v1  
222+0+0 (1789869469 0 0) 0x555883df2f00 con 0x555883eb5000
  -219> 2018-03-22 15:52:47.311319 7fd5798fd700  1 -- x.x.1.17:6803/122963511 
--> x.x.1.17:6789/0 -- auth(proto 2 181 bytes epoch 0) v1 -- 0x555883df2780 con 0
  -218> 2018-03-22 15:52:47.312122 7fd57c09f700  5 -- x.x.1.17:6803/122963511 
>> x.x.1.17:6789/0 conn(0x555883eb5000 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=793748 cs=1 l=1). rx mon.0 
seq 4 0x555883df2780 auth_reply(proto 2 0 (0) Success) v1
  -217> 2018-03-22 15:52:47.312208 7fd5798fd700  1 -- x.x.1.17:6803/122963511 
<== mon.0 x.x.1.17:6789/0 4  auth_reply(proto 2 0 (0) Success) v1  
799+0+0 (4156877078 0 0) 0x555883df2780 con 0x555883eb5000
  -216> 2018-03-22 15:52:47.312393 7fd5798fd700  1 monclient: found mon.filer1
  -215> 2018-03-22 15:52:47.312416 7fd5798fd700 10 monclient: _send_mon_message 
to mon.filer1 at x.x.1.17:6789/0
  -214> 2018-03-22 15:52:47.312427 7fd5798fd700  1 -- x.x.1.17:6803/122963511 
--> x.x.1.17:6789/0 -- mon_subscribe({monmap=0+}) v2 -- 0x555883c8ed80 con 0
  -213> 2018-03-22 15:52:47.312461 7fd5798fd700 10 monclient: 
_check_auth_rotating renewing rotating keys (they expired before 2018-03-22 
15:52:17.312460)
  -212> 2018-03-22 15:52:47.312477 7fd5798fd700 10 monclient: _send_mon_message 
to mon.filer1 at x.x.1.17:6789/0
  -211> 2018-03-22 15:52:47.312482 7fd5798fd700  1 -- x.x.1.17:6803/122963511 
--> x.x.1.17:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- 0x555883df2f00 con 0
  -210> 2018-03-22 15:52:47.312552 7fd580637200  5 monclient: authenticate 
success, global_id 745317
  -209> 2018-03-22 15:52:47.312570 7fd580637200 10 monclient: 
wait_auth_rotating waiting (until 2018-03-22 15:53:17.312568)
  -208> 2018-03-22 15:52:47.312776 7fd57c09f700  5 -- x.x.1.17:6803/122963511 
>> x.x.1.17:6789/0 conn(0x555883eb5000 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=793748 cs=1 l=1). rx mon.0 
seq 5 0x555883c8f8c0 mon_map magic: 0 v1
  -207> 2018-03-22 15:52:47.312841 7fd5798fd700  1 -- x.x.1.17:6803/122963511 
<== mon.0 x.x.1.17:6789/0 5  mon_map magic: 0 v1  433+0+0 (493202164 0 
0) 0x555883c8f8c0 con 0x555883eb5000
  -206> 2018-03-22 15:52:47.312868 7fd5798fd700 10 monclient: handle_monmap 
mon_map magic: 0 v1
  -205> 2018-03-22 15:52:47.312892 7fd5798fd700 10 monclient:  got monmap 7, 
mon.filer1 is now rank 0
  -204> 2018-03-22 15:52:47.312901 7fd57c09f700  5 -- x.x.1.17:6803/122963511 
>> x.x.1.17:6789/0 conn(0x555883eb5000 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=793748 cs=1 l=1). rx mon.0 
seq 6 0x555883df2f00 auth_reply(proto 2 0 (0) Success) v1
  -203> 2018-03-22 15:52:47.312900 7fd5798fd700 10 monclient: dump:
epoch 7
fsid a5473adc-cfb8-4672-883e-40f5f6541a36
last_changed 2017-12-08 10:38:51.267030
created 2017-01-20 17:05:29.092109
0: x.x.1.17:6789/0 mon.filer1
1: x.x.1.18:6789/0 mon.filer2
2: x.x.1.21:6789/0 mon.master1

  -202> 2018-03-22 15:52:47.312950 7fd5798fd700  1 -- x.x.1.17:6803/122963511 
<== mon.0 x.x.1.17:6789/0 6  auth_reply(proto 2 0 (0) Success) v1  
194+0+0 (1424514407 0 0) 0x555883df2f00 con 0x555883eb5000
  -201> 2018-03-22 15:52:47.313072 7fd5798fd700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 2018-03-22 
15:52:17.313071)
  -200> 2018-03-22 15:52:47.313108 7fd580637200 10 monclient: 
wait_auth_rotating done
  -199> 2018-03-22 15:52:47.313172 7fd580637200 10 monclient: _renew_subs
  -198> 2018-03-22 15:52:47.313179 7fd580637200 

Re: [ceph-users] OSD crash with segfault Luminous 12.2.4

2018-03-23 Thread Dietmar Rieder
Hi,


I encountered one more two days ago, and I opened a ticket:

http://tracker.ceph.com/issues/23431

In our case it is more like 1 every two weeks, for now...
And it is affecting different OSDs on different hosts.

Dietmar

On 03/23/2018 11:50 AM, Oliver Freyermuth wrote:
> Hi together,
> 
> I notice exactly the same, also the same addresses, Luminous 12.2.4, CentOS 
> 7. 
> Sadly, logs are equally unhelpful. 
> 
> It happens randomly on an OSD about once per 2-3 days (of the 196 total OSDs 
> we have). It's also not a container environment. 
> 
> Cheers,
>   Oliver
> 
> Am 08.03.2018 um 15:00 schrieb Dietmar Rieder:
>> Hi,
>>
>> I noticed in my client (using cephfs) logs that an osd was unexpectedly
>> going down.
>> While checking the osd logs for the affected OSD I found that the osd
>> was seg faulting:
>>
>> []
>> 2018-03-07 06:01:28.873049 7fd9af370700 -1 *** Caught signal
>> (Segmentation fault) **
>>  in thread 7fd9af370700 thread_name:safe_timer
>>
>>   ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
>> luminous (stable)
>>1: (()+0xa3c611) [0x564585904611]
>> 2: (()+0xf5e0) [0x7fd9b66305e0]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is
>> needed to interpret this.
>> [...]
>>
>> Should I open a ticket for this? What additional information is needed?
>>
>>
>> I put the relevant log entries for download under [1], so maybe someone
>> with more
>> experience can find some useful information therein.
>>
>> Thanks
>>   Dietmar
>>
>>
>> [1] https://expirebox.com/download/6473c34c80e8142e22032469a59df555.html
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel version for Debian 9 CephFS/RBD clients

2018-03-23 Thread Ilya Dryomov
On Fri, Mar 23, 2018 at 11:48 AM,   wrote:
> The stock kernel from Debian is perfect
> Spectre / meltdown mitigations are worthless for a Ceph point of view,
> and should be disabled (again, strictly from a Ceph point of view)
>
> If you need the luminous features, using the userspace implementations
> is required (librbd via rbd-nbd or qemu, libcephfs via fuse etc)

luminous cluster-wide feature bits are supported since kernel 4.13.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore cluster, bad IO perf on blocksize<64k... could it be throttling ?

2018-03-23 Thread Ilya Dryomov
On Wed, Mar 21, 2018 at 6:50 PM, Frederic BRET  wrote:
> Hi all,
>
> The context :
> - Test cluster aside production one
> - Fresh install on Luminous
> - choice of Bluestore (coming from Filestore)
> - Default config (including wpq queuing)
> - 6 nodes SAS12, 14 OSD, 2 SSD, 2 x 10Gb nodes, far more Gb at each switch
> uplink...
> - R3 pool, 2 nodes per site
> - separate db (25GB) and wal (600MB) partitions on SSD for each OSD to be
> able to observe each kind of IO with iostat
> - RBD client fio --ioengine=libaio --iodepth=128 --direct=1
> - client RDB :  rbd map rbd/test_rbd -o queue_depth=1024
> - Just to point out, this is not a thread on SSD performance or adequation
> between SSD and number of OSD. These 12Gb SAS 10DWPD SSD are perfectly
> performing with lot of headroom on the production cluster even with XFS
> filestore and journals on SSDs.
> - This thread is about a possible bottleneck on low size blocks with
> rocksdb/wal/Bluestore.
>
> To begin with, Bluestore performance is really breathtaking compared to
> filestore/XFS : we saturate the 20Gb clients bandwidth on this small test
> cluster, as soon as IO blocksize=64k, a thing we couldn't achieve with
> Filestore and journals, even at 256k.
>
> The downside, all small IO blockizes (4k, 8k, 16k, 32k) are considerably
> slower and appear somewhat capped.
>
> Just to compare, here are observed latencies at 2 consecutive values for
> blocksize 64k and 32k :
> 64k :
>   write: io=55563MB, bw=1849.2MB/s, iops=29586, runt= 30048msec
>  lat (msec): min=2, max=867, avg=17.29, stdev=32.31
>
> 32k :
>   write: io=6332.2MB, bw=207632KB/s, iops=6488, runt= 31229msec
>  lat (msec): min=1, max=5111, avg=78.81, stdev=430.50
>
> Whereas 64k one is almost filling the 20Gb client connection, the 32k one is
> only getting a mere 1/10th of the bandwidth, and IOs latencies are
> multiplied by 4.5 (or get a  ~60ms pause ? ... )
>
> And we see the same constant latency at 16k, 8k and 4k :
> 16k :
>   write: io=3129.4MB, bw=102511KB/s, iops=6406, runt= 31260msec
>  lat (msec): min=0.908, max=6.67, avg=79.87, stdev=500.08
>
> 8k :
>   write: io=1592.8MB, bw=52604KB/s, iops=6575, runt= 31005msec
>  lat (msec): min=0.824, max=5.49, avg=77.82, stdev=461.61
>
> 4k :
>   write: io=837892KB, bw=26787KB/s, iops=6696, runt= 31280msec
>  lat (msec): min=0.766, max=5.45, avg=76.39, stdev=428.29
>
> To compare with filestore, on 4k IOs results I have on hand from previous
> install, we were getting almost 2x the Bluestore perfs on the exact same
> cluster :
> WRITE: io=1221.4MB, aggrb=41477KB/,s maxt=30152msec
>
> The thing is during these small blocksize fio benchmarks, nowhere nodes CPU,
> OSD, SSD, or of course network are saturated (ie. I think this has nothing
> to do with write amplification), nevertheless clients IOPS starve at low
> values.
> Shouldn't Bluestore IOPs be far higher than Filestore on small IOs too ?
>
> To summerize, here is what we can observe :
>
>
> Seeking counters, I found in "perf dump" incrementing values with slow IO
> benchs, here for 1 run of 4k fio :
> "deferred_write_ops": 7631,
> "deferred_write_bytes": 31457280,

bluestore data-journals any write smaller than min_alloc_size because
it has to happen in place, whereas writes equal to or larger than that
go directly to their final location on disk.  IOW anything smaller than
min_alloc_size is written twice.

The default min_alloc_size is 64k.  That is what those counters refer
to.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IO rate-limiting with Ceph RBD (and libvirt)

2018-03-23 Thread Luis Periquito
On Fri, Mar 23, 2018 at 4:05 AM, Anthony D'Atri  wrote:
> FYI: I/O limiting in combination with OpenStack 10/12 + Ceph doesn?t work
> properly. Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1476830
>
>
> That's an OpenStack bug, nothing to do with Ceph.  Nothing stops you from
> using virsh to throttle directly:
>
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-blockio-techniques
>
> https://github.com/cernceph/ceph-scripts/blob/master/tools/virsh-throttle-rbd.py
>

Actually it's not even an OpenStack bug, it's a misunderstanding on
how volume limits work: the flavour will set the limits to the
instances when running from image, but the reporter is running from
volumes, so the limits have to be set with cinder volume types...

>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD crash with segfault Luminous 12.2.4

2018-03-23 Thread Oliver Freyermuth
Hi together,

I notice exactly the same, also the same addresses, Luminous 12.2.4, CentOS 7. 
Sadly, logs are equally unhelpful. 

It happens randomly on an OSD about once per 2-3 days (of the 196 total OSDs we 
have). It's also not a container environment. 

Cheers,
Oliver

Am 08.03.2018 um 15:00 schrieb Dietmar Rieder:
> Hi,
> 
> I noticed in my client (using cephfs) logs that an osd was unexpectedly
> going down.
> While checking the osd logs for the affected OSD I found that the osd
> was seg faulting:
> 
> []
> 2018-03-07 06:01:28.873049 7fd9af370700 -1 *** Caught signal
> (Segmentation fault) **
>  in thread 7fd9af370700 thread_name:safe_timer
> 
>   ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
> luminous (stable)
>1: (()+0xa3c611) [0x564585904611]
> 2: (()+0xf5e0) [0x7fd9b66305e0]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
> [...]
> 
> Should I open a ticket for this? What additional information is needed?
> 
> 
> I put the relevant log entries for download under [1], so maybe someone
> with more
> experience can find some useful information therein.
> 
> Thanks
>   Dietmar
> 
> 
> [1] https://expirebox.com/download/6473c34c80e8142e22032469a59df555.html
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel version for Debian 9 CephFS/RBD clients

2018-03-23 Thread ceph
The stock kernel from Debian is perfect
Spectre / meltdown mitigations are worthless for a Ceph point of view,
and should be disabled (again, strictly from a Ceph point of view)

If you need the luminous features, using the userspace implementations
is required (librbd via rbd-nbd or qemu, libcephfs via fuse etc)


On 03/23/2018 11:21 AM, Nicolas Huillard wrote:
> Hi all,
> 
> I'm using Luminous 12.2.4 on all servers, with Debian stock kernel.
> 
> I use the kernel cephfs/rbd on the client side, and have a choice of :
> * stock Debian 9 kernel 4.9 : LTS, Spectre/Meltdown mitigations in
> place, field-tested, probably old libceph inside.
> * backports kernel 4.14 : probably better Luminous support, no
> Spectre/Meltdown mitigations yet, much less tested (I may have
> experienced a kernel-related PPPoE problem lately), not long-term.
> 
> Which client kernel would you suggest re. Ceph ?
> Does the cephfs/rbd clients benefit from a really newer kernel ?
> I expect that the Cpeh server-side kernel don't really matter.
> 
> TIA,
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Kernel version for Debian 9 CephFS/RBD clients

2018-03-23 Thread Nicolas Huillard
Hi all,

I'm using Luminous 12.2.4 on all servers, with Debian stock kernel.

I use the kernel cephfs/rbd on the client side, and have a choice of :
* stock Debian 9 kernel 4.9 : LTS, Spectre/Meltdown mitigations in
place, field-tested, probably old libceph inside.
* backports kernel 4.14 : probably better Luminous support, no
Spectre/Meltdown mitigations yet, much less tested (I may have
experienced a kernel-related PPPoE problem lately), not long-term.

Which client kernel would you suggest re. Ceph ?
Does the cephfs/rbd clients benefit from a really newer kernel ?
I expect that the Cpeh server-side kernel don't really matter.

TIA,

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephalocon slides/videos

2018-03-23 Thread Serkan Çoban
Hi,

Where can I find slides/videos of the conference?
I already tried (1), but cannot view the videos.

Serkan

1- http://www.itdks.com/eventlist/detail/1962
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-03-23 Thread Alexandre DERUMIER
Hi,

>>Did the fs have lots of mount/umount? 

not too much, I have around 300 ceph-fuse clients (12.2.2 && 12.2.4) and ceph 
cluster is 12.2.2.
maybe when client reboot, but that don't happen too much.


>> We recently found a memory leak
>>bug in that area https://github.com/ceph/ceph/pull/20148

Ok thanks. Does session occur only at mount/unmount ?



I have another cluster, with 64 fuse-client, mds memory is around 500mb.
(with default mds_cache_memory_limit , no tuning, and ceph cluster is 12.2.4 
instead 12.2.2)

Clients are also ceph-fuse 12.2.2 && 12.2.4



I'll try to upgrade this buggy mds to 12.2.4 to see if it's helping.

- Mail original -
De: "Zheng Yan" 
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Vendredi 23 Mars 2018 01:08:46
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

Did the fs have lots of mount/umount? We recently found a memory leak 
bug in that area https://github.com/ceph/ceph/pull/20148 

Regards 
Yan, Zheng 

On Thu, Mar 22, 2018 at 5:29 PM, Alexandre DERUMIER  
wrote: 
> Hi, 
> 
> I'm running cephfs since 2 months now, 
> 
> and my active msd memory usage is around 20G now (still growing). 
> 
> ceph 1521539 10.8 31.2 20929836 20534868 ? Ssl janv.26 8573:34 
> /usr/bin/ceph-mds -f --cluster ceph --id 2 --setuser ceph --setgroup ceph 
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> 
> 
> this is on luminous 12.2.2 
> 
> only tuning done is: 
> 
> mds_cache_memory_limit = 5368709120 
> 
> 
> (5GB). I known it's a soft limit, but 20G seem quite huge vs 5GB  
> 
> 
> Is it normal ? 
> 
> 
> 
> 
> # ceph daemon mds.2 perf dump mds 
> { 
> "mds": { 
> "request": 1444009197, 
> "reply": 1443999870, 
> "reply_latency": { 
> "avgcount": 1443999870, 
> "sum": 1657849.656122933, 
> "avgtime": 0.001148095 
> }, 
> "forward": 0, 
> "dir_fetch": 51740910, 
> "dir_commit": 9069568, 
> "dir_split": 64367, 
> "dir_merge": 58016, 
> "inode_max": 2147483647, 
> "inodes": 2042975, 
> "inodes_top": 152783, 
> "inodes_bottom": 138781, 
> "inodes_pin_tail": 1751411, 
> "inodes_pinned": 1824714, 
> "inodes_expired": 7258145573, 
> "inodes_with_caps": 1812018, 
> "caps": 2538233, 
> "subtrees": 2, 
> "traverse": 1591668547, 
> "traverse_hit": 1259482170, 
> "traverse_forward": 0, 
> "traverse_discover": 0, 
> "traverse_dir_fetch": 30827836, 
> "traverse_remote_ino": 7510, 
> "traverse_lock": 86236, 
> "load_cent": 144401980319, 
> "q": 49, 
> "exported": 0, 
> "exported_inodes": 0, 
> "imported": 0, 
> "imported_inodes": 0 
> } 
> } 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com