[ceph-users] Rebalancing an Erasure coded pool seems to move far more data that necessary

2018-05-25 Thread Jesus Cea
I have a Erasure Coded 8+2 pool with 8 PGs. Each PG is spread on 10 OSDs using Reed-Solomon (the Erasure Code). When I rebalance the cluster I see two PGs moving: "active+remapped+backfilling". A "pg dump" shows this: """ root@jcea:/srv# ceph --id jcea pg dump|grep backf dumped all 75.5 25

Re: [ceph-users] PG explosion with erasure codes, power of two and "x pools have many more objects per pg than average"

2018-05-25 Thread Jesus Cea
On 25/05/18 20:26, Paul Emmerich wrote: > Answes inline. > >> 2018-05-25 17:57 GMT+02:00 Jesus Cea > >: recommendation. Would be nice to know too if >> being "close" to a power of two is better than be far away and if it >> is better to be close but below or close but a little

Re: [ceph-users] Increasing number of PGs by not a factor of two?

2018-05-25 Thread Jesus Cea
On 25/05/18 20:21, David Turner wrote: > If you start your pool with 12 PGs, 4 of them will have double the size > of the other 8.  It is 100% based on a power of 2 and has absolutely > nothing to do with the number you start with vs the number you increase > to.  If your PG count is not a power of

Re: [ceph-users] Increasing number of PGs by not a factor of two?

2018-05-25 Thread Jesus Cea
OK, I am writing this so you don't waste your time correcting me. I beg your pardon. On 25/05/18 18:28, Jesus Cea wrote: > So, if I understand correctly, ceph tries to do the minimum splits. If > you increase PG from 8 to 12, it will split 4 PGs and leave the other 4 > PGs alone, creating an imba

Re: [ceph-users] Dependencies

2018-05-25 Thread David Turner
Admin nodes have zero impact on a Ceph cluster other than the commands you run on them. I personally like creating a single admin node for all of my clusters and create tooling to use the proper config file and keyrings from there. Other than any scripts you keep on your admin node there is nothi

Re: [ceph-users] PG explosion with erasure codes, power of two and "x pools have many more objects per pg than average"

2018-05-25 Thread Paul Emmerich
Answes inline. 2018-05-25 17:57 GMT+02:00 Jesus Cea : > Hi there. > > I have configured a POOL with a 8+2 erasure code. My target by space > usage and OSD configuration, would be 128 PG, but since each configure > PG will be using 10 actual "PGs", I have created the pool with only 8 PG > (80 real

Re: [ceph-users] Increasing number of PGs by not a factor of two?

2018-05-25 Thread David Turner
If you start your pool with 12 PGs, 4 of them will have double the size of the other 8. It is 100% based on a power of 2 and has absolutely nothing to do with the number you start with vs the number you increase to. If your PG count is not a power of 2 then you will have 2 different sizes of PGs

Re: [ceph-users] Increasing number of PGs by not a factor of two?

2018-05-25 Thread Jesus Cea
On 17/05/18 20:36, David Turner wrote: > By sticking with PG numbers as a base 2 number (1024, 16384, etc) all of > your PGs will be the same size and easier to balance and manage.  What > happens when you have a non base 2 number is something like this.  Say > you have 4 PGs that are all 2GB in si

[ceph-users] Dependencies

2018-05-25 Thread Marc-Antoine Desrochers
Hi, I want to know if there is any dependencies between the ceph admin node and the other nodes ? Can I delete my ceph admin node and create a new one and link it to my OSD's nodes ? Or can I take all my existing OSD's in a node from Cluster "A" and transfert it to cluster "B" ?

[ceph-users] PG explosion with erasure codes, power of two and "x pools have many more objects per pg than average"

2018-05-25 Thread Jesus Cea
Hi there. I have configured a POOL with a 8+2 erasure code. My target by space usage and OSD configuration, would be 128 PG, but since each configure PG will be using 10 actual "PGs", I have created the pool with only 8 PG (80 real PG). Since I can increase PGs but not decreasing it, this decision

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-25 Thread Brady Deetz
I'm not sure this is a cache issue. To me, this feels like a memory leak. I'm now at 129GB (haven't had a window to upgrade yet) on a configured 80GB cache. [root@mds0 ceph-admin]# ceph daemon mds.mds0 cache status { "pool": { "items": 166753076, "bytes": 71766944952 } }

Re: [ceph-users] Ceph tech talk on deploy ceph with rook on kubernetes

2018-05-25 Thread Brett Niver
Is the recording available? I wasn't able to attend. Thanks, Brett On Thu, May 24, 2018 at 10:04 AM, Sage Weil wrote: > Starting now! > > https://redhat.bluejeans.com/967991495/ > > It'll be recorded and go up on youtube shortly as well. > ___

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Patrick Donnelly
On Fri, May 25, 2018 at 6:46 AM, Oliver Freyermuth wrote: >> It might be possible to allow rename(2) to proceed in cases where >> nlink==1, but the behavior will probably seem inconsistent (some files get >> EXDEV, some don't). > > I believe even this would be extremely helpful, performance-wise.

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Am 25.05.2018 um 15:39 schrieb Sage Weil: > On Fri, 25 May 2018, Oliver Freyermuth wrote: >> Dear Ric, >> >> I played around a bit - the common denominator seems to be: Moving it >> within a directory subtree below a directory for which max_bytes / >> max_files quota settings are set, things work

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Sage Weil
On Fri, 25 May 2018, Oliver Freyermuth wrote: > Dear Ric, > > I played around a bit - the common denominator seems to be: Moving it > within a directory subtree below a directory for which max_bytes / > max_files quota settings are set, things work fine. Moving it to another > directory tree wi

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Am 25.05.2018 um 15:26 schrieb Luis Henriques: > Oliver Freyermuth writes: > >> Mhhhm... that's funny, I checked an mv with an strace now. I get: >> - >> access("/cephfs/some_folder/file", W_OK) = 0 >> rename("foo", "

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Dear Sage, here you go, some_folder in reality is "/cephfs/group": # stat foo File: ‘foo’ Size: 1048576000 Blocks: 2048000IO Block: 4194304 regular file Device: 27h/39d Inode: 1099515065517 Links: 1 Access: (0644/-rw-r--r--) Uid: (

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Dear Ric, I played around a bit - the common denominator seems to be: Moving it within a directory subtree below a directory for which max_bytes / max_files quota settings are set, things work fine. Moving it to another directory tree without quota settings / with different quota settings, ren

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Luis Henriques
Oliver Freyermuth writes: > Mhhhm... that's funny, I checked an mv with an strace now. I get: > - > access("/cephfs/some_folder/file", W_OK) = 0 > rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-de

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Sage Weil
Can you paste the output of 'stat foo' and 'stat /cephfs/some_folder'? (Maybe also the same with 'stat -f'.) Thanks! sage On Fri, 25 May 2018, Ric Wheeler wrote: > That seems to be the issue - we need to understand why rename sees them as > different. > > Ric > > > On Fri, May 25, 2018, 9:1

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Ric Wheeler
That seems to be the issue - we need to understand why rename sees them as different. Ric On Fri, May 25, 2018, 9:15 AM Oliver Freyermuth < freyerm...@physik.uni-bonn.de> wrote: > Mhhhm... that's funny, I checked an mv with an strace now. I get: > > -

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Mhhhm... that's funny, I checked an mv with an strace now. I get: - access("/cephfs/some_folder/file", W_OK) = 0 rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device link) unlink("/cephfs/some_fold

Re: [ceph-users] Different disk sizes after Luminous upgrade 12.2.2 --> 12.2.5

2018-05-25 Thread Eugen Block
Hi Igor, This difference was introduced by the following PR: https://github.com/ceph/ceph/pull/20487 (commit os/bluestore: do not account DB volume space in total one reported by statfs method). The rationale is to show block device capacity as total only. And don't add DB space to it. Th

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Ric Wheeler
We should look at what mv uses to see if it thinks the directories are on different file systems. If the fstat or whatever it looks at is confused, that might explain it. Ric On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth < freyerm...@physik.uni-bonn.de> wrote: > Am 25.05.2018 um 14:57 schrie

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Am 25.05.2018 um 14:57 schrieb Ric Wheeler: > Is this move between directories on the same file system? It is, we only have a single CephFS in use. There's also only a single ceph-fuse client running. What's different, though, are different ACLs set for source and target directory, and owner /

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Am 25.05.2018 um 14:50 schrieb John Spray: > On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth > wrote: >> Dear Cephalopodians, >> >> I was wondering why a simple "mv" is taking extraordinarily long on CephFS >> and must note that, >> at least with the fuse-client (12.2.5) and when moving a file

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Ric Wheeler
Is this move between directories on the same file system? Rename as a system call only works within a file system. The user space mv command becomes a copy when not the same file system. Regards, Ric On Fri, May 25, 2018, 8:51 AM John Spray wrote: > On Fri, May 25, 2018 at 1:10 PM, Oliver F

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread John Spray
On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth wrote: > Dear Cephalopodians, > > I was wondering why a simple "mv" is taking extraordinarily long on CephFS > and must note that, > at least with the fuse-client (12.2.5) and when moving a file from one > directory to another, > the file appear

Re: [ceph-users] How high-touch is ceph?

2018-05-25 Thread John Spray
On Fri, May 25, 2018 at 1:17 PM, Rhugga Harper wrote: > > I've been evaluating ceph as a solution for persistent block in our > kubrenetes clusters for low-iops requirement applications. It doesn't do too > terribly bad with 32k workloads even though it's object storage under the > hood. > > Howev

[ceph-users] How high-touch is ceph?

2018-05-25 Thread Rhugga Harper
I've been evaluating ceph as a solution for persistent block in our kubrenetes clusters for low-iops requirement applications. It doesn't do too terribly bad with 32k workloads even though it's object storage under the hood. However it seems this is a very high maintenance solution requiring you t

[ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Dear Cephalopodians, I was wondering why a simple "mv" is taking extraordinarily long on CephFS and must note that, at least with the fuse-client (12.2.5) and when moving a file from one directory to another, the file appears to be copied first (byte by byte, traffic going through the client?)

Re: [ceph-users] Issues with RBD when rebooting

2018-05-25 Thread Maged Mokhtar
On 2018-05-25 12:11, Josef Zelenka wrote: > Hi, we are running a jewel cluster (54OSDs, six nodes, ubuntu 16.04) that > serves as a backend for openstack(newton) VMs. TOday we had to reboot one of > the nodes(replicated pool, x2) and some of our VMs oopsed with issues with > their FS(mainly dat

Re: [ceph-users] Different disk sizes after Luminous upgrade 12.2.2 --> 12.2.5

2018-05-25 Thread Igor Fedotov
Hi Eugen, This difference was introduced by the following PR: https://github.com/ceph/ceph/pull/20487 (commit os/bluestore: do not account DB volume space in total one reported by statfs method). The rationale is to show block device capacity as total only. And don't add DB space to it. This

[ceph-users] Different disk sizes after Luminous upgrade 12.2.2 --> 12.2.5

2018-05-25 Thread Eugen Block
Hi list, we have a Luminous bluestore cluster with separate block.db/block.wal on SSDs. We were running version 12.2.2 and upgraded yesterday to 12.2.5. The upgrade went smoothly, but since the restart of the OSDs I noticed that 'ceph osd df' shows a different total disk size: ---cut here

Re: [ceph-users] Delete pool nicely

2018-05-25 Thread Paul Emmerich
Also, upgrade to luminous and migrate your OSDs to bluestore before using erasure coding. Luminous + Bluestore performs so much better for erasure coding than any of the old configurations. Also, I've found that deleting a large number of objects is far less stressfull on a Bluestore OSD than on a

Re: [ceph-users] Ceph replication factor of 2

2018-05-25 Thread Paul Emmerich
If you are so worried about the storage efficiency: why not use erasure coding? EC performs really well with Luminous in our experience. Yes, you generate more IOPS and somewhat more CPU load and a higher latency. But it's often worth a try. Simple example for everyone considering 2/1 replicas: co

Re: [ceph-users] Ceph replication factor of 2

2018-05-25 Thread Donny Davis
Nobody cares about their data until they don't have it anymore. Using replica 3 is the same logic as RAID6. Its likely if one drive has crapped out, more will meet the maker soon. If you care about your data, then do what you can to keep it around. If its a lab like mine, who cares its all ephe

[ceph-users] Issues with RBD when rebooting

2018-05-25 Thread Josef Zelenka
Hi, we are running a jewel cluster (54OSDs, six nodes, ubuntu 16.04) that serves as a backend for openstack(newton) VMs. TOday we had to reboot one of the nodes(replicated pool, x2) and some of our VMs oopsed with issues with their FS(mainly database VMs, postgresql) - is there a reason for thi

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-25 Thread Yan, Zheng
On Fri, May 25, 2018 at 4:28 PM, Yan, Zheng wrote: > I found some memory leak. could you please try > https://github.com/ceph/ceph/pull/22240 > the leak only affects multiple active mds, I think it's unrelated to your issue. > > On Fri, May 25, 2018 at 1:49 PM, Alexandre DERUMIER > wrote: >> H

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-25 Thread Yan, Zheng
I found some memory leak. could you please try https://github.com/ceph/ceph/pull/22240 On Fri, May 25, 2018 at 1:49 PM, Alexandre DERUMIER wrote: > Here the result: > > > root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net flush journal > { > "message": "", > "return_code": 0 > } > root@ce

Re: [ceph-users] Ceph replication factor of 2

2018-05-25 Thread Janne Johansson
Den fre 25 maj 2018 kl 00:20 skrev Jack : > On 05/24/2018 11:40 PM, Stefan Kooman wrote: > >> What are your thoughts, would you run 2x replication factor in > >> Production and in what scenarios? > Me neither, mostly because I have yet to read a technical point of view, > from someone who read and

Re: [ceph-users] ceph-disk is getting removed from master

2018-05-25 Thread Konstantin Shalygin
ceph-disk should be considered as "frozen" and deprecated for Mimic, in favor of ceph-volume. ceph-volume will continue to support bare block device, i.e. without lvm'ish stuff? k ___ ceph-users mailing list ceph-users@lists.ceph.com http://li