[ceph-users] Near Perfect PG distrubtion apart from two OSD

2020-01-09 Thread Ashley Merrick
Hey,



I have a cluster of 30 OSD's that is near perfect distribution minus two OSD's.



I am running ceph version 14.2.6 however has been the same for the previous 
versions, I have the balance module enabled in upmap and it says no 
improvements, I have also tried in crush mode.



ceph balancer status

{

    "last_optimize_duration": "0:00:01.123659",

    "plans": [],

    "mode": "upmap",

    "active": true,

    "optimize_result": "Unable to find further optimization, or pool(s)' pg_num 
is decreasing, or distribution is already perfect",

    "last_optimize_started": "Fri Jan 10 06:11:08 2020"

}



I have read a few email threads on the ML recently about similar cases but not 
sure if I am hitting the same "bug" as its only two that are off the rest are 
almost perfect.



ceph osd df

ID CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META AVAIL   %USE 
 VAR  PGS STATUS

23   hdd 0.00999  1.0  10 GiB 1.4 GiB 434 MiB 1.4 MiB 1023 MiB 8.6 GiB 
14.24 0.21  33 up

24   hdd 0.00999  1.0  10 GiB 1.4 GiB 441 MiB  48 KiB 1024 MiB 8.6 GiB 
14.31 0.21  34 up

25   hdd 0.00999  1.0  10 GiB 1.4 GiB 435 MiB  24 KiB 1024 MiB 8.6 GiB 
14.26 0.21  34 up

26   hdd 0.00999  1.0  10 GiB 1.4 GiB 436 MiB 1.4 MiB 1023 MiB 8.6 GiB 
14.27 0.21  34 up

27   hdd 0.00999  1.0  10 GiB 1.4 GiB 437 MiB  16 KiB 1024 MiB 8.6 GiB 
14.27 0.21  33 up

28   hdd 0.00999  1.0  10 GiB 1.4 GiB 436 MiB  36 KiB 1024 MiB 8.6 GiB 
14.26 0.21  34 up

 3   hdd 9.09599  1.0 9.1 TiB 6.1 TiB 6.1 TiB  76 KiB   19 GiB 3.0 TiB 
67.26 1.00 170 up

 4   hdd 9.09599  1.0 9.1 TiB 6.2 TiB 6.1 TiB  44 KiB   19 GiB 2.9 TiB 
67.77 1.01 172 up

 5   hdd 9.09599  1.0 9.1 TiB 6.3 TiB 6.3 TiB 112 KiB   20 GiB 2.8 TiB 
69.50 1.03 176 up

 6   hdd 9.09599  1.0 9.1 TiB 6.1 TiB 6.1 TiB  17 KiB   19 GiB 2.9 TiB 
67.58 1.01 171 up

 7   hdd 9.09599  1.0 9.1 TiB 6.7 TiB 6.7 TiB  88 KiB   21 GiB 2.4 TiB 
73.98 1.10 187 up

 8   hdd 9.09599  1.0 9.1 TiB 6.5 TiB 6.5 TiB  76 KiB   20 GiB 2.6 TiB 
71.84 1.07 182 up

 9   hdd 9.09599  1.0 9.1 TiB 6.1 TiB 6.1 TiB 120 KiB   19 GiB 3.0 TiB 
67.24 1.00 170 up

10   hdd 9.09599  1.0 9.1 TiB 6.1 TiB 6.1 TiB  72 KiB   19 GiB 3.0 TiB 
67.19 1.00 170 up

11   hdd 9.09599  1.0 9.1 TiB 6.2 TiB 6.2 TiB  40 KiB   19 GiB 2.9 TiB 
68.06 1.01 172 up

12   hdd 9.09599  1.0 9.1 TiB 6.1 TiB 6.1 TiB  28 KiB   19 GiB 3.0 TiB 
67.48 1.00 170 up

13   hdd 9.09599  1.0 9.1 TiB 6.1 TiB 6.1 TiB  36 KiB   19 GiB 3.0 TiB 
67.04 1.00 170 up

14   hdd 9.09599  1.0 9.1 TiB 6.1 TiB 6.1 TiB 108 KiB   19 GiB 3.0 TiB 
67.30 1.00 170 up

15   hdd 9.09599  1.0 9.1 TiB 6.1 TiB 6.1 TiB  68 KiB   19 GiB 3.0 TiB 
67.41 1.00 170 up

16   hdd 9.09599  1.0 9.1 TiB 6.1 TiB 6.1 TiB 152 KiB   19 GiB 2.9 TiB 
67.61 1.01 171 up

17   hdd 9.09599  1.0 9.1 TiB 6.1 TiB 6.1 TiB  36 KiB   19 GiB 3.0 TiB 
67.16 1.00 170 up

18   hdd 9.09599  1.0 9.1 TiB 6.1 TiB 6.1 TiB  41 KiB   19 GiB 3.0 TiB 
67.19 1.00 170 up

19   hdd 9.09599  1.0 9.1 TiB 6.1 TiB 6.1 TiB  64 KiB   19 GiB 3.0 TiB 
67.49 1.00 171 up

20   hdd 9.09599  1.0 9.1 TiB 6.1 TiB 6.1 TiB  12 KiB   19 GiB 3.0 TiB 
67.55 1.01 171 up

21   hdd 9.09599  1.0 9.1 TiB 6.2 TiB 6.1 TiB  76 KiB   19 GiB 2.9 TiB 
67.76 1.01 171 up

22   hdd 9.09599  1.0 9.1 TiB 6.2 TiB 6.2 TiB  12 KiB   19 GiB 2.9 TiB 
68.05 1.01 172 up

29   hdd 9.09599  1.0 9.1 TiB 5.8 TiB 5.8 TiB 108 KiB   17 GiB 3.3 TiB 
63.59 0.95 163 up

30   hdd 9.09599  1.0 9.1 TiB 5.9 TiB 5.9 TiB  24 KiB   18 GiB 3.2 TiB 
65.18 0.97 167 up

31   hdd 9.09599  1.0 9.1 TiB 6.1 TiB 6.1 TiB  44 KiB   18 GiB 3.0 TiB 
66.74 0.99 171 up

32   hdd 9.09599  1.0 9.1 TiB 6.0 TiB 6.0 TiB 220 KiB   18 GiB 3.1 TiB 
66.31 0.99 170 up

33   hdd 9.09599  1.0 9.1 TiB 6.0 TiB 5.9 TiB  36 KiB   18 GiB 3.1 TiB 
65.54 0.98 168 up

34   hdd 9.09599  1.0 9.1 TiB 6.0 TiB 6.0 TiB  44 KiB   18 GiB 3.1 TiB 
66.33 0.99 170 up

35   hdd 9.09599  1.0 9.1 TiB 5.9 TiB 5.9 TiB  68 KiB   18 GiB 3.2 TiB 
64.77 0.96 166 up

36   hdd 9.09599  1.0 9.1 TiB 5.8 TiB 5.8 TiB 168 KiB   17 GiB 3.3 TiB 
63.60 0.95 163 up

37   hdd 9.09599  1.0 9.1 TiB 6.0 TiB 6.0 TiB  60 KiB   18 GiB 3.1 TiB 
65.91 0.98 169 up

38   hdd 9.09599  1.0 9.1 TiB 5.9 TiB 5.9 TiB  68 KiB   18 GiB 3.2 TiB 
65.15 0.97 167 up

 0   hdd 0.00999  1.0  10 GiB 1.4 GiB 437 MiB  28 KiB 1024 MiB 8.6 GiB 
14.27 0.21  34 up

 1   hdd 0.00999  1.0  10 GiB 1.4 GiB 434 MiB 1.4 MiB 1023 MiB 8.6 GiB 
14.24 0.21  34 up

 2   hdd 0.00999  1.0  10 GiB 1.4 GiB 439 MiB  36 KiB 1024 MiB 8.6 GiB 
14.29 0.21  33 up

23   hdd 0.00999  1.0  10 GiB 1.4 GiB 434 MiB 1.4 MiB 1023 MiB 8.6 GiB 
14.24 0.21  33 up

24   hdd 0.00999  1.0  10 GiB 1.4 GiB 441 MiB  48 KiB 1024 MiB 8.6 GiB 
14.31 0.21  34 up

25   hdd 0.00999  1.0  10 

Re: [ceph-users] Looking for experience

2020-01-09 Thread Mainor Daly


 
 
  
   Hi Stefan,
   
   
  
  
   before I give some suggestions, can you first describe your usecase for which you wanna use that setup? Also which aspects are important for you. 
   
   
   
  
  
   
Stefan Priebe - Profihost AG <
s.pri...@profihost.ag> hat am 9. Januar 2020 um 22:52 geschrieben:
   
   

   
   

   
   
As a starting point the current idea is to use something like:
   
   

   
   
4-6 nodes with 12x 12tb disks each
   
   
128G Memory
   
   
AMD EPYC 7302P 3GHz, 16C/32T
   
   
128GB RAM
   
   

   
   
Something to discuss is
   
   

   
   
- EC or go with 3 replicas. We'll use bluestore with compression.
   
   
- Do we need something like Intel Optane for WAL / DB or not?
   
   

   
   
Since we started using ceph we're mostly subscribed to SSDs - so no
   
   
knowlege about HDD in place.
   
   

   
   
Greets,
   
   
Stefan
   
   
Am 09.01.20 um 16:49 schrieb Stefan Priebe - Profihost AG:
   
   

 


 
  Am 09.01.2020 um 16:10 schrieb Wido den Hollander <
  w...@42on.com>:
 
 
  
 
 
  
 
 
  
 
 
  
   On 1/9/20 2:27 PM, Stefan Priebe - Profihost AG wrote:
  
  
   Hi Wido,
  
  
   
Am 09.01.20 um 14:18 schrieb Wido den Hollander:
   
   

   
   

   
   
On 1/9/20 2:07 PM, Daniel Aberger - Profihost AG wrote:
   
   
>
   
   

 Am 09.01.20 um 13:39 schrieb Janne Johansson:


 >


 
  I'm currently trying to workout a concept for a ceph cluster which can
 
 
  be used as a target for backups which satisfies the following
 
 
  requirements:
 
 
  
 
 
  - approx. write speed of 40.000 IOP/s and 2500 Mbyte/s
 
 
  
 
 
  
 
 
  You might need to have a large (at least non-1) number of writers to get
 
 
  to that sum of operations, as opposed to trying to reach it with one
 
 
  single stream written from one single client.
 


 


 


 We are aiming for about 100 writers.

   
   

   
   
So if I read it correctly the writes will be 64k each.
   
  
  
   
  
  
   may be ;-) see below
  
  
   
  
  
   
That should be doable, but you probably want something like NVMe for DB+WAL.
   
   

   
   
You might want to tune that larger writes also go into the WAL to speed
   
   
up the ingress writes. But you mainly want more spindles then less.
   
  
  
   
  
  
   I would like to give a little bit more insight about this and most
  
  
   probobly some overhead we currently have in those numbers. Those values
  
  
   come from our old classic raid storage boxes. Those use btrfs + zlib
  
  
   compression + subvolumes for those backups and we've collected those
  
  
   numbers from all of them.
  
  
   
  
  
   The new system should just replicate snapshots from the live ceph.
  
  
   Hopefully being able to use Erase Coding and compression? ;-)
  
  
   
  
 
 
  
 
 
  Compression might work, but only if the data is compressable.
 
 
  
 
 
  EC usually writes very fast, so that's good. I would recommend a lot of
 
 
  spindles those. More spindles == more OSDs == more performance.
 
 
  
 
 
  So instead of using 12TB drives you can consider 6TB or 8TB drives.
 


 


 Currently we have a lot of 5TB 2.5 drives in place so we could use them.we would like to start with around 4000 Iops and 250 MB per second while using 24 Drive boxes. We could please one or two NVMe PCIe cards in them.


 


 


 Stefan


 


 >


 
  Wido
 
 
  
 
 
  
   Greets,
  
  
   Stefan
  
  
   
  
 

   
   
___
   
   
ceph-users mailing list
   
   
ceph-users@lists.ceph.com

   
   
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

   
   
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD EC images for a ZFS pool

2020-01-09 Thread JC Lopez
Hi,

you can actually specify the feature you want to enable at creation time so 
this way no need to remove the feature after.

To illustrate Ilya’s message: rbd create rbd/test --size=128M 
--image-feature=layering,striping --stripe-count=8 --stripe-unit=4K

The object size is hereby left to the default but it can also be altered with 
--object-size

Best regards
JC


> On Jan 9, 2020, at 18:32, Kyriazis, George  wrote:
> 
> 
> 
>> On Jan 9, 2020, at 2:16 PM, Ilya Dryomov > > wrote:
>> 
>> On Thu, Jan 9, 2020 at 2:52 PM Kyriazis, George
>> mailto:george.kyria...@intel.com>> wrote:
>>> 
>>> Hello ceph-users!
>>> 
>>> My setup is that I’d like to use RBD images as a replication target of a 
>>> FreeNAS zfs pool.  I have a 2nd FreeNAS (in a VM) to act as a backup target 
>>> in which I mount the RBD image.  All this (except the source FreeNAS 
>>> server) is in Proxmox.
>>> 
>>> Since I am using RBD as a backup target, performance is not really 
>>> critical, but I still don’t want it to take months to complete the backup.  
>>> My source pool size is in the order of ~30TB.
>>> 
>>> I’ve set up an EC RBD pool (and the matching replicated pool) and created 
>>> image with no problems.  However, with the stock 4MB object size, backup 
>>> speed in quite slow.  I tried creating an image with 4K object size, but 
>>> even for a relatively small image size (of 1TB), I get:
>>> 
>>> # rbd -p rbd_backup create vm-118-disk-0 --size 1T --object-size 4K 
>>> --data-pool rbd_ec
>>> 2020-01-09 07:40:27.120 7f3e4aa15f40 -1 librbd::image::CreateRequest: 
>>> validate_layout: image size not compatible with object map
>>> rbd: create error: (22) Invalid argument
>>> #
>> 
>> Yeah, this is an object map limitation.  Given that this is a backup
>> target, you don't really need the object map feature.  Disable it with
>> "rbd feature disable vm-118-disk-0 object-map" and you should be able
>> to create an image of any size.
>> 
> Hmm.. Except I can’t disable a feature on a image that I haven’t created yet. 
> :-). I can start creating a smaller image, and resize after I remove that 
> feature.
> 
>> That said, are you sure that object size is the issue?  If you expect
>> small sequential writes and want them to go to different OSDs, look at
>> using a fancy striping pattern instead of changing the object size:
>> 
>>  https://docs.ceph.com/docs/master/man/8/rbd/#striping 
>> 
>> 
>> E.g. with --stripe-unit 4K --stripe-count 8, the first 4K will go to
>> object 1, the second 4K to object 2, etc.  The ninth 4K will return to
>> object 1, the tenth to object 2, etc.  When objects 1-8 become full, it
>> will move on to objects 9-16, then to 17-24, etc.
>> 
>> This way you get the increased parallelism without the very significant
>> overhead of tons of small objects (if your OSDs are capable enough).
>> 
> Thanks for the suggestions.  After yours and Stefan’s suggestions, I’ll 
> experiment a little bit with various parameters and see what gets me the best 
> performance.
> 
> Thanks all!
> 
> George
> 
>> Thanks,
>> 
>>Ilya
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD EC images for a ZFS pool

2020-01-09 Thread Kyriazis, George


On Jan 9, 2020, at 2:16 PM, Ilya Dryomov 
mailto:idryo...@gmail.com>> wrote:

On Thu, Jan 9, 2020 at 2:52 PM Kyriazis, George
mailto:george.kyria...@intel.com>> wrote:

Hello ceph-users!

My setup is that I’d like to use RBD images as a replication target of a 
FreeNAS zfs pool.  I have a 2nd FreeNAS (in a VM) to act as a backup target in 
which I mount the RBD image.  All this (except the source FreeNAS server) is in 
Proxmox.

Since I am using RBD as a backup target, performance is not really critical, 
but I still don’t want it to take months to complete the backup.  My source 
pool size is in the order of ~30TB.

I’ve set up an EC RBD pool (and the matching replicated pool) and created image 
with no problems.  However, with the stock 4MB object size, backup speed in 
quite slow.  I tried creating an image with 4K object size, but even for a 
relatively small image size (of 1TB), I get:

# rbd -p rbd_backup create vm-118-disk-0 --size 1T --object-size 4K --data-pool 
rbd_ec
2020-01-09 07:40:27.120 7f3e4aa15f40 -1 librbd::image::CreateRequest: 
validate_layout: image size not compatible with object map
rbd: create error: (22) Invalid argument
#

Yeah, this is an object map limitation.  Given that this is a backup
target, you don't really need the object map feature.  Disable it with
"rbd feature disable vm-118-disk-0 object-map" and you should be able
to create an image of any size.

Hmm.. Except I can’t disable a feature on a image that I haven’t created yet. 
:-). I can start creating a smaller image, and resize after I remove that 
feature.

That said, are you sure that object size is the issue?  If you expect
small sequential writes and want them to go to different OSDs, look at
using a fancy striping pattern instead of changing the object size:

 https://docs.ceph.com/docs/master/man/8/rbd/#striping

E.g. with --stripe-unit 4K --stripe-count 8, the first 4K will go to
object 1, the second 4K to object 2, etc.  The ninth 4K will return to
object 1, the tenth to object 2, etc.  When objects 1-8 become full, it
will move on to objects 9-16, then to 17-24, etc.

This way you get the increased parallelism without the very significant
overhead of tons of small objects (if your OSDs are capable enough).

Thanks for the suggestions.  After yours and Stefan’s suggestions, I’ll 
experiment a little bit with various parameters and see what gets me the best 
performance.

Thanks all!

George

Thanks,

   Ilya

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Looking for experience

2020-01-09 Thread Ed Kalk
It sounds like an I/O bottleneck (either max IOPS or max throughput) in 
the making.


If you are looking for cold storage archival data only, then it may be 
ok.(if it doesn't matter how long it takes to write the data)


If this is production data with any sort of IOPs load or data change 
rate, I'd be concerned.


Too big of spin disks, will get killed on seek times. Too many & too big 
spinners will likely bottleneck the i/O controller. It would be better 
to use more of cheaper nodes to yield way more disks which are smaller. 
(2TB max) (more disks, more i/o controllers, more motherboards = more 
perf) Think "scale out" in # of nodes not "scale up" the individual nodes


-Ed

Software Defined Storage Engineer


On 1/9/2020 3:52 PM, Stefan Priebe - Profihost AG wrote:

As a starting point the current idea is to use something like:

4-6 nodes with 12x 12tb disks each
128G Memory
AMD EPYC 7302P 3GHz, 16C/32T
128GB RAM

Something to discuss is

- EC or go with 3 replicas. We'll use bluestore with compression.
- Do we need something like Intel Optane for WAL / DB or not?

Since we started using ceph we're mostly subscribed to SSDs - so no
knowlege about HDD in place.

Greets,
Stefan
Am 09.01.20 um 16:49 schrieb Stefan Priebe - Profihost AG:

Am 09.01.2020 um 16:10 schrieb Wido den Hollander :




On 1/9/20 2:27 PM, Stefan Priebe - Profihost AG wrote:
Hi Wido,

Am 09.01.20 um 14:18 schrieb Wido den Hollander:


On 1/9/20 2:07 PM, Daniel Aberger - Profihost AG wrote:

Am 09.01.20 um 13:39 schrieb Janne Johansson:

I'm currently trying to workout a concept for a ceph cluster which can
be used as a target for backups which satisfies the following
requirements:

- approx. write speed of 40.000 IOP/s and 2500 Mbyte/s


You might need to have a large (at least non-1) number of writers to get
to that sum of operations, as opposed to trying to reach it with one
single stream written from one single client.


We are aiming for about 100 writers.

So if I read it correctly the writes will be 64k each.

may be ;-) see below


That should be doable, but you probably want something like NVMe for DB+WAL.

You might want to tune that larger writes also go into the WAL to speed
up the ingress writes. But you mainly want more spindles then less.

I would like to give a little bit more insight about this and most
probobly some overhead we currently have in those numbers. Those values
come from our old classic raid storage boxes. Those use btrfs + zlib
compression + subvolumes for those backups and we've collected those
numbers from all of them.

The new system should just replicate snapshots from the live ceph.
Hopefully being able to use Erase Coding and compression? ;-)


Compression might work, but only if the data is compressable.

EC usually writes very fast, so that's good. I would recommend a lot of
spindles those. More spindles == more OSDs == more performance.

So instead of using 12TB drives you can consider 6TB or 8TB drives.

Currently we have a lot of 5TB 2.5 drives in place so we could use them.we 
would like to start with around 4000 Iops and 250 MB per second while using 24 
Drive boxes. We could please one or two NVMe PCIe cards in them.


Stefan


Wido


Greets,
Stefan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Looking for experience

2020-01-09 Thread Stefan Priebe - Profihost AG
As a starting point the current idea is to use something like:

4-6 nodes with 12x 12tb disks each
128G Memory
AMD EPYC 7302P 3GHz, 16C/32T
128GB RAM

Something to discuss is

- EC or go with 3 replicas. We'll use bluestore with compression.
- Do we need something like Intel Optane for WAL / DB or not?

Since we started using ceph we're mostly subscribed to SSDs - so no
knowlege about HDD in place.

Greets,
Stefan
Am 09.01.20 um 16:49 schrieb Stefan Priebe - Profihost AG:
> 
>> Am 09.01.2020 um 16:10 schrieb Wido den Hollander :
>>
>> 
>>
>>> On 1/9/20 2:27 PM, Stefan Priebe - Profihost AG wrote:
>>> Hi Wido,
 Am 09.01.20 um 14:18 schrieb Wido den Hollander:


 On 1/9/20 2:07 PM, Daniel Aberger - Profihost AG wrote:
>
> Am 09.01.20 um 13:39 schrieb Janne Johansson:
>>
>>I'm currently trying to workout a concept for a ceph cluster which can
>>be used as a target for backups which satisfies the following
>>requirements:
>>
>>- approx. write speed of 40.000 IOP/s and 2500 Mbyte/s
>>
>>
>> You might need to have a large (at least non-1) number of writers to get
>> to that sum of operations, as opposed to trying to reach it with one
>> single stream written from one single client. 
>
>
> We are aiming for about 100 writers.

 So if I read it correctly the writes will be 64k each.
>>>
>>> may be ;-) see below
>>>
 That should be doable, but you probably want something like NVMe for 
 DB+WAL.

 You might want to tune that larger writes also go into the WAL to speed
 up the ingress writes. But you mainly want more spindles then less.
>>>
>>> I would like to give a little bit more insight about this and most
>>> probobly some overhead we currently have in those numbers. Those values
>>> come from our old classic raid storage boxes. Those use btrfs + zlib
>>> compression + subvolumes for those backups and we've collected those
>>> numbers from all of them.
>>>
>>> The new system should just replicate snapshots from the live ceph.
>>> Hopefully being able to use Erase Coding and compression? ;-)
>>>
>>
>> Compression might work, but only if the data is compressable.
>>
>> EC usually writes very fast, so that's good. I would recommend a lot of
>> spindles those. More spindles == more OSDs == more performance.
>>
>> So instead of using 12TB drives you can consider 6TB or 8TB drives.
> 
> Currently we have a lot of 5TB 2.5 drives in place so we could use them.we 
> would like to start with around 4000 Iops and 250 MB per second while using 
> 24 Drive boxes. We could please one or two NVMe PCIe cards in them.
> 
> 
> Stefan
> 
>>
>> Wido
>>
>>> Greets,
>>> Stefan
>>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD EC images for a ZFS pool

2020-01-09 Thread Ilya Dryomov
On Thu, Jan 9, 2020 at 2:52 PM Kyriazis, George
 wrote:
>
> Hello ceph-users!
>
> My setup is that I’d like to use RBD images as a replication target of a 
> FreeNAS zfs pool.  I have a 2nd FreeNAS (in a VM) to act as a backup target 
> in which I mount the RBD image.  All this (except the source FreeNAS server) 
> is in Proxmox.
>
> Since I am using RBD as a backup target, performance is not really critical, 
> but I still don’t want it to take months to complete the backup.  My source 
> pool size is in the order of ~30TB.
>
> I’ve set up an EC RBD pool (and the matching replicated pool) and created 
> image with no problems.  However, with the stock 4MB object size, backup 
> speed in quite slow.  I tried creating an image with 4K object size, but even 
> for a relatively small image size (of 1TB), I get:
>
> # rbd -p rbd_backup create vm-118-disk-0 --size 1T --object-size 4K 
> --data-pool rbd_ec
> 2020-01-09 07:40:27.120 7f3e4aa15f40 -1 librbd::image::CreateRequest: 
> validate_layout: image size not compatible with object map
> rbd: create error: (22) Invalid argument
> #

Yeah, this is an object map limitation.  Given that this is a backup
target, you don't really need the object map feature.  Disable it with
"rbd feature disable vm-118-disk-0 object-map" and you should be able
to create an image of any size.

That said, are you sure that object size is the issue?  If you expect
small sequential writes and want them to go to different OSDs, look at
using a fancy striping pattern instead of changing the object size:

  https://docs.ceph.com/docs/master/man/8/rbd/#striping

E.g. with --stripe-unit 4K --stripe-count 8, the first 4K will go to
object 1, the second 4K to object 2, etc.  The ninth 4K will return to
object 1, the tenth to object 2, etc.  When objects 1-8 become full, it
will move on to objects 9-16, then to 17-24, etc.

This way you get the increased parallelism without the very significant
overhead of tons of small objects (if your OSDs are capable enough).

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD EC images for a ZFS pool

2020-01-09 Thread Stefan Kooman
Quoting Kyriazis, George (george.kyria...@intel.com):
> 
> Hmm, I meant you can use large block size for the large files and small
> block size for the small files.
> 
> Sure, but how to do that.  As far as I know block size is a property of the 
> pool, not a single file.

recordsize: https://blog.programster.org/zfs-record-size,
https://blogs.oracle.com/roch/tuning-zfs-recordsize

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD EC images for a ZFS pool

2020-01-09 Thread Kyriazis, George

On Jan 9, 2020, at 9:27 AM, Stefan Kooman mailto:ste...@bit.nl>> 
wrote:

Quoting Kyriazis, George 
(george.kyria...@intel.com):


On Jan 9, 2020, at 8:00 AM, Stefan Kooman mailto:ste...@bit.nl>> 
wrote:

Quoting Kyriazis, George 
(george.kyria...@intel.com):

The source pool has mainly big files, but there are quite a few
smaller (<4KB) files that I’m afraid will create waste if I create the
destination zpool with ashift > 12 (>4K blocks).  I am not sure,
though, if ZFS will actually write big files in consecutive blocks
(through a send/receive), so maybe the blocking factor is not the
actual file size, but rather the zfs block size.  I am planning on
using zfs gzip-9 compression on the destination pool, if it matters.

You might want to consider Zstandard for compression:
https://engineering.fb.com/core-data/smaller-and-faster-data-compression-with-zstandard/

Thanks for the pointer.  Sorry, I am not sure how you are suggesting
to using zstd, since it’s not part of the standard zfs compression
algorithms.

It's in FreeBSD ... and should be in ZOL soon:
https://github.com/zfsonlinux/zfs/pull/9735

FreeNAS is based on FreeBSD, so it will make it there…. eventually.  But 
compression is not my problem, I have enough horsepower to deal with gzip-9.  
It’s not the bottlenneck.  Ceph file I/O is.

You can optimize a ZFS fs to use larger blocks for those files that are
small ... and use large block sizes for other fs ... if it's easy to
split them.

From what I understand, zfs uses a single block per file, if files are
<4K, ie. It does not put 2 small files in a single block.  How would
larger blocks help small files?  Also, as far as I know ashift is a
pool property, set only at pool creation.

Hmm, I meant you can use large block size for the large files and small
block size for the small files.

Sure, but how to do that.  As far as I know block size is a property of the 
pool, not a single file.

Thanks!

George


I don’t have control over the original files and how they are stored
in the source server.  These are user’s files.

Then you somehow need to find a middle ground.

Gr. Stefan

--
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / 
i...@bit.nl

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Looking for experience

2020-01-09 Thread Stefan Priebe - Profihost AG

> Am 09.01.2020 um 16:10 schrieb Wido den Hollander :
> 
> 
> 
>> On 1/9/20 2:27 PM, Stefan Priebe - Profihost AG wrote:
>> Hi Wido,
>>> Am 09.01.20 um 14:18 schrieb Wido den Hollander:
>>> 
>>> 
>>> On 1/9/20 2:07 PM, Daniel Aberger - Profihost AG wrote:
 
 Am 09.01.20 um 13:39 schrieb Janne Johansson:
> 
>I'm currently trying to workout a concept for a ceph cluster which can
>be used as a target for backups which satisfies the following
>requirements:
> 
>- approx. write speed of 40.000 IOP/s and 2500 Mbyte/s
> 
> 
> You might need to have a large (at least non-1) number of writers to get
> to that sum of operations, as opposed to trying to reach it with one
> single stream written from one single client. 
 
 
 We are aiming for about 100 writers.
>>> 
>>> So if I read it correctly the writes will be 64k each.
>> 
>> may be ;-) see below
>> 
>>> That should be doable, but you probably want something like NVMe for DB+WAL.
>>> 
>>> You might want to tune that larger writes also go into the WAL to speed
>>> up the ingress writes. But you mainly want more spindles then less.
>> 
>> I would like to give a little bit more insight about this and most
>> probobly some overhead we currently have in those numbers. Those values
>> come from our old classic raid storage boxes. Those use btrfs + zlib
>> compression + subvolumes for those backups and we've collected those
>> numbers from all of them.
>> 
>> The new system should just replicate snapshots from the live ceph.
>> Hopefully being able to use Erase Coding and compression? ;-)
>> 
> 
> Compression might work, but only if the data is compressable.
> 
> EC usually writes very fast, so that's good. I would recommend a lot of
> spindles those. More spindles == more OSDs == more performance.
> 
> So instead of using 12TB drives you can consider 6TB or 8TB drives.

Currently we have a lot of 5TB 2.5 drives in place so we could use them.we 
would like to start with around 4000 Iops and 250 MB per second while using 24 
Drive boxes. We could please one or two NVMe PCIe cards in them.


Stefan

> 
> Wido
> 
>> Greets,
>> Stefan
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD EC images for a ZFS pool

2020-01-09 Thread Stefan Kooman
Quoting Kyriazis, George (george.kyria...@intel.com):
> 
> 
> > On Jan 9, 2020, at 8:00 AM, Stefan Kooman  wrote:
> > 
> > Quoting Kyriazis, George (george.kyria...@intel.com):
> > 
> >> The source pool has mainly big files, but there are quite a few
> >> smaller (<4KB) files that I’m afraid will create waste if I create the
> >> destination zpool with ashift > 12 (>4K blocks).  I am not sure,
> >> though, if ZFS will actually write big files in consecutive blocks
> >> (through a send/receive), so maybe the blocking factor is not the
> >> actual file size, but rather the zfs block size.  I am planning on
> >> using zfs gzip-9 compression on the destination pool, if it matters.
> > 
> > You might want to consider Zstandard for compression:
> > https://engineering.fb.com/core-data/smaller-and-faster-data-compression-with-zstandard/
> > 
> Thanks for the pointer.  Sorry, I am not sure how you are suggesting
> to using zstd, since it’s not part of the standard zfs compression
> algorithms.

It's in FreeBSD ... and should be in ZOL soon:
https://github.com/zfsonlinux/zfs/pull/9735

> > You can optimize a ZFS fs to use larger blocks for those files that are
> > small ... and use large block sizes for other fs ... if it's easy to
> > split them.
> > 
> From what I understand, zfs uses a single block per file, if files are
> <4K, ie. It does not put 2 small files in a single block.  How would
> larger blocks help small files?  Also, as far as I know ashift is a
> pool property, set only at pool creation.

Hmm, I meant you can use large block size for the large files and small
block size for the small files.

> 
> I don’t have control over the original files and how they are stored
> in the source server.  These are user’s files.

Then you somehow need to find a middle ground.

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Looking for experience

2020-01-09 Thread Joachim Kraftmayer
I would try to scale horizontally with smaller ceph nodes, so you have 
the advantage of being able to choose an EC profile that does not 
require too much overhead and you can use failure domain host.


Joachim


Am 09.01.2020 um 15:31 schrieb Wido den Hollander:


On 1/9/20 2:27 PM, Stefan Priebe - Profihost AG wrote:

Hi Wido,
Am 09.01.20 um 14:18 schrieb Wido den Hollander:


On 1/9/20 2:07 PM, Daniel Aberger - Profihost AG wrote:

Am 09.01.20 um 13:39 schrieb Janne Johansson:

 I'm currently trying to workout a concept for a ceph cluster which can
 be used as a target for backups which satisfies the following
 requirements:

 - approx. write speed of 40.000 IOP/s and 2500 Mbyte/s


You might need to have a large (at least non-1) number of writers to get
to that sum of operations, as opposed to trying to reach it with one
single stream written from one single client.


We are aiming for about 100 writers.

So if I read it correctly the writes will be 64k each.

may be ;-) see below


That should be doable, but you probably want something like NVMe for DB+WAL.

You might want to tune that larger writes also go into the WAL to speed
up the ingress writes. But you mainly want more spindles then less.

I would like to give a little bit more insight about this and most
probobly some overhead we currently have in those numbers. Those values
come from our old classic raid storage boxes. Those use btrfs + zlib
compression + subvolumes for those backups and we've collected those
numbers from all of them.

The new system should just replicate snapshots from the live ceph.
Hopefully being able to use Erase Coding and compression? ;-)


Compression might work, but only if the data is compressable.

EC usually writes very fast, so that's good. I would recommend a lot of
spindles those. More spindles == more OSDs == more performance.

So instead of using 12TB drives you can consider 6TB or 8TB drives.

Wido


Greets,
Stefan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD EC images for a ZFS pool

2020-01-09 Thread Kyriazis, George


> On Jan 9, 2020, at 8:00 AM, Stefan Kooman  wrote:
> 
> Quoting Kyriazis, George (george.kyria...@intel.com):
> 
>> The source pool has mainly big files, but there are quite a few
>> smaller (<4KB) files that I’m afraid will create waste if I create the
>> destination zpool with ashift > 12 (>4K blocks).  I am not sure,
>> though, if ZFS will actually write big files in consecutive blocks
>> (through a send/receive), so maybe the blocking factor is not the
>> actual file size, but rather the zfs block size.  I am planning on
>> using zfs gzip-9 compression on the destination pool, if it matters.
> 
> You might want to consider Zstandard for compression:
> https://engineering.fb.com/core-data/smaller-and-faster-data-compression-with-zstandard/
> 
Thanks for the pointer.  Sorry, I am not sure how you are suggesting to using 
zstd, since it’s not part of the standard zfs compression algorithms.

> You can optimize a ZFS fs to use larger blocks for those files that are
> small ... and use large block sizes for other fs ... if it's easy to
> split them.
> 
From what I understand, zfs uses a single block per file, if files are <4K, ie. 
It does not put 2 small files in a single block.  How would larger blocks help 
small files?  Also, as far as I know ashift is a pool property, set only at 
pool creation.

I don’t have control over the original files and how they are stored in the 
source server.  These are user’s files.

Thank you!

George

> -- 
> | BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Looking for experience

2020-01-09 Thread Wido den Hollander


On 1/9/20 2:27 PM, Stefan Priebe - Profihost AG wrote:
> Hi Wido,
> Am 09.01.20 um 14:18 schrieb Wido den Hollander:
>>
>>
>> On 1/9/20 2:07 PM, Daniel Aberger - Profihost AG wrote:
>>>
>>> Am 09.01.20 um 13:39 schrieb Janne Johansson:

 I'm currently trying to workout a concept for a ceph cluster which can
 be used as a target for backups which satisfies the following
 requirements:

 - approx. write speed of 40.000 IOP/s and 2500 Mbyte/s


 You might need to have a large (at least non-1) number of writers to get
 to that sum of operations, as opposed to trying to reach it with one
 single stream written from one single client. 
>>>
>>>
>>> We are aiming for about 100 writers.
>>
>> So if I read it correctly the writes will be 64k each.
> 
> may be ;-) see below
> 
>> That should be doable, but you probably want something like NVMe for DB+WAL.
>>
>> You might want to tune that larger writes also go into the WAL to speed
>> up the ingress writes. But you mainly want more spindles then less.
> 
> I would like to give a little bit more insight about this and most
> probobly some overhead we currently have in those numbers. Those values
> come from our old classic raid storage boxes. Those use btrfs + zlib
> compression + subvolumes for those backups and we've collected those
> numbers from all of them.
> 
> The new system should just replicate snapshots from the live ceph.
> Hopefully being able to use Erase Coding and compression? ;-)
> 

Compression might work, but only if the data is compressable.

EC usually writes very fast, so that's good. I would recommend a lot of
spindles those. More spindles == more OSDs == more performance.

So instead of using 12TB drives you can consider 6TB or 8TB drives.

Wido

> Greets,
> Stefan
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD EC images for a ZFS pool

2020-01-09 Thread Stefan Kooman
Quoting Kyriazis, George (george.kyria...@intel.com):

> The source pool has mainly big files, but there are quite a few
> smaller (<4KB) files that I’m afraid will create waste if I create the
> destination zpool with ashift > 12 (>4K blocks).  I am not sure,
> though, if ZFS will actually write big files in consecutive blocks
> (through a send/receive), so maybe the blocking factor is not the
> actual file size, but rather the zfs block size.  I am planning on
> using zfs gzip-9 compression on the destination pool, if it matters.

You might want to consider Zstandard for compression:
https://engineering.fb.com/core-data/smaller-and-faster-data-compression-with-zstandard/

You can optimize a ZFS fs to use larger blocks for those files that are
small ... and use large block sizes for other fs ... if it's easy to
split them.

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD EC images for a ZFS pool

2020-01-09 Thread Kyriazis, George
Hello ceph-users!

My setup is that I’d like to use RBD images as a replication target of a 
FreeNAS zfs pool.  I have a 2nd FreeNAS (in a VM) to act as a backup target in 
which I mount the RBD image.  All this (except the source FreeNAS server) is in 
Proxmox.

Since I am using RBD as a backup target, performance is not really critical, 
but I still don’t want it to take months to complete the backup.  My source 
pool size is in the order of ~30TB.

I’ve set up an EC RBD pool (and the matching replicated pool) and created image 
with no problems.  However, with the stock 4MB object size, backup speed in 
quite slow.  I tried creating an image with 4K object size, but even for a 
relatively small image size (of 1TB), I get:

# rbd -p rbd_backup create vm-118-disk-0 --size 1T --object-size 4K --data-pool 
rbd_ec
2020-01-09 07:40:27.120 7f3e4aa15f40 -1 librbd::image::CreateRequest: 
validate_layout: image size not compatible with object map
rbd: create error: (22) Invalid argument
# 

Creating a smaller image (for example 1G) works fine, so I can only imagine 
that with an object size of 4K, there are way too many objects for the create.  
Given that I’d like to start with having a 40TB image, there is a significant 
size gap here.

The source pool has mainly big files, but there are quite a few smaller (<4KB) 
files that I’m afraid will create waste if I create the destination zpool with 
ashift > 12 (>4K blocks).  I am not sure, though, if ZFS will actually write 
big files in consecutive blocks (through a send/receive), so maybe the blocking 
factor is not the actual file size, but rather the zfs block size.  I am 
planning on using zfs gzip-9 compression on the destination pool, if it matters.

Any thoughts from the community on best methods to approach this?

Thank you!

George

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] monitor ghosted

2020-01-09 Thread Peter Eisch
As oddly as it drifted away it came back.  Next time, should there be a next 
time, I will snag logs as suggested by Sascha.

The window for all this was, local time: 9:02 am - disassociated; 11:20 pm - 
associated.  No changes were made, I did reboot the mon02 host at 1 pm.  No 
other network or host issues were observed in the rest of the cluster or at the 
site.

Thank you for your replies and I'll gather better loggin next time.

peter




Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.64
From: Brad Hubbard 
Date: Wednesday, January 8, 2020 at 6:21 PM
To: Peter Eisch 
Cc: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] monitor ghosted


On Thu, Jan 9, 2020 at 5:48 AM Peter Eisch  
wrote:
Hi,

This morning one of my three monitor hosts got booted from the Nautilus 14.2.4 
cluster and it won’t regain. There haven’t been any changes, or events at this 
site at all. The conf file is the [unchanged] and the same as the other two 
monitors. The host is also running the MDS and MGR apps without any issue. The 
ceph-mon log shows this repeating:

2020-01-08 13:33:29.403 7fec1a736700 1 mon.cephmon02@1(probing) e7 
handle_auth_request failed to assign global_id
2020-01-08 13:33:29.433 7fec1a736700 1 mon.cephmon02@1(probing) e7 
handle_auth_request failed to assign global_id
2020-01-08 13:33:29.541 7fec1a736700 1 mon.cephmon02@1(probing) e7 
handle_auth_request failed to assign global_id
...

Try gathering a log with debug_mon 20. That should provide more detail about 
why  AuthMonitor::_assign_global_id() didn't return an ID.


There is nothing in the logs of the two remaining/healthy monitors. What is my 
best practice to get this host back in the cluster?

peter

___
ceph-users mailing list
mailto:ceph-users@lists.ceph.com


--
Cheers,
Brad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD Marked down unable to restart continuously failing

2020-01-09 Thread Radhakrishnan2 S
Hello Everyone, 

One of the OSD node out of 16 has 12 OSD's with a bcache as NVMe, locally those 
osd daemons seem to be up and running, while the ceph osd tree shows them as 
down. Logs show that OSD's have struck IO for over 4096 sec. 

I tried checking for iostat, netstat, ceph -w  along with the logs. Is there a 
way to identify why this happening ? In addition, when I restart the OSD 
daemons on the respective OSD node, restart is failing. Any quick help pls.

Regards
Radha Krishnan S
TCS Enterprise Cloud Practice
Tata Consultancy Services
Cell:- +1 848 466 4870
Mailto: radhakrishnan...@tcs.com
Website: http://www.tcs.com

Experience certainty.   IT Services
Business Solutions
Consulting



-"ceph-users"  wrote: -
To: d.aber...@profihost.ag, "Janne Johansson" 
From: "Wido den Hollander" 
Sent by: "ceph-users" 
Date: 01/09/2020 08:19AM
Cc: "Ceph Users" , a.bra...@profihost.ag, 
"p.kra...@profihost.ag" , j.kr...@profihost.ag
Subject: Re: [ceph-users] Looking for experience

"External email. Open with Caution"


On 1/9/20 2:07 PM, Daniel Aberger - Profihost AG wrote:
> 
> Am 09.01.20 um 13:39 schrieb Janne Johansson:
>>
>> I'm currently trying to workout a concept for a ceph cluster which can
>> be used as a target for backups which satisfies the following
>> requirements:
>>
>> - approx. write speed of 40.000 IOP/s and 2500 Mbyte/s
>>
>>
>> You might need to have a large (at least non-1) number of writers to get
>> to that sum of operations, as opposed to trying to reach it with one
>> single stream written from one single client. 
> 
> 
> We are aiming for about 100 writers.

So if I read it correctly the writes will be 64k each.

That should be doable, but you probably want something like NVMe for DB+WAL.

You might want to tune that larger writes also go into the WAL to speed
up the ingress writes. But you mainly want more spindles then less.

Wido

> 
> Cheers
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Looking for experience

2020-01-09 Thread Stefan Priebe - Profihost AG
Hi Wido,
Am 09.01.20 um 14:18 schrieb Wido den Hollander:
> 
> 
> On 1/9/20 2:07 PM, Daniel Aberger - Profihost AG wrote:
>>
>> Am 09.01.20 um 13:39 schrieb Janne Johansson:
>>>
>>> I'm currently trying to workout a concept for a ceph cluster which can
>>> be used as a target for backups which satisfies the following
>>> requirements:
>>>
>>> - approx. write speed of 40.000 IOP/s and 2500 Mbyte/s
>>>
>>>
>>> You might need to have a large (at least non-1) number of writers to get
>>> to that sum of operations, as opposed to trying to reach it with one
>>> single stream written from one single client. 
>>
>>
>> We are aiming for about 100 writers.
> 
> So if I read it correctly the writes will be 64k each.

may be ;-) see below

> That should be doable, but you probably want something like NVMe for DB+WAL.
> 
> You might want to tune that larger writes also go into the WAL to speed
> up the ingress writes. But you mainly want more spindles then less.

I would like to give a little bit more insight about this and most
probobly some overhead we currently have in those numbers. Those values
come from our old classic raid storage boxes. Those use btrfs + zlib
compression + subvolumes for those backups and we've collected those
numbers from all of them.

The new system should just replicate snapshots from the live ceph.
Hopefully being able to use Erase Coding and compression? ;-)

Greets,
Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Looking for experience

2020-01-09 Thread Wido den Hollander


On 1/9/20 2:07 PM, Daniel Aberger - Profihost AG wrote:
> 
> Am 09.01.20 um 13:39 schrieb Janne Johansson:
>>
>> I'm currently trying to workout a concept for a ceph cluster which can
>> be used as a target for backups which satisfies the following
>> requirements:
>>
>> - approx. write speed of 40.000 IOP/s and 2500 Mbyte/s
>>
>>
>> You might need to have a large (at least non-1) number of writers to get
>> to that sum of operations, as opposed to trying to reach it with one
>> single stream written from one single client. 
> 
> 
> We are aiming for about 100 writers.

So if I read it correctly the writes will be 64k each.

That should be doable, but you probably want something like NVMe for DB+WAL.

You might want to tune that larger writes also go into the WAL to speed
up the ingress writes. But you mainly want more spindles then less.

Wido

> 
> Cheers
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Looking for experience

2020-01-09 Thread Daniel Aberger - Profihost AG

Am 09.01.20 um 13:39 schrieb Janne Johansson:
> 
> I'm currently trying to workout a concept for a ceph cluster which can
> be used as a target for backups which satisfies the following
> requirements:
> 
> - approx. write speed of 40.000 IOP/s and 2500 Mbyte/s
> 
> 
> You might need to have a large (at least non-1) number of writers to get
> to that sum of operations, as opposed to trying to reach it with one
> single stream written from one single client. 


We are aiming for about 100 writers.

Cheers
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Looking for experience

2020-01-09 Thread Janne Johansson
>
>
> I'm currently trying to workout a concept for a ceph cluster which can
> be used as a target for backups which satisfies the following requirements:
>
> - approx. write speed of 40.000 IOP/s and 2500 Mbyte/s
>

You might need to have a large (at least non-1) number of writers to get to
that sum of operations, as opposed to trying to reach it with one single
stream written from one single client.
-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Looking for experience

2020-01-09 Thread Daniel Aberger - Profihost AG
Hello,

I'm currently trying to workout a concept for a ceph cluster which can
be used as a target for backups which satisfies the following requirements:

- approx. write speed of 40.000 IOP/s and 2500 Mbyte/s
- 500 Tbyte total available space

Does anyone we have experience with a ceph cluster of comparable size
and can recommend a working hardware setup?

If so: what hardware did you use in what ceph configuration?

-- 
Mit freundlichen Grüßen
  Daniel Aberger
Ihr Profihost Team

---
Profihost AG
Expo Plaza 1
30539 Hannover
Deutschland

Tel.: +49 (511) 5151 8181 | Fax.: +49 (511) 5151 8282
URL: http://www.profihost.com | E-Mail: i...@profihost.com

Sitz der Gesellschaft: Hannover, USt-IdNr. DE813460827
Registergericht: Amtsgericht Hannover, Register-Nr.: HRB 202350
Vorstand: Cristoph Bluhm, Sebastian Bluhm, Stefan Priebe
Aufsichtsrat: Prof. Dr. iur. Winfried Huck (Vorsitzender)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Install specific version using ansible

2020-01-09 Thread Konstantin Shalygin

Hello all!
I'm trying to install a specific version of luminous (12.2.4). In the
directory group_vars/all.yml I can specify the luminous version, but i
didn't find a place where I can be more specific about the version.

The ansible installs the latest version (12.2.12 at this time).

I'm using ceph ansible stable-3.1

Is it possible, or I have to downgrade?


Just install packages before deploy. And don't upgrade it.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH rebalance all at once or host-by-host?

2020-01-09 Thread Stefan Kooman
Quoting Sean Matheny (s.math...@auckland.ac.nz):
> I tested this out by setting norebalance and norecover, moving the host 
> buckets under the rack buckets (all of them), and then unsetting. Ceph starts 
> melting down with escalating slow requests, even with backfill and recovery 
> parameters set to throttle. I moved the host buckets back to the default root 
> bucket, and things mostly came right, but I still had some inactive / unknown 
> pgs that I had to restart some OSDs to get back to health_ok.
> 
> I’m sure there’s a way you can tune things or fade in crush weights or 
> something, but I’m happy just moving one at a time.

For big changes like this you can use Dan's UPMAP trick:
https://www.slideshare.net/Inktank_Ceph/ceph-day-berlin-mastering-ceph-operations-upmap-and-the-mgr-balancer

Python script:
https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py

This way you can pause the process or get in "HEALTH_OK" state when
you want to.

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com