[ceph-users] ceph usage for very small objects

2019-12-26 Thread Adrian Nicolae

Hi all,

I have a ceph cluster with 4+2 EC used as a secondary storage system for 
offloading big files from another storage system. Even if most of the 
files are big (at least 50MB), we have also some small objects - less 
than 4MB each. The current storage usage is 358TB of raw data and 237TB 
of 'usable' data which means an overhead of 66%.


I was wondering if I can get more storage efficiency if I can get rid of 
all the small files by moving them on other storage systems.  My 
understanding is that every file is splitted into stripe_unit chunks and 
then mapped into ceph objects which have a size of 4MB per object. So if 
I have a file with a size of 1MB, the file will be splitted into 4 x 
256KB chunks, then added another 2 x 256KB chunks as overhead and every 
chunk will be mapped into a ceph object of 4MB size. This means a 1MB 
file will be stored as 6 ceph objects i.e the storage usage will be 
24MB. Not sure if my understanding is correct though...


 Do you have any suggestions on this topic ? Does it really worth it to 
move the small files from ceph ? If yes, what is the minimum file size 
which I can safely store in ceph without loosing too much storage ?


Thanks.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs kernel client io performance decreases extremely

2019-12-26 Thread renjianxinlover
hello,
   recently, after deleting some fs data in a small-scale ceph cluster, 
some clients IO performance became bad, specially latency. for example, opening 
a tiny text file by vim maybe consumed nearly twenty  seconds, i am not clear 
about how to diagnose the cause, could anyone give some guidence?


Brs
| |
renjianxinlover
|
|
renjianxinlo...@163.com
|
签名由网易邮箱大师定制___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client io performance decreases extremely

2019-12-26 Thread Nathan Fish
I would start by viewing "ceph status", drive IO with: "iostat -x 1
/dev/sd{a..z}" and the CPU/RAM usage of the active MDS. If "ceph status"
warns that the MDS cache is oversized, that may be an easy fix.

On Thu, Dec 26, 2019 at 7:33 AM renjianxinlover 
wrote:

> hello,
>recently, after deleting some fs data in a small-scale ceph
> cluster, some clients IO performance became bad, specially latency. for
> example, opening a tiny text file by vim maybe consumed nearly twenty
>  seconds, i am not clear about how to diagnose the cause, could anyone give
> some guidence?
>
> Brs
> renjianxinlover
> renjianxinlo...@163.com
>
> 
> 签名由 网易邮箱大师  定制
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow rbd read performance

2019-12-26 Thread Ml Ml
Hello Christian,

thanks for your reply. How should i benchmark my OSDs?

"dd bs=1M count=2048 if=/dev/sdX of=/dev/null" for each OSD?

Here are my OSD (write) benchmarks:

root@ceph01:~# ceph tell osd.* bench -f plain
osd.0: bench: wrote 1GiB in blocks of 4MiB in 7.80794 sec at 131MiB/sec 32 IOPS
osd.1: bench: wrote 1GiB in blocks of 4MiB in 7.46659 sec at 137MiB/sec 34 IOPS
osd.2: bench: wrote 1GiB in blocks of 4MiB in 7.59962 sec at 135MiB/sec 33 IOPS
osd.3: bench: wrote 1GiB in blocks of 4MiB in 4.58729 sec at 223MiB/sec 55 IOPS
osd.4: bench: wrote 1GiB in blocks of 4MiB in 4.94816 sec at 207MiB/sec 51 IOPS
osd.5: bench: wrote 1GiB in blocks of 4MiB in 11.7797 sec at 86.9MiB/sec 21 IOPS
osd.6: bench: wrote 1GiB in blocks of 4MiB in 11.6019 sec at 88.3MiB/sec 22 IOPS
osd.7: bench: wrote 1GiB in blocks of 4MiB in 8.87174 sec at 115MiB/sec 28 IOPS
osd.8: bench: wrote 1GiB in blocks of 4MiB in 10.6859 sec at 95.8MiB/sec 23 IOPS
osd.10: bench: wrote 1GiB in blocks of 4MiB in 12.1083 sec at
84.6MiB/sec 21 IOPS
osd.11: bench: wrote 1GiB in blocks of 4MiB in 6.26344 sec at 163MiB/sec 40 IOPS
osd.12: bench: wrote 1GiB in blocks of 4MiB in 8.12922 sec at 126MiB/sec 31 IOPS
osd.13: bench: wrote 1GiB in blocks of 4MiB in 5.5416 sec at 185MiB/sec 46 IOPS
osd.14: bench: wrote 1GiB in blocks of 4MiB in 4.99461 sec at 205MiB/sec 51 IOPS
osd.15: bench: wrote 1GiB in blocks of 4MiB in 5.84936 sec at 175MiB/sec 43 IOPS
osd.16: bench: wrote 1GiB in blocks of 4MiB in 6.72942 sec at 152MiB/sec 38 IOPS
osd.17: bench: wrote 1GiB in blocks of 4MiB in 10.3651 sec at
98.8MiB/sec 24 IOPS
osd.18: bench: wrote 1GiB in blocks of 4MiB in 8.33947 sec at 123MiB/sec 30 IOPS
osd.19: bench: wrote 1GiB in blocks of 4MiB in 4.79787 sec at 213MiB/sec 53 IOPS
osd.20: bench: wrote 1GiB in blocks of 4MiB in 8.11134 sec at 126MiB/sec 31 IOPS
osd.21: bench: wrote 1GiB in blocks of 4MiB in 5.70753 sec at 179MiB/sec 44 IOPS
osd.22: bench: wrote 1GiB in blocks of 4MiB in 4.82281 sec at 212MiB/sec 53 IOPS
osd.23: bench: wrote 1GiB in blocks of 4MiB in 8.04044 sec at 127MiB/sec 31 IOPS
osd.24: bench: wrote 1GiB in blocks of 4MiB in 4.64409 sec at 220MiB/sec 55 IOPS
osd.25: bench: wrote 1GiB in blocks of 4MiB in 6.23562 sec at 164MiB/sec 41 IOPS
osd.27: bench: wrote 1GiB in blocks of 4MiB in 7.00978 sec at 146MiB/sec 36 IOPS
osd.32: bench: wrote 1GiB in blocks of 4MiB in 6.38438 sec at 160MiB/sec 40 IOPS

Thanks,
Mario



On Tue, Dec 24, 2019 at 1:46 AM Christian Balzer  wrote:
>
>
> Hello,
>
> On Mon, 23 Dec 2019 22:14:15 +0100 Ml Ml wrote:
>
> > Hohoho Merry Christmas and Hello,
> >
> > i set up a "poor man´s" ceph cluster with 3 Nodes, one switch and
> > normal standard HDDs.
> >
> > My problem; with rbd benchmark i get 190MB/sec write, but only
> > 45MB/sec read speed.
> >
> Something is severely off with your testing or cluster if reads are slower
> than writes, especially by this margin.
>
> > Here is the Setup: https://i.ibb.co/QdYkBYG/ceph.jpg
> >
> > I plan to implement a separate switch to separate public from cluster
> > network. But i think this is not my current problem here.
> >
> You don't mention how many HDDs per server, 10Gbs is fine most likely and
> a separate network (either physical or logical) is usually not needed or
> beneficial.
> Your results indicate that the HIGHEST peak used 70% of your bandwidth and
> that your disks can only maintain 20% of it.
>
> Do your tests consistently with the same tool.
> Neither rados nor rbdbench are ideal, but at least they give ballpark
> figures.
> FIO on the actual mount on your backup server would be best.
>
> And testing on a ceph node is also prone to skewed results, test from the
> actual client, your backup server.
>
> Make sure your network does what you want and monitor the ceph nodes with
> ie. atop during the test runs to see where obvious bottlenecks are.
>
> Christian
>
> > I mount the stuff with rbd from the backup server. It seems that i get
> > good write, but slow read speed. More details at the end of the mail.
> >
> > rados bench -p scbench 30 write --no-cleanup:
> > -
> > Total time run: 34.269336
> > ...
> > Bandwidth (MB/sec): 162.945
> > Stddev Bandwidth:   198.818
> > Max bandwidth (MB/sec): 764
> > Min bandwidth (MB/sec): 0
> > Average IOPS:   40
> > Stddev IOPS:49
> > Max IOPS:   191
> > Min IOPS:   0
> > Average Latency(s): 0.387122
> > Stddev Latency(s):  1.24094
> > Max latency(s): 11.883
> > Min latency(s): 0.0161869
> >
> >
> > Here are the rbd benchmarks run on ceph01:
> > --
> > rbd -p rbdbench bench $RBD_IMAGE_NAME --io-type write --io-size 8192
> > --io-threads 256 --io-total 10G --io-pattern seq
> > ...
> > elapsed:56  ops:  1310720  ops/sec: 23295.63  bytes/sec:
> > 190837820.82 (190MB/sec) => OKAY
> >
> >
> > rbd -p rbdbench bench $RBD

[ceph-users] HEALTH_ERR, size and min_size

2019-12-26 Thread Ml Ml
Hello List,
i have size = 3 and min_size = 2 with 3 Nodes.

My OSDs:

ceph osd tree
ID CLASS WEIGHT   TYPE NAME   STATUS REWEIGHT PRI-AFF
-1   60.17775 root default
-2   20.21155 host ceph01
 0   hdd  1.71089 osd.0   up  1.0 1.0
 8   hdd  1.71660 osd.8   up  1.0 1.0
 9   hdd  2.67029 osd.9   up  1.0 1.0
11   hdd  1.71649 osd.11  up  1.0 1.0
12   hdd  2.67020 osd.12  up  1.0 1.0
14   hdd  2.67020 osd.14  up  1.0 1.0
18   hdd  1.71649 osd.18  up  1.0 1.0
22   hdd  2.67020 osd.22  up  1.0 1.0
23   hdd  2.67020 osd.23  up  1.0 1.0
-3   19.08154 host ceph02
 2   hdd  2.67029 osd.2   up  1.0 1.0
 3   hdd  2.7 osd.3   up  1.0 1.0
 7   hdd  2.67029 osd.7   up  1.0 1.0
13   hdd  2.67020 osd.13  up  1.0 1.0
16   hdd  1.5 osd.16  up  1.0 1.0
19   hdd  2.38409 osd.19  up  1.0 1.0
24   hdd  2.67020 osd.24  up  1.0 1.0
25   hdd  1.71649 osd.25  up  1.0 1.0
-4   20.88466 host ceph03
 1   hdd  1.71660 osd.1   up  1.0 1.0
 4   hdd  2.67020 osd.4   up  1.0 1.0
 5   hdd  1.71660 osd.5   up  1.0 1.0
 6   hdd  1.71660 osd.6   up  1.0 1.0
15   hdd  2.67020 osd.15  up  1.0 1.0
17   hdd  1.62109 osd.17  up  1.0 1.0
20   hdd  1.71649 osd.20  up  1.0 1.0
21   hdd  2.67020 osd.21  up  1.0 1.0
27   hdd  1.71649 osd.27  up  1.0 1.0
32   hdd  2.67020 osd.32  up  1.0 1.0

I replaced two osds on node ceph01 and ran into "HEALTH_ERR".
My problem: it waits for the backfilling process?
Why did i run into HEALTH_ERR? I thought all data will be available on
at least one more node. or even two:

HEALTH_ERR 343351/10358292 objects misplaced (3.315%); Reduced data
availability: 19 pgs inactive; Degraded data redundancy:
639455/10358292 objects degraded (6.173%), 208 pgs degraded, 204 pgs
undersized; application not enabled on 1 pool(s); 29 slow requests are
blocked > 32 sec. Implicated osds ; 29 stuck requests are blocked >
4096 sec. Implicated osds 2,19,24
OBJECT_MISPLACED 343351/10358292 objects misplaced (3.315%)
PG_AVAILABILITY Reduced data availability: 19 pgs inactive
pg 0.4 is stuck inactive for 4227.236803, current state
undersized+degraded+remapped+backfilling+peered, last acting [19]
pg 0.12 is stuck inactive for 4227.267137, current state
undersized+degraded+remapped+backfilling+peered, last acting [13]
pg 0.1b is stuck inactive for 4198.153642, current state
undersized+degraded+remapped+backfill_wait+peered, last acting [24]
pg 0.1f is stuck inactive for 4226.574006, current state
undersized+degraded+remapped+backfilling+peered, last acting [19]
pg 0.61 is stuck inactive for 4227.316336, current state
undersized+degraded+remapped+backfilling+peered, last acting [2]
pg 0.85 is stuck inactive for 4227.287134, current state
undersized+degraded+remapped+backfill_wait+peered, last acting [13]
pg 0.88 is stuck inactive for 4197.261935, current state
undersized+degraded+remapped+backfill_wait+peered, last acting [24]
pg 0.bd is stuck inactive for 4226.607646, current state
undersized+degraded+remapped+backfilling+peered, last acting [2]
pg 0.fc is stuck inactive for 4226.642664, current state
undersized+degraded+remapped+backfill_wait+peered, last acting [13]
pg 0.140 is stuck inactive for 4198.277165, current state
undersized+degraded+remapped+backfilling+peered, last acting [2]
pg 0.16c is stuck inactive for 4198.268985, current state
undersized+degraded+remapped+backfilling+peered, last acting [7]
pg 0.21f is stuck inactive for 4198.228206, current state
undersized+degraded+remapped+backfilling+peered, last acting [2]
pg 0.222 is stuck inactive for 4198.241280, current state
undersized+degraded+remapped+backfilling+peered, last acting [2]
pg 0.27f is stuck inactive for 4198.201034, current state
undersized+degraded+remapped+backfill_wait+peered, last acting [19]
pg 0.297 is stuck inactive for 4197.247869, current state
undersized+degraded+remapped+backfilling+peered, last acting [24]
pg 0.298 is stuck inactive for 4226.572652, current state
undersized+degraded+remapped+backfilling+peered, last acting [19]
pg 0.2cd is stuck inactive for 4226.643455, current state
undersized+degraded+remapped+backfilling+peered, last acting [16]
pg 0.314 is stuck inactive for 4227.339749, current state
undersized+degraded+remapped+backfilling+peered, last acting [2]
pg 0.375 is stuck inactive for 4227.260662, current state
undersized+degraded+remapped+backfilling+peered, last acting [19]
PG_DEGRADED Degraded data

Re: [ceph-users] Slow rbd read performance

2019-12-26 Thread Christian Balzer

Hello,

On Thu, 26 Dec 2019 18:11:29 +0100 Ml Ml wrote:

> Hello Christian,
> 
> thanks for your reply. How should i benchmark my OSDs?
>
Benchmarking individual components can be helpful if you suspect
something, but you need to get a grip on what your systems are doing,
re-read my mail and familiarize yourself with atop and other tools like
prometheus and grafana to get that insight. 

> "dd bs=1M count=2048 if=/dev/sdX of=/dev/null" for each OSD?
> 
You'd be comparing apples with oranges again, as the blocksize with the
benches below is 4MB. Also a "direct" flag would exclude caching effects.


> Here are my OSD (write) benchmarks:
>
The variance is significant here, especially if the cluster was quiescent
at the time.
If the low results (<100MB/s) can be reproduced on the same OSDs, you have
at least one problem spot located.
The slowest component (OSD) involved determines the overall performance.

Christian

> root@ceph01:~# ceph tell osd.* bench -f plain
> osd.0: bench: wrote 1GiB in blocks of 4MiB in 7.80794 sec at 131MiB/sec 32 
> IOPS
> osd.1: bench: wrote 1GiB in blocks of 4MiB in 7.46659 sec at 137MiB/sec 34 
> IOPS
> osd.2: bench: wrote 1GiB in blocks of 4MiB in 7.59962 sec at 135MiB/sec 33 
> IOPS
> osd.3: bench: wrote 1GiB in blocks of 4MiB in 4.58729 sec at 223MiB/sec 55 
> IOPS
> osd.4: bench: wrote 1GiB in blocks of 4MiB in 4.94816 sec at 207MiB/sec 51 
> IOPS
> osd.5: bench: wrote 1GiB in blocks of 4MiB in 11.7797 sec at 86.9MiB/sec 21 
> IOPS
> osd.6: bench: wrote 1GiB in blocks of 4MiB in 11.6019 sec at 88.3MiB/sec 22 
> IOPS
> osd.7: bench: wrote 1GiB in blocks of 4MiB in 8.87174 sec at 115MiB/sec 28 
> IOPS
> osd.8: bench: wrote 1GiB in blocks of 4MiB in 10.6859 sec at 95.8MiB/sec 23 
> IOPS
> osd.10: bench: wrote 1GiB in blocks of 4MiB in 12.1083 sec at
> 84.6MiB/sec 21 IOPS
> osd.11: bench: wrote 1GiB in blocks of 4MiB in 6.26344 sec at 163MiB/sec 40 
> IOPS
> osd.12: bench: wrote 1GiB in blocks of 4MiB in 8.12922 sec at 126MiB/sec 31 
> IOPS
> osd.13: bench: wrote 1GiB in blocks of 4MiB in 5.5416 sec at 185MiB/sec 46 
> IOPS
> osd.14: bench: wrote 1GiB in blocks of 4MiB in 4.99461 sec at 205MiB/sec 51 
> IOPS
> osd.15: bench: wrote 1GiB in blocks of 4MiB in 5.84936 sec at 175MiB/sec 43 
> IOPS
> osd.16: bench: wrote 1GiB in blocks of 4MiB in 6.72942 sec at 152MiB/sec 38 
> IOPS
> osd.17: bench: wrote 1GiB in blocks of 4MiB in 10.3651 sec at
> 98.8MiB/sec 24 IOPS
> osd.18: bench: wrote 1GiB in blocks of 4MiB in 8.33947 sec at 123MiB/sec 30 
> IOPS
> osd.19: bench: wrote 1GiB in blocks of 4MiB in 4.79787 sec at 213MiB/sec 53 
> IOPS
> osd.20: bench: wrote 1GiB in blocks of 4MiB in 8.11134 sec at 126MiB/sec 31 
> IOPS
> osd.21: bench: wrote 1GiB in blocks of 4MiB in 5.70753 sec at 179MiB/sec 44 
> IOPS
> osd.22: bench: wrote 1GiB in blocks of 4MiB in 4.82281 sec at 212MiB/sec 53 
> IOPS
> osd.23: bench: wrote 1GiB in blocks of 4MiB in 8.04044 sec at 127MiB/sec 31 
> IOPS
> osd.24: bench: wrote 1GiB in blocks of 4MiB in 4.64409 sec at 220MiB/sec 55 
> IOPS
> osd.25: bench: wrote 1GiB in blocks of 4MiB in 6.23562 sec at 164MiB/sec 41 
> IOPS
> osd.27: bench: wrote 1GiB in blocks of 4MiB in 7.00978 sec at 146MiB/sec 36 
> IOPS
> osd.32: bench: wrote 1GiB in blocks of 4MiB in 6.38438 sec at 160MiB/sec 40 
> IOPS
> 
> Thanks,
> Mario
> 
> 
> 
> On Tue, Dec 24, 2019 at 1:46 AM Christian Balzer  wrote:
> >
> >
> > Hello,
> >
> > On Mon, 23 Dec 2019 22:14:15 +0100 Ml Ml wrote:
> >  
> > > Hohoho Merry Christmas and Hello,
> > >
> > > i set up a "poor man´s" ceph cluster with 3 Nodes, one switch and
> > > normal standard HDDs.
> > >
> > > My problem; with rbd benchmark i get 190MB/sec write, but only
> > > 45MB/sec read speed.
> > >  
> > Something is severely off with your testing or cluster if reads are slower
> > than writes, especially by this margin.
> >  
> > > Here is the Setup: https://i.ibb.co/QdYkBYG/ceph.jpg
> > >
> > > I plan to implement a separate switch to separate public from cluster
> > > network. But i think this is not my current problem here.
> > >  
> > You don't mention how many HDDs per server, 10Gbs is fine most likely and
> > a separate network (either physical or logical) is usually not needed or
> > beneficial.
> > Your results indicate that the HIGHEST peak used 70% of your bandwidth and
> > that your disks can only maintain 20% of it.
> >
> > Do your tests consistently with the same tool.
> > Neither rados nor rbdbench are ideal, but at least they give ballpark
> > figures.
> > FIO on the actual mount on your backup server would be best.
> >
> > And testing on a ceph node is also prone to skewed results, test from the
> > actual client, your backup server.
> >
> > Make sure your network does what you want and monitor the ceph nodes with
> > ie. atop during the test runs to see where obvious bottlenecks are.
> >
> > Christian
> >  
> > > I mount the stuff with rbd from the backup server. It seems that i get
> > > good write, but slow read speed. More details at the