Re: [ceph-users] ceph df: Raw used vs. used vs. actual bytes in cephfs

2018-02-18 Thread Flemming Frandsen
Each OSD lives on a separate HDD in bluestore with the journals on 2GB 
partitions on a shared SSD.



On 16/02/18 21:08, Gregory Farnum wrote:
What does the cluster deployment look like? Usually this happens when 
you’re sharing disks with the OS, or have co-located file journals or 
something.
On Fri, Feb 16, 2018 at 4:02 AM Flemming Frandsen 
> wrote:


I'm trying out cephfs and I'm in the process of copying over some
real-world data to see what happens.

I have created a number of cephfs file systems, the only one I've
started working on is the one called jenkins specifically the one
named
jenkins which lives in fs_jenkins_data and fs_jenkins_metadata.

According to ceph df I have about 1387 GB of data in all of the pools,
while the raw used space is 5918 GB, which gives a ratio of about
4.3, I
would have expected a ratio around 2 as the pool size has been set
to 2.


Can anyone explain where half my space has been squandered?

 > ceph df
GLOBAL:
 SIZE  AVAIL RAW USED %RAW USED
 8382G 2463G5918G 70.61
POOLS:
 NAME ID USED   %USED  MAX
AVAIL OBJECTS
 .rgw.root11113 0 258G   
4
 default.rgw.control  2   0 0 258G   
8
 default.rgw.meta 3   0 0 258G   
0
 default.rgw.log  4   0 0 258G   
  207
 fs_docker-nexus_data 5  66120M 11.09 258G   
22655
 fs_docker-nexus_metadata 6  39463k 0 258G   
 2376
 fs_meta_data 7 330 0 258G   
4
 fs_meta_metadata 8567k 0 258G   
   22
 fs_jenkins_data  9   1321G 71.84 258G   
 28576278
 fs_jenkins_metadata  10 52178k 0 258G   
  2285493
 fs_nexus_data11  0 0 258G   
0
 fs_nexus_metadata12   4181 0 258G   
   21


--
  Regards Flemming Frandsen - Stibo Systems - DK - STEP Release
Manager
  Please use rele...@stibo.com  for all
Release Management requests

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
 Regards Flemming Frandsen - Stibo Systems - DK - STEP Release Manager
 Please use rele...@stibo.com for all Release Management requests

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Significance of the us-east-1 region when using S3 clients to talk to RGW

2018-02-18 Thread David Turner
I recently I came across this as well. It is an odd requirement.

On Sun, Feb 18, 2018, 4:54 PM F21  wrote:

> I am using the AWS Go SDK v2 (https://github.com/aws/aws-sdk-go-v2) to
> talk to my RGW instance using the s3 interface. I am running ceph in
> docker using the ceph/daemon docker images in demo mode. The RGW is
> started with a zonegroup and zone with their names set to an empty
> string by the scripts in the image.
>
> I have ForcePathStyle for the client set to true, because I want to
> access all my buckets using the path: myrgw.instance:8080/somebucket.
>
> I noticed that if I set the region for the client to anything other than
> us-east-1, I get this error when creating a bucket:
> InvalidLocationConstraint: The specified location-constraint is not valid.
>
> If I set the region in the client to something made up, such as "ceph"
> and the LocationConstraint to "ceph", I still get the same error.
>
> The only way to get my buckets to create successfully is to set the
> client's region to us-east-1. I have grepped the ceph code base and
> cannot find any references to us-east-1. In addition, I looked at the
> AWS docs for calculating v4 signatures and us-east-1 is the default
> region but I can see that the region string is used in the calculation
> (i.e. the region is not ignored when calculating the signature if it is
> set to us-east-1).
>
> Why do my buckets create successfully if I set the region in my s3
> client to us-east-1, but not otherwise? If I do not want to use
> us-east-1 as my default region, for example, if I want us-west-1 as my
> default region, what should I be configuring in ceph?
>
> Thanks,
>
> Francis
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Significance of the us-east-1 region when using S3 clients to talk to RGW

2018-02-18 Thread F21
I am using the AWS Go SDK v2 (https://github.com/aws/aws-sdk-go-v2) to 
talk to my RGW instance using the s3 interface. I am running ceph in 
docker using the ceph/daemon docker images in demo mode. The RGW is 
started with a zonegroup and zone with their names set to an empty 
string by the scripts in the image.


I have ForcePathStyle for the client set to true, because I want to 
access all my buckets using the path: myrgw.instance:8080/somebucket.


I noticed that if I set the region for the client to anything other than 
us-east-1, I get this error when creating a bucket: 
InvalidLocationConstraint: The specified location-constraint is not valid.


If I set the region in the client to something made up, such as "ceph" 
and the LocationConstraint to "ceph", I still get the same error.


The only way to get my buckets to create successfully is to set the 
client's region to us-east-1. I have grepped the ceph code base and 
cannot find any references to us-east-1. In addition, I looked at the 
AWS docs for calculating v4 signatures and us-east-1 is the default 
region but I can see that the region string is used in the calculation 
(i.e. the region is not ignored when calculating the signature if it is 
set to us-east-1).


Why do my buckets create successfully if I set the region in my s3 
client to us-east-1, but not otherwise? If I do not want to use 
us-east-1 as my default region, for example, if I want us-west-1 as my 
default region, what should I be configuring in ceph?


Thanks,

Francis

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Upgrade to ceph 12.2.2, libec_jerasure.so: undefined symbol: _ZN4ceph6buffer3ptrC1ERKS1_

2018-02-18 Thread Sebastian Koch - ilexius GmbH
Hello,

I ran "apt upgrade" on Ubuntu 16.04 on one node, now the two OSDs on the
node are not starting any more.

From the apt log it looks like many Ceph packages were upgraded from
12.2.1-1xenial to 12.2.2-1xenial.

In both OSD logs I can see the following:

2018-02-18 22:24:14.424425 7f5f96a18e00  0 set uid:gid to 64045:64045
(ceph:ceph)
2018-02-18 22:24:14.424443 7f5f96a18e00  0 ceph version 12.2.2
(cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), process
(unknown), pid 6272
2018-02-18 22:24:14.427532 7f5f96a18e00 -1 Public network was set, but
cluster network was not set
2018-02-18 22:24:14.427538 7f5f96a18e00 -1 Using public network also
for cluster network
2018-02-18 22:24:14.430271 7f5f96a18e00  0 pidfile_write: ignore empty
--pid-file
2018-02-18 22:24:14.431859 7f5f96a18e00 -1 load
dlopen(/usr/lib/ceph/erasure-code/libec_jerasure.so):
/usr/lib/ceph/erasure-code/libec_jerasure.so: undefined symbol:
_ZN4ceph6buffer3ptrC1ERKS1_

Any advice how to get the node running again? Thank you very much!!!

Best regards,
Sebastian

-- 
Dipl.-Wirt.-Inf. Sebastian Koch
Geschäftsführer


ilexius GmbH

Unter den Eichen 5
Haus i
65195 Wiesbaden
E-Mail s.k...@ilexius.de
Tel +49-611 - 1 80 33 49
Fax +49-611 - 236 80 84 29
Web http://www.ilexius.de

vertreten durch die Geschäftsleitung:
Sebastian Koch und Thomas Schlüter
Registergericht: Wiesbaden
Handelsregister: HRB 21723
Steuernummer: 040 236 22640
Ust-IdNr.: DE240822836
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High Load and High Apply Latency

2018-02-18 Thread Steven Vacaroaia
Hi John,

I am trying to squize extra performance from my test cluster too
Dell R 620 with PERC 710 , RAID0, 10 GB network

Would you be willing to share your controller and kernel configuration ?

For example, I am using BIOS profile 'Performance" with the following
added to /etc/default/kernel

intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=0
idle=poll

and tuned profile  throughput-performance

All disks are configured with nr-request=1024 and read-ahead-kb=4096
SSD uses scheduled= noop while HDD uses deadline

cache policy for SSD



megacli -LDSetProp  -WT -Immediate -L0 -a0

megacli -LDSetProp  -NORA -Immediate -L0 -a0

 megacli -LDSetProp  -Direct -Immediate -L0 -a0

HDD cache policy has all caches enabled , WB and ADRA

Many thanks

Steven



On 16 February 2018 at 19:06, John Petrini  wrote:

> I thought I'd follow up on this just in case anyone else experiences
> similar issues. We ended up increasing the tcmalloc thread cache size and
> saw a huge improvement in latency. This got us out of the woods because we
> were finally in a state where performance was good enough that it was no
> longer impacting services.
>
> The tcmalloc issues are pretty well documented on this mailing list and I
> don't believe they impact newer versions of Ceph but I thought I'd at least
> give a data point. After making this change our average apply latency
> dropped to 3.46ms during peak business hours. To give you an idea of how
> significant that is here's a graph of the apply latency prior to the
> change: https://imgur.com/KYUETvD
>
> This however did not resolve all of our issues. We were still seeing high
> iowait (repeated spikes up to 400ms) on three of our OSD nodes on all
> disks. We tried replacing the RAID controller (PERC H730) on these nodes
> and while this resolved the issue on one server the two others remained
> problematic. These two nodes were configured differently than the rest.
> They'd been configured in non-raid mode while the others were configured as
> individual raid-0. This turned out to be the problem. We ended up removing
> the two nodes one at a time and rebuilding them with their disks configured
> in independent raid-0 instead of non-raid. After this change iowait rarely
> spikes above 15ms and averages <1ms.
>
> I was really surprised at the performance impact when using non-raid mode.
> While I realize non-raid bypasses the controller cache I still would have
> never expected such high latency. Dell has a whitepaper that recommends
> using individual raid-0 but their own tests show only a small performance
> advantage over non-raid. Note that we are running SAS disks, they actually
> recommend non-raid mode for SATA but I have not tested this. You can view
> the whtiepaper here: http://en.community.dell.com/
> techcenter/cloud/m/dell_cloud_resources/20442913/download
>
> I hope this helps someone.
>
> John Petrini
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Bluestore performance question

2018-02-18 Thread Oliver Freyermuth
Hi Stijn, 

> the IPoIB network is not 56gb, it's probably a lot less (20gb or so).
> the ib_write_bw test is verbs/rdma based. do you have iperf tests
> between hosts, and if so, can you share those reuslts?

Wow - indeed, yes, I was completely mistaken about ib_write_bw. 
Good that I asked! 

You are completely right, checking with iperf3 I get:
[ ID] Interval   Transfer Bandwidth   Retr
[  4]   0.00-10.00  sec  18.4 GBytes  15.8 Gbits/sec  14242 sender
[  4]   0.00-10.00  sec  18.4 GBytes  15.8 Gbits/sec  receiver

Taking into account that the OSDs also talk to each other over the very same 
network,
I can totally follow the observed client throughput. 

This leaves me with two questions:
- Is it safe to use RDMA with 12.2.2 already? Reading through this mail 
archive, 
  I grasped it may lead to memory exhaustion and in any case needs some hacks 
to the systemd service files. 
- Is it already clear whether RDMA will be part of 12.2.3? 

Also, of course the final question from the last mail:
"Why is data moved in a k=4 m=2 EC-pool with 6 hosts and failure domain "host" 
after failure of one host?"
is still open. 

Many thanks already, this helped a lot to understand things better!

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Bluestore performance question

2018-02-18 Thread Stijn De Weirdt
hi oliver,

the IPoIB network is not 56gb, it's probably a lot less (20gb or so).
the ib_write_bw test is verbs/rdma based. do you have iperf tests
between hosts, and if so, can you share those reuslts?

stijn

> we are just getting started with our first Ceph cluster (Luminous 12.2.2) and 
> doing some basic benchmarking. 
> 
> We have two pools:
> - cephfs_metadata, living on 4 SSD devices (each is a bluestore OSD, 240 GB) 
> on 2 hosts (i.e. 2 SSDs each), setup as:
>   - replicated, min size 2, max size 4
>   - 128 PGs
> - cephfs_data, living on 6 hosts each of which has the following setup:
>   - 32 HDD drives (4 TB) each of which is a bluestore OSD, the LSI controller 
> to which they are attached is in JBOD personality
>   - 2 SSD drives, each has 16 partitions with 7 GB per partition, used as 
> block-db by the bluestore OSDs living on the HDDs. 
>   - Created with:
> ceph osd erasure-code-profile set cephfs_data k=4 m=2 
> crush-device-class=hdd crush-failure-domain=host
> ceph osd pool create cephfs_data 2048 2048 erasure cephfs_data
>   - So to summarize: 192 OSDs, 2048 PGs, each OSD has 4 TB data + 7 GB 
> block-db
> 
> The interconnect (public and cluster network) 
> is made via IP over Infiniband (56 GBit bandwidth), using the software stack 
> that comes with CentOS 7. 
> 
> This leaves us with the possibility that one of the metadata-hosts can fail, 
> and still one of the disks can fail. 
> For the data hosts, up to two machines total can fail. 
> 
> We have 40 clients connected to this cluster. We now run something like:
> dd if=/dev/zero of=some_file bs=1M count=1
> on each CPU core of each of the clients, yielding a total of 1120 writing 
> processes (all 40 clients have 28+28HT cores),
> using the ceph-fuse client. 
> 
> This yields a write throughput of a bit below 1 GB/s (capital B), which is 
> unexpectedly low. 
> Running a BeeGFS on the same cluster before (disks were in RAID 6 in that 
> case) yielded throughputs of about 12 GB/s,
> but came with other issues (e.g. it's not FOSS...), so we'd love to run Ceph 
> :-). 
> 
> I performed some basic tests to try to understand the bottleneck for Ceph:
> # rados bench -p cephfs_data 10 write --no-cleanup -t 40
> Bandwidth (MB/sec): 695.952
> Stddev Bandwidth:   295.223
> Max bandwidth (MB/sec): 1088
> Min bandwidth (MB/sec): 76
> Average IOPS:   173
> Stddev IOPS:73
> Max IOPS:   272
> Min IOPS:   19
> Average Latency(s): 0.220967
> Stddev Latency(s):  0.305967
> Max latency(s): 2.88931
> Min latency(s): 0.0741061
> 
> => This agrees mostly with our basic dd benchmark. 
> 
> Reading is a bit faster:
> # rados bench -p cephfs_data 10 rand
> => Bandwidth (MB/sec):   1108.75
> 
> However, the disks are reasonably quick:
> # ceph tell osd.0 bench
> {
> "bytes_written": 1073741824,
> "blocksize": 4194304,
> "bytes_per_sec": 331850403
> }
> 
> I checked and the OSD-hosts peaked at a load average of about 22 (they have 
> 24+24HT cores) in our dd benchmark,
> but stayed well below that (only about 20 % per OSD daemon) in the rados 
> bench test. 
> One idea would be to switch from jerasure to ISA, since the machines are all 
> Intel CPUs only anyways. 
> 
> Already tried: 
> - TCP stack tuning (wmem, rmem), no huge effect. 
> - changing the block sizes used by dd, no effect. 
> - Testing network throughput with ib_write_bw, this revealed something like:
>  #bytes #iterationsBW peak[MB/sec]BW average[MB/sec]   
> MsgRate[Mpps]
>  2  5000 19.73  19.30  10.118121
>  4  5000 52.79  51.70  13.553412
>  8  5000 101.23 96.65  12.668371  
> 
>  16 5000 243.66 233.42 15.297583
>  32 5000 350.66 344.73 11.296089
>  64 5000 909.14 324.85 5.322323
>  1285000 1424.841401.2911.479374
>  2565000 2865.242801.0411.473055
>  5125000 5169.985095.0810.434733
>  1024   5000 10022.759791.42   
> 10.026410
>  2048   5000 10988.6410628.83  
> 5.441958
>  4096   5000 11401.4011399.14  
> 2.918180
> [...]
> 
> So it seems the IP-over-Infiniband is not the bottleneck (BeeGFS was using 
> RDMA). 
> Other ideas that come to mind:
> - Testing with Ceph-RDMA, but that does not seem production-ready yet, if I 
> read the list correctly. 
> - Increasing osd_pool_erasure_code_stripe_width. 
> - Using ISA as EC plugin. 
> - Reducing the bluestore_cache_size_hdd, it seems when recovery + benchmark 
> is ongoing, swap is used (but not when 

[ceph-users] Ceph Bluestore performance question

2018-02-18 Thread Oliver Freyermuth
Dear Cephalopodians,

we are just getting started with our first Ceph cluster (Luminous 12.2.2) and 
doing some basic benchmarking. 

We have two pools:
- cephfs_metadata, living on 4 SSD devices (each is a bluestore OSD, 240 GB) on 
2 hosts (i.e. 2 SSDs each), setup as:
  - replicated, min size 2, max size 4
  - 128 PGs
- cephfs_data, living on 6 hosts each of which has the following setup:
  - 32 HDD drives (4 TB) each of which is a bluestore OSD, the LSI controller 
to which they are attached is in JBOD personality
  - 2 SSD drives, each has 16 partitions with 7 GB per partition, used as 
block-db by the bluestore OSDs living on the HDDs. 
  - Created with:
ceph osd erasure-code-profile set cephfs_data k=4 m=2 
crush-device-class=hdd crush-failure-domain=host
ceph osd pool create cephfs_data 2048 2048 erasure cephfs_data
  - So to summarize: 192 OSDs, 2048 PGs, each OSD has 4 TB data + 7 GB block-db

The interconnect (public and cluster network) 
is made via IP over Infiniband (56 GBit bandwidth), using the software stack 
that comes with CentOS 7. 

This leaves us with the possibility that one of the metadata-hosts can fail, 
and still one of the disks can fail. 
For the data hosts, up to two machines total can fail. 

We have 40 clients connected to this cluster. We now run something like:
dd if=/dev/zero of=some_file bs=1M count=1
on each CPU core of each of the clients, yielding a total of 1120 writing 
processes (all 40 clients have 28+28HT cores),
using the ceph-fuse client. 

This yields a write throughput of a bit below 1 GB/s (capital B), which is 
unexpectedly low. 
Running a BeeGFS on the same cluster before (disks were in RAID 6 in that case) 
yielded throughputs of about 12 GB/s,
but came with other issues (e.g. it's not FOSS...), so we'd love to run Ceph 
:-). 

I performed some basic tests to try to understand the bottleneck for Ceph:
# rados bench -p cephfs_data 10 write --no-cleanup -t 40
Bandwidth (MB/sec): 695.952
Stddev Bandwidth:   295.223
Max bandwidth (MB/sec): 1088
Min bandwidth (MB/sec): 76
Average IOPS:   173
Stddev IOPS:73
Max IOPS:   272
Min IOPS:   19
Average Latency(s): 0.220967
Stddev Latency(s):  0.305967
Max latency(s): 2.88931
Min latency(s): 0.0741061

=> This agrees mostly with our basic dd benchmark. 

Reading is a bit faster:
# rados bench -p cephfs_data 10 rand
=> Bandwidth (MB/sec):   1108.75

However, the disks are reasonably quick:
# ceph tell osd.0 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": 331850403
}

I checked and the OSD-hosts peaked at a load average of about 22 (they have 
24+24HT cores) in our dd benchmark,
but stayed well below that (only about 20 % per OSD daemon) in the rados bench 
test. 
One idea would be to switch from jerasure to ISA, since the machines are all 
Intel CPUs only anyways. 

Already tried: 
- TCP stack tuning (wmem, rmem), no huge effect. 
- changing the block sizes used by dd, no effect. 
- Testing network throughput with ib_write_bw, this revealed something like:
 #bytes #iterationsBW peak[MB/sec]BW average[MB/sec]   MsgRate[Mpps]
 2  5000 19.73  19.30  10.118121
 4  5000 52.79  51.70  13.553412
 8  5000 101.23 96.65  12.668371
  
 16 5000 243.66 233.42 15.297583
 32 5000 350.66 344.73 11.296089
 64 5000 909.14 324.85 5.322323
 1285000 1424.841401.2911.479374
 2565000 2865.242801.0411.473055
 5125000 5169.985095.0810.434733
 1024   5000 10022.759791.42   
10.026410
 2048   5000 10988.6410628.83  
5.441958
 4096   5000 11401.4011399.14  
2.918180
[...]

So it seems the IP-over-Infiniband is not the bottleneck (BeeGFS was using 
RDMA). 
Other ideas that come to mind:
- Testing with Ceph-RDMA, but that does not seem production-ready yet, if I 
read the list correctly. 
- Increasing osd_pool_erasure_code_stripe_width. 
- Using ISA as EC plugin. 
- Reducing the bluestore_cache_size_hdd, it seems when recovery + benchmark is 
ongoing, swap is used (but not when performing benchmarking only,
  so this should not explain the slowdown). 

However, since we are just beginning with Ceph, it may well be we are missing 
something basic, but crucial here. 
For example, could it be that the block-db storage is too small? How to find 
out? 

Do any ideas come to mind? 

A second, hopefully easier question:
If one OSD-host fails in our setup, all PGs are changed to