Re: [ceph-users] xfs/nobarrier

2014-12-27 Thread Lindsay Mathieson
On Sat, 27 Dec 2014 09:03:16 PM Mark Kirkwood wrote:
 Yep. If you have 'em plugged into a RAID/HBA card with a battery backup 
 (that also disables their individual caches) then it is safe to use 
 nobarrier, otherwise data corruption will result if the server 
 experiences power loss.


Thanks Mark,

do people consider a UPS + Shutdown procedures a suitable substitute?
-- 
Lindsay

signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xfs/nobarrier

2014-12-27 Thread Mark Kirkwood

On 27/12/14 20:32, Lindsay Mathieson wrote:

I see a lot of people mount their xfs osd's with nobarrier for extra
performance, certainly it makes a huge difference to my small system.

However I don't do it as my understanding is this runs a risk of data
corruption in the event of power failure - this is the case, even with ceph?


side note: How do I tell if my disk cache is battery backed? I have WD Red 3TB
(WD30EFRX-68EUZN0) with 64M cache, but no mention of battery backup in the
docs. I presume that means it isn't? :)


Yep. If you have 'em plugged into a RAID/HBA card with a battery backup 
(that also disables their individual caches) then it is safe to use 
nobarrier, otherwise data corruption will result if the server 
experiences power loss.


Regards

Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] replace osd's disk, cann't auto recover data

2014-12-27 Thread 邱尚高

Sorry,  the probem is fixed , because of we modify code lead to this bug.
 
From: 邱尚高
Date: 2014-12-27 12:40
To: ceph-users
Subject: replace osd's disk, cann't auto recover data

3 HOST:
1 CPU + 4Disks(3T SATA Disk)
Ceph version: 0.80.6
OS:  Redhat 6.5
Cluster:  3 host,  and have 3 MONs + 9 OSDs( One OSD hold one Disk)

1. When cluster status is Health_OK,  I write a little  data,   then  I can 
find some block file in PG directory.

[root@rhls-test2 release]# ll data/osd/ceph-0/current/2.106_head/
total 4100
-rw-r--r--. 1 root root 4194304 Dec 17 16:25 
rb.0.1021.6b8b4567.0024__head_753F3906__2

2.  Before replace the osd disk , we set the cluster NOOUT flag. 

3. We stop one OSD.2 which response the PG(2.106) as replica node,  and replace 
the disk with empty disk.

4. and we format the disk with xfs filesystem, and  use the ceph-osd --mkfs 
format.
ceph-osd -i 2 --mkfs --set-osd-fsid 86828adf-7579-4127-8789-cb5e8266f15c
note:
For simply to replace disk,
we modify the ceph-osd code , add  -set-osd-fsid option for ceph-osd to set 
the osd use the old fsid.   

5. the osd start is OK , and we can find all PG's statues is  active+clean.
cluster 7c731223-9637-4e21-a6f5-c576a9cf92a4
 health HEALTH_OK
monmap e1: 3 mons at 
{a=192.169.1.84:6789/0,b=192.169.1.85:6789/0,c=192.169.1.86:6789/0}, election 
epoch 78, quorum 0,1,2 a,b,c
 osdmap e808: 9 osds: 9 up, 9 in
  pgmap v36218: 3072 pgs, 3 pools, 7069 MB data, 8254 objects
48063 MB used, 22298 GB / 22345 GB avail
3072 active+clean

6. but I find the  osd.2 disk have not any data block except the meta 
data(omap, superblock ,etc).  and I can find the  all PG's directory, but is 
empty.
[root@rhls-test2 release]# ll data/osd/ceph-2/current/2.106_head/
total 0








___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xfs/nobarrier

2014-12-27 Thread Andrey Korolyov
Power supply means bigger capex and less redundancy, as the emergency
procedure in case of power failure is less deterministic than with
controlled battery-backed cache. Cache battery is smaller and way more
predictable for a health measurements than a UPS (if passes internal
check, it will be *always* enough to keep memory powered for a while,
but UPS requires periodical battle testing, if you want to know that
it still be able to hold power failure, with two power lanes should be
safe enough, simply because device itself has more complex structure
than a battery with a single voltage stabilizer). Anyway XFS nobarrier
does not bring enough performance boost to be enabled by my
experience.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xfs/nobarrier

2014-12-27 Thread Lindsay Mathieson
On Sat, 27 Dec 2014 04:59:51 PM you wrote:
 Power supply means bigger capex and less redundancy, as the emergency
 procedure in case of power failure is less deterministic than with
 controlled battery-backed cache. 

Yes, the whole  auto shut-down procedure is rather more complex and fragile 
for a UPS than a controller cache

 Anyway XFS nobarrier
 does not bring enough performance boost to be enabled by my
 experience.

It makes a non-trivial difference on my (admittedly slow) setup, with write 
bandwidth going from 35 MB/s to 51 MB/s

-- 
Lindsay

signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xfs/nobarrier

2014-12-27 Thread Andrey Korolyov
On Sat, Dec 27, 2014 at 4:31 PM, Lindsay Mathieson
lindsay.mathie...@gmail.com wrote:
 On Sat, 27 Dec 2014 04:59:51 PM you wrote:
 Power supply means bigger capex and less redundancy, as the emergency
 procedure in case of power failure is less deterministic than with
 controlled battery-backed cache.

 Yes, the whole  auto shut-down procedure is rather more complex and fragile
 for a UPS than a controller cache

 Anyway XFS nobarrier
 does not bring enough performance boost to be enabled by my
 experience.

 It makes a non-trivial difference on my (admittedly slow) setup, with write
 bandwidth going from 35 MB/s to 51 MB/s

Are you able to separate log with data in your setup and check the
difference? If your devices are working strictly under their upper
limits for bw/IOPS, separating meta and data bytes may help a lot, at
least for synchronous clients. So, depending on type of your benchmark
(sync/async/IOPS-/bandwidth-hungry) you may win something just for
crossing journal and data between disks (and increase failure domain
for a single disk as well :) ).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;

2014-12-27 Thread Jiri Kanicky

Hi,

I just build my CEPH cluster but having problems with the health of the 
cluster.


Here are few details:
- I followed the ceph documentation.
- I used btrfs filesystem for all OSDs
- I did not set osd pool default size = 2  as I thought that if I have 
2 nodes + 4 OSDs, I can leave default=3. I am not sure if this was right.
- I noticed that default pools data,metadata were not created. Only 
rbd pool was created.
- As it was complaining that the pg_num is too low, I increased the 
pg_num for pool rbd to 133 (400/3) and end up with pool rbd pg_num 133 
 pgp_num 64.


Would you give me hint where I have made the mistake? (I can remove the 
OSDs and start over if needed.)



cephadmin@ceph1:/etc/ceph$ sudo ceph health
HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck 
unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num 133 
 pgp_num 64

cephadmin@ceph1:/etc/ceph$ sudo ceph status
cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
 health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs 
stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd 
pg_num 133  pgp_num 64
 monmap e1: 2 mons at 
{ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch 
8, quorum 0,1 ceph1,ceph2

 osdmap e42: 4 osds: 4 up, 4 in
  pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects
11704 kB used, 11154 GB / 11158 GB avail
  29 active+undersized+degraded
 104 active+remapped


cephadmin@ceph1:/etc/ceph$ sudo ceph osd tree
# idweight  type name   up/down reweight
-1  10.88   root default
-2  5.44host ceph1
0   2.72osd.0   up  1
1   2.72osd.1   up  1
-3  5.44host ceph2
2   2.72osd.2   up  1
3   2.72osd.3   up  1


cephadmin@ceph1:/etc/ceph$ sudo ceph osd lspools
0 rbd,

cephadmin@ceph1:/etc/ceph$ cat ceph.conf
[global]
fsid = bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
public_network = 192.168.30.0/24
cluster_network = 10.1.1.0/24
mon_initial_members = ceph1, ceph2
mon_host = 192.168.30.21,192.168.30.22
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true

Thank you
Jiri
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD client STRIPINGV2 support

2014-12-27 Thread Florent MONTHEL
Hi,

I’ve just created image with striping support like below (image type 2 - 16 
stripes of 64K with 4MB object) :

rbd create sandevices/flaprdweb01_lun010 --size 102400 --stripe-unit 65536 
--stripe-count 16 --order 22  --image-format 2

rbd info sandevices/flaprdweb01_lun010
rbd image 'flaprdweb01_lun010':
size 102400 MB in 25600 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.40c52ae8944a
format: 2
features: layering, striping
stripe unit: 65536 bytes
stripe count: 16

But when I try to map device, I’ve unsupported striping alert on my dmesg 
console.

rbd map sandevices/flaprdweb01_lun010 --name client.admin
rbd: sysfs write failed
rbd: map failed: (22) Invalid argument

dmesg | tail
[15352.510385] rbd: image flaprdweb01_lun010: unsupported stripe unit (got 
65536 want 4194304)

Do you know if it’s scheduled to support STRIPINGV2 on tbd client ?
How can I mount my device ?

Thanks in advance 


Florent Monthel





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;

2014-12-27 Thread Christian Balzer

Hello,

On Sun, 28 Dec 2014 01:52:39 +1100 Jiri Kanicky wrote:

 Hi,
 
 I just build my CEPH cluster but having problems with the health of the 
 cluster.
 
You're not telling us the version, but it's clearly 0.87 or beyond.

 Here are few details:
 - I followed the ceph documentation.
Outdated, unfortunately.

 - I used btrfs filesystem for all OSDs
Big mistake number 1, do some research (google, ML archives).
Though not related to to  your problems.

 - I did not set osd pool default size = 2  as I thought that if I have 
 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this was right.
Big mistake, assumption number 2,  replications size by the default CRUSH
rule is determined by hosts. So that's your main issue here. 
Either set it to 2 or use 3 hosts.

 - I noticed that default pools data,metadata were not created. Only 
 rbd pool was created.
See outdated docs above. The majority of use cases is with RBD, so since
Giant the cephfs pools are not created by default.

 - As it was complaining that the pg_num is too low, I increased the 
 pg_num for pool rbd to 133 (400/3) and end up with pool rbd pg_num 133 
   pgp_num 64.
 
Re-read the (in this case correct) documentation.
It clearly states to round up to nearest power of 2, in your case 256.

Regards.

Christian

 Would you give me hint where I have made the mistake? (I can remove the 
 OSDs and start over if needed.)
 
 
 cephadmin@ceph1:/etc/ceph$ sudo ceph health
 HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck 
 unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num 133 
   pgp_num 64
 cephadmin@ceph1:/etc/ceph$ sudo ceph status
  cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
   health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs 
 stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd 
 pg_num 133  pgp_num 64
   monmap e1: 2 mons at 
 {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch 
 8, quorum 0,1 ceph1,ceph2
   osdmap e42: 4 osds: 4 up, 4 in
pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects
  11704 kB used, 11154 GB / 11158 GB avail
29 active+undersized+degraded
   104 active+remapped
 
 
 cephadmin@ceph1:/etc/ceph$ sudo ceph osd tree
 # idweight  type name   up/down reweight
 -1  10.88   root default
 -2  5.44host ceph1
 0   2.72osd.0   up  1
 1   2.72osd.1   up  1
 -3  5.44host ceph2
 2   2.72osd.2   up  1
 3   2.72osd.3   up  1
 
 
 cephadmin@ceph1:/etc/ceph$ sudo ceph osd lspools
 0 rbd,
 
 cephadmin@ceph1:/etc/ceph$ cat ceph.conf
 [global]
 fsid = bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
 public_network = 192.168.30.0/24
 cluster_network = 10.1.1.0/24
 mon_initial_members = ceph1, ceph2
 mon_host = 192.168.30.21,192.168.30.22
 auth_cluster_required = cephx
 auth_service_required = cephx
 auth_client_required = cephx
 filestore_xattr_use_omap = true
 
 Thank you
 Jiri


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;

2014-12-27 Thread Nico Schottelius
Hey Jiri,

also rais the pgp_num (pg != pgp - it's easy to overread).

Cheers,

Nico

Jiri Kanicky [Sun, Dec 28, 2014 at 01:52:39AM +1100]:
 Hi,
 
 I just build my CEPH cluster but having problems with the health of
 the cluster.
 
 Here are few details:
 - I followed the ceph documentation.
 - I used btrfs filesystem for all OSDs
 - I did not set osd pool default size = 2  as I thought that if I
 have 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this
 was right.
 - I noticed that default pools data,metadata were not created.
 Only rbd pool was created.
 - As it was complaining that the pg_num is too low, I increased the
 pg_num for pool rbd to 133 (400/3) and end up with pool rbd pg_num
 133  pgp_num 64.
 
 Would you give me hint where I have made the mistake? (I can remove
 the OSDs and start over if needed.)
 
 
 cephadmin@ceph1:/etc/ceph$ sudo ceph health
 HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck
 unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num
 133  pgp_num 64
 cephadmin@ceph1:/etc/ceph$ sudo ceph status
 cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
  health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133
 pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool
 rbd pg_num 133  pgp_num 64
  monmap e1: 2 mons at
 {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election
 epoch 8, quorum 0,1 ceph1,ceph2
  osdmap e42: 4 osds: 4 up, 4 in
   pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects
 11704 kB used, 11154 GB / 11158 GB avail
   29 active+undersized+degraded
  104 active+remapped
 
 
 cephadmin@ceph1:/etc/ceph$ sudo ceph osd tree
 # idweight  type name   up/down reweight
 -1  10.88   root default
 -2  5.44host ceph1
 0   2.72osd.0   up  1
 1   2.72osd.1   up  1
 -3  5.44host ceph2
 2   2.72osd.2   up  1
 3   2.72osd.3   up  1
 
 
 cephadmin@ceph1:/etc/ceph$ sudo ceph osd lspools
 0 rbd,
 
 cephadmin@ceph1:/etc/ceph$ cat ceph.conf
 [global]
 fsid = bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
 public_network = 192.168.30.0/24
 cluster_network = 10.1.1.0/24
 mon_initial_members = ceph1, ceph2
 mon_host = 192.168.30.21,192.168.30.22
 auth_cluster_required = cephx
 auth_service_required = cephx
 auth_client_required = cephx
 filestore_xattr_use_omap = true
 
 Thank you
 Jiri

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Not running multiple services on the same machine?

2014-12-27 Thread Christopher Armstrong
Hi folks,

I've heard several comments on the mailing list warning against running
multiple Ceph services (monitors, daemons, MDS, gateway) on the same
machine. I was wondering if someone could shed more light on the dangers of
this. In Deis[1] we only require clusters to be 3 machines big, and we need
to run monitors, daemons, and MDS servers. Deis runs on CoreOS, so all of
our services are shipped as Docker containers. We run Ceph within
containers as our store[2] component, so on a single CoreOS host we're
running a monitor, daemon, MDS, gateway, and consuming the cluster with a
CephFS mount.

I know it's ill-advised, but my question is - why? What sort of issues are
we looking at? Data loss, performance, etc.? When I implemented this I was
unaware of the recommendation not to do this, and I'd like to address any
potential issues now.

Thanks!

Chris

[1]: https://github.com/deis/deis
[2]: https://github.com/deis/deis/tree/master/store
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD client STRIPINGV2 support

2014-12-27 Thread Ilya Dryomov
On Sat, Dec 27, 2014 at 6:46 PM, Florent MONTHEL fmont...@flox-arts.net wrote:
 Hi,

 I’ve just created image with striping support like below (image type 2 - 16
 stripes of 64K with 4MB object) :

 rbd create sandevices/flaprdweb01_lun010 --size 102400 --stripe-unit 65536
 --stripe-count 16 --order 22  --image-format 2

 rbd info sandevices/flaprdweb01_lun010
 rbd image 'flaprdweb01_lun010':
 size 102400 MB in 25600 objects
 order 22 (4096 kB objects)
 block_name_prefix: rbd_data.40c52ae8944a
 format: 2
 features: layering, striping
 stripe unit: 65536 bytes
 stripe count: 16

 But when I try to map device, I’ve unsupported striping alert on my dmesg
 console.

 rbd map sandevices/flaprdweb01_lun010 --name client.admin
 rbd: sysfs write failed
 rbd: map failed: (22) Invalid argument

 dmesg | tail
 [15352.510385] rbd: image flaprdweb01_lun010: unsupported stripe unit (got
 65536 want 4194304)

 Do you know if it’s scheduled to support STRIPINGV2 on tbd client ?
 How can I mount my device ?

You can't - krbd doesn't support it yet.  It's planned, in fact it's
the top item on the krbd list.  Currently STRIPINGV2 images can be
mapped only if su=4M and sc=1 (i.e. if striping params match v1 images)
and that's the error you are tripping over.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xfs/nobarrier

2014-12-27 Thread Lindsay Mathieson
On Sat, 27 Dec 2014 06:02:32 PM you wrote:
 Are you able to separate log with data in your setup and check the
 difference? 

Do you mean putting the OSD journal on a separate disk? I have the journals on 
SSD partitions, which has helped a lot, previously I was getting 13 MB/s

Its not a good SSD - Samsung 840 EVO :( one of my plans for the new year is to 
get SSD's with better seq write speed and IOPS

I've been trying to figure out if adding more OSD's will improve my 
performance, I only have 2 OSD's (one per node)

  So, depending on type of your benchmark
 (sync/async/IOPS-/bandwidth-hungry) you may win something just for
 crossing journal and data between disks (and increase failure domain
 for a single disk as well  ).

One does tend to foxus on raw seq read/writes for becnhmarking, but my actual 
usage is solely for hosting KVM images, so really random R/W is probably more 
important.

-- 
Lindsay

signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xfs/nobarrier

2014-12-27 Thread Andrey Korolyov
On Sun, Dec 28, 2014 at 1:25 AM, Lindsay Mathieson
lindsay.mathie...@gmail.com wrote:
 On Sat, 27 Dec 2014 06:02:32 PM you wrote:
 Are you able to separate log with data in your setup and check the
 difference?

 Do you mean putting the OSD journal on a separate disk? I have the journals on
 SSD partitions, which has helped a lot, previously I was getting 13 MB/s


No, I meant XFS journal, as we are speaking about filestore fs performance.

 Its not a good SSD - Samsung 840 EVO :( one of my plans for the new year is to
 get SSD's with better seq write speed and IOPS

 I've been trying to figure out if adding more OSD's will improve my
 performance, I only have 2 OSD's (one per node)

Erm, yes. Two OSDs cannot be considered even for a performance
measurement testbed setup, neither should three or any other small
number. This explains numbers you are getting and impact from
nobarrier option.


  So, depending on type of your benchmark
 (sync/async/IOPS-/bandwidth-hungry) you may win something just for
 crossing journal and data between disks (and increase failure domain
 for a single disk as well  ).

 One does tend to foxus on raw seq read/writes for becnhmarking, but my actual
 usage is solely for hosting KVM images, so really random R/W is probably more
 important.

Ok, then my suggestion may not help as much as it can.


 --
 Lindsay
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Improving Performance with more OSD's?

2014-12-27 Thread Lindsay Mathieson
I'm looking to improve the raw performance on my small setup (2 Compute Nodes, 
2 OSD's). Only used for hosting KVM images.

Raw read/write is roughly 200/35 MB/s. Starting 4+ VM's simultaneously pushes 
iowaits over 30%, though the system keeps chugging along.

Budget is limited ... :(

I plan to upgrade my SSD journals to something better than the Samsung 840 
EVO's (Intel 520/530?)

One of the things I see mentioned a lot in blogs etc is how ceph's performance 
improves as you add more OSD's and that the quality of the disks does not 
matter so much as the quantity.

How does this work? does ceph stripe reads and writes across the OSD's to 
improve performance?

If I add 3 cheap OSD's to each node (500GB - 1TB) with 10GB SSD journal 
partition each could I expect a big improvement in performance?

What sort of redundancy to setup? currently its min= 1, size=2. Size is not an 
issue, we already have 150% more space than we need, redundancy and 
performance is more important.

Now I think on it, we can live with the slow write performance, but reducing 
iowait would be *really* good.

thanks,
-- 
Lindsay

signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;

2014-12-27 Thread ji...@ganomi.com
Thanks for the tip. Will do.

Jiri

- Reply message -
From: Nico Schottelius nico-ceph-us...@schottelius.org
To: ceph-us...@ceph.com
Subject: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 
pgs stuck unclean; 29 pgs stuck undersized;
Date: Sun, Dec 28, 2014 03:49

Hey Jiri,

also rais the pgp_num (pg != pgp - it's easy to overread).

Cheers,

Nico

Jiri Kanicky [Sun, Dec 28, 2014 at 01:52:39AM +1100]:
 Hi,
 
 I just build my CEPH cluster but having problems with the health of
 the cluster.
 
 Here are few details:
 - I followed the ceph documentation.
 - I used btrfs filesystem for all OSDs
 - I did not set osd pool default size = 2  as I thought that if I
 have 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this
 was right.
 - I noticed that default pools data,metadata were not created.
 Only rbd pool was created.
 - As it was complaining that the pg_num is too low, I increased the
 pg_num for pool rbd to 133 (400/3) and end up with pool rbd pg_num
 133  pgp_num 64.
 
 Would you give me hint where I have made the mistake? (I can remove
 the OSDs and start over if needed.)
 
 
 cephadmin@ceph1:/etc/ceph$ sudo ceph health
 HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck
 unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num
 133  pgp_num 64
 cephadmin@ceph1:/etc/ceph$ sudo ceph status
 cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
  health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133
 pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool
 rbd pg_num 133  pgp_num 64
  monmap e1: 2 mons at
 {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election
 epoch 8, quorum 0,1 ceph1,ceph2
  osdmap e42: 4 osds: 4 up, 4 in
   pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects
 11704 kB used, 11154 GB / 11158 GB avail
   29 active+undersized+degraded
  104 active+remapped
 
 
 cephadmin@ceph1:/etc/ceph$ sudo ceph osd tree
 # idweight  type name   up/down reweight
 -1  10.88   root default
 -2  5.44host ceph1
 0   2.72osd.0   up  1
 1   2.72osd.1   up  1
 -3  5.44host ceph2
 2   2.72osd.2   up  1
 3   2.72osd.3   up  1
 
 
 cephadmin@ceph1:/etc/ceph$ sudo ceph osd lspools
 0 rbd,
 
 cephadmin@ceph1:/etc/ceph$ cat ceph.conf
 [global]
 fsid = bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
 public_network = 192.168.30.0/24
 cluster_network = 10.1.1.0/24
 mon_initial_members = ceph1, ceph2
 mon_host = 192.168.30.21,192.168.30.22
 auth_cluster_required = cephx
 auth_service_required = cephx
 auth_client_required = cephx
 filestore_xattr_use_omap = true
 
 Thank you
 Jiri

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird scrub problem

2014-12-27 Thread Andrey Korolyov
On Sat, Dec 27, 2014 at 4:09 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Tue, Dec 23, 2014 at 4:17 AM, Samuel Just sam.j...@inktank.com wrote:
 Oh, that's a bit less interesting.  The bug might be still around though.
 -Sam

 On Mon, Dec 22, 2014 at 2:50 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Tue, Dec 23, 2014 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote:
 You'll have to reproduce with logs on all three nodes.  I suggest you
 open a high priority bug and attach the logs.

 debug osd = 20
 debug filestore = 20
 debug ms = 1

 I'll be out for the holidays, but I should be able to look at it when
 I get back.
 -Sam



 Thanks Sam,

 although I am not sure if it makes not only a historical interest (the
 mentioned cluster running cuttlefish), I`ll try to collect logs for
 scrub.

 Same stuff:
 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg15447.html
 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg14918.html

 Looks like issue is still with us, though it requires meta or file
 structure corruption to show itself. I`ll check if it can be
 reproduced via rsync -X sec pg subdir - pri pg subdir or vice-versa.
 Mine case shows slightly different pathnames for same objects with
 same checksums, may be a root reason then. As every case mentioned,
 including mine, happened in oh-shit-hardware-is-broken case, I suggest
 that the incurable corruption happens during primary backfill from
 active replica at the recovery time.

Recovery/backfill from corrupted primary copy results to crash
(attached) of primary OSD, for example it can be triggered by purging
one of secondary copies (top of cuttlefish branch for line numbers).
Although as secondaries preserve same data with same checksums, it is
possible to destroy both meta record and pg directory and refill
primary back. The interesting point is that the corrupted primary was
completely refilled after hardware failure, but looks like it survived
long enough after a failure event to spread corruption to the copies,
I simply can not imagine better explanation.
Thread 1 (Thread 0x7f193190d700 (LWP 64087)):
#0  0x7f194a47ab7b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00857d59 in reraise_fatal (signum=6)
at global/signal_handler.cc:58
#2  handle_fatal_signal (signum=6) at global/signal_handler.cc:104
#3  signal handler called
#4  0x7f1948879405 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x7f194887cb5b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x7f194917789d in __gnu_cxx::__verbose_terminate_handler() ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x7f1949175996 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x7f19491759c3 in std::terminate() ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x7f1949175bee in __cxa_throw ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x0090436a in ceph::__ceph_assert_fail (
assertion=0x9caf67 r = 0, file=optimized out, line=7115, 
func=0x9d1900 void ReplicatedPG::scan_range(hobject_t, int, int, 
PG::BackfillInterval*)) at common/assert.cc:77
#11 0x0065de69 in ReplicatedPG::scan_range (this=this@entry=0x4df6000, 
begin=..., min=min@entry=32, max=max@entry=64, bi=bi@entry=0x4df6d40)
at osd/ReplicatedPG.cc:7115
#12 0x0066f5c6 in ReplicatedPG::recover_backfill (
this=this@entry=0x4df6000, max=max@entry=1) at osd/ReplicatedPG.cc:6923
#13 0x0067c18d in ReplicatedPG::start_recovery_ops (this=0x4df6000, 
max=1, prctx=optimized out) at osd/ReplicatedPG.cc:6561
#14 0x006f2340 in OSD::do_recovery (this=0x2ba7000, pg=pg@entry=
0x4df6000) at osd/OSD.cc:6104
#15 0x00735361 in OSD::RecoveryWQ::_process (this=optimized out, 
pg=0x4df6000) at osd/OSD.h:1248
#16 0x008faeba in ThreadPool::worker (this=0x2ba75e0, wt=0x7be1540)
at common/WorkQueue.cc:119
#17 0x008fc160 in ThreadPool::WorkThread::entry (this=optimized out)
at common/WorkQueue.h:316
#18 0x7f194a472e9a in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#19 0x7f19489353dd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#20 0x in ?? ()
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xfs/nobarrier

2014-12-27 Thread Kyle Bader
 do people consider a UPS + Shutdown procedures a suitable substitute?

I certainly wouldn't, I've seen utility power fail and the transfer
switch fail to transition to UPS strings. Had this happened to me with
nobarrier it would have been a very sad day.

-- 

Kyle Bader
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xfs/nobarrier

2014-12-27 Thread Mark Kirkwood

On 28/12/14 15:51, Kyle Bader wrote:

do people consider a UPS + Shutdown procedures a suitable substitute?


I certainly wouldn't, I've seen utility power fail and the transfer
switch fail to transition to UPS strings. Had this happened to me with
nobarrier it would have been a very sad day.



I'd second that. In addition I've heard of cases where the switchover to 
the UPS worked ok but the damn thing had a flat battery! So the 
switchover process and UPS reliability need to be be well rehearsed + 
monitored if you want to reply on this type of solution.


Cheers

Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;

2014-12-27 Thread Christian Balzer

Hello,

On Sun, 28 Dec 2014 11:58:59 +1100 ji...@ganomi.com wrote:

 Hi Christian.
 
 Thank you for your suggestions. 
 
 I will set the osd pool default size to 2 as you recommended. As
 mentioned the documentation is talking about OSDs, not nodes, so that
 must have confused me.

Note that changing this will only affect new pools of course. So to sort
out your current state either start over with this value set before
creating/starting anything or reduce the current size (ceph osd pool set
poolname size).

Have a look at the crushmap example or even better your own, current one
and you will see where by default the host is the failure domain.
Which of course makes a lot of sense.
 
 Regarding the BTRFS, i thought that btrfs is better option for the
 future providing more features. I know that XFS might be more stable,
 but again my impression was that btrfs is the focus for future
 development. Is that correct?
 
I'm not a developer, but if you scour the ML archives you will find a
number of threads about BTRFS (and ZFS).
The biggest issues with BTRFS are not just stability but also the fact
that it degrades rather quickly (fragmentation) due to the COW nature of
it and less smarts than ZFS in that area.
So development on the Ceph side is not the issue per se.

IMHO BTRFS looks more and more stillborn and with regard to Ceph ZFS might
become the better choice (in the future), with KV store backends being an
alternative for some use cases (also far from production ready at this
time).

Regards,

Christian
 You are right with the round up. I forgot about that.
 
 Thanks for your help. Much appreciated.
 Jiri
 
 - Reply message -
 From: Christian Balzer ch...@gol.com
 To: ceph-us...@ceph.com
 Cc: Jiri Kanicky ji...@ganomi.com
 Subject: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck
 degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; Date: Sun, Dec
 28, 2014 03:29
 
 Hello,
 
 On Sun, 28 Dec 2014 01:52:39 +1100 Jiri Kanicky wrote:
 
  Hi,
  
  I just build my CEPH cluster but having problems with the health of
  the cluster.
  
 You're not telling us the version, but it's clearly 0.87 or beyond.
 
  Here are few details:
  - I followed the ceph documentation.
 Outdated, unfortunately.
 
  - I used btrfs filesystem for all OSDs
 Big mistake number 1, do some research (google, ML archives).
 Though not related to to  your problems.
 
  - I did not set osd pool default size = 2  as I thought that if I
  have 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this
  was right.
 Big mistake, assumption number 2,  replications size by the default CRUSH
 rule is determined by hosts. So that's your main issue here. 
 Either set it to 2 or use 3 hosts.
 
  - I noticed that default pools data,metadata were not created. Only 
  rbd pool was created.
 See outdated docs above. The majority of use cases is with RBD, so since
 Giant the cephfs pools are not created by default.
 
  - As it was complaining that the pg_num is too low, I increased the 
  pg_num for pool rbd to 133 (400/3) and end up with pool rbd pg_num
  133 
pgp_num 64.
  
 Re-read the (in this case correct) documentation.
 It clearly states to round up to nearest power of 2, in your case 256.
 
 Regards.
 
 Christian
 
  Would you give me hint where I have made the mistake? (I can remove
  the OSDs and start over if needed.)
  
  
  cephadmin@ceph1:/etc/ceph$ sudo ceph health
  HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck 
  unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num
  133 
pgp_num 64
  cephadmin@ceph1:/etc/ceph$ sudo ceph status
   cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133
  pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool
  rbd pg_num 133  pgp_num 64
monmap e1: 2 mons at 
  {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election
  epoch 8, quorum 0,1 ceph1,ceph2
osdmap e42: 4 osds: 4 up, 4 in
 pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects
   11704 kB used, 11154 GB / 11158 GB avail
 29 active+undersized+degraded
104 active+remapped
  
  
  cephadmin@ceph1:/etc/ceph$ sudo ceph osd tree
  # idweight  type name   up/down reweight
  -1  10.88   root default
  -2  5.44host ceph1
  0   2.72osd.0   up  1
  1   2.72osd.1   up  1
  -3  5.44host ceph2
  2   2.72osd.2   up  1
  3   2.72osd.3   up  1
  
  
  cephadmin@ceph1:/etc/ceph$ sudo ceph osd lspools
  0 rbd,
  
  cephadmin@ceph1:/etc/ceph$ cat ceph.conf
  [global]
  fsid = bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
  public_network = 192.168.30.0/24
  cluster_network = 10.1.1.0/24
  mon_initial_members = ceph1, ceph2
  mon_host = 192.168.30.21,192.168.30.22
  auth_cluster_required = cephx
  

Re: [ceph-users] xfs/nobarrier

2014-12-27 Thread Sage Weil
On Sun, 28 Dec 2014, Mark Kirkwood wrote:
 On 28/12/14 15:51, Kyle Bader wrote:
   do people consider a UPS + Shutdown procedures a suitable substitute?
  
  I certainly wouldn't, I've seen utility power fail and the transfer
  switch fail to transition to UPS strings. Had this happened to me with
  nobarrier it would have been a very sad day.
  
 
 I'd second that. In addition I've heard of cases where the switchover to the
 UPS worked ok but the damn thing had a flat battery! So the switchover process
 and UPS reliability need to be be well rehearsed + monitored if you want to
 reply on this type of solution.

Right.

nobarrier is definitely *NOT* recommended under almost any circumstances.  
Yes, there are some situations where it is safe, but there are so many 
things that can go wrong and break it (from buggy kernel to buggy 
controller firmware to storage device to power etc) that it is IMO rarely 
worth the risk.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;

2014-12-27 Thread Jiri Kanicky

Hi Christian.

Thank you for your comments again. Very helpful.

I will try to fix the current pool and see how it goes. Its good to 
learn some troubleshooting skills.


Regarding the BTRFS vs XFS, not sure if the documentation is old. My 
decision was based on this:


http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/

Note

We currently recommendXFSfor production deployments. We 
recommendbtrfsfor testing, development, and any non-critical 
deployments. *We believe thatbtrfshas the correct feature set 
and roadmap to serve Ceph in the long-term*, butXFSandext4provide the 
necessary stability for today’s deployments.btrfsdevelopment is 
proceeding rapidly: users should be comfortable installing the latest 
released upstream kernels and be able to track development activity for 
critical bug fixes.




Thanks
Jiri


On 28/12/2014 16:01, Christian Balzer wrote:

Hello,

On Sun, 28 Dec 2014 11:58:59 +1100 ji...@ganomi.com wrote:


Hi Christian.

Thank you for your suggestions.

I will set the osd pool default size to 2 as you recommended. As
mentioned the documentation is talking about OSDs, not nodes, so that
must have confused me.


Note that changing this will only affect new pools of course. So to sort
out your current state either start over with this value set before
creating/starting anything or reduce the current size (ceph osd pool set
poolname size).

Have a look at the crushmap example or even better your own, current one
and you will see where by default the host is the failure domain.
Which of course makes a lot of sense.
  

Regarding the BTRFS, i thought that btrfs is better option for the
future providing more features. I know that XFS might be more stable,
but again my impression was that btrfs is the focus for future
development. Is that correct?


I'm not a developer, but if you scour the ML archives you will find a
number of threads about BTRFS (and ZFS).
The biggest issues with BTRFS are not just stability but also the fact
that it degrades rather quickly (fragmentation) due to the COW nature of
it and less smarts than ZFS in that area.
So development on the Ceph side is not the issue per se.

IMHO BTRFS looks more and more stillborn and with regard to Ceph ZFS might
become the better choice (in the future), with KV store backends being an
alternative for some use cases (also far from production ready at this
time).

Regards,

Christian

You are right with the round up. I forgot about that.

Thanks for your help. Much appreciated.
Jiri

- Reply message -
From: Christian Balzer ch...@gol.com
To: ceph-us...@ceph.com
Cc: Jiri Kanicky ji...@ganomi.com
Subject: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck
degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; Date: Sun, Dec
28, 2014 03:29

Hello,

On Sun, 28 Dec 2014 01:52:39 +1100 Jiri Kanicky wrote:


Hi,

I just build my CEPH cluster but having problems with the health of
the cluster.


You're not telling us the version, but it's clearly 0.87 or beyond.


Here are few details:
- I followed the ceph documentation.

Outdated, unfortunately.


- I used btrfs filesystem for all OSDs

Big mistake number 1, do some research (google, ML archives).
Though not related to to  your problems.


- I did not set osd pool default size = 2  as I thought that if I
have 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this
was right.

Big mistake, assumption number 2,  replications size by the default CRUSH
rule is determined by hosts. So that's your main issue here.
Either set it to 2 or use 3 hosts.


- I noticed that default pools data,metadata were not created. Only
rbd pool was created.

See outdated docs above. The majority of use cases is with RBD, so since
Giant the cephfs pools are not created by default.


- As it was complaining that the pg_num is too low, I increased the
pg_num for pool rbd to 133 (400/3) and end up with pool rbd pg_num
133
   pgp_num 64.


Re-read the (in this case correct) documentation.
It clearly states to round up to nearest power of 2, in your case 256.

Regards.

Christian


Would you give me hint where I have made the mistake? (I can remove
the OSDs and start over if needed.)


cephadmin@ceph1:/etc/ceph$ sudo ceph health
HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck
unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num
133
   pgp_num 64
cephadmin@ceph1:/etc/ceph$ sudo ceph status
  cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
   health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133
pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool
rbd pg_num 133  pgp_num 64
   monmap e1: 2 mons at
{ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election
epoch 8, quorum 0,1 ceph1,ceph2
   osdmap e42: 4 osds: 4 up, 4 in
pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects
  11704 kB used, 11154 GB / 11158 GB avail
29 active+undersized+degraded
   

Re: [ceph-users] Improving Performance with more OSD's?

2014-12-27 Thread Christian Balzer
On Sun, 28 Dec 2014 08:59:33 +1000 Lindsay Mathieson wrote:

 I'm looking to improve the raw performance on my small setup (2 Compute
 Nodes, 2 OSD's). Only used for hosting KVM images.
 
This doesn't really make things clear, do you mean 2 STORAGE nodes with 2
OSDs (HDDs) each?
In either case that's a very small setup (and with a replication of 2 a
risky one, too), so don't expect great performance.

It would help if you'd tell us what these nodes are made of
(CPU, RAM, disks, network) so we can at least guess what that cluster
might be capable of.

 Raw read/write is roughly 200/35 MB/s. Starting 4+ VM's simultaneously
 pushes iowaits over 30%, though the system keeps chugging along.
 
Throughput numbers aren't exactly worthless, but you will find IOPS to be
the killer in most cases. Also without describing how you measured these
numbers (rados bench, fio, bonnie, on the host, inside a VM) they become
even more muddled. 

 Budget is limited ... :(
 
 I plan to upgrade my SSD journals to something better than the Samsung
 840 EVO's (Intel 520/530?)
 
Not a big improvement really. 
Take a look at the 100GB Intel DC S3700s, while they can write only at
200MB/s they are priced rather nicely and they will deliver that
performance at ANY time and for a long time, too.

 One of the things I see mentioned a lot in blogs etc is how ceph's
 performance improves as you add more OSD's and that the quality of the
 disks does not matter so much as the quantity.
 
 How does this work? does ceph stripe reads and writes across the OSD's
 to improve performance?
 
Yes and no. It stripes by default to 4MB objects, so with enough OSDs and
clients I/Os will become distributed, scaling up nicely. However a single
client could be hitting the same object on the same OSD all the time
(small DB file for example), so you won't see much or any improvement in
that case.
There is also the option to stripe things on a much smaller scale, however
that takes some planning and needs to be done at pool creation time. 
See and read the Ceph documentation.

 If I add 3 cheap OSD's to each node (500GB - 1TB) with 10GB SSD journal 
 partition each could I expect a big improvement in performance?
 
That depends a lot on the stuff you haven't told us (CPU/RAM/network).
Given that there is sufficient of those, especially CPU, the answer is yes.
A large amount of RAM on the storage nodes will improve reads, as hot
objects become and remain cached.

Of course having decent HDDs will help even with journals on SSDs, for
example the Toshiba DTxx (totally not recommended for ANYTHING) HDDs
cost about the same as their entry level enterprise MG0x drives, which
are nearly twice as fast in the IOPS department.

 What sort of redundancy to setup? currently its min= 1, size=2. Size is
 not an issue, we already have 150% more space than we need, redundancy
 and performance is more important.
 
You really, really want size 3 and a third node for both performance
(reads) and redundancy.

 Now I think on it, we can live with the slow write performance, but
 reducing iowait would be *really* good.
 
Decent SSDs (see above) and more (decent) spindles will help with both.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;

2014-12-27 Thread Christian Balzer

Hello Jiri,

On Sun, 28 Dec 2014 16:14:04 +1100 Jiri Kanicky wrote:

 Hi Christian.
 
 Thank you for your comments again. Very helpful.
 
 I will try to fix the current pool and see how it goes. Its good to 
 learn some troubleshooting skills.
 
Indeed, knowing what to do when things break is where it's at.

 Regarding the BTRFS vs XFS, not sure if the documentation is old. My 
 decision was based on this:
 
 http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
 
It's dated for sure and a bit of wishful thinking on behalf of the Ceph
developers. 
Who understandably didn't want to re-invent the wheel inside Ceph when the
underlying file system could provide it (checksums, snapshots, etc).

ZFS has all the features (and much better tested) BTRFS is aspiring to and
if kept below 80% utilization doesn't fragment itself to death.

And the end of that page they mention deduplication, which of course (as I
wrote recently in the use ZFS for OSDs thread is unlikely to do anything
worthwhile at all.

Simply put, some things _need_ to be done in Ceph to work properly and
can't be delegated to the underlying FS or other storage backend. 

Christian

 Note
 
 We currently recommendXFSfor production deployments. We 
 recommendbtrfsfor testing, development, and any non-critical 
 deployments. *We believe thatbtrfshas the correct feature set 
 and roadmap to serve Ceph in the long-term*, butXFSandext4provide the 
 necessary stability for today’s deployments.btrfsdevelopment is 
 proceeding rapidly: users should be comfortable installing the latest 
 released upstream kernels and be able to track development activity for 
 critical bug fixes.
 
 
 
 Thanks
 Jiri
 
 
 On 28/12/2014 16:01, Christian Balzer wrote:
  Hello,
 
  On Sun, 28 Dec 2014 11:58:59 +1100 ji...@ganomi.com wrote:
 
  Hi Christian.
 
  Thank you for your suggestions.
 
  I will set the osd pool default size to 2 as you recommended. As
  mentioned the documentation is talking about OSDs, not nodes, so that
  must have confused me.
 
  Note that changing this will only affect new pools of course. So to
  sort out your current state either start over with this value set
  before creating/starting anything or reduce the current size (ceph osd
  pool set poolname size).
 
  Have a look at the crushmap example or even better your own, current
  one and you will see where by default the host is the failure domain.
  Which of course makes a lot of sense.

  Regarding the BTRFS, i thought that btrfs is better option for the
  future providing more features. I know that XFS might be more stable,
  but again my impression was that btrfs is the focus for future
  development. Is that correct?
 
  I'm not a developer, but if you scour the ML archives you will find a
  number of threads about BTRFS (and ZFS).
  The biggest issues with BTRFS are not just stability but also the fact
  that it degrades rather quickly (fragmentation) due to the COW nature
  of it and less smarts than ZFS in that area.
  So development on the Ceph side is not the issue per se.
 
  IMHO BTRFS looks more and more stillborn and with regard to Ceph ZFS
  might become the better choice (in the future), with KV store backends
  being an alternative for some use cases (also far from production
  ready at this time).
 
  Regards,
 
  Christian
  You are right with the round up. I forgot about that.
 
  Thanks for your help. Much appreciated.
  Jiri
 
  - Reply message -
  From: Christian Balzer ch...@gol.com
  To: ceph-us...@ceph.com
  Cc: Jiri Kanicky ji...@ganomi.com
  Subject: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck
  degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; Date: Sun,
  Dec 28, 2014 03:29
 
  Hello,
 
  On Sun, 28 Dec 2014 01:52:39 +1100 Jiri Kanicky wrote:
 
  Hi,
 
  I just build my CEPH cluster but having problems with the health of
  the cluster.
 
  You're not telling us the version, but it's clearly 0.87 or beyond.
 
  Here are few details:
  - I followed the ceph documentation.
  Outdated, unfortunately.
 
  - I used btrfs filesystem for all OSDs
  Big mistake number 1, do some research (google, ML archives).
  Though not related to to  your problems.
 
  - I did not set osd pool default size = 2  as I thought that if I
  have 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this
  was right.
  Big mistake, assumption number 2,  replications size by the default
  CRUSH rule is determined by hosts. So that's your main issue here.
  Either set it to 2 or use 3 hosts.
 
  - I noticed that default pools data,metadata were not created. Only
  rbd pool was created.
  See outdated docs above. The majority of use cases is with RBD, so
  since Giant the cephfs pools are not created by default.
 
  - As it was complaining that the pg_num is too low, I increased the
  pg_num for pool rbd to 133 (400/3) and end up with pool rbd pg_num
  133
 pgp_num 64.
 
  Re-read the (in this case correct) documentation.
  It 

Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;

2014-12-27 Thread Jiri Kanicky

Hi Christian,

Thank you for the valuable info. As I will use this cluster mainly at 
home for my data, and testing (backup in place), I will continue to use 
BTRFS. In production, I would go with XFS as recommended. ZFS - perhaps 
when this will become supported officially.


BTW, I fixed the HEALTH of my cluster:
1. I set ceph osd pool set rbd size 2
2. I set ceph osd pool set rbd pg_num 256 and ceph osd pool set rbd 
pgp_num 256


5 pgs remained stuck unclean (stuck unclean since forever, current state 
active, last acting). I fixed this by restarting ceph -a. I think the 
OSD restart fixed this. I guess there might be more elegant solution, 
but I was not able to figure it out. Tried pg repair but that didn't 
do trick.


Anyway, it seems to be healthy now :).
cephadmin@ceph1:~$ sudo ceph status
cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
 health HEALTH_OK
 monmap e1: 2 mons at 
{ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch 
10, quorum 0,1 ceph1,ceph2

 osdmap e59: 4 osds: 4 up, 4 in
  pgmap v179: 256 pgs, 1 pools, 0 bytes data, 0 objects
16924 kB used, 11154 GB / 11158 GB avail
 256 active+clean

Thanks for the help!
Jiri

On 28/12/2014 16:59, Christian Balzer wrote:

Hello Jiri,

On Sun, 28 Dec 2014 16:14:04 +1100 Jiri Kanicky wrote:


Hi Christian.

Thank you for your comments again. Very helpful.

I will try to fix the current pool and see how it goes. Its good to
learn some troubleshooting skills.


Indeed, knowing what to do when things break is where it's at.


Regarding the BTRFS vs XFS, not sure if the documentation is old. My
decision was based on this:

http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/


It's dated for sure and a bit of wishful thinking on behalf of the Ceph
developers.
Who understandably didn't want to re-invent the wheel inside Ceph when the
underlying file system could provide it (checksums, snapshots, etc).

ZFS has all the features (and much better tested) BTRFS is aspiring to and
if kept below 80% utilization doesn't fragment itself to death.

And the end of that page they mention deduplication, which of course (as I
wrote recently in the use ZFS for OSDs thread is unlikely to do anything
worthwhile at all.

Simply put, some things _need_ to be done in Ceph to work properly and
can't be delegated to the underlying FS or other storage backend.

Christian


Note

We currently recommendXFSfor production deployments. We
recommendbtrfsfor testing, development, and any non-critical
deployments. *We believe thatbtrfshas the correct feature set
and roadmap to serve Ceph in the long-term*, butXFSandext4provide the
necessary stability for today’s deployments.btrfsdevelopment is
proceeding rapidly: users should be comfortable installing the latest
released upstream kernels and be able to track development activity for
critical bug fixes.



Thanks
Jiri


On 28/12/2014 16:01, Christian Balzer wrote:

Hello,

On Sun, 28 Dec 2014 11:58:59 +1100 ji...@ganomi.com wrote:


Hi Christian.

Thank you for your suggestions.

I will set the osd pool default size to 2 as you recommended. As
mentioned the documentation is talking about OSDs, not nodes, so that
must have confused me.


Note that changing this will only affect new pools of course. So to
sort out your current state either start over with this value set
before creating/starting anything or reduce the current size (ceph osd
pool set poolname size).

Have a look at the crushmap example or even better your own, current
one and you will see where by default the host is the failure domain.
Which of course makes a lot of sense.
   

Regarding the BTRFS, i thought that btrfs is better option for the
future providing more features. I know that XFS might be more stable,
but again my impression was that btrfs is the focus for future
development. Is that correct?


I'm not a developer, but if you scour the ML archives you will find a
number of threads about BTRFS (and ZFS).
The biggest issues with BTRFS are not just stability but also the fact
that it degrades rather quickly (fragmentation) due to the COW nature
of it and less smarts than ZFS in that area.
So development on the Ceph side is not the issue per se.

IMHO BTRFS looks more and more stillborn and with regard to Ceph ZFS
might become the better choice (in the future), with KV store backends
being an alternative for some use cases (also far from production
ready at this time).

Regards,

Christian

You are right with the round up. I forgot about that.

Thanks for your help. Much appreciated.
Jiri

- Reply message -
From: Christian Balzer ch...@gol.com
To: ceph-us...@ceph.com
Cc: Jiri Kanicky ji...@ganomi.com
Subject: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck
degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; Date: Sun,
Dec 28, 2014 03:29

Hello,

On Sun, 28 Dec 2014 01:52:39 +1100 Jiri Kanicky wrote:


Hi,

I just build my CEPH cluster but having