[ceph-users] shared rbd ?

2014-12-23 Thread Zeeshan Ali Shah
Is it possible to have shared RBD ? to form a shared NFS kind of system but
on ceph ?

-- 

Regards

Zeeshan Ali Shah
System Administrator - PDC HPC
PhD researcher (IT security)
Kungliga Tekniska Hogskolan
+46 8 790 9115
http://www.pdc.kth.se/members/zashah
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Running instances on ceph with openstack

2014-12-23 Thread Zeeshan Ali Shah
Has any one tried running instances over ceph i.e using ceph as backend for
vm storage . How would you get instant migration in that case since every
compute host will have it's own RBD . other option is to have a big rbd
pool on head node and share it with NFS to have shared file system

any idea ?

-- 

Regards

Zeeshan Ali Shah
System Administrator - PDC HPC
PhD researcher (IT security)
Kungliga Tekniska Hogskolan
+46 8 790 9115
http://www.pdc.kth.se/members/zashah
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running instances on ceph with openstack

2014-12-23 Thread Nico Schottelius
Hello Ali Shah,

we are running VMs using Opennebula with ceph as the backend. So far
with varying results: From time to time VMs are freezing, probably
panic'ing when the load is too high on the ceph storage due to rebalance
work.

We are experimenting with --osd-max-backfills 1, but it hasn't solved
the problem completly.

Cheers,

Nico

Zeeshan Ali Shah [Tue, Dec 23, 2014 at 09:12:25AM +0100]:
 Has any one tried running instances over ceph i.e using ceph as backend for
 vm storage . How would you get instant migration in that case since every
 compute host will have it's own RBD . other option is to have a big rbd
 pool on head node and share it with NFS to have shared file system
 
 any idea ?
 
 -- 
 
 Regards
 
 Zeeshan Ali Shah
 System Administrator - PDC HPC
 PhD researcher (IT security)
 Kungliga Tekniska Hogskolan
 +46 8 790 9115
 http://www.pdc.kth.se/members/zashah

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cluster unusable

2014-12-23 Thread Francois Petit


Hi,

We use Ceph 0.80.7 for our IceHouse PoC.
3 MONs, 3 OSD nodes (ids 10,11,12) with 2 OSDs each, 1.5TB of storage,
total.
4 pools for RBD, size=2,  512 PGs per pool

Everything was fine until mid of last week, and here's what happened:
- OSD node #12 passed away
- AFAICR, ceph recovered fine
- I installed a fresh new node #12 (which inadvertently erased its 2
attached OSDs), and used ceph-deploy to make the node and its 2 OSDs join
the cluster
- it was looking okay, except that the weight for the 2 OSDs (osd.0 and
osd.4) was a solid -3.052e-05.
- I applied the workaround from http://tracker.ceph.com/issues/9998 : 'ceph
osd crush reweight' on both OSDs
- ceph was then busy redistributing PGs on the 6 OSDs. This was on Friday
evening
- on Monday morning (yesterday), ceph was still busy. Actually the two new
OSDs were flapping (msg map eX wrongly marked me down every minute)
- I found the root cause was the firewall on node #12. I opened tcp ports
6789-6900 and this solved the flapping issue
- ceph kept on reorganising PGs and reached this unhealthy state:
--- 900 PGs stuck unclean
--- some 'requests are blocked  32 sec'
--- command 'rbd info images/image_id hung
--- all tested VMs hung
- So I tried this:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-August/032929.html,
 and removed the 2 new OSDs
- ceph again started rebalancing data, and things were looking better (VMs
responding, although pretty slowly)
- but at the end, which is the current state, the cluster was back to an
unhealthy state, and our PoC is stuck.


Fortunately, the PoC users are out for Christmas. I'm here until Wed 4pm
UTC+1 and then back on Jan 5. So there are around 30 hours left for solving
this PoC sev1  issue. So I hope that the community can help me find a
solution before Christmas.



Here are the details (actual host and DC names not shown in these outputs).

[root@MON ~]# date;for im in $(rbd ls images);do echo $im;time rbd info
images/$im;done
Tue Dec 23 06:53:15 GMT 2014
0dde9837-3e45-414d-a2c5-902adee0cfe9

no reply for 2 hours, still ongoing...

[root@MON ]# rbd ls images | head -5
0dde9837-3e45-414d-a2c5-902adee0cfe9
2b62a79c-bdbc-43dc-ad88-dfbfaa9d005e
3917346f-12b4-46b8-a5a1-04296ea0a826
4bde285b-28db-4bef-99d5-47ce07e2463d
7da30b4c-4547-4b4c-a96e-6a3528e03214
[root@MON ]#

[cloud-user@francois-vm2 ~]$ ls -lh /tmp/file
-rw-rw-r--. 1 cloud-user cloud-user 552M Dec 22 22:19 /tmp/file
[cloud-user@francois-vm2 ~]$ rm /tmp/file

no reply for 1 hour, still ongoing. The RBD image used by that VM is
'volume-2e989ca0-b620-42ca-a16f-e218aea32000'


[root@MON ~]# ceph -s
cluster f0e3957f-1df5-4e55-baeb-0b2236ff6e03
 health HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck
unclean; 103 requests are blocked  32 sec; noscrub,nodeep-scrub flag(s)
set
 monmap e6: 3 mons at
{MON01=10.60.9.11:6789/0,MON06=10.60.9.16:6789/0,MON09=10.60.9.19:6789/0},
 election epoch 1338, quorum 0,1,2 MON01,MON06,MON09
 osdmap e42050: 6 osds: 6 up, 6 in
flags noscrub,nodeep-scrub
  pgmap v3290710: 2048 pgs, 4 pools, 301 GB data, 58987 objects
600 GB used, 1031 GB / 1632 GB avail
   2 inactive
2045 active+clean
   1 remapped+peering
  client io 818 B/s wr, 0 op/s

[root@MON ~]# ceph health detail
HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; 103
requests are blocked  32 sec; 2 osds have slow requests;
noscrub,nodeep-scrub flag(s) set
pg 5.a7 is stuck inactive for 54776.026394, current state inactive, last
acting [2,1]
pg 5.ae is stuck inactive for 54774.738938, current state inactive, last
acting [2,1]
pg 5.b3 is stuck inactive for 71579.365205, current state remapped+peering,
last acting [1,0]
pg 5.a7 is stuck unclean for 299118.648789, current state inactive, last
acting [2,1]
pg 5.ae is stuck unclean for 286227.592617, current state inactive, last
acting [2,1]
pg 5.b3 is stuck unclean for 71579.365263, current state remapped+peering,
last acting [1,0]
pg 5.b3 is remapped+peering, acting [1,0]
87 ops are blocked  67108.9 sec
16 ops are blocked  33554.4 sec
84 ops are blocked  67108.9 sec on osd.1
16 ops are blocked  33554.4 sec on osd.1
3 ops are blocked  67108.9 sec on osd.2
2 osds have slow requests
noscrub,nodeep-scrub flag(s) set


[root@MON]# ceph osd tree
# idweight  type name   up/down reweight
-1  1.08root default
-5  0.54datacenter dc_TWO
-2  0.54host node10
1   0.27osd.1   up  1
5   0.27osd.5   up  1
-4  0   host node12
-6  0.54datacenter dc_ONE
-3  0.54host node11
2   0.27osd.2   up  1
3   0.27osd.3   up  1
0   0   osd.0   up  1
4   0   osd.4   up  1

(I'm concerned about the above two ghost osd.0 and osd.4...)




Re: [ceph-users] shared rbd ?

2014-12-23 Thread Wido den Hollander
On 12/23/2014 09:13 AM, Zeeshan Ali Shah wrote:
 Is it possible to have shared RBD ? to form a shared NFS kind of system but
 on ceph ?
 

Yes, you can use OCFS2 or GFS on top of RBD.

But you also might want to look at using CephFS with version 0.90 or the
upcoming hammer.

In my recent tests I found that CephFS is fairly stable when using a
Active/Standby MDS.

I won't say that it's 100% production ready, but I would suggest you try it.

 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster unusable

2014-12-23 Thread Loic Dachary
Hi François,

Could you paste somewhere the output of ceph report to check the pg dump ? 
(it's probably going to be a little too big for the mailing list). You can 
bring back osd.0 and osd.4 into the host to which they belong (instead of being 
at the root of the crush map) with crush set:

http://ceph.com/docs/master/rados/operations/crush-map/#add-move-an-osd

They won't be used by the ruleset 0 because they are not under the default 
bucket. To make sure this happens automagically, you may consider using 
osd_crush_update_on_start=true :

http://ceph.com/docs/master/rados/operations/crush-map/#ceph-crush-location-hook
http://workbench.dachary.org/ceph/ceph/blob/firefly/src/upstart/ceph-osd.conf#L18

Cheers

On 23/12/2014 09:56, Francois Petit wrote:
 Hi,
 
 We use Ceph 0.80.7 for our IceHouse PoC.
 3 MONs, 3 OSD nodes (ids 10,11,12) with 2 OSDs each, 1.5TB of storage, total.
 4 pools for RBD, size=2,  512 PGs per pool
 
 Everything was fine until mid of last week, and here's what happened:
 - OSD node #12 passed away
 - AFAICR, ceph recovered fine
 - I installed a fresh new node #12 (which inadvertently erased its 2 attached 
 OSDs), and used ceph-deploy to make the node and its 2 OSDs join the cluster
 - it was looking okay, except that the weight for the 2 OSDs (osd.0 and 
 osd.4) was a solid -3.052e-05.
 - I applied the workaround from http://tracker.ceph.com/issues/9998 : 'ceph 
 osd crush reweight' on both OSDs
 - ceph was then busy redistributing PGs on the 6 OSDs. This was on Friday 
 evening
 - on Monday morning (yesterday), ceph was still busy. Actually the two new 
 OSDs were flapping (msg map eX wrongly marked me down every minute)
 - I found the root cause was the firewall on node #12. I opened tcp ports 
 6789-6900 and this solved the flapping issue
 - ceph kept on reorganising PGs and reached this unhealthy state:
 --- 900 PGs stuck unclean
 --- some 'requests are blocked  32 sec'
 --- command 'rbd info images/image_id hung
 --- all tested VMs hung
 - So I tried this: 
 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-August/032929.html, 
 and removed the 2 new OSDs
 - ceph again started rebalancing data, and things were looking better (VMs 
 responding, although pretty slowly)
 - but at the end, which is the current state, the cluster was back to an 
 unhealthy state, and our PoC is stuck.
 
 
 Fortunately, the PoC users are out for Christmas. I'm here until Wed 4pm 
 UTC+1 and then back on Jan 5. So there are around 30 hours left for solving 
 this PoC sev1  issue. So I hope that the community can help me find a 
 solution before Christmas.
 
 
 
 Here are the details (actual host and DC names not shown in these outputs).
 
 [root@MON ~]# date;for im in $(rbd ls images);do echo $im;time rbd info 
 images/$im;done
 Tue Dec 23 06:53:15 GMT 2014
 0dde9837-3e45-414d-a2c5-902adee0cfe9
 
 no reply for 2 hours, still ongoing...
 
 [root@MON ]# rbd ls images | head -5
 0dde9837-3e45-414d-a2c5-902adee0cfe9
 2b62a79c-bdbc-43dc-ad88-dfbfaa9d005e
 3917346f-12b4-46b8-a5a1-04296ea0a826
 4bde285b-28db-4bef-99d5-47ce07e2463d
 7da30b4c-4547-4b4c-a96e-6a3528e03214
 [root@MON ]#
 
 [cloud-user@francois-vm2 ~]$ ls -lh /tmp/file
 -rw-rw-r--. 1 cloud-user cloud-user 552M Dec 22 22:19 /tmp/file
 [cloud-user@francois-vm2 ~]$ rm /tmp/file
 
 no reply for 1 hour, still ongoing. The RBD image used by that VM is 
 'volume-2e989ca0-b620-42ca-a16f-e218aea32000'
 
 
 [root@MON ~]# ceph -s
 cluster f0e3957f-1df5-4e55-baeb-0b2236ff6e03
  health HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck 
 unclean; 103 requests are blocked  32 sec; noscrub,nodeep-scrub flag(s) set
  monmap e6: 3 mons at 
 {MON01=10.60.9.11:6789/0,MON06=10.60.9.16:6789/0,MON09=10.60.9.19:6789/0},
  election epoch 1338, quorum 0,1,2 MON01,MON06,MON09
  osdmap e42050: 6 osds: 6 up, 6 in
 flags noscrub,nodeep-scrub
   pgmap v3290710: 2048 pgs, 4 pools, 301 GB data, 58987 objects
 600 GB used, 1031 GB / 1632 GB avail
2 inactive
 2045 active+clean
1 remapped+peering
   client io 818 B/s wr, 0 op/s
 
 [root@MON ~]# ceph health detail
 HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; 103 
 requests are blocked  32 sec; 2 osds have slow requests; 
 noscrub,nodeep-scrub flag(s) set
 pg 5.a7 is stuck inactive for 54776.026394, current state inactive, last 
 acting [2,1]
 pg 5.ae is stuck inactive for 54774.738938, current state inactive, last 
 acting [2,1]
 pg 5.b3 is stuck inactive for 71579.365205, current state remapped+peering, 
 last acting [1,0]
 pg 5.a7 is stuck unclean for 299118.648789, current state inactive, last 
 acting [2,1]
 pg 5.ae is stuck unclean for 286227.592617, current state inactive, last 
 acting [2,1]
 pg 5.b3 is stuck unclean for 71579.365263, current state remapped+peering, 
 last acting [1,0]
 pg 5.b3 is remapped+peering, acting [1,0]
 87 ops are blocked  67108.9 sec
 16 ops are 

Re: [ceph-users] Cluster unusable

2014-12-23 Thread francois.pe...@san-services.com
Hi Loïc,
 
Thanks.
Am trying to find where I can make the report available to you
[root@qvitblhat06 ~]# ceph report  /tmp/ceph_report
report 3298035134
[root@qvitblhat06 ~]# ls -lh /tmp/ceph_report
-rw-r--r--. 1 root root 4.7M Dec 23 10:38 /tmp/ceph_report
[root@qvitblhat06 ~]#

(Sorry guys for the unwanted ad that was sent in my first email...)
 
Francois___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster unusable

2014-12-23 Thread francois.pe...@san-services.com
Here you go:http://www.filedropper.com/cephreport
 
Francois
 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.90 released

2014-12-23 Thread René Gallati

Hello,

so I upgraded my cluster from 89 to 90 and now I get:

~# ceph health
HEALTH_WARN too many PGs per OSD (864  max 300)

That is a new one. I had too few but never too many. Is this a problem 
that needs attention, or ignorable? Or is there even a command now to 
shrink PGs?


The message did not appear before, I currently have 32 OSDs over 8 hosts 
and 9 pools, each with 1024 PG as was the recommended number according 
to the OSD * 100 / replica formula, then round to next power of 2. The 
cluster has been increased by 4 OSDs, 8th host only days before. That is 
to say, it was at 28 OSD / 7 hosts / 9 pools but after extending it with 
another host, ceph 89 did not complain.


Using the formula again I'd actually need to go to 2048PGs in pools but 
ceph is telling me to reduce the PG count now?


Kind regards

René
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.90 released

2014-12-23 Thread Henrik Korkuc

On 12/23/14 12:57, René Gallati wrote:

Hello,

so I upgraded my cluster from 89 to 90 and now I get:

~# ceph health
HEALTH_WARN too many PGs per OSD (864  max 300)

That is a new one. I had too few but never too many. Is this a problem 
that needs attention, or ignorable? Or is there even a command now to 
shrink PGs?


The message did not appear before, I currently have 32 OSDs over 8 
hosts and 9 pools, each with 1024 PG as was the recommended number 
according to the OSD * 100 / replica formula, then round to next power 
of 2. The cluster has been increased by 4 OSDs, 8th host only days 
before. That is to say, it was at 28 OSD / 7 hosts / 9 pools but after 
extending it with another host, ceph 89 did not complain.


Using the formula again I'd actually need to go to 2048PGs in pools 
but ceph is telling me to reduce the PG count now?


formula recommends PG count for all pools, not each pool. So you need 
about 2048 PGs total distributed by expected pool size.


from http://ceph.com/docs/master/rados/operations/placement-groups/:
When using multiple data pools for storing objects, you need to ensure 
that you balance the number of placement groups per pool with the number 
of placement groups per OSD so that you arrive at a reasonable total 
number of placement groups that provides reasonably low variance per OSD 
without taxing system resources or making the peering process too slow.




Kind regards

René
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running instances on ceph with openstack

2014-12-23 Thread Zeeshan Ali Shah
Thank Gallati, which means we donot need have to have shared rbd for live
migration  ?

On Tue, Dec 23, 2014 at 11:47 AM, René Gallati c...@gallati.net wrote:

 Hello,


 On 23.12.2014 09:12, Zeeshan Ali Shah wrote:

 Has any one tried running instances over ceph i.e using ceph as backend
 for vm storage . How would you get instant migration in that case since
 every compute host will have it's own RBD . other option is to have a
 big rbd pool on head node and share it with NFS to have shared file system


 When you use shared block devices, then the compute nodes don't need to
 have a shared file system in openstack. All their (runtime)information
 comes from either config files or the controller node/apis.

 They mount RBDs and they contact each other in the case of a live
 migration so there is a sort of handover protocol, at least when you use
 libvirt+qemu as hypervisor. How this is set up is described in

 http://ceph.com/docs/next/rbd/rbd-openstack/

 Kind regards

 René




-- 

Regards

Zeeshan Ali Shah
System Administrator - PDC HPC
PhD researcher (IT security)
Kungliga Tekniska Hogskolan
+46 8 790 9115
http://www.pdc.kth.se/members/zashah
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Behaviour of a cluster with full OSD(s)

2014-12-23 Thread Max Power
I understand that the status osd full should never be reached. As I am new to
ceph I want to be prepared for this case. I tried two different scenarios and
here are my experiences:

The first one is to completely fill the storage (for me: writing files to a
rados blockdevice). I discovered that the writing client (dd for example) gets
completly stucked then. And this prevents me from stoping the process (SIGTERM,
SIGKILL). At the moment I restart the whole computer to prevent writing to the
cluster. Then I unmap the rbd device and set the full ratio a bit higher (0.95
to 0.97). I do a mount on my adminnode and delete files till everything is okay
again.
Is this the best practice? Is it possible to prevent the system from running in
a osd full state? I could make the block devices smaller than the cluster can
save. But it's hard to calculate this exactly.

The next scenario is to change a pool size from say 2 to 3 replicas. While the
cluster copies the objects it gets stuck as an osd reaches it limit. Normally
the osd process quits then and I cannot restart it (even after setting the
replicas back). The only possibility is to manually delete complete PG folders
after exploring them with 'pg dump'. Is this the only way to get it back working
again?

Greetings!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Best way to simulate SAN masking/mapping with CEPH

2014-12-23 Thread Florent MONTHEL
Hi Users List 

We have a SAN solution with zoning/masking/mapping to segregate LUN allocation 
 avoid security access issue (server srv01 can access on srv02 luns)
I think with CEPH we can only put security on pool side right ? We can’t drill 
down to LUN with client security file like below :

client.serv01 mon 'allow r' osd  'allow rwx pool=serv01/lununxprd01'

So what’s for you the recommandation for my usecase : 1 pool per server / per 
cluster ? Do we have number pool limitation ?

Thanks


Florent Monthel





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD JOURNAL not associated - ceph-disk list ?

2014-12-23 Thread Florent MONTHEL
Hi Loic,

Hum… I will check. However symlink journal to partition is correctly created 
without action on my side :

journal - /dev/disk/by-partuuid/36741e5b-eee0-4368-9736-a31701a186a1

But no journal_uuid file with cep-deploy

Florent Monthel





 Le 23 déc. 2014 à 00:51, Loic Dachary l...@dachary.org a écrit :
 
 Hi Florent,
 
 On 22/12/2014 19:49, Florent MONTHEL wrote:
 Hi Loic, Hi Robert,
 
 Thanks. I’m integrating CEPH OSD with OpenSVC services 
 (http://www.opensvc.com) so I have to generate UUID myself in order to map 
 services
 It’s the reason for that I’m generating sgdisk commands with my own UUID
 
 After activating OSD, I don’t have mapping osd  journal with cep-disk 
 command
 
 root@raven:/var/lib/ceph/osd/ceph-5# ceph-disk list
 /dev/sda other, ext4, mounted on /
 /dev/sdb swap, swap
 /dev/sdc :
 /dev/sdc1 ceph journal
 /dev/sdd :
 /dev/sdd1 ceph data, active, cluster ceph, osd.3
 /dev/sde :
 /dev/sde1 ceph journal
 /dev/sdf :
 /dev/sdf1 ceph data, active, cluster ceph, osd.4
 /dev/sdg :
 /dev/sdg1 ceph journal
 /dev/sdh :
 /dev/sdh1 ceph data, active, cluster ceph, osd.5
 
 After below command (osd 5), ceph-deploy didn’t create file journal_uuid  :
 
 ceph-deploy --overwrite-conf osd create 
 raven:/dev/disk/by-partuuid/6356fd8d-0d84-432a-b9f4-3d02f94afdff:/dev/disk/by-partuuid/36741e5b-eee0-4368-9736-a31701a186a1
 
 root@raven:/var/lib/ceph/osd/ceph-5# ls -l
 total 56
 -rw-r--r--   1 root root  192 Dec 21 23:55 activate.monmap
 -rw-r--r--   1 root root3 Dec 21 23:55 active
 -rw-r--r--   1 root root   37 Dec 21 23:55 ceph_fsid
 drwxr-xr-x 184 root root 8192 Dec 22 19:25 current
 -rw-r--r--   1 root root   37 Dec 21 23:55 fsid
 lrwxrwxrwx   1 root root   58 Dec 21 23:55 journal - 
 /dev/disk/by-partuuid/36741e5b-eee0-4368-9736-a31701a186a1
 -rw---   1 root root   56 Dec 21 23:55 keyring
 -rw-r--r--   1 root root   21 Dec 21 23:55 magic
 -rw-r--r--   1 root root6 Dec 21 23:55 ready
 -rw-r--r--   1 root root4 Dec 21 23:55 store_version
 -rw-r--r--   1 root root   53 Dec 21 23:55 superblock
 -rw-r--r--   1 root root0 Dec 22 19:24 sysvinit
 -rw-r--r--   1 root root2 Dec 21 23:55 whoami
 
 
 So I created for each osd, file journal_uuid » manually and mapping become 
 OK with ceph-disk :)
 
 root@raven:/var/lib/ceph/osd/ceph-5# echo 
 36741e5b-eee0-4368-9736-a31701a186a1 »  journal_uuid
 
 I think this is an indication that when you ceph-disk prepare the device the 
 journal_uuid was not provided and therefore the journal_uuid creation was 
 skipped:
 
 http://workbench.dachary.org/ceph/ceph/blob/giant/src/ceph-disk#L1235 
 http://workbench.dachary.org/ceph/ceph/blob/giant/src/ceph-disk#L1235
 called from
 http://workbench.dachary.org/ceph/ceph/blob/giant/src/ceph-disk#L1338 
 http://workbench.dachary.org/ceph/ceph/blob/giant/src/ceph-disk#L1338
 
 Cheers
 
 It’s ok now :
 
 root@raven:/var/lib/ceph/osd/ceph-5# ceph-disk list
 /dev/sda other, ext4, mounted on /
 /dev/sdb swap, swap
 /dev/sdc :
 /dev/sdc1 ceph journal, for /dev/sdd1
 /dev/sdd :
 /dev/sdd1 ceph data, active, cluster ceph, osd.3, journal /dev/sdc1
 /dev/sde :
 /dev/sde1 ceph journal, for /dev/sdf1
 /dev/sdf :
 /dev/sdf1 ceph data, active, cluster ceph, osd.4, journal /dev/sde1
 /dev/sdg :
 /dev/sdg1 ceph journal, for /dev/sdh1
 /dev/sdh :
 /dev/sdh1 ceph data, active, cluster ceph, osd.5, journal /dev/sdg1
 
 
 Thanks rob...@leblancnet.us mailto:rob...@leblancnet.us 
 mailto:rob...@leblancnet.us mailto:rob...@leblancnet.us for clue ;)
 
 *Florent Monthel**
 *
 
 
 
 
 
 Le 21 déc. 2014 à 18:08, Loic Dachary l...@dachary.org 
 mailto:l...@dachary.org mailto:l...@dachary.org 
 mailto:l...@dachary.org a écrit :
 
 Hi Florent,
 
 It is unusual to manually run the sgdisk. Is there a reason why you need to 
 do this instead of letting ceph-disk prepare do it for you ?
 
 The information about the association between journal and data is only 
 displayed when the OSD has been activated. See 
 http://workbench.dachary.org/ceph/ceph/blob/giant/src/ceph-disk#L2246 
 http://workbench.dachary.org/ceph/ceph/blob/giant/src/ceph-disk#L2246
 Cheers
 
 On 21/12/2014 15:11, Florent MONTHEL wrote:
 Hi,
 
 I would like to separate OSD and journal on 2 différent disks so I have :
 
 1 disk /dev/sde (1GB) for journal = type code JOURNAL_UUID = 
 '45b0969e-9b03-4f30-b4c6-b4b80ceff106'
 1 disk /dev/sdd (5GB) for OSD = type code OSD_UUID = 
 '4fbd7e29-9d25-41b8-afd0-062c0ceff05d'
 
 I execute below commands :
 
 FOR JOURNAL :
 sgdisk --new=1:0:1023M --change-name=1:ceph journal 
 --partition-guid=1:e89f18cc-ae46-4573-8bca-3e782d45849c 
 --typecode=1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 -- /dev/sde
 
 FOR OSD:
 sgdisk --new=1:0:5119M --change-name=1:ceph data 
 --partition-guid=1:7476f0a8-a6cd-4224-b64b-a4834c32a73e 
 --typecode=1:4fbd7e29-9d25-41b8-afd0-062c0ceff05d -- /dev/sdd
 
 And I'm preparing OSD :
 ceph-disk prepare --osd-uuid 7476f0a8-a6cd-4224-b64b-a4834c32a73e 
 --journal-uuid 

Re: [ceph-users] v0.90 released

2014-12-23 Thread René Gallati

Hello,

On 23.12.2014 12:14, Henrik Korkuc wrote:

On 12/23/14 12:57, René Gallati wrote:

Hello,

so I upgraded my cluster from 89 to 90 and now I get:

~# ceph health
HEALTH_WARN too many PGs per OSD (864  max 300)

That is a new one. I had too few but never too many. Is this a problem
that needs attention, or ignorable? Or is there even a command now to
shrink PGs?

The message did not appear before, I currently have 32 OSDs over 8
hosts and 9 pools, each with 1024 PG as was the recommended number
according to the OSD * 100 / replica formula, then round to next power
of 2. The cluster has been increased by 4 OSDs, 8th host only days
before. That is to say, it was at 28 OSD / 7 hosts / 9 pools but after
extending it with another host, ceph 89 did not complain.

Using the formula again I'd actually need to go to 2048PGs in pools
but ceph is telling me to reduce the PG count now?


formula recommends PG count for all pools, not each pool. So you need
about 2048 PGs total distributed by expected pool size.

from http://ceph.com/docs/master/rados/operations/placement-groups/:
When using multiple data pools for storing objects, you need to ensure
that you balance the number of placement groups per pool with the number
of placement groups per OSD so that you arrive at a reasonable total
number of placement groups that provides reasonably low variance per OSD
without taxing system resources or making the peering process too slow.


Ah I've seem to have overlooked this. Lucky for me, I had 5 pools 
exclusively for testing purposes and another that was not in use - 
killing those put me under the complaint threshold.


In this case, Giant 90 is the first version that actually complains 
about too many PGs per OSD it appears.


What I don't like that much about this soft limitation is the fact 
that PGs are defined per pool, which means that just adding a new pool 
is not as straight forward as I thought it was. If you are already 
somewhere near the limit, all you can do is make a new pool with low PG 
count, thus potentially make that pool less well distributed than all 
the pools that came before. But perhaps the overhead incurred with 
higher PG numbers isn't that bad anyway - after all it ran well up until 
now.


Kind regards

René
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] erasure coded pool k=7,m=5

2014-12-23 Thread Stéphane DUGRAVOT
Hi all, 

Soon, we should have a 3 datacenters (dc) ceph cluster with 4 hosts in each dc. 
Each host will have 12 OSD. 

We can accept the loss of one datacenter and one host on the remaining 2 
datacenters. 
In order to use erasure coded pool : 


1. Is the solution for a strategy k = 7, m = 5 is acceptable ? 
2. Is this is the only one that guarantees us our premise ? 
3. And more generally, is there a formula (based on the number of dc, host 
and OSD) that allows us to calculate the profile ? 

Thanks. 
Stephane. 

-- 
Université de Lorraine 
Stéphane DUGRAVOT - Direction du numérique - Infrastructure 
Jabber : stephane.dugra...@univ-lorraine.fr 
Tél.: +33 3 83 68 20 98 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] erasure coded pool k=7,m=5

2014-12-23 Thread Loic Dachary
Hi Stéphane,

On 23/12/2014 14:34, Stéphane DUGRAVOT wrote:
 Hi all,
 
 Soon, we should have a 3 datacenters (dc) ceph cluster with 4 hosts in each 
 dc. Each host will have 12 OSD.
 
 We can accept the loss of one datacenter and one host on the remaining 2 
 datacenters.
 In order to use erasure coded pool :
 
  1. Is the solution for a strategy k = 7, m = 5 is acceptable ?

If you want to sustain the loss of one datacenter, k=2,m=1 is what you want, 
with a ruleset that require that no two shards must be in the same datacenter. 
It also sustains the loss of one host within a datacenter: the missing chunk on 
the lost host will be reconstructed using the two other chunks from the two 
other datacenter.

If, in addition, you want to sustain the loss of one machine while a datacenter 
is down, you would need to use the LRC plugin.

  2. Is this is the only one that guarantees us our premise ?
  3. And more generally, is there a formula (based on the number of dc, host 
 and OSD) that allows us to calculate the profile ?

I don't think there is such a formula.

Cheers

 Thanks.
 Stephane.
 
 -- 
 *Université de Lorraine**/
 /*Stéphane DUGRAVOT - Direction du numérique - Infrastructure
 Jabber : /stephane.dugra...@univ-lorraine.fr/
 Tél.: /+33 3 83 68 20 98/
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Behaviour of a cluster with full OSD(s)

2014-12-23 Thread Nico Schottelius
Max, List,

Max Power [Tue, Dec 23, 2014 at 12:34:54PM +0100]:
 [...Recovering from full osd ...] 
 
 Normally
 the osd process quits then and I cannot restart it (even after setting the
 replicas back). The only possibility is to manually delete complete PG folders
 after exploring them with 'pg dump'. Is this the only way to get it back 
 working
 again?

I was wondering if ceph-osd crashing when the disk gets full shouldn't
be considered being a bug?

Shouldn't ceph osd be able to recover itself? Like if an admin detects
that the disk is full, she can simply reduce the weight of the osd to
free up space. With a dead osd, this is not possible.

To those having deeper ceph knowledge: 

For what reason does ceph-osd exit when the disk is full?
Why can it not start when it is full to get itself out of this 
invidious situation?

Cheers,

Nico

-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Balancing erasure crush rule

2014-12-23 Thread Aaron Bassett
I’m trying to set up an erasure coded pool with k=9 m=6 on 13 osd hosts. I’m 
trying to write a crush rule for this which will balance this between hosts as 
much as possible. I understand that having 9+6=15  13, I will need to parse 
the tree twice in order to find enough pgs. So what I’m trying to do is select 
~1 from each host on the first pass, and then select n more osds to fill it 
out, without using any osds from the first pass, and preferably balancing them 
between racks. 

For starters, I don't know if this is even possible or if its the right 
approach to what I'm trying to do, but heres my attempt:

rule .us-phx.rgw.buckets.ec {
ruleset 1
type erasure
min_size 3
max_size 20
step set_chooseleaf_tries 5
step take default
step chooseleaf indep 0 type host
step emit
step take default
step chooseleaf indep 0 type rack
step emit
}

This gets me pretty close, the first pass works great and the second pass does 
a nice balance between racks, but in my testing ~ 6 out of 1000 pgs will have 
two osds in their group. I'm guessing I need to get down to one pass to make 
sure that doesn't happen, but I'm having a hard time sorting out how to hit the 
requirement of balancing among hosts *and* allowing for more than one osd per 
host. 

Thanks, Aaron 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.90 released

2014-12-23 Thread Sage Weil
On Tue, 23 Dec 2014, Ren? Gallati wrote:
 Hello,
 
 so I upgraded my cluster from 89 to 90 and now I get:
 
 ~# ceph health
 HEALTH_WARN too many PGs per OSD (864  max 300)
 
 That is a new one. I had too few but never too many. Is this a problem that
 needs attention, or ignorable? Or is there even a command now to shrink PGs?

It's a new warning.

You can't reduce the PG count without creating new (smaller) pools 
and migrating data.  You can ignore the message, though, and make it go 
away by adjusting the 'mon pg warn max per osd' (defaults to 300).  Having 
too many PGs increases the memory utilization and can slow things down 
when adapting to a failure, but certainly isn't fatal.

 The message did not appear before, I currently have 32 OSDs over 8 hosts and 9
 pools, each with 1024 PG as was the recommended number according to the OSD *
 100 / replica formula, then round to next power of 2. The cluster has been
 increased by 4 OSDs, 8th host only days before. That is to say, it was at 28
 OSD / 7 hosts / 9 pools but after extending it with another host, ceph 89 did
 not complain.
 
 Using the formula again I'd actually need to go to 2048PGs in pools but ceph
 is telling me to reduce the PG count now?

The guidance in the docs is (was?) a bit confusing.  You need to take the 
*total* number of PGs and see how many of those per OSD there are, 
not create as many equally-sized pools as you want.  There have been 
several attempts to clarify the language to avoid this misunderstanding 
(you're definitely not the first).  If it's still unclear, suggestions 
welcome!

sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.90 released

2014-12-23 Thread Udo Lembke
Hi Sage,

Am 23.12.2014 15:39, schrieb Sage Weil:
...
 
 You can't reduce the PG count without creating new (smaller) pools 
 and migrating data. 
does this also work with the pool metadata, or is this pool essential
for ceph?

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD pool with unfound objects

2014-12-23 Thread Luke Kao
Hi all,
I have some questions about unfound objects in rbd pool, what's the real impact 
to the rbd image?

Currently our cluster (running on v0.80.5) has 25 unfound objects due to recent 
OSD crashes, and cannot mark as lost yet (Bug #10405 created for this).
So far it seems we can still mount the rbd image (filesystem is xfs) but I 
would like to know the real impact
1.My guess it should like bad sector of a real hard disk?
2.Is there any way to identify which file get impacted of the RBD disk?
3.What if we mark it as lost using ceph pg pg mark_unfound_lost revert revert 
/ delete?
4.Is it better to copy current rbd image to another new one and use the new one 
instead?

Any suggestion to current situation is also welcome that we need keep the data 
inside this RBD.

Thanks in advance,



BR,
Luke




This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancing erasure crush rule

2014-12-23 Thread Aaron Bassett
After some more work i realized that didn't get me closer at all. It was still 
only selecting 13 osds *and* still occasionally re-selecting the same one. I 
think the multiple emit/takes isn't working like I expect. Given:
 step take default
 step chooseleaf indep 0 type host
 step emit
 step take default
 step chooseleaf indep 0 type host
 step emit
In a rule, I would expect it to try to select ~1 osd per host once, and then 
start over again. Instead, what I'm seeing is it selects ~1 osd per host and 
then when it starts again, it re-selects those same osds, resulting in multiple 
placements on 2 or 3 osds per pg.

It turns out what I'm trying to do is described here:
https://www.mail-archive.com/ceph-users%40lists.ceph.com/msg01076.html
But I can't find any other references to anything like this. 

Thanks, Aaron
 
 On Dec 23, 2014, at 9:23 AM, Aaron Bassett aa...@five3genomics.com wrote:
 
 I’m trying to set up an erasure coded pool with k=9 m=6 on 13 osd hosts. I’m 
 trying to write a crush rule for this which will balance this between hosts 
 as much as possible. I understand that having 9+6=15  13, I will need to 
 parse the tree twice in order to find enough pgs. So what I’m trying to do is 
 select ~1 from each host on the first pass, and then select n more osds to 
 fill it out, without using any osds from the first pass, and preferably 
 balancing them between racks. 
 
 For starters, I don't know if this is even possible or if its the right 
 approach to what I'm trying to do, but heres my attempt:
 
 rule .us-phx.rgw.buckets.ec {
ruleset 1
type erasure
min_size 3
max_size 20
step set_chooseleaf_tries 5
step take default
step chooseleaf indep 0 type host
step emit
step take default
step chooseleaf indep 0 type rack
step emit
 }
 
 This gets me pretty close, the first pass works great and the second pass 
 does a nice balance between racks, but in my testing ~ 6 out of 1000 pgs will 
 have two osds in their group. I'm guessing I need to get down to one pass to 
 make sure that doesn't happen, but I'm having a hard time sorting out how to 
 hit the requirement of balancing among hosts *and* allowing for more than one 
 osd per host. 
 
 Thanks, Aaron

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Online converting of pool type

2014-12-23 Thread Erik Logtenberg
Hi,

Every now and then someone asks if it's possible to convert a pool to a
different type (replicated vs erasure / change the amount of pg's /
etc), but this is not supported. The advised approach is usually to just
create a new pool and somehow copy all data manually to this new pool,
removing the old pool afterwards. This is both unpractical and very time
consuming.

Recently I saw someone on this list suggest that the cache tiering
feature may actually be used to achieve some form of online converting
of pool types. Today I ran some tests and I would like to share my results.

I started out with a pool test-A, created an rbd image in the pool,
mapped it, created a filesystem in the rbd image, mounted the fs and
placed some test files in the fs. Just to have some objects in the
test-A pool.

I then added a test-B pool and transferred the data using cache tiering
as follows:

Step 0: We have a test-A pool and it contains data, some of which is in use.
# rados -p test-A df
test-A  -   9941   110
  0   0  324 2404   57 4717

Step 1: Create new pool test-B
# ceph osd pool create test-B 32
pool 'test-B' created

Step 2: Make pool test-A a cache pool for test-B.
# ceph osd tier add test-B test-A --force-nonempty
# ceph osd tier cache-mode test-A forward

Step 3: Move data from test-A to test-B (this potentially takes long)
# rados -p test-A cache-flush-evict-all
This step will move all data except the objects that are in active use,
so we are left with some remaining data on test-A pool.

Step 4: Move also the remaining data. This is the only step that doesn't
work online.
Step 4a: Disconnect all clients
# rbd unmap /dev/rbd/test-A/test-rbd   (in my case)
Stab 4b: Move remaining objects
# rados -p test-A cache-flush-evict-all
# rados -p test-A ls  (should now be empty)

Step 5: Remove test-A as cache pool
# ceph osd tier remove test-B test-A

Step 6: Clients are allowed to connect with test-B pool (we are back in
online mode)
# rbd map test-B/test-rbd  (in my case)

Step 7: Remove the now empty pool test-A
# ceph osd pool delete test-A test-A --yes-i-really-really-mean-it


This worked smoothly. In my first try I actually used more steps, by creatig
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Online converting of pool type

2014-12-23 Thread Erik Logtenberg
Whoops, I accidently sent my mail before it was finished. Anyway I have
some more testing to do, especially with converting between
erasure/replicated pools. But it looks promising.

Thanks,

Erik.


On 23-12-14 16:57, Erik Logtenberg wrote:
 Hi,
 
 Every now and then someone asks if it's possible to convert a pool to a
 different type (replicated vs erasure / change the amount of pg's /
 etc), but this is not supported. The advised approach is usually to just
 create a new pool and somehow copy all data manually to this new pool,
 removing the old pool afterwards. This is both unpractical and very time
 consuming.
 
 Recently I saw someone on this list suggest that the cache tiering
 feature may actually be used to achieve some form of online converting
 of pool types. Today I ran some tests and I would like to share my results.
 
 I started out with a pool test-A, created an rbd image in the pool,
 mapped it, created a filesystem in the rbd image, mounted the fs and
 placed some test files in the fs. Just to have some objects in the
 test-A pool.
 
 I then added a test-B pool and transferred the data using cache tiering
 as follows:
 
 Step 0: We have a test-A pool and it contains data, some of which is in use.
 # rados -p test-A df
 test-A  -   9941   110
   0   0  324 2404   57 4717
 
 Step 1: Create new pool test-B
 # ceph osd pool create test-B 32
 pool 'test-B' created
 
 Step 2: Make pool test-A a cache pool for test-B.
 # ceph osd tier add test-B test-A --force-nonempty
 # ceph osd tier cache-mode test-A forward
 
 Step 3: Move data from test-A to test-B (this potentially takes long)
 # rados -p test-A cache-flush-evict-all
 This step will move all data except the objects that are in active use,
 so we are left with some remaining data on test-A pool.
 
 Step 4: Move also the remaining data. This is the only step that doesn't
 work online.
 Step 4a: Disconnect all clients
 # rbd unmap /dev/rbd/test-A/test-rbd   (in my case)
 Stab 4b: Move remaining objects
 # rados -p test-A cache-flush-evict-all
 # rados -p test-A ls  (should now be empty)
 
 Step 5: Remove test-A as cache pool
 # ceph osd tier remove test-B test-A
 
 Step 6: Clients are allowed to connect with test-B pool (we are back in
 online mode)
 # rbd map test-B/test-rbd  (in my case)
 
 Step 7: Remove the now empty pool test-A
 # ceph osd pool delete test-A test-A --yes-i-really-really-mean-it
 
 
 This worked smoothly. In my first try I actually used more steps, by creatig
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help from Ceph experts

2014-12-23 Thread Robert LeBlanc
If you're intent is to learn Ceph, then I suggest that you set up three or
four VMs to learn how all the components work together. Then you will know
better how to put different components together and you can decide which
combination works better for you. I don't like any of those components in
the same OS because they can interfere with each other pretty bad. Putting
them in VMs gets around some of the possible deadlocks but then there is
usually not enough disk IO.

That is my $0.02.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Dec 23, 2014 6:12 AM, Debashish Das deba@gmail.com wrote:

 Hi,

 Thanks for the replies, I have some more queries now :-)


 1. I have one 64 bit Physical Server (4 GB RAM, QuadCore  250 GB HDD) 
 One VM (not a high end one).

  I want to install ceph-mon, ceph-osd  ceph RBD (Rados Block Device).
 Can you please tell me if it is possible to only install ceph-mon  ceph
 RBD in one VM  ceph-osd in Physical Machine?
 Or do you have any other idea how to proceed with my current hardware
 resources?

 Please also let me know any reference links which I can refer for this
 kind of installation.

 I am not sure which component (mon/osd/RBD) should I install in which
 setup ( VM/Physical Server).

 Your expert opinion would be of great help for me.

 Thank You.

 Kind Regards
 Debashish Das




 On Sat, Dec 20, 2014 at 12:00 AM, Craig Lewis cle...@centraldesktop.com
 wrote:

 I've done single nodes.  I have a couple VMs for RadosGW Federation
 testing.  It has a single virtual network, with both clusters on the same
 network.

 Because I'm only using a single OSD on a single host, I had to update the
 crushmap to handle that.  My Chef recipe runs:
 ceph osd getcrushmap -o /tmp/compiled-crushmap.old

 crushtool -d /tmp/compiled-crushmap.old -o /tmp/decompiled-crushmap.old

 sed -e '/step chooseleaf firstn 0 type/s/host/osd/'
 /tmp/decompiled-crushmap.old  /tmp/decompiled-crushmap.new

 crushtool -c /tmp/decompiled-crushmap.new -o /tmp/compiled-crushmap.new

 ceph osd setcrushmap -i /tmp/compiled-crushmap.new


 Those are the only extra commands I run for a single node cluster.
 Otherwise, it looks the same as my production nodes that run mon, osd, and
 rgw.


 Here's my single node's ceph.conf:
 [global]
   fsid = a7798848-1d31-421b-8f3c-5a34d60f6579
   mon initial members = test0-ceph0
   mon host = 172.16.205.143:6789
   auth client required = none
   auth cluster required = none
   auth service required = none
   mon warn on legacy crush tunables = false
   osd crush chooseleaf type = 0
   osd pool default flag hashpspool = true
   osd pool default min size = 1
   osd pool default size = 1
   public network = 172.16.205.0/24

 [osd]
   osd journal size = 1000
   osd mkfs options xfs = -s size=4096
   osd mkfs type = xfs
   osd mount options xfs = rw,noatime,nodiratime,nosuid,noexec,inode64
   osd_scrub_sleep = 1.0
   osd_snap_trim_sleep = 1.0



 [client.radosgw.test0-ceph0]
   host = test0-ceph0
   rgw socket path = /var/run/ceph/radosgw.test0-ceph0
   keyring = /etc/ceph/ceph.client.radosgw.test0-ceph0.keyring
   log file = /var/log/ceph/radosgw.log
   admin socket = /var/run/ceph/radosgw.asok
   rgw dns name = test0-ceph
   rgw region = us
   rgw region root pool = .us.rgw.root
   rgw zone = us-west
   rgw zone root pool = .us-west.rgw.root



 On Thu, Dec 18, 2014 at 11:23 PM, Debashish Das deba@gmail.com
 wrote:

 Hi Team,

 Thank for the insight  the replies, as I understood from the mails -
 running Ceph cluster in a single node is possible but definitely not
 recommended.

 The challenge which i see is there is no clear documentation for single
 node installation.

 So I would request if anyone has installed Ceph in single node, please
 share the link or document which i can refer to install Ceph in my local
 server.

 Again thanks guys !!

 Kind Regards
 Debashish Das

 On Fri, Dec 19, 2014 at 6:08 AM, Robert LeBlanc rob...@leblancnet.us
 wrote:

 Thanks, I'll look into these.

 On Thu, Dec 18, 2014 at 5:12 PM, Craig Lewis cle...@centraldesktop.com
  wrote:

 I think this is it:
 https://engage.redhat.com/inktank-ceph-reference-architecture-s-201409080939

 You can also check out a presentation on Cern's Ceph cluster:
 http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern


 At large scale, the biggest problem will likely be network I/O on the
 inter-switch links.



 On Thu, Dec 18, 2014 at 3:29 PM, Robert LeBlanc rob...@leblancnet.us
 wrote:

 I'm interested to know if there is a reference to this reference
 architecture. It would help alleviate some of the fears we have about
 scaling this thing to a massive scale (10,000's OSDs).

 Thanks,
 Robert LeBlanc

 On Thu, Dec 18, 2014 at 3:43 PM, Craig Lewis 
 cle...@centraldesktop.com wrote:



 On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry 
 patr...@inktank.com wrote:


  2. What should be the minimum hardware requirement of the server
 (CPU,
  Memory, NIC etc)

 There is no real 

Re: [ceph-users] Cluster unusable

2014-12-23 Thread francois.pe...@san-services.com
Hi,


I got a recommendation From Stephan to restart the OSDs one by one.
So I did it. It helped a bit (some IOs completed), but at the end, the state was
the same as before, and new IOs still hung.

Loïc, thanks for the advice on moving back the osd.0 and osd.4 into the game.
 
Actually this was done by simply restarting ceph on that node:
[root@qvitblhat12 ~]# date;service ceph status
Tue Dec 23 14:36:11 UTC 2014
=== osd.0 ===
osd.0: running {version:0.80.7}
=== osd.4 ===
osd.4: running {version:0.80.7}
[root@qvitblhat12 ~]# date;service ceph restart
Tue Dec 23 14:36:17 UTC 2014
=== osd.0 ===
=== osd.0 ===
Stopping Ceph osd.0 on qvitblhat12...kill 4527...kill 4527...done
=== osd.0 ===
create-or-move updating item name 'osd.0' weight 0.27 at location
{host=qvitblhat12,root=default} to crush map
Starting Ceph osd.0 on qvitblhat12...
Running as unit run-4398.service.
=== osd.4 ===
=== osd.4 ===
Stopping Ceph osd.4 on qvitblhat12...kill 5375...done
=== osd.4 ===
create-or-move updating item name 'osd.4' weight 0.27 at location
{host=qvitblhat12,root=default} to crush map
Starting Ceph osd.4 on qvitblhat12...
Running as unit run-4720.service.

[root@qvitblhat06 ~]# ceph osd tree
# idweighttype nameup/downreweight
-11.62root default
-51.08datacenter dc_XAT
-20.54host qvitblhat10
10.27osd.1up1
50.27osd.5up1
-40.54host qvitblhat12
00.27osd.0up1
40.27osd.4up1
-60.54datacenter dc_QVI
-30.54host qvitblhat11
20.27osd.2up1
30.27osd.3up1
[root@qvitblhat06 ~]#

This change made ceph to rebalance data, and then the miracle, as all PGs ended
up as active+clean.

[root@qvitblhat06 ~]# ceph health detail
HEALTH_WARN noscrub,nodeep-scrub flag(s) set
noscrub,nodeep-scrub flag(s) set

Well apart from being happy that the cluster is now healthy, I find it a little
bit scary of having to shake it in one direction and another
and hope that it will eventually recover, while in the meantime my users IOs are
stuck...

So is there a way to understand what happened ?

Francois___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Behaviour of a cluster with full OSD(s)

2014-12-23 Thread Craig Lewis
On Tue, Dec 23, 2014 at 3:34 AM, Max Power 
mailli...@ferienwohnung-altenbeken.de wrote:

 I understand that the status osd full should never be reached. As I am
 new to
 ceph I want to be prepared for this case. I tried two different scenarios
 and
 here are my experiences:


For a real cluster, you should be monitoring your cluster, and taking
immediate action once you get an OSD in nearfull state.  Waiting until OSDs
are toofull is too late.

For a test cluster, it's a great learning experience. :-)



 The first one is to completely fill the storage (for me: writing files to a
 rados blockdevice). I discovered that the writing client (dd for example)
 gets
 completly stucked then. And this prevents me from stoping the process
 (SIGTERM,
 SIGKILL). At the moment I restart the whole computer to prevent writing to
 the
 cluster. Then I unmap the rbd device and set the full ratio a bit higher
 (0.95
 to 0.97). I do a mount on my adminnode and delete files till everything is
 okay
 again.
 Is this the best practice?


It is a design feature of Ceph that all cluster reads and writes stop until
the toofull situation is resolved.

The route you took is one of two ways to recover.  The other route you
found in your replica test.



 Is it possible to prevent the system from running in
 a osd full state? I could make the block devices smaller than the
 cluster can
 save. But it's hard to calculate this exactly.


If you continue to add data to the cluster after it's nearfull, then you're
going to hit toofull.
Once you hit nearfull, you need to delete existing data, or add more OSDs.

You've probably noticed that some OSDs are using more space than others.
You can try to even them out with `ceph osd reweight` or `ceph osd crush
reweight`, but that's a delaying tactic.  When I hit nearfull, I place an
order for new hardware, then use `ceph osd reweight` until it arrives.



 The next scenario is to change a pool size from say 2 to 3 replicas. While
 the
 cluster copies the objects it gets stuck as an osd reaches it limit.
 Normally
 the osd process quits then and I cannot restart it (even after setting the
 replicas back). The only possibility is to manually delete complete PG
 folders
 after exploring them with 'pg dump'. Is this the only way to get it back
 working
 again?


There are some other configs that might have come into play here.  You
might have run into osd_failsafe_nearfull_ratio
or osd_failsafe_full_ratio.  You could try bumping those up a bit, and see
if that lets the process stay up long enough to start reducing replicas.

Since osd_failsafe_full_ratio is already 0.97, I wouldn't take it any
higher than 0.98.  Ceph triggers on greater-than percentages, so 0.99
will let you fill a disk to 100% full.  If you get a disk to 100% full, the
only way to cleanup is to start deleting PG directories.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any Good Ceph Web Interfaces?

2014-12-23 Thread Craig Lewis
Are you asking because you want to manage a Ceph cluster point and click?
Or do you need some shiny to show the boss?


I'm using a combination of Chef and Zabbix.  I'm not running RHEL though,
but I would assume those are available in the repos.

It's not as slick as Calamari, and it really doesn't give me a whole
cluster view.  Ganglia did a better job of that, but I went with Zabbix
for the graphing and alerting in a single product.


If you're looking for some shiny for the boss, Zabbix's web interface
should work fine.

If you're looking for a point and click way to build a Ceph cluster, I
think Calamari is your only option.



On Mon, Dec 22, 2014 at 4:11 PM, Tony unix...@gmail.com wrote:

 Please don't mention calamari :-)

 The best web interface for ceph that actually works with RHEL6.6

 Preferable something in repo and controls and monitors all other ceph osd,
 mon, etc.


 Take everything and live for the moment.



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any Good Ceph Web Interfaces?

2014-12-23 Thread Luis Periquito
Again it depends on what you want to do. I started to evaluate VSM - it's
from intel, and it's what the fujitsu uses in the eternus cd1 - but it
didn't work for me.

https://01.org/virtual-storage-manager

It didn't work for me, because it wants to completely manage all the
cluster, starting from scratch - I have puppet; and it's targetted at the
CentOS crowd - I use ubuntu.


On Tue, Dec 23, 2014 at 8:05 PM, Craig Lewis cle...@centraldesktop.com
wrote:

 Are you asking because you want to manage a Ceph cluster point and click?
 Or do you need some shiny to show the boss?


 I'm using a combination of Chef and Zabbix.  I'm not running RHEL though,
 but I would assume those are available in the repos.

 It's not as slick as Calamari, and it really doesn't give me a whole
 cluster view.  Ganglia did a better job of that, but I went with Zabbix
 for the graphing and alerting in a single product.


 If you're looking for some shiny for the boss, Zabbix's web interface
 should work fine.

 If you're looking for a point and click way to build a Ceph cluster, I
 think Calamari is your only option.



 On Mon, Dec 22, 2014 at 4:11 PM, Tony unix...@gmail.com wrote:

 Please don't mention calamari :-)

 The best web interface for ceph that actually works with RHEL6.6

 Preferable something in repo and controls and monitors all other ceph
 osd, mon, etc.


 Take everything and live for the moment.



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Archives haven't been updated since Dec 8?

2014-12-23 Thread Christopher Armstrong
I was trying to link a colleague to a message on the mailing list, and
noticed the archives haven't been rebuilt since Dec 8:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/

Did something break there?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph on ArmHF Ubuntu 14.4LTS?

2014-12-23 Thread Philip Williams
Hi Chris,

Would you care to name the vendor and hw config?  I.e. x Arm Cores to y 
Disks/SSDs?

Thanks

--phil

 On 23 Dec 2014, at 07:10, Christopher Kunz chrisl...@de-punkt.de wrote:
 
 Am 22.12.14 um 16:10 schrieb Gregory Farnum:
 On Sun, Dec 21, 2014 at 11:54 PM, Christopher Kunz
 chrisl...@de-punkt.de wrote:
 Hi all,
 
 I'm trying to get a working PoC installation of Ceph done on an armhf
 platform. I'm failing to find working Ceph packages (so does
 ceph-deploy, too) for Ubuntu Trusty LTS. The ceph.com repos don't have
 anything besides ceph-deploy and radosgw-agent, and there are no
 packages in the ubuntu repos, either.
 
 What am I missing here?
 
 I don't believe we build arm packages upstream right now. Debian does,
 but I'm not sure about Ubuntu.
 
 We have done so in the past on a dev level (never official release
 packages), so if this is something you're interested in it should be
 pretty simple to home-brew them. :)
 -Greg
 Hi,
 
 in fact there seem to be packages in some openstack repo - I received a
 repository list from the arm server vendor (who happens to advertise
 Ceph compatibility, so kind of has to deliver :) ) and am now running a
 Giant cluster on 6 ARMv7 nodes.
 
 The performance is... uh, let's say, interesting ;)
 
 Thanks anyway!
 
 Regards,
 
 --ck
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-23 Thread Sean Sullivan
I am trying to understand these drive throttle markers that were
mentioned to get an idea of why these drives are marked as slow.::

here is the iostat of the drive /dev/sdbm
http://paste.ubuntu.com/9607168/
 
an IO wait of .79 doesn't seem bad but a write wait of 21.52 seems
really high.  Looking at the ops in flight::
http://paste.ubuntu.com/9607253/


If we check against all of the osds on this node, this seems strange::
http://paste.ubuntu.com/9607331/

I do not understand why this node has ops in flight while the the
remainder seem to be performing without issue. The load on the node is
pretty light as well with an average CPU at 16 and an average iowait of
.79::

---
/var/run/ceph# iostat -xm /dev/sdbm
Linux 3.13.0-40-generic (kh10-4) 12/23/2014 _x86_64_(40 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   3.940.00   23.300.790.00   71.97

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdbm  0.09 0.255.033.42 0.55 0.63  
288.02 0.09   10.562.55   22.32   2.54   2.15
---

I am still trying to understand the osd throttle perfdump so if anyone
can help shed some light on this that would be rad. From what I can tell
from the perfdump 4 osds (the last one, 228, being the slow one
currently). I ended up pulling .228 from the cluster and I have yet to
see another slow/blocked osd in the output of ceph -s. It is still
rebuilding as I just pulled .228 out but I am still getting at least
200MB/s via bonnie while the rebuild is occurring.

Finally, if this helps anyone. Although one 1x1Gb takes around 2.0 - 2.5
minutes. If we split a 10 file into 100 x 100MB we get a completion time
of about 1 minute. Which would be a 10G file in about 1-1.5 minutes or
166.66MB/s versus the 8MB/s I was getting before with sequential
uploads. All of these are coming from a single client via boto. This
leads me to think that this is a radosgw issue specifically.  

This again makes me think that this is not a slow disk issue but an
overall radosgw issue. If this were structural in anyway I would think
that all of rados/cephs faculties would be hit and the 8MBps limit per
client would be due to client throttling due to a ceiling being hit.  As
it turns out I am not hitting the ceiling but some other aspect of the
radosgw or boto is limiting my throughput. Is this logic not correct? I
feel like I am missing something.

Thanks for the help everyone!


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-23 Thread Andrew Cowie
On Mon, 2014-12-22 at 15:26 -0800, Craig Lewis wrote:

 My problems were memory pressure plus an XFS bug, so it took a while
 to manifest. 


The following (long, ongoing) thread on linux-mm discusses our [severe]
problems with memory pressure taking out entire OSD servers. The
upstream problems are still unresolved as at Linux 3.18, but anyone
running Ceph on XFS over especially Infiniband or *anything* that does
custom allocation in the kernel should probably be aware of this.
http://marc.info/?l=linux-mmm=141605213522925w=2

AfC
Sydney


-- 
Andrew Frederick Cowie
Head of Engineering
Anchor Systems

 afcowie   anchor   hosting
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any Good Ceph Web Interfaces?

2014-12-23 Thread Udo Lembke
Hi,
for monitoring only I use the Ceph Dashboard
https://github.com/Crapworks/ceph-dash/

Fo me it's an nice tool for an good overview - for administration i use
the cli.


Udo

On 23.12.2014 01:11, Tony wrote:
 Please don't mention calamari :-)

 The best web interface for ceph that actually works with RHEL6.6 

 Preferable something in repo and controls and monitors all other ceph
 osd, mon, etc.


 Take everything and live for the moment.




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com