Re: [ceph-users] Antwort: Re: Replication between 2 datacenter

2013-06-26 Thread Wolfgang Hennerbichler
Also be aware that due to the nature how monitors work (and that you need an 
unequal number of them), that if the datacenter loses power with the majority 
of the monitors, you can't access your backup-data either (you can after 
fiddling with the monmap, but it doesn't failover).

configuration if crushmap is described very well in the documentation: => 
http://ceph.com/docs/master/rados/operations/crush-map/

wogri

Von: ceph-users-boun...@lists.ceph.com [ceph-users-boun...@lists.ceph.com]" im 
Auftrag von "joachim.t...@gad.de [joachim.t...@gad.de]
Gesendet: Donnerstag, 27. Juni 2013 07:12
Bis: ceph-users@lists.ceph.com
Betreff: [ceph-users] Antwort: Re: Replication between 2 datacenter

Hi,

yes exactly. synchronous replication is OK. The distance between the datacenter
is only 15 km.

How do i configure this in the cruchmap?

Best regards

Joachim



Von:Sage Weil 
An:joachim.t...@gad.de,
Kopie:ceph-users@lists.ceph.com
Datum:25.06.2013 17:39
Betreff:Re: [ceph-users] Replication between 2 datacenter




On Tue, 25 Jun 2013, joachim.t...@gad.de wrote:
> hi folks,
>
> i have a question concerning data replication using the crushmap.
>
> Is it possible to write a crushmap to achive a 2 times 2 replcation in the
> way a have a pool replication in one data center and an overall replication
> of this in the backup datacenter?

Do you mean 2 replicas in datacenter A, and 2 more replicas in datacenter
B?

Short answer: yes, but replication is synchronous, so it will generally
only work well if the latency is low between the two sites.

sage


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Antwort: Re: Replication between 2 datacenter

2013-06-26 Thread Joachim . Tork
Hi,

yes exactly. synchronous replication is OK. The distance between the 
datacenter
is only 15 km.

How do i configure this in the cruchmap?

Best regards

Joachim



Von:Sage Weil 
An: joachim.t...@gad.de, 
Kopie:  ceph-users@lists.ceph.com
Datum:  25.06.2013 17:39
Betreff:Re: [ceph-users] Replication between 2 datacenter



On Tue, 25 Jun 2013, joachim.t...@gad.de wrote:
> hi folks,
> 
> i have a question concerning data replication using the crushmap.
> 
> Is it possible to write a crushmap to achive a 2 times 2 replcation in 
the
> way a have a pool replication in one data center and an overall 
replication
> of this in the backup datacenter?

Do you mean 2 replicas in datacenter A, and 2 more replicas in datacenter 
B?

Short answer: yes, but replication is synchronous, so it will generally 
only work well if the latency is low between the two sites.

sage


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Openstack Multi-rbd storage backend

2013-06-26 Thread Josh Durgin

On 06/21/2013 09:48 AM, w sun wrote:

Josh & Sebastien,

Does either of you have any comments on this cephx issue with multi-rbd
backend pools?

Thx. --weiguo


From: ws...@hotmail.com
To: ceph-users@lists.ceph.com
Date: Thu, 20 Jun 2013 17:58:34 +
Subject: [ceph-users] Openstack Multi-rbd storage backend

Anyone saw the same issue as below?

We are trying to test the multi backend feature with two RBD pools on
Grizzly release. At this point, it seems that rbd.py does not take
separate cephx users for the two RBD pools for authentication as it
defaults to the single ID defined in /etc/init/cinder-volume.conf, which
is documented here with "env CEPH_ARGS="--id volume"

http://ceph.com/docs/master/rbd/rbd-openstack/#configuring-cinder-nova-volume

It seems to us that rbd.py is ignoring the separate "rbd_user="
configuration for each storage backend section,


In Grizzly, this option is only used to tell nova which user to connect
as. cinder-volume requires CEPH_ARGS="--id user" to set the ceph user
you want it to use. This has changed in Havana, where the rbd_user
option is used by Cinder as well, but for Grizzly you'll need to set
the CEPH_ARGS environment variable differently if you want
different users for each backend.

Josh


[svl-stack-mgmt-openstack-volumes-2]
volume_driver=cinder.volume.drivers.rbd.RBDDriver
rbd_pool=stack-mgmt-openstack-volumes-2
rbd_user=stack-mgmt-openstack-volumes-2
rbd_secret_uuid=e1124cad-55e8-d4ce-6c68-5f40491b15ef
volume_backend_name=RBD_CINDER_VOLUMES_3

Here is the error from cinder-volume.log,

-
   File "/usr/lib/python2.7/dist-packages/cinder/volume/drivers/rbd.py",
line 144, in delete_volume
 volume['name'])
   File "/usr/lib/python2.7/dist-packages/cinder/utils.py", line 190, in
execute
 cmd=' '.join(cmd))
ProcessExecutionError: Unexpected error while running command.
Command: rbd snap ls --pool svl-stack-mgmt-openstack-volumes-2
volume-9f1735ae-b31f-4cd5-a279-f879692839c3
Exit code: 1
Stdout: ''
Stderr: 'rbd: error opening image
volume-9f1735ae-b31f-4cd5-a279-f879692839c3: (1) Operation not
permitted\n2013-06-20 10:41:46.591363 7f68117a9780 -1 librbd::ImageCtx:
error finding header: (1) Operation not permitted\n'
---


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Empty osd and crushmap after mon restart?

2013-06-26 Thread Gregory Farnum
On Wed, Jun 26, 2013 at 1:37 PM, Wido den Hollander  wrote:
> On 06/26/2013 06:54 PM, Gregory Farnum wrote:
>>
>> On Wed, Jun 26, 2013 at 12:24 AM, Wido den Hollander 
>> wrote:
>>>
>>> On 06/26/2013 01:18 AM, Gregory Farnum wrote:


 Some guesses are inline.

 On Tue, Jun 25, 2013 at 4:06 PM, Wido den Hollander 
 wrote:
>
>
> Hi,
>
> I'm not sure what happened, but on a Ceph cluster I noticed that the
> monitors (running 0.61) started filling up the disks, so they were
> restarted
> with:
>
> mon compact on start = true
>
> After a restart the osdmap was empty, it showed:
>
>  osdmap e2: 0 osds: 0 up, 0 in
>   pgmap v624077: 15296 pgs: 15296 stale+active+clean; 78104 MB
> data,
> 243
> GB used, 66789 GB / 67032 GB avail
>  mdsmap e1: 0/0/1 up
>
> This cluster has 36 OSDs over 9 hosts, but suddenly that was all gone.
>
> I also checked the crushmap, all 36 OSDs were removed, no trace of
> them.



 As you guess, this is probably because the disks filled up. It
 shouldn't be able to happen but we found an edge case where leveldb
 falls apart; there's a fix for it in the repository now (asserting
 that we get back what we just wrote) that Sage can talk more about.
 Probably both disappeared because the monitor got nothing back when
 reading in the newest OSD Map, and so it's all empty.

>>>
>>> Sounds reasonable and logical.
>>>
>>>
> "ceph auth list" still showed their keys though.
>
> Restarting the OSDs didn't help, since create-or-move complained that
> the
> OSDs didn't exist and didn't do anything. I ran "ceph osd create" to
> get
> the
> 36 OSDs created again, but when the OSDs boot they never start working.
>
> The only thing they log is:
>
> 2013-06-26 01:00:08.852410 7f17f3f16700  0 -- 0.0.0.0:6801/4767 >>
> 10.23.24.53:6801/1758 pipe(0x1025fc80 sd=116 :40516 s=1 pgs=0 cs=0
> l=0).fault with nothing to send, going to standby



 Are they going up and just sitting idle? This is probably because none
 of their peers are telling them to be responsible for any placement
 groups on startup.

>>>
>>> No, they never come up. So checking the monitor logs I only see the
>>> create-or-move command changing their crush position, but they never mark
>>> themselves as "up", so all the OSDs stay down.
>>>
>>> netstat however shows a connection with the monitor between the OSD and
>>> the
>>> Mon, but nothing special in the logs at lower debugging.
>>
>>
>> So the process is still running? Can you generate full logs with debug
>> ms = 5, debug osd = 20, debug monc = 20?
>>
>
> I've done so with 4 OSDs and I uploaded the logs of one OSD:
>
> root@data1:~# sftp cephd...@ceph.com
> cephd...@ceph.com's password:
> Connected to ceph.com.
> sftp> put ceph-osd-0-widodh-empty-osdmap.log.gz
> Uploading ceph-osd-0-widodh-empty-osdmap.log.gz to
> /home/cephdrop/ceph-osd-0-widodh-empty-osdmap.log.gz
> ceph-osd-0-widodh-empty-osdmap.log.gz
> 100%   14MB   3.5MB/s   00:04
> sftp>
>
> My internet here is to slow to go through the logs and I haven't checked
> them yet.

Okay, looks like the OSD is sending a boot message and the monitor is
never completing the boot process. This is probably because of the
lost OSDMaps (it also seems to think that it has maps 0-53, and I
admit I'm a little confused about how that could have happened). So
you'll need to unstick the monitors, which I'm afraid will be a bit of
an adventure.

Yes, you should be able to extract the maps from the OSDs and put them
in the monitor, but I'm not sure exactly how that will work — you
probably want to come in to irc and get somebody to walk you through
manipulating the OSD store; I think with the monstore-tool there's a
put but you or somebody might need to add that. :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Empty osd and crushmap after mon restart?

2013-06-26 Thread Wido den Hollander

On 06/26/2013 10:37 PM, Wido den Hollander wrote:

On 06/26/2013 06:54 PM, Gregory Farnum wrote:

On Wed, Jun 26, 2013 at 12:24 AM, Wido den Hollander 
wrote:

On 06/26/2013 01:18 AM, Gregory Farnum wrote:


Some guesses are inline.

On Tue, Jun 25, 2013 at 4:06 PM, Wido den Hollander 
wrote:


Hi,

I'm not sure what happened, but on a Ceph cluster I noticed that the
monitors (running 0.61) started filling up the disks, so they were
restarted
with:

mon compact on start = true

After a restart the osdmap was empty, it showed:

 osdmap e2: 0 osds: 0 up, 0 in
  pgmap v624077: 15296 pgs: 15296 stale+active+clean; 78104 MB
data,
243
GB used, 66789 GB / 67032 GB avail
 mdsmap e1: 0/0/1 up

This cluster has 36 OSDs over 9 hosts, but suddenly that was all gone.

I also checked the crushmap, all 36 OSDs were removed, no trace of
them.



As you guess, this is probably because the disks filled up. It
shouldn't be able to happen but we found an edge case where leveldb
falls apart; there's a fix for it in the repository now (asserting
that we get back what we just wrote) that Sage can talk more about.
Probably both disappeared because the monitor got nothing back when
reading in the newest OSD Map, and so it's all empty.



Sounds reasonable and logical.



"ceph auth list" still showed their keys though.

Restarting the OSDs didn't help, since create-or-move complained
that the
OSDs didn't exist and didn't do anything. I ran "ceph osd create"
to get
the
36 OSDs created again, but when the OSDs boot they never start
working.

The only thing they log is:

2013-06-26 01:00:08.852410 7f17f3f16700  0 -- 0.0.0.0:6801/4767 >>
10.23.24.53:6801/1758 pipe(0x1025fc80 sd=116 :40516 s=1 pgs=0 cs=0
l=0).fault with nothing to send, going to standby



Are they going up and just sitting idle? This is probably because none
of their peers are telling them to be responsible for any placement
groups on startup.



No, they never come up. So checking the monitor logs I only see the
create-or-move command changing their crush position, but they never
mark
themselves as "up", so all the OSDs stay down.

netstat however shows a connection with the monitor between the OSD
and the
Mon, but nothing special in the logs at lower debugging.


So the process is still running? Can you generate full logs with debug
ms = 5, debug osd = 20, debug monc = 20?



I've done so with 4 OSDs and I uploaded the logs of one OSD:

root@data1:~# sftp cephd...@ceph.com
cephd...@ceph.com's password:
Connected to ceph.com.
sftp> put ceph-osd-0-widodh-empty-osdmap.log.gz
Uploading ceph-osd-0-widodh-empty-osdmap.log.gz to
/home/cephdrop/ceph-osd-0-widodh-empty-osdmap.log.gz
ceph-osd-0-widodh-empty-osdmap.log.gz
100%   14MB   3.5MB/s   00:04
sftp>

My internet here is to slow to go through the logs and I haven't checked
them yet.



Couldn't resist going through them and I found this:

2013-06-26 22:21:11.721185 7ffe4b7c9780  7 osd.0 1137 consume_map 
version 1137
2013-06-26 22:21:11.746395 7ffe4b7c9780 10 osd.0 1137 done with init, 
starting boot process
2013-06-26 22:21:11.746400 7ffe4b7c9780 10 osd.0 1137 start_boot - have 
maps 503..1137

2013-06-26 22:21:11.746402 7ffe4b7c9780 10 monclient: get_version osdmap
2013-06-26 22:21:11.746404 7ffe4b7c9780 10 monclient: _send_mon_message 
to mon.mon1 at 10.23.24.8:6789/0
2013-06-26 22:21:11.746409 7ffe4b7c9780  1 -- 10.23.24.51:6800/27568 --> 
10.23.24.8:6789/0 -- mon_get_version(what=osdmap handle=1) v1 -- ?+0 
0x2bc5c40 con 0x4a28b00
2013-06-26 22:21:11.767132 7ffe3e43e700  1 -- 10.23.24.51:6800/27568 <== 
mon.0 10.23.24.8:6789/0 10  mon_check_map_ack(handle=1 version=59) 
v2  24+0+0 (1392806332 0 0) 0x6b9f8c0 con 0x4a28b00
2013-06-26 22:21:11.771242 7ffe3a436700 10 osd.0 1137 _maybe_boot mon 
has osdmaps 1..59

2013-06-26 22:21:11.771259 7ffe3a436700 10 osd.0 1137 _send_boot
2013-06-26 22:21:11.771261 7ffe3a436700 10 osd.0 1137  assuming 
cluster_addr ip matches client_addr
2013-06-26 22:21:11.771262 7ffe3a436700 10 osd.0 1137  assuming hb_addr 
ip matches cluster_addr
2013-06-26 22:21:11.771265 7ffe3a436700 10 osd.0 1137  client_addr 
10.23.24.51:6800/27568, cluster_addr 10.23.24.51:6801/27568, hb addr 
10.23.24.51:6802/27568
2013-06-26 22:21:11.771274 7ffe3a436700 10 monclient: _send_mon_message 
to mon.mon1 at 10.23.24.8:6789/0
2013-06-26 22:21:11.771276 7ffe3a436700  1 -- 10.23.24.51:6800/27568 --> 
10.23.24.8:6789/0 -- osd_boot(osd.0 booted 0 v1137) v3 -- ?+0 0x43d3000 
con 0x4a28b00

2013-06-26 22:21:14.712300 7ffe3ac37700 10 monclient: tick
2013-06-26 22:21:14.712320 7ffe3ac37700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 2013-06-26 
22:20:44.712319)
2013-06-26 22:21:14.712327 7ffe3ac37700 10 monclient: renew subs? (now: 
2013-06-26 22:21:14.712327; renew after: 2013-06-26 22:23:41.712450) -- no
2013-06-26 22:21:16.413529 7ffe3442a700 20 osd.0 1137 update_osd_stat 
osd_stat(6673 MB used, 1855 GB avail, 1862 GB total, peers []/[])
2013-06-26 22:21:16.413542

Re: [ceph-users] Empty osd and crushmap after mon restart?

2013-06-26 Thread Wido den Hollander

On 06/26/2013 06:54 PM, Gregory Farnum wrote:

On Wed, Jun 26, 2013 at 12:24 AM, Wido den Hollander  wrote:

On 06/26/2013 01:18 AM, Gregory Farnum wrote:


Some guesses are inline.

On Tue, Jun 25, 2013 at 4:06 PM, Wido den Hollander  wrote:


Hi,

I'm not sure what happened, but on a Ceph cluster I noticed that the
monitors (running 0.61) started filling up the disks, so they were
restarted
with:

mon compact on start = true

After a restart the osdmap was empty, it showed:

 osdmap e2: 0 osds: 0 up, 0 in
  pgmap v624077: 15296 pgs: 15296 stale+active+clean; 78104 MB data,
243
GB used, 66789 GB / 67032 GB avail
 mdsmap e1: 0/0/1 up

This cluster has 36 OSDs over 9 hosts, but suddenly that was all gone.

I also checked the crushmap, all 36 OSDs were removed, no trace of them.



As you guess, this is probably because the disks filled up. It
shouldn't be able to happen but we found an edge case where leveldb
falls apart; there's a fix for it in the repository now (asserting
that we get back what we just wrote) that Sage can talk more about.
Probably both disappeared because the monitor got nothing back when
reading in the newest OSD Map, and so it's all empty.



Sounds reasonable and logical.



"ceph auth list" still showed their keys though.

Restarting the OSDs didn't help, since create-or-move complained that the
OSDs didn't exist and didn't do anything. I ran "ceph osd create" to get
the
36 OSDs created again, but when the OSDs boot they never start working.

The only thing they log is:

2013-06-26 01:00:08.852410 7f17f3f16700  0 -- 0.0.0.0:6801/4767 >>
10.23.24.53:6801/1758 pipe(0x1025fc80 sd=116 :40516 s=1 pgs=0 cs=0
l=0).fault with nothing to send, going to standby



Are they going up and just sitting idle? This is probably because none
of their peers are telling them to be responsible for any placement
groups on startup.



No, they never come up. So checking the monitor logs I only see the
create-or-move command changing their crush position, but they never mark
themselves as "up", so all the OSDs stay down.

netstat however shows a connection with the monitor between the OSD and the
Mon, but nothing special in the logs at lower debugging.


So the process is still running? Can you generate full logs with debug
ms = 5, debug osd = 20, debug monc = 20?



I've done so with 4 OSDs and I uploaded the logs of one OSD:

root@data1:~# sftp cephd...@ceph.com
cephd...@ceph.com's password:
Connected to ceph.com.
sftp> put ceph-osd-0-widodh-empty-osdmap.log.gz
Uploading ceph-osd-0-widodh-empty-osdmap.log.gz to 
/home/cephdrop/ceph-osd-0-widodh-empty-osdmap.log.gz
ceph-osd-0-widodh-empty-osdmap.log.gz 


100%   14MB   3.5MB/s   00:04
sftp>

My internet here is to slow to go through the logs and I haven't checked 
them yet.



The internet connection I'm behind is a 3G connection, so I can't go
skimming through the logs with debugging at very high levels, but I'm
just
wondering what this could be?

It's obvious that the monitors filling up probably triggered the problem,
but I'm now looking at a way to get the OSDs back up again.

In the meantime I upgraded all the nodes to 0.61.4, but that didn't
change
anything.

Any ideas on what this might be and how to resolve it?



At a guess, you can go in and grab the last good version of the OSD
Map and inject that back into the cluster, then restart the OSDs? If
that doesn't work then we'll need to figure out the right way to kick
them into being responsible for their stuff.
(First, make sure that when you turn them on they are actually
connecting to the monitors.)



You mean grabbing the old OSDMap from an OSD or the Monitor store? Both are
using leveldb for their storage now, right? So I'd have to grab the OSD Map
using some leveldb tooling?


There's a ceph-monstore-tool or similar that provides this
functionality, although it's pretty new so you might need to grab an
autobuilt package somewhere instead of the cuttlefish one (not sure)


Ah, cool! I'll give that a try.


-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph journal double writes?

2013-06-26 Thread Gregory Farnum
On Wed, Jun 26, 2013 at 11:57 AM, Oliver Fuckner  wrote:
> How do I debug expander behaviour? I know lsiutil, but is there something
> like iostat for sas lanes/phys? Talking about oversubscription: What I
> really try is 24 streams to one SSD-mirror. So I will probably need more
> ssds, okay...

Just by testing simultaneous writes to all your drives and seeing what
happens, basically.

> > I would test the write behavior of your disks independently of Ceph (but
> > simultaneously!) and see what happens.

> well dd to the ssds also shows 400MByte/sec with 4MByte blocks.

Ah, they're in a RAID-1 mirror with all 24 going to it? Yeah, 24
simultaneous streams is beyond what a lot of controllers and drives
can handle very well. Try running multiple streams to the drives and
see what their average, deviation, and aggregate bandwidth is. :) I
expect that's the issue.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph journal double writes?

2013-06-26 Thread Oliver Fuckner

On 6/26/2013 4:49 PM, Gregory Farnum wrote:

On Wednesday, June 26, 2013, Oliver Fuckner wrote:

Hi,
I am fairly new to ceph and just built my first 4 systems.

I use:
Supermicro X9SCL-Board with E3-1240 (4*3.4GHz) CPU and 32GB RAM
LSI 9211-4i SAS HBA with 24 SATA disks and 2 SSDs (Intel 3700,
100GB), all connected through a 6GBit-SAS expander
CentOS 6.4 with Kernel 2.6.32-358.11.1, 64bit
ceph 0.61.4
Intel 10GigEthernet NICs are used to connect the nodes together
xfs is used on journal and osds


The SSDs are configured in a mdadm raid1 and used for journals.
The SSDs can write 400MBytes/sec each, but the sum of all disks is
exactly half of it, 200MBytes/sec.




So there are 2 journal writes for every write to the osd?


No.

Is this expected behaviour? Why?


No, but at a guess your expanders aren't behaving properly. 
Alternatively, your SSDs don't handle twelve write streams so well -- 
that's quite a lot of oversubscription.


How do I debug expander behaviour? I know lsiutil, but is there 
something like iostat for sas lanes/phys? Talking about 
oversubscription: What I really try is 24 streams to one SSD-mirror. So 
I will probably need more ssds, okay...



I would test the write behavior of your disks independently of Ceph 
(but simultaneously!) and see what happens.


well dd to the ssds also shows 400MByte/sec with 4MByte blocks.

Thanks,
 Oliver
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph on mixed AMD/Intel architecture

2013-06-26 Thread Gregory Farnum
On Wed, Jun 26, 2013 at 10:57 AM, Greg Chavez  wrote:
> I could have sworn that I read somewhere, very early on in my
> investigation of Ceph, that you OSDs need to run on the same processor
> architecture.  Only it suddenly occurred to me that for the last
> month, I 've been running a small 3-node cluster with two Intel
> systems and one AMD system.  I thought they were all AMD!
>
> So... is this a problem?  It seems to be running well.

Well, AMD and Intel processors are all the same architecture (for now,
at least). ;)
However, Ceph *should* be fully processor independent — we don't test
it very often but everything that goes over the network is converted
to the same format regardless of what the local processor is doing. :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph on mixed AMD/Intel architecture

2013-06-26 Thread Greg Chavez
I could have sworn that I read somewhere, very early on in my
investigation of Ceph, that you OSDs need to run on the same processor
architecture.  Only it suddenly occurred to me that for the last
month, I 've been running a small 3-node cluster with two Intel
systems and one AMD system.  I thought they were all AMD!

So... is this a problem?  It seems to be running well.

Thanks.

--
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Empty osd and crushmap after mon restart?

2013-06-26 Thread Gregory Farnum
On Wed, Jun 26, 2013 at 12:24 AM, Wido den Hollander  wrote:
> On 06/26/2013 01:18 AM, Gregory Farnum wrote:
>>
>> Some guesses are inline.
>>
>> On Tue, Jun 25, 2013 at 4:06 PM, Wido den Hollander  wrote:
>>>
>>> Hi,
>>>
>>> I'm not sure what happened, but on a Ceph cluster I noticed that the
>>> monitors (running 0.61) started filling up the disks, so they were
>>> restarted
>>> with:
>>>
>>> mon compact on start = true
>>>
>>> After a restart the osdmap was empty, it showed:
>>>
>>> osdmap e2: 0 osds: 0 up, 0 in
>>>  pgmap v624077: 15296 pgs: 15296 stale+active+clean; 78104 MB data,
>>> 243
>>> GB used, 66789 GB / 67032 GB avail
>>> mdsmap e1: 0/0/1 up
>>>
>>> This cluster has 36 OSDs over 9 hosts, but suddenly that was all gone.
>>>
>>> I also checked the crushmap, all 36 OSDs were removed, no trace of them.
>>
>>
>> As you guess, this is probably because the disks filled up. It
>> shouldn't be able to happen but we found an edge case where leveldb
>> falls apart; there's a fix for it in the repository now (asserting
>> that we get back what we just wrote) that Sage can talk more about.
>> Probably both disappeared because the monitor got nothing back when
>> reading in the newest OSD Map, and so it's all empty.
>>
>
> Sounds reasonable and logical.
>
>
>>> "ceph auth list" still showed their keys though.
>>>
>>> Restarting the OSDs didn't help, since create-or-move complained that the
>>> OSDs didn't exist and didn't do anything. I ran "ceph osd create" to get
>>> the
>>> 36 OSDs created again, but when the OSDs boot they never start working.
>>>
>>> The only thing they log is:
>>>
>>> 2013-06-26 01:00:08.852410 7f17f3f16700  0 -- 0.0.0.0:6801/4767 >>
>>> 10.23.24.53:6801/1758 pipe(0x1025fc80 sd=116 :40516 s=1 pgs=0 cs=0
>>> l=0).fault with nothing to send, going to standby
>>
>>
>> Are they going up and just sitting idle? This is probably because none
>> of their peers are telling them to be responsible for any placement
>> groups on startup.
>>
>
> No, they never come up. So checking the monitor logs I only see the
> create-or-move command changing their crush position, but they never mark
> themselves as "up", so all the OSDs stay down.
>
> netstat however shows a connection with the monitor between the OSD and the
> Mon, but nothing special in the logs at lower debugging.

So the process is still running? Can you generate full logs with debug
ms = 5, debug osd = 20, debug monc = 20?

>>> The internet connection I'm behind is a 3G connection, so I can't go
>>> skimming through the logs with debugging at very high levels, but I'm
>>> just
>>> wondering what this could be?
>>>
>>> It's obvious that the monitors filling up probably triggered the problem,
>>> but I'm now looking at a way to get the OSDs back up again.
>>>
>>> In the meantime I upgraded all the nodes to 0.61.4, but that didn't
>>> change
>>> anything.
>>>
>>> Any ideas on what this might be and how to resolve it?
>>
>>
>> At a guess, you can go in and grab the last good version of the OSD
>> Map and inject that back into the cluster, then restart the OSDs? If
>> that doesn't work then we'll need to figure out the right way to kick
>> them into being responsible for their stuff.
>> (First, make sure that when you turn them on they are actually
>> connecting to the monitors.)
>
>
> You mean grabbing the old OSDMap from an OSD or the Monitor store? Both are
> using leveldb for their storage now, right? So I'd have to grab the OSD Map
> using some leveldb tooling?

There's a ceph-monstore-tool or similar that provides this
functionality, although it's pretty new so you might need to grab an
autobuilt package somewhere instead of the cuttlefish one (not sure)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph journal double writes?

2013-06-26 Thread Gregory Farnum
On Wednesday, June 26, 2013, Oliver Fuckner wrote:

> Hi,
> I am fairly new to ceph and just built my first 4 systems.
>
> I use:
> Supermicro X9SCL-Board with E3-1240 (4*3.4GHz) CPU and 32GB RAM
> LSI 9211-4i SAS HBA with 24 SATA disks and 2 SSDs (Intel 3700, 100GB), all
> connected through a 6GBit-SAS expander
> CentOS 6.4 with Kernel 2.6.32-358.11.1, 64bit
> ceph 0.61.4
> Intel 10GigEthernet NICs are used to connect the nodes together
> xfs is used on journal and osds
>
>
> The SSDs are configured in a mdadm raid1 and used for journals.
> The SSDs can write 400MBytes/sec each, but the sum of all disks is exactly
> half of it, 200MBytes/sec.
>
>
>
>
> So there are 2 journal writes for every write to the osd?


No.


> Is this expected behaviour? Why?


No, but at a guess your expanders aren't behaving properly. Alternatively,
your SSDs don't handle twelve write streams so well -- that's quite a lot
of oversubscription.

I would test the write behavior of your disks independently of Ceph (but
simultaneously!) and see what happens.
-Greg

This can be seen with real load and rados bench write -t 64
>
>
> Details:
>
>
> SSDs:
> mdadm creation:
> mdadm --create /dev/md2 --run --raid-devices=2 --level=raid1 --name=ssd
> /dev/sdc /dev/sdd
> mkfs.xfs -f -i size=2048 /dev/md2
>
> mount options in /etc/fstab:
> /dev/md2  /data/journal xfs rw,noatime,discard  1 2
>
> OSDs:
> mkfs.xfs -i size=2048 /dev/$24-disks
> ceph-osd -i $ID --mkfs --mkkey --mkjournal --osd-data /data/osd.slot
> --osd-journal /data/journal/slot
>
>
> Journals are limited to 2GByte per osd via ceph.conf
>
>
> Thanks,
>  Oliver
> __**_
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com
>


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problem connecting with Cyberduck

2013-06-26 Thread Gary Bruce
Hi All,

I have followed the 2-node install and trying to connect using Cyberduck to:

https://x...@cephserver1.zion.bt.co.uk/

I get the following message:

I/O Error
Connection failed
Unrecognised SSL message, plaintext  connection?.
GET /HTTP/1.1
Date.
Authorisation: AWS

Host: cephserver1.zion.bt.co.uk:443
Connection: Keep-Alive
User-Agent: Cyberduck...

Can anyone help?

Thanks
Gary
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph journal double writes?

2013-06-26 Thread Oliver Fuckner

Hi,
I am fairly new to ceph and just built my first 4 systems.

I use:
Supermicro X9SCL-Board with E3-1240 (4*3.4GHz) CPU and 32GB RAM
LSI 9211-4i SAS HBA with 24 SATA disks and 2 SSDs (Intel 3700, 100GB), 
all connected through a 6GBit-SAS expander

CentOS 6.4 with Kernel 2.6.32-358.11.1, 64bit
ceph 0.61.4
Intel 10GigEthernet NICs are used to connect the nodes together
xfs is used on journal and osds


The SSDs are configured in a mdadm raid1 and used for journals.
The SSDs can write 400MBytes/sec each, but the sum of all disks is 
exactly half of it, 200MBytes/sec.





So there are 2 journal writes for every write to the osd? Is this 
expected behaviour? Why?

This can be seen with real load and rados bench write -t 64


Details:


SSDs:
mdadm creation:
mdadm --create /dev/md2 --run --raid-devices=2 --level=raid1 --name=ssd 
/dev/sdc /dev/sdd

mkfs.xfs -f -i size=2048 /dev/md2

mount options in /etc/fstab:
/dev/md2  /data/journal xfs rw,noatime,discard  1 2

OSDs:
mkfs.xfs -i size=2048 /dev/$24-disks
ceph-osd -i $ID --mkfs --mkkey --mkjournal --osd-data /data/osd.slot 
--osd-journal /data/journal/slot



Journals are limited to 2GByte per osd via ceph.conf


Thanks,
 Oliver
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Empty osd and crushmap after mon restart?

2013-06-26 Thread Wido den Hollander

On 06/26/2013 01:18 AM, Gregory Farnum wrote:

Some guesses are inline.

On Tue, Jun 25, 2013 at 4:06 PM, Wido den Hollander  wrote:

Hi,

I'm not sure what happened, but on a Ceph cluster I noticed that the
monitors (running 0.61) started filling up the disks, so they were restarted
with:

mon compact on start = true

After a restart the osdmap was empty, it showed:

osdmap e2: 0 osds: 0 up, 0 in
 pgmap v624077: 15296 pgs: 15296 stale+active+clean; 78104 MB data, 243
GB used, 66789 GB / 67032 GB avail
mdsmap e1: 0/0/1 up

This cluster has 36 OSDs over 9 hosts, but suddenly that was all gone.

I also checked the crushmap, all 36 OSDs were removed, no trace of them.


As you guess, this is probably because the disks filled up. It
shouldn't be able to happen but we found an edge case where leveldb
falls apart; there's a fix for it in the repository now (asserting
that we get back what we just wrote) that Sage can talk more about.
Probably both disappeared because the monitor got nothing back when
reading in the newest OSD Map, and so it's all empty.



Sounds reasonable and logical.


"ceph auth list" still showed their keys though.

Restarting the OSDs didn't help, since create-or-move complained that the
OSDs didn't exist and didn't do anything. I ran "ceph osd create" to get the
36 OSDs created again, but when the OSDs boot they never start working.

The only thing they log is:

2013-06-26 01:00:08.852410 7f17f3f16700  0 -- 0.0.0.0:6801/4767 >>
10.23.24.53:6801/1758 pipe(0x1025fc80 sd=116 :40516 s=1 pgs=0 cs=0
l=0).fault with nothing to send, going to standby


Are they going up and just sitting idle? This is probably because none
of their peers are telling them to be responsible for any placement
groups on startup.



No, they never come up. So checking the monitor logs I only see the 
create-or-move command changing their crush position, but they never 
mark themselves as "up", so all the OSDs stay down.


netstat however shows a connection with the monitor between the OSD and 
the Mon, but nothing special in the logs at lower debugging.



The internet connection I'm behind is a 3G connection, so I can't go
skimming through the logs with debugging at very high levels, but I'm just
wondering what this could be?

It's obvious that the monitors filling up probably triggered the problem,
but I'm now looking at a way to get the OSDs back up again.

In the meantime I upgraded all the nodes to 0.61.4, but that didn't change
anything.

Any ideas on what this might be and how to resolve it?


At a guess, you can go in and grab the last good version of the OSD
Map and inject that back into the cluster, then restart the OSDs? If
that doesn't work then we'll need to figure out the right way to kick
them into being responsible for their stuff.
(First, make sure that when you turn them on they are actually
connecting to the monitors.)


You mean grabbing the old OSDMap from an OSD or the Monitor store? Both 
are using leveldb for their storage now, right? So I'd have to grab the 
OSD Map using some leveldb tooling?


Wido



-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com