optmize librbd for iops

2012-11-12 Thread Stefan Priebe - Profihost AG

Hello list,

are there any plans to optimize librbd for iops? Right now i'm able to 
get 50.000 iop/s via iscsi and 100.000 iop/s using multipathing with iscsi.


With librbd i'm stuck to around 18.000iops. As this scales with more 
hosts but not with more disks in a vm. It must be limited by rbd 
implementation in kvm / librbd.


Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Disabling journal

2012-11-12 Thread Sage Weil
On Sun, 11 Nov 2012, Stefan Priebe wrote:
 Hi Sage,
 
  With btrfs, yes, although this isn't something we have tested in a while.
 I'm not using btrfs as long as the devs claim it is not ready for prod.

In that case, the journal is needed for consistency of the fs; we rely on 
writeahead journaling.  It can't be turned off.

Putting it on a ramdisk in this case is interesting for performance, but 
it means that a crash/reboot/powerloss event leaves the fs in an 
inconsistent and unusable state.

The only time tmpfs is potentially useful in production is when you're 
using btrfs *and* have independent backup power sources for replicas (and 
can thus avoid worrying about a site-wide power failure and loss of 
journal).  (Or have relaxed requirements for the durability of recent 
writes.)

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Disabling journal

2012-11-12 Thread Stefan Priebe - Profihost AG

Am 12.11.2012 15:42, schrieb Sage Weil:

On Sun, 11 Nov 2012, Stefan Priebe wrote:

Hi Sage,


With btrfs, yes, although this isn't something we have tested in a while.

I'm not using btrfs as long as the devs claim it is not ready for prod.


In that case, the journal is needed for consistency of the fs; we rely on
writeahead journaling.  It can't be turned off.

Putting it on a ramdisk in this case is interesting for performance, but
it means that a crash/reboot/powerloss event leaves the fs in an
inconsistent and unusable state.


But only if for replicas 2 both nodes crash / have a powerloss?


The only time tmpfs is potentially useful in production is when you're
using btrfs *and* have independent backup power sources for replicas (and
can thus avoid worrying about a site-wide power failure and loss of
journal).  (Or have relaxed requirements for the durability of recent
writes.)
What happens for XFS and replicas two and ONE host has a power loss? The 
other replica / journal should be still there.


I've no idea where to put the journal on.

I mean i've 8 SSDs per Host one per osd each with a write IOP/s speed of 
45.000 iops to whole IOP/s write speed of 360.000 IOP/s per Node.


Which journal device can handle this? And if i put the journal on the 
same disk as the OSD it has to copy the data around.


Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Disabling journal

2012-11-12 Thread Sage Weil
On Mon, 12 Nov 2012, Stefan Priebe - Profihost AG wrote:
 Am 12.11.2012 15:42, schrieb Sage Weil:
  On Sun, 11 Nov 2012, Stefan Priebe wrote:
   Hi Sage,
   
With btrfs, yes, although this isn't something we have tested in a
while.
   I'm not using btrfs as long as the devs claim it is not ready for prod.
  
  In that case, the journal is needed for consistency of the fs; we rely on
  writeahead journaling.  It can't be turned off.
  
  Putting it on a ramdisk in this case is interesting for performance, but
  it means that a crash/reboot/powerloss event leaves the fs in an
  inconsistent and unusable state.
 
 But only if for replicas 2 both nodes crash / have a powerloss?

Then you're okay.. but the one that lost the journal effectively also lost 
the contents of the SSD.  Also, manual intervention is currently needed to 
reinitialize the osd (since this is not a normal failure mode).

  The only time tmpfs is potentially useful in production is when you're
  using btrfs *and* have independent backup power sources for replicas (and
  can thus avoid worrying about a site-wide power failure and loss of
  journal).  (Or have relaxed requirements for the durability of recent
  writes.)
 What happens for XFS and replicas two and ONE host has a power loss? The other
 replica / journal should be still there.
 
 I've no idea where to put the journal on.
 
 I mean i've 8 SSDs per Host one per osd each with a write IOP/s speed of
 45.000 iops to whole IOP/s write speed of 360.000 IOP/s per Node.
 
 Which journal device can handle this? And if i put the journal on the same
 disk as the OSD it has to copy the data around.

I think you have two choices.  Either put the journal SSDs (perhaps a 
journal on an existing one), or use a higher-end NVRAM-based device.  
There are several of these out there, although I'm blanking on product 
names at the moment.  The best are probably the battery-backed DRAM ones 
with a bit of flash for when the battery gets low.  Lots of RAID 
controllers also have some onboard NVRAM that can often be finagled into 
being useful, at least with spinning disks; I'm not sure how they perform 
with SSDs.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ceph cluster hangs when rebooting one node

2012-11-12 Thread Stefan Priebe - Profihost AG

Hello list,

i was checking what happens if i reboot a ceph node.

Sadly if i reboot one node, the whole ceph cluster hangs and no I/O is 
possible.


ceph -w:
Looks like this:
2012-11-12 16:03:58.191106 mon.0 [INF] pgmap v19013: 7032 pgs: 7032 
active+clean; 91615 MB data, 174 GB used, 4294 GB / 4469 GB avail

2012-11-12 16:04:08.365557 mon.0 [INF] mon.a calling new monitor election
2012-11-12 16:04:13.422682 mon.0 [INF] mon.a@0 won leader election with 
quorum 0,2
2012-11-12 16:04:13.708045 mon.0 [INF] pgmap v19014: 7032 pgs: 7032 
active+clean; 91615 MB data, 174 GB used, 4294 GB / 4469 GB avail

2012-11-12 16:04:13.708059 mon.0 [INF] mdsmap e1: 0/0/1 up
2012-11-12 16:04:13.708070 mon.0 [INF] osdmap e4582: 20 osds: 20 up, 20 in
2012-11-12 16:04:08.242688 mon.2 [INF] mon.c calling new monitor election
2012-11-12 16:04:13.708089 mon.0 [INF] monmap e1: 3 mons at 
{a=10.255.0.100:6789/0,b=10.255.0.101:6789/0,c=10.255.0.102:6789/0}
2012-11-12 16:04:14.070593 mon.0 [INF] pgmap v19015: 7032 pgs: 7032 
active+clean; 91615 MB data, 174 GB used, 4294 GB / 4469 GB avail
2012-11-12 16:04:15.283954 mon.0 [INF] pgmap v19016: 7032 pgs: 7032 
active+clean; 91615 MB data, 174 GB used, 4294 GB / 4469 GB avail
2012-11-12 16:04:18.506812 mon.0 [INF] osd.21 10.255.0.101:6800/5049 
failed (3 reports from 3 peers after 20.339769 = grace 20.00)

2012-11-12 16:04:18.890003 mon.0 [INF] osdmap e4583: 20 osds: 19 up, 20 in
2012-11-12 16:04:19.137936 mon.0 [INF] pgmap v19017: 7032 pgs: 6720 
active+clean, 312 stale+active+clean; 91615 MB data, 174 GB used, 4294 
GB / 4469 GB avail

2012-11-12 16:04:20.024595 mon.0 [INF] osdmap e4584: 20 osds: 19 up, 20 in
2012-11-12 16:04:20.330149 mon.0 [INF] pgmap v19018: 7032 pgs: 6720 
active+clean, 312 stale+active+clean; 91615 MB data, 174 GB used, 4294 
GB / 4469 GB avail
2012-11-12 16:04:21.535471 mon.0 [INF] pgmap v19019: 7032 pgs: 6720 
active+clean, 312 stale+active+clean; 91615 MB data, 174 GB used, 4294 
GB / 4469 GB avail
2012-11-12 16:04:24.181292 mon.0 [INF] osd.22 10.255.0.101:6803/5153 
failed (3 reports from 3 peers after 23.013550 = grace 20.00)
2012-11-12 16:04:24.182208 mon.0 [INF] osd.23 10.255.0.101:6806/5276 
failed (3 reports from 3 peers after 21.000834 = grace 20.00)
2012-11-12 16:04:24.671373 mon.0 [INF] pgmap v19020: 7032 pgs: 6637 
active+clean, 208 stale+active+clean, 187 incomplete; 91615 MB data, 174 
GB used, 4295 GB / 4469 GB avail

2012-11-12 16:04:24.829022 mon.0 [INF] osdmap e4585: 20 osds: 17 up, 20 in
2012-11-12 16:04:24.870969 mon.0 [INF] osd.24 10.255.0.101:6809/5397 
failed (3 reports from 3 peers after 20.688672 = grace 20.00)
2012-11-12 16:04:25.522333 mon.0 [INF] pgmap v19021: 7032 pgs: 5912 
active+clean, 933 stale+active+clean, 187 incomplete; 91615 MB data, 174 
GB used, 4295 GB / 4469 GB avail
2012-11-12 16:04:25.596927 mon.0 [INF] osd.24 10.255.0.101:6809/5397 
failed (3 reports from 3 peers after 21.708444 = grace 20.00)

2012-11-12 16:04:26.077545 mon.0 [INF] osdmap e4586: 20 osds: 16 up, 20 in
2012-11-12 16:04:26.606475 mon.0 [INF] pgmap v19022: 7032 pgs: 5394 
active+clean, 1094 stale+active+clean, 544 incomplete; 91615 MB data, 
173 GB used, 4296 GB / 4469 GB avail

2012-11-12 16:04:27.162034 mon.0 [INF] osdmap e4587: 20 osds: 16 up, 20 in
2012-11-12 16:04:27.656974 mon.0 [INF] pgmap v19023: 7032 pgs: 5394 
active+clean, 1094 stale+active+clean, 544 incomplete; 91615 MB data, 
173 GB used, 4296 GB / 4469 GB avail
2012-11-12 16:04:30.229958 mon.0 [INF] pgmap v19024: 7032 pgs: 5394 
active+clean, 1094 stale+active+clean, 544 incomplete; 91615 MB data, 
172 GB used, 4296 GB / 4469 GB avail
2012-11-12 16:04:31.411989 mon.0 [INF] pgmap v19025: 7032 pgs: 5394 
active+clean, 1094 stale+active+clean, 544 incomplete; 91615 MB data, 
172 GB used, 4296 GB / 4469 GB avail
2012-11-12 16:04:32.617576 mon.0 [INF] pgmap v19026: 7032 pgs: 4660 
active+clean, 2372 incomplete; 91615 MB data, 171 GB used, 4298 GB / 
4469 GB avail
2012-11-12 16:04:35.172861 mon.0 [INF] pgmap v19027: 7032 pgs: 4660 
active+clean, 2372 incomplete; 91615 MB data, 171 GB used, 4298 GB / 
4469 GB avail
2012-11-12 16:04:30.505872 osd.53 [WRN] 6 slow requests, 6 included 
below; oldest blocked for  30.247691 secs
2012-11-12 16:04:30.505875 osd.53 [WRN] slow request 30.247691 seconds 
old, received at 2012-11-12 16:04:00.258118: 
osd_op(client.131626.0:771962 rb.0.107a.734602d5.0bce [write 
2478080~4096] 3.562a9efc) v4 currently reached pg
2012-11-12 16:04:30.505879 osd.53 [WRN] slow request 30.238016 seconds 
old, received at 2012-11-12 16:04:00.267793: 
osd_op(client.131626.0:772116 rb.0.107a.734602d5.1608 [write 
262144~4096] 3.a47890e) v4 currently reached pg
2012-11-12 16:04:30.505881 osd.53 [WRN] slow request 30.236572 seconds 
old, received at 2012-11-12 16:04:00.269237: 
osd_op(client.131626.0:772141 rb.0.107a.734602d5.1777 [write 
798720~4096] 3.547bc855) v4 currently reached pg
2012-11-12 16:04:30.505883 osd.53 [WRN] slow request 

Re: [BUG] ceph-mon crashes

2012-11-12 Thread Stefan Priebe - Profihost AG

Am 12.11.2012 15:58, schrieb Joao Eduardo Luis:

Hi Stefan,


Any chance you can get me a larger chunk of the log from the monitor
that was the leader by the time you issued those commands until the
point the monitor crashed (from the excerpt you provided, that should be
mon.b)?


Sure:
https://www.dropbox.com/s/8e604bihk56m0yd/ceph-mon.b.log.1.gz

Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph cluster hangs when rebooting one node

2012-11-12 Thread Sage Weil
On Mon, 12 Nov 2012, Stefan Priebe - Profihost AG wrote:
 Hello list,
 
 i was checking what happens if i reboot a ceph node.
 
 Sadly if i reboot one node, the whole ceph cluster hangs and no I/O is
 possible.

If you are using the current master, the new 'min_size' may be biting you; 
ceph osd dump | grep ^pool and see if you see min_size for your pools.  
You can change that back to the norma behavior with

 ceph osd pool set poolname min_size 1

sage


 
 ceph -w:
 Looks like this:
 2012-11-12 16:03:58.191106 mon.0 [INF] pgmap v19013: 7032 pgs: 7032
 active+clean; 91615 MB data, 174 GB used, 4294 GB / 4469 GB avail
 2012-11-12 16:04:08.365557 mon.0 [INF] mon.a calling new monitor election
 2012-11-12 16:04:13.422682 mon.0 [INF] mon.a@0 won leader election with quorum
 0,2
 2012-11-12 16:04:13.708045 mon.0 [INF] pgmap v19014: 7032 pgs: 7032
 active+clean; 91615 MB data, 174 GB used, 4294 GB / 4469 GB avail
 2012-11-12 16:04:13.708059 mon.0 [INF] mdsmap e1: 0/0/1 up
 2012-11-12 16:04:13.708070 mon.0 [INF] osdmap e4582: 20 osds: 20 up, 20 in
 2012-11-12 16:04:08.242688 mon.2 [INF] mon.c calling new monitor election
 2012-11-12 16:04:13.708089 mon.0 [INF] monmap e1: 3 mons at
 {a=10.255.0.100:6789/0,b=10.255.0.101:6789/0,c=10.255.0.102:6789/0}
 2012-11-12 16:04:14.070593 mon.0 [INF] pgmap v19015: 7032 pgs: 7032
 active+clean; 91615 MB data, 174 GB used, 4294 GB / 4469 GB avail
 2012-11-12 16:04:15.283954 mon.0 [INF] pgmap v19016: 7032 pgs: 7032
 active+clean; 91615 MB data, 174 GB used, 4294 GB / 4469 GB avail
 2012-11-12 16:04:18.506812 mon.0 [INF] osd.21 10.255.0.101:6800/5049 failed (3
 reports from 3 peers after 20.339769 = grace 20.00)
 2012-11-12 16:04:18.890003 mon.0 [INF] osdmap e4583: 20 osds: 19 up, 20 in
 2012-11-12 16:04:19.137936 mon.0 [INF] pgmap v19017: 7032 pgs: 6720
 active+clean, 312 stale+active+clean; 91615 MB data, 174 GB used, 4294 GB /
 4469 GB avail
 2012-11-12 16:04:20.024595 mon.0 [INF] osdmap e4584: 20 osds: 19 up, 20 in
 2012-11-12 16:04:20.330149 mon.0 [INF] pgmap v19018: 7032 pgs: 6720
 active+clean, 312 stale+active+clean; 91615 MB data, 174 GB used, 4294 GB /
 4469 GB avail
 2012-11-12 16:04:21.535471 mon.0 [INF] pgmap v19019: 7032 pgs: 6720
 active+clean, 312 stale+active+clean; 91615 MB data, 174 GB used, 4294 GB /
 4469 GB avail
 2012-11-12 16:04:24.181292 mon.0 [INF] osd.22 10.255.0.101:6803/5153 failed (3
 reports from 3 peers after 23.013550 = grace 20.00)
 2012-11-12 16:04:24.182208 mon.0 [INF] osd.23 10.255.0.101:6806/5276 failed (3
 reports from 3 peers after 21.000834 = grace 20.00)
 2012-11-12 16:04:24.671373 mon.0 [INF] pgmap v19020: 7032 pgs: 6637
 active+clean, 208 stale+active+clean, 187 incomplete; 91615 MB data, 174 GB
 used, 4295 GB / 4469 GB avail
 2012-11-12 16:04:24.829022 mon.0 [INF] osdmap e4585: 20 osds: 17 up, 20 in
 2012-11-12 16:04:24.870969 mon.0 [INF] osd.24 10.255.0.101:6809/5397 failed (3
 reports from 3 peers after 20.688672 = grace 20.00)
 2012-11-12 16:04:25.522333 mon.0 [INF] pgmap v19021: 7032 pgs: 5912
 active+clean, 933 stale+active+clean, 187 incomplete; 91615 MB data, 174 GB
 used, 4295 GB / 4469 GB avail
 2012-11-12 16:04:25.596927 mon.0 [INF] osd.24 10.255.0.101:6809/5397 failed (3
 reports from 3 peers after 21.708444 = grace 20.00)
 2012-11-12 16:04:26.077545 mon.0 [INF] osdmap e4586: 20 osds: 16 up, 20 in
 2012-11-12 16:04:26.606475 mon.0 [INF] pgmap v19022: 7032 pgs: 5394
 active+clean, 1094 stale+active+clean, 544 incomplete; 91615 MB data, 173 GB
 used, 4296 GB / 4469 GB avail
 2012-11-12 16:04:27.162034 mon.0 [INF] osdmap e4587: 20 osds: 16 up, 20 in
 2012-11-12 16:04:27.656974 mon.0 [INF] pgmap v19023: 7032 pgs: 5394
 active+clean, 1094 stale+active+clean, 544 incomplete; 91615 MB data, 173 GB
 used, 4296 GB / 4469 GB avail
 2012-11-12 16:04:30.229958 mon.0 [INF] pgmap v19024: 7032 pgs: 5394
 active+clean, 1094 stale+active+clean, 544 incomplete; 91615 MB data, 172 GB
 used, 4296 GB / 4469 GB avail
 2012-11-12 16:04:31.411989 mon.0 [INF] pgmap v19025: 7032 pgs: 5394
 active+clean, 1094 stale+active+clean, 544 incomplete; 91615 MB data, 172 GB
 used, 4296 GB / 4469 GB avail
 2012-11-12 16:04:32.617576 mon.0 [INF] pgmap v19026: 7032 pgs: 4660
 active+clean, 2372 incomplete; 91615 MB data, 171 GB used, 4298 GB / 4469 GB
 avail
 2012-11-12 16:04:35.172861 mon.0 [INF] pgmap v19027: 7032 pgs: 4660
 active+clean, 2372 incomplete; 91615 MB data, 171 GB used, 4298 GB / 4469 GB
 avail
 2012-11-12 16:04:30.505872 osd.53 [WRN] 6 slow requests, 6 included below;
 oldest blocked for  30.247691 secs
 2012-11-12 16:04:30.505875 osd.53 [WRN] slow request 30.247691 seconds old,
 received at 2012-11-12 16:04:00.258118: osd_op(client.131626.0:771962
 rb.0.107a.734602d5.0bce [write 2478080~4096] 3.562a9efc) v4 currently
 reached pg
 2012-11-12 16:04:30.505879 osd.53 [WRN] slow request 30.238016 seconds old,
 received at 2012-11-12 16:04:00.267793: osd_op(client.131626.0:772116
 rb.0.107a.734602d5.1608 [write 

Re: ceph cluster hangs when rebooting one node

2012-11-12 Thread Stefan Priebe - Profihost AG

Am 12.11.2012 16:11, schrieb Sage Weil:

On Mon, 12 Nov 2012, Stefan Priebe - Profihost AG wrote:

Hello list,

i was checking what happens if i reboot a ceph node.

Sadly if i reboot one node, the whole ceph cluster hangs and no I/O is
possible.


If you are using the current master, the new 'min_size' may be biting you;
ceph osd dump | grep ^pool and see if you see min_size for your pools.
You can change that back to the norma behavior with


No i don't see any min size:
# ceph osd dump | grep ^pool
pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 
1344 pgp_num 1344 last_change 1 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 
1344 pgp_num 1344 last_change 1 owner 0
pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 1344 
pgp_num 1344 last_change 1 owner 0
pool 3 'kvmpool1' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 
3000 pgp_num 3000 last_change 958 owner 0



  ceph osd pool set poolname min_size 1
Yes this helps! But min_size is still not shown in ceph osd dump. Also 
when i reboot a node it takes up to 10s-20s until all osds from this 
node are set to failed and the I/O starts again. Should i issue an ceph 
osd out command before?


But i had already this set for all my rules in my crushmap
min_size 1
max_size 2

in my crushmap for each rule.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [pve-devel] less cores more iops / speed

2012-11-12 Thread Stefan Priebe - Profihost AG
Adding this to ceph.conf on kvm host adds another 2000 iops (20.000 
iop/s with one VM). I'm sure most of them are useless on a client kvm / 
rbd host but i don't know which one makes sense ;-)


[global]
debug ms = 0/0
debug rbd = 0/0
debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0

[client]
debug ms = 0/0
debug rbd = 0/0
debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0

Stefan

Am 12.11.2012 15:35, schrieb Alexandre DERUMIER:

Another idea,

do you have tried to put

  debug lockdep = 0/0
  debug context = 0/0
  debug crush = 0/0
  debug buffer = 0/0
  debug timer = 0/0
  debug journaler = 0/0
  debug osd = 0/0
  debug optracker = 0/0
  debug objclass = 0/0
  debug filestore = 0/0
  debug journal = 0/0
  debug ms = 0/0
  debug monc = 0/0
  debug tp = 0/0
  debug auth = 0/0
  debug finisher = 0/0
  debug heartbeatmap = 0/0
  debug perfcounter = 0/0
  debug asok = 0/0
  debug throttle = 0/0


in a ceph.conf on your kvm host ?


- Mail original -

De: Alexandre DERUMIER aderum...@odiso.com
À: Stefan Priebe - Profihost AG s.pri...@profihost.ag
Cc: pve-de...@pve.proxmox.com
Envoyé: Lundi 12 Novembre 2012 15:26:36
Objet: Re: [pve-devel] less cores more iops / speed

Maybe some tracing on kvm process could give us clues to find where the cpu is 
used ?

Also another idea, can you try with auth supported=none ? maybe they are some 
overhead with ceph authenfication ?




- Mail original -

De: Alexandre DERUMIER aderum...@odiso.com
À: Stefan Priebe - Profihost AG s.pri...@profihost.ag
Cc: pve-de...@pve.proxmox.com
Envoyé: Lundi 12 Novembre 2012 15:20:07
Objet: Re: [pve-devel] less cores more iops / speed

Ok thanks.

Seem to use a lot of cpu vs nfs,iscsi ...

I hope that ceph dev will work on this soon !


- Mail original -

De: Stefan Priebe - Profihost AG s.pri...@profihost.ag
À: Alexandre DERUMIER aderum...@odiso.com
Cc: eric e...@netwalk.com, pve-de...@pve.proxmox.com
Envoyé: Lundi 12 Novembre 2012 15:05:08
Objet: Re: [pve-devel] less cores more iops / speed

Am 12.11.2012 13:49, schrieb Alexandre DERUMIER:

One VM on one Host: 18.000 IOP/s
Two VM on one Host: 2x11.000 IOP/s
Three VM on one Host: 3x7.000 IOP/s


And host cpu is 100% ?


No. For three VMs yes. For one and two no. I think librbd / rbd
implementation in kvm is the bottleneck here.

Stefan


- Mail original -

De: Stefan Priebe - Profihost AG s.pri...@profihost.ag
À: Alexandre DERUMIER aderum...@odiso.com
Cc: eric e...@netwalk.com, pve-de...@pve.proxmox.com
Envoyé: Lundi 12 Novembre 2012 12:58:35
Objet: Re: [pve-devel] less cores more iops / speed

Am 12.11.2012 08:51, schrieb Alexandre DERUMIER:

Right now RBD in KVM is limited by CPU speed.


Good to known, so it's seem lack of threading, or maybe somes locks. (so faster 
cpu give more iops).

If you lauch parallel fio on same host on different guest, do you get more 
total iops ? (for me it's scale)


One VM on one Host: 18.000 IOP/s
Two VM on one Host: 2x11.000 IOP/s
Three VM on one Host: 3x7.000 IOP/s


if you launch 2 parallel fio, on same guest (on differents disk), do you get 
more iops ? (for me, it doesn't scale, so raid0 in guest doesn't help).

No it doesn't scale.

Stefan


- Mail original -

De: Stefan Priebe s.pri...@profihost.ag
À: Alexandre DERUMIER aderum...@odiso.com
Cc: eric e...@netwalk.com, pve-de...@pve.proxmox.com
Envoyé: Dimanche 11 Novembre 2012 13:07:36
Objet: Re: [pve-devel] less cores more iops / speed

Am 11.11.2012 12:12, schrieb Alexandre DERUMIER:

If I remember good, stefan can achieve 100.000 iops with iscsi with same kvm 
host.


Correct but this was always with scsi-generic and I/O multipathing on
host. rbd does not support scsi-generic ;-(


I have checked ceph mailing, stefan seem to have resolved his problem with dual 
core with bios update !

Correct. So speed on Dual Xeon is now 14.000 IOP/s and 18.000 IOP/s on
Single Xeon. But the difference is an issue of the CPU Speed. 3,6Ghz
Single Xeon vs. 

improve speed with auth supported=none

2012-11-12 Thread Stefan Priebe - Profihost AG

Hello list,

i'm still trying to improve ceph speed. Disable logging on host and rbd 
client gives me additional 5000 iop/s which is great.


But i also wanted to try disabling authentication using:
auth supported=none

How does this work? Do i just have to place this line under global 
section in ceph.conf?


Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


changed rbd cp behavior in 0.53

2012-11-12 Thread Andrey Korolyov
Hi,

For this version, rbd cp assumes that destination pool is the same as
source, not 'rbd', if pool in the destination path is omitted.

rbd cp install/img testimg
rbd ls install
img testimg


Is this change permanent?

Thanks!
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] ceph-mon crashes

2012-11-12 Thread Joao Eduardo Luis
On 11/12/2012 03:10 PM, Stefan Priebe - Profihost AG wrote:
 Am 12.11.2012 15:58, schrieb Joao Eduardo Luis:
 Hi Stefan,


 Any chance you can get me a larger chunk of the log from the monitor
 that was the leader by the time you issued those commands until the
 point the monitor crashed (from the excerpt you provided, that should be
 mon.b)?
 
 Sure:
 https://www.dropbox.com/s/8e604bihk56m0yd/ceph-mon.b.log.1.gz
 
 Greets,
 Stefan

Hi Stefan,


Thanks for the log.

Can you please confirm me that sometime between you issuing the out
command and mon.b failing, you had yet another monitor (maybe mon.a)
that was the leader but for some reason it was down by the time that
mon.b failed?

If so, could you provide the log for that monitor as well, given this
log doesn't have some infos I'm looking for?


  -Joao
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


pull request: ceph-qa-suite branch wip-java

2012-11-12 Thread Joe Buck
This patch adds a yaml file to add the libcephfs-java tests to the 
nightly qa test set.


Best,
-Joe Buck
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: improve speed with auth supported=none

2012-11-12 Thread Sébastien Han
I guess you can refer to that link on the list:
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/9776

btw do you get 5000 iop/s on the rbd kernel or on a vm disk?

cheers.


On Mon, Nov 12, 2012 at 4:37 PM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
 Hello list,

 i'm still trying to improve ceph speed. Disable logging on host and rbd
 client gives me additional 5000 iop/s which is great.

 But i also wanted to try disabling authentication using:
 auth supported=none

 How does this work? Do i just have to place this line under global section
 in ceph.conf?

 Greets,
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] ceph-mon crashes

2012-11-12 Thread Stefan Priebe

Hi Joao,

Am 12.11.2012 18:05, schrieb Joao Eduardo Luis:

Can you please confirm me that sometime between you issuing the out
command and mon.b failing, you had yet another monitor (maybe mon.a)
that was the leader but for some reason it was down by the time that
mon.b failed?

If so, could you provide the log for that monitor as well, given this
log doesn't have some infos I'm looking for?

Not sure but here are the logs of the other two mons:
https://www.dropbox.com/s/jztsedvj1b2kjje/ceph-mon.a.log.1.gz

https://www.dropbox.com/s/62jkfbbbgvs5o25/ceph-mon.c.log.1.gz

Thanks,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [pve-devel] less cores more iops / speed

2012-11-12 Thread Josh Durgin

On 11/12/2012 07:33 AM, Stefan Priebe - Profihost AG wrote:

Adding this to ceph.conf on kvm host adds another 2000 iops (20.000
iop/s with one VM). I'm sure most of them are useless on a client kvm /
rbd host but i don't know which one makes sense ;-)

[global]
 debug ms = 0/0
 debug rbd = 0/0
 debug lockdep = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug buffer = 0/0
 debug timer = 0/0
 debug journaler = 0/0
 debug osd = 0/0
 debug optracker = 0/0
 debug objclass = 0/0
 debug filestore = 0/0
 debug journal = 0/0
 debug ms = 0/0
 debug monc = 0/0
 debug tp = 0/0
 debug auth = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug perfcounter = 0/0
 debug asok = 0/0
 debug throttle = 0/0

[client]


For the client side you'd these settings to disable all debug logging:

[client]
debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug filer = 0/0
debug objecter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug objectcacher = 0/0
debug client = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0

Josh


 debug ms = 0/0
 debug rbd = 0/0
 debug lockdep = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug buffer = 0/0
 debug timer = 0/0
 debug journaler = 0/0
 debug osd = 0/0
 debug optracker = 0/0
 debug objclass = 0/0
 debug filestore = 0/0
 debug journal = 0/0
 debug ms = 0/0
 debug monc = 0/0
 debug tp = 0/0
 debug auth = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug perfcounter = 0/0
 debug asok = 0/0
 debug throttle = 0/0

Stefan

Am 12.11.2012 15:35, schrieb Alexandre DERUMIER:

Another idea,

do you have tried to put

  debug lockdep = 0/0
  debug context = 0/0
  debug crush = 0/0
  debug buffer = 0/0
  debug timer = 0/0
  debug journaler = 0/0
  debug osd = 0/0
  debug optracker = 0/0
  debug objclass = 0/0
  debug filestore = 0/0
  debug journal = 0/0
  debug ms = 0/0
  debug monc = 0/0
  debug tp = 0/0
  debug auth = 0/0
  debug finisher = 0/0
  debug heartbeatmap = 0/0
  debug perfcounter = 0/0
  debug asok = 0/0
  debug throttle = 0/0


in a ceph.conf on your kvm host ?


- Mail original -

De: Alexandre DERUMIER aderum...@odiso.com
À: Stefan Priebe - Profihost AG s.pri...@profihost.ag
Cc: pve-de...@pve.proxmox.com
Envoyé: Lundi 12 Novembre 2012 15:26:36
Objet: Re: [pve-devel] less cores more iops / speed

Maybe some tracing on kvm process could give us clues to find where
the cpu is used ?

Also another idea, can you try with auth supported=none ? maybe they
are some overhead with ceph authenfication ?




- Mail original -

De: Alexandre DERUMIER aderum...@odiso.com
À: Stefan Priebe - Profihost AG s.pri...@profihost.ag
Cc: pve-de...@pve.proxmox.com
Envoyé: Lundi 12 Novembre 2012 15:20:07
Objet: Re: [pve-devel] less cores more iops / speed

Ok thanks.

Seem to use a lot of cpu vs nfs,iscsi ...

I hope that ceph dev will work on this soon !


- Mail original -

De: Stefan Priebe - Profihost AG s.pri...@profihost.ag
À: Alexandre DERUMIER aderum...@odiso.com
Cc: eric e...@netwalk.com, pve-de...@pve.proxmox.com
Envoyé: Lundi 12 Novembre 2012 15:05:08
Objet: Re: [pve-devel] less cores more iops / speed

Am 12.11.2012 13:49, schrieb Alexandre DERUMIER:

One VM on one Host: 18.000 IOP/s
Two VM on one Host: 2x11.000 IOP/s
Three VM on one Host: 3x7.000 IOP/s


And host cpu is 100% ?


No. For three VMs yes. For one and two no. I think librbd / rbd
implementation in kvm is the bottleneck here.

Stefan


- Mail original -

De: Stefan Priebe - Profihost AG s.pri...@profihost.ag
À: Alexandre DERUMIER aderum...@odiso.com
Cc: eric e...@netwalk.com, pve-de...@pve.proxmox.com
Envoyé: Lundi 12 Novembre 2012 12:58:35
Objet: Re: [pve-devel] less cores more iops / speed

Am 12.11.2012 08:51, schrieb Alexandre DERUMIER:

Right now RBD in KVM is limited by CPU speed.


Good to known, so it's seem lack of threading, or maybe somes locks.
(so faster cpu give more iops).

If you lauch parallel fio on same host on different guest, do you
get more total iops ? (for me it's scale)


One VM on one Host: 18.000 IOP/s
Two VM on one Host: 2x11.000 IOP/s
Three VM on one Host: 3x7.000 IOP/s


if you launch 2 parallel fio, on same guest (on differents disk), do
you get more iops ? (for me, it doesn't scale, so raid0 in guest
doesn't help).

No it doesn't scale.

Stefan


- Mail original -

De: Stefan Priebe s.pri...@profihost.ag
À: Alexandre DERUMIER aderum...@odiso.com
Cc: eric e...@netwalk.com, pve-de...@pve.proxmox.com
Envoyé: Dimanche 11 Novembre 2012 13:07:36
Objet: Re: [pve-devel] 

Re: [pve-devel] less cores more iops / speed

2012-11-12 Thread Stefan Priebe

Hi Josh,


For the client side you'd these settings to disable all debug logging:

...

Thanks!

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] ceph-mon crashes

2012-11-12 Thread Joao Eduardo Luis
On 11/12/2012 06:30 PM, Stefan Priebe wrote:
 Hi Joao,
 
 Am 12.11.2012 18:05, schrieb Joao Eduardo Luis:
 Can you please confirm me that sometime between you issuing the out
 command and mon.b failing, you had yet another monitor (maybe mon.a)
 that was the leader but for some reason it was down by the time that
 mon.b failed?

 If so, could you provide the log for that monitor as well, given this
 log doesn't have some infos I'm looking for?
 Not sure but here are the logs of the other two mons:
 https://www.dropbox.com/s/jztsedvj1b2kjje/ceph-mon.a.log.1.gz
 
 https://www.dropbox.com/s/62jkfbbbgvs5o25/ceph-mon.c.log.1.gz
 
 Thanks,
 Stefan

Thanks Stefan,

I'll be looking into this.

For future reference, I created issue #3477 on the tracker:
http://tracker.newdream.net/issues/3477

  -Joao
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Build regressions/improvements in v3.7-rc5

2012-11-12 Thread Geert Uytterhoeven
On Mon, Nov 12, 2012 at 9:58 PM, Geert Uytterhoeven
ge...@linux-m68k.org wrote:
 JFYI, when comparing v3.7-rc5 to v3.7-rc4[3], the summaries are:
   - build errors: +14/-4

14 regressions:
  + drivers/virt/fsl_hypervisor.c: error: 'MSR_GS' undeclared (first
use in this function):  = 799:93
  + error: No rule to make target drivers/scsi/aic7xxx/aicasm/*.[chyl]:  = N/A
  + net/ceph/ceph_common.c: error: dereferencing pointer to incomplete
type:  = 272:13
  + net/ceph/ceph_common.c: error: implicit declaration of function
'request_key' [-Werror=implicit-function-declaration]:  = 249:2
  + net/ceph/crypto.c: error: dereferencing pointer to incomplete
type:  = 463:19, 434:46, 452:5, 448:52, 429:23, 467:36, 447:18
  + net/ceph/crypto.c: error: implicit declaration of function
'key_payload_reserve' [-Werror=implicit-function-declaration]:  =
437:2
  + net/ceph/crypto.c: error: implicit declaration of function
'register_key_type' [-Werror=implicit-function-declaration]:  = 481:2
  + net/ceph/crypto.c: error: implicit declaration of function
'unregister_key_type' [-Werror=implicit-function-declaration]:  =
485:2
  + net/ceph/crypto.c: error: unknown field 'destroy' specified in
initializer:  = 477:2
  + net/ceph/crypto.c: error: unknown field 'instantiate' specified in
initializer:  = 475:2
  + net/ceph/crypto.c: error: unknown field 'match' specified in
initializer:  = 476:2
  + net/ceph/crypto.c: error: unknown field 'name' specified in
initializer:  = 474:2
  + net/ceph/crypto.c: error: variable 'key_type_ceph' has initializer
but incomplete type:  = 473:8

powerpc-randconfig

  + error: relocation truncated to fit: R_PPC64_REL24 against symbol
`._mcount' defined in .text section in arch/powerpc/kernel/entry_64.o:
(.text+0x1ff9eb8) = (.text+0x1ffa274), (.text+0x1ff7840)

powerpc-allyesconfig

 [1] http://kisskb.ellerman.id.au/kisskb/head/5614/ (all 117 configs)
 [3] http://kisskb.ellerman.id.au/kisskb/head/5600/ (all 117 configs)

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say programmer or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] ceph-mon crashes

2012-11-12 Thread Stefan Priebe

Thanks i'm subscribed to the tracker now.

Stefan

Am 12.11.2012 21:40, schrieb Joao Eduardo Luis:

On 11/12/2012 06:30 PM, Stefan Priebe wrote:

Hi Joao,

Am 12.11.2012 18:05, schrieb Joao Eduardo Luis:

Can you please confirm me that sometime between you issuing the out
command and mon.b failing, you had yet another monitor (maybe mon.a)
that was the leader but for some reason it was down by the time that
mon.b failed?

If so, could you provide the log for that monitor as well, given this
log doesn't have some infos I'm looking for?

Not sure but here are the logs of the other two mons:
https://www.dropbox.com/s/jztsedvj1b2kjje/ceph-mon.a.log.1.gz

https://www.dropbox.com/s/62jkfbbbgvs5o25/ceph-mon.c.log.1.gz

Thanks,
Stefan


Thanks Stefan,

I'll be looking into this.

For future reference, I created issue #3477 on the tracker:
http://tracker.newdream.net/issues/3477

   -Joao
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: improve speed with auth supported=none

2012-11-12 Thread Stefan Priebe

Thanks,

this gives another burst for iops. I'm now at 23.000 iops ;-) So for 
random 4k iops ceph auth and especially the logging is a lot of overhead.


Greets,
Stefan
Am 12.11.2012 19:26, schrieb Sébastien Han:

I guess you can refer to that link on the list:
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/9776

btw do you get 5000 iop/s on the rbd kernel or on a vm disk?

cheers.


On Mon, Nov 12, 2012 at 4:37 PM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:

Hello list,

i'm still trying to improve ceph speed. Disable logging on host and rbd
client gives me additional 5000 iop/s which is great.

But i also wanted to try disabling authentication using:
auth supported=none

How does this work? Do i just have to place this line under global section
in ceph.conf?

Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd map command hangs for 15 minutes during system start up

2012-11-12 Thread Nick Bartos
After removing 8-libceph-protect-ceph_con_open-with-mutex.patch, it
seems we no longer have this hang.

On Thu, Nov 8, 2012 at 5:43 PM, Josh Durgin josh.dur...@inktank.com wrote:
 On 11/08/2012 02:10 PM, Mandell Degerness wrote:

 We are seeing a somewhat random, but frequent hang on our systems
 during startup.  The hang happens at the point where an rbd map
 rbdvol command is run.

 I've attached the ceph logs from the cluster.  The map command happens
 at Nov  8 18:41:09 on server 172.18.0.15.  The process which hung can
 be seen in the log as 172.18.0.15:0/1143980479.

 It appears as if the TCP socket is opened to the OSD, but then times
 out 15 minutes later, the process gets data when the socket is closed
 on the client server and it retries.

 Please help.

 We are using ceph version 0.48.2argonaut
 (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe).

 We are using a 3.5.7 kernel with the following list of patches applied:

 1-libceph-encapsulate-out-message-data-setup.patch
 2-libceph-dont-mark-footer-complete-before-it-is.patch
 3-libceph-move-init-of-bio_iter.patch
 4-libceph-dont-use-bio_iter-as-a-flag.patch
 5-libceph-resubmit-linger-ops-when-pg-mapping-changes.patch
 6-libceph-re-initialize-bio_iter-on-start-of-message-receive.patch
 7-ceph-close-old-con-before-reopening-on-mds-reconnect.patch
 8-libceph-protect-ceph_con_open-with-mutex.patch
 9-libceph-reset-connection-retry-on-successfully-negotiation.patch
 10-rbd-only-reset-capacity-when-pointing-to-head.patch
 11-rbd-set-image-size-when-header-is-updated.patch
 12-libceph-fix-crypto-key-null-deref-memory-leak.patch
 13-ceph-tolerate-and-warn-on-extraneous-dentry-from-mds.patch
 14-ceph-avoid-divide-by-zero-in-__validate_layout.patch
 15-rbd-drop-dev-reference-on-error-in-rbd_open.patch
 16-ceph-Fix-oops-when-handling-mdsmap-that-decreases-max_mds.patch
 17-libceph-check-for-invalid-mapping.patch
 18-ceph-propagate-layout-error-on-osd-request-creation.patch
 19-rbd-BUG-on-invalid-layout.patch
 20-ceph-return-EIO-on-invalid-layout-on-GET_DATALOC-ioctl.patch
 21-ceph-avoid-32-bit-page-index-overflow.patch
 23-ceph-fix-dentry-reference-leak-in-encode_fh.patch

 Any suggestions?


 The log shows your monitors don't have time sychronized enough among
 them to make much progress (including authenticating new connections).
 That's probably the real issue. 0.2s is pretty large clock drift.


 One thought is that the following patch (which we could not apply) is
 what is required:

 22-rbd-reset-BACKOFF-if-unable-to-re-queue.patch


 This is certainly useful too, but I don't think it's the cause of
 the delay in this case.

 Josh
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: changed rbd cp behavior in 0.53

2012-11-12 Thread Josh Durgin

On 11/12/2012 08:30 AM, Andrey Korolyov wrote:

Hi,

For this version, rbd cp assumes that destination pool is the same as
source, not 'rbd', if pool in the destination path is omitted.

rbd cp install/img testimg
rbd ls install
img testimg


Is this change permanent?

Thanks!


This is a regression. The previous behavior will be restored for 0.54.
I added http://tracker.newdream.net/issues/3478 to track it.

Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd map command hangs for 15 minutes during system start up

2012-11-12 Thread Sage Weil
On Mon, 12 Nov 2012, Nick Bartos wrote:
 After removing 8-libceph-protect-ceph_con_open-with-mutex.patch, it
 seems we no longer have this hang.

Hmm, that's a bit disconcerting.  Did this series come from our old 3.5 
stable series?  I recently prepared a new one that backports *all* of the 
fixes from 3.6 to 3.5 (and 3.4); see wip-3.5 in ceph-client.git.  I would 
be curious if you see problems with that.

So far, with these fixes in place, we have not seen any unexplained kernel 
crashes in this code.

I take it you're going back to a 3.5 kernel because you weren't able to 
get rid of the sync problem with 3.6?

sage



 
 On Thu, Nov 8, 2012 at 5:43 PM, Josh Durgin josh.dur...@inktank.com wrote:
  On 11/08/2012 02:10 PM, Mandell Degerness wrote:
 
  We are seeing a somewhat random, but frequent hang on our systems
  during startup.  The hang happens at the point where an rbd map
  rbdvol command is run.
 
  I've attached the ceph logs from the cluster.  The map command happens
  at Nov  8 18:41:09 on server 172.18.0.15.  The process which hung can
  be seen in the log as 172.18.0.15:0/1143980479.
 
  It appears as if the TCP socket is opened to the OSD, but then times
  out 15 minutes later, the process gets data when the socket is closed
  on the client server and it retries.
 
  Please help.
 
  We are using ceph version 0.48.2argonaut
  (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe).
 
  We are using a 3.5.7 kernel with the following list of patches applied:
 
  1-libceph-encapsulate-out-message-data-setup.patch
  2-libceph-dont-mark-footer-complete-before-it-is.patch
  3-libceph-move-init-of-bio_iter.patch
  4-libceph-dont-use-bio_iter-as-a-flag.patch
  5-libceph-resubmit-linger-ops-when-pg-mapping-changes.patch
  6-libceph-re-initialize-bio_iter-on-start-of-message-receive.patch
  7-ceph-close-old-con-before-reopening-on-mds-reconnect.patch
  8-libceph-protect-ceph_con_open-with-mutex.patch
  9-libceph-reset-connection-retry-on-successfully-negotiation.patch
  10-rbd-only-reset-capacity-when-pointing-to-head.patch
  11-rbd-set-image-size-when-header-is-updated.patch
  12-libceph-fix-crypto-key-null-deref-memory-leak.patch
  13-ceph-tolerate-and-warn-on-extraneous-dentry-from-mds.patch
  14-ceph-avoid-divide-by-zero-in-__validate_layout.patch
  15-rbd-drop-dev-reference-on-error-in-rbd_open.patch
  16-ceph-Fix-oops-when-handling-mdsmap-that-decreases-max_mds.patch
  17-libceph-check-for-invalid-mapping.patch
  18-ceph-propagate-layout-error-on-osd-request-creation.patch
  19-rbd-BUG-on-invalid-layout.patch
  20-ceph-return-EIO-on-invalid-layout-on-GET_DATALOC-ioctl.patch
  21-ceph-avoid-32-bit-page-index-overflow.patch
  23-ceph-fix-dentry-reference-leak-in-encode_fh.patch
 
  Any suggestions?
 
 
  The log shows your monitors don't have time sychronized enough among
  them to make much progress (including authenticating new connections).
  That's probably the real issue. 0.2s is pretty large clock drift.
 
 
  One thought is that the following patch (which we could not apply) is
  what is required:
 
  22-rbd-reset-BACKOFF-if-unable-to-re-queue.patch
 
 
  This is certainly useful too, but I don't think it's the cause of
  the delay in this case.
 
  Josh
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ceph osd crush set command under 0.53

2012-11-12 Thread Mandell Degerness
Did the syntax and behavior of the ceph osd crush set ... command
change between 0.48 and 0.53?

When trying out ceph 0.53, I get the following in my log when trying
to add the first OSD to a new cluster (similar behavior for osds 2 and
3).  It appears that the ceph osd crush command fails, but still marks
the OSDs as up and in:

Nov 12 23:19:05 node-172-20-0-13/172.20.0.13 [2012-11-12 23:19:05.759]
908/MainThread savage/INFO: execute(['ceph', 'osd', 'crush', 'set',
'0', 'osd.0', '1.0', 'host=172.20.0.13', 'rack=0', 'pool=default'])
Nov 12 23:19:05 node-172-20-0-14/172.20.0.14 ceph-mon: 2012-11-12
23:19:05.804080 7ffd761fe700  0 mon.1@1(peon) e1 handle_command
mon_command(osd crush set 0 osd.0 1.0 host=172.20.0.13 rack=0
pool=default v 0) v1
Nov 12 23:19:05 node-172-20-0-13/172.20.0.13 ceph-mon: 2012-11-12
23:19:05.772215 7fad40911700  0 mon.0@0(leader) e1 handle_command
mon_command(osd crush set 0 osd.0 1.0 host=172.20.0.13 rack=0
pool=default v 0) v1
Nov 12 23:19:05 node-172-20-0-13/172.20.0.13 ceph-mon: 2012-11-12
23:19:05.772248 7fad40911700  0 mon.0@0(leader).osd e2 adding/updating
crush item id 0 name 'osd.0' weight 1 at location
{host=172.20.0.13,pool=default,rack=0}
Nov 12 23:19:05 node-172-20-0-13/172.20.0.13 ceph-mon: 2012-11-12
23:19:05.772323 7fad40911700  1 error: didn't find anywhere to add
item 0 in {host=172.20.0.13,pool=default,rack=0}
Nov 12 23:19:05 node-172-20-0-13/172.20.0.13 [2012-11-12 23:19:05.783]
908/MainThread savage/CRITICAL: Logging uncaught exception Traceback
(most recent call last):   File /usr/bin/sv-fred.py, line 9, in
module 
load_entry_point('savage==.2101.118c3ebc8c0843f87e82eb047de043c8a70086bd',
'console_scripts', 'sv-fred.py')()   File
/usr/lib64/python2.6/site-packages/savage/services/fred.py, line
811, in main   File
/usr/lib64/python2.6/site-packages/savage/services/fred.py, line
798, in run   File
/usr/lib64/python2.6/site-packages/savage/utils/nfa.py, line 291, in
step   File /usr/lib64/python2.6/site-packages/savage/utils/nfa.py,
line 252, in step   File
/usr/lib64/python2.6/site-packages/savage/utils/nfa.py, line 231, in
_newstate   File
/usr/lib64/python2.6/site-packages/savage/utils/nfa.py, line 219, in
_newstate   File
/usr/lib64/python2.6/site-packages/savage/services/fred.py, line
563, in action_firstboot_full   File
/usr/lib64/python2.6/site-packages/savage/services/fred.py, line
768, in handle_message   File
/usr/lib64/python2.6/site-packages/savage/services/fred.py, line
750, in start_phase   File
/usr/lib64/python2.6/site-packages/savage/services/fred.py, line
164, in start   File
/usr/lib64/python2.6/site-packages/savage/utils/__init__.py, line
275, in _wrap   File
/usr/lib64/python2.6/site-packages/savage/command/commands/ceph.py,
line 50, in crush_myself   File
/usr/lib64/python2.6/site-packages/savage/utils/__init__.py, line
244, in execute   File
/usr/lib64/python2.6/site-packages/savage/utils/__init__.py, line
130, in collect_subprocess ExecutionError: Command failed: ceph osd
crush set 0 osd.0 1.0 host=172.20.0.13 rack=0 pool=default
return_code: 1 stdout: (22) Invalid argument stderr:
Nov 12 23:19:06 node-172-20-0-13/172.20.0.13 ceph-mon: 2012-11-12
23:19:06.491514 7fad40911700  1 mon.0@0(leader).osd e3 e3: 3 osds: 1
up, 1 in
Nov 12 23:19:06 node-172-20-0-13/172.20.0.13 ceph-mon: 2012-11-12
23:19:06.494461 7fad40911700  0 log [INF] : osdmap e3: 3 osds: 1 up, 1
in
Nov 12 23:19:06 node-172-20-0-13/172.20.0.13 ceph-mon: 2012-11-12
23:19:06.494463 mon.0 172.20.0.13:6789/0 16 : [INF] osdmap e3: 3 osds:
1 up, 1 in
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph osd crush set command under 0.53

2012-11-12 Thread Sage Weil
On Mon, 12 Nov 2012, Mandell Degerness wrote:
 Did the syntax and behavior of the ceph osd crush set ... command
 change between 0.48 and 0.53?
 
 When trying out ceph 0.53, I get the following in my log when trying
 to add the first OSD to a new cluster (similar behavior for osds 2 and
 3).  It appears that the ceph osd crush command fails, but still marks
 the OSDs as up and in:

The 'pool=default' is changed to 'root=default', as in the root of the 
crush hierarchy.  'pool' was confusing because there are also rados pools, 
which are something else entirely.

(You can also omit the first '0' (i.e., just 'osd.123' and not [..., 
'123', 'osd.123', ...]), but both the old and new syntax are supported.)

sage


 
 Nov 12 23:19:05 node-172-20-0-13/172.20.0.13 [2012-11-12 23:19:05.759]
 908/MainThread savage/INFO: execute(['ceph', 'osd', 'crush', 'set',
 '0', 'osd.0', '1.0', 'host=172.20.0.13', 'rack=0', 'pool=default'])
 Nov 12 23:19:05 node-172-20-0-14/172.20.0.14 ceph-mon: 2012-11-12
 23:19:05.804080 7ffd761fe700  0 mon.1@1(peon) e1 handle_command
 mon_command(osd crush set 0 osd.0 1.0 host=172.20.0.13 rack=0
 pool=default v 0) v1
 Nov 12 23:19:05 node-172-20-0-13/172.20.0.13 ceph-mon: 2012-11-12
 23:19:05.772215 7fad40911700  0 mon.0@0(leader) e1 handle_command
 mon_command(osd crush set 0 osd.0 1.0 host=172.20.0.13 rack=0
 pool=default v 0) v1
 Nov 12 23:19:05 node-172-20-0-13/172.20.0.13 ceph-mon: 2012-11-12
 23:19:05.772248 7fad40911700  0 mon.0@0(leader).osd e2 adding/updating
 crush item id 0 name 'osd.0' weight 1 at location
 {host=172.20.0.13,pool=default,rack=0}
 Nov 12 23:19:05 node-172-20-0-13/172.20.0.13 ceph-mon: 2012-11-12
 23:19:05.772323 7fad40911700  1 error: didn't find anywhere to add
 item 0 in {host=172.20.0.13,pool=default,rack=0}
 Nov 12 23:19:05 node-172-20-0-13/172.20.0.13 [2012-11-12 23:19:05.783]
 908/MainThread savage/CRITICAL: Logging uncaught exception Traceback
 (most recent call last):   File /usr/bin/sv-fred.py, line 9, in
 module 
 load_entry_point('savage==.2101.118c3ebc8c0843f87e82eb047de043c8a70086bd',
 'console_scripts', 'sv-fred.py')()   File
 /usr/lib64/python2.6/site-packages/savage/services/fred.py, line
 811, in main   File
 /usr/lib64/python2.6/site-packages/savage/services/fred.py, line
 798, in run   File
 /usr/lib64/python2.6/site-packages/savage/utils/nfa.py, line 291, in
 step   File /usr/lib64/python2.6/site-packages/savage/utils/nfa.py,
 line 252, in step   File
 /usr/lib64/python2.6/site-packages/savage/utils/nfa.py, line 231, in
 _newstate   File
 /usr/lib64/python2.6/site-packages/savage/utils/nfa.py, line 219, in
 _newstate   File
 /usr/lib64/python2.6/site-packages/savage/services/fred.py, line
 563, in action_firstboot_full   File
 /usr/lib64/python2.6/site-packages/savage/services/fred.py, line
 768, in handle_message   File
 /usr/lib64/python2.6/site-packages/savage/services/fred.py, line
 750, in start_phase   File
 /usr/lib64/python2.6/site-packages/savage/services/fred.py, line
 164, in start   File
 /usr/lib64/python2.6/site-packages/savage/utils/__init__.py, line
 275, in _wrap   File
 /usr/lib64/python2.6/site-packages/savage/command/commands/ceph.py,
 line 50, in crush_myself   File
 /usr/lib64/python2.6/site-packages/savage/utils/__init__.py, line
 244, in execute   File
 /usr/lib64/python2.6/site-packages/savage/utils/__init__.py, line
 130, in collect_subprocess ExecutionError: Command failed: ceph osd
 crush set 0 osd.0 1.0 host=172.20.0.13 rack=0 pool=default
 return_code: 1 stdout: (22) Invalid argument stderr:
 Nov 12 23:19:06 node-172-20-0-13/172.20.0.13 ceph-mon: 2012-11-12
 23:19:06.491514 7fad40911700  1 mon.0@0(leader).osd e3 e3: 3 osds: 1
 up, 1 in
 Nov 12 23:19:06 node-172-20-0-13/172.20.0.13 ceph-mon: 2012-11-12
 23:19:06.494461 7fad40911700  0 log [INF] : osdmap e3: 3 osds: 1 up, 1
 in
 Nov 12 23:19:06 node-172-20-0-13/172.20.0.13 ceph-mon: 2012-11-12
 23:19:06.494463 mon.0 172.20.0.13:6789/0 16 : [INF] osdmap e3: 3 osds:
 1 up, 1 in
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Help] Use Ceph RBD as primary storage in CloudStack 4.0

2012-11-12 Thread Alex Jiang
Hi, All

Has somebody used Ceph RBD in CloudStack as primary storage? I see
that in the new features of CS 4.0, RBD is supported for KVM. So I
tried using RBD as primary storage but met with some problems.

I use a CentOS6.3 server as host. First I erase the qemu-kvm(0.12.1)
and libvirt(0.9.10) because their versions are too low (Qemu on the
Hypervisor has to be compiled with RBD enabled .The libvirt version on
the Hypervisor has to be at least 0.10 with RBD enabled).Then I
download the latest qemu(1.2.0) and libvirt(1.0.0) source code and
compile and install them. But when compiling qemu source code,

#wget http://wiki.qemu-project.org/download/qemu-1.2.0.tar.bz2
#tar jxvf qemu-1.2.0.tar.bz2
# cd qemu-1.2.0
# ./configure --enable-rbd

the following errors occur:
ERROR: User requested feature rados block device
ERROR: configure was not able to find it

But on Ubuntu12.04 I tried compiling qemu source code and succeed.Now
I am very confused.How to use Ceph RBD as primary storage in
CloudStack on CentOS6.3?Anyone can help me?

Best Regards,

 Alex
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


MDaemon Notification -- Attachment Removed

2012-11-12 Thread Postmaster
The following message contained restricted attachment(s) which have been 
removed:

From  : ceph-devel@vger.kernel.org
To: libr...@irost.org
Subject   : Returned mail: Data format error
Message-ID: 

Attachment(s) removed:
-
instruction.pif


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: improve speed with auth supported=none

2012-11-12 Thread Josh Durgin

On 11/12/2012 01:57 PM, Stefan Priebe wrote:

Thanks,

this gives another burst for iops. I'm now at 23.000 iops ;-) So for
random 4k iops ceph auth and especially the logging is a lot of overhead.


How much difference did disabling auth make vs only disabling logging?

Josh


Greets,
Stefan
Am 12.11.2012 19:26, schrieb Sébastien Han:

I guess you can refer to that link on the list:
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/9776

btw do you get 5000 iop/s on the rbd kernel or on a vm disk?

cheers.


On Mon, Nov 12, 2012 at 4:37 PM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:

Hello list,

i'm still trying to improve ceph speed. Disable logging on host and rbd
client gives me additional 5000 iop/s which is great.

But i also wanted to try disabling authentication using:
auth supported=none

How does this work? Do i just have to place this line under global
section
in ceph.conf?

Greets,
Stefan


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: optmize librbd for iops

2012-11-12 Thread Josh Durgin

On 11/12/2012 05:50 AM, Stefan Priebe - Profihost AG wrote:

Hello list,

are there any plans to optimize librbd for iops? Right now i'm able to
get 50.000 iop/s via iscsi and 100.000 iop/s using multipathing with iscsi.

With librbd i'm stuck to around 18.000iops. As this scales with more
hosts but not with more disks in a vm. It must be limited by rbd
implementation in kvm / librbd.


It'd be interesting to see which layers are most limiting in this
case - qemu/kvm, librados, or librbd.

How does rados bench with 4k writes and then 4k reads with many
concurrent IOs do?

Unfortunately there's no librbd read benchmark yet.

Josh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: improve speed with auth supported=none

2012-11-12 Thread Stefan Priebe

Am 13.11.2012 08:42, schrieb Josh Durgin:

On 11/12/2012 01:57 PM, Stefan Priebe wrote:

Thanks,

this gives another burst for iops. I'm now at 23.000 iops ;-) So for
random 4k iops ceph auth and especially the logging is a lot of overhead.


How much difference did disabling auth make vs only disabling logging?


disable debug logging: 3000 iops
disable auth logging: 2000 iops

Is anybody in the ceph team also interested in a call graph of kvm when 
VM is doing random 4k write io?


Greets
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: optmize librbd for iops

2012-11-12 Thread Stefan Priebe

Am 13.11.2012 08:51, schrieb Josh Durgin:

On 11/12/2012 05:50 AM, Stefan Priebe - Profihost AG wrote:

Hello list,

are there any plans to optimize librbd for iops? Right now i'm able to
get 50.000 iop/s via iscsi and 100.000 iop/s using multipathing with
iscsi.

With librbd i'm stuck to around 18.000iops. As this scales with more
hosts but not with more disks in a vm. It must be limited by rbd
implementation in kvm / librbd.


It'd be interesting to see which layers are most limiting in this
case - qemu/kvm, librados, or librbd.

How does rados bench with 4k writes and then 4k reads with many
concurrent IOs do?
Right now i'm using qemu-kvm with librbd and fio inside guest. How does 
the rados bench work?


Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html