[ceph-users] rbd snap ls: how much locking is involved?

2016-01-21 Thread Christian Kauhaus
Hi,

some of our applications (e.g., backy) use 'rbd snap ls' quite often. I see
regular occurrences of blocked requests on a headly loaded cluster which
correspond to snap_list operations. Log file example:

2016-01-20 11:38:14.389325 osd.13 172.22.4.44:6803/13012 40529 : cluster [WRN]
1 slow requests, 1 included below; oldest blocked for > 15.098679 secs
2016-01-20 11:38:14.389336 osd.13 172.22.4.44:6803/13012 40530 : cluster [WRN]
slow request 15.098679 seconds old, received at 2016-01-20 11:37:59.276665:
osd_op(client.256532559.0:2041
rbd_data.c390a692ae8944a.057b@snapdir [list-snaps] 266.95976dde
ack+read+known_if_redirected e807541) currently no flag points reached

Does anyone know if 'rbd snap ls' creates locks? On which level are these
locks created (volume, pool, global)? Would it be best to reduce the usage of
'rbd snap ls' on a heavly loaded cluster?

TIA

Christian

-- 
Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd snap ls: how much locking is involved?

2016-01-21 Thread Christian Kauhaus
Am 21.01.2016 um 15:32 schrieb Jason Dillaman:
> Are you performing a lot of 'rbd export-diff' or 'rbd diff' operations?  I 
> can't speak to whether or not list-snaps is related to your blocked requests, 
> but I can say that operation is only issued when performing RBD diffs.

Yes, we are also doing 'rbd export-diff' on snapshots. So this could be the
cause, too.

Regards
  Christian

-- 
Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked requests after "osd in"

2015-12-11 Thread Christian Kauhaus
Am 10.12.2015 um 06:38 schrieb Robert LeBlanc:
> Since I'm very interested in
> reducing this problem, I'm willing to try and submit a fix after I'm
> done with the new OP queue I'm working on. I don't know the best
> course of action at the moment, but I hope I can get some input for
> when I do try and tackle the problem next year.

Is there already a ticket present for this issue in the bug tracker? I think
this is an import issue.

Regards

Christian

-- 
Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked requests after "osd in"

2015-12-10 Thread Christian Kauhaus
Am 10.12.2015 um 06:38 schrieb Robert LeBlanc:
> I noticed this a while back and did some tracing. As soon as the PGs
> are read in by the OSD (very limited amount of housekeeping done), the
> OSD is set to the "in" state so that peering with other OSDs can
> happen and the recovery process can begin. The problem is that when
> the OSD is "in", the clients also see that and start sending requests
> to the OSDs before it has had a chance to actually get its bearings
> and is able to even service the requests. After discussion with some
> of the developers, there is no easy way around this other than let the
> PGs recover to other OSDs and then bring in the OSDs after recovery (a
> ton of data movement).

Many thanks for your detailed analysis. It's a bit disappointing that there
seems to be no easy way around. Any work to improve the situation is much
appreciated.

In the meantime, I'll be experimenting with pre-seeding the VFS cache to speed
things up at least a little bit.

Regards

Christian

-- 
Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Blocked requests after "osd in"

2015-12-09 Thread Christian Kauhaus
Hi,

I'm getting blocked requests (>30s) every time when an OSD is set to "in" in
our clusters. Once this has happened, backfills run smoothly.

I have currently no idea where to start debugging. Has anyone a hint what to
examine first in order to narrow this issue?

TIA

Christian

-- 
Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked requests after "osd in"

2015-12-09 Thread Christian Kauhaus
313: 1800 pgs: 277 active+remapped+wait_backfill, 881
active+remapped, 4 active+remapped+backfilling, 638 active+clean; 439 GB data,
906 GB used, 7700 GB / 8607 GB avail; 347 kB/s rd, 2551 kB/s wr, 261 op/s;
162079/313904 objects misplaced (51.633%); 218 MB/s, 54 objects/s recovering

I've used Brendan Greggs opensnoop utility to see what is going on on the
filesystem level (see attached log). AFAICS the OSB reads lots of directories.
The underlying filesystem is XFS, so this should be sufficiently fast. During
the time I see slow requests, the OSD continuously opens omap/*.ldb and
omap/*.log files (starting at timestamp 95927.111837 in opensnoop log which
equivalents 15:06:37 in wall clock time).

Any idea how to reduce the blockage at least?

> It's unclear to me whether MONs influence this somehow (the peering stage) 
> but I have observed their CPU usage and IO also spikes when OSDs are started, 
> so make sure they are not under load.

I don't think this is an issue here. Our MONs don't use more than 5% CPU
during the operation and don't cause significant amounts of disk I/O.

Regards

Christian

-- 
Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick


osd-opensnoop.log.gz
Description: application/gzip
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] nf_conntrack overflow crashes OSDs

2014-08-08 Thread Christian Kauhaus
Hi,

today I'd like to share a severe problem we've found (and fixed) on our Ceph
cluster. We're running 48 OSDs (8 per host). While restarting all OSDs on a
host, the kernel's nf_conntrack table was overflown. This rendered all OSDs on
that machine unusable.

The symptoms were as follows. In the kernel log, we saw lines like:

| Aug  6 15:23:48 cartman06 kernel: [12713575.554784] nf_conntrack: table
full, dropping packet

This is effectively a DoS against the kernel's IP stack.

In the OSD log files, we saw repeated connection attempts like:

| 2014-08-06 15:22:35.348175 7f92f25a8700 10 -- 172.22.4.42:6802/9560 
172.22.4.51:0/2025662 pipe(0x7f9208035440 sd=382 :6802 s=2 pgs=26750 cs=1 l=1
c=0x7f92080021c0).fault on lossy channel, failing
| 2014-08-06 15:22:35.348287 7f8fd69e4700 10 -- 172.22.4.42:6802/9560 
172.22.4.39:0/3024957 pipe(0x7f9208007b30 sd=149 :6802 s=2 pgs=245725 cs=1 l=1
c=0x7f9208036630).fault on lossy channel, failing
| 2014-08-06 15:22:35.348293 7f8fe24e4700 20 -- 172.22.4.42:6802/9560 
172.22.4.38:0/1013265 pipe(0x7f92080476e0 sd=450 :6802 s=4 pgs=32439 cs=1 l=1
c=0x7f9208018e90).writer finishing
| 2014-08-06 15:22:35.348284 7f8fd4fca700  2 -- 172.22.4.42:6802/9560 
172.22.4.5:0/3032136 pipe(0x7f92080686b0 sd=305 :6802 s=2 pgs=306100 cs=1 l=1
c=0x7f920805f340).fault 0: Success
| 2014-08-06 15:22:35.348292 7f8fd108b700 20 -- 172.22.4.42:6802/9560 
172.22.4.4:0/1000901 pipe(0x7f920802e7d0 sd=401 :6802 s=4 pgs=73173 cs=1 l=1
c=0x7f920802eda0).writer finishing
| 2014-08-06 15:22:35.344719 7f8fd1d98700  2 -- 172.22.4.42:6802/9560 
172.22.4.49:0/3026524 pipe(0x7f9208033a80 sd=492 :6802 s=2 pgs=12845 cs=1 l=1
c=0x7f9208033ce0).reader couldn't read tag, Success

and so on, generating 1000s of log lines. The OSDs were spinning with 100%
CPU, trying to re-connect in rapid succession. The repeated connection
attempts stopped nf_conntrack from getting out of its overflown state.

Thus, we saw blocked requests for 15 minutes or so, until the MONs banned the
stuck OSDs from the cluster.

As a short term countermeasure, we stopped all OSDs on the affected hosts and
started them one by one, leaving enough time in between to allow the recovery
settle a bit (10 sec gap between OSDs was enough). During normal operation, we
see only 5000-6000 connections on a host.

As a permanent fix, we have doubled the size of the nf_conntrack table and
reduced some timeouts according to
http://www.pc-freak.net/blog/resolving-nf_conntrack-table-full-dropping-packet-flood-message-in-dmesg-linux-kernel-log/.
Now a restart of all 8 OSDs on a host works without problems.

Alternatively, we have considered removing nf_conntrack completely. This,
however, is not possible since we use host-based firewalling and nf_conntrack
is wired quite deeply into Linux' firewall code.

Just to share our experience in case someone experiences the same problem.

Regards

Christian

-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] nf_conntrack overflow crashes OSDs

2014-08-08 Thread Christian Kauhaus
Am 08.08.2014 um 14:05 schrieb Robert van Leeuwen:
 It is also possible to specifically not conntrack certain connections.
 e.g.
 iptables -t raw -A PREROUTING -p tcp --dport 6789 -j CT --notrack

Thanks Robert. This is really an interesting approach. We will test it.

Regards

Christian

-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RBD and Backup.

2014-07-03 Thread Christian Kauhaus
Am 03.07.2014 07:21, schrieb Irek Fasikhov:
 Dear community. How do you make backups CEPH RDB?

We @ gocept are currently in the process of developing backy, a new-style
backup tool that works directly with block level snapshots / diffs.

The tool is not quite finished, but it is making rapid progress. It would be
great if you'd try it, spot bugs, contribute code etc. Help is appreciated. :-)

PyPI page: https://pypi.python.org/pypi/backy/

Pull requests go here: https://bitbucket.org/ctheune/backy

Christian Theune c...@gocept.com is the primary contact.

HTH

Christian

-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to improve performance of ceph objcect storage cluster

2014-06-27 Thread Christian Kauhaus
Am 26.06.2014 20:05, schrieb Aronesty, Erik:
 Well, it's the same for rbd, what's your stripe count set to?  For a small 
 system, it should be at least the # of nodes in your system.As systems 
 get larger, there's limited returns... I would imagine there would be some 
 OSD caching advantage to keeping the number limited (IE: more requests of the 
 same device = more likely the device has the next stripe unit prefetched).   

I'm trying to make sure I understand this: usually you can't set the stripe
count directly, but you can set the default stripe size of RBD volumes. So in
consequence, does this mean to go with a larger RBD object size than the
default (4MiB)?

Regards

Christian

-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Behaviour of ceph pg repair on different replication levels

2014-06-26 Thread Christian Kauhaus
Am 26.06.2014 02:08, schrieb Gregory Farnum:
 It's a good idea, and in fact there was a discussion yesterday during
 the Ceph Developer Summit about making scrub repair significantly more
 powerful; they're keeping that use case in mind in addition to very
 fine-grained ones like specifying a particular replica for every
 object.

+1

This would be very cool.

 Yeah, it's got nothing and is relying on the local filesystem to barf
 if that happens. Unfortunately, neither xfs nor ext4 provide that
 checking functionality (which is one of the reasons we continue to
 look to btrfs as our long-term goal).

When thinking in petabytes scale, bit rot going to happen as a matter of fact.
So I think Ceph should be prepared, at least when there are more than 2 
replicas.

Regards

Christian

-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Behaviour of ceph pg repair on different replication levels

2014-06-25 Thread Christian Kauhaus
Am 23.06.2014 20:24, schrieb Gregory Farnum:
 Well, actually it always takes the primary copy, unless the primary
 has some way of locally telling that its version is corrupt. (This
 might happen if the primary thinks it should have an object, but it
 doesn't exist on disk.) But there's not a voting or anything at this
 time.

Thanks Greg for the clarification. I wonder if some sort of voting during
recovery would be feasible to implement. Having this available would make a 3x
replica scheme immensely more useful.

In my current understanding Ceph has no guards against local bit rot (e.g.,
when a local disk returns incorrect data). Or is there already a voting scheme
in place during deep scrub?

Regards

Christian

-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] trying to interpret lines in osd.log

2014-06-23 Thread Christian Kauhaus
I see several instances of the following log messages in the OSD logs each day:

2014-06-21 02:05:27.740697 7fbc58b78700  0 -- 172.22.8.12:6810/31918 
172.22.8.12:6800/28827 pipe(0x7fbe400029f0 sd=764 :6810 s=0 pgs=0 cs=0 l=0
c=0x7fbe40003190).accept connect_seq 30 vs existing 29 state standby

2014-06-21 07:44:29.437810 7fbc452cb700  0 -- 172.22.8.12:6810/31918 
172.22.8.16:6802/31292 pipe(0x7fbe40002d90 sd=748 :6810 s=2 pgs=11345 cs=57
l=0 c=0x7fbf68eb2a70).fault with nothing to send, going to standby

What does this mean? Anything to worry about?

TIA

Christian

-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs

2014-06-12 Thread Christian Kauhaus
Am 12.06.2014 14:09, schrieb Loic Dachary:
 With the replication factor set to three (which is the default), it can 
 tolerate that two OSD fail at the same time.

I've noticed that a replication factor of 3 is the new default in firefly.
What rationale led to changing the default? It used to be 2. A replication
factor of 3 incurs significantly more space overhead. Has a replication factor
of 2 been proven to be insecure?

Regards

Christian

-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] error (24) Too many open files

2014-06-12 Thread Christian Kauhaus
Hi,

we have a Ceph cluster with 32 OSDs running on 4 servers (8 OSDs per server,
one for each disk).

From time to time, I see Ceph servers running out of file descriptors. It logs
lines like:

 2014-06-08 22:15:35.154759 7f850ac25700  0 filestore(/srv/ceph/osd/ceph-20)
write couldn't open
86.37_head/a63e7df7/rbd_data.1933fe2ae8944a.042c/head//86: (24)
Too many open files
 2014-06-08 22:15:35.255955 7f850ac25700 -1 os/FileStore.cc: In function
'unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t,
int, ThreadPool::TPHandle*)' thread 7f850ac25700 time
 2014-06-08 22:15:35.191181 os/FileStore.cc: 2448: FAILED assert(0 ==
unexpected error)

but apparently everything proceeds normally after that.

Is the error considered critical? Should I lower max open files in
ceph.conf? Or should I increase the value in /proc/sys/fs/file-max? Has anyone
a good recommendation?

TIA

Christian


Reference:

* we are running Ceph Emperor 0.72.2 on Linux 3.10.7.

* full log follows:

2014-06-08 22:15:34.928660 7f84e6770700  0 cls cls/lock/cls_lock.cc:89:
error reading xattr lock.rbd_lock: -24
2014-06-08 22:15:34.934733 7f84e6770700  0 cls cls/lock/cls_lock.cc:384:
Could not read lock info: Unknown error -24
2014-06-08 22:15:35.085361 7f84ecf7d700  0 accepter.accepter no incoming
connection?  sd = -1 errno 24 Too many open files
2014-06-08 22:15:35.125393 7f84ecf7d700  0 accepter.accepter no incoming
connection?  sd = -1 errno 24 Too many open files
2014-06-08 22:15:35.125403 7f84ecf7d700  0 accepter.accepter no incoming
connection?  sd = -1 errno 24 Too many open files
2014-06-08 22:15:35.125407 7f84ecf7d700  0 accepter.accepter no incoming
connection?  sd = -1 errno 24 Too many open files
2014-06-08 22:15:35.125410 7f84ecf7d700  0 accepter.accepter no incoming
connection?  sd = -1 errno 24 Too many open files
2014-06-08 22:15:35.154759 7f850ac25700  0 filestore(/srv/ceph/osd/ceph-20)
write couldn't open
86.37_head/a63e7df7/rbd_data.1933fe2ae8944a.042c/head//86: (24)
Too many open files
2014-06-08 22:15:35.159074 7f850ac25700  0 filestore(/srv/ceph/osd/ceph-20)
error (24) Too many open files not handled on operation 10 (488954466.1.0, or
op 0, counting from 0)
2014-06-08 22:15:35.159095 7f850ac25700  0 filestore(/srv/ceph/osd/ceph-20)
unexpected error code
2014-06-08 22:15:35.159098 7f850ac25700  0 filestore(/srv/ceph/osd/ceph-20)
transaction dump:
{ ops: [
{ op_num: 0,
  op_name: write,
  collection: 86.37_head,
  oid: 
a63e7df7\/rbd_data.1933fe2ae8944a.042c\/head\/\/86,
  length: 4096,
  offset: 3104768,
  bufferlist length: 4096},
{ op_num: 1,
  op_name: setattr,
  collection: 86.37_head,
  oid: 
a63e7df7\/rbd_data.1933fe2ae8944a.042c\/head\/\/86,
  name: _,
  length: 251},
{ op_num: 2,
  op_name: setattr,
  collection: 86.37_head,
  oid: 
a63e7df7\/rbd_data.1933fe2ae8944a.042c\/head\/\/86,
  name: snapset,
  length: 31}]}
2014-06-08 22:15:35.255955 7f850ac25700 -1 os/FileStore.cc: In function
'unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t,
int, ThreadPool::TPHandle*)' thread 7f850ac25700 time
2014-06-08 22:15:35.191181 os/FileStore.cc: 2448: FAILED assert(0 ==
unexpected error)

-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] FAILED assert(_size = 0) during recovery - need to understand what's going on

2014-06-10 Thread Christian Kauhaus
,
(boost::statechart::history_mode)0::shallow_construct(boost::intrusive_ptrPG::RecoveryState::Primary
const, boost::statechart::state_machinePG::RecoveryState::RecoveryMachine,
PG::RecoveryState::Initial, std::allocatorvoid,
boost::statechart::null_exception_translator)+0x4f) [0x83a3fa]
 5: (boost::statechart::detail::safe_reaction_result
boost::statechart::simple_statePG::RecoveryState::Peering,
PG::RecoveryState::Primary, PG::RecoveryState::GetInfo,
(boost::statechart::history_mode)0::transitPG::RecoveryState::Active()+0xa4) 
[0x83a5c8]
 6: (boost::statechart::simple_statePG::RecoveryState::Peering,
PG::RecoveryState::Primary, PG::RecoveryState::GetInfo,
(boost::statechart::history_mode)0::react_impl(boost::statechart::event_base
const, void const*)+0x16a) [0x83a8ae]
 7: (boost::statechart::simple_statePG::RecoveryState::WaitFlushedPeering,
PG::RecoveryState::Peering, boost::mpl::listmpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na,
(boost::statechart::history_mode)0::react_impl(boost::statechart::event_base
const, void const*)+0x84) [0x837e0a]
 8: (boost::statechart::state_machinePG::RecoveryState::RecoveryMachine,
PG::RecoveryState::Initial, std::allocatorvoid,
boost::statechart::null_exception_translator::process_queued_events()+0xf2)
[0x81abe0]
 9: (boost::statechart::state_machinePG::RecoveryState::RecoveryMachine,
PG::RecoveryState::Initial, std::allocatorvoid,
boost::statechart::null_exception_translator::process_event(boost::statechart::event_base
const)+0x1e) [0x81ae24]
 10: (PG::handle_peering_event(std::tr1::shared_ptrPG::CephPeeringEvt,
PG::RecoveryCtx*)+0x2fb) [0x7d6303]
 11: (OSD::process_peering_events(std::listPG*, std::allocatorPG*  const,
ThreadPool::TPHandle)+0x320) [0x64cdde]
 12: (OSD::PeeringWQ::_process(std::listPG*, std::allocatorPG*  const,
ThreadPool::TPHandle)+0x16) [0x6aa06a]
 13: (ThreadPool::worker(ThreadPool::WorkThread*)+0x569) [0x9cf66f]
 14: (ThreadPool::WorkThread::entry()+0x10) [0x9d10b6]
 15: (()+0x7b77) [0x7f0ee292bb77]
 16: (clone()+0x6d) [0x7f0ee0c6368d]

We finally managed to restart all 3 affected OSDs, but we got corrupted
filesystems inside the VMs as well as scrub errors afterwards.

How can this be? Isn't Ceph designed to handle network failures? Obviously,
running nf_conntrack on Ceph hosts is not a brilliant idea but it simply was
present here. But I don't think that dropping network packets should lead to
corrupted data. Am I right? Any hints on what could be wrong here are
appreciated! I don't like to run into a similar situation again.

TIA

Christian

-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] smart replication

2014-02-20 Thread Christian Kauhaus
Am 19.02.2014 12:01, schrieb Pavel V. Kaygorodov:
 Is it possible to do this with ceph?
 If yes, how to configure this?

I think this can be achieved through multiple CRUSH rulesets. There is an
example in the docs which explains how to differentiate between SSD and
non-SSD storage:

http://ceph.com/docs/master/rados/operations/crush-map/#placing-different-pools-on-different-osds

The principles shown here can possibly adapted to you use case.

Regards

Christian

-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Removing OSD, double data migration

2014-02-13 Thread Christian Kauhaus
Am 12.02.2014 20:27, schrieb Michael:
 Have always wondered this, why does data get shuffled twice when you delete an
 OSD? You out an OSD and the data gets moved to other nodes -  understandable
 but then when you remove that OSD from crush it moves data again, aren't outed
 OSD's and an OSD's not in crush the same from a data position point of view?
 What data is being moved when a fully outed OSD is then removed from crush?

I second this. When I adhere to the OSD removal how-to[1], I see heavy data
migration taking place twice. This is a nuisance. The last time I had to take
an OSD out of a cluster, I marked it out and removed it from the CRUSH map
at the same time. Don't know if this is the recommended way but it seemed to 
work.

Regards

Christian

[1]
http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] filesystem fragmentation on ext4 OSD

2014-02-07 Thread Christian Kauhaus
Am 06.02.2014 16:24, schrieb Mark Nelson:
 Hi Christian, can you tell me a little bit about how you are using Ceph and
 what kind of IO you are doing?

Just forgot to mention: we're running Ceph 0.72.2 on Linux 3.10 (both storage
servers and inside VMs) and Qemu-KVM 1.5.3.

Regards

Christian

-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] filesystem fragmentation on ext4 OSD

2014-02-07 Thread Christian Kauhaus
Am 07.02.2014 14:42, schrieb Mark Nelson:
 Ok, so the reason I was wondering about the use case is if you were doing RBD
 specifically.  Fragmentation has been something we've periodically kind of
 battled with but still see in some cases.  BTRFS especially can get pretty
 spectacularly fragmented due to COW and overwrites.  There's a thread from a
 couple of weeks ago called rados io hints that you may want to look
 at/contribute to.

Thank you for the hint. Sage's proposal on ceph-devel sounds good, so I'll
wait for an implementation.

Regards

Christian

-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] filesystem fragmentation on ext4 OSD

2014-02-06 Thread Christian Kauhaus
Hi,

after running Ceph for a while I see a lot of fragmented files on our OSD
filesystems (all running ext4). For example:

itchy ~ # fsck -f /srv/ceph/osd/ceph-5
fsck von util-linux 2.22.2
e2fsck 1.42 (29-Nov-2011)
[...]
/dev/mapper/vgosd00-ceph--osd00: 461903/418119680 files (33.7%
non-contiguous), 478239460/836229120 blocks

This is an unusually high value for ext4. The normal expectation is something
in the 5% range. I suspect that such a high fragmentation produces lots of
unnecessary seeks on the disks.

Has anyone an idea what to do to make Ceph fragment an OSD filesystem less?

TIA

Christian

-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] filesystem fragmentation on ext4 OSD

2014-02-06 Thread Christian Kauhaus
Am 06.02.2014 16:24, schrieb Mark Nelson:
 Hi Christian, can you tell me a little bit about how you are using Ceph and
 what kind of IO you are doing?

Sure. We're using it almost exclusively for serving VM images that are
accessed from Qemu's built-in RBD client. The VMs themselves perform a very
wide range of I/O types, from servers that write mainly log files to ZEO
database servers with nearly completely random I/O. Many VMs have slowly
increasing storage utilization.

A reason could be that the OSDs issue syncfs() calls and ext4 cuts FS extents
from just what has been written so far. But I'm not sure about the exact
pattern of OSD/filesystem interaction.

HTH

Christian

-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Performance

2014-01-09 Thread Christian Kauhaus
Am 09.01.2014 10:25, schrieb Bradley Kite:
 3 servers (quad-core CPU, 16GB RAM), each with 4 SATA 7.2K RPM disks (4TB)
 plus a 160GB SSD.
 [...]
 By comparison, a 12-disk RAID5 iscsi SAN is doing ~4000 read iops and ~2000
 iops write (but with 15KRPM SAS disks).

I think that comparing Ceph on 7.2k rpm SATA disks against iSCSI on 15k rpm
SAS disks is not fair. The random access times of 15k SAS disks are hugely
better compared to 7.2k SATA disks. What would be far more interesting is to
compare Ceph against iSCSI with identical disks.

Regards

Christian

-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com