from:"Andrey Korolyov"

Re: osd: new pool flags: noscrub, nodeep-scrub

2015-09-11 Thread Andrey Korolyov

On Fri, Sep 11, 2015 at 4:24 PM, Mykola Golub  wrote:
> On Fri, Sep 11, 2015 at 05:59:56AM -0700, Sage Weil wrote:
>
>> I wonder if, in addition, we should also allow scrub and deep-scrub
>> intervals to be set on a per-pool basis?
>
> ceph osd pool set  [deep-]scrub_interval N ?

BTW it would be absolutely lovely to see a copy-aware scrubs, e.g.
parallel (deep-) scrubs on a non-intersecting set of PGs. Currently as
far as I can see if the scrub starts, the max_scrubs is in effect only
for a primary OSD, allowing situations when two scrubs, primary and
non-primary, can land on a same OSD.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: leaking mons on a latest dumpling

2015-04-16 Thread Andrey Korolyov

On Thu, Apr 16, 2015 at 11:30 AM, Joao Eduardo Luis j...@suse.de wrote:
 On 04/15/2015 05:38 PM, Andrey Korolyov wrote:
 Hello,

 there is a slow leak which is presented in all ceph versions I assume
 but it is positively exposed only on large time spans and on large
 clusters. It looks like the lower is monitor placed in the quorum
 hierarchy, the higher the leak is:


 {election_epoch:26,quorum:[0,1,2,3,4],quorum_names:[0,1,2,3,4],quorum_leader_name:0,monmap:{epoch:1,fsid:a2ec787e-3551-4a6f-aa24-deedbd8f8d01,modified:2015-03-05
 13:48:54.696784,created:2015-03-05
 13:48:54.696784,mons:[{rank:0,name:0,addr:10.0.1.91:6789\/0},{rank:1,name:1,addr:10.0.1.92:6789\/0},{rank:2,name:2,addr:10.0.1.93:6789\/0},{rank:3,name:3,addr:10.0.1.94:6789\/0},{rank:4,name:4,addr:10.0.1.95:6789\/0}]}}

 ceph heap stats -m 10.0.1.95:6789 | grep Actual
 MALLOC: =427626648 (  407.8 MiB) Actual memory used (physical + swap)
 ceph heap stats -m 10.0.1.94:6789 | grep Actual
 MALLOC: =289550488 (  276.1 MiB) Actual memory used (physical + swap)
 ceph heap stats -m 10.0.1.93:6789 | grep Actual
 MALLOC: =230592664 (  219.9 MiB) Actual memory used (physical + swap)
 ceph heap stats -m 10.0.1.92:6789 | grep Actual
 MALLOC: =253710488 (  242.0 MiB) Actual memory used (physical + swap)
 ceph heap stats -m 10.0.1.91:6789 | grep Actual
 MALLOC: = 97112216 (   92.6 MiB) Actual memory used (physical + swap)

 for almost same uptime, the data difference is:
 rd KB 55365750505
 wr KB 82719722467

 The leak itself is not very critical but of course requires some
 script work to restart monitors at least once per month on a 300Tb
 cluster to prevent 1G memory consumption by monitor processes. Given
 a current status for a dumpling, it would be probably possible to
 identify leak source and then forward-port fix to the newer releases,
 as the freshest version I am running on a large scale is a top of
 dumpling branch, otherwise it would require enormous amount of time to
 check fix proposals.

 There have been numerous reports of a slow leak in the monitors on
 dumpling and firefly.  I'm sure there's a ticket for that but I wasn't
 able to find it.

 Many hours were spent chasing down this leak to no avail, despite of
 plugging several leaks throughout the code (especially in firefly, that
 should have been backported to dumpling at some point or the other).

 This was mostly hard to figure out because it tends to require a
 long-term cluster to show up, and the biggest the cluster is the larger
 the probability of triggering it.  This behavior has me believing that
 this should be somewhere in the message dispatching workflow and, given
 it's the leader that suffers the most, should be somewhere in the
 read-write message dispatching (PaxosService::prepare_update()).  But
 despite code inspections, I don't think we ever found the cause -- or
 that any fixed leak was ever flagged as the root of the problem.

 Anyway, since Giant, most complaints (if not all!) went away.  Maybe I
 missed them, or maybe people suffering from this just stopped
 complaining.  I'm hoping it's the first rather than the latter and, as
 luck has it, maybe the fix was a fortunate side-effect of some other change.

   -Joao


Thanks for an explanation, I accidentally reversed the logical order
describing leadership placement above. I`ll go through non-ported
commits for ff and will port most promising ones on a spare time
occasion, checking if the leak disappeared or not (it takes about a
week to see the difference for mine workloads). Could dump structures
be helpful for developers to ring a bell for deterministic
suggestions?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

leaking mons on a latest dumpling

2015-04-15 Thread Andrey Korolyov

Hello,

there is a slow leak which is presented in all ceph versions I assume
but it is positively exposed only on large time spans and on large
clusters. It looks like the lower is monitor placed in the quorum
hierarchy, the higher the leak is:


{election_epoch:26,quorum:[0,1,2,3,4],quorum_names:[0,1,2,3,4],quorum_leader_name:0,monmap:{epoch:1,fsid:a2ec787e-3551-4a6f-aa24-deedbd8f8d01,modified:2015-03-05
13:48:54.696784,created:2015-03-05
13:48:54.696784,mons:[{rank:0,name:0,addr:10.0.1.91:6789\/0},{rank:1,name:1,addr:10.0.1.92:6789\/0},{rank:2,name:2,addr:10.0.1.93:6789\/0},{rank:3,name:3,addr:10.0.1.94:6789\/0},{rank:4,name:4,addr:10.0.1.95:6789\/0}]}}

ceph heap stats -m 10.0.1.95:6789 | grep Actual
MALLOC: =427626648 (  407.8 MiB) Actual memory used (physical + swap)
ceph heap stats -m 10.0.1.94:6789 | grep Actual
MALLOC: =289550488 (  276.1 MiB) Actual memory used (physical + swap)
ceph heap stats -m 10.0.1.93:6789 | grep Actual
MALLOC: =230592664 (  219.9 MiB) Actual memory used (physical + swap)
ceph heap stats -m 10.0.1.92:6789 | grep Actual
MALLOC: =253710488 (  242.0 MiB) Actual memory used (physical + swap)
ceph heap stats -m 10.0.1.91:6789 | grep Actual
MALLOC: = 97112216 (   92.6 MiB) Actual memory used (physical + swap)

for almost same uptime, the data difference is:
rd KB 55365750505
wr KB 82719722467

The leak itself is not very critical but of course requires some
script work to restart monitors at least once per month on a 300Tb
cluster to prevent 1G memory consumption by monitor processes. Given
a current status for a dumpling, it would be probably possible to
identify leak source and then forward-port fix to the newer releases,
as the freshest version I am running on a large scale is a top of
dumpling branch, otherwise it would require enormous amount of time to
check fix proposals.

Thanks!
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preliminary RDMA vs TCP numbers

2015-04-08 Thread Andrey Korolyov

On Wed, Apr 8, 2015 at 11:17 AM, Somnath Roy somnath@sandisk.com wrote:

 Hi,
 Please find the preliminary performance numbers of TCP Vs RDMA (XIO) 
 implementation (on top of SSDs) in the following link.

 http://www.slideshare.net/somnathroy7568/ceph-on-rdma

 The attachment didn't go through it seems, so, I had to use slideshare.

 Mark,
 If we have time, I can present it in tomorrow's performance meeting.

 Thanks  Regards
 Somnath


Those numbers are really impressive (for small numbers at least)! What
are TCP settings you using?For example, difference can be lowered on
scale due to less intensive per-connection acceleration on CUBIC on a
larger number of nodes, though I do not believe that it was a main
reason for an observed TCP catchup on a relatively flat workload such
as fio generates.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Multiple issues with glibc heap management

2014-10-13 Thread Andrey Korolyov

Hello,

since very long period (at least from cuttlefish) many users,
including me, experiences rare but still very disturbing client
crashes (#8385, #6480, and couple of other same-looking traces for
different code pieces, I may start the corresponding separate bugs if
necessary). The main problem is that the issues are very hard to
reproduce in a deterministic way, although they are *primarily*
correlate with disk workload.

Despite fixing all reported separately is a definitely a working way,
the issue can be also caused be simularly-behaving single piece of
code belonging to one of shared libraries. As issue touches more or
less all existing deployments on stable release, may be it is worthy
(and possible) to fix it not by cleaning up particular issues one by
one.

Thanks!
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] qemu drive-mirror to rbd storage : no sparse rbd image

2014-10-11 Thread Andrey Korolyov

On Sat, Oct 11, 2014 at 12:25 PM, Fam Zheng f...@redhat.com wrote:
 On Sat, 10/11 10:00, Alexandre DERUMIER wrote:
 What is the source format? If the zero clusters are actually unallocated 
 in the
 source image, drive-mirror will not write those clusters either. I.e. with
 drive-mirror sync=top, both source and target should have the same 
 qemu-img
 map output.

 Thanks for your reply,

 I had tried drive mirror (sync=full) with

 raw file (sparse) - rbd  (no sparse)
 rbd (sparse) - rbd (no sparse)
 raw file (sparse) - qcow2 on ext4  (sparse)
 rbd (sparse) - raw on ext4 (sparse)

 Also I see that I have the same problem with target file format on xfs.

 raw file (sparse) - qcow2 on xfs  (no sparse)
 rbd (sparse) - raw on xfs (no sparse)


 These don't tell me much. Maybe it's better to show the actual commands and 
 how
 you tell sparse from no sparse?

 Does qcow2 - qcow2 work for you on xfs?


 I only have this problem with drive-mirror, qemu-img convert seem to simply 
 skip zero blocks.


 Or maybe this is because I'm using sync=full ?

 What is the difference between full and top ?

 sync: what parts of the disk image should be copied to the destination;
   possibilities include full for all the disk, top for only the sectors
   allocated in the topmost image.

 (what is topmost image ?)

 For sync=top, only the clusters allocated in the image itself is copied; for
 full, all those clusters allocated in the image itself, and its backing
 image, and it's backing's backing image, ..., are copied.

 The image itself, having a backing image or not, is called the topmost image.

 Fam
 --

Just a wild guess - Alexandre, did you tried detect-zeroes blk option
for mirroring targets?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Adding a delay when restarting all OSDs on a host

2014-07-22 Thread Andrey Korolyov

On Tue, Jul 22, 2014 at 5:19 PM, Wido den Hollander w...@42on.com wrote:
 Hi,

 Currently on Ubuntu with Upstart when you invoke a restart like this:

 $ sudo restart ceph-osd-all

 It will restart all OSDs at once, which can increase the load on the system
 a quite a bit.

 It's better to restart all OSDs by restarting them one by one:

 $ sudo ceph restart ceph-osd id=X

 But you then have to figure out all the IDs by doing a find in
 /var/lib/ceph/osd and that's more manual work.

 I'm thinking of patching the init scripts which allows something like this:

 $ sudo restart ceph-osd-all delay=180

 It then waits 180 seconds between each OSD restart making the proces even
 smoother.

 I know there are currently sysvinit, upstart and systemd scripts, so it has
 to be implemented on various places, but how does the general idea sound?

 --
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 --


Hi,

this behaviour obviously have a negative side of increased overall
peering time and larger integral value of out-of-SLA delays. I`d vote
for warming up necessary files, most likely collections, just before
restart. If there are no enough room to hold all of them at once, we
can probably combine both methods to achieve lower impact value on
restart, although adding a simple delay sounds much more straight than
putting file cache to ram.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Adding a delay when restarting all OSDs on a host

2014-07-22 Thread Andrey Korolyov

On Tue, Jul 22, 2014 at 6:28 PM, Wido den Hollander w...@42on.com wrote:
 On 07/22/2014 03:48 PM, Andrey Korolyov wrote:

 On Tue, Jul 22, 2014 at 5:19 PM, Wido den Hollander w...@42on.com wrote:

 Hi,

 Currently on Ubuntu with Upstart when you invoke a restart like this:

 $ sudo restart ceph-osd-all

 It will restart all OSDs at once, which can increase the load on the
 system
 a quite a bit.

 It's better to restart all OSDs by restarting them one by one:

 $ sudo ceph restart ceph-osd id=X

 But you then have to figure out all the IDs by doing a find in
 /var/lib/ceph/osd and that's more manual work.

 I'm thinking of patching the init scripts which allows something like
 this:

 $ sudo restart ceph-osd-all delay=180

 It then waits 180 seconds between each OSD restart making the proces even
 smoother.

 I know there are currently sysvinit, upstart and systemd scripts, so it
 has
 to be implemented on various places, but how does the general idea sound?

 --
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 --



 Hi,

 this behaviour obviously have a negative side of increased overall
 peering time and larger integral value of out-of-SLA delays. I`d vote
 for warming up necessary files, most likely collections, just before
 restart. If there are no enough room to hold all of them at once, we
 can probably combine both methods to achieve lower impact value on
 restart, although adding a simple delay sounds much more straight than
 putting file cache to ram.


 In the case I'm talking about there are 23 OSDs running on a single machine
 and restarting all the OSDs causes a lot of peering and reading PG logs.

 A warm-up mechanism might work, but that would be a lot of work.

 When upgrading your cluster you simply want to do this:

 $ dsh -g ceph-osd sudo restart ceph-osd-all delay=180

 That might take hours to complete, but if it's just an upgrade that doesn't
 matter. You want as minimal impact on service as possible.


I may suggest to measure impact with vmtouch[0], it decreased OSD
startup time greatly on mine tests, but I was stuck with same resource
exhaustion as before after OSD marked itself up (IOPS ceiling
primarily).


0. http://hoytech.com/vmtouch/
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SMART monitoring

2014-05-22 Thread Andrey Korolyov

On Fri, Dec 27, 2013 at 9:09 PM, Andrey Korolyov and...@xdel.ru wrote:
 On 12/27/2013 08:15 PM, Justin Erenkrantz wrote:
 On Thu, Dec 26, 2013 at 9:17 PM, Sage Weil s...@inktank.com wrote:
 I think the question comes down to whether Ceph should take some internal
 action based on the information, or whether that is better handled by some
 external monitoring agent.  For example, an external agent might collect
 SMART info into graphite, and every so often do some predictive analysis
 and mark out disks that are expected to fail soon.

 I'd love to see some consensus form around what this should look like...

 My $.02 from the peanut gallery: at a minimum, set the HEALTH_WARN flag if
 there is a SMART failure on a physical drive that contains an OSD.  Yes,
 you could build the monitoring into a separate system, but I think it'd be
 really useful to combine it into the cluster health assessment.  -- justin
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 Hi,

 Judging from my personal experience SMART failures can be dangerous if
 they are not bad enough to completely tear down an OSD therefore it will
 not flap and will not be marked as down in time, but cluster performance
 is greatly affected in this case. I don`t think that the SMART
 monitoring task is somehow related to Ceph because seperate monitoring
 of predictive failure counters can do its job well and in cause of
 sudden errors SMART query may not work at all since a lot of bus resets
 was made by the system and disk can be inaccessible at all. So I propose
 two set of strategies - do a regular scattered background checks and
 monitor OSD responsiveness to word around cases with performance
 degradation due to read/write errors.

Some necromant job for this thread..

Considering a year-long experience with Hitachi 4T disks, there are a
lot of failures which are cannot be handled by SMART completely -
speed degradation and sudden disk death. Although second case rules
out by itself by kicking out stuck OSD, it is not very easy to check
which disks are about to die without throughout dmesg monitoring for
bus errors and periodical speed calibration. Probably introducing such
thing as idle-priority speed measurement for OSDs without dramatically
increasing overall wearout may be useful enough to implement in couple
with additional OSD perf metric, like seek_time in SMART, though SMART
may return good value for it when performance already slowed down to
crawl, also it`ll handle most things impacting performance which can
be unexposable at all to the host OS - correctable bus errors and so
on. By the way, although 1T Seagates have way higher failure rate,
they always dying with an 'appropriate' set of attributes in SMART,
Hitachi tends to die without warning :) Hope that it`ll be helpful for
someone.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Helper for state replication machine

2014-05-20 Thread Andrey Korolyov

Hello,

I do not know about how many of you aware of this work of Michael Hines
[0], but looks like it can be extremely usable for critical applications
using qemu and, of course, Ceph at the block level. My thought was that
if qemu rbd driver can provide any kind of metadata interface to mark
each atomic write, it can be easily used to check and replay machine
states on the acceptor side independently. Since Ceph replication is
asynchronous, there is no acceptable approach to tell when it`s time to
replay certain memory state on acceptor side, even if we are pushing all
writes in synchronous manner. I`d be happy to hear any suggestions on
this, because the result probably will be widely adopted by enterprise
users whose needs includes state replication and who are bounded to
VMWare by now. Of course, I am assuming worst case above, when primary
replica shifts during disaster state and there are at least two sites
holding primary and non-primary replica sets, with 100% distinction of
primary role (=0.80). Of course there are a lot of points to discuss,
like 'fallback' primary affinity and so on, but I`d like to ask first of
possibility to implement such mechanism at a driver level.

Thanks!

0. http://wiki.qemu.org/Features/MicroCheckpointing
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [librbd] Add interface of get the snapshot size?

2014-03-24 Thread Andrey Korolyov

On 03/24/2014 05:30 PM, Haomai Wang wrote:
 Hi all,
 
 As we know, snapshot is a lightweight resource in librbd and we
 doesn't have any statistic informations about it. But it causes some
 problems to the cloud management.
 
 We can't measure the size of snapshot, different snapshot will occur
 different space. So we don't have way to estimate the resource usage
 of user.
 
 Maybe we can have a counter to record space usage when volumn created.
 When creating snapshot, the counter is freeze and store as the size of
 snapshot. New counter will assign to zero for the volume.
 
 Any feedback is appreciate!
 

I believe that there is a rough estimation over 'rados df'. Per-image
statistics would be awesome, though precise stats will be neither rough
too(# of rbd object clones per volume) or introduce new counter
mechanism. Dealing with discard for the filestore, it looks even more
difficult to calculate right estimation, as with XFS preallocation feature.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

XFS preallocation with lot of small objects

2013-12-10 Thread Andrey Korolyov

Hello,

Due to lot of reports of ENOSPC for xfs-based stores may be it worth to
introduce an option to, say, ceph-deploy which will pass allocsize=
param to the mount effectively disabling Dynamic Preallocation? Of
course not every case really worth it because of related performance
impact. If there is any method to calculate 'real' allocation on such
volumes, it can be put to the docs as a measurement suggestion too.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: xfs Warnings in syslog

2013-10-22 Thread Andrey Korolyov

Just my two cents:

XFS is a quite unstable with Ceph especially along with heavy CPU
usage up to 3.7(primarily soft lockups). I used 3.7 for eight months
before upgrade on production system and it performs just perfectly.

On Tue, Oct 22, 2013 at 1:29 PM, Jeff Liu jeff@oracle.com wrote:
 Hello,

 So It's better to add XFS mailing list to the CC-list. :)

 I think this issue has been fixed by upstream commits:

 From ff9a28f6c25d18a635abcab1f49db68108203dfb
 From: Jan Kara j...@suse.cz
 Date: Thu, 14 Mar 2013 14:30:54 +0100
 Subject: [PATCH 1/1] xfs: Fix WARN_ON(delalloc) in xfs_vm_releasepage()


 Thanks,
 -Jeff

 On 10/22/2013 07:46 PM, Niklas Goerke wrote:

 Hi

 My syslog and dmesg are being filled with the Warnings attached.
 Looking at todays syslog I got up to 1101 of these warnings in the time
 from 10:50 to 11:13 (and only in that time, else the log was clean). I
 found them on all four of my OSD hosts, all at about the same time.
 I'm running kernel 3.2.0-4-amd64 on a debian 7.0. Ceph is on version
 0.67.4. I have got 15 OSDs per OSD Host.

 Ceph does not really seem to care about this, so I'm not sure what it is
 all about…
 Still they are warnings in syslog and I hope you guys can tell me what
 went wrong here and what I can do about it?

 Thank you
 Niklas


 Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388018] [ cut
 here ]
 Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388030] WARNING: at
 /build/linux-s5x2oE/linux-3.2.46/fs/xfs/xfs_aops.c:1091
 xfs_vm_releasepage+0x76/0x8e [xfs]()
 Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388034] Hardware name:
 X9DR3-F
 Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388036] Modules linked in:
 xfs autofs4 nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc ext3 jbd
 loop acpi_cpufreq mperf coretemp crc32c_intel ghash_clmulni_intel
 snd_pcm aesni_intel snd_page_alloc aes_x86_64 snd_timer aes_generic s
 nd cryptd soundcore pcspkr sb_edac joydev evdev edac_core iTCO_wdt
 i2c_i801 iTCO_vendor_support i2c_core ioatdma processor thermal_sys
 container button ext4 crc16 jbd2 mbcache usbhid hid ses enclosure sg
 sd_mod crc_t10dif megaraid_sas ehci_hcd usbcore isci libsas usb_common liba
 ta ixgbe mdio scsi_transport_sas scsi_mod igb dca [last unloaded:
 scsi_wait_scan]
 Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388093] Pid: 3459605,
 comm: ceph-osd Tainted: GW3.2.0-4-amd64 #1 Debian 3.2.46-1
 Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388096] Call Trace:
 Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388102]
 [81046b75] ? warn_slowpath_common+0x78/0x8c
 Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388115]
 [a048b98c] ? xfs_vm_releasepage+0x76/0x8e [xfs]
 Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388122]
 [810bedc5] ? invalidate_inode_page+0x5e/0x80
 Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388129]
 [810bee5d] ? invalidate_mapping_pages+0x76/0x102
 Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388135]
 [810b7b83] ? sys_fadvise64_64+0x19f/0x1e2
 Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388140]
 [81353b52] ? system_call_fastpath+0x16/0x1b
 Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388144] ---[ end trace
 e9640ed6f82f066d ]---
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Large time shift causes OSD to hit suicide timeout and ABRT

2013-10-03 Thread Andrey Korolyov

Hello,

Not sure if this matches any real-world problem:

step time server 192.168.10.125 offset 30763065.968946 sec

#0  0x7f2d0294d405 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f2d02950b5b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f2d0324b875 in __gnu_cxx::__verbose_terminate_handler() ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x7f2d03249996 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x7f2d032499c3 in std::terminate() () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x7f2d03249bee in __cxa_throw () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x0090d2fa in ceph::__ceph_assert_fail (assertion=0xa38ab1
0 == \hit suicide timeout\, file=optimized out, line=79,
func=0xa38c60 bool
ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*,
time_t)) at common/assert.cc:77
#7  0x0087914b in ceph::HeartbeatMap::_check
(this=this@entry=0x35b40e0, h=h@entry=0x36d1050,
who=who@entry=0xa38aef reset_timeout, now=now@entry=1380797379)
at common/HeartbeatMap.cc:79
#8  0x0087940e in ceph::HeartbeatMap::reset_timeout
(this=0x35b40e0, h=0x36d1050, grace=15, suicide_grace=150) at
common/HeartbeatMap.cc:89
#9  0x0070ada7 in OSD::process_peering_events (this=0x375,
pgs=..., handle=...) at osd/OSD.cc:6808
#10 0x0074c2e4 in OSD::PeeringWQ::_process (this=optimized
out, pgs=..., handle=...) at osd/OSD.h:869
#11 0x00903dca in ThreadPool::worker (this=0x3750478,
wt=0x4ef6fa80) at common/WorkQueue.cc:119
#12 0x00905070 in ThreadPool::WorkThread::entry
(this=optimized out) at common/WorkQueue.h:316
#13 0x7f2d046c2e9a in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#14 0x7f2d02a093dd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#15 0x in ?? ()
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph users meetup

2013-09-25 Thread Andrey Korolyov

If anyone attends to the CloudConf Europe, it would be nice to meet in
in real world too.

On Wed, Sep 25, 2013 at 2:29 PM, Wido den Hollander w...@42on.com wrote:
 On 09/25/2013 10:53 AM, Loic Dachary wrote:

 Hi Eric  Patrick,

 Yesterday morning Eric suggested that organizing a ceph user meetup would
 be great and proposed his help to make it happen. Although I'd be very happy
 to attend a france based meetup, it may make sense to also organize a Europe
 wide meetup. For instance it would be great to have a Ceph room during
 FOSDEM ( http://fosdem.org/ february 2014 ).


 I'm in! NL, BE, DE or FR doesn't matter to me.

 Cheers



 --
 Wido den Hollander
 42on B.V.

 Phone: +31 (0)20 700 9902
 Skype: contact42on

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hiding auth key string for the qemu process

2013-09-22 Thread Andrey Korolyov

Hello,

Since it was a long time from enabling cephx by default and we may
think that everyone using it, is seems worthy to introduce bits of
code hiding the key from cmdline. First applicable place for such
improvement is most-likely OpenStack envs with their sparse security
and usage of admin key as default one.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Deep-Scrub and High Read Latency with QEMU/RBD

2013-08-30 Thread Andrey Korolyov

You may want to reduce scrubbing pgs per osd to 1 using config option
and check the results.

On Fri, Aug 30, 2013 at 8:03 PM, Mike Dawson mike.daw...@cloudapt.com wrote:
 We've been struggling with an issue of spikes of high i/o latency with
 qemu/rbd guests. As we've been chasing this bug, we've greatly improved the
 methods we use to monitor our infrastructure.

 It appears that our RBD performance chokes in two situations:

 - Deep-Scrub
 - Backfill/recovery

 In this email, I want to focus on deep-scrub. Graphing '% Util' from 'iostat
 -x' on my hosts with OSDs, I can see Deep-Scrub take my disks from around
 10% utilized to complete saturation during a scrub.

 RBD writeback cache appears to cover the issue nicely, but occasionally
 suffers drops in performance (presumably when it flushes). But, reads appear
 to suffer greatly, with multiple seconds of 0B/s of reads accomplished (see
 log fragment below). If I make the assumption that deep-scrub isn't intended
 to create massive spindle contention, this appears to be a problem. What
 should happen here?

 Looking at the settings around deep-scrub, I don't see an obvious way to say
 don't saturate my drives. Are there any setting in Ceph or otherwise
 (readahead?) that might lower the burden of deep-scrub?

 If not, perhaps reads could be remapped to avoid waiting on saturated disks
 during scrub.

 Any ideas?

 2013-08-30 15:47:20.166149 mon.0 [INF] pgmap v9853931: 20672 pgs: 20665
 active+clean, 7 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 0B/s rd, 5058KB/s wr, 217op/s
 2013-08-30 15:47:21.945948 mon.0 [INF] pgmap v9853932: 20672 pgs: 20665
 active+clean, 7 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 0B/s rd, 5553KB/s wr, 229op/s
 2013-08-30 15:47:23.205843 mon.0 [INF] pgmap v9853933: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 0B/s rd, 6580KB/s wr, 246op/s
 2013-08-30 15:47:24.843308 mon.0 [INF] pgmap v9853934: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 0B/s rd, 3795KB/s wr, 224op/s
 2013-08-30 15:47:25.862722 mon.0 [INF] pgmap v9853935: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 1414B/s rd, 3799KB/s wr, 181op/s
 2013-08-30 15:47:26.887516 mon.0 [INF] pgmap v9853936: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 1541B/s rd, 8138KB/s wr, 160op/s
 2013-08-30 15:47:27.933629 mon.0 [INF] pgmap v9853937: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 0B/s rd, 14458KB/s wr, 304op/s
 2013-08-30 15:47:29.127847 mon.0 [INF] pgmap v9853938: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 0B/s rd, 15300KB/s wr, 345op/s
 2013-08-30 15:47:30.344837 mon.0 [INF] pgmap v9853939: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 0B/s rd, 13128KB/s wr, 218op/s
 2013-08-30 15:47:31.380089 mon.0 [INF] pgmap v9853940: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 0B/s rd, 13299KB/s wr, 241op/s
 2013-08-30 15:47:32.388303 mon.0 [INF] pgmap v9853941: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 4951B/s rd, 8147KB/s wr, 192op/s
 2013-08-30 15:47:33.858382 mon.0 [INF] pgmap v9853942: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 7029B/s rd, 3254KB/s wr, 190op/s
 2013-08-30 15:47:35.279691 mon.0 [INF] pgmap v9853943: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64555 GB / 174 TB avail; 1651B/s rd, 2476KB/s wr, 207op/s
 2013-08-30 15:47:36.309078 mon.0 [INF] pgmap v9853944: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64555 GB / 174 TB avail; 0B/s rd, 3788KB/s wr, 239op/s
 2013-08-30 15:47:38.120343 mon.0 [INF] pgmap v9853945: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64555 GB / 174 TB avail; 0B/s rd, 4671KB/s wr, 239op/s
 2013-08-30 15:47:39.546980 mon.0 [INF] pgmap v9853946: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64555 GB / 174 TB avail; 0B/s rd, 13487KB/s wr, 444op/s
 2013-08-30 15:47:40.561203 mon.0 [INF] pgmap v9853947: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64555 GB / 174 TB avail; 0B/s rd, 15265KB/s wr, 489op/s
 2013-08-30 15:47:41.794355 mon.0 [INF] pgmap v9853948: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111

Re: Deep-Scrub and High Read Latency with QEMU/RBD

2013-08-30 Thread Andrey Korolyov

On Fri, Aug 30, 2013 at 9:44 PM, Mike Dawson mike.daw...@cloudapt.com wrote:
 Andrey,

 I use all the defaults:

 # ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config show | grep scrub
   osd_scrub_thread_timeout: 60,
   osd_scrub_finalize_thread_timeout: 600,


   osd_max_scrubs: 1,

This one. I may suggest to increase max_interval and write some kind
of script doing per-pg scrub with low intensity, so you`ll have one
scrubbing PG or less anytime and you may wait some time before
scrubbing next, so they will not start scrubbing at once when
max_interval will expire. I had discussed some throttling mechanisms
to scrubbing some months ago here or in ceph-devel, but there still no
such implementation (it is ultimately low-priority task since it can
be handled by such simple thing as proposal above).

   osd_scrub_load_threshold: 0.5,
   osd_scrub_min_interval: 86400,
   osd_scrub_max_interval: 604800,
   osd_scrub_chunk_min: 5,
   osd_scrub_chunk_max: 25,
   osd_deep_scrub_interval: 604800,
   osd_deep_scrub_stride: 524288,

 Which value are you referring to?


 Does anyone know exactly how osd scrub load threshold works? The manual
 states The maximum CPU load. Ceph will not scrub when the CPU load is
 higher than this number. Default is 50%. So on a system with multiple
 processors and cores...what happens? Is the threshold .5 load (meaning half
 a core) or 50% of max load meaning anything less than 8 if you have 16
 cores?

 Thanks,
 Mike Dawson


 On 8/30/2013 1:34 PM, Andrey Korolyov wrote:

 You may want to reduce scrubbing pgs per osd to 1 using config option
 and check the results.

 On Fri, Aug 30, 2013 at 8:03 PM, Mike Dawson mike.daw...@cloudapt.com
 wrote:

 We've been struggling with an issue of spikes of high i/o latency with
 qemu/rbd guests. As we've been chasing this bug, we've greatly improved
 the
 methods we use to monitor our infrastructure.

 It appears that our RBD performance chokes in two situations:

 - Deep-Scrub
 - Backfill/recovery

 In this email, I want to focus on deep-scrub. Graphing '% Util' from
 'iostat
 -x' on my hosts with OSDs, I can see Deep-Scrub take my disks from around
 10% utilized to complete saturation during a scrub.

 RBD writeback cache appears to cover the issue nicely, but occasionally
 suffers drops in performance (presumably when it flushes). But, reads
 appear
 to suffer greatly, with multiple seconds of 0B/s of reads accomplished
 (see
 log fragment below). If I make the assumption that deep-scrub isn't
 intended
 to create massive spindle contention, this appears to be a problem. What
 should happen here?

 Looking at the settings around deep-scrub, I don't see an obvious way to
 say
 don't saturate my drives. Are there any setting in Ceph or otherwise
 (readahead?) that might lower the burden of deep-scrub?

 If not, perhaps reads could be remapped to avoid waiting on saturated
 disks
 during scrub.

 Any ideas?

 2013-08-30 15:47:20.166149 mon.0 [INF] pgmap v9853931: 20672 pgs: 20665
 active+clean, 7 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 0B/s rd, 5058KB/s wr, 217op/s
 2013-08-30 15:47:21.945948 mon.0 [INF] pgmap v9853932: 20672 pgs: 20665
 active+clean, 7 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 0B/s rd, 5553KB/s wr, 229op/s
 2013-08-30 15:47:23.205843 mon.0 [INF] pgmap v9853933: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 0B/s rd, 6580KB/s wr, 246op/s
 2013-08-30 15:47:24.843308 mon.0 [INF] pgmap v9853934: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 0B/s rd, 3795KB/s wr, 224op/s
 2013-08-30 15:47:25.862722 mon.0 [INF] pgmap v9853935: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 1414B/s rd, 3799KB/s wr, 181op/s
 2013-08-30 15:47:26.887516 mon.0 [INF] pgmap v9853936: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 1541B/s rd, 8138KB/s wr, 160op/s
 2013-08-30 15:47:27.933629 mon.0 [INF] pgmap v9853937: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 0B/s rd, 14458KB/s wr, 304op/s
 2013-08-30 15:47:29.127847 mon.0 [INF] pgmap v9853938: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 0B/s rd, 15300KB/s wr, 345op/s
 2013-08-30 15:47:30.344837 mon.0 [INF] pgmap v9853939: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 0B/s rd, 13128KB/s wr, 218op/s
 2013-08-30 15:47:31.380089 mon.0 [INF] pgmap v9853940: 20672 pgs: 20664
 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used,
 64556 GB / 174 TB avail; 0B/s rd, 13299KB/s wr, 241op/s
 2013-08-30 15:47:32.388303 mon

Re: libvirt: Removing RBD volumes with snapshots, auto purge or not?

2013-08-20 Thread Andrey Korolyov

On Tue, Aug 20, 2013 at 7:36 PM, Wido den Hollander w...@42on.com wrote:
Hi,

The current [0] libvirt storage pool code simply calls rbd_remove without
anything else.

As far as I know rbd_remove will fail if the image still has snapshots, you
have to remove those snapshots first before you can remove the image.

The problem is that libvirt's storage pools do not support listing
snapshots, so we can't integrate that.

Libvirt however has a flag you can pass down to tell you want the device to
be zeroed.

The normal procedure is that the device is filled with zeros before actually
removing it.

I was thinking about abusing this flag to use it as a snap purge for RBD.

So a regular volume removal will call only rbd_remove, but when the flag
VIR_STORAGE_VOL_DELETE_ZEROED is passed it will purge all snapshots prior to
calling rbd_remove.

Another way would be to always purge snapshots, but I'm afraid that could
make somebody very unhappy at some point.

Currently virsh doesn't support flags, but that could be fixed in a
different patch.

Does my idea sound sane?

[0]:
http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=e3340f63f412c22d025f615beb7cfed25f00107b;hb=master#l407

--
Wido den Hollander
42on B.V.

Hi Wido,

You had mentioned not so long ago the same idea as I had about a year
and half ago about placing memory dumps along with the regular
snapshot in Ceph using libvirt mechanisms. That sounds pretty nice
since we`ll have something other than qcow2 with same snapshot
functionality but your current proposal does not extend to this.
Placing custom side hook seems much more expandable than putting snap
purge into specific flag.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

2013-08-02 Thread Andrey Korolyov

Created #5844.

On Thu, Aug 1, 2013 at 10:38 PM, Samuel Just sam.j...@inktank.com wrote:
 Is there a bug open for this?  I suspect we don't sufficiently
 throttle the snapshot removal work.
 -Sam

 On Thu, Aug 1, 2013 at 7:50 AM, Andrey Korolyov and...@xdel.ru wrote:
 Second this. Also for long-lasting snapshot problem and related
 performance issues I may say that cuttlefish improved things greatly,
 but creation/deletion of large snapshot (hundreds of gigabytes of
 commited data) still can bring down cluster for a minutes, despite
 usage of every possible optimization.

 On Thu, Aug 1, 2013 at 12:22 PM, Stefan Priebe - Profihost AG
 s.pri...@profihost.ag wrote:
 Hi,

 i still have recovery issues with cuttlefish. After the OSD comes back
 it seem to hang for around 2-4 minutes and then recovery seems to start
 (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
 get a lot of slow request messages an hanging VMs.

 What i noticed today is that if i leave the OSD off as long as ceph
 starts to backfill - the recovery and re backfilling wents absolutely
 smooth without any issues and no slow request messages at all.

 Does anybody have an idea why?

 Greets,
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

2013-08-01 Thread Andrey Korolyov

Second this. Also for long-lasting snapshot problem and related
performance issues I may say that cuttlefish improved things greatly,
but creation/deletion of large snapshot (hundreds of gigabytes of
commited data) still can bring down cluster for a minutes, despite
usage of every possible optimization.

On Thu, Aug 1, 2013 at 12:22 PM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
 Hi,

 i still have recovery issues with cuttlefish. After the OSD comes back
 it seem to hang for around 2-4 minutes and then recovery seems to start
 (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
 get a lot of slow request messages an hanging VMs.

 What i noticed today is that if i leave the OSD off as long as ceph
 starts to backfill - the recovery and re backfilling wents absolutely
 smooth without any issues and no slow request messages at all.

 Does anybody have an idea why?

 Greets,
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Read ahead affect Ceph read performance much

2013-07-29 Thread Andrey Korolyov

Wow, very glad to hear that. I tried with the regular FS tunable and
there was almost no effect on the regular test, so I thought that
reads cannot be improved at all in this direction.

On Mon, Jul 29, 2013 at 2:24 PM, Li Wang liw...@ubuntukylin.com wrote:
 We performed Iozone read test on a 32-node HPC server. Regarding the
 hardware of each node, the CPU is very powerful, so does the network, with a
 bandwidth  1.5 GB/s. 64GB memory, the IO is relatively slow, the throughput
 measured by ‘dd’ locally is around 70MB/s. We configured a Ceph cluster with
 24 OSDs on 24 nodes, one mds, one to four clients, one client per node. The
 performance is as follows,

 Iozone sequential read throughput (MB/s)
 Number of clients 1  2 4
 Default resize180.0954   324.4836   591.5851
 Resize: 256MB 645.3347   1022.998   1267.631

 The complete iozone parameter for one client is,
 iozone -t 1 -+m /tmp/iozone.nodelist.50305030 -s 64G -r 4M -i 0 -+n -w -c -e
 -b /tmp/iozone.nodelist.50305030.output, on each client node, only one
 thread is started.

 for two clients, it is,
 iozone -t 2 -+m /tmp/iozone.nodelist.50305030 -s 32G -r 4M -i 0 -+n -w -c -e
 -b /tmp/iozone.nodelist.50305030.output

 As the data shown, a larger read ahead window could result in 300% speedup!

 Besides, Since the backend of Ceph is not the traditional hard disk, it is
 beneficial to capture the stride read prefetching. To prove this, we tested
 the stride read with the following program, as we know, the generic read
 ahead algorithm of Linux kernel will not capture stride-read prefetch, so we
 use fadvise() to manually force pretching.
 the record size is 4MB. The result is even more surprising,

 Stride read throughput (MB/s)
 Number of records prefetched  0  1  4  16  64  128
 Throughput  42.82  100.74 217.41  497.73  854.48  950.18

 As the data shown, with a read ahead size of 128*4MB, the speedup over
 without read ahead could be up to 950/42  2000%!

 The core logic of the test program is below,

 stride = 17
 recordsize = 4MB
 for (;;) {
   for (i = 0; i  count; ++i) {
 long long start = pos + (i + 1) * stride * recordsize;
 printf(PRE READ %lld %lld\n, start, start + block);
 posix_fadvise(fd, start, block, POSIX_FADV_WILLNEED);
   }
   len = read(fd, buf, block);
   total += len;
   printf(READ %lld %lld\n, pos, (pos + len));
   pos += len;
   lseek(fd, (stride - 1) * block, SEEK_CUR);
   pos += (stride - 1) * block;
 }

 Given the above results and some more, We plan to submit a blue print to
 discuss the prefetching optimization of Ceph.

 Cheers,
 Li Wang




 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

OSD crash upon pool creation

2013-07-15 Thread Andrey Korolyov

Hello,

Using db2bb270e93ed44f9252d65d1d4c9b36875d0ea5 I had observed some
disaster-alike behavior after ``pool create'' command - every osd
daemon in the cluster will die at least once(some will crash times in
a row after bringing back). Please take a look on the
backtraces(almost identical) below. Issue #5637 is created in the
tracker.

Thanks!

http://xdel.ru/downloads/poolcreate.txt.gz
http://xdel.ru/downloads/poolcreate2.txt.gz
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: poor write performance

2013-04-18 Thread Andrey Korolyov

On Thu, Apr 18, 2013 at 5:43 PM, Mark Nelson mark.nel...@inktank.com wrote:
 On 04/18/2013 06:46 AM, James Harper wrote:

 I'm doing some basic testing so I'm not really fussed about poor
 performance, but my write performance appears to be so bad I think I'm doing
 something wrong.

 Using dd to test gives me kbytes/second for write performance for 4kb
 block sizes, while read performance is acceptable (for testing at least).
 For dd I'm using iflag=direct for read and oflag=direct for write testing.

 My setup, approximately, is:

 Two OSD's
 . 1 x 7200RPM SATA disk each
 . 2 x gigabit cluster network interfaces each in a bonded configuration
 directly attached (osd to osd, no switch)
 . 1 x gigabit public network
 . journal on another spindle

 Three MON's
 . 1 each on the OSD's
 . 1 on another server, which is also the one used for testing performance

 I'm using debian packages from ceph which are version 0.56.4

 For comparison, my existing production storage is 2 servers running DRBD
 with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top
 of the iSCSI. Performance not spectacular but acceptable. The servers in
 question are the same specs as the servers I'm testing on.

 Where should I start looking for performance problems? I've tried running
 some of the benchmark stuff in the documentation but I haven't gotten very
 far...


 Hi James!  Sorry to hear about the performance trouble!  Is it just
 sequential 4KB direct IO writes that are giving you troubles?  If you are
 using the kernel version of RBD, we don't have any kind of cache implemented
 there and since you are bypassing the pagecache on the client, those writes
 are being sent to the different OSDs in 4KB chunks over the network.  RBD
 stores data in blocks that are represented by 4MB objects on one of the
 OSDs, so without cache a lot of sequential 4KB writes will be hitting 1 OSD
 repeatedly and then moving on to the next one.  Hopefully those writes would
 get aggregated at the OSD level, but clearly that's not really happening
 here given your performance.

 Here's a couple of thoughts:

 1) If you are working with VMs, using the QEMU/KVM interface with virtio
 drivers and RBD cache enabled will give you a huge jump in small sequential
 write performance relative to what you are seeing now.

 2) You may want to try upgrading to 0.60.  We made a change to how the
 pg_log works that causes fewer disk seeks during small IO, especially with
 XFS.

Can you point into related commits, if possible?


 3) If you are still having trouble, testing your network, disk speeds, and
 using rados bench to test the object store all may be helpful.


 Thanks

 James


 Good luck!



 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Accidental image deletion

2013-04-07 Thread Andrey Korolyov

Hello,

Is there an existing or planned way to save an image from such thing,
except protected snapshot? Since ``rbd snap protect'' is good enough
for a small or inactive images, large ones may add significant overhead
by space or by I/O when 'locking' snapshot is present, so it would be nice
to see same functionality by the flag of ``rbd lock'' command.

Thanks!
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph availability test recovering question

2013-03-18 Thread Andrey Korolyov

Hello,

I`m experiencing same long-lasting problem - during recovery ops, some
percentage of read I/O remains in-flight for seconds, rendering
upper-level filesystem on the qemu client very slow and almost
unusable. Different striping has almost no effect on visible delays
and reads may be non-intensive at all but they still are very slow.

Here is some fio results on randread with small blocks, so it is not
affected by readahead as linear one:

Intensive reads during recovery:
lat (msec) : 2=0.01%, 4=0.08%, 10=1.87%, 20=4.17%, 50=8.34%
lat (msec) : 100=13.93%, 250=2.77%, 500=1.19%, 750=25.13%, 1000=0.41%
lat (msec) : 2000=15.45%, =2000=26.66%

same on healthy cluster:
lat (msec) : 20=0.33%, 50=9.17%, 100=23.35%, 250=25.47%, 750=6.53%
lat (msec) : 1000=0.42%, 2000=34.17%, =2000=0.56%


On Sun, Mar 17, 2013 at 8:18 AM,  kelvin_hu...@wiwynn.com wrote:
 Hi, all

 I have some problem after availability test

 Setup:
 Linux kernel: 3.2.0
 OS: Ubuntu 12.04
 Storage server : 11 HDD (each storage server has 11 osd, 7200 rpm, 1T) + 
 10GbE NIC
 RAID card: LSI MegaRAID SAS 9260-4i  For every HDD: RAID0, Write Policy: 
 Write Back with BBU, Read Policy: ReadAhead, IO Policy: Direct
 Storage server number : 2

 Ceph version : 0.48.2
 Replicas : 2
 Monitor number:3


 We have two storage server as a cluter, then use ceph client create 1T RBD 
 image for testing, the client also
 has 10GbE NIC , Linux kernel 3.2.0 , Ubuntu 12.04

 We also use FIO to produce workload

 fio command:
 [Sequencial Read]
 fio --iodepth = 32 --numjobs=1 --runtime=120  --bs = 65536 --rw = read 
 --ioengine=libaio --group_reporting --direct=1 --eta=always  --ramp_time=10 
 --thinktime=10

 [Sequencial Write]
 fio --iodepth = 32 --numjobs=1 --runtime=120  --bs = 65536 --rw = write 
 --ioengine=libaio --group_reporting --direct=1 --eta=always  --ramp_time=10 
 --thinktime=10


 Now I want observe to ceph state when one storage server is crash, so I turn 
 off one storage server networking.
 We expect that data write and data read operation can be quickly resume or 
 even not be suspended in ceph recovering time, but the experimental results 
 show
 the data write and data read operation will pause for about 20~30 seconds in 
 ceph recovering time.

 My question is:
 1.The state of I/O pause is normal when ceph recovering ?
 2.The pause time of I/O that can not be avoided when ceph recovering ?
 3.How to reduce the I/O pause time ?


 Thanks!!
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: maintanance on osd host

2013-02-26 Thread Andrey Korolyov

On Tue, Feb 26, 2013 at 6:56 PM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
 Hi list,

 how can i do a short maintanance like a kernel upgrade on an osd host?
 Right now ceph starts to backfill immediatly if i say:
 ceph osd out 41
 ...

 Without ceph osd out command all clients hang for the time ceph does not
 know that the host was rebootet.

 I tried
 ceph osd set nodown and ceph osd set noout
 but this doesn't result in any difference


Hi Stefan,

in my practice nodown will freeze all I/O for sure until OSD will
return, killing osd process and setting ``mon osd down out interval''
large enough will do the trick - you`ll get only two small freezes on
the peering process at start and at the end. Also it is very strange
that your clients hanging for a long time - I have set non-optimal
values for purpose and was not able to observe re-peering process
longer than a minute.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: rbd export speed limit

2013-02-20 Thread Andrey Korolyov

On Wed, Feb 13, 2013 at 12:22 AM, Stefan Priebe s.pri...@profihost.ag wrote:
 Hi,

 is there a speed limit option for rbd export? Right now i'm able to produce
 several SLOW requests from IMPORTANT valid requests while just exporting a
 snapshot which is not really important.

 rbd export runs with 2400MB/s and each OSD with 250MB/s so it seems to block
 valid normal read / write operations.

 Greets,
 Stefan
 --

Can confirm this in some specific case - when 0.56.2 and 0.56.3
coexist for a long time, nodes with newer running version can produce
such warnings at the beginning of export huge snapshots, not during
entire export. And there are real impact on clients - for example I
can see messages from watchdog in the KVM guests. For now, I will do
an input throttling on export as temporary workaround.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hit suicide timeout after adding new osd

2013-02-17 Thread Andrey Korolyov

On Thu, Jan 24, 2013 at 10:01 PM, Sage Weil s...@inktank.com wrote:
 On Thu, 24 Jan 2013, Andrey Korolyov wrote:
 On Thu, Jan 24, 2013 at 8:39 AM, Sage Weil s...@inktank.com wrote:
  On Thu, 24 Jan 2013, Andrey Korolyov wrote:
  On Thu, Jan 24, 2013 at 12:59 AM, Jens Kristian S?gaard
  j...@mermaidconsulting.dk wrote:
   Hi Sage,
  
   I think the problem now is just that 'osd target transaction size' is
  
   I set it to 50, and that seems to have solved all my problems.
  
   After a day or so my cluster got to a HEALTH_OK state again. It has 
   been
   running for a few days now without any crashes!
  
  
   Hmm, one of the OSDs crashed again, sadly.
  
   It logs:
  
  -2 2013-01-23 18:01:23.563624 7f67524da700  1 heartbeat_map 
   is_healthy
   'FileStore::op_tp thread 0x7f673affd700' had timed out after 60
   -1 2013-01-23 18:01:23.563657 7f67524da700  1 heartbeat_map 
   is_healthy
   'FileStore::op_tp thread 0x7f673affd700' had suicide timed out after 180
0 2013-01-23 18:01:24.257996 7f67524da700 -1 
   common/HeartbeatMap.cc:
   In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*,
   const char*, time_t)' thread 7f67524da700 time 2013-01-23 
   18:01:23.563677
  
   common/HeartbeatMap.cc: 78: FAILED assert(0 == hit suicide timeout)
  
  
   With this stack trace:
  
ceph version 0.56.1-26-g3bd8f6b 
   (3bd8f6b7235eb14cab778e3c6dcdc636aff4f539)
1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
   long)+0x2eb) [0x846ecb]
2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x8476ae]
3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x8478d8]
4: (CephContextServiceThread::entry()+0x55) [0x8e0f45]
5: /lib64/libpthread.so.0() [0x3cbc807d14]
6: (clone()+0x6d) [0x3cbc0f167d]
  
  
   I have saved the core file, if there's anything in there you need?
  
   Or do you think I just need to set the target transaction size even 
   lower
   than 50?
  
  
 
  I was able to catch this too on rejoin to very busy cluster and seems
  I need to lower this value at least at start time. Also
  c5fe0965572c074a2a33660719ce3222d18c1464 has increased overall time
  before restarted or new osd will join a cluster, and for 2M objects/3T
  of replicated data restart of the cluster was took almost a hour
  before it actually begins to work. The worst thing is that a single
  osd, if restarted, will mark as up after couple of minutes, then after
  almost half of hour(eating 100 percent of one cpu, ) as down and then
  cluster will start to redistribute data after 300s timeout, osd still
  doing something.
 
  Okay, something is very wrong.  Can you reproduce this with a log?  Or
  even a partial log while it is spinning?  You can adjust the log level on
  a running process with
 
ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_osd 20
ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_ms 1
 
  We haven't been able to reproduce this, so I'm very much interested in any
  light you can shine here.
 

 Unfortunately cluster finally hit ``suicide timeout'' by every osd, so
 there was no logs, only some backtraces[1].
 Yesterday after an osd was not able to join cluster in a hour, I
 decided to wait until data is remapped, then tried to restart cluster,
 leaving it overnight, to morning all osd processes are dead, with the
 same backtraces. Before it, after a silly node crash(related to
 deadlocks in kernel kvm code), some pgs remains to stay in peering
 state without any blocker in json output, so I had decided to restart
 osd to which primary copy belongs, because it helped before. So most
 interesting part is missing, but I`ll reformat cluster soon and will
 try to catch this again after filling some data in.

 [1]. http://xdel.ru/downloads/ceph-log/osd-heartbeat/

 Thanks, I believe I see the problem.  The peering workqueue is way behind,
 and it is trying to it all in one lump, timing out the work queue.  The
 workaround is to increase the timeout.  We'll put together a proper fix.

 sage

Hi Sage,

Single OSDs still not able to join a cluster after restart, osd
process eats one core and reads disk by long continuous periods, about
hundreds of seconds, then staying eating 100% of core, then repeat. On
relatively new cluster, it is not repeatable even with almost same
data commit, only week or two of writes, snapshot creation, etc.
exposes that. Please see log below:

2013-02-17 12:08:17.503992 7fbe8795c780  0 ceph version
0.56.3-2-g290a352 (290a352c3f9e241deac562e980ac8c6a74033ba6), process
ceph-osd, pid 29283
starting osd.26 at :/0 osd_data /var/lib/ceph/osd/26
/var/lib/ceph/osd/journal/journal26
2013-02-17 12:08:17.508193 7fbe8795c780  1 accepter.accepter.bind
my_inst.addr is 0.0.0.0:6803/29283 need_addr=1
2013-02-17 12:08:17.508222 7fbe8795c780  1 accepter.accepter.bind
my_inst.addr is 0.0.0.0:6804/29283 need_addr=1
2013-02-17 12:08:17.508244 7fbe8795c780  1 accepter.accepter.bind
my_inst.addr is 0.0.0.0:6805

Re: [0.48.3] OSD memory leak when scrubbing

2013-02-15 Thread Andrey Korolyov

Can anyone who hit this bug please confirm that your system contains libc 2.15+?

On Tue, Feb 5, 2013 at 1:27 AM, Sébastien Han han.sebast...@gmail.com wrote:
 oh nice, the pattern also matches path :D, didn't know that
 thanks Greg
 --
 Regards,
 Sébastien Han.


 On Mon, Feb 4, 2013 at 10:22 PM, Gregory Farnum g...@inktank.com wrote:
 Set your /proc/sys/kernel/core_pattern file. :) 
 http://linux.die.net/man/5/core
 -Greg

 On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han han.sebast...@gmail.com 
 wrote:
 ok I finally managed to get something on my test cluster,
 unfortunately, the dump goes to /

 any idea to change the destination path?

 My production / won't be big enough...

 --
 Regards,
 Sébastien Han.


 On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick dan.m...@inktank.com wrote:
 ...and/or do you have the corepath set interestingly, or one of the
 core-trapping mechanisms turned on?


 On 02/04/2013 11:29 AM, Sage Weil wrote:

 On Mon, 4 Feb 2013, S?bastien Han wrote:

 Hum just tried several times on my test cluster and I can't get any
 core dump. Does Ceph commit suicide or something? Is it expected
 behavior?


 SIGSEGV should trigger the usual path that dumps a stack trace and then
 dumps core.  Was your ulimit -c set before the daemon was started?

 sage



 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han han.sebast...@gmail.com
 wrote:

 Hi Lo?c,

 Thanks for bringing our discussion on the ML. I'll check that tomorrow
 :-).

 Cheer
 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han han.sebast...@gmail.com
 wrote:

 Hi Lo?c,

 Thanks for bringing our discussion on the ML. I'll check that tomorrow
 :-).

 Cheers

 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote:


 Hi,

 As discussed during FOSDEM, the script you wrote to kill the OSD when
 it
 grows too much could be amended to core dump instead of just being
 killed 
 restarted. The binary + core could probably be used to figure out
 where the
 leak is.

 You should make sure the OSD current working directory is in a file
 system
 with enough free disk space to accomodate for the dump and set

 ulimit -c unlimited

 before running it ( your system default is probably ulimit -c 0 which
 inhibits core dumps ). When you detect that OSD grows too much kill it
 with

 kill -SEGV $pid

 and upload the core found in the working directory, together with the
 binary in a public place. If the osd binary is compiled with -g but
 without
 changing the -O settings, you should have a larger binary file but no
 negative impact on performances. Forensics analysis will be made a lot
 easier with the debugging symbols.

 My 2cts

 On 01/31/2013 08:57 PM, Sage Weil wrote:

 On Thu, 31 Jan 2013, Sylvain Munaut wrote:

 Hi,

 I disabled scrubbing using

 ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
 ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'


 and the leak seems to be gone.

 See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD
 memory
 for the 12 osd processes over the last 3.5 days.
 Memory was rising every 24h. I did the change yesterday around 13h00
 and OSDs stopped growing. OSD memory even seems to go down slowly by
 small blocks.

 Of course I assume disabling scrubbing is not a long term solution
 and
 I should re-enable it ... (how do I do that btw ? what were the
 default values for those parameters)


 It depends on the exact commit you're on.  You can see the defaults
 if
 you
 do

   ceph-osd --show-config | grep osd_scrub

 Thanks for testing this... I have a few other ideas to try to
 reproduce.

 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 Lo?c Dachary, Artisan Logiciel Libre




 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: rbd export speed limit

2013-02-12 Thread Andrey Korolyov

Hi Stefan,

you may be interested in throttle(1) as a side solution with stdout
export option. By the way, on which interconnect you have manage to
get such speeds, if you mean 'commited' bytes(e.g. not almost empty
allocated image)?

On Wed, Feb 13, 2013 at 12:22 AM, Stefan Priebe s.pri...@profihost.ag wrote:
 Hi,

 is there a speed limit option for rbd export? Right now i'm able to produce
 several SLOW requests from IMPORTANT valid requests while just exporting a
 snapshot which is not really important.

 rbd export runs with 2400MB/s and each OSD with 250MB/s so it seems to block
 valid normal read / write operations.

 Greets,
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Paxos and long-lasting deleted data

2013-02-03 Thread Andrey Korolyov

On Thu, Jan 31, 2013 at 11:18 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Thu, Jan 31, 2013 at 10:56 PM, Gregory Farnum g...@inktank.com wrote:
 On Thu, Jan 31, 2013 at 10:50 AM, Andrey Korolyov and...@xdel.ru wrote:
 http://xdel.ru/downloads/ceph-log/rados-out.txt.gz


 On Thu, Jan 31, 2013 at 10:31 PM, Gregory Farnum g...@inktank.com wrote:
 Can you pastebin the output of rados -p rbd ls?


 Well, that sure is a lot of rbd objects. Looks like a tool mismatch or
 a bug in whatever version you were using. Can you describe how you got
 into this state, what versions of the servers and client tools you
 used, etc?
 -Greg

 That`s relatively fresh data moved into bare new cluster after couple
 of days of 0.56.1 release, and tool/daemons version kept consistently
 the same at any moment. All garbage data belongs to the same pool
 prefix(3.) on which I have put a bunch of VM` images lately, cluster
 may have been experienced split-brain problem for a short times during
 crash-tests with no workload at all and standard crash tests on osd
 removal/readdition during moderate workload. Killed osds have been
 returned before,at the time and after process of data rearrangement on
 ``osd down'' timeout. Is it possible to do a little clean somehow
 without pool re-creation?

Just an update: this data stayed after pool deletion, so there is
probably a way to delete garbage bytes on live pool without doing any
harm(hope so), since it is can be dissected from actual pool pool data
placement, in theory.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Paxos and long-lasting deleted data

2013-02-03 Thread Andrey Korolyov

On Mon, Feb 4, 2013 at 1:46 AM, Gregory Farnum g...@inktank.com wrote:
 On Sunday, February 3, 2013 at 11:45 AM, Andrey Korolyov wrote:
 Just an update: this data stayed after pool deletion, so there is
 probably a way to delete garbage bytes on live pool without doing any
 harm(hope so), since it is can be dissected from actual pool pool data
 placement, in theory.


 What? You mean you deleted the pool and the data in use by the cluster didn't 
 drop? If that's the case, check and see if it's still at the same level — 
 pool deletes are asynchronous and throttled to prevent impacting client 
 operations too much.

Yep, of course, I meant this exactly - I have waited until ``ceph -w''
values was stabilized for a long period, then checked that a bunch of
files with same prefix as in deleted pool remains, then I purged them
manually. I`m not sure if this data was in use at the moment of pool
removal, as I mentioned above, it`s just garbage produced during
periods when cluster was degraded heavily.

 -Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Paxos and long-lasting deleted data

2013-01-31 Thread Andrey Korolyov

http://xdel.ru/downloads/ceph-log/rados-out.txt.gz


On Thu, Jan 31, 2013 at 10:31 PM, Gregory Farnum g...@inktank.com wrote:
 Can you pastebin the output of rados -p rbd ls?

 On Thu, Jan 31, 2013 at 10:17 AM, Andrey Korolyov and...@xdel.ru wrote:
 Hi,

 Please take a look, this data remains for days and seems not to be
 deleted in future too:

 pool name   category KB  objects   clones
degraded  unfound   rdrd KB   wr
 wr KB
 data-  000
0   0000
 0
 install -   15736833 38560
0   0   163   464648
 60970390
 metadata-  000
0   0000
 0
 prod-rack0  -  364027905888950
0   0   320   267626
 689034186
 rbd -4194305 10270
0   04111269
 25165828
   total used  690091436893778
   total avail18335469376
   total space25236383744

 for pool in $(rados lspools) ; do rbd ls -l $pool ; done | grep -v
 SIZE | awk '{ sum += $2} END { print sum }'
 rbd: pool data doesn't contain rbd images
 rbd: pool metadata doesn't contain rbd images
 526360

 I have same thing before, but not so contrast as there. Cluster was
 put on moderate failure test, dropping one or two osds at once under
 I/O pressure with replication factor three.

Just wondering if there was something else you wanted to discuss on your email 
given the email subject. Wanted by any chance discuss anything regarding 
Paxos?

Sorry, please nevermind, just thought about paxos-like behavior and
suddenly put that in a title, instead of ``osd data placement''.

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: page allocation failures on osd nodes

2013-01-29 Thread Andrey Korolyov

On Mon, Jan 28, 2013 at 8:55 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Mon, Jan 28, 2013 at 5:48 PM, Sam Lang sam.l...@inktank.com wrote:
 On Sun, Jan 27, 2013 at 2:52 PM, Andrey Korolyov and...@xdel.ru wrote:

 Ahem. once on almost empty node same trace produced by qemu
 process(which was actually pinned to the specific numa node), so seems
 that`s generally is a some scheduler/mm bug, not directly related to
 the osd processes. In other words, the less percentage of memory
 actually is an RSS, the more is a probability of such allocation
 failure.

 This might be a known bug in xen for your kernel?  The xen users list
 might be able to help.
 -sam

 It is vanilla-3.4, I really wonder from where comes paravirt bits in the 
 trace.

Bug exposed only in 3.4 and really harmless, at least in ways I have
tested that. Ceph-osd memory allocation behavior more likely to
trigger those messages than most other applications in ``same''
conditions.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: page allocation failures on osd nodes

2013-01-28 Thread Andrey Korolyov

On Mon, Jan 28, 2013 at 5:48 PM, Sam Lang sam.l...@inktank.com wrote:
 On Sun, Jan 27, 2013 at 2:52 PM, Andrey Korolyov and...@xdel.ru wrote:

 Ahem. once on almost empty node same trace produced by qemu
 process(which was actually pinned to the specific numa node), so seems
 that`s generally is a some scheduler/mm bug, not directly related to
 the osd processes. In other words, the less percentage of memory
 actually is an RSS, the more is a probability of such allocation
 failure.

 This might be a known bug in xen for your kernel?  The xen users list
 might be able to help.
 -sam

It is vanilla-3.4, I really wonder from where comes paravirt bits in the trace.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: page allocation failures on osd nodes

2013-01-27 Thread Andrey Korolyov

On Sat, Jan 26, 2013 at 12:41 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Sat, Jan 26, 2013 at 3:40 AM, Sam Lang sam.l...@inktank.com wrote:
 On Fri, Jan 25, 2013 at 10:07 AM, Andrey Korolyov and...@xdel.ru wrote:
 Sorry, I have written too less yesterday because of being sleepy.
 That`s obviously a cache pressure since dropping caches resulted in
 disappearance of this errors for a long period. I`m not very familiar
 with kernel memory mechanisms, but shouldn`t kernel try to allocate
 memory on the second node if this not prohibited by process` cpuset
 first and only then report allocation failure(as can be seen only node
 0 involved in the failures)? I really have no idea where
 numa-awareness may be count in case of osd daemons.

 Hi Andrey,

 You said that the allocation failure doesn't occur if you flush
 caches, but the kernel should evict pages from the cache as needed so
 that the osd can allocate more memory (unless their dirty, but it
 doesn't look like you have many dirty pages in this case).  It looks
 like you have plenty of reclaimable pages as well.  Does the osd
 remain running after that error occurs?

 Yes, it keeps running flawlessly without even a bit in an osdmap, but
 unfortunately logging wasn`t turned on for this moment. As soon as
 I`ll end massive test for ``suicide timeout'' bug I`ll check you idea
 with dd and also rerun test as below with ``debug osd = 20''.

 My thought is that kernel has ready-to-be-free memory on node1 and for
 strange reason osd process trying to reserve pages from node0 (where
 it is obviously allocated memory on start, since node1` memory
 starting only from high numbers over 32G), then kernel refuses to free
 cache on the specific node(it`s a quite misty, at least for me, why
 kernel just does not invalidate some buffers, even they are more
 preferably to stay in RAM than tail of LRU` ones?).

 Allocation looks like following on the most nodes:
 MemTotal:   66081396 kB
 MemFree:  278216 kB
 Buffers:   15040 kB
 Cached: 62422368 kB
 SwapCached:0 kB
 Active:  2063908 kB
 Inactive:   60876892 kB
 Active(anon): 509784 kB
 Inactive(anon):   56 kB
 Active(file):1554124 kB
 Inactive(file): 60876836 kB

 OSD-node free memory, with two osd processes on each node, libvirt
 prints ``Free'' field there:


 0: 207500 KiB
 1:  72332 KiB
 
 Total: 279832 KiB

 0: 208528 KiB
 1:  80692 KiB
 
 Total: 289220 KiB

 Since it is known that kernel reserve more memory on the node with
 higher memory pressure, seems very legit - osd processes works mostly
 with node 0` memory, so there is a bigger gap than on node 1 where
 exists almost only fs cache.



Ahem. once on almost empty node same trace produced by qemu
process(which was actually pinned to the specific numa node), so seems
that`s generally is a some scheduler/mm bug, not directly related to
the osd processes. In other words, the less percentage of memory
actually is an RSS, the more is a probability of such allocation
failure.

I have printed timestamps of failure events on selected nodes, just
for reference:
http://xdel.ru/downloads/ceph-log/allocation-failure/stat.txt


 I wonder if you see the same error if you do a long write intensive
 workload on the local disk for the osd in question, maybe dd
 if=/dev/zero of=/data/osd.0/foo

 -sam



 On Fri, Jan 25, 2013 at 2:42 AM, Andrey Korolyov and...@xdel.ru wrote:
 Hi,

 Those traces happens only constant high constant writes and seems to
 be very rarely. OSD processes do not consume more memory after this
 event and peaks are not distinguishable by monitoring. I have able to
 catch it having four-hour constant writes on the cluster.

 http://xdel.ru/downloads/ceph-log/allocation-failure/
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: page allocation failures on osd nodes

2013-01-26 Thread Andrey Korolyov

On Sat, Jan 26, 2013 at 3:40 AM, Sam Lang sam.l...@inktank.com wrote:
 On Fri, Jan 25, 2013 at 10:07 AM, Andrey Korolyov and...@xdel.ru wrote:
 Sorry, I have written too less yesterday because of being sleepy.
 That`s obviously a cache pressure since dropping caches resulted in
 disappearance of this errors for a long period. I`m not very familiar
 with kernel memory mechanisms, but shouldn`t kernel try to allocate
 memory on the second node if this not prohibited by process` cpuset
 first and only then report allocation failure(as can be seen only node
 0 involved in the failures)? I really have no idea where
 numa-awareness may be count in case of osd daemons.

 Hi Andrey,

 You said that the allocation failure doesn't occur if you flush
 caches, but the kernel should evict pages from the cache as needed so
 that the osd can allocate more memory (unless their dirty, but it
 doesn't look like you have many dirty pages in this case).  It looks
 like you have plenty of reclaimable pages as well.  Does the osd
 remain running after that error occurs?

Yes, it keeps running flawlessly without even a bit in an osdmap, but
unfortunately logging wasn`t turned on for this moment. As soon as
I`ll end massive test for ``suicide timeout'' bug I`ll check you idea
with dd and also rerun test as below with ``debug osd = 20''.

My thought is that kernel has ready-to-be-free memory on node1 and for
strange reason osd process trying to reserve pages from node0 (where
it is obviously allocated memory on start, since node1` memory
starting only from high numbers over 32G), then kernel refuses to free
cache on the specific node(it`s a quite misty, at least for me, why
kernel just does not invalidate some buffers, even they are more
preferably to stay in RAM than tail of LRU` ones?).

Allocation looks like following on the most nodes:
MemTotal:   66081396 kB
MemFree:  278216 kB
Buffers:   15040 kB
Cached: 62422368 kB
SwapCached:0 kB
Active:  2063908 kB
Inactive:   60876892 kB
Active(anon): 509784 kB
Inactive(anon):   56 kB
Active(file):1554124 kB
Inactive(file): 60876836 kB

OSD-node free memory, with two osd processes on each node, libvirt
prints ``Free'' field there:


0: 207500 KiB
1:  72332 KiB

Total: 279832 KiB

0: 208528 KiB
1:  80692 KiB

Total: 289220 KiB

Since it is known that kernel reserve more memory on the node with
higher memory pressure, seems very legit - osd processes works mostly
with node 0` memory, so there is a bigger gap than on node 1 where
exists almost only fs cache.



 I wonder if you see the same error if you do a long write intensive
 workload on the local disk for the osd in question, maybe dd
 if=/dev/zero of=/data/osd.0/foo

 -sam



 On Fri, Jan 25, 2013 at 2:42 AM, Andrey Korolyov and...@xdel.ru wrote:
 Hi,

 Those traces happens only constant high constant writes and seems to
 be very rarely. OSD processes do not consume more memory after this
 event and peaks are not distinguishable by monitoring. I have able to
 catch it having four-hour constant writes on the cluster.

 http://xdel.ru/downloads/ceph-log/allocation-failure/
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how to protect rbd from multiple simultaneous mapping

2013-01-25 Thread Andrey Korolyov

On Fri, Jan 25, 2013 at 7:51 PM, Sage Weil s...@inktank.com wrote:
 On Fri, 25 Jan 2013, Andrey Korolyov wrote:
 On Fri, Jan 25, 2013 at 4:52 PM, Ugis ugi...@gmail.com wrote:
  I mean if you map rbd and do not use rbd lock.. command. Can you
  tell which client has mapped certain rbd anyway?

 Not yet.  We need to add the ability to list watchers in librados, which
 will then let us infer that information.

 Assume you has an undistinguishable L3 segment, NAT for example, and
 accessing cluster over it - there is no possibility for cluster to
 tell who exactly did something(mean, mapping). Locks mechanism is
 enough to fulfill your request, anyway.

 The addrs listed by the lock list are entity_addr_t's, which include an
 IP, port, and a nonce that uniquely identifies the client.  It won't get
 confused by NAT.  Note that you can blacklist either a full P or an
 individual entity_addr_t.

 But, as mentioned above, you can't list users who didn't use the locking
 (yet).

Yep, I meant impossibility of mapping source address to the specific
client in this case, there is possible to say that some client mapped
image, not exact one with specific identity(since clients using same
credentials, in less-distinguishable case). Client with the root
privileges can be extended to send DMI UUID which is more or less
persistent, but this is generally bad idea since client may be
non-root and still in need of persistent identity.


 sage



 
  2013/1/25 Wido den Hollander w...@widodh.nl:
  On 01/25/2013 11:47 AM, Ugis wrote:
 
  This could work, thanks!
 
  P.S. Is there a way to tell which client has mapped certain rbd if no
  rbd lock is used?
 
 
  What you could do is this:
 
  $ rbd lock add myimage `hostname`
 
  That way you know which client locked the image.
 
  Wido
 
 
  It would be useful to see that info in output of rbd info image.
  Probably attribute for rbd like max_map_count_allowed would be
  useful in future - just to make sure rbd is not mapped from multiple
  clients if it must not. I suppose it can actually happen if multiple
  admins work with same rbds from multiple clients and no strict rbd
  lock add.. procedure is followed.
 
  Ugis
 
 
  2013/1/25 Sage Weil s...@inktank.com:
 
  On Thu, 24 Jan 2013, Mandell Degerness wrote:
 
  The advisory locks are nice, but it would be really nice to have the
  fencing.  If a node is temporarily off the network and a heartbeat
  monitor attempts to bring up a service on a different node, there is
  no way to ensure that the first node will not write data to the rbd
  after the rbd is mounted on the second node.  It would be nice if, on
  seeing that an advisory lock exists, you could tell ceph Do not
  accept data from node X until further notice.
 
 
  Just a reminder: you can use the information from the locks to fence.
  The
  basic process is:
 
- identify old rbd lock holder (rbd lock list img)
- blacklist old owner (ceph osd blacklist add addr)
- break old rbd lock (rbd lock remove img lockid addr)
- lock rbd image on new host (rbd lock add img lockid)
- map rbd image on new host
 
  The oddity here is that the old VM can in theory continue to write up
  until the OSD hears about the blacklist via the internal gossip.  This 
  is
  okay because the act of the new VM touching any part of the image (and
  the
  OSD that stores it) ensures that that OSD gets the blacklist 
  information.
  So on XFS, for example, the act of replaying the XFS journal ensures 
  that
  any attempt by the old VM to write to the journal will get EIO.
 
  sage
 
 
 
 
  On Thu, Jan 24, 2013 at 11:50 AM, Josh Durgin josh.dur...@inktank.com
  wrote:
 
  On 01/24/2013 05:30 AM, Ugis wrote:
 
 
  Hi,
 
  I have rbd which contains non-cluster filesystem. If this rbd is
  mapped+mounted on one host, it should not be mapped+mounted on the
  other simultaneously.
  How to protect such rbd from being mapped on the other host?
 
  At ceph level the only option is to use lock add [image-name]
  [lock-id] and check for existance of this lock on the other client 
  or
  is it possible to protect rbd in a way that on other clients rbd map
   command would just fail with something like Permission denied
  without using arbitrary locks? In other words, can one limit the 
  count
  of clients that may map certain rbd?
 
 
 
  This is what the lock commands were added for. The lock add command
  will exit non-zero if the image is already locked, so you can run
  something like:
 
   rbd lock add [image-name] [lock-id]  rbd map [image-name]
 
  to avoid mapping an image that's in use elsewhere.
 
  The lock-id is user-defined, so you could (for example) use the
  hostname of the machine mapping the image to tell where it's
  in use.
 
  Josh
 
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel
  in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
  --
  To unsubscribe

Re: Hit suicide timeout after adding new osd

2013-01-23 Thread Andrey Korolyov

On Thu, Jan 24, 2013 at 12:59 AM, Jens Kristian Søgaard
j...@mermaidconsulting.dk wrote:
 Hi Sage,

 I think the problem now is just that 'osd target transaction size' is

 I set it to 50, and that seems to have solved all my problems.

 After a day or so my cluster got to a HEALTH_OK state again. It has been
 running for a few days now without any crashes!


 Hmm, one of the OSDs crashed again, sadly.

 It logs:

-2 2013-01-23 18:01:23.563624 7f67524da700  1 heartbeat_map is_healthy
 'FileStore::op_tp thread 0x7f673affd700' had timed out after 60
 -1 2013-01-23 18:01:23.563657 7f67524da700  1 heartbeat_map is_healthy
 'FileStore::op_tp thread 0x7f673affd700' had suicide timed out after 180
  0 2013-01-23 18:01:24.257996 7f67524da700 -1 common/HeartbeatMap.cc:
 In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*,
 const char*, time_t)' thread 7f67524da700 time 2013-01-23 18:01:23.563677

 common/HeartbeatMap.cc: 78: FAILED assert(0 == hit suicide timeout)


 With this stack trace:

  ceph version 0.56.1-26-g3bd8f6b (3bd8f6b7235eb14cab778e3c6dcdc636aff4f539)
  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
 long)+0x2eb) [0x846ecb]
  2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x8476ae]
  3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x8478d8]
  4: (CephContextServiceThread::entry()+0x55) [0x8e0f45]
  5: /lib64/libpthread.so.0() [0x3cbc807d14]
  6: (clone()+0x6d) [0x3cbc0f167d]


 I have saved the core file, if there's anything in there you need?

 Or do you think I just need to set the target transaction size even lower
 than 50?



I was able to catch this too on rejoin to very busy cluster and seems
I need to lower this value at least at start time. Also
c5fe0965572c074a2a33660719ce3222d18c1464 has increased overall time
before restarted or new osd will join a cluster, and for 2M objects/3T
of replicated data restart of the cluster was took almost a hour
before it actually begins to work. The worst thing is that a single
osd, if restarted, will mark as up after couple of minutes, then after
almost half of hour(eating 100 percent of one cpu, ) as down and then
cluster will start to redistribute data after 300s timeout, osd still
doing something.

 --
 Jens Kristian Søgaard, Mermaid Consulting ApS,
 j...@mermaidconsulting.dk,
 http://www.mermaidconsulting.com/
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hit suicide timeout after adding new osd

2013-01-23 Thread Andrey Korolyov

On Thu, Jan 24, 2013 at 8:39 AM, Sage Weil s...@inktank.com wrote:
 On Thu, 24 Jan 2013, Andrey Korolyov wrote:
 On Thu, Jan 24, 2013 at 12:59 AM, Jens Kristian S?gaard
 j...@mermaidconsulting.dk wrote:
  Hi Sage,
 
  I think the problem now is just that 'osd target transaction size' is
 
  I set it to 50, and that seems to have solved all my problems.
 
  After a day or so my cluster got to a HEALTH_OK state again. It has been
  running for a few days now without any crashes!
 
 
  Hmm, one of the OSDs crashed again, sadly.
 
  It logs:
 
 -2 2013-01-23 18:01:23.563624 7f67524da700  1 heartbeat_map is_healthy
  'FileStore::op_tp thread 0x7f673affd700' had timed out after 60
  -1 2013-01-23 18:01:23.563657 7f67524da700  1 heartbeat_map is_healthy
  'FileStore::op_tp thread 0x7f673affd700' had suicide timed out after 180
   0 2013-01-23 18:01:24.257996 7f67524da700 -1 common/HeartbeatMap.cc:
  In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*,
  const char*, time_t)' thread 7f67524da700 time 2013-01-23 18:01:23.563677
 
  common/HeartbeatMap.cc: 78: FAILED assert(0 == hit suicide timeout)
 
 
  With this stack trace:
 
   ceph version 0.56.1-26-g3bd8f6b (3bd8f6b7235eb14cab778e3c6dcdc636aff4f539)
   1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
  long)+0x2eb) [0x846ecb]
   2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x8476ae]
   3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x8478d8]
   4: (CephContextServiceThread::entry()+0x55) [0x8e0f45]
   5: /lib64/libpthread.so.0() [0x3cbc807d14]
   6: (clone()+0x6d) [0x3cbc0f167d]
 
 
  I have saved the core file, if there's anything in there you need?
 
  Or do you think I just need to set the target transaction size even lower
  than 50?
 
 

 I was able to catch this too on rejoin to very busy cluster and seems
 I need to lower this value at least at start time. Also
 c5fe0965572c074a2a33660719ce3222d18c1464 has increased overall time
 before restarted or new osd will join a cluster, and for 2M objects/3T
 of replicated data restart of the cluster was took almost a hour
 before it actually begins to work. The worst thing is that a single
 osd, if restarted, will mark as up after couple of minutes, then after
 almost half of hour(eating 100 percent of one cpu, ) as down and then
 cluster will start to redistribute data after 300s timeout, osd still
 doing something.

 Okay, something is very wrong.  Can you reproduce this with a log?  Or
 even a partial log while it is spinning?  You can adjust the log level on
 a running process with

   ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_osd 20
   ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_ms 1

 We haven't been able to reproduce this, so I'm very much interested in any
 light you can shine here.


Unfortunately cluster finally hit ``suicide timeout'' by every osd, so
there was no logs, only some backtraces[1].
Yesterday after an osd was not able to join cluster in a hour, I
decided to wait until data is remapped, then tried to restart cluster,
leaving it overnight, to morning all osd processes are dead, with the
same backtraces. Before it, after a silly node crash(related to
deadlocks in kernel kvm code), some pgs remains to stay in peering
state without any blocker in json output, so I had decided to restart
osd to which primary copy belongs, because it helped before. So most
interesting part is missing, but I`ll reformat cluster soon and will
try to catch this again after filling some data in.

[1]. http://xdel.ru/downloads/ceph-log/osd-heartbeat/

 Thanks!
 sage


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: handling fs errors

2013-01-21 Thread Andrey Korolyov

On Tue, Jan 22, 2013 at 10:05 AM, Sage Weil s...@inktank.com wrote:
 We observed an interesting situation over the weekend.  The XFS volume
 ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4
 minutes.  After 3 minutes (180s), ceph-osd gave up waiting and committed
 suicide.  XFS seemed to unwedge itself a bit after that, as the daemon was
 able to restart and continue.

 The problem is that during that 180s the OSD was claiming to be alive but
 not able to do any IO.  That heartbeat check is meant as a sanity check
 against a wedged kernel, but waiting so long meant that the ceph-osd
 wasn't failed by the cluster quickly enough and client IO stalled.

 We could simply change that timeout to something close to the heartbeat
 interval (currently default is 20s).  That will make ceph-osd much more
 sensitive to fs stalls that may be transient (high load, whatever).

 Another option would be to make the osd heartbeat replies conditional on
 whether the internal heartbeat is healthy.  Then the heartbeat warnings
 could start at 10-20s, ping replies would pause, but the suicide could
 still be 180s out.  If the stall is short-lived, pings will continue, the
 osd will mark itself back up (if it was marked down) and continue.

 Having written that out, the last option sounds like the obvious choice.
 Any other thoughts?

 sage

Seems to be possible to run in domino-style failing marks there if
lock is triggered frequently enough and depends only on pure amount of
workload. By the way, was that fs aged or you`re able to catch the
lock on fresh one? And which kernel you have run there?

Thanks!

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Single host VM limit when using RBD

2013-01-17 Thread Andrey Korolyov

Hi Matthew,

Seems to a low value in /proc/sys/kernel/threads-max value.

On Thu, Jan 17, 2013 at 12:37 PM, Matthew Anderson
matth...@base3.com.au wrote:
 I've run into a limit on the maximum number of RBD backed VM's that I'm able 
 to run on a single host. I have 20 VM's (21 RBD volumes open) running on a 
 single host and when booting the 21st machine I get the below error from 
 libvirt/QEMU. I'm able to shut down a VM and start another in it's place so 
 there seems to be a hard limit on the amount of volumes I'm able to have 
 open.  I did some googling and the error 11 from pthread_create seems to mean 
 'resource unavailable' so I'm probably running into a thread limit of some 
 sort. I did try increasing the max_thread kernel option but nothing changed. 
 I moved a few VM's to a different empty host and they start with no issues at 
 all.

 This machine has 4 OSD's running on it in addition to the 20 VM's. Kernel 
 3.7.1. Ceph 0.56.1 and QEMU 1.3.0. There is currently 65GB of 96GB free ram 
 and no swap.

 Can anyone suggest where the limit might be or anything I can do to narrow 
 down the problem?

 Thanks
 -Matt
 -

 Error starting domain: internal error Process exited while reading console 
 log output: char device redirected to /dev/pts/23
 Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In 
 function 'void Thread::create(size_t)' thread 7f4eb5a65960 time 2013-01-17 
 02:32:58.096437
 common/Thread.cc: 110: FAILED assert(ret == 0)
 ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
 1: (()+0x2aaa8f) [0x7f4eb2de8a8f]
 2: (SafeTimer::init()+0x95) [0x7f4eb2cd2575]
 3: (librados::RadosClient::connect()+0x72c) [0x7f4eb2c689dc]
 4: (()+0xa0290) [0x7f4eb5b27290]
 5: (()+0x879dd) [0x7f4eb5b0e9dd]
 6: (()+0x87c1b) [0x7f4eb5b0ec1b]
 7: (()+0x87ae1) [0x7f4eb5b0eae1]
 8: (()+0x87d50) [0x7f4eb5b0ed50]
 9: (()+0xb37b2) [0x7f4eb5b3a7b2]
 10: (()+0x1e83eb) [0x7f4eb5c6f3eb]
 11: (()+0x1ab54a) [0x7f4eb5c3254a]
 12: (main()+0x9da) [0x7f4eb5c72a3a]
 13: (__libc_start_main()+0xfd) [0x7f4eb1ab4cdd]
 14: (()+0x710b9) [0x7f4eb5af80b9]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
 interpret this.
 terminate called after

 Traceback (most recent call last):
   File /usr/share/virt-manager/virtManager/asyncjob.py, line 96, in 
 cb_wrapper
 callback(asyncjob, *args, **kwargs)
   File /usr/share/virt-manager/virtManager/asyncjob.py, line 117, in tmpcb
 callback(*args, **kwargs)
   File /usr/share/virt-manager/virtManager/domain.py, line 1090, in startup
 self._backend.create()
   File /usr/lib/python2.7/dist-packages/libvirt.py, line 620, in create
 if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self)
 libvirtError: internal error Process exited while reading console log output: 
 char device redirected to /dev/pts/23
 Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In 
 function 'void Thread::create(size_t)' thread 7f4eb5a65960 time 2013-01-17 
 02:32:58.096437
 common/Thread.cc: 110: FAILED assert(ret == 0)
 ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
 1: (()+0x2aaa8f) [0x7f4eb2de8a8f]
 2: (SafeTimer::init()+0x95) [0x7f4eb2cd2575]
 3: (librados::RadosClient::connect()+0x72c) [0x7f4eb2c689dc]
 4: (()+0xa0290) [0x7f4eb5b27290]
 5: (()+0x879dd) [0x7f4eb5b0e9dd]
 6: (()+0x87c1b) [0x7f4eb5b0ec1b]
 7: (()+0x87ae1) [0x7f4eb5b0eae1]
 8: (()+0x87d50) [0x7f4eb5b0ed50]
 9: (()+0xb37b2) [0x7f4eb5b3a7b2]
 10: (()+0x1e83eb) [0x7f4eb5c6f3eb]
 11: (()+0x1ab54a) [0x7f4eb5c3254a]
 12: (main()+0x9da) [0x7f4eb5c72a3a]
 13: (__libc_start_main()+0xfd) [0x7f4eb1ab4cdd]
 14: (()+0x710b9) [0x7f4eb5af80b9]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
 interpret this.
 terminate called after

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: flashcache

2013-01-17 Thread Andrey Korolyov

On Thu, Jan 17, 2013 at 7:00 PM, Atchley, Scott atchle...@ornl.gov wrote:
 On Jan 17, 2013, at 9:48 AM, Gandalf Corvotempesta 
 gandalf.corvotempe...@gmail.com wrote:

 2013/1/17 Atchley, Scott atchle...@ornl.gov:
 IB DDR should get you close to 2 GB/s with IPoIB. I have gotten our IB QDR 
 PCI-E Gen. 2 up to 2.8 GB/s measured via netperf with lots of tuning. Since 
 it uses the traditional socket stack through the kernel, CPU usage will be 
 as high (or higher if QDR) than 10GbE.

 Which kind of tuning? Do you have a paper about this?

 No, I followed the Mellanox tuning guide and modified their interrupt 
 affinity scripts.

Did you tried to bind interrupts only to core to which QPI link
belongs in reality and measure difference with spread-over-all-cores
binding?


 But, actually, is possible to use ceph with IPoIB in a stable way or
 is this experimental ?

 IPoIB appears as a traditional Ethernet device to Linux and can be used as 
 such.

Not exactly, this summer kernel added additional driver for fully
featured L2(ib ethernet driver), before that it was quite painful to
do any possible failover using ipoib.


 I don't know if i support for rsocket that is experimental/untested
 and IPoIB is a stable workaroud or what else.

 IPoIB is much more used and pretty stable, while rsockets is new with limited 
 testing. That said, more people using it will help Sean improve it.

 Ideally, we would like support for zero-copy and reduced CPU usage (via 
 OS-bypass) and with more interconnects than just InfiniBand. :-)

 And is a dual controller needed on each OSD node? Ceph is able to
 handle OSD network failures? This is really important to know. It
 change the whole network topology.

 I will let others answer this.

 Scott--
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph slow request unstable issue

2013-01-16 Thread Andrey Korolyov

On Wed, Jan 16, 2013 at 10:35 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Jan 16, 2013 at 8:58 PM, Sage Weil s...@inktank.com wrote:
 Hi,

 On Wed, 16 Jan 2013, Andrey Korolyov wrote:
 On Wed, Jan 16, 2013 at 4:58 AM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote:
  Hi list,
  We are suffering from OSD or OS down when there is continuing 
  high pressure on the Ceph rack.
  Basically we are on Ubuntu 12.04+ Ceph 0.56.1, 6 nodes, in each 
  nodes with 20 * spindles + 4* SSDs as journal.(120 spindles in total)
  We create a lots of RBD volumes (say 240),mounting to 16 
  different client machines ( 15 RBD Volumes/ client) and running DD 
  concurrently on top of each RBD.
 
  The issues are:
  1. Slow requests
  ??From the list-archive it seems solved in 0.56.1 but we still notice 
  such warning
  2. OSD Down or even host down
  Like the message below.Seems some OSD has been blocking there for quite a 
  long time.
 
  Suggestions are highly appreciate.Thanks


  Xiaoxi
 
  _
 
  Bad news:
 
  I have  back all my Ceph machine?s OS to kernel  3.2.0-23, which Ubuntu 
  12.04 use.
  I run dd command (dd if=/dev/zero bs=1M count=6 of=/dev/rbd${i}  )on 
  Ceph client to create data prepare test at last night.
  Now, I have one machine down (can?t be reached by ping), another two 
  machine has all OSD daemon down, while the three left has some daemon 
  down.
 
  I have many warnings in OSD log like this:
 
  no flag points reached
  2013-01-15 19:14:22.769898 7f20a2d57700  0 log [WRN] : slow request 
  52.218106 seconds old, received at 2013-01-15 19:13:30.551718: 
  osd_op(client.10674.1:1002417 rb.0.27a8.6b8b4567.0eba [write 
  3145728~524288] 2.c61810ee RETRY) currently waiting for sub ops
  2013-01-15 19:14:23.770077 7f20a2d57700  0 log [WRN] : 21 slow requests, 
  6 included below; oldest blocked for  1132.138983 secs
  2013-01-15 19:14:23.770086 7f20a2d57700  0 log [WRN] : slow request 
  53.216404 seconds old, received at 2013-01-15 19:13:30.553616: 
  osd_op(client.10671.1:1066860 rb.0.282c.6b8b4567.1057 [write 
  2621440~524288] 2.ea7acebc) currently waiting for sub ops
  2013-01-15 19:14:23.770096 7f20a2d57700  0 log [WRN] : slow request 
  51.442032 seconds old, received at 2013-01-15 19:13:32.327988: 
  osd_op(client.10674.1:1002418
 
  Similar info in dmesg we have saw pervious:
 
  [21199.036476] INFO: task ceph-osd:7788 blocked for more than 120 seconds.
  [21199.037493] echo 0  /proc/sys/kernel/hung_task_timeout_secs 
  disables this message.
  [21199.038841] ceph-osdD 0006 0  7788  1 
  0x
  [21199.038844]  880fefdafcc8 0086  
  ffe0
  [21199.038848]  880fefdaffd8 880fefdaffd8 880fefdaffd8 
  00013780
  [21199.038852]  88081aa58000 880f68f52de0 880f68f52de0 
  882017556200
  [21199.038856] Call Trace:
  [21199.038858]  [8165a55f] schedule+0x3f/0x60
  [21199.038861]  [8106b7e5] exit_mm+0x85/0x130
  [21199.038864]  [8106b9fe] do_exit+0x16e/0x420
  [21199.038866]  [8109d88f] ? __unqueue_futex+0x3f/0x80
  [21199.038869]  [8107a19a] ? __dequeue_signal+0x6a/0xb0
  [21199.038872]  [8106be54] do_group_exit+0x44/0xa0
  [21199.038874]  [8107ccdc] get_signal_to_deliver+0x21c/0x420
  [21199.038877]  [81013865] do_signal+0x45/0x130
  [21199.038880]  [810a091c] ? do_futex+0x7c/0x1b0
  [21199.038882]  [810a0b5a] ? sys_futex+0x10a/0x1a0
  [21199.038885]  [81013b15] do_notify_resume+0x65/0x80
  [21199.038887]  [81664d50] int_signal+0x12/0x17

 We have seen this stack trace several times over the past 6 months, but
 are not sure what the trigger is.  In principle, the ceph server-side
 daemons shouldn't be capable of locking up like this, but clearly
 something is amiss between what they are doing in userland and how the
 kernel is tolerating that.  Low memory, perhaps?  In each case where we
 tried to track it down, the problem seemed to go away on its own.  Is this
 easily reproducible in your case?

 my 0.02$:
 http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg11531.html
 and kernel panic from two different hosts from yesterday during ceph
 startup(on 3.8-rc3, images from console available at
 http://imgur.com/wIRVn,k0QCS#0) leads to suggestion that Ceph may have
 been introduced lockup-alike behavior not a long ago, causing, in my
 case, excessive amount of context switches on the host leading to osd
 flaps and panic at the ip-ib stack due to same issue.

 For the stack trace my first guess would be a problem with the IB driver
 that is triggered by memory pressure.  Can you characterize what the
 system utilization

Re: Striped images and cluster misbehavior

2013-01-12 Thread Andrey Korolyov

After digging a lot, I have found that IB cards and switch may went to
``bad'' state after host` load spike, so I have limited all
potentially cpu-hungry processes via cg. That`s has no effect at all,
spikes happens almost at same time when osds on the corresponding host
went down as ``wrongly marked'' for a couple of seconds. By doing
manual observations, I have ensured that osds went crazy first, eating
all cores with 100% SY(mean. scheduler or fs issues), then card
lacking time for its interrupts start dropping the packets and so on.

This can be reproduced only on heavy workload on the fast cluster,
slow one with simular software versions will crawl but do not produce
such locks. Those locks may went away and may hang for a while, tens
of minutes, I do not sure of what it depends. Both nodes with logs
pointed above contains one monitor and one osd, but locks do happen on
two-osd nodes as well. Ceph instances does not share block devices in
my setup(except two-osd nodes using same SSD for a journal, but since
it is reproducible on mon-osd pair with completely separated storage
that`s seems not to be an exact cause). For meantime, I may suggest
for myself to move out from XFS and see if locks remain. The issue
started in the latest 3.6 series and 0.55+ and remains in the 3.7.1
and 0.56.1. Should I move to ext4 immediately or try 3.8-rc with
couple of XFS fixes first?

http://xdel.ru/downloads/ceph-log/osd-lockup-1-14-25-12.875107.log.gz
http://xdel.ru/downloads/ceph-log/osd-lockup-2-14-33-16.741603.log.gz

Timestamps in filenames added for easier lookup, osdmap have marked
osds as down after couple of beats after those marks.


On Mon, Dec 31, 2012 at 1:16 AM, Andrey Korolyov and...@xdel.ru wrote:
 On Sun, Dec 30, 2012 at 10:56 PM, Samuel Just sam.j...@inktank.com wrote:
 Sorry for the delay.  A quick look at the log doesn't show anything
 obvious... Can you elaborate on how you caused the hang?
 -Sam


 I am sorry for all this noise, the issue almost for sure has been
 triggered by some bug in the Infiniband switch firmware because
 per-port reset was able to solve ``wrong mark'' problem - at least, it
 haven`t showed up yet for a week. The problem took almost two days
 until resolution - all possible connectivity tests displayed no
 overtimes or drops which can cause wrong marks. Finally, I have
 started playing with TCP settings and found that ipv4.tcp_low_latency
 raising possibility of ``wrong mark'' event several times when enabled
 - so area of all possible causes quickly collapsed to the media-only
 problem and I fixed problem soon.

 On Wed, Dec 19, 2012 at 3:53 AM, Andrey Korolyov and...@xdel.ru wrote:
 Please take a look at the log below, this is slightly different bug -
 both osd processes on the node was stuck eating all available cpu
 until I killed them. This can be reproduced by doing parallel export
 of different from same client IP using both ``rbd export'' or API
 calls - after a couple of wrong ``downs'' osd.19 and osd.27 finally
 stuck. What is more interesting, 10.5.0.33 holds most hungry set of
 virtual machines, eating constantly four of twenty-four HT cores, and
 this node fails almost always, Underlying fs is an XFS, ceph version
 gf9d090e. With high possibility my previous reports are about side
 effects of this problem.

 http://xdel.ru/downloads/ceph-log/osd-19_and_27_stuck.log.gz

 and timings for the monmap, logs are from different hosts, so they may
 have a time shift of tens of milliseconds:

 http://xdel.ru/downloads/ceph-log/timings-crash-osd_19_and_27.txt

 Thanks!
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: v0.56.1 released

2013-01-07 Thread Andrey Korolyov

On Tue, Jan 8, 2013 at 11:30 AM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
 Hi,

 i cannot see any git tag or branch claming to be 0.56.1? Which commit id is
 this?

 Greets
   Stefan

Same for me, github simply does not sent a new tag in the pull to
local tree by some reason. Repository cloning from scratch resolved
this :)



 Am 08.01.2013 05:53, schrieb Sage Weil:

 We found a few critical problems with v0.56, and fixed a few outstanding
 problems. v0.56.1 is ready, and we're pretty pleased with it!

 There are two critical fixes in this update: a fix for possible data loss
 or corruption if power is lost, and a protocol compatibility problem that
 was introduced in v0.56 (between v0.56 and any other version of ceph).

   * osd: fix commit sequence for XFS, ext4 (or any other non-btrfs) to
 prevent data loss on power cycle or kernel panic
   * osd: fix compatibility for CALL operation
   * osd: process old osdmaps prior to joining cluster (fixes slow startup)
   * osd: fix a couple of recovery-related crashes
   * osd: fix large io requests when journal is in (non-default) aio mode
   * log: fix possible deadlock in logging code

 This release will kick off the bobtail backport series, and will get a
 shiny new URL for it's home.

   * Git at git://github.com/ceph/ceph.git
   * Tarball at http://ceph.com/download/ceph-0.56.1.tar.gz
   * For Debian/Ubuntu packages, see
 http://ceph.com/docs/master/install/debian
   * For RPMs, see http://ceph.com/docs/master/install/rpm
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Very intensive I/O under mon process

2013-01-02 Thread Andrey Korolyov

I have just observed that ceph-mon process, at least bobtail one, has
an extremely high density of writes - times above _overall_ cluster
amount of writes, measured by qemu driver(and they are very close to
be fair). For example, test cluster of 32 osds have 7.5 MByte/s of
writes on each mon node having overall amount about 1.5 Mbyte/s and
dev- with only three osds has values is about 1Mbyte/s with
accumulated real write bandwidth of tens of kilobytes per second.

I`m afraid if this is normal, I may hit a limit of spinning storage
increasing test cluster, say, twenty times up of number of osd and
related ``idle'' write bandwidth.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Very intensive I/O under mon process

2013-01-02 Thread Andrey Korolyov

On Wed, Jan 2, 2013 at 8:00 PM, Joao Eduardo Luis joao.l...@inktank.com wrote:
 On 01/02/2013 03:40 PM, Andrey Korolyov wrote:

 I have just observed that ceph-mon process, at least bobtail one, has
 an extremely high density of writes - times above _overall_ cluster
 amount of writes, measured by qemu driver(and they are very close to
 be fair). For example, test cluster of 32 osds have 7.5 MByte/s of
 writes on each mon node having overall amount about 1.5 Mbyte/s and
 dev- with only three osds has values is about 1Mbyte/s with
 accumulated real write bandwidth of tens of kilobytes per second.

 I`m afraid if this is normal, I may hit a limit of spinning storage
 increasing test cluster, say, twenty times up of number of osd and
 related ``idle'' write bandwidth.


 High debugging levels (specially 'debug ms', 'debug mon' or 'debug paxos')
 should significantly increase IO on the monitors. Might that be the case?

Nope, all debug levels, including mons are set to 0/0. I also see that
the ``no-client'' cluster shows a very small amount of such writes
under mon, 10-20kByte/s, and one idle client (writing couple of bytes
without O_SYNC) raise this value times up to ~200kB/s and so on, so
may be I`m wrong before and writes correlate with amount of clients
too(six clients plus three control nodes accessing via API in the
context of previous message for both environments).
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

0.55 crashed during upgrade to bobtail

2013-01-01 Thread Andrey Korolyov

Hi,

All osds in the dev cluster died shortly after upgrade (packet-only,
i.e. binary upgrade, even without restart running processes), please
see attached file.

Was: 0.55.1-356-g850d1d5
Upgraded to: 0.56 tag

The only one difference is a version of the gcc corresponding
libstdc++ - 4.6 on the buildhost and 4.7 on the cluster. Of course I
may do a rollback and problem will eliminate with high probability,
but seems there should be some fix. Also I have something simular in
the the testing env days ago -  packet upgrade inside 0.55 killed all
_windows_ guests_ and one of tens of linux guests running above rbd.
Unfortunately I have no debug sessions at this moment and I have only
tail of the log from qemu:

terminate called after throwing an instance of 'ceph::buffer::end_of_buffer'
  what():  buffer::end_of_buffer

I`m blaming ldconfig action from librbd because nothing else `ll cause
such case of destroy on the running processes - may be I`m wrong.

thanks!

WBR, Andrey


crashes-2013-01-01.tgz
Description: GNU Zip compressed data

Re: 0.55 crashed during upgrade to bobtail

2013-01-01 Thread Andrey Korolyov

On Tue, Jan 1, 2013 at 9:49 PM, Andrey Korolyov and...@xdel.ru wrote:
 Hi,

 All osds in the dev cluster died shortly after upgrade (packet-only,
 i.e. binary upgrade, even without restart running processes), please
 see attached file.

 Was: 0.55.1-356-g850d1d5
 Upgraded to: 0.56 tag

 The only one difference is a version of the gcc corresponding
 libstdc++ - 4.6 on the buildhost and 4.7 on the cluster. Of course I
 may do a rollback and problem will eliminate with high probability,
 but seems there should be some fix. Also I have something simular in
 the the testing env days ago -  packet upgrade inside 0.55 killed all
 _windows_ guests_ and one of tens of linux guests running above rbd.
 Unfortunately I have no debug sessions at this moment and I have only
 tail of the log from qemu:

 terminate called after throwing an instance of 'ceph::buffer::end_of_buffer'
   what():  buffer::end_of_buffer

 I`m blaming ldconfig action from librbd because nothing else `ll cause
 such case of destroy on the running processes - may be I`m wrong.

 thanks!

 WBR, Andrey

Sorry, I`m not able to reproduce crash after rollback and traces was
uncomplete due to lack of disk space on specified core location, so
please don`t mind it.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 0.55 crashed during upgrade to bobtail

2013-01-01 Thread Andrey Korolyov

On Wed, Jan 2, 2013 at 12:16 AM, Andrey Korolyov and...@xdel.ru wrote:
 On Tue, Jan 1, 2013 at 9:49 PM, Andrey Korolyov and...@xdel.ru wrote:
 Hi,

 All osds in the dev cluster died shortly after upgrade (packet-only,
 i.e. binary upgrade, even without restart running processes), please
 see attached file.

 Was: 0.55.1-356-g850d1d5
 Upgraded to: 0.56 tag

 The only one difference is a version of the gcc corresponding
 libstdc++ - 4.6 on the buildhost and 4.7 on the cluster. Of course I
 may do a rollback and problem will eliminate with high probability,
 but seems there should be some fix. Also I have something simular in
 the the testing env days ago -  packet upgrade inside 0.55 killed all
 _windows_ guests_ and one of tens of linux guests running above rbd.
 Unfortunately I have no debug sessions at this moment and I have only
 tail of the log from qemu:

 terminate called after throwing an instance of 'ceph::buffer::end_of_buffer'
   what():  buffer::end_of_buffer

 I`m blaming ldconfig action from librbd because nothing else `ll cause
 such case of destroy on the running processes - may be I`m wrong.

 thanks!

 WBR, Andrey

 Sorry, I`m not able to reproduce crash after rollback and traces was
 uncomplete due to lack of disk space on specified core location, so
 please don`t mind it.

Ahem, finally it seems that osd process stumbling on something on the
fs, because my other environments also was able to reproduce crash
once, but reproducing is not possible since new osd process started
over existing filestore(offline version rollback and another try to
online upgrade doing fine). And backtrace in first message is
complete, at least 1 and 2, despite of lack of space first time - I
have received a couple of coredumps which trace looks exactly the same
as 2.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Improving responsiveness of KVM guests on Ceph storage

2012-12-31 Thread Andrey Korolyov

On Mon, Dec 31, 2012 at 3:12 AM, Jens Kristian Søgaard
j...@mermaidconsulting.dk wrote:
 Hi Andrey,

 Thanks for your reply!


 You may try do play with SCHED_RT, I have found it hard to use for
 myself, but you can achieve your goal by adding small RT slices via
 ``cpu'' cgroup to vcpu/emulator threads, it dramatically increases
 overall VM` responsibility.


 I'm not quite sure I understand your suggestion.

 Do you mean that you set the process priority to real-time on each qemu-kvm
 process, and then use cgroups cpu.rt_runtime_us / cpu.rt_period_us to
 restrict the amount of CPU time those processes can receive?

 I'm not sure how that would apply here, as I have only one qemu-kvm process
 and it is not non-responsive because of the lack of allocated CPU time
 slices - but rather because some I/Os take a long time to complete, and
 other I/Os apparently have to wait for those to complete.

Yep, I meant the same. Of course it`ll not help with only one VM, RT
may help in more concurrent cases :)

 threads. Of course, some Ceph tuning like writeback cache and large
 journal may help you too, I`m speaking primarily of VM` performance by


 I have been considering the journal as something where I could improve
 performance by tweaking the setup. I have set aside 10 GB of space for the
 journal, but I'm not sure if this is too little - or if the size really
 doesn't matter that much when it is on the same mdraid as the data itself.

 Is there a tool that can tell me how much of my journal space that is
 actually actively being used?

 I.e. I'm looking for something that could tell me, if increasing the size of
 the journal or placing it on a seperate (SSD) disk could solve my problem.

As I understood right, you have md device holding both journal and
filestore? What type of raid you have here? Of course you`ll need a
separate device (for experimental purposes, fast disk may be enough)
for the journal, and if you set any type of redundant storage under
filestore partition, you may also change it to simple RAID0, or even
separate disks, and create one osd over every disk(you should see to
the journal device` throughput which must be equal to sum of speeds of
all filestore devices, so for commodity-type SSD it sums to two
100MB/s disks, for example). I have ``pure'' disk setup in my dev
environment built on quite old desktop-class machines and one rsync
process may hang VM for short time, despite of using dedicated SATA
disk for journal.

 How do I change the size of the writeback cache when using qemu-kvm like I
 do?

 Does setting rbd cache size in ceph.conf have any effect on qemu-kvm, where
 the drive is defined as:

   format=rbd,file=rbd:data/image1:rbd_cache=1,if=virtio

What size of cache_size/max_dirty you have inside ceph.conf and which
qemu version you use? Default values good enough to prevent pushing
I/O spikes down to the physical storage, but for long I/O-intensive
tasks increasing cache may help OS to align writes more smoothly. Also
you don`t need to set rbd_cache explicitly in the disk config using
qemu 1.2 and younger releases, for older ones
http://lists.gnu.org/archive/html/qemu-devel/2012-05/msg02500.html
should be applied.


 --
 Jens Kristian Søgaard, Mermaid Consulting ApS,
 j...@mermaidconsulting.dk,
 http://www.mermaidconsulting.com/
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Improving responsiveness of KVM guests on Ceph storage

2012-12-31 Thread Andrey Korolyov

On Mon, Dec 31, 2012 at 2:58 PM, Jens Kristian Søgaard
j...@mermaidconsulting.dk wrote:
 Hi Andrey,


 As I understood right, you have md device holding both journal and
 filestore? What type of raid you have here?


 Yes, same md device holding both journal and filestore. It is a raid5.

Ahem, of course you need to reassemble it to something faster :)


 Of course you`ll need a
 separate device (for experimental purposes, fast disk may be enough)
 for the journal


 Is there a way to tell if the journal is the bottleneck without actually
 adding such an extra device?

In theory, yes - but your setup already dying under high amount of
write seeks, so it may be not necessary. Also I don`t see a right way
to measure a bottleneck when disk device used for both filestore and
journal - in case of separated ones, you may measure maximum values
using fio and compare to calculated ones from /proc/diskstats,
``all-in-one'' case seems obviously hard to measure, even if you able
to log writes to journal file and filestore files separately without
significant overhead.


 filestore partition, you may also change it to simple RAID0, or even
 separate disks, and create one osd over every disk(you should see to


 I have only 3 OSDs with 4 disks each. I was afraid that it would be too
 brittle as a RAID0, and if I created seperate OSDs for each disk, it would
 stall the file system due to recovery if a server crashes.

No, it isn`t too bad in most cases. Recovery process is not affecting
operations to the rbd storage except small performance degradation, so
you may split your raid setup to the lightweight R0. It depends, on
plain SATA controller software R0 under one OSD will do better work
than 2 separate OSDs having one disk each, on cache-backed controller
separate OSDs is more preferably until controller is not able to align
writes due to overall write bandwidth.



 What size of cache_size/max_dirty you have inside ceph.conf


 I haven't set them explicitly, so I imagine the cache_size is 32 MB and the
 max_dirty is 24 MB.


 and which

 qemu version you use?


 Using the default 0.15 version in Fedora 16.


 tasks increasing cache may help OS to align writes more smoothly. Also
 you don`t need to set rbd_cache explicitly in the disk config using
 qemu 1.2 and younger releases, for older ones
 http://lists.gnu.org/archive/html/qemu-devel/2012-05/msg02500.html
 should be applied.


 I read somewhere that I needed to enable it specifically for older qemu-kvm
 versions, which I did like this:

   format=rbd,file=rbd:data/image1:rbd_cache=1,if=virtio

 However now I read in the docs for qemu-rbd that it needs to be set like
 this:

   format=raw,file=rbd:data/squeeze:rbd_cache=true,cache=writeback

 I'm not sure if 1 and true are interpreted the same way?

 I'll try using true and see if I get any noticable changes in behaviour.

 The link you sent me seems to indicate that I need to compile my own version
 of qemu-kvm to be able to test this?


No, there is no significant changes since 0.15 to the current version
and your options will work just fine. So there may be general
recommendations to remove redundancy from your disk backend and then
move out journal to separate disk or ssd.



 --
 Jens Kristian Søgaard, Mermaid Consulting ApS,
 j...@mermaidconsulting.dk,
 http://www.mermaidconsulting.com/
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Striped images and cluster misbehavior

2012-12-30 Thread Andrey Korolyov

On Sun, Dec 30, 2012 at 10:56 PM, Samuel Just sam.j...@inktank.com wrote:
 Sorry for the delay.  A quick look at the log doesn't show anything
 obvious... Can you elaborate on how you caused the hang?
 -Sam


I am sorry for all this noise, the issue almost for sure has been
triggered by some bug in the Infiniband switch firmware because
per-port reset was able to solve ``wrong mark'' problem - at least, it
haven`t showed up yet for a week. The problem took almost two days
until resolution - all possible connectivity tests displayed no
overtimes or drops which can cause wrong marks. Finally, I have
started playing with TCP settings and found that ipv4.tcp_low_latency
raising possibility of ``wrong mark'' event several times when enabled
- so area of all possible causes quickly collapsed to the media-only
problem and I fixed problem soon.

 On Wed, Dec 19, 2012 at 3:53 AM, Andrey Korolyov and...@xdel.ru wrote:
 Please take a look at the log below, this is slightly different bug -
 both osd processes on the node was stuck eating all available cpu
 until I killed them. This can be reproduced by doing parallel export
 of different from same client IP using both ``rbd export'' or API
 calls - after a couple of wrong ``downs'' osd.19 and osd.27 finally
 stuck. What is more interesting, 10.5.0.33 holds most hungry set of
 virtual machines, eating constantly four of twenty-four HT cores, and
 this node fails almost always, Underlying fs is an XFS, ceph version
 gf9d090e. With high possibility my previous reports are about side
 effects of this problem.

 http://xdel.ru/downloads/ceph-log/osd-19_and_27_stuck.log.gz

 and timings for the monmap, logs are from different hosts, so they may
 have a time shift of tens of milliseconds:

 http://xdel.ru/downloads/ceph-log/timings-crash-osd_19_and_27.txt

 Thanks!
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Improving responsiveness of KVM guests on Ceph storage

2012-12-30 Thread Andrey Korolyov

On Sun, Dec 30, 2012 at 9:05 PM, Jens Kristian Søgaard
j...@mermaidconsulting.dk wrote:
 Hi guys,

 I'm testing Ceph as storage for KVM virtual machine images and found an
 inconvenience that I am hoping it is possible to find the cause of.

 I'm running a single KVM Linux guest on top of Ceph storage. In that guest I
 run rsync to download files from the internet. When rsync is running, the
 guest will seemingly stall and run very slowly.

 For example if I log in via SSH to the guest and use the command prompt,
 nothing will happen for a long period (30+ seconds), then it processes a few
 typed characters, and then it blocks for another long period of time, then
 process a bit more, etc.

 I was hoping to be able to tweak the system so that it runs more like when
 using conventional storage - i.e. perhaps the rsync won't be super fast, but
 the machine will be equally responsive all the time.

 I'm hoping that you can provide some hints on how to best benchmark or test
 the system to find the cause of this?

 The ceph OSDs periodically logs thse two messages, that I do not fully
 understand:

 12-12-30 17:07:12.894920 7fc8f3242700  1 heartbeat_map is_healthy
 'OSD::op_tp thread 0x7fc8cbfff700' had timed out after 30
 2012-12-30 17:07:13.599126 7fc8cbfff700  1 heartbeat_map reset_timeout
 'OSD::op_tp thread 0x7fc8cbfff700' had timed out after 30

 Is this to be expected when the system is in use, or does it indicate that
 something is wrong?

 Ceph also logs messages such as this:

 2012-12-30 17:07:36.932272 osd.0 10.0.0.1:6800/9157 286340 : [WRN] slow
 request 30.751940 seconds old, received at 2012-12-30 17:07:06.180236:
 osd_op(client.4705.0:16074961 rb.0.11b7.4a933baa.000c188f [write
 532480~4096] 0.f2a63fe) v4 currently waiting for sub ops


 My setup:

 3 servers running Fedora 17 with Ceph 0.55.1 from RPM.
 Each server runs one osd and one mon. One of the servers also runs an mds.
 Backing file system is btrfs stored on a md-raid . Journal is stored on the
 same SATA disks as the rests of the data.
 Each server has 3 bonded gigabit/sec NICs.

 One server running Fedora 16 with qemu-kvm.
 Has gigabit/sec NIC connected to the same network as the Ceph servers, and a
 gigabit/sec NIC connected to the Internet.
 Disk is mounted with:

 -drive format=rbd,file=rbd:data/image1:rbd_cache=1,if=virtio


 iostat on the KVM guest gives:

 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
0,000,000,00  100,000,000,00

 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
 avgqu-sz   await  svctm  %util
 vda   0,00 1,400,100,30 0,8013,60 36,00
 1,66 2679,25 2499,75  99,99


 Top on the KVM host shows 90% CPU idle and 0.0% I/O waiting.

 iostat on a OSD gives:
 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
0,130,001,50   15,790,00   82,58

 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 sda 240,70   441,20   33,00   42,70  1122,40  1961,80 81,48
 14,45  164,42  319,14   44,85   6,63  50,22
 sdb 299,10   393,10   33,90   38,40  1363,60  1720,60 85,32
 13,55  171,32  316,21   43,41   6,55  47,39
 sdc 268,50   441,60   28,80   45,40  1191,60  1977,00 85,41
 19,08  159,39  345,98   41,02   6,56  48,69
 sdd 255,50   445,50   30,20   45,00  1150,40  1975,80 83,14
 18,18  155,97  338,90   33,20   6,95  52,23
 md0   0,00 0,001,20  132,70 4,80  4086,40 61,11
 0,000,000,000,00   0,00   0,00


 The figures are similar on all three OSDs.

 I am thinking that one possible cause could be that the journal is stored on
 the same disks as the rest of the data, but I don't know how to benchmark if
 this is actually the case (?)

 Thanks for any help or advice, you can offer!

Hi Jens,

You may try do play with SCHED_RT, I have found it hard to use for
myself, but you can achieve your goal by adding small RT slices via
``cpu'' cgroup to vcpu/emulator threads, it dramatically increases
overall VM` responsibility. I have thrown it off because RT scheduler
is a very strange thing - it may cause endless lockup on disk
operation during heavy operations or produce ever-stuck ``kworker'' on
some cores if you have killed VM which has separate RT slices for vcpu
threads. Of course, some Ceph tuning like writeback cache and large
journal may help you too, I`m speaking primarily of VM` performance by
itself.


 --
 Jens Kristian Søgaard, Mermaid Consulting ApS,
 j...@mermaidconsulting.dk,
 http://www.mermaidconsulting.com/
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at

Re: Striped images and cluster misbehavior

2012-12-18 Thread Andrey Korolyov

On Mon, Dec 17, 2012 at 2:36 AM, Andrey Korolyov and...@xdel.ru wrote:
 Hi,

 After recent switch do default  ``--stripe-count 1'' on image upload I
 have observed some strange thing - single import or deletion of the
 striped image may temporarily turn off entire cluster, literally(see
 log below).
 Of course next issued osd map fix the situation, but all in-flight
 operations experiencing a short freeze. This issue appears randomly in
 some import or delete operation, have not seen any other types causing
 this. Even if a nature of this bug laying completely in the client-osd
 interaction, may be ceph should develop a some foolproof actions even
 if complaining client have admin privileges? Almost for sure this
 should be reproduced within teuthology with rwx rights both on osds
 and mons at the client. And as I can see there is no problem on both
 physical and protocol layer for dedicated cluster interface on client
 machine.

 2012-12-17 02:17:03.691079 mon.0 [INF] pgmap v2403268: 15552 pgs:
 15552 active+clean; 931 GB data, 2927 GB used, 26720 GB / 29647 GB
 avail
 2012-12-17 02:17:04.693344 mon.0 [INF] pgmap v2403269: 15552 pgs:
 15552 active+clean; 931 GB data, 2927 GB used, 26720 GB / 29647 GB
 avail
 2012-12-17 02:17:05.695742 mon.0 [INF] pgmap v2403270: 15552 pgs:
 15552 active+clean; 931 GB data, 2927 GB used, 26720 GB / 29647 GB
 avail
 2012-12-17 02:17:05.991900 mon.0 [INF] osd.0 10.5.0.10:6800/4907
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.991859 =
 grace 20.00)
 2012-12-17 02:17:05.992017 mon.0 [INF] osd.1 10.5.0.11:6800/5011
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.991995 =
 grace 20.00)
 2012-12-17 02:17:05.992139 mon.0 [INF] osd.2 10.5.0.12:6803/5226
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992110 =
 grace 20.00)
 2012-12-17 02:17:05.992240 mon.0 [INF] osd.3 10.5.0.13:6803/6054
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992224 =
 grace 20.00)
 2012-12-17 02:17:05.992330 mon.0 [INF] osd.4 10.5.0.14:6803/5792
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992317 =
 grace 20.00)
 2012-12-17 02:17:05.992420 mon.0 [INF] osd.5 10.5.0.15:6803/5564
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992405 =
 grace 20.00)
 2012-12-17 02:17:05.992515 mon.0 [INF] osd.7 10.5.0.17:6803/5902
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992501 =
 grace 20.00)
 2012-12-17 02:17:05.992607 mon.0 [INF] osd.8 10.5.0.10:6803/5338
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992591 =
 grace 20.00)
 2012-12-17 02:17:05.992702 mon.0 [INF] osd.10 10.5.0.12:6800/5040
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992686 =
 grace 20.00)
 2012-12-17 02:17:05.992793 mon.0 [INF] osd.11 10.5.0.13:6800/5748
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992778 =
 grace 20.00)
 2012-12-17 02:17:05.992891 mon.0 [INF] osd.12 10.5.0.14:6800/5459
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992875 =
 grace 20.00)
 2012-12-17 02:17:05.992980 mon.0 [INF] osd.13 10.5.0.15:6800/5235
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992966 =
 grace 20.00)
 2012-12-17 02:17:05.993081 mon.0 [INF] osd.16 10.5.0.30:6800/5585
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993065 =
 grace 20.00)
 2012-12-17 02:17:05.993184 mon.0 [INF] osd.17 10.5.0.31:6800/5578
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993169 =
 grace 20.00)
 2012-12-17 02:17:05.993274 mon.0 [INF] osd.18 10.5.0.32:6800/5097
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993260 =
 grace 20.00)
 2012-12-17 02:17:05.993367 mon.0 [INF] osd.19 10.5.0.33:6800/5109
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993352 =
 grace 20.00)
 2012-12-17 02:17:05.993464 mon.0 [INF] osd.20 10.5.0.34:6800/5125
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993448 =
 grace 20.00)
 2012-12-17 02:17:05.993554 mon.0 [INF] osd.21 10.5.0.35:6800/5183
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993538 =
 grace 20.00)
 2012-12-17 02:17:05.993644 mon.0 [INF] osd.22 10.5.0.36:6800/5202
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993628 =
 grace 20.00)
 2012-12-17 02:17:05.993740 mon.0 [INF] osd.23 10.5.0.37:6800/5252
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993725 =
 grace 20.00)
 2012-12-17 02:17:05.993831 mon.0 [INF] osd.24 10.5.0.30:6803/5758
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993816 =
 grace 20.00)
 2012-12-17 02:17:05.993924 mon.0 [INF] osd.25 10.5.0.31:6803/5748
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993908 =
 grace 20.00)
 2012-12-17 02:17:05.994018 mon.0 [INF] osd.26 10.5.0.32:6803/5275
 failed (3 reports from 1 peers after 2012-12-17 02:17:29.994002 =
 grace 20.00)
 2012-12-17 02:17:06.105315 mon.0 [INF] osdmap e24204: 32 osds: 4 up, 32 in
 2012-12-17 02:17:06.051291 osd.6 [WRN] 1 slow requests, 1

Re: Slow requests

2012-12-16 Thread Andrey Korolyov

On Sun, Dec 16, 2012 at 5:59 PM, Jens Kristian Søgaard
j...@mermaidconsulting.dk wrote:
 Hi,

 My log is filling up with warnings about a single slow request that has been
 around for a very long time:

 osd.1 10.0.0.2:6800/900 162926 : [WRN] 1 slow requests, 1 included below;
 oldest blocked for  84446.312051 secs

 osd.1 10.0.0.2:6800/900 162927 : [WRN] slow request 84446.312051 seconds
 old, received at 2012-12-15 15:27:56.891437:
 osd_sub_op(client.4528.0:19602219 0.fe
 3807b5fe/rb.0.11b7.4a933baa.0008629e/head//0 [] v 53'185888
 snapset=0=[]:[] snapc=0=[]) v7 currently started


 How can I identify the cause of this and how can I cancel this request?

 I'm running Ceph on Fedora 17 using the latest RPMs available from ceph.com
 (0.52-6).


 Thanks in advance,

Hi Jens,

Please take a look to this thread:
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10843

Seems that you`ll need newer rpms to get rid of this.

 --
 Jens Kristian Søgaard, Mermaid Consulting ApS,
 j...@mermaidconsulting.dk,
 http://www.mermaidconsulting.com/
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Striped images and cluster misbehavior

2012-12-16 Thread Andrey Korolyov

Hi,

After recent switch do default  ``--stripe-count 1'' on image upload I
have observed some strange thing - single import or deletion of the
striped image may temporarily turn off entire cluster, literally(see
log below).
Of course next issued osd map fix the situation, but all in-flight
operations experiencing a short freeze. This issue appears randomly in
some import or delete operation, have not seen any other types causing
this. Even if a nature of this bug laying completely in the client-osd
interaction, may be ceph should develop a some foolproof actions even
if complaining client have admin privileges? Almost for sure this
should be reproduced within teuthology with rwx rights both on osds
and mons at the client. And as I can see there is no problem on both
physical and protocol layer for dedicated cluster interface on client
machine.

2012-12-17 02:17:03.691079 mon.0 [INF] pgmap v2403268: 15552 pgs:
15552 active+clean; 931 GB data, 2927 GB used, 26720 GB / 29647 GB
avail
2012-12-17 02:17:04.693344 mon.0 [INF] pgmap v2403269: 15552 pgs:
15552 active+clean; 931 GB data, 2927 GB used, 26720 GB / 29647 GB
avail
2012-12-17 02:17:05.695742 mon.0 [INF] pgmap v2403270: 15552 pgs:
15552 active+clean; 931 GB data, 2927 GB used, 26720 GB / 29647 GB
avail
2012-12-17 02:17:05.991900 mon.0 [INF] osd.0 10.5.0.10:6800/4907
failed (3 reports from 1 peers after 2012-12-17 02:17:29.991859 =
grace 20.00)
2012-12-17 02:17:05.992017 mon.0 [INF] osd.1 10.5.0.11:6800/5011
failed (3 reports from 1 peers after 2012-12-17 02:17:29.991995 =
grace 20.00)
2012-12-17 02:17:05.992139 mon.0 [INF] osd.2 10.5.0.12:6803/5226
failed (3 reports from 1 peers after 2012-12-17 02:17:29.992110 =
grace 20.00)
2012-12-17 02:17:05.992240 mon.0 [INF] osd.3 10.5.0.13:6803/6054
failed (3 reports from 1 peers after 2012-12-17 02:17:29.992224 =
grace 20.00)
2012-12-17 02:17:05.992330 mon.0 [INF] osd.4 10.5.0.14:6803/5792
failed (3 reports from 1 peers after 2012-12-17 02:17:29.992317 =
grace 20.00)
2012-12-17 02:17:05.992420 mon.0 [INF] osd.5 10.5.0.15:6803/5564
failed (3 reports from 1 peers after 2012-12-17 02:17:29.992405 =
grace 20.00)
2012-12-17 02:17:05.992515 mon.0 [INF] osd.7 10.5.0.17:6803/5902
failed (3 reports from 1 peers after 2012-12-17 02:17:29.992501 =
grace 20.00)
2012-12-17 02:17:05.992607 mon.0 [INF] osd.8 10.5.0.10:6803/5338
failed (3 reports from 1 peers after 2012-12-17 02:17:29.992591 =
grace 20.00)
2012-12-17 02:17:05.992702 mon.0 [INF] osd.10 10.5.0.12:6800/5040
failed (3 reports from 1 peers after 2012-12-17 02:17:29.992686 =
grace 20.00)
2012-12-17 02:17:05.992793 mon.0 [INF] osd.11 10.5.0.13:6800/5748
failed (3 reports from 1 peers after 2012-12-17 02:17:29.992778 =
grace 20.00)
2012-12-17 02:17:05.992891 mon.0 [INF] osd.12 10.5.0.14:6800/5459
failed (3 reports from 1 peers after 2012-12-17 02:17:29.992875 =
grace 20.00)
2012-12-17 02:17:05.992980 mon.0 [INF] osd.13 10.5.0.15:6800/5235
failed (3 reports from 1 peers after 2012-12-17 02:17:29.992966 =
grace 20.00)
2012-12-17 02:17:05.993081 mon.0 [INF] osd.16 10.5.0.30:6800/5585
failed (3 reports from 1 peers after 2012-12-17 02:17:29.993065 =
grace 20.00)
2012-12-17 02:17:05.993184 mon.0 [INF] osd.17 10.5.0.31:6800/5578
failed (3 reports from 1 peers after 2012-12-17 02:17:29.993169 =
grace 20.00)
2012-12-17 02:17:05.993274 mon.0 [INF] osd.18 10.5.0.32:6800/5097
failed (3 reports from 1 peers after 2012-12-17 02:17:29.993260 =
grace 20.00)
2012-12-17 02:17:05.993367 mon.0 [INF] osd.19 10.5.0.33:6800/5109
failed (3 reports from 1 peers after 2012-12-17 02:17:29.993352 =
grace 20.00)
2012-12-17 02:17:05.993464 mon.0 [INF] osd.20 10.5.0.34:6800/5125
failed (3 reports from 1 peers after 2012-12-17 02:17:29.993448 =
grace 20.00)
2012-12-17 02:17:05.993554 mon.0 [INF] osd.21 10.5.0.35:6800/5183
failed (3 reports from 1 peers after 2012-12-17 02:17:29.993538 =
grace 20.00)
2012-12-17 02:17:05.993644 mon.0 [INF] osd.22 10.5.0.36:6800/5202
failed (3 reports from 1 peers after 2012-12-17 02:17:29.993628 =
grace 20.00)
2012-12-17 02:17:05.993740 mon.0 [INF] osd.23 10.5.0.37:6800/5252
failed (3 reports from 1 peers after 2012-12-17 02:17:29.993725 =
grace 20.00)
2012-12-17 02:17:05.993831 mon.0 [INF] osd.24 10.5.0.30:6803/5758
failed (3 reports from 1 peers after 2012-12-17 02:17:29.993816 =
grace 20.00)
2012-12-17 02:17:05.993924 mon.0 [INF] osd.25 10.5.0.31:6803/5748
failed (3 reports from 1 peers after 2012-12-17 02:17:29.993908 =
grace 20.00)
2012-12-17 02:17:05.994018 mon.0 [INF] osd.26 10.5.0.32:6803/5275
failed (3 reports from 1 peers after 2012-12-17 02:17:29.994002 =
grace 20.00)
2012-12-17 02:17:06.105315 mon.0 [INF] osdmap e24204: 32 osds: 4 up, 32 in
2012-12-17 02:17:06.051291 osd.6 [WRN] 1 slow requests, 1 included
below; oldest blocked for  30.947080 secs
2012-12-17 02:17:06.051299 osd.6 [WRN] slow request 30.947080 seconds
old, received at 2012-12-17 02:16:35.042711:

Re: Slow requests

2012-12-16 Thread Andrey Korolyov

On Mon, Dec 17, 2012 at 2:42 AM, Jens Kristian Søgaard
j...@mermaidconsulting.dk wrote:
 Hi Andrey,

 Thanks for your reply!


 Please take a look to this thread:
 http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10843


 I took your advice and restarted each of my three osd's individually. The
 first two restarted within a minute or two. The last one took 20 minutes to
 restart (?)

 Afterwards the slow request had disappeared, so it did seem to work!


 Seems that you`ll need newer rpms to get rid of this.


 Are newer RPMs available for download somewhere, or do I need to compile my
 own?

 I have searched the ceph.com site several times in the past, but I only find
 older versions.



Oh, sorry, I maybe misguided you - solution is the patch from Sam,
restarts may help only on the short distance and you`re not able to
check some pgs for consistency until patch have been applied - they`ll
hang on scrub every time.

 --
 Jens Kristian Søgaard, Mermaid Consulting ApS,
 j...@mermaidconsulting.dk,
 http://www.mermaidconsulting.com/
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

misdirected client messages

2012-12-12 Thread Andrey Korolyov

Hi,

Today during planned kernel upgrade one of osds (which I have not
touched yet), started to claim about ``misdirected client'':

2012-12-12 21:22:59.107648 osd.20 [WRN] client.2774043
10.5.0.33:0/1013711 misdirected client.2774043.0:114 pg 5.ad140d42 to
osd.20 in e23834, client e23834 pg 5.542 features 67108863 (remained
the same except timestamp for at least ten minutes and then
disappeared)


Last two of three nodes have been rebooted to this time and primary
still not rebooted yet, osd.20 not rebooted too to this moment:

5.542   124 0   0   0   511705106   142753  142753
 active+clean2012-12-12 17:07:05.761509  23834'1022923
23409'2958324 [7,11,5][7,11,5]23452'839524
2012-12-08 21:08:42.076364  23452'8395242012-12-08
21:08:42.076365


As I can see by looking to bugtracker, those message shouldn`t appear,
at least in recent versions. I`m using nondefault TCP congestion algo,
so it is clearly not a result of TCP mistuning between kernel
versions. I`m running 0.55 gf9d090e and kernel was upgraded inside 3.6
branch. Unfortunately message disappeared too soon before I point it
and change logging level on all involved OSD daemons. Since there is
absolutely no harm, may I ask on suggestions on repeating / raising
probability of this bug and do an appropriate logging?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hangup during scrubbing - possible solutions

2012-12-01 Thread Andrey Korolyov

On Sat, Dec 1, 2012 at 9:07 AM, Samuel Just sam.j...@inktank.com wrote:
 Just pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7.
 Let me know if it persists.  Thanks for the logs!
 -Sam


Very nice, thanks!

There is one corner case - ``on-the-fly'' upgrade works well only if
your patch applied to ``generic'' 0.54 by cherry-picking, online
upgrade to the next-dccf6ee from tagged 0.54 causes osd processes on
the upgraded nodes to fall shortly after restart with backtrace you
may see below. Offline upgrade, e.g over shutting down entire cluster,
works fine, so only one problem is a preservation of running state of
the cluster over upgrade which may confuse some users(at least ones
who runs production suites).

http://xdel.ru/downloads/ceph-log/bt-recovery-sj-patch.out.gz



 On Fri, Nov 30, 2012 at 2:04 PM, Samuel Just sam.j...@inktank.com wrote:
 Hah!  Thanks for the log, it's our handling of active_pushes.  I'll
 have a patch shortly.

 Thanks!
 -Sam

 On Fri, Nov 30, 2012 at 4:14 AM, Andrey Korolyov and...@xdel.ru wrote:
 http://xdel.ru/downloads/ceph-log/ceph-scrub-stuck.log.gz
 http://xdel.ru/downloads/ceph-log/cluster-w.log.gz

 Here, please.

 I have initiated a deep-scrub of osd.1 which was lead to forever-stuck
 I/O requests in a short time(scrub `ll do the same). Second log may be
 useful for proper timestamps, as seeks on the original may took a long
 time. Osd processes on the specific node was restarted twice - at the
 beginning to be sure all config options were applied and at the end to
 do same plus to get rid of stuck requests.


 On Wed, Nov 28, 2012 at 5:35 AM, Samuel Just sam.j...@inktank.com wrote:
 If you can reproduce it again, what we really need are the osd logs
 from the acting set of a pg stuck in scrub with
 debug osd = 20
 debug ms = 1
 debug filestore = 20.

 Thanks,
 -Sam

 On Sun, Nov 25, 2012 at 2:08 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Fri, Nov 23, 2012 at 12:35 AM, Sage Weil s...@inktank.com wrote:
 On Thu, 22 Nov 2012, Andrey Korolyov wrote:
 Hi,

 In the recent versions Ceph introduces some unexpected behavior for
 the permanent connections (VM or kernel clients) - after crash
 recovery, I/O will hang on the next planned scrub on the following
 scenario:

 - launch a bunch of clients doing non-intensive writes,
 - lose one or more osd, mark them down, wait for recovery completion,
 - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script,
 or wait for ceph to do the same,
 - observe a raising number of pgs stuck in the active+clean+scrubbing
 state (they took a master role from ones which was on killed osd and
 almost surely they are being written in time of crash),
 - some time later, clients will hang hardly and ceph log introduce
 stuck(old) I/O requests.

 The only one way to return clients back without losing their I/O state
 is per-osd restart, which also will help to get rid of
 active+clean+scrubbing pgs.

 First of all, I`ll be happy to help to solve this problem by providing
 logs.

 If you can reproduce this behavior with 'debug osd = 20' and 'debug ms =
 1' logging on the OSD, that would be wonderful!


 I have tested slightly different recovery flow, please see below.
 Since there is no real harm, like frozen I/O, placement groups also
 was stuck forever on the active+clean+scrubbing state, until I
 restarted all osds (end of the log):

 http://xdel.ru/downloads/ceph-log/recover-clients-later-than-osd.txt.gz

 - start the healthy cluster
 - start persistent clients
 - add an another host with pair of OSDs, let them be in the data placement
 - wait for data to rearrange
 - [22:06 timestamp] mark OSDs out or simply kill them and wait(since I
 have an 1/2 hour delay on readjust in such case, I did ``ceph osd
 out'' manually)
 - watch for data to rearrange again
 - [22:51 timestamp] when it ends, start a manual rescrub, with
 non-zero active+clean+scrubbing-state placement groups at the end of
 process which `ll stay in this state forever until something happens

 After that, I can restart osds one per one, if I want to get rid of
 scrubbing states immediately and then do deep-scrub(if I don`t, those
 states will return at next ceph self-scrubbing) or do per-osd
 deep-scrub, if I have a lot of time. The case I have described in the
 previous message took place when I remove osd from data placement
 which existed on the moment when client(s) have started and indeed it
 is more harmful than current one(frozen I/O leads to hanging entire
 guest, for example). Since testing those flow took a lot of time, I`ll
 send logs related to this case tomorrow.

 Second question is not directly related to this problem, but I
 have thought on for a long time - is there a planned features to
 control scrub process more precisely, e.g. pg scrub rate or scheduled
 scrub, instead of current set of timeouts which of course not very
 predictable on when to run?

 Not yet.  I would be interested in hearing what kind of control/config
 options/whatever

Re: Hangup during scrubbing - possible solutions

2012-11-30 Thread Andrey Korolyov

http://xdel.ru/downloads/ceph-log/ceph-scrub-stuck.log.gz
http://xdel.ru/downloads/ceph-log/cluster-w.log.gz

Here, please.

I have initiated a deep-scrub of osd.1 which was lead to forever-stuck
I/O requests in a short time(scrub `ll do the same). Second log may be
useful for proper timestamps, as seeks on the original may took a long
time. Osd processes on the specific node was restarted twice - at the
beginning to be sure all config options were applied and at the end to
do same plus to get rid of stuck requests.


On Wed, Nov 28, 2012 at 5:35 AM, Samuel Just sam.j...@inktank.com wrote:
 If you can reproduce it again, what we really need are the osd logs
 from the acting set of a pg stuck in scrub with
 debug osd = 20
 debug ms = 1
 debug filestore = 20.

 Thanks,
 -Sam

 On Sun, Nov 25, 2012 at 2:08 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Fri, Nov 23, 2012 at 12:35 AM, Sage Weil s...@inktank.com wrote:
 On Thu, 22 Nov 2012, Andrey Korolyov wrote:
 Hi,

 In the recent versions Ceph introduces some unexpected behavior for
 the permanent connections (VM or kernel clients) - after crash
 recovery, I/O will hang on the next planned scrub on the following
 scenario:

 - launch a bunch of clients doing non-intensive writes,
 - lose one or more osd, mark them down, wait for recovery completion,
 - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script,
 or wait for ceph to do the same,
 - observe a raising number of pgs stuck in the active+clean+scrubbing
 state (they took a master role from ones which was on killed osd and
 almost surely they are being written in time of crash),
 - some time later, clients will hang hardly and ceph log introduce
 stuck(old) I/O requests.

 The only one way to return clients back without losing their I/O state
 is per-osd restart, which also will help to get rid of
 active+clean+scrubbing pgs.

 First of all, I`ll be happy to help to solve this problem by providing
 logs.

 If you can reproduce this behavior with 'debug osd = 20' and 'debug ms =
 1' logging on the OSD, that would be wonderful!


 I have tested slightly different recovery flow, please see below.
 Since there is no real harm, like frozen I/O, placement groups also
 was stuck forever on the active+clean+scrubbing state, until I
 restarted all osds (end of the log):

 http://xdel.ru/downloads/ceph-log/recover-clients-later-than-osd.txt.gz

 - start the healthy cluster
 - start persistent clients
 - add an another host with pair of OSDs, let them be in the data placement
 - wait for data to rearrange
 - [22:06 timestamp] mark OSDs out or simply kill them and wait(since I
 have an 1/2 hour delay on readjust in such case, I did ``ceph osd
 out'' manually)
 - watch for data to rearrange again
 - [22:51 timestamp] when it ends, start a manual rescrub, with
 non-zero active+clean+scrubbing-state placement groups at the end of
 process which `ll stay in this state forever until something happens

 After that, I can restart osds one per one, if I want to get rid of
 scrubbing states immediately and then do deep-scrub(if I don`t, those
 states will return at next ceph self-scrubbing) or do per-osd
 deep-scrub, if I have a lot of time. The case I have described in the
 previous message took place when I remove osd from data placement
 which existed on the moment when client(s) have started and indeed it
 is more harmful than current one(frozen I/O leads to hanging entire
 guest, for example). Since testing those flow took a lot of time, I`ll
 send logs related to this case tomorrow.

 Second question is not directly related to this problem, but I
 have thought on for a long time - is there a planned features to
 control scrub process more precisely, e.g. pg scrub rate or scheduled
 scrub, instead of current set of timeouts which of course not very
 predictable on when to run?

 Not yet.  I would be interested in hearing what kind of control/config
 options/whatever you (and others) would like to see!

 Of course it will be awesome to have any determined scheduler or at
 least an option to disable automated scrubbing, since it is not very
 determined in time and deep-scrub eating a lot of I/O if command
 issued against entire OSD. Rate limiting is not in the first place, at
 least it may be recreated in external script, but for those who prefer
 to leave control to Ceph, it may be very useful.

 Thanks!
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: parsing in the ceph osd subsystem

2012-11-29 Thread Andrey Korolyov

On Thu, Nov 29, 2012 at 8:34 PM, Sage Weil s...@inktank.com wrote:
 On Thu, 29 Nov 2012, Andrey Korolyov wrote:
 $ ceph osd down -
 osd.0 is already down
 $ ceph osd down ---
 osd.0 is already down

 the same for ``+'', ``/'', ``%'' and so - I think that for osd subsys
 ceph cli should explicitly work only with positive integers plus zero,
 refusing all other input.

 which branch is this?  this parsing is cleaned u pin the latest
 next/master.



It was produced by 0.54-tag. I have built
dd3a24a647d0b0f1153cf1b102ed1f51d51be2f2 today and problem has
gone(except parsing ``-0'' as 0 and 0/001 as 0 and 1
correspondingly).


 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: endless flying slow requests

2012-11-28 Thread Andrey Korolyov

On Thu, Nov 29, 2012 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote:
 Also, these clusters aren't mixed argonaut and next, are they?  (Not
 that that shouldn't work, but it would be a useful data point.)
 -Sam

 On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote:
 Did you observe hung io along with that error?  Both sub_op_commit and
 sub_op_applied have happened, so the sub_op_reply should have been
 sent back to the primary.  This looks more like a leak.  If you also
 observed hung io, then it's possible that the problem is occurring
 between the sub_op_applied event and the response.
 -Sam


It is relatively easy to check if one of client VMs has locked one or
more cores to iowait or just hangs, so yes, these ops are related to
real commit operations and they are hanged.
I`m using all-new 0.54 cluster, without mixing of course. Does
everyone who hit that bug readjusted cluster before bug shows
itself(say, in a day-long distance)?

 On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote:
 Hi Stefan,

 On Thu, 15 Nov 2012, Sage Weil wrote:
 On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
  Am 14.11.2012 15:59, schrieb Sage Weil:
   Hi Stefan,
  
   I would be nice to confirm that no clients are waiting on replies for
   these requests; currently we suspect that the OSD request tracking is 
   the
   buggy part.  If you query the OSD admin socket you should be able to 
   dump
   requests and see the client IP, and then query the client.
  
   Is it librbd?  In that case you likely need to change the config so 
   that
   it is listening on an admin socket ('admin socket = path').
 
  Yes it is. So i have to specify admin socket at the KVM host?

 Right.  IIRC the disk line is a ; (or \;) separated list of key/value
 pairs.

  How do i query the admin socket for requests?

 ceph --admin-daemon /path/to/socket help
 ceph --admin-daemon /path/to/socket objecter_dump (i think)

 Were you able to reproduce this?

 Thanks!
 sage

 Meanwhile, I did. :)
 Such requests will always be created if you have restarted or marked
 an osd out and then back in and scrub didn`t happen in the meantime
 (after such operation and before request arrival).
 What is more interesting, the hangup happens not exactly at the time
 of operation, but tens of minutes later.

 { description: osd_sub_op(client.1292013.0:45422 4.731
 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
 16444'21693 snapset=0=[]:[] snapc=0=[]),
   received_at: 2012-11-28 03:54:43.094151,
   age: 27812.942680,
   duration: 2.676641,
   flag_point: started,
   events: [
 { time: 2012-11-28 03:54:43.094222,
   event: waiting_for_osdmap},
 { time: 2012-11-28 03:54:43.386890,
   event: reached_pg},
 { time: 2012-11-28 03:54:43.386894,
   event: started},
 { time: 2012-11-28 03:54:43.386973,
   event: commit_queued_for_journal_write},
 { time: 2012-11-28 03:54:45.360049,
   event: write_thread_in_journal_buffer},
 { time: 2012-11-28 03:54:45.586183,
   event: journaled_completion_queued},
 { time: 2012-11-28 03:54:45.586262,
   event: sub_op_commit},
 { time: 2012-11-28 03:54:45.770792,
   event: sub_op_applied}]}]}





 sage

 
  Stefan
 
 
   On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
  
Hello list,
   
i see this several times. Endless flying slow requests. And they 
never
stop
until i restart the mentioned osd.
   
2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31791.858827 secs
2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10

Re: endless flying slow requests

2012-11-27 Thread Andrey Korolyov

On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote:
 Hi Stefan,

 On Thu, 15 Nov 2012, Sage Weil wrote:
 On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
  Am 14.11.2012 15:59, schrieb Sage Weil:
   Hi Stefan,
  
   I would be nice to confirm that no clients are waiting on replies for
   these requests; currently we suspect that the OSD request tracking is the
   buggy part.  If you query the OSD admin socket you should be able to dump
   requests and see the client IP, and then query the client.
  
   Is it librbd?  In that case you likely need to change the config so that
   it is listening on an admin socket ('admin socket = path').
 
  Yes it is. So i have to specify admin socket at the KVM host?

 Right.  IIRC the disk line is a ; (or \;) separated list of key/value
 pairs.

  How do i query the admin socket for requests?

 ceph --admin-daemon /path/to/socket help
 ceph --admin-daemon /path/to/socket objecter_dump (i think)

 Were you able to reproduce this?

 Thanks!
 sage

Meanwhile, I did. :)
Such requests will always be created if you have restarted or marked
an osd out and then back in and scrub didn`t happen in the meantime
(after such operation and before request arrival).
What is more interesting, the hangup happens not exactly at the time
of operation, but tens of minutes later.

{ description: osd_sub_op(client.1292013.0:45422 4.731
a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
16444'21693 snapset=0=[]:[] snapc=0=[]),
  received_at: 2012-11-28 03:54:43.094151,
  age: 27812.942680,
  duration: 2.676641,
  flag_point: started,
  events: [
{ time: 2012-11-28 03:54:43.094222,
  event: waiting_for_osdmap},
{ time: 2012-11-28 03:54:43.386890,
  event: reached_pg},
{ time: 2012-11-28 03:54:43.386894,
  event: started},
{ time: 2012-11-28 03:54:43.386973,
  event: commit_queued_for_journal_write},
{ time: 2012-11-28 03:54:45.360049,
  event: write_thread_in_journal_buffer},
{ time: 2012-11-28 03:54:45.586183,
  event: journaled_completion_queued},
{ time: 2012-11-28 03:54:45.586262,
  event: sub_op_commit},
{ time: 2012-11-28 03:54:45.770792,
  event: sub_op_applied}]}]}





 sage

 
  Stefan
 
 
   On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
  
Hello list,
   
i see this several times. Endless flying slow requests. And they never
stop
until i restart the mentioned osd.
   
2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31791.858827 secs
2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31792.858971 secs
2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31793.859124 secs
2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
   
When i now restart osd 24 they go away and everything is fine again.
   
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel 
in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   
   
   --
   To

Re: Hangup during scrubbing - possible solutions

2012-11-25 Thread Andrey Korolyov

On Fri, Nov 23, 2012 at 12:35 AM, Sage Weil s...@inktank.com wrote:
 On Thu, 22 Nov 2012, Andrey Korolyov wrote:
 Hi,

 In the recent versions Ceph introduces some unexpected behavior for
 the permanent connections (VM or kernel clients) - after crash
 recovery, I/O will hang on the next planned scrub on the following
 scenario:

 - launch a bunch of clients doing non-intensive writes,
 - lose one or more osd, mark them down, wait for recovery completion,
 - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script,
 or wait for ceph to do the same,
 - observe a raising number of pgs stuck in the active+clean+scrubbing
 state (they took a master role from ones which was on killed osd and
 almost surely they are being written in time of crash),
 - some time later, clients will hang hardly and ceph log introduce
 stuck(old) I/O requests.

 The only one way to return clients back without losing their I/O state
 is per-osd restart, which also will help to get rid of
 active+clean+scrubbing pgs.

 First of all, I`ll be happy to help to solve this problem by providing
 logs.

 If you can reproduce this behavior with 'debug osd = 20' and 'debug ms =
 1' logging on the OSD, that would be wonderful!


I have tested slightly different recovery flow, please see below.
Since there is no real harm, like frozen I/O, placement groups also
was stuck forever on the active+clean+scrubbing state, until I
restarted all osds (end of the log):

http://xdel.ru/downloads/ceph-log/recover-clients-later-than-osd.txt.gz

- start the healthy cluster
- start persistent clients
- add an another host with pair of OSDs, let them be in the data placement
- wait for data to rearrange
- [22:06 timestamp] mark OSDs out or simply kill them and wait(since I
have an 1/2 hour delay on readjust in such case, I did ``ceph osd
out'' manually)
- watch for data to rearrange again
- [22:51 timestamp] when it ends, start a manual rescrub, with
non-zero active+clean+scrubbing-state placement groups at the end of
process which `ll stay in this state forever until something happens

After that, I can restart osds one per one, if I want to get rid of
scrubbing states immediately and then do deep-scrub(if I don`t, those
states will return at next ceph self-scrubbing) or do per-osd
deep-scrub, if I have a lot of time. The case I have described in the
previous message took place when I remove osd from data placement
which existed on the moment when client(s) have started and indeed it
is more harmful than current one(frozen I/O leads to hanging entire
guest, for example). Since testing those flow took a lot of time, I`ll
send logs related to this case tomorrow.

 Second question is not directly related to this problem, but I
 have thought on for a long time - is there a planned features to
 control scrub process more precisely, e.g. pg scrub rate or scheduled
 scrub, instead of current set of timeouts which of course not very
 predictable on when to run?

 Not yet.  I would be interested in hearing what kind of control/config
 options/whatever you (and others) would like to see!

Of course it will be awesome to have any determined scheduler or at
least an option to disable automated scrubbing, since it is not very
determined in time and deep-scrub eating a lot of I/O if command
issued against entire OSD. Rate limiting is not in the first place, at
least it may be recreated in external script, but for those who prefer
to leave control to Ceph, it may be very useful.

Thanks!
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 'zombie snapshot' problem

2012-11-22 Thread Andrey Korolyov

On Thu, Nov 22, 2012 at 2:05 AM, Josh Durgin josh.dur...@inktank.com wrote:
 On 11/21/2012 04:50 AM, Andrey Korolyov wrote:

 Hi,

 Somehow I have managed to produce unkillable snapshot, which does not
 allow to remove itself or parent image:

 $ rbd snap purge dev-rack0/vm2
 Removing all snapshots: 100% complete...done.


 I see one bug with 'snap purge' ignoring the return code when removing
 snaps. I just fixed this in the next branch. It's probably getting the
 same error as 'rbd snap rm' below.

 Could you post the output of:

 rbd snap purge dev-rack0/vm2 --debug-ms 1 --debug-rbd 20


 $ rbd rm dev-rack0/vm2
 2012-11-21 16:31:24.184626 7f7e0d172780 -1 librbd: image has snapshots
 - not removing
 Removing image: 0% complete...failed.
 rbd: image has snapshots - these must be deleted with 'rbd snap purge'
 before the image can be removed.
 $ rbd snap ls dev-rack0/vm2
 SNAPID NAME   SIZE
 188 vm2.snap-yxf 16384 MB
 $ rbd info dev-rack0/vm2
 rbd image 'vm2':
  size 16384 MB in 4096 objects
  order 22 (4096 KB objects)
  block_name_prefix: rbd_data.1fa164c960874
  format: 2
  features: layering
 $ rbd snap rm --snap vm2.snap-yxf dev-rack0/vm2
 rbd: failed to remove snapshot: (2) No such file or directory
 $ rbd snap create --snap vm2.snap-yxf dev-rack0/vm2
 rbd: failed to create snapshot: (17) File exists
 $ rbd snap rollback --snap vm2.snap-yxf dev-rack0/vm2
 Rolling back to snapshot: 100% complete...done.
 $ rbd snap protect --snap vm2.snap-yxf dev-rack0/vm2
 $ rbd snap unprotect --snap vm2.snap-yxf dev-rack0/vm2


 Meanwhile, ``rbd ls -l dev-rack0''  segfaulting with an attached log.
 Is there any reliable way to kill problematic snap?


 From this log it looks like vm2 used to be a clone, and the snapshot
 vm2.snap-yxf was taken before it was flattened. Later, the parent of
 vm2.snap-yxf was deleted. Is this correct?

I have attached log you asked, hope it will be useful.
Here is a two possible flows: snapshot created before and during flatten:

Completely linear flow:

$ rbd cp install/debian7 dev-rack0/testimg
Image copy: 100% complete...done.
$ rbd snap create --snap test1 dev-rack0/testimg
$ rbd snap clone --snap test1 dev-rack0/testimg dev-rack0/testimg2
rbd: error parsing command 'clone'
$ rbd snap protect --snap test1 dev-rack0/testimg
$ rbd clone --snap test1 dev-rack0/testimg dev-rack0/testimg2
$ rbd snap create --snap test2 dev-rack0/testimg2
$ rbd flatten dev-rack0/testimg2
Image flatten: 100% complete...done.
$ rbd snap unprotect --snap test1 dev-rack0/testimg
2012-11-22 15:11:03.446892 7ff9fb7c1780 -1 librbd: snap_unprotect:
can't unprotect; at least 1 child(ren) in pool dev-rack0
rbd: unprotecting snap failed: (16) Device or resource busy
$ rbd snap purge dev-rack0/testimg2
Removing all snapshots: 100% complete...done.
$ rbd snap ls dev-rack0/testimg2
$ rbd snap unprotect --snap test1 dev-rack0/testimg


snapshot created over image with ``flatten'' in progress:

$ rbd snap create --snap test3 dev-rack0/testimg
$ rbd snap protect --snap test3 dev-rack0/testimg
$ rbd clone --snap test3 dev-rack0/testimg dev-rack0/testimg3
rbd $ rbd flatten dev-rack0/testimg3
[here was executed rbd snap create --snap test43 dev-rack0/testimg3]
Image flatten: 100% complete...done.
$ rbd snap unprotect --snap test3 dev-rack0/testimg
$ rbd snap ls dev-rack0/testimg3
SNAPID NAME SIZE
   323 test43 640 MB
$ rbd snap purge dev-rack0/testimg3
Removing all snapshots: 100% complete...done.
$ rbd snap ls dev-rack0/testimg3
SNAPID NAME SIZE
   323 test43 640 MB
$ rbd snap rm --snap test43 dev-rack0/testimg3
rbd: failed to remove snapshot: (2) No such file or directory

Hooray, problem found! Now I`ll avoid this by putting flatten state as
exclusive one over the image.
ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150)


 It was a bug in 0.53 that protected snapshots could be deleted.

 Josh


snap.txt.gz
Description: GNU Zip compressed data

Hangup during scrubbing - possible solutions

2012-11-22 Thread Andrey Korolyov

Hi,

In the recent versions Ceph introduces some unexpected behavior for
the permanent connections (VM or kernel clients) - after crash
recovery, I/O will hang on the next planned scrub on the following
scenario:

- launch a bunch of clients doing non-intensive writes,
- lose one or more osd, mark them down, wait for recovery completion,
- do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script,
or wait for ceph to do the same,
- observe a raising number of pgs stuck in the active+clean+scrubbing
state (they took a master role from ones which was on killed osd and
almost surely they are being written in time of crash),
- some time later, clients will hang hardly and ceph log introduce
stuck(old) I/O requests.

The only one way to return clients back without losing their I/O state
is per-osd restart, which also will help to get rid of
active+clean+scrubbing pgs.

First of all, I`ll be happy to help to solve this problem by providing
logs. Second question is not directly related to this problem, but I
have thought on for a long time - is there a planned features to
control scrub process more precisely, e.g. pg scrub rate or scheduled
scrub, instead of current set of timeouts which of course not very
predictable on when to run?

Thanks!
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

'zombie snapshot' problem

2012-11-21 Thread Andrey Korolyov

Hi,

Somehow I have managed to produce unkillable snapshot, which does not
allow to remove itself or parent image:

$ rbd snap purge dev-rack0/vm2
Removing all snapshots: 100% complete...done.
$ rbd rm dev-rack0/vm2
2012-11-21 16:31:24.184626 7f7e0d172780 -1 librbd: image has snapshots
- not removing
Removing image: 0% complete...failed.
rbd: image has snapshots - these must be deleted with 'rbd snap purge'
before the image can be removed.
$ rbd snap ls dev-rack0/vm2
SNAPID NAME   SIZE
   188 vm2.snap-yxf 16384 MB
$ rbd info dev-rack0/vm2
rbd image 'vm2':
size 16384 MB in 4096 objects
order 22 (4096 KB objects)
block_name_prefix: rbd_data.1fa164c960874
format: 2
features: layering
$ rbd snap rm --snap vm2.snap-yxf dev-rack0/vm2
rbd: failed to remove snapshot: (2) No such file or directory
$ rbd snap create --snap vm2.snap-yxf dev-rack0/vm2
rbd: failed to create snapshot: (17) File exists
$ rbd snap rollback --snap vm2.snap-yxf dev-rack0/vm2
Rolling back to snapshot: 100% complete...done.
$ rbd snap protect --snap vm2.snap-yxf dev-rack0/vm2
$ rbd snap unprotect --snap vm2.snap-yxf dev-rack0/vm2


Meanwhile, ``rbd ls -l dev-rack0''  segfaulting with an attached log.
Is there any reliable way to kill problematic snap?


log-crash.txt.gz
Description: GNU Zip compressed data

Re: Authorization issues in the 0.54

2012-11-15 Thread Andrey Korolyov

On Thu, Nov 15, 2012 at 5:03 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Thu, Nov 15, 2012 at 5:12 AM, Yehuda Sadeh yeh...@inktank.com wrote:
 On Wed, Nov 14, 2012 at 4:20 AM, Andrey Korolyov and...@xdel.ru wrote:
 Hi,
 In the 0.54 cephx is probably broken somehow:

 $ ceph auth add client.qemukvm osd 'allow *' mon 'allow *' mds 'allow
 *' -i qemukvm.key
 2012-11-14 15:51:23.153910 7ff06441f780 -1 read 65 bytes from qemukvm.key
 added key for client.qemukvm

 $ ceph auth list
 ...
 client.admin
 key: [xx]
 caps: [mds] allow *

 Note that for mds you just specify 'allow' and not 'allow *'. It
 shouldn't affect the stuff that you're testing though.


 Thanks for the hint!

 caps: [mon] allow *
 caps: [osd] allow *
 client.qemukvm
 key: [yy]
 caps: [mds] allow *
 caps: [mon] allow *
 caps: [osd] allow *
 ...
 $ virsh secret-set-value --secret uuid --base64 yy
 set username in the VM` xml...
 $ virsh start testvm
 kvm: -drive 
 file=rbd:rbd/vm0:id=qemukvm:key=yy:auth_supported=cephx\;none:mon_host=192.168.10.125\:6789\;192.168.10.127\:6789\;192.168.10.129\:6789,if=none,id=drive-virtio-disk0,format=raw:
 could not open disk image
 rbd:rbd/vm0:id=qemukvm:key=yy:auth_supported=cephx\;none:mon_host=192.168.10.125\:6789\;192.168.10.127\:6789\;192.168.10.129\:6789:
 Operation not permitted
 $ virsh secret-set-value --secret uuid --base64 xx
 set username again to admin for the VM` disk
 $ virsh start testvm
 Finally, vm started successfully.

 All rbd commands issued from cli works okay with the appropriate
 credentials, qemu binary was linked with same librbd as running one.
 Does anyone have a suggestion?

 There wasn't any change that I'm aware of that should make that
 happening. Can you reproduce it with 'debug ms = 1' and 'debug auth =
 20'?


 I`ll provide detailed logs some time later, after I do an upgrade of
 production rack.

 The situation is a quite strange - when I did upgrade from older
 version (tested for 0.51 and 0.53), auth was stopped to work exactly
 as above, and any actions with key(importing and elevating privileges
 or importing with max possible privileges) does nothing for an
 rbd-backed QEMU vm, only ``admin'' credentials able to pass
 authentication. When I finally reformatted cluster using mkcephfs for
 0.54, authentication works with ``rwx'' rights on osd, when earlier
 ``rw'' was enough. Seems that this is some kind of bug in the monfs
 resulting to misworking authentication, also 0.53 to 0.54 was the
 first upgrade which made impossible version rollback - mons
 complaining to an empty set of some ``missing features'' on start, so
 I recreated monfs on every mon during online downgrade(I know that
 downgrade is bad by nature, but since on-disk format for osd was
 fixed, I have trying to do it).


Sorry, it was three overlapping factors - my inattention, additional
``x'' attribute in the required key capabilities and ``backup'' mon
stayed from time of upgrade - I have simply forgot to kill it and this
mon alone caused to drop authentication requests from qemu VMs somehow
in meantime allowing plain cluster operations using ``rbd'' command
and same credentials (very very strange). By the way, it seems that
monitor not included in cluster can easily flood any of existing mons
if it have same name, even it is completely outside authentication
keyring. Output from flooded mon is very close to #2645 by footprint.
I have suggestion that it`ll be reasonable to introduce temporary bans
or any type of foolproof behavior for bad authentication requests on
the monitors in future.

Thanks!
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: changed rbd cp behavior in 0.53

2012-11-15 Thread Andrey Korolyov

On Thu, Nov 15, 2012 at 8:43 PM, Deb Barba deb.ba...@inktank.com wrote:
 This is not common UNIX/posix behavior.

 if you just give the source a file name, it should assume . (current
 directory) as it's location, not whatever path you started from.

 I would expect most UNIX users would be losing a lot of files if they try to
 copy from path x/y/z, and just provide a new name.  that would indicate they
 wanted it stashed in ..  Not cloned in path x/y/z  .

 I am concerned this would confuse most users out in the field.

 Thanks,
 Deborah Barba

Speaking of standards, rbd layout is more closely to /dev layout, or,
at least iSCSI targets, when not specifying full path or use some
predefined default prefix make no sense at all.


 On Wed, Nov 14, 2012 at 10:43 PM, Andrey Korolyov and...@xdel.ru wrote:

 On Thu, Nov 15, 2012 at 4:56 AM, Dan Mick dan.m...@inktank.com wrote:
 
 
  On 11/12/2012 02:47 PM, Josh Durgin wrote:
 
  On 11/12/2012 08:30 AM, Andrey Korolyov wrote:
 
  Hi,
 
  For this version, rbd cp assumes that destination pool is the same as
  source, not 'rbd', if pool in the destination path is omitted.
 
  rbd cp install/img testimg
  rbd ls install
  img testimg
 
 
  Is this change permanent?
 
  Thanks!
 
 
  This is a regression. The previous behavior will be restored for 0.54.
  I added http://tracker.newdream.net/issues/3478 to track it.
 
 
  Actually, on detailed examination, it looks like this has been the
  behavior
  for a long time; I think the wiser course would be not to change this
  defaulting.  One could argue the value of such defaulting, but it's also
  true that you can specify the source and destination pools explicitly.
 
  Andrey, any strong objection to leaving this the way it is?

 I`m not complaining -  this behavior seems more logical in the first
 place and of course I use full path even doing something by hand.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Authorization issues in the 0.54

2012-11-14 Thread Andrey Korolyov

Hi,
In the 0.54 cephx is probably broken somehow:

$ ceph auth add client.qemukvm osd 'allow *' mon 'allow *' mds 'allow
*' -i qemukvm.key
2012-11-14 15:51:23.153910 7ff06441f780 -1 read 65 bytes from qemukvm.key
added key for client.qemukvm

$ ceph auth list
...
client.admin
key: [xx]
caps: [mds] allow *
caps: [mon] allow *
caps: [osd] allow *
client.qemukvm
key: [yy]
caps: [mds] allow *
caps: [mon] allow *
caps: [osd] allow *
...
$ virsh secret-set-value --secret uuid --base64 yy
set username in the VM` xml...
$ virsh start testvm
kvm: -drive 
file=rbd:rbd/vm0:id=qemukvm:key=yy:auth_supported=cephx\;none:mon_host=192.168.10.125\:6789\;192.168.10.127\:6789\;192.168.10.129\:6789,if=none,id=drive-virtio-disk0,format=raw:
could not open disk image
rbd:rbd/vm0:id=qemukvm:key=yy:auth_supported=cephx\;none:mon_host=192.168.10.125\:6789\;192.168.10.127\:6789\;192.168.10.129\:6789:
Operation not permitted
$ virsh secret-set-value --secret uuid --base64 xx
set username again to admin for the VM` disk
$ virsh start testvm
Finally, vm started successfully.

All rbd commands issued from cli works okay with the appropriate
credentials, qemu binary was linked with same librbd as running one.
Does anyone have a suggestion?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: changed rbd cp behavior in 0.53

2012-11-14 Thread Andrey Korolyov

On Thu, Nov 15, 2012 at 4:56 AM, Dan Mick dan.m...@inktank.com wrote:


 On 11/12/2012 02:47 PM, Josh Durgin wrote:

 On 11/12/2012 08:30 AM, Andrey Korolyov wrote:

 Hi,

 For this version, rbd cp assumes that destination pool is the same as
 source, not 'rbd', if pool in the destination path is omitted.

 rbd cp install/img testimg
 rbd ls install
 img testimg


 Is this change permanent?

 Thanks!


 This is a regression. The previous behavior will be restored for 0.54.
 I added http://tracker.newdream.net/issues/3478 to track it.


 Actually, on detailed examination, it looks like this has been the behavior
 for a long time; I think the wiser course would be not to change this
 defaulting.  One could argue the value of such defaulting, but it's also
 true that you can specify the source and destination pools explicitly.

 Andrey, any strong objection to leaving this the way it is?

I`m not complaining -  this behavior seems more logical in the first
place and of course I use full path even doing something by hand.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

changed rbd cp behavior in 0.53

2012-11-12 Thread Andrey Korolyov

Hi,

For this version, rbd cp assumes that destination pool is the same as
source, not 'rbd', if pool in the destination path is omitted.

rbd cp install/img testimg
rbd ls install
img testimg


Is this change permanent?

Thanks!
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

``rbd mv'' crash when no destination issued

2012-11-09 Thread Andrey Korolyov

Hi,

Please take a look, seems harmless:


$ rbd mv vm0
terminate called after throwing an instance of 'std::logic_error'
  what():  basic_string::_S_construct null not valid
*** Caught signal (Aborted) **
 in thread 7f85f5981780
 ceph version 0.53 (commit:2528b5ee105b16352c91af064af5c0b5a7d45d7c)
 1: rbd() [0x431a92]
 2: (()+0xfcb0) [0x7f85f451fcb0]
 3: (gsignal()+0x35) [0x7f85f2c63405]
 4: (abort()+0x17b) [0x7f85f2c66b5b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f85f35616dd]
 6: (()+0x637e6) [0x7f85f355f7e6]
 7: (()+0x63813) [0x7f85f355f813]
 8: (()+0x63a3e) [0x7f85f355fa3e]
 9: (std::__throw_logic_error(char const*)+0x5d) [0x7f85f35b133d]
 10: (char* std::string::_S_constructchar const*(char const*, char
const*, std::allocatorchar const, std::forward_iterator_tag)+0xa9)
[0x7f85f35bd3e9]
 11: (std::basic_stringchar, std::char_traitschar,
std::allocatorchar ::basic_string(char const*, std::allocatorchar
const)+0x43) [0x7f85f35bd453]
 12: (librbd::rename(librados::IoCtx, char const*, char
const*)+0x119) [0x7f85f5528099]
 13: (main()+0x2ba5) [0x42aff5]
 14: (__libc_start_main()+0xed) [0x7f85f2c4e76d]
 15: rbd() [0x42d599]
2012-11-09 20:51:50.082452 7f85f5981780 -1 *** Caught signal (Aborted) **
 in thread 7f85f5981780

 ceph version 0.53 (commit:2528b5ee105b16352c91af064af5c0b5a7d45d7c)
 1: rbd() [0x431a92]
 2: (()+0xfcb0) [0x7f85f451fcb0]
 3: (gsignal()+0x35) [0x7f85f2c63405]
 4: (abort()+0x17b) [0x7f85f2c66b5b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f85f35616dd]
 6: (()+0x637e6) [0x7f85f355f7e6]
 7: (()+0x63813) [0x7f85f355f813]
 8: (()+0x63a3e) [0x7f85f355fa3e]
 9: (std::__throw_logic_error(char const*)+0x5d) [0x7f85f35b133d]
 10: (char* std::string::_S_constructchar const*(char const*, char
const*, std::allocatorchar const, std::forward_iterator_tag)+0xa9)
[0x7f85f35bd3e9]
 11: (std::basic_stringchar, std::char_traitschar,
std::allocatorchar ::basic_string(char const*, std::allocatorchar
const)+0x43) [0x7f85f35bd453]
 12: (librbd::rename(librados::IoCtx, char const*, char
const*)+0x119) [0x7f85f5528099]
 13: (main()+0x2ba5) [0x42aff5]
 14: (__libc_start_main()+0xed) [0x7f85f2c4e76d]
 15: rbd() [0x42d599]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

--- begin dump of recent events ---
   -47 2012-11-09 20:51:50.060811 7f85f5981780  5 asok(0x208ba50)
register_command perfcounters_dump hook 0x208b9b0
   -46 2012-11-09 20:51:50.060838 7f85f5981780  5 asok(0x208ba50)
register_command 1 hook 0x208b9b0
   -45 2012-11-09 20:51:50.060843 7f85f5981780  5 asok(0x208ba50)
register_command perf dump hook 0x208b9b0
   -44 2012-11-09 20:51:50.060858 7f85f5981780  5 asok(0x208ba50)
register_command perfcounters_schema hook 0x208b9b0
   -43 2012-11-09 20:51:50.060864 7f85f5981780  5 asok(0x208ba50)
register_command 2 hook 0x208b9b0
   -42 2012-11-09 20:51:50.060866 7f85f5981780  5 asok(0x208ba50)
register_command perf schema hook 0x208b9b0
   -41 2012-11-09 20:51:50.060868 7f85f5981780  5 asok(0x208ba50)
register_command config show hook 0x208b9b0
   -40 2012-11-09 20:51:50.060874 7f85f5981780  5 asok(0x208ba50)
register_command config set hook 0x208b9b0
   -39 2012-11-09 20:51:50.060877 7f85f5981780  5 asok(0x208ba50)
register_command log flush hook 0x208b9b0
   -38 2012-11-09 20:51:50.060880 7f85f5981780  5 asok(0x208ba50)
register_command log dump hook 0x208b9b0
   -37 2012-11-09 20:51:50.060886 7f85f5981780  5 asok(0x208ba50)
register_command log reopen hook 0x208b9b0
   -36 2012-11-09 20:51:50.063300 7f85f5981780  1 librados: starting
msgr at :/0
   -35 2012-11-09 20:51:50.063317 7f85f5981780  1 librados: starting objecter
   -34 2012-11-09 20:51:50.063399 7f85f5981780  1 librados: setting wanted keys
   -33 2012-11-09 20:51:50.063406 7f85f5981780  1 librados: calling
monclient init
   -32 2012-11-09 20:51:50.063589 7f85f5981780  2 auth:
KeyRing::load: loaded key file /etc/ceph/keyring.bin
   -31 2012-11-09 20:51:50.064794 7f85efc26700  5
throttle(msgr_dispatch_throttler-radosclient 0x2095a08) get 473 (0 -
473)
   -30 2012-11-09 20:51:50.064953 7f85efc26700  5
throttle(msgr_dispatch_throttler-radosclient 0x2095a08) get 33 (473 -
506)
   -29 2012-11-09 20:51:50.065026 7f85f1c2a700  1 monclient(hunting):
found mon.2
   -28 2012-11-09 20:51:50.065055 7f85f1c2a700  5
throttle(msgr_dispatch_throttler-radosclient 0x2095a08) put 473
(0x6a9d28 - 33)
   -27 2012-11-09 20:51:50.065332 7f85f1c2a700  5
throttle(msgr_dispatch_throttler-radosclient 0x2095a08) put 33
(0x6a9d28 - 0)
   -26 2012-11-09 20:51:50.065627 7f85efc26700  5
throttle(msgr_dispatch_throttler-radosclient 0x2095a08) get 206 (0 -
206)
   -25 2012-11-09 20:51:50.065798 7f85f1c2a700  5
throttle(msgr_dispatch_throttler-radosclient 0x2095a08) put 206
(0x6a9d28 - 0)
   -24 2012-11-09 20:51:50.066082 7f85efc26700  5
throttle(msgr_dispatch_throttler-radosclient 0x2095a08) get 393 (0 -
393)
   -23 2012-11-09 20:51:50.066177 7f85f1c2a700  5
throttle(msgr_dispatch_throttler-radosclient

Re: clock syncronisation

2012-11-08 Thread Andrey Korolyov

On Thu, Nov 8, 2012 at 4:00 PM, Wido den Hollander w...@widodh.nl wrote:


 On 08-11-12 10:04, Stefan Priebe - Profihost AG wrote:

 Hello list,

 is there any prefered way to use clock syncronisation?

 I've tried running openntpd and ntpd on all servers but i'm still getting:
 2012-11-08 09:55:38.255928 mon.0 [WRN] message from mon.2 was stamped
 0.063136s in the future, clocks not synchronized
 2012-11-08 09:55:39.328639 mon.0 [WRN] message from mon.2 was stamped
 0.063285s in the future, clocks not synchronized
 2012-11-08 09:55:39.328833 mon.0 [WRN] message from mon.2 was stamped
 0.063301s in the future, clocks not synchronized
 2012-11-08 09:55:40.819975 mon.0 [WRN] message from mon.2 was stamped
 0.063360s in the future, clocks not synchronized


 What NTP server are you using? Network latency might cause the clocks not to
 be synchronised.


There is no real reason to worry about, quorum may suffer only large
desync delays as some seconds or more. If you have unsynchronized
clocks on mon hodes with such big delays, requests which have issued
from cli, e.g. creating new connection may wait as long as delay
itself, depend of clock value of selected monitor node.

Clock drift caused mostly by heavy load, but of course playing with
clocksources may have some effect(since most systems already use HPET
timer, there is only one way, to sync with ntp server as frequent as
you want to prevent drift).


 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion

2012-11-08 Thread Andrey Korolyov

On Thu, Nov 8, 2012 at 7:02 PM, Atchley, Scott atchle...@ornl.gov wrote:
On Nov 8, 2012, at 10:00 AM, Scott Atchley atchle...@ornl.gov wrote:

On Nov 8, 2012, at 9:39 AM, Mark Nelson mark.nel...@inktank.com wrote:

On 11/08/2012 07:55 AM, Atchley, Scott wrote:
On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta
gandalf.corvotempe...@gmail.com wrote:

2012/11/8 Mark Nelson mark.nel...@inktank.com:
I haven't done much with IPoIB (just RDMA), but my understanding is that
it
tends to top out at like 15Gb/s. Some others on this mailing list can
probably speak more authoritatively. Even with RDMA you are going to top
out at around 3.1-3.2GB/s.

15Gb/s is still faster than 10Gbe
But this speed limit seems to be kernel-related and should be the same
even in a 10Gbe environment, or not?

We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using
Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running
Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on
whether I use interrupt affinity and process binding.

For our Ceph testing, we will set the affinity of two of the mlx4
interrupt handlers to cores 0 and 1 and we will not using process binding.
For single stream Netperf, we do use process binding and bind it to the
same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf
runs, we do not use process binding but we still see ~22 Gb/s.

Scott, this is very interesting! Does setting the interrupt affinity
make the biggest difference then when you have concurrent netperf
processes going? For some reason I thought that setting interrupt
affinity wasn't even guaranteed in linux any more, but this is just some
half-remembered recollection from a year or two ago.

We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with
and without affinity:

Default (irqbalance running) 12.8 Gb/s
IRQ balance off13.0 Gb/s
Set IRQ affinity to socket 0 17.3 Gb/s # using the Mellanox script

When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get
~22 Gb/s for a single stream.

Did you tried Mellanox-baked modules for 2.6.32 before that?

Note, I used hwloc to determine which socket was closer to the mlx4 device on
our dual socket machines. On these nodes, hwloc reported that both sockets
were equally close, but a colleague has machines where one socket is closer
than the other. In that case, bind to the closer socket (or to cores within
the closer socket).

We used all of the Mellanox tuning recommendations for IPoIB available in
their tuning pdf:

http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf

We looked at their interrupt affinity setting scripts and then wrote our
own.

Our testing is with IPoIB in connected mode, not datagram mode.
Connected mode is less scalable, but currently I only get ~3 Gb/s with
datagram mode. Mellanox claims that we should get identical performance
with both modes and we are looking into it.

We are getting a new test cluster with FDR HCAs and I will look into those
as well.

Nice! At some point I'll probably try to justify getting some FDR cards
in house. I'd definitely like to hear how FDR ends up working for you.

I'll post the numbers when I get access after they are set up.

Scott

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: less cores more iops / speed

2012-11-08 Thread Andrey Korolyov

On Thu, Nov 8, 2012 at 7:53 PM, Alexandre DERUMIER aderum...@odiso.com wrote:
So it is a problem of KVM which let's the processes jump between cores a
lot.

 maybe numad from redhat can help ?
 http://fedoraproject.org/wiki/Features/numad

 It's try to keep process on same numa node and I think it's also doing some 
 dynamic pinning.

Numad keeps only memory chunks on the preferred node, cpu pinning,
which is a primary goal there, should be done separately via libvirt
or manually for qemu process via cpuset(libvirt does pinning via
taskset and seems that it is broken at least in debian wheezy - even
affinity mask is set for qemu process, load spreads all over numa
node, including cpus outside the set).


 - Mail original -

 De: Stefan Priebe - Profihost AG s.pri...@profihost.ag
 À: Mark Nelson mark.nel...@inktank.com
 Cc: Joao Eduardo Luis joao.l...@inktank.com, ceph-devel@vger.kernel.org
 Envoyé: Jeudi 8 Novembre 2012 16:14:32
 Objet: Re: less cores more iops / speed

 Am 08.11.2012 14:19, schrieb Mark Nelson:
 On 11/08/2012 02:45 AM, Stefan Priebe - Profihost AG wrote:
 Am 08.11.2012 01:59, schrieb Mark Nelson:
 There's also the context switching overhead. It'd be interesting to
 know how much the writer processes were shifting around on cores.
 What do you mean by that? I'm talking about the KVM guest not about the
 ceph nodes.

 in this case, is fio bouncing around between cores?

 Thanks you're correct. If i bind fio to two cores on a 8 core VM it runs
 with 16.000 iops.

 So it is a problem of KVM which let's the processes jump between cores a
 lot.

 Greets,
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG: kvm crashing in void librbd::AioCompletion::complete_request

2012-11-05 Thread Andrey Korolyov

On Mon, Nov 5, 2012 at 11:33 PM, Stefan Priebe s.pri...@profihost.ag wrote:
 Am 04.11.2012 15:12, schrieb Sage Weil:

 On Sun, 4 Nov 2012, Stefan Priebe wrote:

 Can i merge wip-rbd-read into master?


 Yeah.  I'm going to do a bit more testing first before I do it, but it
 should apply cleanly.  Hopefully later today.


 Thanks - seems to be fixed with wip-rbd-read but i have a memory leak now.

 kvm process raises and raises and raises which each new test around 5GB of
 memory. Should i write a new mail?

Do you use qemu 1.2.0? It has a memory leak, at least with rbd
backend, but I didn`t profiled it yet, so it may be something generic
:)


 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Different geoms for an rbd block device

2012-10-30 Thread Andrey Korolyov

On Wed, Oct 31, 2012 at 1:07 AM, Josh Durgin josh.dur...@inktank.com wrote:
 On 10/28/2012 03:02 AM, Andrey Korolyov wrote:

 Hi,

 Should following behavior considered to be normal?

 $ rbd map test-rack0/debiantest --user qemukvm --secret qemukvm.key
 $ fdisk /dev/rbd1

 Command (m for help): p

 Disk /dev/rbd1: 671 MB, 671088640 bytes
 255 heads, 63 sectors/track, 81 cylinders, total 1310720 sectors
 Units = sectors of 1 * 512 = 512 bytes
 Sector size (logical/physical): 512 bytes / 512 bytes
 I/O size (minimum/optimal): 4194304 bytes / 4194304 bytes
 Disk identifier: 0x00056f14

   Device Boot  Start End  Blocks   Id  System
 /dev/rbd1p12048   63487   30720   82  Linux swap /
 Solaris
 Partition 1 does not start on physical sector boundary.
 /dev/rbd1p2   63488 1292287  614400   83  Linux
 Partition 2 does not start on physical sector boundary.

 Meanwhile, in the guest vm over same image:

 fdisk /dev/vda

 Command (m for help): p

 Disk /dev/vda: 671 MB, 671088640 bytes
 16 heads, 63 sectors/track, 1300 cylinders, total 1310720 sectors


 I'm guessing the reported number of cylinders is the issue?
 You can control that with a qemu option. I think

 -drive ...cyls=81

 will do it. You can also set the min/opt i/o sizes via
 qemu device properties min_io_size and opt_io_size in
 the same way you can adjust discard granularity:

 http://ceph.com/docs/master/rbd/qemu-rbd/#enabling-discard-trim

 Unfortunately min_io_size is a uint16 in qemu, so it won't
 be able to store 4194304.


 Units = sectors of 1 * 512 = 512 bytes
 Sector size (logical/physical): 512 bytes / 512 bytes
 I/O size (minimum/optimal): 512 bytes / 512 bytes
 Disk identifier: 0x00056f14

 Device Boot  Start End  Blocks   Id  System
 /dev/vda12048   63487   30720   82  Linux swap /
 Solaris
 /dev/vda2   63488 1292287  614400   83  Linux

 The real pain starts when I try to repartition disk from after 'rbd
 map' using its geometry - it simply broke partition layout, for
 example, first block offset moves from 2048b to 8192. Of course I can
 specify geometry by hand, but before that I may need to start vm at
 least once or do something else which will print me out actual layout.

 Thanks!


 Setting the geometry at qemu boot time should work, and is a bit easier.
 qemu actually has code to try to guess disk geometry from a partition
 table, but perhaps it doesn't support the format you're using.

 Josh

So preferable geometry is one provided by kernel client, right? Is
there any advantages of using large blocks for I/O with discard(ofc,
not right now, I`ll wait for virtio bus support :) )?  At first sight,
TCP transfers should not differ by resulting speed on typical
workloads, but only on exotic ones - like delayed commit on the guest
FS + intensive writes.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ignore O_SYNC for rbd cache

2012-10-10 Thread Andrey Korolyov

Hi,

Recent tests on my test rack with 20G IB(iboip, 64k mtu, default
CUBIC, CFQ, LSI SAS 2108 w/ wb cache) interconnect shows a quite
fantastic performance - on both reads and writes Ceph completely
utilizing all disk bandwidth as high as 0.9 of theoretical limit of
sum of all bandwidths bearing in mind replication level. The only
thing that may bring down overall performance is a O_SYNC|O_DIRECT
writes which will be issued by almost every database server in the
default setup. Assuming that the database config may be untouchable
and somehow I can build very reliable hardware setup which `ll never
fail on power, should ceph have an option to ignore these flags? May
be there is another real-world cases for including such or I am very
wrong even thinking on fool client application in this way.

Thank you for any suggestion!
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Collection of strange lockups on 0.51

2012-10-03 Thread Andrey Korolyov

On Mon, Oct 1, 2012 at 8:42 PM, Tommi Virtanen t...@inktank.com wrote:
 On Sun, Sep 30, 2012 at 2:55 PM, Andrey Korolyov and...@xdel.ru wrote:
 Short post mortem - EX3200/12.1R2.9 may begin to drop packets (seems
 to appear more likely on 0.51 traffic patterns, which is very strange
 for L2 switching) when a bunch of the 802.3ad pairs, sixteen in my
 case, exposed to extremely high load - database benchmark over 700+
 rbd-backed VMs and cluster rebalance at same time. It explains
 post-reboot lockups in igb driver and all types of lockups above. I
 would very appreciate any suggestions of switch models which do not
 expose such behavior in simultaneous conditions both off-list and in
 this thread.

 I don't see how a switch dropping packets would give an ethernet card
 driver any excuse to crash, but I'm simultaneously happy to hear that
 it doesn't seem like Ceph is at fault, and sorry for your troubles.

 I don't have an up to date 1GbE card recommendation to share, but I
 would recommend making sure you're using a recent Linux kernel.

I have incorrectly formulated a reason - of course drops can not cause
a lockup by themselves, but switch may create somehow a long-lasting
`corrupt` state on the trunk ports which leads to such lockups at the
ethernet card. Of course I`ll play with the driver versions and
card|port settings, thanks for suggestion :)

I`m still investigating the issue since it is a quite hard to repeat
in the right time and hope I`m able to capture this state using
tcpdump-like, e.g. s/w methods - if card driver locks on something, it
may prevent to process problematic byte sequence at packet sniffer level.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Collection of strange lockups on 0.51

2012-09-30 Thread Andrey Korolyov

On Thu, Sep 13, 2012 at 1:43 AM, Andrey Korolyov and...@xdel.ru wrote:
 On Thu, Sep 13, 2012 at 1:09 AM, Tommi Virtanen t...@inktank.com wrote:
 On Wed, Sep 12, 2012 at 10:33 AM, Andrey Korolyov and...@xdel.ru wrote:
 Hi,
 This is completely off-list, but I`m asking because only ceph trigger
 such a bug :) .

 With 0.51, following happens: if I kill an osd, one or more neighbor
 nodes may go to hanged state with cpu lockups, not related to
 temperature or overall interrupt count or la and it happens randomly
 over 16-node cluster. Almost sure that ceph triggerizing some hardware
 bug, but I don`t quite sure of which origin. Also after a short time
 after reset from such crash a new lockup may be created by any action.

 From the log, it looks like your ethernet driver is crapping out.

 [172517.057886] NETDEV WATCHDOG: eth0 (igb): transmit queue 7 timed out
 ...
 [172517.058622]  [812b2975] ? netif_tx_lock+0x40/0x76

 etc.

 The later oopses are talking about paravirt_write_msr etc, which makes
 me thing you're using Xen? You probably don't want to run Ceph servers
 inside virtualization (for production).

 NOPE. Xen was my choice for almost five years, but right now I am
 replaced it with kvm everywhere due to buggy 4.1 '-stable'. 4.0 has
 same poor network performance as 3.x but can be really named stable.
 All those backtraces comes from bare hardware.

 At the end you can see nice backtrace which comes out soon after end
 of the boot sequence when I manually typed 'modprobe rbd', it may be
 any other command assuming from experience. As soon as I don`t know
 anything about long-lasting states in intel, especially of those which
 will survive ipmi reset button, I think that first-sight complain
 about igb may be not quite right. If there cards may save some of
 runtime states to EEPROM and pull them back then I`m wrong.

Short post mortem - EX3200/12.1R2.9 may begin to drop packets (seems
to appear more likely on 0.51 traffic patterns, which is very strange
for L2 switching) when a bunch of the 802.3ad pairs, sixteen in my
case, exposed to extremely high load - database benchmark over 700+
rbd-backed VMs and cluster rebalance at same time. It explains
post-reboot lockups in igb driver and all types of lockups above. I
would very appreciate any suggestions of switch models which do not
expose such behavior in simultaneous conditions both off-list and in
this thread.



 [172696.503900]  [8100d025] ? paravirt_write_msr+0xb/0xe
 [172696.503942]  [810325f3] ? leave_mm+0x3e/0x3e

 and *then* you get

 [172695.041709] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0
 [172695.041745] megasas: [ 0]waiting for 35 commands to complete
 [172696.045602] megaraid_sas: no pending cmds after reset
 [172696.045644] megasas: reset successful

 which just adds more awesomeness to the soup -- though I do wonder if
 this could be caused by the soft hang from earlier.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: enabling cephx by default

2012-09-18 Thread Andrey Korolyov

On Tue, Sep 18, 2012 at 4:37 PM, Guido Winkelmann
guido-c...@thisisnotatest.de wrote:
 Am Dienstag, 11. September 2012, 17:25:49 schrieben Sie:
 The next stable release will have cephx authentication enabled by default.

 Hm, that could be a problem for me. I have tried multiple times to get cephx
 working in the past, without lasting success. (I cannot recall at the moment
 what the problem was the last time around, but it was probably qemu/libvirt.)

BTW, libvirt 0.10.x has a broken cephx support somehow. It forms same
string for -drive as 0.9x(at least in a log) but failing to pass
authentication same moment.


 IMHO, the documentation badly needs a high-level overview for cephx (or maybe
 I just haven't found it yet); what it does, what dangers it protects you from
 and how it achieves that.

 Guido
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: enabling cephx by default

2012-09-18 Thread Andrey Korolyov

On Tue, Sep 18, 2012 at 5:34 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Tue, Sep 18, 2012 at 4:37 PM, Guido Winkelmann
 guido-c...@thisisnotatest.de wrote:
 Am Dienstag, 11. September 2012, 17:25:49 schrieben Sie:
 The next stable release will have cephx authentication enabled by default.

 Hm, that could be a problem for me. I have tried multiple times to get cephx
 working in the past, without lasting success. (I cannot recall at the moment
 what the problem was the last time around, but it was probably qemu/libvirt.)

 BTW, libvirt 0.10.x has a broken cephx support somehow. It forms same
 string for -drive as 0.9x(at least in a log) but failing to pass
 authentication same moment.

Please nevermind, I have build incorrect regex for log parsing previously.
https://www.redhat.com/archives/libvirt-users/2012-September/msg00082.html

 IMHO, the documentation badly needs a high-level overview for cephx (or maybe
 I just haven't found it yet); what it does, what dangers it protects you from
 and how it achieves that.

 Guido
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Collection of strange lockups on 0.51

2012-09-12 Thread Andrey Korolyov

Hi,

This is completely off-list, but I`m asking because only ceph trigger
such a bug :) .

With 0.51, following happens: if I kill an osd, one or more neighbor
nodes may go to hanged state with cpu lockups, not related to
temperature or overall interrupt count or la and it happens randomly
over 16-node cluster. Almost sure that ceph triggerizing some hardware
bug, but I don`t quite sure of which origin. Also after a short time
after reset from such crash a new lockup may be created by any action.

Before blaming system drivers and continuing to investigate a problem,
may I ask if someone faced similar problem? I am using 802.ad on pair
intel 350 for general connectivity. I have attached a bit of traces
which was pushed to netconsole(in some cases, machine died hardly,
e.g. not even sending a final bye over netconsole, so it is not
complete).


netcon.log.gz
Description: GNU Zip compressed data

Re: Collection of strange lockups on 0.51

2012-09-12 Thread Andrey Korolyov

On Thu, Sep 13, 2012 at 1:09 AM, Tommi Virtanen t...@inktank.com wrote:
 On Wed, Sep 12, 2012 at 10:33 AM, Andrey Korolyov and...@xdel.ru wrote:
 Hi,
 This is completely off-list, but I`m asking because only ceph trigger
 such a bug :) .

 With 0.51, following happens: if I kill an osd, one or more neighbor
 nodes may go to hanged state with cpu lockups, not related to
 temperature or overall interrupt count or la and it happens randomly
 over 16-node cluster. Almost sure that ceph triggerizing some hardware
 bug, but I don`t quite sure of which origin. Also after a short time
 after reset from such crash a new lockup may be created by any action.

 From the log, it looks like your ethernet driver is crapping out.

 [172517.057886] NETDEV WATCHDOG: eth0 (igb): transmit queue 7 timed out
 ...
 [172517.058622]  [812b2975] ? netif_tx_lock+0x40/0x76

 etc.

 The later oopses are talking about paravirt_write_msr etc, which makes
 me thing you're using Xen? You probably don't want to run Ceph servers
 inside virtualization (for production).

NOPE. Xen was my choice for almost five years, but right now I am
replaced it with kvm everywhere due to buggy 4.1 '-stable'. 4.0 has
same poor network performance as 3.x but can be really named stable.
All those backtraces comes from bare hardware.

At the end you can see nice backtrace which comes out soon after end
of the boot sequence when I manually typed 'modprobe rbd', it may be
any other command assuming from experience. As soon as I don`t know
anything about long-lasting states in intel, especially of those which
will survive ipmi reset button, I think that first-sight complain
about igb may be not quite right. If there cards may save some of
runtime states to EEPROM and pull them back then I`m wrong.


 [172696.503900]  [8100d025] ? paravirt_write_msr+0xb/0xe
 [172696.503942]  [810325f3] ? leave_mm+0x3e/0x3e

 and *then* you get

 [172695.041709] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0
 [172695.041745] megasas: [ 0]waiting for 35 commands to complete
 [172696.045602] megaraid_sas: no pending cmds after reset
 [172696.045644] megasas: reset successful

 which just adds more awesomeness to the soup -- though I do wonder if
 this could be caused by the soft hang from earlier.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD crash

2012-09-04 Thread Andrey Korolyov

Hi,

Almost always one or more osd dies when doing overlapped recovery -
e.g. add new crushmap and remove some newly added osds from cluster
some minutes later during remap or inject two slightly different
crushmaps after a short time(surely preserving at least one of
replicas online). Seems that osd dying on excessive amount of
operations in queue because under normal test, e.g. rados, iowait does
not break one percent barrier but during recovery it may raise up to
ten percents(2108 w/ cache, splitted disks as R0 each).

#0  0x7f62f193a445 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f62f193db9b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f62f2236665 in __gnu_cxx::__verbose_terminate_handler() ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x7f62f2234796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x7f62f22347c3 in std::terminate() () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x7f62f22349ee in __cxa_throw () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00844e11 in ceph::__ceph_assert_fail(char const*, char
const*, int, char const*) ()
#7  0x0073148f in
FileStore::_do_transaction(ObjectStore::Transaction, unsigned long,
int) ()
#8  0x0073484e in
FileStore::do_transactions(std::listObjectStore::Transaction*,
std::allocatorObjectStore::Transaction* , unsigned long) ()
#9  0x0070c680 in FileStore::_do_op(FileStore::OpSequencer*) ()
#10 0x0083ce01 in ThreadPool::worker() ()
#11 0x006823ed in ThreadPool::WorkThread::entry() ()
#12 0x7f62f345ee9a in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#13 0x7f62f19f64cd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#14 0x in ?? ()
ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)

On Sun, Aug 26, 2012 at 8:52 PM, Andrey Korolyov and...@xdel.ru wrote:
 During recovery, following crash happens(simular to
 http://tracker.newdream.net/issues/2126 which marked resolved long
 ago):

 http://xdel.ru/downloads/ceph-log/osd-2012-08-26.txt

 On Sat, Aug 25, 2012 at 12:30 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Thu, Aug 23, 2012 at 4:09 AM, Gregory Farnum g...@inktank.com wrote:
 The tcmalloc backtrace on the OSD suggests this may be unrelated, but
 what's the fd limit on your monitor process? You may be approaching
 that limit if you've got 500 OSDs and a similar number of clients.


 Thanks! I didn`t measured a # of connection because of bearing in mind
 1 conn per client, raising limit did the thing. Previously mentioned
 qemu-kvm zombie does not related to rbd itself - it can be created by
 destroying libvirt domain which is in saving state or vice-versa, so
 I`ll put a workaround on this. Right now I am faced different problem
 - osds dying silently, e.g. not leaving a core, I`ll check logs on the
 next testing phase.

 On Wed, Aug 22, 2012 at 6:55 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil s...@inktank.com wrote:
 On Thu, 23 Aug 2012, Andrey Korolyov wrote:
 Hi,

 today during heavy test a pair of osds and one mon died, resulting to
 hard lockup of some kvm processes - they went unresponsible and was
 killed leaving zombie processes ([kvm] defunct). Entire cluster
 contain sixteen osd on eight nodes and three mons, on first and last
 node and on vm outside cluster.

 osd bt:
 #0  0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 (gdb) bt
 #0  0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 #1  0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
 /usr/lib/libtcmalloc.so.4
 #2  0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4
 #3  0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at
 /usr/include/c++/4.7/bits/basic_string.h:246
 #4  ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at
 /usr/include/c++/4.7/bits/basic_string.h:536
 #5  ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out)
 at /usr/include/c++/4.7/sstream:60
 #6  ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized
 out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439
 #7  pretty_version_to_str () at common/version.cc:40
 #8  0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10,
 out=...) at common/BackTrace.cc:19
 #9  0x0078f450 in handle_fatal_signal (signum=11) at
 global/signal_handler.cc:91
 #10 signal handler called
 #11 0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
 /usr/lib/libtcmalloc.so.4
 #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4
 #14

Re: Ceph benchmarks

2012-08-27 Thread Andrey Korolyov

On Tue, Aug 28, 2012 at 12:47 AM, Sébastien Han han.sebast...@gmail.com wrote:
 Hi community,

 For those of you who are interested, I performed several benchmarks of
 RADOS and RBD on different types of hardware and use case.
 You can find my results here:
 http://www.sebastien-han.fr/blog/2012/08/26/ceph-benchmarks/

 Hope it helps :)

 Feel free to comment, critic... :)

 Cheers!

My two cents - on ultrafast journal(tmpfs) it means which tcp
congestion control algorithm you using. For default CUBIC delays
aggregated sixteen-osd writing speed is about 450MBps, but for DCTCP
it raising up to 550MBps. For such device as SLC disk(ext4,^O journal,
commit=100) there is no observable difference - both times aggregated
speed measured about 330MBps. I do not tried yet H(S)TCP, it should do
the same as DCTCP. For delays lower than regular gigabit ethernet
different congestion algorithms should show bigger difference, though.

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD crash

2012-08-22 Thread Andrey Korolyov

On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil s...@inktank.com wrote:
 On Thu, 23 Aug 2012, Andrey Korolyov wrote:
 Hi,

 today during heavy test a pair of osds and one mon died, resulting to
 hard lockup of some kvm processes - they went unresponsible and was
 killed leaving zombie processes ([kvm] defunct). Entire cluster
 contain sixteen osd on eight nodes and three mons, on first and last
 node and on vm outside cluster.

 osd bt:
 #0  0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 (gdb) bt
 #0  0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 #1  0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
 /usr/lib/libtcmalloc.so.4
 #2  0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4
 #3  0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at
 /usr/include/c++/4.7/bits/basic_string.h:246
 #4  ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at
 /usr/include/c++/4.7/bits/basic_string.h:536
 #5  ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out)
 at /usr/include/c++/4.7/sstream:60
 #6  ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized
 out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439
 #7  pretty_version_to_str () at common/version.cc:40
 #8  0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10,
 out=...) at common/BackTrace.cc:19
 #9  0x0078f450 in handle_fatal_signal (signum=11) at
 global/signal_handler.cc:91
 #10 signal handler called
 #11 0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
 /usr/lib/libtcmalloc.so.4
 #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4
 #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() ()
 from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #15 0x7fc37d1c4796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #16 0x7fc37d1c47c3 in std::terminate() () from
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #17 0x7fc37d1c49ee in __cxa_throw () from
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c
 0 == \unexpected error\, file=optimized out, line=3007,
 func=0x90ef80 unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int))
 at common/assert.cc:77

 This means it got an unexpected error when talking to the file system.  If
 you look in the osd log, it may tell you what that was.  (It may
 not--there isn't usually the other tcmalloc stuff triggered from the
 assert handler.)

 What happens if you restart that ceph-osd daemon?

 sage



Unfortunately I have completely disabled logs during test, so there
are no suggestion of assert_fail. The main problem was revealed -
created VMs was pointed to one monitor instead set of three, so there
may be some unusual things(btw, crashed mon isn`t one from above, but
a neighbor of crashed osds on first node). After IPMI reset node
returns back well and cluster behavior seems to be okay - stuck kvm
I/O somehow prevented even other module load|unload on this node, so I
finally decided to do hard reset. Despite I`m using almost generic
wheezy, glibc was updated to 2.15, may be because of this my trace
appears first time ever. I`m almost sure that fs does not triggered
this crash and mainly suspecting stuck kvm processes. I`ll rerun test
with same conditions tomorrow(~500 vms pointed to one mon and very
high I/O, but with osd logging).

 #19 0x0073148f in FileStore::_do_transaction
 (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545,
 trans_num=trans_num@entry=0) at os/FileStore.cc:3007
 #20 0x0073484e in FileStore::do_transactions (this=0x2cde000,
 tls=..., op_seq=429545) at os/FileStore.cc:2436
 #21 0x0070c680 in FileStore::_do_op (this=0x2cde000,
 osr=optimized out) at os/FileStore.cc:2259
 #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at
 common/WorkQueue.cc:54
 #23 0x006823ed in ThreadPool::WorkThread::entry
 (this=optimized out) at ./common/WorkQueue.h:126
 #24 0x7fc37e3eee9a in start_thread () from
 /lib/x86_64-linux-gnu/libpthread.so.0
 #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6
 #26 0x in ?? ()

 mon bt was exactly the same as in http://tracker.newdream.net/issues/2762
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info

another performance-related thread

2012-07-31 Thread Andrey Korolyov

Hi,

I`ve finally managed to run rbd-related test on relatively powerful
machines and what I have got:

1) Reads on almost fair balanced cluster(eight nodes) did very well,
utilizing almost all disk and bandwidth (dual gbit 802.3ad nics, sata
disks beyond lsi sas 2108 with wt cache gave me ~1.6Gbyte/s on linear
and sequential reads, which is close to overall disk throughput)
2) Writes get much worse, both on rados bench and on fio test when I
ran fio simularly on 120 vms - at it best, overall performance is
about 400Mbyte/s, using rados bench -t 12 on three host nodes

fio config:

rw=(randread|randwrite|seqread|seqwrite)
size=256m
direct=1
directory=/test
numjobs=1
iodepth=12
group_reporting
name=random-ead-direct
bs=1M
loops=12

for 120 vm set, Mbyte/s
linear reads:
MEAN: 14156
STDEV: 612.596
random reads:
MEAN: 14128
STDEV: 911.789
linear writes:
MEAN: 2956
STDEV: 283.165
random writes:
MEAN: 2986
STDEV: 361.311

each node holds 15 vms and for 64M rbd cache all possible three states
- wb, wt and no-cache has almost same numbers at the tests. I wonder
if it possible to raise write/read ratio somehow. Seems that osd
underutilize itself, e.g. I am not able to get single-threaded rbd
write to get above 35Mb/s. Adding second osd on same disk only raising
iowait time, but not benchmark results.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: another performance-related thread

2012-07-31 Thread Andrey Korolyov

On 07/31/2012 07:17 PM, Mark Nelson wrote:
 Hi Andrey!

 On 07/31/2012 10:03 AM, Andrey Korolyov wrote:
 Hi,

 I`ve finally managed to run rbd-related test on relatively powerful
 machines and what I have got:

 1) Reads on almost fair balanced cluster(eight nodes) did very well,
 utilizing almost all disk and bandwidth (dual gbit 802.3ad nics, sata
 disks beyond lsi sas 2108 with wt cache gave me ~1.6Gbyte/s on linear
 and sequential reads, which is close to overall disk throughput)

 Does your 2108 have the RAID or JBOD firmware?  I'm guessing the RAID
 firmware given that you are able to change the caching behavior?  How
 do you have the arrays setup for the OSDs?

Exactly, I am able to change cache behavior on-the-fly using 'famous'
megacli binary. Each node contains three disks, each of them configured
as raid0 single-disk - two 7200 server sata and intel 313 for journal.
On satas I am using xfs with default mount options and on ssd I`ve put
ext4 with disabled journal and of course with discard/noatime. This 2108
comes with SuperMicro firmware 2.120.243-1482 - guessing it is RAID
variant and I didn`t tried to reflash it yet. For tests, I have forced
write-through cache on - this should be very good at small writes
aggregation. Before using such config, I have configured two disks to
RAID0 and get slightly worse results on write bench. Thanks for
suggesting to try JBOD firmware, I`ll do tests using it this week and
post results.
 2) Writes get much worse, both on rados bench and on fio test when I
 ran fio simularly on 120 vms - at it best, overall performance is
 about 400Mbyte/s, using rados bench -t 12 on three host nodes

 fio config:

 rw=(randread|randwrite|seqread|seqwrite)
 size=256m
 direct=1
 directory=/test
 numjobs=1
 iodepth=12
 group_reporting
 name=random-ead-direct
 bs=1M
 loops=12

 for 120 vm set, Mbyte/s
 linear reads:
 MEAN: 14156
 STDEV: 612.596
 random reads:
 MEAN: 14128
 STDEV: 911.789
 linear writes:
 MEAN: 2956
 STDEV: 283.165
 random writes:
 MEAN: 2986
 STDEV: 361.311

 each node holds 15 vms and for 64M rbd cache all possible three states
 - wb, wt and no-cache has almost same numbers at the tests. I wonder
 if it possible to raise write/read ratio somehow. Seems that osd
 underutilize itself, e.g. I am not able to get single-threaded rbd
 write to get above 35Mb/s. Adding second osd on same disk only raising
 iowait time, but not benchmark results.

 I've seen high IO wait times (especially with small writes) via rados
 bench as well.  It's something we are actively investigating.  Part of
 the issue with rados bench is that every single request is getting
 written to a seperate file, so especially at small IO sizes there is a
 lot of underlying filesystem metadata traffic.  For us, this is
 happening on 9260 controllers with RAID firmware.  I think we may see
 some improvement by switching to 2X08 cards with the JBOD (ie IT)
 firmware, but we haven't confirmed it yet.

For 24 HT cores I have seen 2 percent iowait at most(at writes), so
almost surely there is no IO bottleneck at all(except breaking the rule
'one osd per physical disk', when iowait raising up to 50 percent on
entire system). Rados bench is not an universal measurement tool,
thought - using VM` IO requests instead of manipulating rados objects
will lead to almost fair result, by my opinion.


 We actually just purchased a variety of alternative RAID and SAS
 controllers to test with to see how universal this problem is.
 Theoretically RBD shouldn't suffer from this as badly as small writes
 to the same file should get buffered.  The same is true for CephFS
 when doing buffered IO to a single file due to the Linux buffer
 cache.  Small writes to many files will likely suffer in the same way
 that rados bench does though.

 -- 
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: another performance-related thread

2012-07-31 Thread Andrey Korolyov

On 07/31/2012 07:53 PM, Josh Durgin wrote:
 On 07/31/2012 08:03 AM, Andrey Korolyov wrote:
 Hi,

 I`ve finally managed to run rbd-related test on relatively powerful
 machines and what I have got:

 1) Reads on almost fair balanced cluster(eight nodes) did very well,
 utilizing almost all disk and bandwidth (dual gbit 802.3ad nics, sata
 disks beyond lsi sas 2108 with wt cache gave me ~1.6Gbyte/s on linear
 and sequential reads, which is close to overall disk throughput)
 2) Writes get much worse, both on rados bench and on fio test when I
 ran fio simularly on 120 vms - at it best, overall performance is
 about 400Mbyte/s, using rados bench -t 12 on three host nodes

 How are your osd journals configured? What's your ceph.conf for the
 osds?

 fio config:

 rw=(randread|randwrite|seqread|seqwrite)
 size=256m
 direct=1
 directory=/test
 numjobs=1
 iodepth=12
 group_reporting
 name=random-ead-direct
 bs=1M
 loops=12

 for 120 vm set, Mbyte/s
 linear reads:
 MEAN: 14156
 STDEV: 612.596
 random reads:
 MEAN: 14128
 STDEV: 911.789
 linear writes:
 MEAN: 2956
 STDEV: 283.165
 random writes:
 MEAN: 2986
 STDEV: 361.311

 each node holds 15 vms and for 64M rbd cache all possible three states
 - wb, wt and no-cache has almost same numbers at the tests. I wonder
 if it possible to raise write/read ratio somehow. Seems that osd
 underutilize itself, e.g. I am not able to get single-threaded rbd
 write to get above 35Mb/s. Adding second osd on same disk only raising
 iowait time, but not benchmark results.

 Are these write tests using direct I/O? That will bypass the cache for
 writes, which would explain the similar numbers with different cache
 modes.

I have previously forgot that direct flag may affect rbd cache behaviout.

Without it on wb cache, read rate remained same and writes increased by
~ 0.15:
random writes:
MEAN: 3370
STDEV: 939.99

linear writes:
MEAN: 3561
STDEV: 824.954

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph status reporting non-existing osd

2012-07-19 Thread Andrey Korolyov

On Thu, Jul 19, 2012 at 1:28 AM, Gregory Farnum g...@inktank.com wrote:
 On Wed, Jul 18, 2012 at 12:07 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Jul 18, 2012 at 10:30 PM, Gregory Farnum g...@inktank.com wrote:
 On Wed, Jul 18, 2012 at 12:47 AM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Jul 18, 2012 at 11:18 AM, Gregory Farnum g...@inktank.com wrote:
 On Tuesday, July 17, 2012 at 11:22 PM, Andrey Korolyov wrote:
 On Wed, Jul 18, 2012 at 10:09 AM, Gregory Farnum g...@inktank.com 
 (mailto:g...@inktank.com) wrote:
  Hrm. That shouldn't be possible if the OSD has been removed. How did 
  you take it out? It sounds like maybe you just marked it in the OUT 
  state (and turned it off quite quickly) without actually taking it out 
  of the cluster?
  -Greg



 As I have did removal, it was definitely not like that - at first
 place, I have marked osds(4 and 5 on same host) out, then rebuilt
 crushmap and then kill osd processes. As I mentioned before, osd.4
 doest not exist in crushmap and therefore it shouldn`t be reported at
 all(theoretically).

 Okay, that's what happened — marking an OSD out in the CRUSH map means 
 all the data gets moved off it, but that doesn't remove it from all the 
 places where it's registered in the monitor and in the map, for a couple 
 reasons:
 1) You might want to mark an OSD out before taking it down, to allow for 
 more orderly data movement.
 2) OSDs can get marked out automatically, but the system shouldn't be 
 able to forget about them on its own.
 3) You might want to remove an OSD from the CRUSH map in the process of 
 placing it somewhere else (perhaps you moved the physical machine to a 
 new location).
 etc.

 You want to run ceph osd rm 4 5 and that should unregister both of them 
 from everything[1]. :)
 -Greg
 [1]: Except for the full lists, which have a bug in the version of code 
 you're running — remove the OSDs, then adjust the full ratios again, and 
 all will be well.


 $ ceph osd rm 4
 osd.4 does not exist
 $ ceph -s
health HEALTH_WARN 1 near full osd(s)
monmap e3: 3 mons at
 {0=192.168.10.129:6789/0,1=192.168.10.128:6789/0,2=192.168.10.127:6789/0},
 election epoch 58, quorum 0,1,2 0,1,2
osdmap e2198: 4 osds: 4 up, 4 in
 pgmap v586056: 464 pgs: 464 active+clean; 66645 MB data, 231 GB
 used, 95877 MB / 324 GB avail
mdsmap e207: 1/1/1 up {0=a=up:active}

 $ ceph health detail
 HEALTH_WARN 1 near full osd(s)
 osd.4 is near full at 89%

 $ ceph osd dump
 
 max_osd 4
 osd.0 up   in  weight 1 up_from 2183 up_thru 2187 down_at 2172
 last_clean_interval [2136,2171) 192.168.10.128:6800/4030
 192.168.10.128:6801/4030 192.168.10.128:6802/4030 exists,up
 68b3deec-e80a-48b7-9c29-1b98f5de4f62
 osd.1 up   in  weight 1 up_from 2136 up_thru 2186 down_at 2135
 last_clean_interval [2115,2134) 192.168.10.129:6800/2980
 192.168.10.129:6801/2980 192.168.10.129:6802/2980 exists,up
 b2a26fe9-aaa8-445f-be1f-fa7d2a283b57
 osd.2 up   in  weight 1 up_from 2181 up_thru 2187 down_at 2172
 last_clean_interval [2136,2171) 192.168.10.128:6803/4128
 192.168.10.128:6804/4128 192.168.10.128:6805/4128 exists,up
 378d367a-f7fb-4892-9ec9-db8ffdd2eb20
 osd.3 up   in  weight 1 up_from 2136 up_thru 2186 down_at 2135
 last_clean_interval [2115,2134) 192.168.10.129:6803/3069
 192.168.10.129:6804/3069 192.168.10.129:6805/3069 exists,up
 faf8eda8-55fc-4a0e-899f-47dbd32b81b8
 

 Hrm. How did you create your new crush map? All the normal avenues of
 removing an OSD from the map set a flag which the PGMap uses to delete
 its records (which would prevent it reappearing in the full list), and
 I can't see how setcrushmap would remove an OSD from the map (although
 there might be a code path I haven't found).

 Manually, by deleting osd4|5 entries and reweighing remaining nodes.

 So you extracted the CRUSH map, edited it, and injected it using ceph
 osd setrcrushmap?

Yep, exactly.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph status reporting non-existing osd

2012-07-18 Thread Andrey Korolyov

On Wed, Jul 18, 2012 at 10:09 AM, Gregory Farnum g...@inktank.com wrote:
 On Monday, July 16, 2012 at 11:55 AM, Andrey Korolyov wrote:
 On Mon, Jul 16, 2012 at 10:48 PM, Gregory Farnum g...@inktank.com 
 (mailto:g...@inktank.com) wrote:
  ceph pg set_full_ratio 0.95
  ceph pg set_nearfull_ratio 0.94
 
 
  On Monday, July 16, 2012 at 11:42 AM, Andrey Korolyov wrote:
 
   On Mon, Jul 16, 2012 at 8:12 PM, Gregory Farnum g...@inktank.com 
   (mailto:g...@inktank.com) wrote:
On Saturday, July 14, 2012 at 7:20 AM, Andrey Korolyov wrote:
 On Fri, Jul 13, 2012 at 9:09 PM, Sage Weil s...@inktank.com 
 (mailto:s...@inktank.com) wrote:
  On Fri, 13 Jul 2012, Gregory Farnum wrote:
   On Fri, Jul 13, 2012 at 1:17 AM, Andrey Korolyov and...@xdel.ru 
   (mailto:and...@xdel.ru) wrote:
Hi,
   
Recently I`ve reduced my test suite from 6 to 4 osds at ~60% 
usage on
six-node,
and I have removed a bunch of rbd objects during recovery to 
avoid
overfill.
Right now I`m constantly receiving a warn about nearfull state 
on
non-existing osd:
   
health HEALTH_WARN 1 near full osd(s)
monmap e3: 3 mons at
{0=192.168.10.129:6789/0,1=192.168.10.128:6789/0,2=192.168.10.127:6789/0},
election epoch 240, quorum 0,1,2 0,1,2
osdmap e2098: 4 osds: 4 up, 4 in
pgmap v518696: 464 pgs: 464 active+clean; 61070 MB data, 181 GB
used, 143 GB / 324 GB avail
mdsmap e181: 1/1/1 up {0=a=up:active}
   
HEALTH_WARN 1 near full osd(s)
osd.4 is near full at 89%
   
Needless to say, osd.4 remains only in ceph.conf, but not at 
crushmap.
Reducing has been done 'on-line', e.g. without restart entire 
cluster.
  
  
  
  
  
  
  
   Whoops! It looks like Sage has written some patches to fix this, 
   but
   for now you should be good if you just update your ratios to a 
   larger
   number, and then bring them back down again. :)
 
 
 
 
 
 
 
  Restarting ceph-mon should also do the trick.
 
  Thanks for the bug report!
  sage







 Should I restart mons simultaneously?
I don't think restarting will actually do the trick for you — you 
actually will need to set the ratios again.
   
 Restarting one by one has no
 effect, same as filling up data pool up to ~95 percent(btw, when I
 deleted this 50Gb file on cephfs, mds was stuck permanently and usage
 remained same until I dropped and recreated data pool - hope it`s one
 of known posix layer bugs). I also deleted entry from config, and 
 then
 restarted mons, with no effect. Any suggestions?
   
   
   
   
   
I'm not sure what you're asking about here?
-Greg
  
  
  
  
  
   Oh, sorry, I have mislooked and thought that you suggested filling up
   osds. How do I can set full/nearfull ratios correctly?
  
   $ceph injectargs '--mon_osd_full_ratio 96'
   parsed options
   $ ceph injectargs '--mon_osd_near_full_ratio 94'
   parsed options
  
   ceph pg dump | grep 'full'
   full_ratio 0.95
   nearfull_ratio 0.85
  
   Setting parameters in the ceph.conf and then restarting mons does not
   affect ratios either.
 



 Thanks, it worked, but setting values back result to turn warning back.
 Hrm. That shouldn't be possible if the OSD has been removed. How did you take 
 it out? It sounds like maybe you just marked it in the OUT state (and turned 
 it off quite quickly) without actually taking it out of the cluster?
 -Greg


As I have did removal, it was definitely not like that - at first
place, I have marked osds(4 and 5 on same host) out, then rebuilt
crushmap and then kill osd processes. As I mentioned before, osd.4
doest not exist in crushmap and therefore it shouldn`t be reported at
all(theoretically).
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph status reporting non-existing osd

2012-07-18 Thread Andrey Korolyov

On Wed, Jul 18, 2012 at 11:18 AM, Gregory Farnum g...@inktank.com wrote:
 On Tuesday, July 17, 2012 at 11:22 PM, Andrey Korolyov wrote:
 On Wed, Jul 18, 2012 at 10:09 AM, Gregory Farnum g...@inktank.com 
 (mailto:g...@inktank.com) wrote:
  On Monday, July 16, 2012 at 11:55 AM, Andrey Korolyov wrote:
   On Mon, Jul 16, 2012 at 10:48 PM, Gregory Farnum g...@inktank.com 
   (mailto:g...@inktank.com) wrote:
ceph pg set_full_ratio 0.95
ceph pg set_nearfull_ratio 0.94
   
   
On Monday, July 16, 2012 at 11:42 AM, Andrey Korolyov wrote:
   
 On Mon, Jul 16, 2012 at 8:12 PM, Gregory Farnum g...@inktank.com 
 (mailto:g...@inktank.com) wrote:
  On Saturday, July 14, 2012 at 7:20 AM, Andrey Korolyov wrote:
   On Fri, Jul 13, 2012 at 9:09 PM, Sage Weil s...@inktank.com 
   (mailto:s...@inktank.com) wrote:
On Fri, 13 Jul 2012, Gregory Farnum wrote:
 On Fri, Jul 13, 2012 at 1:17 AM, Andrey Korolyov 
 and...@xdel.ru (mailto:and...@xdel.ru) wrote:
  Hi,
 
  Recently I`ve reduced my test suite from 6 to 4 osds at 
  ~60% usage on
  six-node,
  and I have removed a bunch of rbd objects during recovery 
  to avoid
  overfill.
  Right now I`m constantly receiving a warn about nearfull 
  state on
  non-existing osd:
 
  health HEALTH_WARN 1 near full osd(s)
  monmap e3: 3 mons at
  {0=192.168.10.129:6789/0,1=192.168.10.128:6789/0,2=192.168.10.127:6789/0},
  election epoch 240, quorum 0,1,2 0,1,2
  osdmap e2098: 4 osds: 4 up, 4 in
  pgmap v518696: 464 pgs: 464 active+clean; 61070 MB data, 
  181 GB
  used, 143 GB / 324 GB avail
  mdsmap e181: 1/1/1 up {0=a=up:active}
 
  HEALTH_WARN 1 near full osd(s)
  osd.4 is near full at 89%
 
  Needless to say, osd.4 remains only in ceph.conf, but not 
  at crushmap.
  Reducing has been done 'on-line', e.g. without restart 
  entire cluster.









 Whoops! It looks like Sage has written some patches to fix 
 this, but
 for now you should be good if you just update your ratios to 
 a larger
 number, and then bring them back down again. :)
   
   
   
   
   
   
   
   
   
Restarting ceph-mon should also do the trick.
   
Thanks for the bug report!
sage
  
  
  
  
  
  
  
  
  
   Should I restart mons simultaneously?
  I don't think restarting will actually do the trick for you — you 
  actually will need to set the ratios again.
 
   Restarting one by one has no
   effect, same as filling up data pool up to ~95 percent(btw, when 
   I
   deleted this 50Gb file on cephfs, mds was stuck permanently and 
   usage
   remained same until I dropped and recreated data pool - hope 
   it`s one
   of known posix layer bugs). I also deleted entry from config, 
   and then
   restarted mons, with no effect. Any suggestions?
 
 
 
 
 
 
 
  I'm not sure what you're asking about here?
  -Greg







 Oh, sorry, I have mislooked and thought that you suggested filling up
 osds. How do I can set full/nearfull ratios correctly?

 $ceph injectargs '--mon_osd_full_ratio 96'
 parsed options
 $ ceph injectargs '--mon_osd_near_full_ratio 94'
 parsed options

 ceph pg dump | grep 'full'
 full_ratio 0.95
 nearfull_ratio 0.85

 Setting parameters in the ceph.conf and then restarting mons does not
 affect ratios either.
   
  
  
  
  
  
   Thanks, it worked, but setting values back result to turn warning back.
  Hrm. That shouldn't be possible if the OSD has been removed. How did you 
  take it out? It sounds like maybe you just marked it in the OUT state (and 
  turned it off quite quickly) without actually taking it out of the cluster?
  -Greg



 As I have did removal, it was definitely not like that - at first
 place, I have marked osds(4 and 5 on same host) out, then rebuilt
 crushmap and then kill osd processes. As I mentioned before, osd.4
 doest not exist in crushmap and therefore it shouldn`t be reported at
 all(theoretically).

 Okay, that's what happened — marking an OSD out in the CRUSH map means all 
 the data gets moved off it, but that doesn't remove it from all the places 
 where it's registered in the monitor and in the map, for a couple reasons:
 1) You might want to mark an OSD out before taking it down, to allow for more 
 orderly data movement.
 2) OSDs can get marked out automatically, but the system shouldn't be able to 
 forget about them on its own.
 3) You might want to remove an OSD from the CRUSH map

Re: ceph status reporting non-existing osd

2012-07-18 Thread Andrey Korolyov

On Wed, Jul 18, 2012 at 10:30 PM, Gregory Farnum g...@inktank.com wrote:
 On Wed, Jul 18, 2012 at 12:47 AM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Jul 18, 2012 at 11:18 AM, Gregory Farnum g...@inktank.com wrote:
 On Tuesday, July 17, 2012 at 11:22 PM, Andrey Korolyov wrote:
 On Wed, Jul 18, 2012 at 10:09 AM, Gregory Farnum g...@inktank.com 
 (mailto:g...@inktank.com) wrote:
  Hrm. That shouldn't be possible if the OSD has been removed. How did you 
  take it out? It sounds like maybe you just marked it in the OUT state 
  (and turned it off quite quickly) without actually taking it out of the 
  cluster?
  -Greg



 As I have did removal, it was definitely not like that - at first
 place, I have marked osds(4 and 5 on same host) out, then rebuilt
 crushmap and then kill osd processes. As I mentioned before, osd.4
 doest not exist in crushmap and therefore it shouldn`t be reported at
 all(theoretically).

 Okay, that's what happened — marking an OSD out in the CRUSH map means all 
 the data gets moved off it, but that doesn't remove it from all the places 
 where it's registered in the monitor and in the map, for a couple reasons:
 1) You might want to mark an OSD out before taking it down, to allow for 
 more orderly data movement.
 2) OSDs can get marked out automatically, but the system shouldn't be able 
 to forget about them on its own.
 3) You might want to remove an OSD from the CRUSH map in the process of 
 placing it somewhere else (perhaps you moved the physical machine to a new 
 location).
 etc.

 You want to run ceph osd rm 4 5 and that should unregister both of them 
 from everything[1]. :)
 -Greg
 [1]: Except for the full lists, which have a bug in the version of code 
 you're running — remove the OSDs, then adjust the full ratios again, and 
 all will be well.


 $ ceph osd rm 4
 osd.4 does not exist
 $ ceph -s
health HEALTH_WARN 1 near full osd(s)
monmap e3: 3 mons at
 {0=192.168.10.129:6789/0,1=192.168.10.128:6789/0,2=192.168.10.127:6789/0},
 election epoch 58, quorum 0,1,2 0,1,2
osdmap e2198: 4 osds: 4 up, 4 in
 pgmap v586056: 464 pgs: 464 active+clean; 66645 MB data, 231 GB
 used, 95877 MB / 324 GB avail
mdsmap e207: 1/1/1 up {0=a=up:active}

 $ ceph health detail
 HEALTH_WARN 1 near full osd(s)
 osd.4 is near full at 89%

 $ ceph osd dump
 
 max_osd 4
 osd.0 up   in  weight 1 up_from 2183 up_thru 2187 down_at 2172
 last_clean_interval [2136,2171) 192.168.10.128:6800/4030
 192.168.10.128:6801/4030 192.168.10.128:6802/4030 exists,up
 68b3deec-e80a-48b7-9c29-1b98f5de4f62
 osd.1 up   in  weight 1 up_from 2136 up_thru 2186 down_at 2135
 last_clean_interval [2115,2134) 192.168.10.129:6800/2980
 192.168.10.129:6801/2980 192.168.10.129:6802/2980 exists,up
 b2a26fe9-aaa8-445f-be1f-fa7d2a283b57
 osd.2 up   in  weight 1 up_from 2181 up_thru 2187 down_at 2172
 last_clean_interval [2136,2171) 192.168.10.128:6803/4128
 192.168.10.128:6804/4128 192.168.10.128:6805/4128 exists,up
 378d367a-f7fb-4892-9ec9-db8ffdd2eb20
 osd.3 up   in  weight 1 up_from 2136 up_thru 2186 down_at 2135
 last_clean_interval [2115,2134) 192.168.10.129:6803/3069
 192.168.10.129:6804/3069 192.168.10.129:6805/3069 exists,up
 faf8eda8-55fc-4a0e-899f-47dbd32b81b8
 

 Hrm. How did you create your new crush map? All the normal avenues of
 removing an OSD from the map set a flag which the PGMap uses to delete
 its records (which would prevent it reappearing in the full list), and
 I can't see how setcrushmap would remove an OSD from the map (although
 there might be a code path I haven't found).

Manually, by deleting osd4|5 entries and reweighing remaining nodes.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph status reporting non-existing osd

2012-07-16 Thread Andrey Korolyov

On Mon, Jul 16, 2012 at 10:48 PM, Gregory Farnum g...@inktank.com wrote:
 ceph pg set_full_ratio 0.95
 ceph pg set_nearfull_ratio 0.94


 On Monday, July 16, 2012 at 11:42 AM, Andrey Korolyov wrote:

 On Mon, Jul 16, 2012 at 8:12 PM, Gregory Farnum g...@inktank.com 
 (mailto:g...@inktank.com) wrote:
  On Saturday, July 14, 2012 at 7:20 AM, Andrey Korolyov wrote:
   On Fri, Jul 13, 2012 at 9:09 PM, Sage Weil s...@inktank.com 
   (mailto:s...@inktank.com) wrote:
On Fri, 13 Jul 2012, Gregory Farnum wrote:
 On Fri, Jul 13, 2012 at 1:17 AM, Andrey Korolyov and...@xdel.ru 
 (mailto:and...@xdel.ru) wrote:
  Hi,
 
  Recently I`ve reduced my test suite from 6 to 4 osds at ~60% usage 
  on
  six-node,
  and I have removed a bunch of rbd objects during recovery to avoid
  overfill.
  Right now I`m constantly receiving a warn about nearfull state on
  non-existing osd:
 
  health HEALTH_WARN 1 near full osd(s)
  monmap e3: 3 mons at
  {0=192.168.10.129:6789/0,1=192.168.10.128:6789/0,2=192.168.10.127:6789/0},
  election epoch 240, quorum 0,1,2 0,1,2
  osdmap e2098: 4 osds: 4 up, 4 in
  pgmap v518696: 464 pgs: 464 active+clean; 61070 MB data, 181 GB
  used, 143 GB / 324 GB avail
  mdsmap e181: 1/1/1 up {0=a=up:active}
 
  HEALTH_WARN 1 near full osd(s)
  osd.4 is near full at 89%
 
  Needless to say, osd.4 remains only in ceph.conf, but not at 
  crushmap.
  Reducing has been done 'on-line', e.g. without restart entire 
  cluster.





 Whoops! It looks like Sage has written some patches to fix this, but
 for now you should be good if you just update your ratios to a larger
 number, and then bring them back down again. :)
   
   
   
   
   
Restarting ceph-mon should also do the trick.
   
Thanks for the bug report!
sage
  
  
  
  
  
   Should I restart mons simultaneously?
  I don't think restarting will actually do the trick for you — you actually 
  will need to set the ratios again.
 
   Restarting one by one has no
   effect, same as filling up data pool up to ~95 percent(btw, when I
   deleted this 50Gb file on cephfs, mds was stuck permanently and usage
   remained same until I dropped and recreated data pool - hope it`s one
   of known posix layer bugs). I also deleted entry from config, and then
   restarted mons, with no effect. Any suggestions?
 
 
 
  I'm not sure what you're asking about here?
  -Greg



 Oh, sorry, I have mislooked and thought that you suggested filling up
 osds. How do I can set full/nearfull ratios correctly?

 $ceph injectargs '--mon_osd_full_ratio 96'
 parsed options
 $ ceph injectargs '--mon_osd_near_full_ratio 94'
 parsed options

 ceph pg dump | grep 'full'
 full_ratio 0.95
 nearfull_ratio 0.85

 Setting parameters in the ceph.conf and then restarting mons does not
 affect ratios either.




Thanks, it worked, but setting values back result to turn warning back.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph status reporting non-existing osd

2012-07-14 Thread Andrey Korolyov

On Fri, Jul 13, 2012 at 9:09 PM, Sage Weil s...@inktank.com wrote:
 On Fri, 13 Jul 2012, Gregory Farnum wrote:
 On Fri, Jul 13, 2012 at 1:17 AM, Andrey Korolyov and...@xdel.ru wrote:
  Hi,
 
  Recently I`ve reduced my test suite from 6 to 4 osds at ~60% usage on
  six-node,
  and I have removed a bunch of rbd objects during recovery to avoid
  overfill.
  Right now I`m constantly receiving a warn about nearfull state on
  non-existing osd:
 
 health HEALTH_WARN 1 near full osd(s)
 monmap e3: 3 mons at
  {0=192.168.10.129:6789/0,1=192.168.10.128:6789/0,2=192.168.10.127:6789/0},
  election epoch 240, quorum 0,1,2 0,1,2
 osdmap e2098: 4 osds: 4 up, 4 in
  pgmap v518696: 464 pgs: 464 active+clean; 61070 MB data, 181 GB
  used, 143 GB / 324 GB avail
 mdsmap e181: 1/1/1 up {0=a=up:active}
 
  HEALTH_WARN 1 near full osd(s)
  osd.4 is near full at 89%
 
  Needless to say, osd.4 remains only in ceph.conf, but not at crushmap.
  Reducing has been done 'on-line', e.g. without restart entire cluster.

 Whoops! It looks like Sage has written some patches to fix this, but
 for now you should be good if you just update your ratios to a larger
 number, and then bring them back down again. :)

 Restarting ceph-mon should also do the trick.

 Thanks for the bug report!
 sage

Should I restart mons simultaneously? Restarting one by one has no
effect, same as filling up data pool up to ~95 percent(btw, when I
deleted this 50Gb file on cephfs, mds was stuck permanently and usage
remained same until I dropped and recreated data pool - hope it`s one
of known posix layer bugs). I also deleted entry from config, and then
restarted mons, with no effect. Any suggestions?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 119 matches

Mail list logo