Re: [ceph-users] Zombie OSD filesystems rise from the grave during bluestore conversion

2019-11-05 Thread J David
On Tue, Nov 5, 2019 at 2:21 PM Janne Johansson wrote: > I seem to recall some ticket where zap would "only" clear 100M of the drive, > but lvm and all partition info needed more to be cleared, so using dd > bs=1M count=1024 (or more!) would be needed to make sure no part of the OS > picks

Re: [ceph-users] Zombie OSD filesystems rise from the grave during bluestore conversion

2019-11-05 Thread J David
On Tue, Nov 5, 2019 at 3:18 AM Paul Emmerich wrote: > could be a new feature, I've only realized this exists/works since Nautilus. > You seem to be a relatively old version since you still have ceph-disk > installed None of this is using ceph-disk? It's all done with ceph-volume. The ceph clus

Re: [ceph-users] Zombie OSD filesystems rise from the grave during bluestore conversion

2019-11-04 Thread J David
On Mon, Nov 4, 2019 at 1:32 PM Paul Emmerich wrote: > BTW: you can run destroy before stopping the OSD, you won't need the > --yes-i-really-mean-it if it's drained in this case This actually does not seem to work: $ sudo ceph osd safe-to-destroy 42 OSD(s) 42 are safe to destroy without reducing

Re: [ceph-users] Zombie OSD filesystems rise from the grave during bluestore conversion

2019-11-04 Thread J David
On Mon, Nov 4, 2019 at 1:32 PM Paul Emmerich wrote: > That's probably the ceph-disk udev script being triggered from > something somewhere (and a lot of things can trigger that script...) That makes total sense. > Work-around: convert everything to ceph-volume simple first by running > "ceph-vol

[ceph-users] Zombie OSD filesystems rise from the grave during bluestore conversion

2019-11-04 Thread J David
While converting a luminous cluster from filestore to bluestore, we are running into a weird race condition on a fairly regular basis. We have a master script that writes upgrade scripts for each OSD server. The script for an OSD looks like this: ceph osd out 68 while ! ceph osd safe-to-destroy

Re: [ceph-users] Luminous OSD crashes every few seconds: FAILED assert(0 == "past_interval end mismatch")

2018-08-01 Thread J David
On Wed, Aug 1, 2018 at 9:53 PM, Brad Hubbard wrote: > What is the status of the cluster with this osd down and out? Briefly, miserable. All client IO was blocked. 36 pgs were stuck “down.” pg query reported that they were blocked by that OSD, despite that OSD not holding any replicas for them,

Re: [ceph-users] Luminous OSD crashes every few seconds: FAILED assert(0 == "past_interval end mismatch")

2018-08-01 Thread J David
seemed very happy to see it again. Not sure if this solution works generally or if it was specific to this case, or if it was not a solution and the cluster will eat itself overnight. But, so far so good! Thanks! On Wed, Aug 1, 2018 at 3:42 PM, J David wrote: > Hello all, > > On

[ceph-users] Luminous OSD crashes every few seconds: FAILED assert(0 == "past_interval end mismatch")

2018-08-01 Thread J David
Hello all, On Luminous 12.2.7, during the course of recovering from a failed OSD, one of the other OSDs started repeatedly crashing every few seconds with an assertion failure: 2018-08-01 12:17:20.584350 7fb50eded700 -1 log_channel(cluster) log [ERR] : 2.621 past_interal bound [19300,21449) end d

Re: [ceph-users] Slow requests

2017-10-19 Thread J David
On Thu, Oct 19, 2017 at 9:42 PM, Brad Hubbard wrote: > I guess you have both read and followed > http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/?highlight=backfill#debugging-slow-requests > > What was the result? Not sure if you’re asking Ольга or myself, but in my cas

Re: [ceph-users] Slow requests

2017-10-19 Thread J David
On Wed, Oct 18, 2017 at 8:12 AM, Ольга Ухина wrote: > I have a problem with ceph luminous 12.2.1. > […] > I have slow requests on different OSDs on random time (for example at night, > but I don’t see any problems at the time of problem > […] > 2017-10-18 01:20:38.187326 mon.st3 mon.0 10.192.1.78:

Re: [ceph-users] osd max scrubs not honored?

2017-10-16 Thread J David
ely starving other normal or lower priority requests. Is that how it works? Or is the queue in question a simple FIFO queue? Is there anything else I can try to help narrow this down? Thanks! On Sat, Oct 14, 2017 at 6:51 PM, J David wrote: > On Sat, Oct 14, 2017 at 9:33 AM, David Turner

Re: [ceph-users] osd max scrubs not honored?

2017-10-14 Thread J David
On Sat, Oct 14, 2017 at 9:33 AM, David Turner wrote: > First, there is no need to deep scrub your PGs every 2 days. They aren’t being deep scrubbed every two days, nor is there any attempt (or desire) to do so. That would be require 8+ scrubs running at once. Currently, it takes between 2 and 3

Re: [ceph-users] osd max scrubs not honored?

2017-10-13 Thread J David
Thanks all for input on this. It’s taken a couple of weeks, but based on the feedback from the list, we’ve got our version of a scrub-one-at-a-time cron script running and confirmed that it’s working properly. Unfortunately, this hasn’t really solved the real problem. Even with just one scrub an

[ceph-users] osd max scrubs not honored?

2017-09-26 Thread J David
With “osd max scrubs” set to 1 in ceph.conf, which I believe is also the default, at almost all times, there are 2-3 deep scrubs running. 3 simultaneous deep scrubs is enough to cause a constant stream of: mon.ceph1 [WRN] Health check update: 69 slow requests are blocked > 32 sec (REQUEST_SLOW)

Re: [ceph-users] How is split brain situations handled in ceph?

2016-10-26 Thread J David
On Wed, Oct 26, 2016 at 8:55 AM, Andreas Davour wrote: > If there are 1 MON in B, that cluster will have quorum within itself and > keep running, and in A the MON cluster will vote and reach quorum again. Quorum requires a majority of all monitors. One monitor by itself (in a cluster with at lea

Re: [ceph-users] Out-of-date RBD client libraries

2016-10-25 Thread J David
On Tue, Oct 25, 2016 at 3:10 PM, Steve Taylor wrote: > Recently we tested an upgrade from 0.94.7 to 10.2.3 and found exactly the > opposite. Upgrading the clients first worked for many operations, but we > got "function not implemented" errors when we would try to clone RBD > snapshots. > Yes, w

[ceph-users] Out-of-date RBD client libraries

2016-10-25 Thread J David
What are the potential consequences of using out-of-date client libraries with RBD against newer clusters? Specifically, what are the potential ill-effects of using Firefly client libraries (0.80.7 and 0.80.8) to access Hammer or Jewel (10.2.3) clusters? The upgrading instructions ( http://docs.c

Re: [ceph-users] Tuning ZFS + QEMU/KVM + Ceph RBD’s

2015-12-28 Thread J David
Yes, given the architectural design limitations of ZFS, there will indeed always be performance consequences for using it in an environment its creators never envisioned, like Ceph. But ZFS offers many advanced features not found on other filesystems, and for production environments that depend on

[ceph-users] Tuning ZFS + QEMU/KVM + Ceph RBD’s

2015-12-24 Thread J David
For a variety of reasons, a ZFS pool in a QEMU/KVM virtual machine backed by a Ceph RBD doesn’t perform very well. Does anyone have any tuning tips (on either side) for this workload? A fair amount of the problem is probably related to two factors. First, ZFS always assumes it is talking to bare

Re: [ceph-users] Minimum failure domain

2015-10-20 Thread J David
On Mon, Oct 19, 2015 at 7:09 PM, John Wilkins wrote: > The classic case is when you are just trying Ceph out on a laptop (e.g., > using file directories for OSDs, setting the replica size to 2, and setting > osd_crush_chooseleaf_type to 0). Sure, but the text isn’t really applicable in that situa

[ceph-users] Minimum failure domain

2015-10-15 Thread J David
In the Ceph docs, at: http://docs.ceph.com/docs/master/rados/deployment/ceph-deploy-osd/ It says (under "Prepare OSDs"): "Note: When running multiple Ceph OSD daemons on a single node, and sharing a partioned journal with each OSD daemon, you should consider the entire node the minimum failure d

Re: [ceph-users] Ceph, SSD, and NVMe

2015-10-01 Thread J David
Thanks & Regards > Somnath > > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark > Nelson > Sent: Wednesday, September 30, 2015 12:04 PM > To: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Ceph, SSD, and NVMe

Re: [ceph-users] Issue with journal on another drive

2015-09-30 Thread J David
On Tue, Sep 29, 2015 at 7:32 AM, Jiri Kanicky wrote: > Thank you for your reply. In this case I am considering to create separate > partitions for each disk on the SSD drive. Would be good to know what is the > performance difference, because creating partitions is kind of waste of > space. It ma

[ceph-users] Ceph, SSD, and NVMe

2015-09-30 Thread J David
Because we have a good thing going, our Ceph clusters are still running Firefly on all of our clusters including our largest, all-SSD cluster. If I understand right, newer versions of Ceph make much better use of SSDs and give overall much higher performance on the same equipment. However, the imp

Re: [ceph-users] high density machines

2015-09-30 Thread J David
On Wed, Sep 30, 2015 at 8:19 AM, Mark Nelson wrote: > FWIW, I've mentioned to Supermicro that I would *really* love a version of the > 5018A-AR12L that replaced the Atom with an embedded Xeon-D 1540. :) Is even that enough? (It's a serious question; due to our insatiable need for IOPs rather tha

Re: [ceph-users] high density machines

2015-09-29 Thread J David
On Thu, Sep 3, 2015 at 3:49 PM, Gurvinder Singh wrote: >> The density would be higher than the 36 drive units but lower than the >> 72 drive units (though with shorter rack depth afaik). > You mean the 1U solution with 12 disk is longer in length than 72 disk > 4U version ? This is a bit old and

Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread J David
On Fri, Jul 17, 2015 at 12:19 PM, Mark Nelson wrote: > Maybe try some iperf tests between the different OSD nodes in your > cluster and also the client to the OSDs. This proved to be an excellent suggestion. One of these is not like the others: f16 inbound: 6Gbps f16 outbound: 6Gbps f17 inbound

Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread J David
On Fri, Jul 17, 2015 at 11:15 AM, Quentin Hartman wrote: > That looks a lot like what I was seeing initially. The OSDs getting marked > out was relatively rare and it took a bit before I saw it. Our problem is "most of the time" and does not appear confined to a specific ceph cluster node or OSD:

Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread J David
On Fri, Jul 17, 2015 at 10:47 AM, Quentin Hartman wrote: > What does "ceph status" say? Usually it says everything is cool. However just now it gave this: cluster e9c32e63-f3eb-4c25-b172-4815ed566ec7 health HEALTH_WARN 2 requests are blocked > 32 sec monmap e3: 3 mons at {f16=192.

Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread J David
On Fri, Jul 17, 2015 at 10:21 AM, Mark Nelson wrote: > rados -p 30 bench write > > just to see how it handles 4MB object writes. Here's that, from the VM host: Total time run: 52.062639 Total writes made: 66 Write size: 4194304 Bandwidth (MB/sec): 5.071 Stddev Ban

[ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread J David
This is the same cluster I posted about back in April. Since then, the situation has gotten significantly worse. Here is what iostat looks like for the one active RBD image on this cluster: Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_awai

Re: [ceph-users] Having trouble getting good performance

2015-04-24 Thread J David
On Fri, Apr 24, 2015 at 10:58 AM, Nick Fisk wrote: > 7.2k drives tend to do about 80 iops at 4kb IO sizes, as the IO size > increases the number of iops will start to fall. You will probably get > around 70 iops for 128kb. But please benchmark your raw disks to get some > accurate numbers if neede

Re: [ceph-users] Having trouble getting good performance

2015-04-24 Thread J David
On Fri, Apr 24, 2015 at 6:39 AM, Nick Fisk wrote: > From the Fio runs, I see you are getting around 200 iops at 128kb write io > size. I would imagine you should be getting somewhere around 200-300 iops > for the cluster you posted in the initial post, so it looks like its > performing about right

Re: [ceph-users] Having trouble getting good performance

2015-04-23 Thread J David
On Thu, Apr 23, 2015 at 4:23 PM, Mark Nelson wrote: > If you want to adjust the iodepth, you'll need to use an asynchronous > ioengine like libaio (you also need to use direct=1) Ah yes, libaio makes a big difference. With 1 job: testfile: (g=0): rw=randwrite, bs=128K-128K/128K-128K/128K-128K,

Re: [ceph-users] Having trouble getting good performance

2015-04-23 Thread J David
On Thu, Apr 23, 2015 at 3:05 PM, Nick Fisk wrote: > I have had a look through the fio runs, could you also try and run a couple > of jobs with iodepth=64 instead of numjobs=64. I know they should do the > same thing, but the numbers with the former are easier to understand. Maybe it's an issue of

Re: [ceph-users] Having trouble getting good performance

2015-04-23 Thread J David
On Thu, Apr 23, 2015 at 3:05 PM, Nick Fisk wrote: > If you can let us know the avg queue depth that ZFS is generating that will > probably give a good estimation of what you can expect from the cluster. How would that be measured? > I have had a look through the fio runs, could you also try and

Re: [ceph-users] Having trouble getting good performance

2015-04-23 Thread J David
On Wed, Apr 22, 2015 at 4:07 PM, Somnath Roy wrote: > I am suggesting synthetic workload like fio to run on top of VM to identify > where the bottleneck is. For example, if fio is giving decent enough output, > I guess ceph layer is doing fine. It is your client that is not driving > enough. A

Re: [ceph-users] Having trouble getting good performance

2015-04-23 Thread J David
On Wed, Apr 22, 2015 at 4:30 PM, Nick Fisk wrote: > I suspect you are hitting problems with sync writes, which Ceph isn't known > for being the fastest thing for. There's "not being the fastest thing" and "an expensive cluster of hardware that performs worse than a single SATA drive." :-( > I'm

Re: [ceph-users] Getting placement groups to place evenly (again)

2015-04-22 Thread J David
On Wed, Apr 22, 2015 at 2:16 PM, Gregory Farnum wrote: > Uh, looks like it's the contents of the "omap" directory (inside of > "current") are the levelDB store. :) OK, here's du -sk of all of those: 36740 ceph-0/current/omap 35736 ceph-1/current/omap 37356 ceph-2/current/omap 38096 ceph-3/curren

Re: [ceph-users] Having trouble getting good performance

2015-04-22 Thread J David
On Wed, Apr 22, 2015 at 2:54 PM, Somnath Roy wrote: > What ceph version are you using ? Firefly, 0.80.9. > Could you try with rbd_cache=false or true and see if behavior changes ? As this is ZFS, running a cache layer below it that it is not aware of violates data integrity and can cause corrup

[ceph-users] Having trouble getting good performance

2015-04-22 Thread J David
A very small 3-node Ceph cluster with this OSD tree: http://pastebin.com/mUhayBk9 has some performance issues. All 27 OSDs are 5TB SATA drives, it keeps two copies of everything, and it's really only intended for nearline backups of large data objects. All of the OSDs look OK in terms of utiliz

Re: [ceph-users] Getting placement groups to place evenly (again)

2015-04-22 Thread J David
On Thu, Apr 16, 2015 at 8:02 PM, Gregory Farnum wrote: > Since I now realize you did a bunch of reweighting to try and make > data match up I don't think you'll find something like badly-sized > LevelDB instances, though. It's certainly something I can check, just to be sure. Erm, what does a Le

Re: [ceph-users] unbalanced OSDs

2015-04-22 Thread J David
On Wed, Apr 22, 2015 at 7:12 AM, Stefan Priebe - Profihost AG wrote: > Also a reweight-by-utilization does nothing. As a fellow sufferer from this issue, mostly what I can offer you is sympathy rather than actual help. However, this may be beneficial: By default, reweight-by-utilization only al

Re: [ceph-users] Getting placement groups to place evenly (again)

2015-04-11 Thread J David
On Thu, Apr 9, 2015 at 7:20 PM, Gregory Farnum wrote: > Okay, but 118/85 = 1.38. You say you're seeing variance from 53% > utilization to 96%, and 53%*1.38 = 73.5%, which is *way* off your > numbers. 53% to 96% is with all weights set to default (i.e. disk size) and all reweights set to 1. (I.e.

Re: [ceph-users] Getting placement groups to place evenly (again)

2015-04-08 Thread J David
On Wed, Apr 8, 2015 at 11:40 AM, Gregory Farnum wrote: > "ceph pg dump" will output the size of each pg, among other things. Among many other things. :) Here is the raw output, in case I'm misinterpreting it: http://pastebin.com/j4ySNBdQ It *looks* like the pg's are roughly uniform in size. T

Re: [ceph-users] Getting placement groups to place evenly (again)

2015-04-08 Thread J David
On Wed, Apr 8, 2015 at 11:33 AM, Gregory Farnum wrote: > Is this a problem with your PGs being placed unevenly, with your PGs being > sized very differently, or both? Please forgive the silly question, but how would one check that? Thanks! ___ ceph-use

[ceph-users] Getting placement groups to place evenly (again)

2015-04-07 Thread J David
Getting placement groups to be placed evenly continues to be a major challenge for us, bordering on impossible. When we first reported trouble with this, the ceph cluster had 12 OSD's (each Intel DC S3700 400GB) spread across three nodes. Since then, it has grown to 8 nodes with 38 OSD's. The av

Re: [ceph-users] How to do maintenance without falling out of service?

2015-01-27 Thread J David
On Wed, Jan 21, 2015 at 5:53 PM, Gregory Farnum wrote: > Depending on how you configured things it's possible that the min_size > is also set to 2, which would be bad for your purposes (it should be > at 1). This was exactly the problem. Setting min_size=1 (which I believe used to be the default

[ceph-users] How to do maintenance without falling out of service?

2015-01-21 Thread J David
A couple of weeks ago, we had some involuntary maintenance come up that required us to briefly turn off one node of a three-node ceph cluster. To our surprise, this resulted in failure to write on the VM's on that ceph cluster, even though we set noout before the maintenance. This cluster is for

Re: [ceph-users] Asked for emperor, got firefly. (You can't take the sky from me?)

2014-09-02 Thread J David
On Tue, Sep 2, 2014 at 3:47 PM, Alfredo Deza wrote: > This is an actual issue, so I created: > > http://tracker.ceph.com/issues/9319 > > And should be fixing it soon. Thank you! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com

Re: [ceph-users] Asked for emperor, got firefly. (You can't take the sky from me?)

2014-09-02 Thread J David
On Tue, Sep 2, 2014 at 2:50 PM, Konrad Gutkowski wrote: > You need to set higher priority for ceph repo, check "ceph-deploy with > --release (--stable) for dumpling?" thread. Right, this is the same issue as that. It looks like the 0.80.1 packages are coming from Ubuntu; this is the first time w

Re: [ceph-users] Asked for emperor, got firefly. (You can't take the sky from me?)

2014-09-02 Thread J David
On Tue, Sep 2, 2014 at 1:00 PM, Alfredo Deza wrote: > correct, if you don't specify what release you want/need, ceph-deploy > will use the latest stable release (firefly as of this writing) So, ceph-deploy set up emperor repositories in /etc/apt/sources.list.d/ceph.list and then didn't use them?

[ceph-users] Asked for emperor, got firefly. (You can't take the sky from me?)

2014-08-30 Thread J David
While adding some nodes to a ceph emperor cluster using ceph-deploy, the new nodes somehow wound up with 0.80.1, which I think is a Firefly release. The ceph version on existing nodes: $ ceph --version ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) The repository on the new nodes

Re: [ceph-users] Uneven OSD usage

2014-08-30 Thread J David
On Fri, Aug 29, 2014 at 2:53 AM, Christian Balzer wrote: >> Now, 1200 is not a power of two, but it makes sense. (12 x 100). > Should have been 600 and then upped to 1024. At the time, there was a reason why doing that did not work, but I don't remember the specifics. All messages sent back in

Re: [ceph-users] question about monitor and paxos relationship

2014-08-29 Thread J David
On Fri, Aug 29, 2014 at 12:52 AM, pragya jain wrote: > #2: why odd no. of monitors are recommended for production cluster, not even > no.? Because to achieve a quorum, you must always have participation of more than 50% of the monitors. Not 50%. More than 50%. With an even number of monitors,

Re: [ceph-users] Uneven OSD usage

2014-08-28 Thread J David
On Thu, Aug 28, 2014 at 10:47 PM, Christian Balzer wrote: >> There are 1328 PG's in the pool, so about 110 per OSD. >> > And just to be pedantic, the PGP_NUM is the same? Ah, "ceph status" reports 1328 pgs. But: $ sudo ceph osd pool get rbd pg_num pg_num: 1200 $ sudo ceph osd pool get rbd pgp_n

Re: [ceph-users] Uneven OSD usage

2014-08-28 Thread J David
On Thu, Aug 28, 2014 at 7:00 PM, Robert LeBlanc wrote: > How many PGs do you have in your pool? This should be about 100/OSD. There are 1328 PG's in the pool, so about 110 per OSD. Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://li

[ceph-users] Uneven OSD usage

2014-08-28 Thread J David
Hello, Is there any way to provoke a ceph cluster to level out its OSD usage? Currently, a cluster of 3 servers with 4 identical OSDs each is showing disparity of about 20% between the most-used OSD and the least-used OSD. This wouldn't be too big of a problem, but the most-used OSD is now at 86