from:"Lionel Bouton"

[ceph-users] Repository with some internal utils

2016-01-19 Thread Lionel Bouton

Hi,

someone asked me if he could get access to the BTRFS defragmenter we
used for our Ceph OSDs. I took a few minutes to put together a small
github repository with :
- the defragmenter I've been asked about (tested on 7200 rpm drives and
designed to put low IO load on them),
- the scrub scheduler we use to avoid load spikes on Firefly,
- some basic documentation (this is still rough around the edges so you
better like to read Ruby code if you want to peak at most of the logic,
tune or hack these).

Here it is: https://github.com/jtek/ceph-utils

This is running in production for several months now and I didn't touch
the code or the numerous internal tunables these scripts have for
several weeks so it probably won't destroy your clusters. These scripts
come without warranties though.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to properly deal with NEAR FULL OSD

2016-02-19 Thread Lionel Bouton

Le 19/02/2016 17:17, Don Laursen a écrit :
>
> Thanks. To summarize
>
> Your data, images+volumes = 27.15% space used
>
> Raw used = 81.71% used
>
>  
>
> This is a big difference that I can’t account for? Can anyone? So is
> your cluster actually full?
>

I believe this is the pool size being accounted for and it is harmless:
3 x 27.15 = 81.45 which is awfully close to 81.71.
We have the same behavior on our Ceph cluster.

>  
>
> I had the same problem with my small cluster. Raw used was about 85%
> and actual data, with replication, was about 30%. My OSDs were also
> BRTFS. BRTFS was causing its own problems. I fixed my problem by
> removing each OSD one at a time and re-adding as the default XFS
> filesystem. Doing so brought the percentages used to be about the same
> and it’s good now.
>

That's odd : AFAIK we had the same behaviour with XFS before migrating
to BTRFS.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ZFS or BTRFS for performance?

2016-03-18 Thread Lionel Bouton

Hi,

Le 18/03/2016 20:58, Mark Nelson a écrit :
> FWIW, from purely a performance perspective Ceph usually looks pretty
> fantastic on a fresh BTRFS filesystem.  In fact it will probably
> continue to look great until you do small random writes to large
> objects (like say to blocks in an RBD volume).  Then COW starts
> fragmenting the objects into oblivion.  I've seen sequential read
> performance drop by 300% after 5 minutes of 4K random writes to the
> same RBD blocks.
>
> Autodefrag might help.

With 3.19 it wasn't enough for our workload and we had to develop our
own defragmentation, see scheduler https://github.com/jtek/ceph-utils.
We tried autodefrag again with a 4.0.5 kernel but it wasn't good enough
yet (and based on my reading of the linux-btrfs list I don't think there
is any work done on it currently).

>   A long time ago I recall Josef told me it was dangerous to use (I
> think it could run the node out of memory and corrupt the FS), but it
> may be that it's safer now.

No problem here (as long as we use our defragmentation scheduler,
otherwise the performance degrades over time/amount of rewrites).

>   In any event we don't really do a lot of testing with BTRFS these
> days as bluestore is indeed the next gen OSD backend.

Will bluestore provide the same protection against bitrot than BTRFS?
Ie: with BTRFS the deep-scrubs detect inconsistencies *and* the OSD(s)
with invalid data get IO errors when trying to read corrupted data and
as such can't be used as the source for repairs even if they are primary
OSD(s). So with BTRFS you get a pretty good overall protection against
bitrot in Ceph (it allowed us to automate the repair process in the most
common cases). With XFS IIRC unless you override the default behavior
the primary OSD is always the source for repairs (even if all the
secondaries agree on another version of the data).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Lionel Bouton

Le 12/04/2016 01:40, Lindsay Mathieson a écrit :
> On 12/04/2016 9:09 AM, Lionel Bouton wrote:
>> * If the journal is not on a separate partition (SSD), it should
>> definitely be re-created NoCoW to avoid unnecessary fragmentation. From
>> memory : stop OSD, touch journal.new, chattr +C journal.new, dd
>> if=journal of=journal.new (your dd options here for best perf/least
>> amount of cache eviction), rm journal, mv journal.new journal, start OSD
>> again.
>
> Flush the journal after stopping the OSD !
>

No need to: dd makes an exact duplicate.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Lionel Bouton

Hi,

Le 11/04/2016 23:57, Mark Nelson a écrit :
> [...]
> To add to this on the performance side, we stopped doing regular
> performance testing on ext4 (and btrfs) sometime back around when ICE
> was released to focus specifically on filestore behavior on xfs. 
> There were some cases at the time where ext4 was faster than xfs, but
> not consistently so.  btrfs is often quite fast on fresh fs, but
> degrades quickly due to fragmentation induced by cow with
> small-writes-to-large-object workloads (IE RBD small writes).  If
> btrfs auto-defrag is now safe to use in production it might be worth
> looking at again, but probably not ext4.

For BTRFS, autodefrag is probably not performance-safe (yet), at least
with RBD access patterns. At least it wasn't in 4.1.9 when we tested it
last time (the performance degraded slowly but surely over several weeks
from an initially good performing filesystem to the point where we
measured a 100% increase in average latencies and large spikes and
stopped the experiment). I didn't see any patches on linux-btrfs since
then (it might have benefited from other modifications, but the
autodefrag algorithm wasn't reworked itself AFAIK).
That's not an inherent problem of BTRFS but of the autodefrag
implementation though. Deactivating autodefrag and reimplementing a
basic, cautious defragmentation scheduler gave us noticeably better
latencies with BTRFS vs XFS (~30% better) on the same hardware and
workload long term (as in almost a year and countless full-disk rewrites
on the same filesystems due to both normal writes and rebalancing with 3
to 4 months of XFS and BTRFS OSDs coexisting for comparison purposes).
I'll certainly remount a subset of our OSDs autodefrag as I did with
4.1.9 when we will deploy 4.4.x or a later LTS kernel. So I might have
more up to date information in the coming months. I don't plan to
compare BTRFS to XFS anymore though : XFS only saves us from running our
defragmentation scheduler, BTRFS is far more suited to our workload and
we've seen constant improvements in behavior in the (arguably bumpy
until late 3.19 versions) 3.16.x to 4.1.x road.

Other things:

* If the journal is not on a separate partition (SSD), it should
definitely be re-created NoCoW to avoid unnecessary fragmentation. From
memory : stop OSD, touch journal.new, chattr +C journal.new, dd
if=journal of=journal.new (your dd options here for best perf/least
amount of cache eviction), rm journal, mv journal.new journal, start OSD
again.
* filestore btrfs snap = false
  is mandatory if you want consistent performance (at least on HDDs). It
may not be felt with almost empty OSDs but performance hiccups appear if
any non trivial amount of data is added to the filesystems.
  IIRC, after debugging surprisingly the snapshot creation didn't seem
to be the actual cause of the performance problems but the snapshot
deletion... It's so bad that the default should probably be false and
not true.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ZFS or BTRFS for performance?

2016-03-19 Thread Lionel Bouton

Le 19/03/2016 18:38, Heath Albritton a écrit :
> If you google "ceph bluestore" you'll be able to find a couple slide
> decks on the topic.  One of them by Sage is easy to follow without the
> benefit of the presentation.  There's also the " Redhat Ceph Storage
> Roadmap 2016" deck.
>
> In any case, bluestore is not intended to address bitrot.  Given that
> ceph is a distributed file system, many of the posix file system
> features are not required for the underlying block storage device.
>  Bluestore is intended to address this and reduce the disk IO required
> to store user data.
>
> Ceph protects against bitrot at a much higher level by validating the
> checksum of the entire placement group during a deep scrub.

My impression is that the only protection against bitrot is provided by
the underlying filesystem which means that you don't get any if you use
XFS or EXT4.

I can't trust Ceph on this alone until its bitrot protection (if any) is
clearly documented. The situation is far from clear right now. The
documentations states that deep scrubs are using checksums to validate
data, but this is not good enough at least because we don't known what
these checksums are supposed to cover (see below for another reason).
There is even this howto by Sebastien Han about repairing a PG :
http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/
which clearly concludes that with only 2 replicas you can't reliably
find out which object is corrupted with Ceph alone. If Ceph really
stored checksums to verify all the objects it stores we could manually
check which replica is valid.

Even if deep scrubs would use checksums to verify data this would not be
enough to protect against bitrot: there is a window between a corruption
event and a deep scrub where the data on a primary can be returned to a
client. BTRFS solves this problem by returning an IO error for any data
read that doesn't match its checksum (or automatically rebuilds it if
the allocation group is using RAID1/10/5/6). I've never seen this kind
of behavior documented for Ceph.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ZFS or BTRFS for performance?

2016-03-20 Thread Lionel Bouton

Hi,

Le 20/03/2016 15:23, Francois Lafont a écrit :
> Hello,
>
> On 20/03/2016 04:47, Christian Balzer wrote:
>
>> That's not protection, that's an "uh-oh, something is wrong, you better
>> check it out" notification, after which you get to spend a lot of time
>> figuring out which is the good replica 
> In fact, I have never been confronted to this case so far and I have a
> couple of questions.
>
> 1. When it happens (ie a deep scrub fails), is it mentioned in the output
> of the "ceph status" command and, in this case, can you confirm to me
> that the health of the cluster in the output is different of "HEALTH_OK"?

Yes. This is obviously a threat to your data so the cluster isn't
HEALTH_OK (HEALTH_WARN IIRC).

>
> 2. For instance, if it happens with the PG id == 19.10 and if I have 3 OSDs
> for this PG (because my pool has replica size == 3). I suppose that the
> concerned OSDs are OSD id == 1, 6 and 12. Can you tell me if this "naive"
> method is valid to solve the problem (and, if not, why)?
>
> a) ssh in the node which hosts osd-1 and I launch this command:
> ~# id=1 && sha1sum /var/lib/ceph/osd/ceph-$id/current/19.10_head/* | 
> sed "s|/ceph-$id/|/ceph-id/|" | sha1sum
> 055b0fd18cee4b158a8d336979de74d25fadc1a3  -
>
> b) ssh in the node which hosts osd-6 and I launch this command:
> ~# id=6 && sha1sum /var/lib/ceph/osd/ceph-$id/current/19.10_head/* | 
> sed "s|/ceph-$id/|/ceph-id/|" | sha1sum
> 055b0fd18cee4b158a8d336979de74d25fadc1a3 -
>
> c) ssh in the node which hosts osd-12 and I launch this command:
> ~# id=12 && sha1sum /var/lib/ceph/osd/ceph-$id/current/19.10_head/* | 
> sed "s|/ceph-$id/|/ceph-id/|" | sha1sum
> 3f786850e387550fdab836ed7e6dc881de23001b -

You may get 3 different hashes because of concurrent writes on the PG.
So you may have to restart your commands and probably try to launch them
at the same time on all nodes to avoid this problem. If you have
constant heavy writes on all your PGs this will probably never give a
useful result.

>
> I notice that the result is different for osd-12 so it's the "bad" osd.
> So, in the node which hosts osd-12, I launch this command:
>
> id=12 && rm /var/lib/ceph/osd/ceph-$id/current/19.10_head/*

You should stop the OSD, flush its journal and then do this before
restarting the OSD.

> And now I can launch safely this command:
>
> ceph pg repair 19.10
>
> Is there a problem with this "naive" method?

It is probably overkill (and may not work, see above). Usually you can
find out the exact file (see the link in my previous post) in this
directory which differs and should be deleted. I believe that if the
offending file isn't on the primary you can directly launch the repair
command.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-12 Thread Lionel Bouton

Hi,

Le 12/07/2016 02:51, Brad Hubbard a écrit :
>  [...]
 This is probably a fragmentation problem : typical rbd access patterns
 cause heavy BTRFS fragmentation.
>>> To the extent that operations take over 120 seconds to complete? Really?
>> Yes, really. I had these too. By default Ceph/RBD uses BTRFS in a very
>> aggressive way, rewriting data all over the place and creating/deleting
>> snapshots every filestore sync interval (5 seconds max by default IIRC).
>>
>> As I said there are 3 main causes of performance degradation :
>> - the snapshots,
>> - the journal in a standard copy-on-write file (move it out of the FS or
>> use NoCow),
>> - the weak auto defragmentation of BTRFS (autodefrag mount option).
>>
>> Each one of them is enough to impact or even destroy performance in the
>> long run. The 3 combined make BTRFS unusable by default. This is why
>> BTRFS is not recommended : if you want to use it you have to be prepared
>> for some (heavy) tuning. The first 2 points are easy to address, for the
>> last (which begins to be noticeable when you accumulate rewrites on your
>> data) I'm not aware of any other tool than the one we developed and
>> published on github (link provided in previous mail).
>>
>> Another thing : you better have a recent 4.1.x or 4.4.x kernel on your
>> OSDs if you use BTRFS. We've used it since 3.19.x but I wouldn't advise
>> it now and would recommend 4.4.x if it's possible for you and 4.1.x
>> otherwise.
> Thanks for the information. I wasn't aware things were that bad with BTRFS as
> I haven't had much to do with it up to this point.

Bad is relative. BTRFS was very time consuming to set up (mainly because
of the defragmentation scheduler development but finding sources of
inefficiency was no picnic either), but once used properly it has 3
unique advantages :
- data checksums : this forces Ceph to use one good replica by refusing
to hand over corrupted data and makes it far easier to handle silent
data corruption (and some of our RAID controllers, probably damaged by
electrical surges, had this nasty habit of flipping bits so it really
was a big time/data saver here),
- compression : you get more space for free,
- speed : we get better latencies than XFS with it.

Until bluestore is production ready (it should address these points even
better than BTRFS does), if I don't find a use case where BTRFS falls on
its face there's no way I'd used anything but BTRFS with Ceph.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph OSD with 95% full

2016-07-19 Thread Lionel Bouton

Hi,

On 19/07/2016 13:06, Wido den Hollander wrote:
>> Op 19 juli 2016 om 12:37 schreef M Ranga Swami Reddy :
>>
>>
>> Thanks for the correction...so even one OSD reaches to 95% full, the
>> total ceph cluster IO (R/W) will be blocked...Ideally read IO should
>> work...
> That should be a config option, since reading while writes still block is 
> also a danger. Multiple clients could read the same object, perform a 
> in-memory change and their write will block.
>
> Now, which client will 'win' after the full flag has been removed?
>
> That could lead to data corruption.

If it did, the clients would be broken as normal usage (without writes
being blocked) doesn't prevent multiple clients from reading the same
data and trying to write at the same time. So if multiple writes (I
suppose on the same data blocks) are possibly waiting the order in which
they are performed *must not* matter in your system. The alternative is
to prevent simultaneous write accesses from multiple clients (this is
how non-cluster filesystems must be configured on top of Ceph/RBD, they
must even be prevented from read-only accessing an already mounted fs).

>
> Just make sure you have proper monitoring on your Ceph cluster. At nearfull 
> it goes into WARN and you should act on that.

+1 : monitoring is not an option.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-11 Thread Lionel Bouton

Le 11/07/2016 04:48, 한승진 a écrit :
> Hi cephers.
>
> I need your help for some issues.
>
> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs.
>
> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs).
>
> I've experienced one of OSDs was killed himself.
>
> Always it issued suicide timeout message.

This is probably a fragmentation problem : typical rbd access patterns
cause heavy BTRFS fragmentation.

If you already use the autodefrag mount option, you can try this which
performs much better for us :
https://github.com/jtek/ceph-utils/blob/master/btrfs-defrag-scheduler.rb

Note that it can take some time to fully defragment the filesystems but
it shouldn't put more stress than autodefrag while doing so.

If you don't already use it, set :
filestore btrfs snap = false
in ceph.conf an restart your OSDs.

Finally if you use journals on the filesystem and not on dedicated
partitions, you'll have to recreate them with the NoCow attribute
(there's no way to defragment journals in any way that doesn't kill
performance otherwise).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-11 Thread Lionel Bouton

Le 11/07/2016 11:56, Brad Hubbard a écrit :
> On Mon, Jul 11, 2016 at 7:18 PM, Lionel Bouton
> <lionel-subscript...@bouton.name> wrote:
>> Le 11/07/2016 04:48, 한승진 a écrit :
>>> Hi cephers.
>>>
>>> I need your help for some issues.
>>>
>>> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs.
>>>
>>> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs).
>>>
>>> I've experienced one of OSDs was killed himself.
>>>
>>> Always it issued suicide timeout message.
>> This is probably a fragmentation problem : typical rbd access patterns
>> cause heavy BTRFS fragmentation.
> To the extent that operations take over 120 seconds to complete? Really?

Yes, really. I had these too. By default Ceph/RBD uses BTRFS in a very
aggressive way, rewriting data all over the place and creating/deleting
snapshots every filestore sync interval (5 seconds max by default IIRC).

As I said there are 3 main causes of performance degradation :
- the snapshots,
- the journal in a standard copy-on-write file (move it out of the FS or
use NoCow),
- the weak auto defragmentation of BTRFS (autodefrag mount option).

Each one of them is enough to impact or even destroy performance in the
long run. The 3 combined make BTRFS unusable by default. This is why
BTRFS is not recommended : if you want to use it you have to be prepared
for some (heavy) tuning. The first 2 points are easy to address, for the
last (which begins to be noticeable when you accumulate rewrites on your
data) I'm not aware of any other tool than the one we developed and
published on github (link provided in previous mail).

Another thing : you better have a recent 4.1.x or 4.4.x kernel on your
OSDs if you use BTRFS. We've used it since 3.19.x but I wouldn't advise
it now and would recommend 4.4.x if it's possible for you and 4.1.x
otherwise.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Lionel Bouton

Hi,

Le 29/06/2016 12:00, Mario Giammarco a écrit :
> Now the problem is that ceph has put out two disks because scrub  has
> failed (I think it is not a disk fault but due to mark-complete)

There is something odd going on. I've only seen deep-scrub failing (ie
detect one inconsistency and marking the pg so) so I'm not sure what
happens in the case of a "simple" scrub failure but what should not
happen is the whole OSD going down on scrub of deepscrub fairure which
you seem to imply did happen.
Do you have logs for these two failures giving a hint at what happened
(probably /var/log/ceph/ceph-osd..log) ? Any kernel log pointing to
hardware failure(s) around the time these events happened ?

Another point : you said that you had one disk "broken". Usually ceph
handles this case in the following manner :
- the OSD detects the problem and commit suicide (unless it's configured
to ignore IO errors which is not the default),
- your cluster is then in degraded state with one OSD down/in,
- after a timeout (several minutes), Ceph decides that the OSD won't
come up again soon and marks the OSD "out" (so one OSD down/out),
- as the OSD is out, crush adapts pg positions based on the remaining
available OSDs and bring back all degraded pg to clean state by creating
missing replicas while moving pgs around. You see a lot of IO, many pg
in wait_backfill/backfilling states at this point,
- when all is done the cluster is back to HEALTH_OK

When your disk was broken and you waited 24 hours how far along this
process was your cluster ?

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pg scrub and auto repair in hammer

2016-06-29 Thread Lionel Bouton

Hi,

Le 29/06/2016 18:33, Stefan Priebe - Profihost AG a écrit :
>> Am 28.06.2016 um 09:43 schrieb Lionel Bouton 
>> <lionel-subscript...@bouton.name>:
>>
>> Hi,
>>
>> Le 28/06/2016 08:34, Stefan Priebe - Profihost AG a écrit :
>>> [...]
>>> Yes but at least BTRFS is still not working for ceph due to
>>> fragmentation. I've even tested a 4.6 kernel a few weeks ago. But it
>>> doubles it's I/O after a few days.
>> BTRFS autodefrag is not working over the long term. That said BTRFS
>> itself is working far better than XFS on our cluster (noticeably better
>> latencies). As not having checksums wasn't an option we coded and are
>> using this:
>>
>> https://github.com/jtek/ceph-utils/blob/master/btrfs-defrag-scheduler.rb
>>
>> This actually saved us from 2 faulty disk controllers which were
>> infrequently corrupting data in our cluster.
>>
>> Mandatory too for performance :
>> filestore btrfs snap = false
> This sounds interesting. For how long you use this method?

More than a year now. Since the beginning almost two years ago we always
had at least one or two BTRFS OSDs to test and compare to the XFS ones.
At the very beginning we had to recycle them regularly because their
performance degraded over time. This was not a problem as Ceph makes it
easy to move data around safely.
We only switched after both finding out that "filestore btrfs snap =
false" was mandatory (when true it creates large write spikes every
filestore sync interval) and that a custom defragmentation process was
needed to maintain performance over the long run.

>  What kind of workload do you have?

A dozen VMs using rbd through KVM built-in support. There are different
kinds of access patterns : a large PostgreSQL instance (75+GB on disk,
300+ tx/s with peaks of ~2000 with a mean of 50+ IO/s and peaks to 1000,
mostly writes), a small MySQL instance (hard to say : was very large but
we moved most of its content to PostgreSQL which left only a small
database for a proprietary tool and large ibdata* files with mostly
holes), a very large NFS server (~10 TB), lots of Ruby on Rails
applications and background workers.

On the whole storage system Ceph reports an average of 170 op/s with
peaks that can reach 3000.

>  How did you measure the performance and latency?

Every useful metric we can get is fed to a Zabbix server. Latency is
measured both by the kernel on each disk with the average time a request
stays in queue (number of IOs / accumulated wait time over a given
period : you can find these values in /sys/block//stat) and at Ceph
level by monitoring the apply latency (we now have journals on SSD so
our commit latency is mostly limited by the available CPU).
The most interesting metric is the apply latency, block device latency
is useful to monitor to see how much the device itself is pushed and how
well read performs (apply latency only gives us the write side of the
story).

The behavior during backfills confirmed the latency benefits too : BTRFS
OSDs were less frequently involved in slow requests than the XFS ones.

>  What kernel do you use with btrfs?

4.4.6 currently (we just finished migrating all servers last week-end).
But the switch from XFS to BTRFS occurred with late 3.9 kernels IIRC.

I don't have measurements for this but when we switched from 4.1.15-r1
("-r1" is for Gentoo patches) to 4.4.6 we saw faster OSD startups
(including the initial filesystem mount). The only drawback with BTRFS
(if you don't count having to develop and run a custom defragmentation
scheduler) was the OSD startup times vs XFS. It was very slow when
starting from an unmounted filesystem at least until 4.1.x. This was not
really a problem as we don't restart OSDs often.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how possible is that ceph cluster crash

2016-11-19 Thread Lionel Bouton

Le 19/11/2016 à 00:52, Brian :: a écrit :
> This is like your mother telling not to cross the road when you were 4
> years of age but not telling you it was because you could be flattened
> by a car :)
>
> Can you expand on your answer? If you are in a DC with AB power,
> redundant UPS, dual feed from the electric company, onsite generators,
> dual PSU servers, is it still a bad idea?

Yes it is.

In such a datacenter where we have a Ceph cluster there was a complete
shutdown because of a design error : the probes used by the solution
responsible for starting and stopping the generators were installed
before the breakers installed on the feeds. After a blackout where
generators kicked in the breakers opened due to a surge when power was
restored. The generators were stopped because power was restored, and
the UPS systems failed 3 minutes later. Closing the breakers couldn't be
done in time (you don't approach them without being heavily protected,
putting on the suit to protect you needs more time than simply closing
the breaker).

There's no such thing as uninterruptible power supply.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

2017-01-10 Thread Lionel Bouton

Hi,

Le 10/01/2017 à 19:32, Brian Andrus a écrit :
> [...]
>
>
> I think the main point I'm trying to address is - as long as the
> backing OSD isn't egregiously handling large amounts of writes and it
> has a good journal in front of it (that properly handles O_DSYNC [not
> D_SYNC as Sebastien's article states]), it is unlikely inconsistencies
> will occur upon a crash and subsequent restart.

I don't see how you can guess if it is "unlikely". If you need SSDs you
are probably handling relatively large amounts of accesses (so large
amounts of writes aren't unlikely) or you would have used cheap 7200rpm
or even slower drives.

Remember that in the default configuration, if you have any 3 OSDs
failing at the same time, you have chances of losing data. For <30 OSDs
and size=3 this is highly probable as there are only a few thousands
combinations of 3 OSDs possible (and you usually have typically a
thousand or 2 of pgs picking OSDs in a more or less random pattern).

With SSDs not handling write barriers properly I wouldn't bet on
recovering the filesystems of all OSDs properly given a cluster-wide
power loss shutting down all the SSDs at the same time... In fact as the
hardware will lie about the stored data, the filesystem might not even
detect the crash properly and might apply its own journal on outdated
data leading to unexpected results.
So losing data is a possibility and testing for it is almost impossible
(you'll have to reproduce all the different access patterns your Ceph
cluster could experience at the time of a power loss and trigger the
power losses in each case).

>
> Therefore - while not ideal to rely on journals to maintain consistency,

Ceph journals aren't designed for maintaining the filestore consistency.
They *might* restrict the access patterns to the filesystems in such a
way that running fsck on them after a "let's throw away committed data"
crash might have better chances of restoring enough data but if it's the
case it's only an happy coincidence (and you will have to run these
fscks *manually* as the filesystem can't detect inconsistencies by itself).

> that is what they are there for.

No. They are here for Ceph internal consistency, not the filesystem
backing the filestore consistency. Ceph relies both on journals and
filesystems able to maintain internal consistency and supporting syncfs
to maintain consistency, if the journal or the filesystem fails the OSD
is damaged. If 3 OSDs are damaged at the same time on a size=3 pool you
enter "probable data loss" territory.

> There is a situation where "consumer-grade" SSDs could be used as
> OSDs. While not ideal, it can and has been done before, and may be
> preferable to tossing out $500k of SSDs (Seen it firsthand!)

For these I'd like to know :
- which SSD models were used ?
- how long did the SSDs survive (some consumer SSDs not only lie to the
system about write completions but they usually don't handle large
amounts of write nearly as well as DC models) ?
- how many cluster-wide power losses did the cluster survive ?
- what were the access patterns on the cluster during the power losses ?

If for a model not guaranteed for sync writes there hasn't been dozens
of power losses on clusters under large loads without any problem
detected in the week following (thing deep-scrub), using them is playing
Russian roulette with your data.

AFAIK there have only been reports of data losses and/or heavy
maintenance later when people tried to use consumer SSDs (admittedly
mainly for journals). I've yet to spot long-running robust clusters
built with consumer SSDs.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

2017-01-07 Thread Lionel Bouton

Le 07/01/2017 à 14:11, kevin parrikar a écrit :
> Thanks for your valuable input.
> We were using these SSD in our NAS box(synology)  and it was giving
> 13k iops for our fileserver in raid1.We had a few spare disks which we
> added to our ceph nodes hoping that it will give good performance same
> as that of NAS box.(i am not comparing NAS with ceph ,just the reason
> why we decided to use these SSD)
>
> We dont have S3520 or S3610 at the moment but can order one of these
> to see how it performs in ceph .We have 4xS3500  80Gb handy.
> If i create a 2 node cluster with 2xS3500 each and with replica of
> 2,do you think it can deliver 24MB/s of 4k writes .

Probably not. See
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

According to the page above the DC S3500 reaches 39MB/s. Its capacity
isn't specified, yours are 80GB only which is the lowest capacity I'm
aware of and for all DC models I know of the speed goes down with the
capacity so you probably will get lower than that.
If you put both data and journal on the same device you cut your
bandwidth in half : so this would give you an average <20MB/s per OSD
(with occasional peaks above that if you don't have a sustained 20MB/s).
With 4 OSDs and size=2, your total write bandwidth is <40MB/s. For a
single stream of data you will only get <20MB/s though (you won't
benefit from parallel writes to the 4 OSDs and will only write on 2 at a
time).

Not that by comparison the 250GB 840 EVO only reaches 1.9MB/s.

But even if you reach the 40MB/s, these models are not designed for
heavy writes, you will probably kill them long before their warranty is
expired (IIRC these are rated for ~24GB writes per day over the warranty
period). In your configuration you only have to write 24G each day (as
you have 4 of them, write both to data and journal and size=2) to be in
this situation (this is an average of only 0.28 MB/s compared to your 24
MB/s target).

> We bought S3500 because last time when we tried ceph, people were
> suggesting this model :) :)

The 3500 series might be enough with the higher capacities in some rare
cases but the 80GB model is almost useless.

You have to do the math considering :
- how much you will write to the cluster (guess high if you have to guess),
- if you will use the SSD for both journals and data (which means
writing twice on them),
- your replication level (which means you will write multiple times the
same data),
- when you expect to replace the hardware,
- the amount of writes per day they support under warranty (if the
manufacturer doesn't present this number prominently they probably are
trying to sell you a fast car headed for a brick wall)

If your hardware can't handle the amount of write you expect to put in
it then you are screwed. There were reports of new Ceph users not aware
of this and using cheap SSDs that failed in a matter of months all at
the same time. You definitely don't want to be in their position.
In fact as problems happen (hardware failure leading to cluster storage
rebalancing for example) you should probably get a system able to handle
10x the amount of writes you expect it to handle and then monitor the
SSD SMART attributes to be alerted long before they die and replace them
before problems happen. You definitely want a controller allowing access
to this information. If you can't you will have to monitor the writes
and guess this value which is risky as write amplification inside SSDs
is not easy to guess...

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-13 Thread Lionel Bouton

Le 13/04/2017 à 17:47, mj a écrit :
> Hi,
>
> On 04/13/2017 04:53 PM, Lionel Bouton wrote:
>> We use rbd snapshots on Firefly (and Hammer now) and I didn't see any
>> measurable impact on performance... until we tried to remove them.
>
> What exactly do you mean with that?

Just what I said : having snapshots doesn't impact performance, only
removing them (obviously until Ceph is finished cleaning up).

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-13 Thread Lionel Bouton

Hi,

Le 13/04/2017 à 10:51, Peter Maloney a écrit :
> [...]
> Also more things to consider...
>
> Ceph snapshots relly slow things down.

We use rbd snapshots on Firefly (and Hammer now) and I didn't see any
measurable impact on performance... until we tried to remove them. We
usually have at least one snapshot per VM image, often 3 or 4.
Note that we use BTRFS filestores where IIRC the CoW is handled by the
filesystem so it might be faster compared to the default/recommended XFS
filestores.

>  They aren't efficient like on
> zfs and btrfs. Having one might take away some % performance, and having
> 2 snaps takes potentially double, etc. until it is crawling. And it's
> not just the CoW... even just rbd snap rm, rbd diff, etc. starts to take
> many times longer. See http://tracker.ceph.com/issues/10823 for
> explanation of CoW. My goal is just to keep max 1 long term snapshot.[...]

In my experience with BTRFS filestores, snap rm impact is proportional
to the amount of data specific to the snapshot being removed (ie: not
present on any other snapshot) but completely unrelated to the number of
existing snapshots. For example the first one removed can be handled
very fast and it can be the last one removed that takes the most time
and impacts the most the performance.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-18 Thread Lionel Bouton

Le 18/04/2017 à 11:24, Jogi Hofmüller a écrit :
> Hi,
>
> thanks for all you comments so far.
>
> Am Donnerstag, den 13.04.2017, 16:53 +0200 schrieb Lionel Bouton:
>> Hi,
>>
>> Le 13/04/2017 à 10:51, Peter Maloney a écrit :
>>> Ceph snapshots relly slow things down.
> I can confirm that now :(
>
>> We use rbd snapshots on Firefly (and Hammer now) and I didn't see any
>> measurable impact on performance... until we tried to remove them. We
>> usually have at least one snapshot per VM image, often 3 or 4.
> This might have been true for hammer and older versions of ceph. From
> what I can tell now, every snapshot taken reduces performance of the
> entire cluster :(

The version isn't the only difference here. We use BTRFS with a custom
defragmentation process for the filestores, which is highly uncommon for
Ceph users. As I said, Ceph has support for BTRFS CoW, so a part of the
snapshot handling processes is actually handled by BTRFS.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] dropping filestore+btrfs testing for luminous

2017-07-04 Thread Lionel Bouton

Le 04/07/2017 à 19:00, Jack a écrit :
> You may just upgrade to Luminous, then replace filestore by bluestore

You don't just "replace" filestore by bluestore on a production cluster
: you transition over several weeks/months from the first to the second.
The two must be rock stable and have predictable performance
characteristics to do that.
We took more than 6 months with Firefly to migrate from XFS to Btrfs and
studied/tuned the cluster along the way. Simply replacing a store by
another without any experience of the real world behavior of the new one
is just playing with fire (and a huge heap of customer data).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] dropping filestore+btrfs testing for luminous

2017-07-04 Thread Lionel Bouton

Le 30/06/2017 à 18:48, Sage Weil a écrit :
> On Fri, 30 Jun 2017, Lenz Grimmer wrote:
>> Hi Sage,
>>
>> On 06/30/2017 05:21 AM, Sage Weil wrote:
>>
>>> The easiest thing is to
>>>
>>> 1/ Stop testing filestore+btrfs for luminous onward.  We've recommended 
>>> against btrfs for a long time and are moving toward bluestore anyway.
>> Searching the documentation for "btrfs" does not really give a user any
>> clue that the use of Btrfs is discouraged.
>>
>> Where exactly has this been recommended?
>>
>> The documentation currently states:
>>
>> http://docs.ceph.com/docs/master/rados/configuration/ceph-conf/?highlight=btrfs#osds
>>
>> "We recommend using the xfs file system or the btrfs file system when
>> running mkfs."
>>
>> http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/?highlight=btrfs#filesystems
>>
>> "btrfs is still supported and has a comparatively compelling set of
>> features, but be mindful of its stability and support status in your
>> Linux distribution."
>>
>> http://docs.ceph.com/docs/master/start/os-recommendations/?highlight=btrfs#ceph-dependencies
>>
>> "If you use the btrfs file system with Ceph, we recommend using a recent
>> Linux kernel (3.14 or later)."
>>
>> As an end user, none of these statements would really sound as
>> recommendations *against* using Btrfs to me.
>>
>> I'm therefore concerned about just disabling the tests related to
>> filestore on Btrfs while still including and shipping it. This has
>> potential to introduce regressions that won't get caught and fixed.
> Ah, crap.  This is what happens when devs don't read their own 
> documetnation.  I recommend against btrfs every time it ever comes up, the 
> downstream distributions all support only xfs, but yes, it looks like the 
> docs never got updated... despite the xfs focus being 5ish years old now.
>
> I'll submit a PR to clean this up, but
>  
>>> 2/ Leave btrfs in the mix for jewel, and manually tolerate and filter out 
>>> the occasional ENOSPC errors we see.  (They make the test runs noisy but 
>>> are pretty easy to identify.)
>>>
>>> If we don't stop testing filestore on btrfs now, I'm not sure when we 
>>> would ever be able to stop, and that's pretty clearly not sustainable.
>>> Does that seem reasonable?  (Pretty please?)
>> If you want to get rid of filestore on Btrfs, start a proper deprecation
>> process and inform users that support for it it's going to be removed in
>> the near future. The documentation must be updated accordingly and it
>> must be clearly emphasized in the release notes.
>>
>> Simply disabling the tests while keeping the code in the distribution is
>> setting up users who happen to be using Btrfs for failure.
> I don't think we can wait *another* cycle (year) to stop testing this.
>
> We can, however,
>
>  - prominently feature this in the luminous release notes, and
>  - require the 'enable experimental unrecoverable data corrupting features =
> btrfs' in order to use it, so that users are explicitly opting in to 
> luminous+btrfs territory.
>
> The only good(ish) news is that we aren't touching FileStore if we can 
> help it, so it less likely to regress than other things.  And we'll 
> continue testing filestore+btrfs on jewel for some time.
>
> Is that good enough?

Not sure how we will handle the transition. Is bluestore considered
stable in Jewel ? Then our current clusters (recently migrated from
Firefly to Hammer) will have support for both BTRFS+Filestore and
Bluestore when the next upgrade takes place. If Bluestore is only
considered stable on Luminous I don't see how we can manage the
transition easily. The only path I see is to :
- migrate to XFS+filestore with Jewel (which will not only take time but
will be a regression for us : this will cause performance and sizing
problems on at least one of our clusters and we will lose the silent
corruption detection from BTRFS)
- then upgrade to Luminous and migrate again to Bluestore.
I was not expecting the transition from Btrfs+Filestore to Bluestore to
be this convoluted (we were planning to add Bluestore OSDs one at a time
and study the performance/stability for months before migrating the
whole clusters). Is there any way to restrict your BTRFS tests to at
least a given stable configuration (BTRFS is known to have problems with
the high rate of snapshot deletion Ceph generates by default for example
and we use 'filestore btrfs snap = false') ?

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] HW Raid vs. Multiple OSD

2017-11-13 Thread Lionel Bouton

Le 13/11/2017 à 15:47, Oscar Segarra a écrit :
> Thanks Mark, Peter, 
>
> For clarification, the configuration with RAID5 is having many servers
> (2 or more) with RAID5 and CEPH on top of it. Ceph will replicate data
> between servers. Of course, each server will have just one OSD daemon
> managing a big disk.
>
> It looks functionally is the same using RAID5 +  1 Ceph daemon as 8
> CEPH daemons.

Functionally it's the same but RAID5 will kill your write performance.

For example if you start with 3 OSD hosts and a pool size of 3, due to
RAID5 each and every write on your Ceph cluster will imply a read on one
server on every disks minus one then a write on *all* the disks of the
cluster.

If you use one OSD per disk you'll have a read on one disk only and a
write on 3 disks only : you'll get approximately 8 times the IOPS for
writes (with 8 disks per server). Clever RAID5 logic can minimize this
for some I/O patterns but it is a bet and will never be as good as what
you'll get with one disk per OSD.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

2018-05-24 Thread Lionel Bouton

Hi,

On 22/02/2018 23:32, Mike Lovell wrote:
> hrm. intel has, until a year ago, been very good with ssds. the
> description of your experience definitely doesn't inspire confidence.
> intel also dropping the entire s3xxx and p3xxx series last year before
> having a viable replacement has been driving me nuts.
>
> i don't know that i have the luxury of being able to return all of the
> ones i have or just buying replacements. i'm going to need to at least
> try them in production. it'll probably happen with the s4600 limited
> to a particular fault domain. these are also going to be filestore
> osds so maybe that will result in a different behavior. i'll try to
> post updates as i have them.

Sorry for the deep digging into the archives. I might be in a situation
where I could get S4600 (with filestore initially but I would very much
like them to support Bluestore without bursting into flames).

To expand a Ceph cluster and test EPYC in our context we have ordered a
server based on a Supermicro EPYC motherboard and SM863a SSDs. For
reference :
https://www.supermicro.nl/Aplus/motherboard/EPYC7000/H11DSU-iN.cfm

Unfortunately I just learned that Supermicro found an incompatibility
between this motherboard and SM863a SSDs (I don't have more information
yet) and they proposed S4600 as an alternative. I immediately remembered
that there were problems and asked for a delay/more information and dug
out this old thread.

Has anyone successfully used Ceph with S4600 ? If so could you share if
you used filestore or bluestore, which firmware was used and
approximately how much data was written on the most used SSDs ?

Best regards,

Lionel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

2018-05-31 Thread Lionel Bouton

On 31/05/2018 14:41, Simon Ironside wrote:
> On 24/05/18 19:21, Lionel Bouton wrote:
>
>> Unfortunately I just learned that Supermicro found an incompatibility
>> between this motherboard and SM863a SSDs (I don't have more information
>> yet) and they proposed S4600 as an alternative. I immediately remembered
>> that there were problems and asked for a delay/more information and dug
>> out this old thread.
>
> In case it helps you, I'm about to go down the same Supermicro EPYC
> and SM863a path as you. I asked about the incompatibility you
> mentioned and they knew what I was referring to. The incompatibility
> is between the on-board SATA controller and the SM863a and has
> apparently already been fixed.

That's good news.

> Even if not fixed, the incompatibility wouldn't be present if you're
> using a RAID controller instead of the on board SATA (which I intend
> to - don't know if you were?).

I wasn't : we plan to use the 14 on board SATA connectors. As long as we
can we use a standard SATA/AHCI controller as they cause less headaches
than RAID controllers even in HBA mode.

Thanks a lot for this information, I've forwarded it to our Supermicro
reseller.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] KVM+Ceph: Live migration of I/O-heavy VM

2018-12-11 Thread Lionel Bouton

Le 11/12/2018 à 15:51, Konstantin Shalygin a écrit :
>
>> Currently I plan a migration of a large VM (MS Exchange, 300 Mailboxes
>> and 900GB DB) from qcow2 on ext4 (RAID1) to an all-flash Ceph luminous
>> cluster (which already holds lot's of images).
>> The server has access to both local and cluster-storage, I only need
>> to live migrate the storage, not machine.
>>
>> I have never used live migration as it can cause more issues and the
>> VMs that are already migrated, had planned downtime.
>> Taking the VM offline and convert/import using qemu-img would take
>> some hours but I would like to still serve clients, even if it is
>> slower.
>>
>> The VM is I/O-heavy in terms of the old storage (LSI/Adaptec with
>> BBU). There are two HDDs bound as RAID1 which are constantly under 30%
>> - 60% load (this goes up to 100% during reboot, updates or login
>> prime-time).
>>
>> What happens when either the local compute node or the ceph cluster
>> fails (degraded)? Or network is unavailable?
>> Are all writes performed to both locations? Is this fail-safe? Or does
>> the VM crash in worst case, which can lead to dirty shutdown for MS-EX
>> DBs?
>>
>> The node currently has 4GB free RAM and 29GB listed as cache /
>> available. These numbers need caution because we have "tuned" enabled
>> which causes de-deplication on RAM and this host runs about 10 Windows
>> VMs.
>> During reboots or updates, RAM can get full again.
>>
>> Maybe I am to cautious about live-storage-migration, maybe I am not.
>>
>> What are your experiences or advices?
>>
>> Thank you very much!
>
> I was read your message two times and still can't figure out what is
> your question?
>
> You need move your block image from some storage to Ceph? No, you
> can't do this without downtime because fs consistency.
>
> You can easy migrate your filesystem via rsync for example, with small
> downtime for reboot VM.
>

I believe OP is trying to use the storage migration feature of QEMU.
I've never tried it and I wouldn't recommend it (probably not very
tested and there is a large window for failure).

One tactic that can be used assuming OP is using LVM in the VM for
storage is to add a Ceph volume to the VM (probably needs a reboot) add
the corresponding virtual disk to the VM volume group and then migrate
all data from the logical volume(s) to the new disk. LVM is using
mirroring internally during the transfer so you get robustness by using
it. It can be slow (especially with old kernels) but at least it is
safe. I've done a DRBD to Ceph migration with this process 5 years ago.
When all logical volumes are moved to the new disk you can remove the
old disk from the volume group.

Assuming everything is on LVM including the root filesystem, only moving
the boot partition will have to be done outside of LVM.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Major ceph disaster

2019-05-13 Thread Lionel Bouton

Le 13/05/2019 à 16:20, Kevin Flöh a écrit :
> Dear ceph experts,
>
> [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
> Here is what happened: One osd daemon could not be started and
> therefore we decided to mark the osd as lost and set it up from
> scratch. Ceph started recovering and then we lost another osd with the
> same behavior. We did the same as for the first osd.

With 3+1 you only allow a single OSD failure per pg at a given time. You
have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2
separate servers (assuming standard crush rules) is a death sentence for
the data on some pgs using both of those OSD (the ones not fully
recovered before the second failure).

Depending on the data stored (CephFS ?) you probably can recover most of
it but some of it is irremediably lost.

If you can recover the data from the failed OSD at the time they failed
you might be able to recover some of your lost data (with the help of
Ceph devs), if not there's nothing to do.

In the later case I'd add a new server to use at least 3+2 for a fresh
pool instead of 3+1 and begin moving the data to it.

The 12.2 + 13.2 mix is a potential problem in addition to the one above
but it's a different one.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Repository with some internal utils

Re: [ceph-users] How to properly deal with NEAR FULL OSD

Re: [ceph-users] ZFS or BTRFS for performance?

Re: [ceph-users] Deprecating ext4 support

Re: [ceph-users] Deprecating ext4 support

Re: [ceph-users] ZFS or BTRFS for performance?

Re: [ceph-users] ZFS or BTRFS for performance?

Re: [ceph-users] Fwd: Ceph OSD suicide himself

Re: [ceph-users] ceph OSD with 95% full

Re: [ceph-users] Fwd: Ceph OSD suicide himself

Re: [ceph-users] Fwd: Ceph OSD suicide himself

Re: [ceph-users] Another cluster completely hang

Re: [ceph-users] pg scrub and auto repair in hammer

Re: [ceph-users] how possible is that ceph cluster crash

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

Re: [ceph-users] slow requests and short OSD failures in small cluster

Re: [ceph-users] slow requests and short OSD failures in small cluster

Re: [ceph-users] slow requests and short OSD failures in small cluster

Re: [ceph-users] dropping filestore+btrfs testing for luminous

Re: [ceph-users] dropping filestore+btrfs testing for luminous

Re: [ceph-users] HW Raid vs. Multiple OSD

Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

Re: [ceph-users] KVM+Ceph: Live migration of I/O-heavy VM

Re: [ceph-users] Major ceph disaster

< 1 2

101 - 126 of 126 matches

Site Navigation

Mail list logo

Footer information