from:"Janne Johansson"

Re: [ceph-users] Issues with Nautilus 14.2.6 ceph-volume lvm batch --bluestore ?

2020-01-20 Thread Janne Johansson

Den mån 20 jan. 2020 kl 09:03 skrev Dave Hall :

> Hello,
>
Since upgrading to Nautilus (+ Debian 10 Backports), when I issue
> 'ceph-volume lvm batch --bluestore ' it fails with
>
> bluestore(/var/lib/ceph/osd/ceph-0/) _read_fsid unparsable uuid
>
> I previously had Luminous + Debian 9 running on the same hardware with the
> same OSD layout, but I decided to do a full wipe and start over.
>

I seem to recall some instances of "wiping" where it only writes some 100M
or so to the beginning of old OSD drives, where LVM or some other part
picks up on the old data, so make sure the "full wipe" actually is either
using clean drives, or that you dd /dev/zero over a significant part of the
disk, like 3-4G before handing it back to the new ceph installation.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] block db sizing and calculation

2020-01-14 Thread Janne Johansson

(sorry for empty mail just before)


> i'm plannung to split the block db to a seperate flash device which i
>> also would like to use as an OSD for erasure coding metadata for rbd
>> devices.
>>
>> If i want to use 14x 14TB HDDs per Node
>>
>> https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
>>
>> recommends a minimum size of 140GB per 14TB HDD.
>>
>> Is there any recommandation of how many osds a single flash device can
>> serve? The optane ones can do 2000MB/s write + 500.000 iop/s.
>>
>
>
I think many ceph admins are more concerned with having many drives
co-using the same DB drive, since if the DB drive fails, it also means all
OSDs are lost at the same time.
Optanes and decent NVMEs are probably capable of handling tons of HDDs, so
that the bottleneck ends up being somewhere else, but the failure scenarios
are a bit scary if the whole host is lost just by that one DB device acting
up.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] block db sizing and calculation

2020-01-14 Thread Janne Johansson

Den mån 13 jan. 2020 kl 08:09 skrev Stefan Priebe - Profihost AG <
s.pri...@profihost.ag>:

> Hello,
>
> i'm plannung to split the block db to a seperate flash device which i
> also would like to use as an OSD for erasure coding metadata for rbd
> devices.
>
> If i want to use 14x 14TB HDDs per Node
>
> https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
>
> recommends a minimum size of 140GB per 14TB HDD.
>
> Is there any recommandation of how many osds a single flash device can
> serve? The optane ones can do 2000MB/s write + 500.000 iop/s.
>
> Greets,
> Stefan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Looking for experience

2020-01-09 Thread Janne Johansson

>
>
> I'm currently trying to workout a concept for a ceph cluster which can
> be used as a target for backups which satisfies the following requirements:
>
> - approx. write speed of 40.000 IOP/s and 2500 Mbyte/s
>

You might need to have a large (at least non-1) number of writers to get to
that sum of operations, as opposed to trying to reach it with one single
stream written from one single client.
-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Shall host weight auto reduce on hdd failure?

2019-12-04 Thread Janne Johansson

Den tors 5 dec. 2019 kl 00:28 skrev Milan Kupcevic <
milan_kupce...@harvard.edu>:

>
>
> There is plenty of space to take more than a few failed nodes. But the
> question was about what is going on inside a node with a few failed
> drives. Current Ceph behavior keeps increasing number of placement
> groups on surviving drives inside the same node. It does not spread them
> across the cluster. So, lets get back to he original question. Shall
> host weight auto reduce on hdd failure, or not?
>

If the OSDs are still in the crush map, with non-zero weights, they will
add "value" to the host, and hence the host gets as much PGs as the sum of
the crush values (ie, sizes) says it can bear.
If some of the OSDs have zero OSD-reweight values, they will not take a
part of the burden, but rather let the "surviving" OSDs on the host take
more load, until the cluster decides the broken OSDs are down and out, at
which point the cluster rebalances according to the general algorithm which
should(*) even it out, letting the OSD hosts with fewer OSDs have less PGs
and hence less data.

*) There are reports of Nautilus (only, as far as I remember) having weird
placement ideas that tend to fill up OSDs that already have much data,
leaving it to the ceph admin to force values down in order to
not go over 85% at which point some rebalancing ops will stop.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSDs behind Hardware Raid

2019-12-04 Thread Janne Johansson

Den ons 4 dec. 2019 kl 09:57 skrev Marc Roos :

>
> But I guess that in 'ceph osd tree' the ssd's were then also displayed
> as hdd?
>

Probably, and the difference in perf would be the different defaults hdd
gets vs ssd OSDs with regards to bluestore caches.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Shall host weight auto reduce on hdd failure?

2019-12-04 Thread Janne Johansson

Den ons 4 dec. 2019 kl 01:37 skrev Milan Kupcevic <
milan_kupce...@harvard.edu>:

> This cluster can handle this case at this moment as it has got plenty of
> free space. I wonder how is this going to play out when we get to 90% of
> usage on the whole cluster. A single backplane failure in a node takes
>

You should not run any file storage system to 90% full, ceph or otherwise.

You should set a target for how full it can get before you must add new
hardware to it, be it more drives or hosts with drives, and as noted below,
you should probably include at least one failed node into this calculation,
so that planned maintenance doesn't become a critical situation. This means
that in terms or raw disk space, the cluster should probably be aiming for
at most 50-60% usage, until it gets large in terms of number of hosts, and
upto that point, aim for having more resources added when it hits 70% or
something like that. (perhaps something simple as 'start planning expansion
at 50%, get delivery before 75%')

Hopefully, when building storage clusters, the raw disk space should be one
of the cheaper resources to expand, compared to network, power, rack space,
admin time/salaries and all that.


> four drives out at once; that is 30% of storage space on a node. The
> whole cluster would have enough space to host the failed placement
> groups but one node would not.
> 



-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Impact of a small DB size with Bluestore

2019-11-26 Thread Janne Johansson

It's mentioned here among other places
https://books.google.se/books?id=vuiLDwAAQBAJ&pg=PA79&lpg=PA79&dq=rocksdb+sizes+3+30+300+g&source=bl&ots=TlH4GR0E8P&sig=ACfU3U0QOJQZ05POZL9DQFBVwTapML81Ew&hl=en&sa=X&ved=2ahUKEwiPscq57YfmAhVkwosKHY1bB1YQ6AEwAnoECAoQAQ#v=onepage&q=rocksdb%20sizes%203%2030%20300%20g&f=false

The 4% was a quick ballpark figure someone came up with to give early
adopters a decent start, but later science has shown that L0,L1,L2 levels
make the sizes 3,30,300 "optimal" to not waste SSD space that will not be
used.
You can set 240, but it will not be better than 30. It will be better than
24, so "not super bad, but not optimal".


Den tis 26 nov. 2019 kl 12:18 skrev Vincent Godin :

> The documentation tell to size the DB to 4% of the disk data ie 240GB
> for a 6 TB disk. Plz gives more explanations when your answer disagree
> with the documentation !
>
> Le lun. 25 nov. 2019 à 11:00, Konstantin Shalygin  a
> écrit :
> >
> > I have an Ceph cluster which was designed for file store. Each host
> > have 5 SSDs write intensive of 400GB and 20 HDD of 6TB. So each HDD
> > have a WAL of 5 GB on SSD
> > If i want to put Bluestore on this cluster, i can only allocate ~75GB
> > of WAL and DB on SSD for each HDD which is far below the 4% limit of
> > 240GB (for 6TB)
> > In the doc, i read "It is recommended that the block.db size isn’t
> > smaller than 4% of block. For example, if the block size is 1TB, then
> > block.db shouldn’t be less than 40GB."
> > Are the 4% mandatory ? What should i expect ? Only relative slow
> > performance or problem with such a configuration ?
> >
> > You should use not more 1Gb for WAL and 30Gb for RocksDB. Numbers !
> 3,30,300 (Gb) for block.db is useless.
> >
> >
> >
> > k
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrating from block to lvm

2019-11-15 Thread Janne Johansson

Den fre 15 nov. 2019 kl 19:40 skrev Mike Cave :

> So would you recommend doing an entire node at the same time or per-osd?
>

You should be able to do it per-OSD (or per-disk in case you run more than
one OSD per disk), to minimize data movement over the network, letting
other OSDs on the same host take a bit of the load while re-making the
disks one by one. You can use "ceph osd reweight  0.0" to make the
particular OSD release its data but still claim it supplies $crush-weight
to the host, meaning the other disks will have to take its data more or
less.
Moving data between disks in the same host usually goes lots faster than
over the network to other hosts.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Strange CEPH_ARGS problems

2019-11-15 Thread Janne Johansson

Is the flip between the client name "rz" and "user" also a mistype? It's
hard to divinate if it is intentional or not since you are mixing it about.


Den fre 15 nov. 2019 kl 10:57 skrev Rainer Krienke :

> I found a typo in my post:
>
> Of course I tried
>
> export CEPH_ARGS="-n client.rz --keyring="
>
> and not
>
> export CEPH_ARGS=="-n client.rz --keyring="
>
> Thanks
> Rainer
>
> Am 15.11.19 um 07:46 schrieb Rainer Krienke:
> > Hello,
> >
> > I try to use CEPH_ARGS in order to use eg rbd with a non client.admin
> > user and keyring without extra parameters. On a ceph-client with Ubuntu
> > 18.04.3 I get this:
> >
> > # unset CEPH_ARGS
> > # rbd --name=client.user --keyring=/etc/ceph/ceph.client.user.keyring ls
> > a
> > b
> > c
> >
> > # export CEPH_ARGS=="-n client.rz --keyring=/etc
> > /ceph/ceph.client.user.keyring"
> > # rbd ls
> > 
> > rbd: couldn't connect to the cluster!
> > rbd: listing images failed: (22) Invalid argument
> >
> > # export CEPH_ARGS=="--keyring=/etc/ceph/ceph.client.user.keyring"
> > # rbd -n client.user ls
> > a
> > b
> > c
> >
> > Is this the desired behavior? I would like to set both user name and
> > keyring to be used, so that I can run rbd without any parameters.
> >
> > How do you do this?
> >
> > Thanks
> > Rainer
> >
>
>
> --
> Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
> 56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
> Web: http://userpages.uni-koblenz.de/~krienke
> PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Zombie OSD filesystems rise from the grave during bluestore conversion

2019-11-05 Thread Janne Johansson

Den tis 5 nov. 2019 kl 19:10 skrev J David :

> On Tue, Nov 5, 2019 at 3:18 AM Paul Emmerich 
> wrote:
> > could be a new feature, I've only realized this exists/works since
> Nautilus.
> > You seem to be a relatively old version since you still have ceph-disk
> installed
>
> The next approach may be to just try to stop udev while ceph-volume
> lvm zap is running.
>
>
> I seem to recall some ticket where zap would "only" clear 100M of the
drive, but lvm and all partition info needed more to be cleared, so using
dd  bs=1M count=1024 (or more!) would be needed to make sure no part of
the OS picks up anything from the previous contents.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore runs out of space and dies

2019-10-31 Thread Janne Johansson

Den tors 31 okt. 2019 kl 15:07 skrev George Shuklin <
george.shuk...@gmail.com>:

> Thank you everyone, I got it. There is no way to fix out-of-space
> bluestore without expanding it.
>
> Therefore, in production we would stick with 99%FREE size for LV, as it
> gives operators 'last chance' to repair the cluster in case of
> emergency. It's a bit unfortunate that we need to give up the whole per
> cent (1 % is too much for 4Tb drives).
>

In production, stuff with start giving warnings at 85%, so you would just
not get into this kinds of situations where the last percent matters or not.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph pg in inactive state

2019-10-31 Thread Janne Johansson

Den tors 31 okt. 2019 kl 04:22 skrev soumya tr :

> Thanks 潘东元 for the response.
>
> The creation of a new pool works, and all the PGs corresponding to that
> pool have active+clean state.
>
> When I initially set ceph 3 node cluster using juju charms (replication
> count per object was set to 3), there were issues with ceph-osd services.
> So I had to delete the units and readd them (I did all of them together,
> which must have created issues with rebalancing). I assume that the PGs in
> the inactive state points to the 3 old OSDs which were deleted.
>
> I assume I will have to create all the pools again. But my concern is
> about the default pools.
>
> ---
> pool 1 'default.rgw.buckets.data' replicated size 3 min_size 2 crush_rule
> 0 object_hash rjenkins pg_num 16 pgp_num 16 last_change 15 flags hashpspool
> stripe_width 0 application rgw
> pool 2 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 2 pgp_num 2 last_change 19 flags hashpspool
> stripe_width 0 application rgw
> pool 3 'default.rgw.data.root' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 2 pgp_num 2 last_change 23 flags hashpspool
> stripe_width 0 application rgw
> pool 4 'default.rgw.gc' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 2 pgp_num 2 last_change 27 flags hashpspool
> stripe_width 0 application rgw
> pool 5 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 2 pgp_num 2 last_change 31 flags hashpspool
> stripe_width 0 application rgw
> pool 6 'default.rgw.intent-log' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 2 pgp_num 2 last_change 35 flags hashpspool
> stripe_width 0 application rgw
> pool 7 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 2 pgp_num 2 last_change 39 flags hashpspool
> stripe_width 0 application rgw
> pool 8 'default.rgw.usage' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 2 pgp_num 2 last_change 43 flags hashpspool
> stripe_width 0 application rgw
> pool 9 'default.rgw.users.keys' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 2 pgp_num 2 last_change 47 flags hashpspool
> stripe_width 0 application rgw
> pool 10 'default.rgw.users.email' replicated size 3 min_size 2 crush_rule
> 0 object_hash rjenkins pg_num 2 pgp_num 2 last_change 51 flags hashpspool
> stripe_width 0 application rgw
> pool 11 'default.rgw.users.swift' replicated size 3 min_size 2 crush_rule
> 0 object_hash rjenkins pg_num 2 pgp_num 2 last_change 55 flags hashpspool
> stripe_width 0 application rgw
> pool 12 'default.rgw.users.uid' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 2 pgp_num 2 last_change 59 flags hashpspool
> stripe_width 0 application rgw
> pool 13 'default.rgw.buckets.extra' replicated size 3 min_size 2
> crush_rule 0 object_hash rjenkins pg_num 2 pgp_num 2 last_change 63 flags
> hashpspool stripe_width 0 application rgw
> pool 14 'default.rgw.buckets.index' replicated size 3 min_size 2
> crush_rule 0 object_hash rjenkins pg_num 4 pgp_num 4 last_change 67 flags
> hashpspool stripe_width 0 application rgw
> ---
>
> Can you please update if recreating them using rados cli will break
> anything?
>
>
Those pools belong to radosgw, and if they are missing, they will be
created on demand the next time radosgw starts up.
the "defaul" is the name of the radosgw zone, which defaults to...
"default". They are not needed by any other part of ceph.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Can't create erasure coded pools with k+m greater than hosts?

2019-10-24 Thread Janne Johansson

(Slightly abbreviated)

Den tors 24 okt. 2019 kl 09:24 skrev Frank Schilder :

>  What I learned are the following:
>
> 1) Avoid this work-around too few hosts for EC rule at all cost.
>
> 2) Do not use EC 2+1. It does not offer anything interesting for
> production. Use 4+2 (or 8+2, 8+3 if you have the hosts).
>
> 3) If you have no perspective of getting at least 7 servers in the long
> run (4+2=6 for EC profile, +1 for fail-over automatic rebuild), do not go
> for EC.
>
> 4) Before you start thinking about replicating to a second site, you
> should have a primary site running solid first.
>
> This is collected from my experience. I would do things different now and
> maybe it helps you with deciding how to proceed. Its basically about what
> resources can you expect in the foreseeable future and what compromises are
> you willing to make with regards to sleep and sanity.
>

Amen to all of those points. We did similar-but-not-same mistakes on an EC
cluster here. You are going to produce more tears than I/O if you make
these mis-designs mentioned above.
We could add:

5) Never buy SMR drives, pretend they don't even exist. If a similar
technology appears tomorrow for cheap SSD/NVME, skip it.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cluster network down

2019-09-30 Thread Janne Johansson

>
> I don't remember where I read it, but it was told that the cluster is
> migrating its complete traffic over to the public network when the cluster
> networks goes down. So this seems not to be the case?
>

Be careful with generalizations like "when a network acts up, it will be
completely down and noticeably unreachable for all parts", since networks
can break in thousands of not-very-obvious ways which are not 0%-vs-100%
but somewhere in between.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-volume lvm create leaves half-built OSDs lying around

2019-09-11 Thread Janne Johansson

Den ons 11 sep. 2019 kl 12:18 skrev Matthew Vernon :

> We keep finding part-made OSDs (they appear not attached to any host,
> and down and out; but still counting towards the number of OSDs); we
> never saw this with ceph-disk. On investigation, this is because
> ceph-volume lvm create makes the OSD (ID and auth at least) too early in
> the process and is then unable to roll-back cleanly (because the
> bootstrap-osd credential isn't allowed to remove OSDs).
>
>
---8<


> This is annoying to have to clear up, and it seems to me could be
> avoided by either:
>
> i) ceph-volume should (attempt to) set up the LVM volumes &c before
> making the new OSD id
> or
> ii) allow the bootstrap-osd credential to purge OSDs
>
> i) seems like clearly the better answer...?
>

This happens to me too at times. Even a simple
iii)  "Run 'ceph osd purge XYZ'
printout for my cut-n-paste-convenience would be an improvement over the
current situation, though it might be better having some kind of state that
tells the cluster if an OSD has run for even the slightest of time, and if
not - allow the bootstrap-osd key to delete a never-really-seen OSD id from
all relevant places it might appear in when the disk setup fails you.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] WAL/DB size

2019-08-14 Thread Janne Johansson

Den tors 15 aug. 2019 kl 00:16 skrev Anthony D'Atri :

> Good points in both posts, but I think there’s still some unclarity.
>

...


> We’ve seen good explanations on the list of why only specific DB sizes,
> say 30GB, are actually used _for the DB_.
> If the WAL goes along with the DB, shouldn’t we also explicitly determine
> an appropriate size N for the WAL, and make the partition (30+N) GB?
> If so, how do we derive N?  Or is it a constant?
>
> Filestore was so much simpler, 10GB set+forget for the journal.  Not that
> I miss XFS, mind you.
>

But we got a simple handwaving-best-effort-guesstimate that went "WAL 1GB
is fine, yes." so there you have an N you can use for the
30+N or 60+N sizings.
Can't see how that N needs more science than the filestore N=10G you
showed. Not that I think journal=10G was wrong or anything.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] strange backfill delay after outing one node

2019-08-14 Thread Janne Johansson

Den ons 14 aug. 2019 kl 09:49 skrev Simon Oosthoek :

> Hi all,
>
> Yesterday I marked out all the osds on one node in our new cluster to
> reconfigure them with WAL/DB on their NVMe devices, but it is taking
> ages to rebalance.
>




> > ceph tell 'osd.*' injectargs '--osd-max-backfills 16'
> > ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
> Since the cluster is currently hardly loaded, backfilling can take up
> all the unused bandwidth as far as I'm concerned...
> Is it a good idea to give the above commands or other commands to speed
> up the backfilling? (e.g. like increasing "osd max backfills")
>
>
> OSD max backfills is going to have a very large effect on recovery time,
so that
would be the obvious knob to twist first. Check what it defaults to now,
raise to 4,8,12,16
in steps and see that it doesn't slow rebalancing down too much.
Spindrives without any ssd/nvme journal/wal/db should perhaps have 1 or 2
at most,
ssds can take more than that and nvme even more before diminishing gains
occur.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] High memory usage OSD with BlueStore

2019-08-01 Thread Janne Johansson

Den tors 1 aug. 2019 kl 11:31 skrev dannyyang(杨耿丹) :

> H all:
>
> we have a cephfs env,ceph version is 12.2.10,server in arm,but fuse clients 
> are x86,
> osd disk size is 8T,some osd use 12GB memory,is that normal?
>
>
For bluestore, there are certain tuneables you can use to limit memory a
bit. For filestore it would not be "normal" but very much possible in
recovery scenarios for memory to shoot up like that.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Urgent Help Needed (regarding rbd cache)

2019-07-31 Thread Janne Johansson

Den tors 1 aug. 2019 kl 07:31 skrev Muhammad Junaid :

> Your email has cleared many things to me. Let me repeat my understanding.
> Every Critical data (Like Oracle/Any Other DB) writes will be done with
> sync, fsync flags, meaning they will be only confirmed to DB/APP after it
> is actually written to Hard drives/OSD's. Any other application can do it
> also.
> All other writes, like OS logs etc will be confirmed immediately to
> app/user but later on written  passing through kernel, RBD Cache, Physical
> drive Cache (If any)  and then to disks. These are susceptible to
> power-failure-loss but overall things are recoverable/non-critical.
>

That last part is probably simplified a bit, I suspect between a program in
a guest sending its data to the virtualised device, running in a KVM on top
of an OS that has remote storage over network, to a storage server with its
own OS and drive controller chip and lastly physical drive(s) to store the
write, there will be something like ~10 layers of write caching possible,
out of which the RBD you were asking about, is just one.

It is just located very conveniently before the I/O has to leave the KVM
host and go back and forth over the network, so it is the last place where
you can see huge gains in the guests I/O response time, but at the same
time possible to share between lots of guests on the KVM host which should
have tons of RAM available compared to any single guest so it is a nice way
to get a large cache for outgoing writes.

Also, to answer your first part, yes all critical software that depend
heavily on write ordering and integrity is hopefully already doing write
operations that way, asking for sync(), fsync() or fdatasync() and similar
calls, but I can't produce a list of all programs that do. Since there
already are many layers of delayed cached writes even without
virtualisation and/or ceph, applications that are important have mostly
learned their lessons by now, so chances are very high that all your
important databases and similar program are doing the right thing.

But if the guest is instead running a mail filter that does antivirus
checks, spam checks and so on, operating on files that live on the machine
for something like one second, and then either get dropped or sent to the
destination mailbox somewhere else, then having aggressive write caches
would be very useful, since the effects of a crash would still mostly mean
"the emails that were in the queue were lost, not acked by the final
mailserver and will probably be resent by the previous smtp server". For
such a guest VM, forcing sync writes would only be a net loss, it would
gain much by having large ram write caches.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Urgent Help Needed (regarding rbd cache)

2019-07-31 Thread Janne Johansson

Den ons 31 juli 2019 kl 06:55 skrev Muhammad Junaid :

> The question is about RBD Cache in write-back mode using KVM/libvirt. If
> we enable this, it uses local KVM Host's RAM as cache for VM's write
> requests. And KVM Host immediately responds to VM's OS that data has been
> written to Disk (Actually it is still not on OSD's yet). Then how can be it
> power failure safe?
>
> It is not. Nothing is power-failure safe. However you design things, you
will always have the possibility of some long write being almost done when
the power goes away, and that write (and perhaps others in caches) will be
lost. Different filesystems handle losses good or bad, databases running on
those filesystems will have their I/O interrupted and not acknowledged
which may be not-so-bad or very bad.

The write-caches you have in the KVM guest and this KVM RBD cache will make
the guest I/Os faster, at the expense of higher risk of losing data in a
power outage, but there will be some kind of roll-back, some kind of
fsck/trash somewhere to clean up if a KVM host dies with guests running.
In 99% of the cases, this is ok, the only lost things are "last lines to
this log file" or "those 3 temp files for the program that ran" and in the
last 1% you need to pull out your backups like when the physical disks die.

If someone promises "power failure safe" then I think they are misguided.
Chances may be super small for bad things to happen, but it will just never
be 0%. Also, I think the guest VMs have the possibility to ask kvm-rbd to
flush data out, and as such take the "pain" of waiting for real completion
when it is actually needed, so that other operation can go fast (and
slightly less safe) and the IO that needs harder guarantees can call for
flushing and know when data actually is on disk for real.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Problems understanding 'ceph-features' output

2019-07-30 Thread Janne Johansson

Den tis 30 juli 2019 kl 10:33 skrev Massimo Sgaravatto <
massimo.sgarava...@gmail.com>:

> The documentation that I have seen says that the minimum requirements for
> clients to use upmap are:
> - CentOs 7.5 or kernel 4.5
> - Luminous version
> E.g. right now I am interested about 0x1ffddff8eea4fffb. Is this also good
> enough for upmap ?
>
>
Someone should make a webpage where you can enter that hex-string and get a
list back.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to add 100 new OSDs...

2019-07-25 Thread Janne Johansson

Den tors 25 juli 2019 kl 10:47 skrev 展荣臻（信泰） :

>
> 1、Adding osds in same one failure domain is to ensure only one PG in pg up
> set (ceph pg dump shows)to remap.
> 2、Setting "osd_pool_default_min_size=1" is to ensure objects to read/write
> uninterruptedly while pg remap.
> Is this wrong?
>

How did you read the first email where he described how 3 copies was not
enough, wanting to perhaps go to 4 copies
to make sure he is not putting data at risk?

The effect you describe is technically correct, it will allow writes to
pass, but it would also go 100% against what ceph tries to do here, retain
the data even while doing planned maintenance, even while getting
unexpected downtime.

Setting min_size=1 means you don't care at all for your data, and that you
will be placing it under extreme risks.

Not only will that single copy be a danger, but you can easily get into a
situation where your singlecopy-write gets accepted and then that drive
gets destroyed, and the cluster will know the latest writes ended up on it,
and even getting the two older copies back will not help, since it has
already registered that somewhere there is a newer version. For a single
object, reverting to older (if possible) isn't all that bad, but for a
section in the middle of a VM drive, that could mean total disaster.

There are lots of people losing data with 1 copy, lots of posts on how
repl_size=2, min_size=1 lost data for people using ceph, so I think posting
advice to that effect goes against what ceph is good for.

Not that I think the original poster would fall into that trap, but others
might find this post later and think that it would be a good solution to
maximize risk while adding/rebuilding 100s of OSDs. I don't agree.


> Den tors 25 juli 2019 kl 04:36 skrev zhanrzh...@teamsun.com.cn <
> zhanrzh...@teamsun.com.cn>:
>
>> I think it should to set "osd_pool_default_min_size=1" before you add osd
>> ,
>> and the osd that you add  at a time  should in same Failure domain.
>>
>
> That sounds like weird or even bad advice?
> What is the motivation behind it?
>
>
-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to add 100 new OSDs...

2019-07-25 Thread Janne Johansson

Den tors 25 juli 2019 kl 04:36 skrev zhanrzh...@teamsun.com.cn <
zhanrzh...@teamsun.com.cn>:

> I think it should to set "osd_pool_default_min_size=1" before you add osd ,
> and the osd that you add  at a time  should in same Failure domain.
>

That sounds like weird or even bad advice?
What is the motivation behind it?

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Anybody using 4x (size=4) replication?

2019-07-24 Thread Janne Johansson

Den ons 24 juli 2019 kl 21:48 skrev Wido den Hollander :

> Right now I'm just trying to find a clever solution to this. It's a 2k
> OSD cluster and the likelihood of an host or OSD crashing is reasonable
> while you are performing maintenance on a different host.
>
> All kinds of things have crossed my mind where using size=4 is one of them.
>

The slow and boring solution would be to empty the host first. 8-(

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Future of Filestore?

2019-07-19 Thread Janne Johansson

Den fre 19 juli 2019 kl 12:43 skrev Marc Roos :

>
> Maybe a bit of topic, just curious what speeds did you get previously?
> Depending on how you test your native drive of 5400rpm, the performance
> could be similar. 4k random read of my 7200rpm/5400 rpm results in
> ~60iops at 260kB/s.
> I also wonder why filestore could be that much faster, is this not
> something else? Maybe some dangerous caching method was on?
>

Then again, filestore will use the OS fs caches normally, which bluestore
will not, so unless you tune
your bluestores carefully, it will be far easier to get at least read
caches to work in your favor with
filestore if you have RAM to spare on your OSD hosts.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] What if etcd is lost

2019-07-16 Thread Janne Johansson

Den tis 16 juli 2019 kl 18:15 skrev Oscar Segarra :

> Hi Paul,
> That is the initial question, is it possible to recover my ceph cluster
> (docker based) if I loose all information stored in the etcd...
> I don't know if anyone has a clear answer to these questions..
> 1.- I bootstrap a complete cluster mons, osds, mgr, mds, nfs, etc using
> etcd as a key store
> 2.- There is an electric blackout and all nodes of my cluster goes down
> and all data in my etcd is lost (but muy osd disks have useful data)
>

Not having run container versions in my case made me reply wrongly last
time, but I think it is still true that a running ceph will not store
anything useful in etcd, but the info stuffed there by your deployment
scripts is just _a_ way to get keys out to newly made containers while
deploying, and not the _only_ way.

Also, a running ceph cluster will (probably!) not read anything useful out
of etcd later on, since it keeps its own databases for all keys. Hence, the
specific question about "I have ceph running, it was deployed using scripts
and etcd for transport of keys, then it bombs out and for reason X, all
etcd files are gone but rest is still there, then it boots up" should
probably give you a working ceph cluster back, assuming no other part (like
network configs, firewall rules or so) needs etcd for its own purposes. If
the OS is gone, or all of /etc is gone or something like that, then the
machine will not boot and ceph will not run but that is kind of obvious.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] What if etcd is lost

2019-07-16 Thread Janne Johansson

Nah, it was me not running container versions.

The bootstrap keys are used to get daemons up without giving out too much
admin access to them, so my guess is that as soon as you have a cluster
going, you can always read out or create new bootstrap keys if needed
later, but they are not necessary for getting a crashed cluster up, rather
to expand or initially create nodes.

Den tis 16 juli 2019 kl 17:58 skrev Oscar Segarra :

> Thanks a lot Janne,
>
> Well, maybe I'm missunderstanding how ceph stores keyrings in etcd...
>
>
> https://github.com/ceph/ceph-container/blob/master/src/daemon/config.kv.etcd.sh
>
> bootstrap="bootstrap${array[1]}Keyring"
> etcdctl "${ETCDCTL_OPTS[@]}" "${KV_TLS[@]}" set "${CLUSTER_PATH}"/"
> ${bootstrap}" < "$keyring"
> But I'd like to know what happens if etcd loses the keyrings sotred in it
> when etcd is used to deploy ceph daemons as containers:
>
> https://hub.docker.com/r/ceph/daemon/
>
> With KV backend:
>
> docker run -d --net=host \
> --privileged=true \
> --pid=host \
> -v /dev/:/dev/ \
> -e OSD_DEVICE=/dev/vdd \
> -e KV_TYPE=etcd \
> -e KV_IP=192.168.0.20 \
> ceph/daemon osd
>
> Thanks a lot for your help,
>
> Óscar
>
>
>
>
> El mar., 16 jul. 2019 17:34, Janne Johansson 
> escribió:
>
>> Den mån 15 juli 2019 kl 23:05 skrev Oscar Segarra <
>> oscar.sega...@gmail.com>:
>>
>>> Hi Frank,
>>> Thanks a lot for your quick response.
>>> Yes, the use case that concerns me is the following:
>>> 1.- I bootstrap a complete cluster mons, osds, mgr, mds, nfs, etc using
>>> etcd as a key store
>>>
>>
>> as a key store ... for what? Are you stuffing anything ceph-related
>> there? If so, please tell us what.
>>
>> As previously said, ceph has no etcd concept so unless you somehow pull
>> stuff out of ceph and feed it into etcd, ceph will be completely careless
>> if you lose etcd data.
>>
>> --
>> May the most significant bit of your life be positive.
>>
>

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] What if etcd is lost

2019-07-16 Thread Janne Johansson

Den mån 15 juli 2019 kl 23:05 skrev Oscar Segarra :

> Hi Frank,
> Thanks a lot for your quick response.
> Yes, the use case that concerns me is the following:
> 1.- I bootstrap a complete cluster mons, osds, mgr, mds, nfs, etc using
> etcd as a key store
>

as a key store ... for what? Are you stuffing anything ceph-related there?
If so, please tell us what.

As previously said, ceph has no etcd concept so unless you somehow pull
stuff out of ceph and feed it into etcd, ceph will be completely careless
if you lose etcd data.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pools limit

2019-07-16 Thread Janne Johansson

Den tis 16 juli 2019 kl 16:16 skrev M Ranga Swami Reddy <
swamire...@gmail.com>:

> Hello - I have created 10 nodes ceph cluster with 14.x version. Can you
> please confirm below:
> Q1 - Can I create 100+ pool (or more) on the cluster? (the reason is -
> creating a pool per project). Any limitation on pool creation?
>
> Q2 - In the above pool - I use 128 PG-NUM - to start with and enable
> autoscale for PG_NUM, so that based on the data in the pool, PG_NUM will
> increase by ceph itself.
>
>
12800 PGs in total might be a bit much, depending on how many OSDs you have
in total for these pools. OSDs aim for something like ~100 PGs per OSD at
most, so for 12800 PGs in total, times 3 for replication=3 makes it
necessary to have quite many OSDs per host. I guess the autoscaler might be
working downwards for your pools instead of upwards. There is nothing wrong
with starting with PG_NUM 8 or so, and have autoscaler increase the pools
that actually do get a lot of data.

100 pools * repl = 3 * pg_num 8 => 2400 PGs, which is fine for 24 OSDs but
would need more OSDs as some of those pools grow in data/objects.

100 * 3 * 128 => 38400 PGs, which requires 384 OSDs, or close to 40 OSDs
per host in your setup. That might become a limiting factor in itself,
sticking so many OSDs in a single box.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD's won't start - thread abort

2019-07-03 Thread Janne Johansson

Den ons 3 juli 2019 kl 20:51 skrev Austin Workman :

>
> But a very strange number shows up in the active sections of the pg's
> that's the same number roughly as 2147483648.  This seems very odd,
> and maybe the value got lodged somewhere it doesn't belong which is causing
> an issue.
>
>
That pg number is "-1" or something for a signed 32bit int, which means "I
don't know which one it was anymore" which you can get in PG lists when
OSDs are gone.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] slow requests due to scrubbing of very small pg

2019-07-03 Thread Janne Johansson

Den ons 3 juli 2019 kl 09:01 skrev Luk :

> Hello,
>
> I have strange problem with scrubbing.
>
> When  scrubbing starts on PG which belong to default.rgw.buckets.index
> pool,  I  can  see that this OSD is very busy (see attachment), and starts
> showing many
> slow  request,  after  the  scrubbing  of this PG stops, slow requests
> stops immediately.
>
> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# du -sh 20.2_*
> 636K20.2_head
> 0   20.2_TEMP
> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# ls -1 -R 20.2_head |
> wc -l
> 4125
> [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]#
>

I think that pool might contain tons of metadata entries and non-obvious
data for which it takes a while to process.
But I see similar things on my rgw pools also during scrubs. :-(

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How does monitor know OSD is dead?

2019-07-03 Thread Janne Johansson

Den ons 3 juli 2019 kl 05:41 skrev Bryan Henderson :

> I may need to modify the above, though, now that I know how Ceph works,
> because I've seen storage server products that use Ceph inside.  However,
> I'll
> bet the people who buy those are not aware that it's designed never to go
> down
> and if something breaks while the system is coming up, a repair action may
> be
> necessary before data is accessible again.
>

I think you would be hard pressed to find any storage cluster who could not
ever get
into a situation where repair is needed before coming up again, given all
the random
events that might occur while a non-small number of members suffer from
sudden
power outages.

I appreciate you had a bad experience, but don't believe that all others
will gracefully and
without issues automagically handle any kind of disturbances when parts of
the clusters
come up at different times and have their member disks checked at different
speeds
before being allowed in again.

Not saying ceph is perfect, but work long enough in the storage sector and
you'll see all
kinds of odd surprises, and when total power loss happens, vendors are
quite likely to shrug
it off just like the replies you got here, in a "well don't get more
outages" fashion.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Erasure Coding - FPGA / Hardware Acceleration

2019-06-14 Thread Janne Johansson

Den fre 14 juni 2019 kl 15:47 skrev Sean Redmond :

> Hi James,
> Thanks for your comments.
> I think the CPU burn is more of a concern to soft iron here as they are
> using low power ARM64 CPU's to keep the power draw low compared to using
> Intel CPU's where like you say the problem maybe less of a concern.
>
> Using less power by using ARM64 and providing EC using an FPGA does sound
> interesting as I often run into power constrains when deploying. I am just
> concerned that this FPGA functionally seems limited to a single vendor, who
> I assume is packing their own EC plugin to get this to work (hopefully a
> soft iron employee can explain to us how that is implemented). as I like
> the flexibility we have with ceph to change or use multiple vendor over time
>
>
Just getting slightly the same vibes as the old "crypto accelerator cards"
that was once sold to enhance your single/3DES performance which could show
very impressive numbers of encrypted bytes/s and whatnot, but in those
cases, if you did ipsec for instance, unless you could also checksum and
re-package the packets on the card, it became almost a net loss to take
packets in, ship over buses to accelerator card and back, then loop over
the packets once more for checksums after decrypt, and quite possibly copy
them once again in memory before sending the decapsulated packets inwards
from your ipsec box, so even if the crypto takes 0 time, it becomes a
rather small part of the whole operation and of course these days the
packet fits quite well inside L1 caches of almost all CPUs, so looping over
it once for decrypt and once for checksum isn't going to take a huge amount
more time than only once + two trips over the buses.

I get that this design is slightly different, but the vibes are there.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Erasure Coding - FPGA / Hardware Acceleration

2019-06-14 Thread Janne Johansson

Den fre 14 juni 2019 kl 13:58 skrev Sean Redmond :

> Hi Ceph-Uers,
> I noticed that Soft Iron now have hardware acceleration for Erasure
> Coding[1], this is interesting as the CPU overhead can be a problem in
> addition to the extra disk I/O required for EC pools.
> Does anyone know if any other work is ongoing to support generic FPGA
> Hardware Acceleration for EC pools, or if this is just a vendor specific
> feature.
>
> [1]
> https://www.theregister.co.uk/2019/05/20/softiron_unleashes_accepherator_an_erasure_coding_accelerator_for_ceph/
>

Are there numbers anywhere to see how "tough" on a CPU it would be to
calculate an EC code compared to "writing a sector to
a disk on a remote server and getting an ack back" ? To my very untrained
eye, it seems like a very small part of the whole picture,
especially if you are meant to buy a ton of cards to do it.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD caching on EC-pools (heavy cross OSD communication on cached reads)

2019-06-10 Thread Janne Johansson

Den sön 9 juni 2019 kl 18:29 skrev :

> make sense - makes the cases for ec pools smaller though.
>
> Sunday, 9 June 2019, 17.48 +0200 from paul.emmer...@croit.io <
> paul.emmer...@croit.io>:
>
> Caching is handled in BlueStore itself, erasure coding happens on a higher
> layer.
>
>
>
In your case, caching at cephfs MDS would be even more efficient then?

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW metadata pool migration

2019-05-23 Thread Janne Johansson

Den ons 22 maj 2019 kl 17:43 skrev Nikhil Mitra (nikmitra) <
nikmi...@cisco.com>:

> Hi All,
>
> What are the metadata pools in an RGW deployment that need to sit on the
> fastest medium to better the client experience from an access standpoint ?
>
> Also is there an easy way to migrate these pools in a PROD scenario with
> minimal to no-outage if possible ?
>

We have lots of non-data pools on SSD pools and the data (and log) pools on
HDD.
It's a simple matter of making a crush ruleset for SSD and telling the
pools you want to move that they should use the SSD ruleset and they will
move over by themselves.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Is there a Ceph-mon data size partition max limit?

2019-05-10 Thread Janne Johansson

Den fre 10 maj 2019 kl 14:48 skrev Poncea, Ovidiu <
ovidiu.pon...@windriver.com>:

> Oh... joy :) Do you know if, after replay, ceph-mon data will decrease or
> do we need to do some manual cleanup? Hopefully we don't keep it in there
> forever.
>

You get the storage back as soon as the situation clears.


>
> Depends on cluster size and how long you keep your cluster in a degraded
> state.
>
> Having ~64 GB available is a good idea
>


> >>
> >> What is the commanded size for the ceph-mon data partitions? Is there a
> maximum limit to it? If not is there a way to limit it's growth (or celan
> it up)? To my knowledge ceph-mon doesn't use a lot of data (500MB - 1GB
> should be enough, but I'm not the expert here :)
>

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] maximum rebuild speed for erasure coding pool

2019-05-10 Thread Janne Johansson

Den tors 9 maj 2019 kl 17:46 skrev Feng Zhang :

> Thanks, guys.
>
> I forgot the IOPS. So since I have 100disks, the total
> IOPS=100X100=10K. For the 4+2 erasure, one disk fail, then it needs to
> read 5 and write 1 objects.Then the whole 100 disks can do 10K/6 ~ 2K
> rebuilding actions per seconds.
>
> While for the 100X6TB disks, suppose the object size is set to 4MB,
> then 6TB/4MB=1.25 million objects. Not considering the disk throughput
> IO or CPUs, fully rebuilding takes:
>
> 1.25M/2K=600 seconds?
>

I think you will _never_ see a full cluster all helping out at 100% to fix
such an issue,
so while your math is probably correctly describing the absolute best-case,
reality will
be somewhere below that.

Still, it will be quite possible to cause this situation and make
a measurement of your own with exactly your own circumstances, since
everyones setup
is slightly different. Replacing broken drives is normal for any large
storage system, and
ceph will prioritze client traffic most of the time over normal repairs, so
that will add to the
total calendar time it takes for recovery, but keep your users happy while
doing it.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] maximum rebuild speed for erasure coding pool

2019-05-09 Thread Janne Johansson

Den tors 9 maj 2019 kl 16:17 skrev Marc Roos :

>
>  > Fancy fast WAL/DB/Journals probably help a lot here, since they do
> affect the "iops"
>  > you experience from your spin-drive OSDs.
>
> What difference can be expected if you have a 100 iops hdd and you start
> using
> wal/db/journals on ssd? What would this 100 iops increase to
> (estimating)?
>
>
I don't know, there is a factor of reading objects which won't get lots of
perf from
WAL/DB/Journals at all, only the destination writes, and also the relative
sizes of
the WAL/Journals are relevant since they need to be large enough to allow
the
drive to flush out data (albeit in a nicer order with larger IOs
presumably) or you will
just have nice IOPS for a while and then fall back to spin-drive speeds as
the
WAL/Journal gets filled and need to wait for the drives anyhow.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] maximum rebuild speed for erasure coding pool

2019-05-09 Thread Janne Johansson

Den tors 9 maj 2019 kl 15:46 skrev Feng Zhang :

>
> For erasure pool, suppose I have 10 nodes, each has 10 6TB drives, so
> in total 100 drives. I make a 4+2 erasure pool, failure domain is
> host/node. Then if one drive failed, (assume the 6TB is fully used),
> what the maximum speed the recovering process can have? Also suppose
> the cluster network is 10GbE, each disk has maximum 200MB/s sequential
> throughput.
>


I think IOPS is what I experience makes the largest impact during recovery
of spinning
drives, not speed of sequential perf, so when recoverying you will see
progress
(on older clusters at least) in terms of number of misplaced/degraded
objects like:

misplaced3454/34989583 objects (0.0123%)

and the first number (here is my guess) moves at the speed of the IOPS of
the drives
being repaired, so if your drive(s) from which you rebuild can do 100 IOPS,
then the above
scenario will take ~34 seconds, even if that sizes and raw speeds should
indicate something
else about how fast they could move 0.0123% of your stored data.

As soon as you get super fast SSDs and NVMEs, the limit moves somewhere
else since they
have crazy IOPS numbers, and hence will repair lots faster, but if you have
only spinning
drives, then "don't hold your breath" is good advice, since it will take
longer than 6 TB divided
by 200MB/s (8h20m) if you are unlucky and other drives can't help out in
the rebuild.

Fancy fast WAL/DB/Journals probably help a lot here, since they do affect
the "iops"
you experience from your spin-drive OSDs.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Is there a Ceph-mon data size partition max limit?

2019-05-09 Thread Janne Johansson

Den tors 9 maj 2019 kl 11:52 skrev Poncea, Ovidiu <
ovidiu.pon...@windriver.com>:

> Hi folks,
>
> What is the commanded size for the ceph-mon data partitions? Is there a
> maximum limit to it? If not is there a way to limit it's growth (or celan
> it up)? To my knowledge ceph-mon doesn't use a lot of data (500MB - 1GB
> should be enough, but I'm not the expert here :)
>

Our long lived cluster mons have some 1.5G under /var/lib/ceph for
monitors, we have given them 50-ish G on /var.
I think if you have missing/downed OSDs for a long while, they will retain
info for replays which will make it grow a lot if that condition stays so
you want some margin there.


> We are working on the StarlingX project and need want to decide if user
> will ever need to resize this partition. If yes then we have to implement a
> good partition resize mechanism else we can leave it static and be done
> with it.
>

Just make it on an LVM volume so you can live-expand if needed.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd ssd pool for (windows) vms

2019-05-06 Thread Janne Johansson

Den mån 6 maj 2019 kl 10:03 skrev Marc Roos :

>
> Yes but those 'changes' can be relayed via the kernel rbd driver not?
> Besides I don't think you can move a rbd block device being used to a
> different pool anyway.
>
>
No, but you can move the whole pool, which takes all RBD images with it.


> On the manual page [0] there is nothing mentioned about configuration
> settings needed for rbd use. Nor for ssd. They are also using in the
> example the virtio/vda, while I learned here that you should use the
> virtio-scsi/sda.
>
>
There are differences here, one is "tell guest to use virtio-scsi" that
will allow
TRIM from a guest to become a possibility to reclaim space on
thin-provisioned
RBD images, and that is probably a good thing.

That doesn't mean the guest TRIM commands will pass on to the pool OSD
storage sectors underneath.

So you don't gain anything directly on the end devices in that regard to
let a guest know
that it currently is lying on SSDs or HDDs, because the guest will not be
sending SSD
commands to the real device. Inversely, the TRIMs sent from a guest would
allow re-thinning
on a HDD pool aswell, since it's not a factor of the underlying devices,
but rather the ceph
code and pool/rbd properties which are the same regardless.

Also, if the guest makes other decisions based on if there is HDD or SSD
underneath,
those decisions can be wrong both ways, like "I was told its hdd and
therefore I assume
only X iops are possible" where the kvm librbd layer can cache tons of
things for you,
aswell as filestore OSD ram caches could give you awesome RAM-like write
performance
not seen on normal HDDs ever. (at the risk of dataloss in the worst case)

On the other hand, being told as a guest that there is SSD or NVME
and deciding 100k iops should be the norm from now on would be equally
wrong the
other way around if your ceph network between the guest and OSDs prevent
you from
doing more than 1k iops.

If you find out that there is no certain way to tell a guest about where it
really is stored,
that may actually be a conscious decision that is for the best. Let the
guest try to do as
much IO as it think it needs and get the results when they are ready.

(a nod to Xen that likes to tell its guests they are on IDE drives so
guests never send out
more than one IO request at a time because IDE just don't have that
concept, regardless
of how fancy host you have with super-deep request queues and all... 8-/ )

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Restricting access to RadosGW/S3 buckets

2019-05-03 Thread Janne Johansson

Den tors 2 maj 2019 kl 23:41 skrev Vladimir Brik <
vladimir.b...@icecube.wisc.edu>:

> Hello
> I am trying to figure out a way to restrict access to S3 buckets. Is it
> possible to create a RadosGW user that can only access specific bucket(s)?
>

You can have a user with very small bucket/bytes quota so they can't make
buckets of their own, then have another ID make these buckets and add the
first user as allowed user to be able to write in them. The first user will
not be able to list all buckets given to it, but if the names are
predetermined, this might not be a showstopper.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd ssd pool for (windows) vms

2019-05-03 Thread Janne Johansson

Den ons 1 maj 2019 kl 23:00 skrev Marc Roos :

> Do you need to tell the vm's that they are on a ssd rbd pool? Or does
> ceph and the libvirt drivers do this automatically for you?
> When testing a nutanix acropolis virtual install, I had to 'cheat' it by
> adding this
>  
> To make the installer think there was a ssd drive.
>

Being or not being on an SSD pool is a (possibly) temporary conditition, so
if the guest OS makes certain assumptions based on it, it might be invalid
an hour later.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] clock skew

2019-04-25 Thread Janne Johansson

Den tors 25 apr. 2019 kl 13:00 skrev huang jun :

> mj  于2019年4月25日周四 下午6:34写道：
> >
> > Hi all,
> >
> > On our three-node cluster, we have setup chrony for time sync, and even
> > though chrony reports that it is synced to ntp time, at the same time
> > ceph occasionally reports time skews that can last several hours.
> >
> > But two questions:
> >
> > - can anyone explain why this is happening, is it looks as if ceph and
> > NTP/chrony disagree on just how time-synced the servers are..?
>
> Not familiar with chrony, but for our practice is using NTP, and it works
> fine.
>

What we do with ntpd (and that is probably possible with chrony also) is to
have all mons grab the date from some generic NTP servers, but also add
eachother as peers, which means they sync with eachother about what time it
is, and since the mons are super close to eachother network wise, this is
very stable, compared to what you might get from a random time server on
the internet. It's not super important that they are right about what time
it actually is, only that they all agree with eachother.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] getting pg inconsistent periodly

2019-04-24 Thread Janne Johansson

Den ons 24 apr. 2019 kl 08:46 skrev Zhenshi Zhou :

> Hi,
>
> I'm running a cluster for a period of time. I find the cluster usually
> run into unhealthy state recently.
>
> With 'ceph health detail', one or two pg are inconsistent. What's
> more, pg in wrong state each day are not placed on the same disk,
> so that I don't think it's a disk problem.
>
> The cluster is using version 12.2.5. Any idea about this strange issue?
>
>
There was lots of fixes for releases around that version,
do read https://ceph.com/releases/12-2-7-luminous-released/
and later release notes on the 12.2.x series.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Are there any statistics available on how most production ceph clusters are being used?

2019-04-19 Thread Janne Johansson

Den fre 19 apr. 2019 kl 12:10 skrev Marc Roos :

>
> [...]since nobody here is interested in a better rgw client for end
> users. I am wondering if the rgw is even being used like this, and what
> most production environments look like.
>
>
"Like this" ?

People use tons of scriptable and built-in clients, from s3cmd, to "My
backup software can use S3 as a remote backend"
You mentioned looking at two and now conclude noone wants s3...


> This could also be interesting information to decide in what direction
> ceph should develop in the future not?
>
>
Find an area which bugs you and fix that, present your results, don't go
ape over a failed "survey" during easter vacations.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rgw windows/mac clients shitty, develop a new one?

2019-04-18 Thread Janne Johansson

https://www.reddit.com/r/netsec/comments/8t4xrl/filezilla_malware/

not saying it definitely is, or isn't malware-ridden, but it sure was shady
at that time.
I would suggest not pointing people to it.


Den tors 18 apr. 2019 kl 16:41 skrev Brian : :

> Hi Marc
>
> Filezilla has decent S3 support https://filezilla-project.org/
>
> ymmv of course!
>
> On Thu, Apr 18, 2019 at 2:18 PM Marc Roos 
> wrote:
> >
> >
> > I have been looking a bit at the s3 clients available to be used, and I
> > think they are quite shitty, especially this Cyberduck that processes
> > files with default reading rights to everyone. I am in the process to
> > advice clients to use for instance this mountain duck. But I am not to
> > happy about it. I don't like the fact that everything has default
> > settings for amazon or other stuff in there for ftp or what ever.
> >
> > I am thinking of developing something in-house, more aimed at the ceph
> > environments, easier/better to use.
> >
> > What I can think of:
> >
> > - cheaper, free or maybe even opensource
> > - default settings for your ceph cluster
> > - only configuration for object storage (no amazon, rackspace, backblaze
> > shit)
> > - default secure settings
> > - offer in the client only functionality that is available from the
> > specific ceph release
> > - integration with the finder / explorer windows
> >
> > I am curious who would be interested in a such new client? Maybe better
> > to send me your wishes directly, and not clutter the mailing list with
> > this.
> >
> >
> >
> >
> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] showing active config settings

2019-04-10 Thread Janne Johansson

Den ons 10 apr. 2019 kl 13:37 skrev Eugen Block :

> > If you don't specify which daemon to talk to, it tells you what the
> > defaults would be for a random daemon started just now using the same
> > config as you have in /etc/ceph/ceph.conf.
>
> I tried that, too, but the result is not correct:
>
> host1:~ # ceph -n osd.1 --show-config | grep osd_recovery_max_active
> osd_recovery_max_active = 3
>

I always end up using "ceph --admin-daemon
/var/run/ceph/name-of-socket-here.asok config show | grep ..." to get what
is in effect now for a certain daemon.
Needs you to be on the host of the daemon of course.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] showing active config settings

2019-04-10 Thread Janne Johansson

Den ons 10 apr. 2019 kl 13:31 skrev Eugen Block :

>
> While --show-config still shows
>
> host1:~ # ceph --show-config | grep osd_recovery_max_active
> osd_recovery_max_active = 3
>
>
> It seems as if --show-config is not really up-to-date anymore?
> Although I can execute it, the option doesn't appear in the help page
> of a Mimic and Luminous cluster. So maybe this is deprecated.
>
>
If you don't specify which daemon to talk to, it tells you what the
defaults would be for a random daemon started just now using the same
config as you have in /etc/ceph/ceph.conf.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cluster is not stable

2019-03-15 Thread Janne Johansson

Den tors 14 mars 2019 kl 17:00 skrev Zhenshi Zhou :
> I think I've found the root cause which make the monmap contains no
> feature. As I moved the servers from one place to another, I modified
> the monmap once.

If this was the empty cluster that you refused to redo from scratch, then I
feel it might be right to quote myself from the discussion before the move:
--
 If the cluster is clean I see no
reason for doing brain surgery on monmaps
just to "save" a few minutes of redoing correctly from scratch.



*Whatif you miss some part, some command gives you an erroryou really
aren't comfortable with, something doesn't really feelright after doing it,
then the whole lifetime of that clusterwill be followed by a small nagging
feeling* that it might have been
that time you followed a guide that tries to talk you out of
doing it that way, for a cluster with no data.
--

I think the part in bold is *exactly* what happened to you now, you did
something quite far out of the ordinary which was doable, but recommended
against, and somehow some part not anticipated or covered in the "blindly
type these commands into your ceph" occured.

>From this point on, you _will_ know that your cluster is not 100% like
everyone elses, and any future errors and crashes just _might_ be from it
being different in a way noone has ever tested before. Some bit unset, some
string left uninitialised, some value left untouched that never could be
like that if done right.

If you have little data in it now, I would still recommend moving data
elsewhere and setting it up correctly.

--
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph migration

2019-02-25 Thread Janne Johansson

Den mån 25 feb. 2019 kl 13:40 skrev Eugen Block :
> I just moved a (virtual lab) cluster to a different network, it worked
> like a charm.
> In an offline method - you need to:
>
> - set osd noout, ensure there are no OSDs up
> - Change the MONs IP, See the bottom of [1] "CHANGING A MONITOR’S IP
> ADDRESS", MONs are the only ones really
> sticky with the IP
> - Ensure ceph.conf has the new MON IPs and network IPs
> - Start MONs with new monmap, then start OSDs
>
> > No, certain ips will be visible in the databases, and those will not change.
> I'm not sure where old IPs will be still visible, could you clarify
> that, please?

Well, I've just reacted to all the text at the beginning of
http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way
including the title "the messy way". If the cluster is clean I see no
reason for doing brain surgery on monmaps
just to "save" a few minutes of redoing correctly from scratch. What
if you miss some part, some command gives you an error
you really aren't comfortable with, something doesn't really feel
right after doing it, then the whole lifetime of that cluster
will be followed by a small nagging feeling that it might have been
that time you followed a guide that tries to talk you out of
doing it that way, for a cluster with no data.

I think that is the wrong way to learn how to run clusters.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph migration

2019-02-25 Thread Janne Johansson

Den mån 25 feb. 2019 kl 12:33 skrev Zhenshi Zhou :
> I deployed a new cluster(mimic). Now I have to move all servers
> in this cluster to another place, with new IP.
> I'm not sure if the cluster will run well or not after I modify config
> files, include /etc/hosts and /etc/ceph/ceph.conf.

No, certain ips will be visible in the databases, and those will not change.

> Fortunately, the cluster has no data at present. I never encounter
> such an issue like this. Is there any suggestion for me?

If you recently created the cluster, it should be easy to just
recreate it again,
using the same scripts so you don't have to repeat yourself as an admin since
computers are very good at repetitive tasks.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph cluster stability

2019-02-22 Thread Janne Johansson

Den fre 22 feb. 2019 kl 12:35 skrev M Ranga Swami Reddy <
swamire...@gmail.com>:

> No seen the CPU limitation because we are using the 4 cores per osd daemon.
> But still using "ms_crc_data = true and ms_crc_header = true". Will
> disable these and try the performance.
>

I am a bit sceptical to crc being so heavy that it would impact a CPU made
after 1990..

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to trim default.rgw.log pool?

2019-02-14 Thread Janne Johansson

While we're at it, a way to know what in the default.rgw...non-ec pool one
can remove. We have tons of old zero-size objects there which are probably
useless and just take up (meta)space.


Den tors 14 feb. 2019 kl 09:26 skrev Charles Alva :

> Hi All,
>
> Is there a way to trim Ceph default.rgw.log pool so it won't take huge
> space? Or perhaps is there logrotate mechanism in placed?
>
>
> Kind regards,
>
> Charles Alva
> Sent from Gmail Mobile
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Multicast communication compuverde

2019-02-06 Thread Janne Johansson

For EC coded stuff,at 10+4 with 13 others needing data apart from the
primary, they are specifically NOT getting the same data, they are getting
either 1/10th of the pieces, or one of the 4 different checksums, so it
would be nasty to send full data to all OSDs expecting a 14th of the data.


Den ons 6 feb. 2019 kl 10:14 skrev Marc Roos :

>
> Yes indeed, but for osd's writing the replication or erasure objects you
> get sort of parrallel processing not?
>
>
>
> Multicast traffic from storage has a point in things like the old
> Windows provisioning software Ghost where you could netboot a room full
> och computers, have them listen to a mcast stream of the same data/image
> and all apply it at the same time, and perhaps re-sync potentially
> missing stuff at the end, which would be far less data overall than
> having each client ask the server(s) for the same image over and over.
> In the case of ceph, I would say it was much less probable that many
> clients would ask for exactly same data in the same order, so it would
> just mean all clients hear all traffic (or at least more traffic than
> they asked for) and need to skip past a lot of it.
>
>
> Den tis 5 feb. 2019 kl 22:07 skrev Marc Roos :
>
>
>
>
> I am still testing with ceph mostly, so my apologies for bringing
> up
> something totally useless. But I just had a chat about compuverde
> storage. They seem to implement multicast in a scale out solution.
>
> I was wondering if there is any experience here with compuverde
> and
> how
> it compared to ceph. And maybe this multicast approach could be
> interesting to use with ceph?
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
>
> May the most significant bit of your life be positive.
>
>
>
>

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Multicast communication compuverde

2019-02-06 Thread Janne Johansson

Multicast traffic from storage has a point in things like the old Windows
provisioning software Ghost where you could netboot a room full och
computers, have them listen to a mcast stream of the same data/image and
all apply it at the same time, and perhaps re-sync potentially missing
stuff at the end, which would be far less data overall than having each
client ask the server(s) for the same image over and over.
In the case of ceph, I would say it was much less probable that many
clients would ask for exactly same data in the same order, so it would just
mean all clients hear all traffic (or at least more traffic than they asked
for) and need to skip past a lot of it.


Den tis 5 feb. 2019 kl 22:07 skrev Marc Roos :

>
>
> I am still testing with ceph mostly, so my apologies for bringing up
> something totally useless. But I just had a chat about compuverde
> storage. They seem to implement multicast in a scale out solution.
>
> I was wondering if there is any experience here with compuverde and how
> it compared to ceph. And maybe this multicast approach could be
> interesting to use with ceph?
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph block - volume with RAID#0

2019-01-31 Thread Janne Johansson

Den fre 1 feb. 2019 kl 06:30 skrev M Ranga Swami Reddy :

> Here user requirement is - less write and more reads...so not much
> worried on performance .
>

So why go for raid0 at all?
It is the least secure way to store data.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph block - volume with RAID#0

2019-01-30 Thread Janne Johansson

Den ons 30 jan. 2019 kl 14:47 skrev M Ranga Swami Reddy <
swamire...@gmail.com>:

> Hello - Can I use the ceph block volume with RAID#0? Are there any
> issues with this?
>

Hard to tell if you mean raid0 over a block volume or a block volume over
raid0. Still, it is seldom a good idea to stack redundancies on top of each
other.
It will work, but may not give the gains you might expect from it.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best practice for increasing number of pg and pgp

2019-01-30 Thread Janne Johansson

Den ons 30 jan. 2019 kl 05:24 skrev Linh Vu :
>
> We use https://github.com/cernceph/ceph-scripts  ceph-gentle-split script to 
> slowly increase by 16 pgs at a time until we hit the target.

>
> Somebody recommends that this adjustment should be done in multiple stages, 
> e.g. increase 1024 pg each time. Is this a good practice? or should we 
> increase it to 8192 in one time. Thanks!

We also do a few at a time, mostly 8 I think.
-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Modify ceph.mon network required

2019-01-25 Thread Janne Johansson

Den fre 25 jan. 2019 kl 09:52 skrev cmonty14 <74cmo...@gmail.com>:
>
> Hi,
> I have identified a major issue with my cluster setup consisting of 3 nodes:
> all monitors are connected to cluster network.
>
> Question:
> How can I modify the network configuration of mon?
>
> It's not working to simply change the parameters in ceph.conf because
> then the quorum fails.
>

Look at the docs for Adding and Removing Mons and then you remove one,
and add it at the new address, and repeat for the other mons.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] quick questions about a 5-node homelab setup

2019-01-22 Thread Janne Johansson

Den tis 22 jan. 2019 kl 00:50 skrev Brian Topping :
> > I've scrounged up 5 old Atom Supermicro nodes and would like to run them 
> > 365/7 for limited production as RBD with Bluestore (ideally latest 13.2.4 
> > Mimic), triple copy redundancy. Underlying OS is a Debian 9 64 bit, minimal 
> > install.
>
> The other thing to consider about a lab is “what do you want to learn?” If 
> reliability isn’t an issue (ie you aren’t putting your family pictures on 
> it), regardless of the cluster technology, you can often learn basics more 
> quickly without the overhead of maintaining quorums and all that stuff on day 
> one. So at risk of being a heretic, start small, for instance with single 
> mon/manager and add more later.

Well, if you start small with one OSD, you are going to run into "the
defaults will work against you" since as you make your first pool, it
will want to place 3 copies on the separate hosts, so not only are you
trying to get accustomed to ceph terms and technologies, you are also
working against the whole cluster idea by not building a cluster at
all, so you will encounter problems regular ceph admins don't see
ever, so chances of getting help is smaller. Things like "OSD will
pre-allocate so much data a 10G OSD crashes at start" or "my pool wont
start since my pgs are in a bad state since I have only one OSD or
only one host and I didn't change the crush rules" is just something
people starting small will ever experience. Anyone with 3 or more real
hosts with real drives attached just will never see it.

Telling people to learn clusters by building a non-cluster might be
counter-productive. When you have a working ceph cluster you can
practice in getting it to run on a rpi with a usb stick for a drive,
but starting at that will make you fight two or more unknowns at the
same time, both ceph being new to you, and un-clustering a cluster
software suite. (and possibly running on non-x86_64 for a third
unknown)

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] quick questions about a 5-node homelab setup

2019-01-21 Thread Janne Johansson

Den fre 18 jan. 2019 kl 12:42 skrev Robert Sander
:

> > Assuming BlueStore is too fat for my crappy nodes, do I need to go to 
> > FileStore? If yes, then with xfs as the file system? Journal on the SSD as 
> > a directory, then?
>
> Journal for FileStore is also a block device.

It can be a file in a pre-existing file system also.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Problem with CephFS - No space left on device

2019-01-08 Thread Janne Johansson

Den tis 8 jan. 2019 kl 16:05 skrev Yoann Moulin :
> The best thing you can do here is added two disks to pf-us1-dfs3.

After that, get a fourth host with 4 OSDs on it and add to the cluster.
If you have 3 replicas (which is good!), then any downtime will mean
the cluster is
kept in a degraded mode. If you have 4 or more hosts then the cluster
will repair
itself and get back into a decent state when you lose a server for
whatever reason.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Balancer=on with crush-compat mode

2019-01-06 Thread Janne Johansson

Den sön 6 jan. 2019 kl 13:22 skrev Marc Roos :
>
>  >If I understand the balancer correct, it balances PGs not data.
>  >This worked perfectly fine in your case.
>  >
>  >I prefer a PG count of ~100 per OSD, you are at 30. Maybe it would
>  >help to bump the PGs.
>  >

> I am not sure if I should increase from 8 to 16. Because that would just
>
> half the data in the pg's and they probably end up on the same osd's in
> the same ratio as now? Or am I assuming this incorrectly?
> Is 16 adviced?
>

If you had only one PG (the very extremest usecase) it would always be optimally
misplaced. If you have lots, there are many more chances of ceph spreading them
correctly. There is some hashing and pseudorandom in there to screw it
up at times,
but considering what you can do with few -vs- many PGs, having many allows for
better spread than few, upto some point where the cpu handling all PGs eats more
resources than its worth.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] list admin issues

2018-12-26 Thread Janne Johansson

Den lör 22 dec. 2018 kl 19:18 skrev Brian : :
> Sorry to drag this one up again.

Not as sorry to drag it up as you

> Just got the unsubscribed due to excessive bounces thing.

And me.

> 'Your membership in the mailing list ceph-users has been disabled due
> to excessive bounces The last bounce received from you was dated
[..]
> this before your membership in the list is deleted.'
> can anyone check MTA logs to see what the bounce is?

^^^
this

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore nvme DB/WAL size

2018-12-21 Thread Janne Johansson

Den tors 20 dec. 2018 kl 22:45 skrev Vladimir Brik
:
> Hello
> I am considering using logical volumes of an NVMe drive as DB or WAL
> devices for OSDs on spinning disks.
> The documentation recommends against DB devices smaller than 4% of slow
> disk size. Our servers have 16x 10TB HDDs and a single 1.5TB NVMe, so
> dividing it equally will result in each OSD getting ~90GB DB NVMe
> volume, which is a lot less than 4%. Will this cause problems down the road?

Well, apart from the reply you already got on "one nvme fails all the
HDDs it is WAL/DB for",
the recommendations are about getting the best out of them, especially
for the DB I suppose.

If one can size stuff up before, then following recommendations is a
good choice, but I think
you should test using it for WALs for instance, and bench it against
another host with data,
wal and db on the HDD and see if it helps a lot in your expected use case.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Scheduling deep-scrub operations

2018-12-14 Thread Janne Johansson

Den fre 14 dec. 2018 kl 12:25 skrev Caspar Smit :
> We have operating hours from 4 pm until 7 am each weekday and 24 hour days in 
> the weekend.
> I was wondering if it's possible to allow deep-scrubbing from 7 am until 15 
> pm only on weekdays and prevent any deep-scrubbing in the weekend.
> I've seen the osd scrub begin/end hour settings but that doesn't allow for 
> preventing deep-scrubs in the weekend.

It is somewhat simple to run scrubs manually.


At 4 PM, get a PG dump and sort based on last-scrubbed, then if time
is less than 6.50AM, scrub the longest-since-scrub PG, repeat.
With additions for weekend logic on top of that.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] yet another deep-scrub performance topic

2018-12-11 Thread Janne Johansson

Den tis 11 dec. 2018 kl 12:54 skrev Caspar Smit :
>
> On a Luminous 12.2.7 cluster these are the defaults:
> ceph daemon osd.x config show

thank you very much.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] yet another deep-scrub performance topic

2018-12-11 Thread Janne Johansson

Den tis 11 dec. 2018 kl 12:26 skrev Caspar Smit :
>
> Furthermore, presuming you are running Jewel or Luminous you can change some 
> settings in ceph.conf to mitigate the deep-scrub impact:
>
> osd scrub max interval = 4838400
> osd scrub min interval = 2419200
> osd scrub interval randomize ratio = 1.0
> osd scrub chunk max = 1
> osd scrub chunk min = 1
> osd scrub priority = 1
> osd scrub sleep = 0.1
> osd deep scrub interval = 2419200
> osd deep scrub stride = 1048576
> osd disk thread ioprio class = idle
> osd disk thread ioprio priority = 7
>

It would be interesting to see what the defaults for those were, so
one can see which go up and which go down.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] High average apply latency Firefly

2018-12-04 Thread Janne Johansson

Den tis 4 dec. 2018 kl 11:20 skrev Klimenko, Roman :
>
> Hi everyone!
>
> On the old prod cluster
> - baremetal, 5 nodes (24 cpu, 256G RAM)
> - ceph 0.80.9 filestore
> - 105 osd, size 114TB (each osd 1.1T, SAS Seagate ST1200MM0018) , raw used 60%
> - 15 journals (eash journal 0.4TB, Toshiba PX04SMB040)
> - net 20Gbps
> - 5 pools, size 2, min_size 1
> we have discovered recently pretty high Average Apply latency, near 20 ms.
> Using ceph osd perf I can see that sometimes osd's apply latency could be 
> high as 300-400 ms on some disks. How I can tune ceph in order to reduce this 
> latency?

I would start with running "iostat" on all OSDs hosts and see if one
or more drives show a huge percent on utilization%.
Having one or a few drives that are lots slower than the rest (in many
cases it shows up as taking a long time to finish IO
and hence more utilization% than the other OSD drives) will hurt the
whole cluster speed.
If you find one or a few drives being extra slow, lower crush weight
so data moves off them to other healthy drives.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] all vms can not start up when boot all the ceph hosts.

2018-12-04 Thread Janne Johansson

Den tis 4 dec. 2018 kl 10:37 skrev linghucongsong :
> Thank you for reply!
> But it is just in case suddenly power off for all the hosts!
> So the best way for this it is to have the snapshot on  the import vms or 
> have to mirror the
> images to other ceph cluster?

Best way is probably to do just like you would handle power outages
for physical machines,
make sure you have working backups with tested restores, AND/OR
scripts that can reinstall your
guests again if needed and then avoid power outages as much as possible.

ceph will only know that certain writes and reads are being made from
the openstack compute hosts,
it will not know the meaning of an individual write, so it can't know
if one write just before an outage
is half of an xfs update needed to create a new file or if it is a
complete transaction, so if you remove
the storage at random, then random errors will occur on your
filesystem, just like pulling out a USB
stick while you are writing to it.

If power outages are very common, you might consider having lots of
small filesystems on your guests
and mount them readonly as much as possible, and then for short
intervals have them readwrite-able
so that the chance for each partition getting broken during outages
gets as small as possible.

>> HI all!
>>
>> I have a ceph test envirment use ceph with openstack. There are some vms run 
>> on the openstack. It is just a test envirment.
>> my ceph version is 12.2.4. Last day I reboot all the ceph hosts before this 
>> I do not shutdown the vms on the openstack.
>> When all the hosts boot up and the ceph become healthy. I  found all the vms 
>> can not start up. All the vms have the
>> below xfs error. Even I use xfs_repair also can not repair this problem .
>>
>
> So you removed the underlying storage while the machines were running. What 
> did you expect would happen?
> If you do this to a physical machine, or guests running with some other kind 
> of remote storage like iscsi, what
> do you think will happen to running machines in that case?
>
>
> --
> May the most significant bit of your life be positive.
>
>
>
>



-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] all vms can not start up when boot all the ceph hosts.

2018-12-04 Thread Janne Johansson

Den tis 4 dec. 2018 kl 09:49 skrev linghucongsong :

> HI all!
>
> I have a ceph test envirment use ceph with openstack. There are some vms
> run on the openstack. It is just a test envirment.
> my ceph version is 12.2.4. Last day I reboot all the ceph hosts before
> this I do not shutdown the vms on the openstack.
> When all the hosts boot up and the ceph become healthy. I  found all the
> vms can not start up. All the vms have the
> below xfs error. Even I use xfs_repair also can not repair this problem .
>
>
So you removed the underlying storage while the machines were running. What
did you expect would happen?
If you do this to a physical machine, or guests running with some other
kind of remote storage like iscsi, what
do you think will happen to running machines in that case?


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Disable intra-host replication?

2018-11-26 Thread Janne Johansson

Den mån 26 nov. 2018 kl 12:11 skrev Marco Gaiarin :
> Mandi! Janne Johansson
>   In chel di` si favelave...
>
> > The default crush rules with replication=3 would only place PGs on
> > separate hosts,
> > so in that case it would go into degraded mode if a node goes away,
> > and not place
> > replicas on different disks on the remaining hosts.
>
> 'hosts' mean 'hosts with OSDs', right?
> Because my cluster have 5 hosts, 2 are only MONs.

Yes, only hosts with OSDs.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Sizing for bluestore db and wal

2018-11-26 Thread Janne Johansson

Den mån 26 nov. 2018 kl 10:10 skrev Felix Stolte :
>
> Hi folks,
>
> i upgraded our ceph cluster from jewel to luminous and want to migrate
> from filestore to bluestore. Currently we use one SSD as journal for
> thre 8TB Sata Drives with a journal partition size of 40GB. If my
> understanding of the bluestore documentation is correct, i can use a wal
> partition for the writeahead log (to decrease write latency, similar to
> filestore) and a db partition for metadata (decreasing write AND read
> latency/throughput). Now I have two questions:
>
> a) Do I really need an WAL partition if both wal and db are on the same SSD?

I think the answer is no here, if you point the DB to an SSD, bluestore will
use it for WAL also.

> b) If so, what would the ratio look like? 99% db, 1% wal?

..which means just let ceph handle this itself.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Degraded objects afte: ceph osd in $osd

2018-11-26 Thread Janne Johansson

Den mån 26 nov. 2018 kl 09:39 skrev Stefan Kooman :

> > It is a slight mistake in reporting it in the same way as an error,
> > even if it looks to the
> > cluster just as if it was in error and needs fixing. This gives the
> > new ceph admins a
> > sense of urgency or danger whereas it should be perfectly normal to add 
> > space to
> > a cluster. Also, it could have chosen to add a fourth PG in a repl=3
> > PG and fill from
> > the one going out into the new empty PG and somehow keep itself with 3 
> > working
> > replicas, but ceph chooses to first discard one replica, then backfill
> > into the empty
> > one, leading to this kind of "error" report.
>
> Thanks for the explanation. I agree with you that it would be more safe to
> first backfill to the new PG instead of just assuming the new OSD will
> be fine and discarding a perfectly healthy PG. We do have max_size 3 in
> the CRUSH ruleset ... I wonder if Ceph would behave differently if we
> would have max_size 4 ... to actually allow a fourth copy in the first
> place ...

I don't think the replication number is important, it's more of a choice which
PERHAPS is meant to allow you to move PGs to a new drive when the cluster is
near full, since it will clear out space lots faster if you just kill
off one unneeded
replica and starts writing to a new drive, whereas keeping all old
replicas until data is
100% ok on the new replica will make new space not appear until a large
amount of data has moved, which for large drives and large PGs might take
a very long time.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Degraded objects afte: ceph osd in $osd

2018-11-26 Thread Janne Johansson

Den sön 25 nov. 2018 kl 22:10 skrev Stefan Kooman :
>
> Hi List,
>
> Another interesting and unexpected thing we observed during cluster
> expansion is the following. After we added  extra disks to the cluster,
> while "norebalance" flag was set, we put the new OSDs "IN". As soon as
> we did that a couple of hundered objects would become degraded. During
> that time no OSD crashed or restarted. Every "ceph osd crush add $osd
> weight host=$storage-node" would cause extra degraded objects.
>
> I don't expect objects to become degraded when extra OSDs are added.
> Misplaced, yes. Degraded, no
>
> Someone got an explantion for this?
>

Yes, when you add a drive (or 10), some PGs decide they should have one or more
replicas on the new drives, a new empty PG is created there, and
_then_ that replica
will make that PG get into the "degraded" mode, meaning if it had 3
fine active+clean
replicas before, it now has 2 active+clean and one needing backfill to
get into shape.

It is a slight mistake in reporting it in the same way as an error,
even if it looks to the
cluster just as if it was in error and needs fixing. This gives the
new ceph admins a
sense of urgency or danger whereas it should be perfectly normal to add space to
a cluster. Also, it could have chosen to add a fourth PG in a repl=3
PG and fill from
the one going out into the new empty PG and somehow keep itself with 3 working
replicas, but ceph chooses to first discard one replica, then backfill
into the empty
one, leading to this kind of "error" report.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Disable intra-host replication?

2018-11-23 Thread Janne Johansson

Den fre 23 nov. 2018 kl 15:19 skrev Marco Gaiarin :
>
>
> Previous (partial) node failures and my current experiments on adding a
> node lead me to the fact that, when rebalancing are needed, ceph
> rebalance also on intra-node: eg, if an OSD of a node die, data are
> rebalanced on all OSD, even if i've pool molteplicity 3 and 3 nodes.
>
> This, indeed, make perfectly sense: overral data scattering have better
> performance and safety.
>
>
> But... there's some way to se to crush 'don't rebalance in the same node, go
> in degradated mode'?
>

The default crush rules with replication=3 would only place PGs on
separate hosts,
so in that case it would go into degraded mode if a node goes away,
and not place
replicas on different disks on the remaining hosts.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New OSD with weight 0, rebalance still happen...

2018-11-23 Thread Janne Johansson

Den fre 23 nov. 2018 kl 11:08 skrev Marco Gaiarin :

> Reading ceph docs lead to me that 'ceph osd reweight' and 'ceph osd crush
> reweight' was roughly the same, the first is effectively 'temporary'
> and expressed in percentage (0-1), while the second is 'permanent' and
> expressed, normally, as disk terabyte.
>
> You are saying that insted the first modify only the disk occupation,
> while only the latter alter the crush map.

The crush weight tells the cluster how much this disk adds to the
capacity of the host it is attached to,
the OSD weight says  (from 0 to 1) how much of the advertized size it
actually wants to receive/handle.
If you add crush weight, data will flow to the node, but if it has low
OSD weight, the other OSDs on the
host will have to bear the extra data. So starting out with 0 for
crush and 1.0 for OSD weight is fine, it
will not cause data movement, until you start (slowly perhaps) to add
to the crush weight until it matches
the size of the disk.

--
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] I can't find the configuration of user connection log in RADOSGW

2018-11-12 Thread Janne Johansson

Den mån 12 nov. 2018 kl 06:19 skrev 대무무 :
>
> Hello.
> I installed ceph framework in 6 servers and I want to manage the user access 
> log. So I configured ceph.conf in the server which installing the rgw.
>
> ceph.conf
> [client.rgw.~~~]
> ...
> rgw enable usage log = True
>
> However, I cannot find the connection log of each user.
> I want to know the method of user connection log like the linux command 
> 'last'.
>
> I’m looking forward to hearing from you.
>

That config setting will make summary statistics on bucket creation,
file ops and so on.

For a more simple "login"/"logout" you should see it on the webserver
log (civetweb or
whatever frontend you have in front of radosgw-admin).


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph 12.2.9 release

2018-11-08 Thread Janne Johansson

Den ons 7 nov. 2018 kl 18:43 skrev David Turner :
>
> My big question is that we've had a few of these releases this year that are 
> bugged and shouldn't be upgraded to... They don't have any release notes or 
> announcement and the only time this comes out is when users finally ask about 
> it weeks later.  Why is this not proactively announced to avoid a problematic 
> release and hopefully prevent people from installing it?  It would be great 
> if there was an actual release notes saying not to upgrade to this version or 
> something.

I think the big question is why do these packages end up publicly so
that scripts, updates and anyone not actively trying to hold back get
exposed to them, then we are somehow supposed to notice that the
accompanying release notes are lacking and then from that divinate
that we shouldn't have upgraded into this release at all. This seems
all backwards in most possible ways.

I'm not even upset about releases having bugs, stuff happens, but the
way people are forced into it, then it's somehow your fault for
running ceph-deploy or yum/apt upgrade against official release-repos.
It's almost as if it was meant to push people into slow-moving dists
like Blue^H^H^H^HRedhat with ceph on top.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] list admin issues

2018-11-06 Thread Janne Johansson

Den lör 6 okt. 2018 kl 15:06 skrev Elias Abacioglu
:
> I'm bumping this old thread cause it's getting annoying. My membership get 
> disabled twice a month.
> Between my two Gmail accounts I'm in more than 25 mailing lists and I see 
> this behavior only here. Why is only ceph-users only affected? Maybe 
> Christian was on to something, is this intentional?
> Reality is that there is a lot of ceph-users with Gmail accounts, perhaps it 
> wouldn't be so bad to actually trying to figure this one out?
> So can the maintainers of this list please investigate what actually gets 
> bounced? Look at my address if you want.
> I got disabled 20181006, 20180927, 20180916, 20180725, 20180718 most recently.

Guess it's time for it again.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] EC K + M Size

2018-11-03 Thread Janne Johansson

Den lör 3 nov. 2018 kl 09:10 skrev Ashley Merrick :
>
> Hello,
>
> Tried to do some reading online but was unable to find much.
>
> I can imagine a higher K + M size with EC requires more CPU to re-compile the 
> shards into the required object.
>
> But is there any benefit or negative going with a larger K + M, obviously 
> their is the size benefit but technically could it also improve reads due to 
> more OSD's providing a smaller section of the data required to compile the 
> shard?
>
> Is there any gotchas that should be known for example going with a 4+2 vs 10+2
>

If one host goes down in a 10+2 scenarion, then 11 or 12 other
machines need to get involved in order to repair the lost data. This
means that if your cluster has close to 12 hosts, it would mean most
or all the servers get extra work now. I saw some old yahoo post from
long ago that stated that the primary (whose job it is to piece them
together) would only send out 8 requests at any given time, and IF
still true, that would make 6+2 somewhat more efficient. Still, EC is
seldom about performance, but rather to save space while still
allowing 1-2-3 drives to die without losing data by using 1-2-3
checksum pieces.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Priority for backfilling misplaced and degraded objects

2018-11-01 Thread Janne Johansson

I think that all the misplaced PGs that are in the queue that get
writes _while_ waiting for backfill will get the "degraded" status,
meaning that before they were just on the wrong place, now they are on
the wrong place, AND the newly made PG they should backfill into will
get an old dump made first, then an incremental with all the changes
that came in while waiting or while finishing the first backfill, then
become active+clean.
Nothing to worry about, that is how recovery looks on all clusters.

Den ons 31 okt. 2018 kl 22:29 skrev Jonas Jelten :
>
> Hello!
>
> My cluster currently has this health state:
>
> 2018-10-31 21:20:13.694633 mon.lol [WRN] Health check update: 
> 39010709/192173470 objects misplaced (20.300%)
> (OBJECT_MISPLACED)
> 2018-10-31 21:20:13.694684 mon.lol [WRN] Health check update: Degraded data 
> redundancy: 1624786/192173470 objects
> degraded (0.845%), 49 pgs degraded, 57 pgs undersized (PG_DEGRADED)
> [...]
> 2018-10-31 21:39:24.113440 mon.lol [WRN] Health check update: 
> 38897646/192173470 objects misplaced (20.241%)
> (OBJECT_MISPLACED)
> 2018-10-31 21:39:24.113526 mon.lol [WRN] Health check update: Degraded data 
> redundancy: 1613658/192173470 objects
> degraded (0.840%), 49 pgs degraded, 57 pgs undersized (PG_DEGRADED)
>
>
> It is recovering slowly, but apparenly does not recover the 0.8% degraded 
> objects first. Instead it recovers both at the
> same relative rate, which even means that the misplaced objects are recovered 
> way slower than the misplaced objects!
>
> Is there a way to recover the degraded objects first?
>
>
> Cheers
>   -- Jonas
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Misplaced/Degraded objects priority

2018-10-24 Thread Janne Johansson

Den ons 24 okt. 2018 kl 13:09 skrev Florent B :
> On a Luminous cluster having some misplaced and degraded objects after
> outage :
>
> health: HEALTH_WARN
> 22100/2496241 objects misplaced (0.885%)
> Degraded data redundancy: 964/2496241 objects degraded
> (0.039%), 3 p
> gs degraded
>
> I can that Ceph gives priority on replacing objects instead of repairing
> degraded ones.
>
> Number of misplaced objects is decreasing, while number of degraded
> objects does not decrease.
> Is it expected ?

I think it is. It can even increase.

My theory is that you have a certain PG (or many) that is misplaced
during outage,
the cluster runs on with the replicas of the PG taking reads and
writes during recovery.
As long as there only exist reads, the PG (and the % of objects it
holds) will only be
misplaced, and as the cluster slowly gets stuff back to where it
belongs (or making a
new copy in a new OSD) this will decrease the % misplaced.

This takes non-zero time, and if there are writes to the PG (or other
queueing PGs) while
the move is running, ceph will know that not only is this PG lacking
one or more replicas,
the data that was recently written is available in less-than-optimal numbers.

I guess a PG has some kind of timestamp saying "last write was at time
xyz", so when it
recovers, a stream job makes a new empty PG, does a copy of all data
upto zyx into it
and after that is done, checks to see if the original PG still is at
version xyz in which case
it just jumps into service directly, or if the PG is at version xyz+10
then it asks for the last
10 changes, and repeats the check again.

Since there is a queue which is limited to max_recovery or
max_backfills, the longer the repair
takes to complete, the bigger the chance to see degraded aswell as
misplaced, but as the
number of misplaced goes down close to zero, the degraded number will
shrink really fast.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW stale buckets

2018-10-23 Thread Janne Johansson

When you run rgw it creates a ton of pools, so one of the other pools
were holding the indexes of what buckets there are, and the actual
data is what got stored in default.rgw.data (or whatever name it had),
so that cleanup was not complete and this is what causes your issues,
I'd say.

How to move from here depends on how much work/data you have put into
the badly-cleaned-pools and if you can redo the last part again after
a good clean restart.

Den tis 23 okt. 2018 kl 00:27 skrev Robert Stanford :
>
>
>  Someone deleted our rgw data pool to clean up.  They recreated it afterward. 
>  This is fine in one respect, we don't need the data.  But listing with 
> radosgw-admin still shows all the buckets.  How can we clean things up and 
> get rgw to understand what actually exists, and what doesn't?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] What is rgw.none

2018-10-21 Thread Janne Johansson

Den mån 6 aug. 2018 kl 12:58 skrev Tomasz Płaza :

> Hi all,
>
> I have a bucket with a vary big num_objects in rgw.none:
>
> {
> "bucket": "dyna",
>
> "usage": {
> "rgw.none": {
>
> "num_objects": 18446744073709551615
> }
>
> What is rgw.none and is this big number OK?
>
That number is exactly -1 for a 64-bit integer, so -1 might either be some
kind of "we accidentally reduced 1 from 0" code bug or just a marker with
an "impossible" number telling some other code part this is special.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph df space usage confusion - balancing needed?

2018-10-20 Thread Janne Johansson

Yes, if you have uneven sizes I guess you could end up in a situation
where you have
lots of 1TB OSDs and a number of 2TB OSD but pool replication forces
the pool to have one
PG replica on the 1TB OSD, then it would be possible to state "this
pool cant write more than X G"
but when it is full, there would be free space left on some of the
2TB-OSDs, but which the pool
cant utilize. Probably same for uneven OSD hosts if you have those.

Den lör 20 okt. 2018 kl 20:28 skrev Oliver Freyermuth
:
>
> Dear Janne,
>
> yes, of course. But since we only have two pools here, this can not explain 
> the difference.
> The metadata is replicated (3 copies) across ssd drives, and we have < 3 TB 
> of total raw storage for that.
> So looking at the raw space usage, we can ignore that.
>
> All the rest is used for the ceph_data pool. So the ceph_data pool, in terms 
> of raw storage, is about 50 % used.
>
> But in terms of storage shown for that pool, it's almost 63 % %USED.
> So I guess this can purely be from bad balancing, correct?
>
> Cheers,
> Oliver
>
> Am 20.10.18 um 19:49 schrieb Janne Johansson:
> > Do mind that drives may have more than one pool on them, so RAW space
> > is what it says, how much free space there is. Then the avail and
> > %USED on per-pool stats will take replication into account, it can
> > tell how much data you may write into that particular pool, given that
> > pools replication or EC settings.
> >
> > Den lör 20 okt. 2018 kl 19:09 skrev Oliver Freyermuth
> > :
> >>
> >> Dear Cephalopodians,
> >>
> >> as many others, I'm also a bit confused by "ceph df" output
> >> in a pretty straightforward configuration.
> >>
> >> We have a CephFS (12.2.7) running, with 4+2 EC profile.
> >>
> >> I get:
> >> 
> >> # ceph df
> >> GLOBAL:
> >> SIZE AVAIL RAW USED %RAW USED
> >> 824T  410T 414T 50.26
> >> POOLS:
> >> NAMEID USED %USED MAX AVAIL OBJECTS
> >> cephfs_metadata 1  452M  0.05  860G   365774
> >> cephfs_data 2  275T 62.68  164T 75056403
> >> 
> >>
> >> So about 50 % of raw space are used, but already ~63 % of filesystem space 
> >> are used.
> >> Is this purely from imperfect balancing?
> >> In "ceph osd df", I do indeed see OSD usages spreading from 65.02 % usage 
> >> down to 37.12 %.
> >>
> >> We did not yet use the balancer plugin.
> >> We don't have any pre-luminous clients.
> >> In that setup, I take it that "upmap" mode would be recommended - correct?
> >> Any "gotchas" using that on luminous?
> >>
> >> Cheers,
> >> Oliver
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
>
>


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph df space usage confusion - balancing needed?

2018-10-20 Thread Janne Johansson

Do mind that drives may have more than one pool on them, so RAW space
is what it says, how much free space there is. Then the avail and
%USED on per-pool stats will take replication into account, it can
tell how much data you may write into that particular pool, given that
pools replication or EC settings.

Den lör 20 okt. 2018 kl 19:09 skrev Oliver Freyermuth
:
>
> Dear Cephalopodians,
>
> as many others, I'm also a bit confused by "ceph df" output
> in a pretty straightforward configuration.
>
> We have a CephFS (12.2.7) running, with 4+2 EC profile.
>
> I get:
> 
> # ceph df
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED
> 824T  410T 414T 50.26
> POOLS:
> NAMEID USED %USED MAX AVAIL OBJECTS
> cephfs_metadata 1  452M  0.05  860G   365774
> cephfs_data 2  275T 62.68  164T 75056403
> 
>
> So about 50 % of raw space are used, but already ~63 % of filesystem space 
> are used.
> Is this purely from imperfect balancing?
> In "ceph osd df", I do indeed see OSD usages spreading from 65.02 % usage 
> down to 37.12 %.
>
> We did not yet use the balancer plugin.
> We don't have any pre-luminous clients.
> In that setup, I take it that "upmap" mode would be recommended - correct?
> Any "gotchas" using that on luminous?
>
> Cheers,
> Oliver
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Does anyone use interactive CLI mode?

2018-10-11 Thread Janne Johansson

Den ons 10 okt. 2018 kl 16:20 skrev John Spray :
> So the question is: does anyone actually use this feature?  It's not
> particularly expensive to maintain, but it might be nice to have one
> less path through the code if this is entirely unused.

It can go as far as I am concerned too.

Better spend time on making the verbose flag do better that just listing
the 189 sub-commands you didn't use and then have more or less the
same output as if you ran without --verbose.

(try 'ceph status' -vs- 'ceph status --verbose' for instance)

In the cases where it actually needs to perform things, talk network
and such things and fails
it would be helpful if those messages didn't drown in all that
"this ceph status wasn't ceph osd crush reweight and not ceph pg dump
and not ..."

I get that someone wanted to debug the parser once, but that could be
some other flag than
"the admin wants to know the substeps this command tries so the fault
can be found when
it doesn't work" which --verbose flags tend to try to be in other software.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] list admin issues

2018-10-06 Thread Janne Johansson

Den lör 6 okt. 2018 kl 15:06 skrev Elias Abacioglu
:
>
> Hi,
>
> I'm bumping this old thread cause it's getting annoying. My membership get 
> disabled twice a month.
> Between my two Gmail accounts I'm in more than 25 mailing lists and I see 
> this behavior only here. Why is only ceph-users only affected? Maybe 
> Christian was on to something, is this intentional?
> Reality is that there is a lot of ceph-users with Gmail accounts, perhaps it 
> wouldn't be so bad to actually trying to figure this one out?
>
> So can the maintainers of this list please investigate what actually gets 
> bounced? Look at my address if you want.
> I got disabled 20181006, 20180927, 20180916, 20180725, 20180718 most recently.
> Please help!

Same here.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] hardware heterogeneous in same pool

2018-10-04 Thread Janne Johansson

Den tors 4 okt. 2018 kl 00:09 skrev Bruno Carvalho :

> Hi Cephers, I would like to know how you are growing the cluster.
> Using dissimilar hardware in the same pool or creating a pool for each
> different hardware group.
> What problem would I have many problems using different hardware (CPU,
> memory, disk) in the same pool?


I don't think CPU and RAM (and other hw related things like HBA controller
card brand) matters
a lot, more is always nicer, but as long as you don't add worse machines
like Jonathan wrote you
should not see any degradation.

What you might want to look out for is if the new disks are very uneven
compared to the old
setup, so if you used to have servers with 10x2TB drives and suddenly add
one with 2x10TB,
things might become very unbalanced, since those differences will not be
handled seamlessly
by the crush map.

Apart from that, the only issues for us is "add drives, quickly set crush
reweight to 0.0 before
all existing OSD hosts shoot massive amounts of I/O on them, then script a
slower raise of
crush weight upto what they should end up at", to lessen the impact for our
24/7 operations.

If you have weekends where noone accesses the cluster or night-time low-IO
usage patterns,
just upping the weight at the right hour might suffice.

Lastly, for ssd/nvme setups with good networking, this is almost moot, they
converge so fast
its almost unfair. A real joy working with expanding flash-only
pools/clusters.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs issue with moving files between data pools gives Input/output error

2018-10-02 Thread Janne Johansson

Den mån 1 okt. 2018 kl 22:08 skrev John Spray :

>
> > totally new for me, also not what I would expect of a mv on a fs. I know
> > this is normal to expect coping between pools, also from the s3cmd
> > client. But I think more people will not expect this behaviour. Can't
> > the move be implemented as a move?
>
> In almost all filesystems, a rename (like "mv") is a pure metadata
> operation -- it doesn't involve reading all the file's data and
> re-writing it.  It would be very surprising for most users if they
> found that their "mv" command blocked for a very long time while
> waiting for a large file's content to be e.g. read out of one pool and
> written into another.


There are other networked filesystems which do behave like that, where
the OS thinks the whole mount is one single FS, but when you move stuff
with mv around it actually needs to move all data to other servers/disks and
incur the slowness of a copy/delete operation.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mimic upgrade failure

2018-09-10 Thread Janne Johansson

Den mån 10 sep. 2018 kl 08:10 skrev Kevin Hrpcek :

> Update for the list archive.
>
> I went ahead and finished the mimic upgrade with the osds in a fluctuating
> state of up and down. The cluster did start to normalize a lot easier after
> everything was on mimic since the random mass OSD heartbeat failures
> stopped and the constant mon election problem went away. I'm still battling
> with the cluster reacting poorly to host reboots or small map changes, but
> I feel like my current pg:osd ratio may be playing a factor in that since
> we are 2x normal pg count while migrating data to new EC pools.
>

We found a setting to help us when we had constant reelections, though they
were lots more frequent, and not related in the least to Mimic, but bumping
the time between elections allowed our cluster to at least start. It voted,
decided on a master, the master started (re)playing transactions, got so
busy the others called for a new election, same mon won again, restarted
the job and repeated over that. Bumping the election to last 30s instead of
the default (5?) allowed the mon to finish looking over the things to do
and start replying to heartbeats as expected and then it went smoother from
there.

mon_lease = 30 for future reference.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] advice with erasure coding

2018-09-07 Thread Janne Johansson

Den fre 7 sep. 2018 kl 13:44 skrev Maged Mokhtar :

>
> Good day Cephers,
>
> I want to get some guidance on erasure coding, the docs do state the
> different plugins and settings but to really understand them all and their
> use cases is not easy:
>
> -Are the majority of implementations using jerasure and just configuring k
> and m ?
>

Probably, yes


> -For jerasure: when/if would i need to change
> stripe_unit/osd_pool_erasure_code_stripe_unit/packetsize/algorithm ? The
> main usage is rbd with 4M object size, the workload is virtualization with
> average block size of 64k.
>
> Any help based on people's actual experience will be greatly appreciated..
>
>
Running VMs on top of EC pools is possible, but probably not recommended.
All the random reads and writes they usually cause will make EC less
suitable than replicated pools, even if it is possible.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Luminous RGW errors at start

2018-09-03 Thread Janne Johansson

Did you change the default pg_num or pgp_num so the pools that did show up
made it go past the mon_max_pg_per_osd ?


Den fre 31 aug. 2018 kl 17:20 skrev Robert Stanford :

>
>  I installed a new Luminous cluster.  Everything is fine so far.  Then I
> tried to start RGW and got this error:
>
> 2018-08-31 15:15:41.998048 7fc350271e80  0 rgw_init_ioctx ERROR:
> librados::Rados::pool_create returned (34) Numerical result out of range
> (this can be due to a pool or placement group misconfiguration, e.g. pg_num
> < pgp_num or mon_max_pg_per_osd exceeded)
> 2018-08-31 15:15:42.005732 7fc350271e80 -1 Couldn't init storage provider
> (RADOS)
>
>  I notice that the only pools that exist are the data and index RGW pools
> (no user or log pools like on Jewel).  What is causing this?
>
>  Thank you
>  R
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Network cluster / addr

2018-08-21 Thread Janne Johansson

Den tis 21 aug. 2018 kl 09:31 skrev Nino Bosteels :

>
> * Does ceph interpret multiple values for this in the ceph.conf (I
> wouldn’t say so out of my tests)?
>
> * Shouldn’t public network be your internet facing range and cluster
> network the private range?
>

"Public" doesn't necessarily mean "reachable from internet", it means
"where ceph consumers and clients can talk", and the private network is
"where only OSDs and ceph infrastructure can talk to eachother".

Ceph clients can still be non-reachable from the internet, it's not the
same meaning that firewall vendors place on "private" and "public".

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] limited disk slots - should I ran OS on SD card ?

2018-08-15 Thread Janne Johansson

Den ons 15 aug. 2018 kl 10:04 skrev Wido den Hollander :

> > This is the case for filesystem journals (xfs, ext4, almost all modern
> > filesystems). Been there, done that, had two storage systems failing due
> > to SD wear
> >
>
> I've been running OS on the SuperMicro 64 and 128GB SATA-DOMs for a
> while now and work fine.
>
> I disable Ceph's OSD logging though for performance reasons, but it also
> saves writes.
>
> They work just fine.
>

We had OS on DOMs and ETOOMANY of them failed for us to be comfortable with
them, so we moved away from that.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Least impact when adding PG's

2018-08-13 Thread Janne Johansson

Den mån 13 aug. 2018 kl 23:30 skrev :

>
>
> Am 7. August 2018 18:08:05 MESZ schrieb John Petrini <
> jpetr...@coredial.com>:
> >Hi All,
>
> Hi John,
>
> >
> >Any advice?
> >
>
> I am Not sure but what i would do is to increase the PG Step by Step and
> always with a value of "Power of two" i.e. 256.
>

The end result should be a power of two for best spread, but you can still
do it in intervals of say 16 PGs at a time or so.
When you go from 4 to 8, 8 to 16 and so on, the change isn't so big, but
when you move from 1024 to 2048 you would be
asking 1024 PGs to split "at the same time", which might very well be
noticeable to client I/O, so even if you mean to end up
at 2048, you should still do it in small steps to allow client I/O to run
as smooth as possible in between those operations.



> Also have a look on the pg_-/pgp_num. One of this should be increased
> First - not sure which one, but the docs  and Mailinglist history should be
> helpfull.
>
>
>
You can only do one first, so there is zero chance of doing them wrongly.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

1 2 >

1 - 100 of 132 matches

Mail list logo