[ceph-users] Rebalancing an Erasure coded pool seems to move far more data that necessary

2018-05-25 Thread Jesus Cea
I have a Erasure Coded 8+2 pool with 8 PGs.

Each PG is spread on 10 OSDs using Reed-Solomon (the Erasure Code).

When I rebalance the cluster I see two PGs moving:
"active+remapped+backfilling".

A "pg dump" shows this:

"""
root@jcea:/srv# ceph --id jcea pg dump|grep backf
dumped all
75.5  25536  00 18690   0
107105746944 3816 3816 active+remapped+backfilling 2018-05-25
23:53:06.341894117576'47616117576:61186
[1,11,0,18,19,21,4,5,15,12]  1   [3,11,0,18,19,21,4,5,15,12]
 3 0'0 2018-05-25 14:01:30.889768 0'0
2018-05-25 14:01:30.889768 0
73.7  29849  00 21587   0
125195780096 1537 1537 active+remapped+backfilling 2018-05-25
23:49:47.085736117466'60337117576:77332
[18,21,4,12,6,17,2,15,10,23] 18   [18,3,4,12,6,17,2,15,10,23]
 18117466'60337 2018-05-25 15:53:40.005828 0'0
2018-05-24 10:47:07.592897 0
"""

In my application, each file is 4MB in size, Erasure code (8+2) expands
25%, so each OSD stores 512 Kbytes from it, including erasure overhead.

We can see that PG 75.5 is moving from [3,11,0,18,19,21,4,5,15,12] OSDs
to [1,11,0,18,19,21,4,5,15,12] OSDs. Comparing tuples, we see that the
content in OSD 3 is moving to OSD 1. That should be 512Kbytes per object
(the slice assigned to this OSD).

My "ceph -s" shows:

"""
io:
   recovery: 41308 kB/s, 10 objects/s
"""

So, each object moved requires 4MB instead of 512Kbytes. According to
this, we are moving complete objects instead of only the slices
belonging to the evacuated OSDs.

Am I interpreting this correctly?.

Is this something known?. Will it be solved in the future?.

This is especially costly because the PGs associated to an Erasure Coded
pool are quite small compared to a regular replicated pool. So they are
huge (in my case, 150 GB each, including EC overhead). Rebuilding the
file from the EC slices to move a single slice seems way overkill.

Am I missing anything?

Thanks for your time and expertise!.

-- 
Jesús Cea Avión _/_/  _/_/_/_/_/_/
j...@jcea.es - http://www.jcea.es/ _/_/_/_/  _/_/_/_/  _/_/
Twitter: @jcea_/_/_/_/  _/_/_/_/_/
jabber / xmpp:j...@jabber.org  _/_/  _/_/_/_/  _/_/  _/_/
"Things are not so easy"  _/_/  _/_/_/_/  _/_/_/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/_/_/_/  _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG explosion with erasure codes, power of two and "x pools have many more objects per pg than average"

2018-05-25 Thread Jesus Cea
On 25/05/18 20:26, Paul Emmerich wrote:
> Answes inline.
> 
>> 2018-05-25 17:57 GMT+02:00 Jesus Cea > >: recommendation. Would be nice to know too if
>> being "close" to a power of two is better than be far away and if it
>> is better to be close but below or close but a little bit more. If
>> ideal value is 128 but I only can be 120 or 130, what should I
>> choose?. 120 or 130?. Why?
> 
> Go for the next larger power of two under the assumption that your 
> cluster will grow.

I now know better. Check my other emails.

Not being a power of two always creates imbalance. You can not overcome
that.

If your are close to a power of two but under it (120), most of your PG
will be "X" in size, a few of your PG will be "2*X" in size.

If your are close to a power of two but over it (130), most of your PG
will be size "X" and few of them will be of size "X/2".

>> 3. Is there any negative effect for CRUSH of using erasure code 8+2 
>> instead of 6+2 or 14+2 (power of two)?. I have 25 OSDs, so requiring
>> 16 for a single operation seems a bad idea, even more when my OSD 
>> capacities are very spread (from 150 GB to 1TB) and filling a small
>> OSD would block writes in the entire pool.
> 
> EC rules don't have to be powers of two. And yes, too many chunks
> for EC pools is a bad idea. It's rarely advisable to have a total of
> k + m larger than 8 or so.

I verified it. My objects are 4MB fixed size, inmutable (no rewrites),
so each OSD provides 512 Kbytes. Seems nice. I could even use wider EC
codes, in my personal environment.

If your objects are small, requests per OSD will be tiny and performance
will suffer. You would better use narrower EC codes.

> Also, you should have at least k + m + 1 servers, otherwise full
> server failures cannot be handled properly.

Good advice, of course. "crush-failure-domain=host" (or bigger failure
domain) is also important, if you have enough resources.

> A large spread between the OSD capacities within one crush rule is
> also usually a bad idea, 150 GB to 1 TB is typically too big.

I know. Legacy sins. I spend my days reweighting.

> Well, you reduced the number of PGs by a factor of 64, so you'll of
> course see a large skew here. The option mon_pg_warn_max_object_skew 
> controls when this warning is shown, default is 10.

So you are advising me to increase that value to silent the warning?.

What I am thinking is that mixing in the same cluster regular replicated
pools with EC pools will always generate this "warning". It is almost a
natural effect.

>> What is the actual memory hungry factor in a OSD, PGs or objects
>> per PG?.
> 
> PGs typically impose a bigger overhead. But PGs with a large number
> of objects can become annoying...

I find this difficult to believe, but you are far more experience with
Ceph than me. Do you have any reference I can learn the details from?.
Beside source code :-).

Using EC will inevitably create PG with large number of objects.

My pools have around 240.000 4MB immutable objects (~1 TB). A replicated
pool would be configured as 128 PG, each PG having 1.875 objects, 7.5GB.

The same pool using EC 8+2 would use 13 PG (internally it would use 130
"pseudo PG", close to the original 128). Spare me the power of two rule
for now. 240.000 objects in 13 PG is 18.461 objects per PG, 92GB
(74*10/8) (internally it will stored in 10 OSD, each providing 9.2GB
each). I am actually using 8 PGs, so in my configuration it is more in
the 30.000 objects per PG range, 150GB per PG, 15GB per OSD per PG.

This compares badly with the original 1.875 objects by PG, although each
OSD used to take care of 7.5 GB and now it only grew to 15GB.

Is 30.000 objects per PG an issue?. What price am I paying here?

Can I do something to improve the situation?. Increasing PG_num to 16
will be better, but not too much, and going to 32 will push the PG count
per OSD well over the <500 PGs per OSD advice, considering that I have
quite a few of those EC pools.

Advice?

Thank!.

-- 
Jesús Cea Avión _/_/  _/_/_/_/_/_/
j...@jcea.es - http://www.jcea.es/ _/_/_/_/  _/_/_/_/  _/_/
Twitter: @jcea_/_/_/_/  _/_/_/_/_/
jabber / xmpp:j...@jabber.org  _/_/  _/_/_/_/  _/_/  _/_/
"Things are not so easy"  _/_/  _/_/_/_/  _/_/_/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/_/_/_/  _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing number of PGs by not a factor of two?

2018-05-25 Thread Jesus Cea
On 25/05/18 20:21, David Turner wrote:
> If you start your pool with 12 PGs, 4 of them will have double the size
> of the other 8.  It is 100% based on a power of 2 and has absolutely
> nothing to do with the number you start with vs the number you increase
> to.  If your PG count is not a power of 2 then you will have 2 different
> sizes of PGs with some being double the size of the others.

Thanks for correctly my wild speculation friendly and with facts. I have
spend quite a few hours digging into this I now I understand the issue
far better. I did some explanations in other email.

> Once upon a time I started a 2 rack cluster with 12,000 PGs.  All data
> was in 1 pool and I attempted to balance the cluster by making sure that
> every OSD in the cluster was within 2 PGs of each other.

I have spend some time thinking about the importance of PGs being equal
size and realized it depends a lot of the workload, existence of several
pools sharing the cluster, etc. In my particular situation (most data
under CephFS using quite wide 8+2 erasure code), low activity, immutable
data, write once, read many, it seems to be a non issue.

I need to think more about it.

Having wildly spread of OSD capacity (120 GB - 1TB) seems to be far
worse idea. I spend my days reweighting and it is quite difficult to
fully utilize the capacity of the cluster.

Thanks.

-- 
Jesús Cea Avión _/_/  _/_/_/_/_/_/
j...@jcea.es - http://www.jcea.es/ _/_/_/_/  _/_/_/_/  _/_/
Twitter: @jcea_/_/_/_/  _/_/_/_/_/
jabber / xmpp:j...@jabber.org  _/_/  _/_/_/_/  _/_/  _/_/
"Things are not so easy"  _/_/  _/_/_/_/  _/_/_/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/_/_/_/  _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing number of PGs by not a factor of two?

2018-05-25 Thread Jesus Cea
OK, I am writing this so you don't waste your time correcting me. I beg
your pardon.


On 25/05/18 18:28, Jesus Cea wrote:
> So, if I understand correctly, ceph tries to do the minimum splits. If
> you increase PG from 8 to 12, it will split 4 PGs and leave the other 4
> PGs alone, creating an imbalance.
> 
> According to that, would be far more advisable to create the pool with
> 12 PGs from the very beginning.
> 
> If I understand correctly, then, the advice of "power of two" is an
> oversimplification. The real advice would be: you better double your PG
> when you increase the PG count. That is: 12->24->48->96... Not real need
> for power of two.

Instead of trying to be smart, I just spend a few hours build a Ceph
experiment myself, testing different scenarios, PG resizing and such.

The "power of two" rule is law.

If you don't follow it, some PG will be contain double number of objects
than others.

The rule is something like this:

Lets say, your PG_num:

2^n-1 < PG_num <=2^n

The object name you created is hashed. "n" bits of it are considered.
Let's say that number is x.

If x < PG_num, your object will be stored under PG number x.

If x >= PG_num, then drop a bit (you use "n-1" bits) and it will be the
PG that will store your object.

This algorithm says that if your PG_num is not a "power of two", some of
your PG will be double size.

For instance, suppose PG_num = 13: (first number is the "x" of your
object, the second number is the PG used to store it)

0 -> 01 -> 12 -> 23 -> 3
4 -> 45 -> 56 -> 67 -> 7
8 -> 89 -> 9   10 -> 10  11 -> 11
12 -> 12

Good so far. But now:

13 -> 5  14 -> 6   15 -> 7

So, PGs 0-4 and 8-12 will store "COUNT" objects, but  PGs 5, 6 and 7
will store "2*COUNT" objects. PGs 5, 6 and 7 have twice the probability
of store your object.

Interestingly, The maximum object count difference between the biggest
PG and the smaller PG will be a factor of TWO. Statistically.

How important is that PG sizes are the same is something I am not sure
to understand.

> Also, a bad split is not important if the pool creates/destroys objects
> constantly, because new objects will be spread evenly. This could be an
> approach to rebalance a badly expanded pool: just copy & rename your
> objects (I am thinking about cephfs).
> 
> What am I saying makes sense?.

I answer to myself.

No, fool, it doesn't make sense. Ceph doesn't work that way. The PG
allocation is far more simpler and scalable, but also more dump. The
imbalance only depends of the number of PGs (should be a power of two),
not the process to get there.

The described idea doesn't work because if the PG numbers is not a power
of two, some PGs just simple get twice the lottery tickets and will get
double number of objects.  Copying, moving, replacing objects will not
change that.

> How Ceph decide what PG to split?. Per PG object count or by PG byte size?.

Following the algorithm described at the top of this post, Ceph will
just simply split PGs by increasing order. If my PG_num is 13 and I
increase it to 14, Ceph will split PG 5. Fully deterministic and
unrelated to size or how many objects are stored in that PG.

Since the association of an object to a PG is based in the hash of the
object name, we would expect every PG to have (statistically) the same
number of objects. Object size is not used here, so a huge object will
create a huge PG. This is a well known problem in Ceph (a few large
object will imbalance your cluster).

> Thank for your post. It deserves to be a blog!.

The original post was great. My reply was lame. I was just too smart for
my own good :).

Sorry for wasting your time.

-- 
Jesús Cea Avión _/_/  _/_/_/_/_/_/
j...@jcea.es - http://www.jcea.es/ _/_/_/_/  _/_/_/_/  _/_/
Twitter: @jcea_/_/_/_/  _/_/_/_/_/
jabber / xmpp:j...@jabber.org  _/_/  _/_/_/_/  _/_/  _/_/
"Things are not so easy"  _/_/  _/_/_/_/  _/_/_/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/_/_/_/  _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dependencies

2018-05-25 Thread David Turner
Admin nodes have zero impact on a Ceph cluster other than the commands you
run on them.  I personally like creating a single admin node for all of my
clusters and create tooling to use the proper config file and keyrings from
there.  Other than any scripts you keep on your admin node there is nothing
on there that you need for the cluster to run or to re-create an admin
node.  You can get your config from any node and keyrings from the ceph cli
tool.

You cannot move disks with data between clusters.  You would need to wipe
them and start them as brand new OSDs in the new cluster.

On Fri, May 25, 2018 at 12:00 PM Marc-Antoine Desrochers <
marc-antoine.desroch...@sogetel.com> wrote:

> Hi,
>
>
>
> I want to know if there is any dependencies between the ceph admin node
> and the other nodes ?
>
>
>
> Can I delete my ceph admin node and create a new one and link it to my
> OSD’s nodes ?
>
>
>
> Or can I take all my existing OSD’s in a node from Cluster “A” and
> transfert it to cluster “B” ?
>
>
>
>
>
> Cluster A
>
>
>
> AdminNode ---node1
>
> node2
>
>node3(take that node and bring it
> to cluster « B »)
>
>
>
> Cluster B
>
>
>
> AdminNode node1
>
>  -Node 2(clusterA-node3)
>
>
>
>
>
> Cheers, Marc-Antoine
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG explosion with erasure codes, power of two and "x pools have many more objects per pg than average"

2018-05-25 Thread Paul Emmerich
Answes inline.

2018-05-25 17:57 GMT+02:00 Jesus Cea :

> Hi there.
>
> I have configured a POOL with a 8+2 erasure code. My target by space
> usage and OSD configuration, would be 128 PG, but since each configure
> PG will be using 10 actual "PGs", I have created the pool with only 8 PG
> (80 real PG). Since I can increase PGs but not decreasing it, this
> decision seems sensible.
>
> Some questions:
>
> 1. Documentation insists everywhere that the PG could should be a power
> of two. Would be nice to know the consequences of not following this
> recommendation. Would be nice to know too if being "close" to a power of
> two is better than be far away and if it is better to be close but below
> or close but a little bit more. If ideal value is 128 but I only can be
> 120 or 130, what should I choose?. 120 or 130?. Why?
>

Go for the next larger power of two under the assumption that your cluster
will grow.


>
> 2. As I understand, the PG count that should be "power of two" is "8",
> in this case (real 80 PG underneath). Good. In this case, the next step
> would be 16 (160 real PG). I would rather prefer to increase it to 12 or
> 13 (120/130 real PGs). Would it be reasonable?. What are the
> consequences of increasing PG to 12 or 13 instead of choosing 16 (the
> next power of two).
>

Data will be poorly balanced between PGs if it's not a power of two.


>
> 3. Is there any negative effect for CRUSH of using erasure code 8+2
> instead of 6+2 or 14+2 (power of two)?. I have 25 OSDs, so requiring 16
> for a single operation seems a bad idea, even more when my OSD
> capacities are very spread (from 150 GB to 1TB) and filling a small OSD
> would block writes in the entire pool.
>

EC rules don't have to be powers of two. And yes, too many chunks for
EC pools is a bad idea. It's rarely advisable to have a total of k + m
larger
than 8 or so.

Also, you should have at least k + m + 1 servers, otherwise full server
failures cannot be handled properly.

A large spread between the OSD capacities within one crush rule is also
usually a bad idea, 150 GB to 1 TB is typically too big.


>
> 4. Since I have created a erasure coded pool with 8 PG, I am getting
> warnings of "x pools have many more objects per pg than average". The
> data I am copying is coming from a legacy pool with PG=512. New pool PG
> is 8. That is creating ~30.000 objects per PG, far above average (616
> objects). What can I do?. Moving to 16 or 32 PGs is not going to improve
> the situation, but will consume PGs (32*10). Advice?.
>

Well, you reduced the number of PGs by a factor of 64, so you'll of course
see a large skew here. The option mon_pg_warn_max_object_skew
controls when this warning is shown, default is 10.


>
> 5. I understand the advice of having <300 PGs per OSD because memory
> usage, but I am wondering about the impact of the number of objects in
> each PG. I wonder if memory and resource wise, having 100 PG with 10.000
> objects each is far more demanding than 1000 PGs with 50 objects each.
> Since I have PGs with 300 objects and PGs with 30.000 objects, I wonder
> about the memory impact of each. What is the actual memory hungry factor
> in a OSD, PGs or objects per PG?.
>

PGs typically impose a bigger overhead. But PGs with a large number of
objects
can become annoying...


Paul


>
> Thanks for your time and knowledge :).
>
> --
> Jesús Cea Avión _/_/  _/_/_/_/_/_/
> j...@jcea.es - http://www.jcea.es/ _/_/_/_/  _/_/_/_/  _/_/
> Twitter: @jcea_/_/_/_/  _/_/_/_/_/
> jabber / xmpp:j...@jabber.org  _/_/  _/_/_/_/  _/_/  _/_/
> "Things are not so easy"  _/_/  _/_/_/_/  _/_/_/_/  _/_/
> "My name is Dump, Core Dump"   _/_/_/_/_/_/  _/_/  _/_/
> "El amor es poner tu felicidad en la felicidad de otro" - Leibniz
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing number of PGs by not a factor of two?

2018-05-25 Thread David Turner
If you start your pool with 12 PGs, 4 of them will have double the size of
the other 8.  It is 100% based on a power of 2 and has absolutely nothing
to do with the number you start with vs the number you increase to.  If
your PG count is not a power of 2 then you will have 2 different sizes of
PGs with some being double the size of the others.

When increasing your PG count, ceph chooses which PGs to split in half
based on the pg name, not with how big the PG is or how many objects it
has.  The PG names are based on how many PGs you have in your pool and are
perfectly evenly split when and only if your PG count is a power of 2.

Once upon a time I started a 2 rack cluster with 12,000 PGs.  All data was
in 1 pool and I attempted to balance the cluster by making sure that every
OSD in the cluster was within 2 PGs of each other.  That is to say that if
the average PGs per OSD was 100, then no OSD had more than 101 PGs for that
pool and no OSD had less than 99 PGs.  My tooling made this possible and is
how we balanced our other clusters.  The resulting balance in this cluster
was AWFUL!!!  Digging in I found that some of the PGs were twice as big as
the other PGs.  It was actually very mathematical in how many.  Of the
12,000 PGs 4,384 PGs were twice as big as the remaining 7,616.  We
increased the PG count in the pool to 16,384 and all of the PGs were the
same in size when the backfilling finished.

On Fri, May 25, 2018 at 12:48 PM Jesus Cea  wrote:

> On 17/05/18 20:36, David Turner wrote:
> > By sticking with PG numbers as a base 2 number (1024, 16384, etc) all of
> > your PGs will be the same size and easier to balance and manage.  What
> > happens when you have a non base 2 number is something like this.  Say
> > you have 4 PGs that are all 2GB in size.  If you increase pg(p)_num to
> > 6, then you will have 2 PGs that are 2GB and 4 PGs that are 1GB as
> > you've split 2 of the PGs into 4 to get to the 6 total.  If you increase
> > the pg(p)_num to 8, then all 8 PGs will be 1GB.  Depending on how you
> > manage your cluster, that doesn't really matter, but for some methods of
> > balancing your cluster, that will greatly imbalance things.
>
> So, if I understand correctly, ceph tries to do the minimum splits. If
> you increase PG from 8 to 12, it will split 4 PGs and leave the other 4
> PGs alone, creating an imbalance.
>
> According to that, would be far more advisable to create the pool with
> 12 PGs from the very beginning.
>
> If I understand correctly, then, the advice of "power of two" is an
> oversimplification. The real advice would be: you better double your PG
> when you increase the PG count. That is: 12->24->48->96... Not real need
> for power of two.
>
> Also, a bad split is not important if the pool creates/destroys objects
> constantly, because new objects will be spread evenly. This could be an
> approach to rebalance a badly expanded pool: just copy & rename your
> objects (I am thinking about cephfs).
>
> What am I saying makes sense?.
>
> How Ceph decide what PG to split?. Per PG object count or by PG byte size?.
>
> Thank for your post. It deserves to be a blog!.
>
> --
> Jesús Cea Avión _/_/  _/_/_/_/_/_/
> j...@jcea.es - http://www.jcea.es/ _/_/_/_/  _/_/_/_/  _/_/
> Twitter: @jcea_/_/_/_/  _/_/_/_/_/
> jabber / xmpp:j...@jabber.org  _/_/  _/_/_/_/  _/_/  _/_/
> "Things are not so easy"  _/_/  _/_/_/_/  _/_/_/_/  _/_/
> "My name is Dump, Core Dump"   _/_/_/_/_/_/  _/_/  _/_/
> "El amor es poner tu felicidad en la felicidad de otro" - Leibniz
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing number of PGs by not a factor of two?

2018-05-25 Thread Jesus Cea
On 17/05/18 20:36, David Turner wrote:
> By sticking with PG numbers as a base 2 number (1024, 16384, etc) all of
> your PGs will be the same size and easier to balance and manage.  What
> happens when you have a non base 2 number is something like this.  Say
> you have 4 PGs that are all 2GB in size.  If you increase pg(p)_num to
> 6, then you will have 2 PGs that are 2GB and 4 PGs that are 1GB as
> you've split 2 of the PGs into 4 to get to the 6 total.  If you increase
> the pg(p)_num to 8, then all 8 PGs will be 1GB.  Depending on how you
> manage your cluster, that doesn't really matter, but for some methods of
> balancing your cluster, that will greatly imbalance things.

So, if I understand correctly, ceph tries to do the minimum splits. If
you increase PG from 8 to 12, it will split 4 PGs and leave the other 4
PGs alone, creating an imbalance.

According to that, would be far more advisable to create the pool with
12 PGs from the very beginning.

If I understand correctly, then, the advice of "power of two" is an
oversimplification. The real advice would be: you better double your PG
when you increase the PG count. That is: 12->24->48->96... Not real need
for power of two.

Also, a bad split is not important if the pool creates/destroys objects
constantly, because new objects will be spread evenly. This could be an
approach to rebalance a badly expanded pool: just copy & rename your
objects (I am thinking about cephfs).

What am I saying makes sense?.

How Ceph decide what PG to split?. Per PG object count or by PG byte size?.

Thank for your post. It deserves to be a blog!.

-- 
Jesús Cea Avión _/_/  _/_/_/_/_/_/
j...@jcea.es - http://www.jcea.es/ _/_/_/_/  _/_/_/_/  _/_/
Twitter: @jcea_/_/_/_/  _/_/_/_/_/
jabber / xmpp:j...@jabber.org  _/_/  _/_/_/_/  _/_/  _/_/
"Things are not so easy"  _/_/  _/_/_/_/  _/_/_/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/_/_/_/  _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Dependencies

2018-05-25 Thread Marc-Antoine Desrochers
Hi,

 

I want to know if there is any dependencies between the ceph admin node and
the other nodes ?

 

Can I delete my ceph admin node and create a new one and link it to my OSD's
nodes ?

 

Or can I take all my existing OSD's in a node from Cluster "A" and transfert
it to cluster "B" ?

 

 

Cluster A

 

AdminNode ---node1

node2

   node3(take that node and bring it to
cluster < B >)

 

Cluster B 

 

AdminNode node1

 -Node 2(clusterA-node3)

 

 

Cheers, Marc-Antoine

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG explosion with erasure codes, power of two and "x pools have many more objects per pg than average"

2018-05-25 Thread Jesus Cea
Hi there.

I have configured a POOL with a 8+2 erasure code. My target by space
usage and OSD configuration, would be 128 PG, but since each configure
PG will be using 10 actual "PGs", I have created the pool with only 8 PG
(80 real PG). Since I can increase PGs but not decreasing it, this
decision seems sensible.

Some questions:

1. Documentation insists everywhere that the PG could should be a power
of two. Would be nice to know the consequences of not following this
recommendation. Would be nice to know too if being "close" to a power of
two is better than be far away and if it is better to be close but below
or close but a little bit more. If ideal value is 128 but I only can be
120 or 130, what should I choose?. 120 or 130?. Why?

2. As I understand, the PG count that should be "power of two" is "8",
in this case (real 80 PG underneath). Good. In this case, the next step
would be 16 (160 real PG). I would rather prefer to increase it to 12 or
13 (120/130 real PGs). Would it be reasonable?. What are the
consequences of increasing PG to 12 or 13 instead of choosing 16 (the
next power of two).

3. Is there any negative effect for CRUSH of using erasure code 8+2
instead of 6+2 or 14+2 (power of two)?. I have 25 OSDs, so requiring 16
for a single operation seems a bad idea, even more when my OSD
capacities are very spread (from 150 GB to 1TB) and filling a small OSD
would block writes in the entire pool.

4. Since I have created a erasure coded pool with 8 PG, I am getting
warnings of "x pools have many more objects per pg than average". The
data I am copying is coming from a legacy pool with PG=512. New pool PG
is 8. That is creating ~30.000 objects per PG, far above average (616
objects). What can I do?. Moving to 16 or 32 PGs is not going to improve
the situation, but will consume PGs (32*10). Advice?.

5. I understand the advice of having <300 PGs per OSD because memory
usage, but I am wondering about the impact of the number of objects in
each PG. I wonder if memory and resource wise, having 100 PG with 10.000
objects each is far more demanding than 1000 PGs with 50 objects each.
Since I have PGs with 300 objects and PGs with 30.000 objects, I wonder
about the memory impact of each. What is the actual memory hungry factor
in a OSD, PGs or objects per PG?.

Thanks for your time and knowledge :).

-- 
Jesús Cea Avión _/_/  _/_/_/_/_/_/
j...@jcea.es - http://www.jcea.es/ _/_/_/_/  _/_/_/_/  _/_/
Twitter: @jcea_/_/_/_/  _/_/_/_/_/
jabber / xmpp:j...@jabber.org  _/_/  _/_/_/_/  _/_/  _/_/
"Things are not so easy"  _/_/  _/_/_/_/  _/_/_/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/_/_/_/  _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-25 Thread Brady Deetz
I'm not sure this is a cache issue. To me, this feels like a memory leak.
I'm now at 129GB (haven't had a window to upgrade yet) on a configured 80GB
cache.

[root@mds0 ceph-admin]# ceph daemon mds.mds0 cache status
{
"pool": {
"items": 166753076,
"bytes": 71766944952
}
}


ran a 10 minute heap profile.

[root@mds0 ceph-admin]# ceph tell mds.mds0 heap start_profiler
2018-05-25 08:15:04.428519 7f3f657fa700  0 client.127046191 ms_handle_reset
on 10.124.103.50:6800/2248223690
2018-05-25 08:15:04.447528 7f3f667fc700  0 client.127055541 ms_handle_reset
on 10.124.103.50:6800/2248223690
mds.mds0 started profiler


[root@mds0 ceph-admin]# ceph tell mds.mds0 heap dump
2018-05-25 08:25:14.265450 7f1774ff9700  0 client.127057266 ms_handle_reset
on 10.124.103.50:6800/2248223690
2018-05-25 08:25:14.356292 7f1775ffb700  0 client.127057269 ms_handle_reset
on 10.124.103.50:6800/2248223690
mds.mds0 dumping heap profile now.

MALLOC:   123658130320 (117929.6 MiB) Bytes in use by application
MALLOC: +0 (0.0 MiB) Bytes in page heap freelist
MALLOC: +   6969713096 ( 6646.8 MiB) Bytes in central cache freelist
MALLOC: + 26700832 (   25.5 MiB) Bytes in transfer cache freelist
MALLOC: + 54460040 (   51.9 MiB) Bytes in thread cache freelists
MALLOC: +531034272 (  506.4 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: = 131240038560 (125160.3 MiB) Actual memory used (physical + swap)
MALLOC: +   7426875392 ( 7082.8 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: = 138666913952 (132243.1 MiB) Virtual address space used
MALLOC:
MALLOC:7434952  Spans in use
MALLOC: 20  Thread heaps in use
MALLOC:   8192  Tcmalloc page size

Call ReleaseFreeMemory() to release freelist memory to the OS (via
madvise()).
Bytes released to the OS take up virtual address space but no physical
memory.

[root@mds0 ceph-admin]# ceph tell mds.mds0 heap stop_profiler
2018-05-25 08:25:26.394877 7fbe48ff9700  0 client.127047898 ms_handle_reset
on 10.124.103.50:6800/2248223690
2018-05-25 08:25:26.736909 7fbe49ffb700  0 client.127035608 ms_handle_reset
on 10.124.103.50:6800/2248223690
mds.mds0 stopped profiler

[root@mds0 ceph-admin]# pprof --pdf /bin/ceph-mds
/var/log/ceph/mds.mds0.profile.000* > profile.pdf



On Thu, May 10, 2018 at 2:11 PM, Patrick Donnelly 
wrote:

> On Thu, May 10, 2018 at 12:00 PM, Brady Deetz  wrote:
> > [ceph-admin@mds0 ~]$ ps aux | grep ceph-mds
> > ceph1841  3.5 94.3 133703308 124425384 ? Ssl  Apr04 1808:32
> > /usr/bin/ceph-mds -f --cluster ceph --id mds0 --setuser ceph --setgroup
> ceph
> >
> >
> > [ceph-admin@mds0 ~]$ sudo ceph daemon mds.mds0 cache status
> > {
> > "pool": {
> > "items": 173261056,
> > "bytes": 76504108600
> > }
> > }
> >
> > So, 80GB is my configured limit for the cache and it appears the mds is
> > following that limit. But, the mds process is using over 100GB RAM in my
> > 128GB host. I thought I was playing it safe by configuring at 80. What
> other
> > things consume a lot of RAM for this process?
> >
> > Let me know if I need to create a new thread.
>
> The cache size measurement is imprecise pre-12.2.5 [1]. You should upgrade
> ASAP.
>
> [1] https://tracker.ceph.com/issues/22972
>
> --
> Patrick Donnelly
>


profile.pdf
Description: Adobe PDF document
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph tech talk on deploy ceph with rook on kubernetes

2018-05-25 Thread Brett Niver
Is the recording available?  I wasn't able to attend.
Thanks,
Brett


On Thu, May 24, 2018 at 10:04 AM, Sage Weil  wrote:
> Starting now!
>
> https://redhat.bluejeans.com/967991495/
>
> It'll be recorded and go up on youtube shortly as well.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Patrick Donnelly
On Fri, May 25, 2018 at 6:46 AM, Oliver Freyermuth
 wrote:
>> It might be possible to allow rename(2) to proceed in cases where
>> nlink==1, but the behavior will probably seem inconsistent (some files get
>> EXDEV, some don't).
>
> I believe even this would be extremely helpful, performance-wise. At least in 
> our case, hardlinks are seldomly used,
> it's more about data movement between user, group and scratch areas.
> For files with nlinks>1, it's more or less expected a copy has to be 
> performed when crossing quota boundaries (I think).

It may be possible to allow the rename in the MDS and check quotas
there. I've filed a tracker ticket here:
http://tracker.ceph.com/issues/24305


-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Am 25.05.2018 um 15:39 schrieb Sage Weil:
> On Fri, 25 May 2018, Oliver Freyermuth wrote:
>> Dear Ric,
>>
>> I played around a bit - the common denominator seems to be: Moving it 
>> within a directory subtree below a directory for which max_bytes / 
>> max_files quota settings are set, things work fine. Moving it to another 
>> directory tree without quota settings / with different quota settings, 
>> rename() returns EXDEV.
> 
> Aha, yes, this is the issue.
> 
> When you set a quota you force subvolume-like behavior.  This is done 
> because hard links across this quota boundary won't correctly account for 
> utilization (only one of the file links will accrue usage).  The 
> expectation is that quotas are usually set in locations that aren't 
> frequently renamed across.

Understood, that explains it. That's indeed also true for our application in 
most cases - 
but sometimes, we have the case that users want to migrate their data to group 
storage, or vice-versa. 

> 
> It might be possible to allow rename(2) to proceed in cases where 
> nlink==1, but the behavior will probably seem inconsistent (some files get 
> EXDEV, some don't).

I believe even this would be extremely helpful, performance-wise. At least in 
our case, hardlinks are seldomly used,
it's more about data movement between user, group and scratch areas. 
For files with nlinks>1, it's more or less expected a copy has to be performed 
when crossing quota boundaries (I think). 

Cheers,
Oliver

> 
> sage
> 
> 
> 
>>
>> Cheers, Oliver
>>
>>
>> Am 25.05.2018 um 15:18 schrieb Ric Wheeler:
>>> That seems to be the issue - we need to understand why rename sees them as 
>>> different.
>>>
>>> Ric
>>>
>>>
>>> On Fri, May 25, 2018, 9:15 AM Oliver Freyermuth 
>>> > 
>>> wrote:
>>>
>>> Mhhhm... that's funny, I checked an mv with an strace now. I get:
>>> 
>>> -
>>> access("/cephfs/some_folder/file", W_OK) = 0
>>> rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid 
>>> cross-device link)
>>> unlink("/cephfs/some_folder/file") = 0
>>> lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 
>>> 255) = 30
>>> 
>>> -
>>> But I can assure it's only a single filesystem, and a single ceph-fuse 
>>> client running.
>>>
>>> Same happens when using absolute paths.
>>>
>>> Cheers,
>>>         Oliver
>>>
>>> Am 25.05.2018 um 15:06 schrieb Ric Wheeler:
>>> > We should look at what mv uses to see if it thinks the directories 
>>> are on different file systems.
>>> >
>>> > If the fstat or whatever it looks at is confused, that might explain 
>>> it.
>>> >
>>> > Ric
>>> >
>>> >
>>> > On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth 
>>>  
>>> >> >> wrote:
>>> >
>>> >     Am 25.05.2018 um 14:57 schrieb Ric Wheeler:
>>> >     > Is this move between directories on the same file system?
>>> >
>>> >     It is, we only have a single CephFS in use. There's also only a 
>>> single ceph-fuse client running.
>>> >
>>> >     What's different, though, are different ACLs set for source and 
>>> target directory, and owner / group,
>>> >     but I hope that should not matter.
>>> >
>>> >     All the best,
>>> >     Oliver
>>> >
>>> >     > Rename as a system call only works within a file system.
>>> >     >
>>> >     > The user space mv command becomes a copy when not the same file 
>>> system. 
>>> >     >
>>> >     > Regards,
>>> >     >
>>> >     > Ric
>>> >     >
>>> >     >
>>> >     > On Fri, May 25, 2018, 8:51 AM John Spray >>  >> > >>  >> >> >     >
>>> >     >     On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
>>> >     >     >>  
>>> >> > 
>>> >>  
>>> >> >> >     >     > Dear Cephalopodians,
>>> >     >     >
>>> >     >     > I was wondering why a simple "mv" is taking 
>>> extraordinarily long on CephFS and must note that,
>>> >     >     > at least with the fuse-client (12.2.5) and when moving a 
>>> file from one directory to another,
>>>   

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Sage Weil
On Fri, 25 May 2018, Oliver Freyermuth wrote:
> Dear Ric,
> 
> I played around a bit - the common denominator seems to be: Moving it 
> within a directory subtree below a directory for which max_bytes / 
> max_files quota settings are set, things work fine. Moving it to another 
> directory tree without quota settings / with different quota settings, 
> rename() returns EXDEV.

Aha, yes, this is the issue.

When you set a quota you force subvolume-like behavior.  This is done 
because hard links across this quota boundary won't correctly account for 
utilization (only one of the file links will accrue usage).  The 
expectation is that quotas are usually set in locations that aren't 
frequently renamed across.

It might be possible to allow rename(2) to proceed in cases where 
nlink==1, but the behavior will probably seem inconsistent (some files get 
EXDEV, some don't).

sage



> 
> Cheers, Oliver
> 
> 
> Am 25.05.2018 um 15:18 schrieb Ric Wheeler:
> > That seems to be the issue - we need to understand why rename sees them as 
> > different.
> > 
> > Ric
> > 
> > 
> > On Fri, May 25, 2018, 9:15 AM Oliver Freyermuth 
> > > 
> > wrote:
> > 
> > Mhhhm... that's funny, I checked an mv with an strace now. I get:
> > 
> > -
> > access("/cephfs/some_folder/file", W_OK) = 0
> > rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid 
> > cross-device link)
> > unlink("/cephfs/some_folder/file") = 0
> > lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 
> > 255) = 30
> > 
> > -
> > But I can assure it's only a single filesystem, and a single ceph-fuse 
> > client running.
> > 
> > Same happens when using absolute paths.
> > 
> > Cheers,
> >         Oliver
> > 
> > Am 25.05.2018 um 15:06 schrieb Ric Wheeler:
> > > We should look at what mv uses to see if it thinks the directories 
> > are on different file systems.
> > >
> > > If the fstat or whatever it looks at is confused, that might explain 
> > it.
> > >
> > > Ric
> > >
> > >
> > > On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth 
> >  
> >  > >> wrote:
> > >
> > >     Am 25.05.2018 um 14:57 schrieb Ric Wheeler:
> > >     > Is this move between directories on the same file system?
> > >
> > >     It is, we only have a single CephFS in use. There's also only a 
> > single ceph-fuse client running.
> > >
> > >     What's different, though, are different ACLs set for source and 
> > target directory, and owner / group,
> > >     but I hope that should not matter.
> > >
> > >     All the best,
> > >     Oliver
> > >
> > >     > Rename as a system call only works within a file system.
> > >     >
> > >     > The user space mv command becomes a copy when not the same file 
> > system. 
> > >     >
> > >     > Regards,
> > >     >
> > >     > Ric
> > >     >
> > >     >
> > >     > On Fri, May 25, 2018, 8:51 AM John Spray  >   > >  >   >  > >     >
> > >     >     On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
> > >     >      >  
> >  > > 
> >  >  
> >  >  > >     >     > Dear Cephalopodians,
> > >     >     >
> > >     >     > I was wondering why a simple "mv" is taking 
> > extraordinarily long on CephFS and must note that,
> > >     >     > at least with the fuse-client (12.2.5) and when moving a 
> > file from one directory to another,
> > >     >     > the file appears to be copied first (byte by byte, 
> > traffic going through the client?) before the initial file is deleted.
> > >     >     >
> > >     >     > Is this true, or am I missing something?
> > >     >
> > >     >     A mv should not involve copying a file through the client 
> > -- it's
> > >     >     implemented in the MDS as a rename from one location to 
> > another.
> > >     >     What's the observation that's making it seem like the data 
> > is going
> > >     >     through the client?
> > >     >
> > >     >     John
> > >     >

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Am 25.05.2018 um 15:26 schrieb Luis Henriques:
> Oliver Freyermuth  writes:
> 
>> Mhhhm... that's funny, I checked an mv with an strace now. I get:
>> -
>> access("/cephfs/some_folder/file", W_OK) = 0
>> rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device 
>> link)
> 
> I believe this could happen if you have quotas set on any of the paths,
> or different snapshot realms.

Wow - yes, this matches my observations! 
So in this case, e.g. moving files from a "user" directory with quota to a 
"group" directory with different quota,
it's currently expected that files can not be renamed across those boundaries?

Cheers,
Oliver

> 
> Cheers,
> 




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Dear Sage,

here you go, some_folder in reality is "/cephfs/group":


# stat foo
  File: ‘foo’
  Size: 1048576000  Blocks: 2048000IO Block: 4194304 regular file
Device: 27h/39d Inode: 1099515065517  Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
Context: system_u:object_r:fusefs_t:s0
Access: 2018-05-25 15:27:59.433279424 +0200
Modify: 2018-05-25 15:28:01.379754052 +0200
Change: 2018-05-25 15:28:01.379754052 +0200
 Birth: -

# stat -f foo
  File: "foo"
ID: 0Namelen: 255 Type: fuseblk
Block size: 4194304Fundamental block size: 4194304
Blocks: Total: 104471885  Free: 79096968   Available: 79096968
Inodes: Total: 26258533   Free: -1


# stat -f /cephfs/group/
  File: "/cephfs/group/"
ID: 0Namelen: 255 Type: fuseblk
Block size: 4194304Fundamental block size: 4194304
Blocks: Total: 104471835  Free: 79098264   Available: 79098264
Inodes: Total: 26257190   Free: -1

# stat /cephfs/group/
  File: ‘/cephfs/group/’
  Size: 73167320986856  Blocks: 1  IO Block: 4096   directory
Device: 27h/39d Inode: 1099511627888  Links: 1
Access: (0755/drwxr-xr-x)  Uid: (0/root)   Gid: (0/root)
Context: system_u:object_r:fusefs_t:s0
Access: 2018-03-09 18:22:47.061501906 +0100
Modify: 2018-05-25 15:18:02.164391701 +0200
Change: 2018-05-25 15:18:02.164391701 +0200
 Birth: -


Cheers,
Oliver

Am 25.05.2018 um 15:21 schrieb Sage Weil:
> Can you paste the output of 'stat foo' and 'stat /cephfs/some_folder'?  
> (Maybe also the same with 'stat -f'.)
> 
> Thanks!
> sage
> 
> 
> On Fri, 25 May 2018, Ric Wheeler wrote:
>> That seems to be the issue - we need to understand why rename sees them as
>> different.
>>
>> Ric
>>
>>
>> On Fri, May 25, 2018, 9:15 AM Oliver Freyermuth <
>> freyerm...@physik.uni-bonn.de> wrote:
>>
>>> Mhhhm... that's funny, I checked an mv with an strace now. I get:
>>>
>>> -
>>> access("/cephfs/some_folder/file", W_OK) = 0
>>> rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device
>>> link)
>>> unlink("/cephfs/some_folder/file") = 0
>>> lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 255)
>>> = 30
>>>
>>> -
>>> But I can assure it's only a single filesystem, and a single ceph-fuse
>>> client running.
>>>
>>> Same happens when using absolute paths.
>>>
>>> Cheers,
>>> Oliver
>>>
>>> Am 25.05.2018 um 15:06 schrieb Ric Wheeler:
 We should look at what mv uses to see if it thinks the directories are
>>> on different file systems.

 If the fstat or whatever it looks at is confused, that might explain it.

 Ric


 On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth <
>>> freyerm...@physik.uni-bonn.de >
>>> wrote:

 Am 25.05.2018 um 14:57 schrieb Ric Wheeler:
 > Is this move between directories on the same file system?

 It is, we only have a single CephFS in use. There's also only a
>>> single ceph-fuse client running.

 What's different, though, are different ACLs set for source and
>>> target directory, and owner / group,
 but I hope that should not matter.

 All the best,
 Oliver

 > Rename as a system call only works within a file system.
 >
 > The user space mv command becomes a copy when not the same file
>>> system.
 >
 > Regards,
 >
 > Ric
 >
 >
 > On Fri, May 25, 2018, 8:51 AM John Spray >>  > jsp...@redhat.com>>> wrote:
 >
 > On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
 > > freyerm...@physik.uni-bonn.de> >> >> wrote:
 > > Dear Cephalopodians,
 > >
 > > I was wondering why a simple "mv" is taking extraordinarily
>>> long on CephFS and must note that,
 > > at least with the fuse-client (12.2.5) and when moving a
>>> file from one directory to another,
 > > the file appears to be copied first (byte by byte, traffic
>>> going through the client?) before the initial file is deleted.
 > >
 > > Is this true, or am I missing something?
 >
 > A mv should not involve copying a file through the client --
>>> it's
 > implemented in the MDS as a rename from one location to
>>> another.

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Dear Ric,

I played around a bit - the common denominator seems to be: Moving it within a 
directory subtree below a directory for which max_bytes / max_files quota 
settings are set,
things work fine. 
Moving it to another directory tree without quota settings / with different 
quota settings, rename() returns EXDEV. 

Cheers,
Oliver


Am 25.05.2018 um 15:18 schrieb Ric Wheeler:
> That seems to be the issue - we need to understand why rename sees them as 
> different.
> 
> Ric
> 
> 
> On Fri, May 25, 2018, 9:15 AM Oliver Freyermuth 
> > wrote:
> 
> Mhhhm... that's funny, I checked an mv with an strace now. I get:
> 
> -
> access("/cephfs/some_folder/file", W_OK) = 0
> rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid 
> cross-device link)
> unlink("/cephfs/some_folder/file") = 0
> lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 
> 255) = 30
> 
> -
> But I can assure it's only a single filesystem, and a single ceph-fuse 
> client running.
> 
> Same happens when using absolute paths.
> 
> Cheers,
>         Oliver
> 
> Am 25.05.2018 um 15:06 schrieb Ric Wheeler:
> > We should look at what mv uses to see if it thinks the directories are 
> on different file systems.
> >
> > If the fstat or whatever it looks at is confused, that might explain it.
> >
> > Ric
> >
> >
> > On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth 
>  
>  >> wrote:
> >
> >     Am 25.05.2018 um 14:57 schrieb Ric Wheeler:
> >     > Is this move between directories on the same file system?
> >
> >     It is, we only have a single CephFS in use. There's also only a 
> single ceph-fuse client running.
> >
> >     What's different, though, are different ACLs set for source and 
> target directory, and owner / group,
> >     but I hope that should not matter.
> >
> >     All the best,
> >     Oliver
> >
> >     > Rename as a system call only works within a file system.
> >     >
> >     > The user space mv command becomes a copy when not the same file 
> system. 
> >     >
> >     > Regards,
> >     >
> >     > Ric
> >     >
> >     >
> >     > On Fri, May 25, 2018, 8:51 AM John Spray    >     >     >
> >     >     On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
> >     >        >     >     >     > Dear Cephalopodians,
> >     >     >
> >     >     > I was wondering why a simple "mv" is taking extraordinarily 
> long on CephFS and must note that,
> >     >     > at least with the fuse-client (12.2.5) and when moving a 
> file from one directory to another,
> >     >     > the file appears to be copied first (byte by byte, traffic 
> going through the client?) before the initial file is deleted.
> >     >     >
> >     >     > Is this true, or am I missing something?
> >     >
> >     >     A mv should not involve copying a file through the client -- 
> it's
> >     >     implemented in the MDS as a rename from one location to 
> another.
> >     >     What's the observation that's making it seem like the data is 
> going
> >     >     through the client?
> >     >
> >     >     John
> >     >
> >     >     >
> >     >     > For large files, this might be rather time consuming,
> >     >     > and we should certainly advise all our users to not move 
> files around needlessly if this is the case.
> >     >     >
> >     >     > Cheers,
> >     >     >         Oliver
> >     >     >
> >     >     >
> >     >     > ___
> >     >     > ceph-users mailing list
> >     >     > ceph-users@lists.ceph.com 
>   >    >>
> >     >     > 

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Sage Weil
Can you paste the output of 'stat foo' and 'stat /cephfs/some_folder'?  
(Maybe also the same with 'stat -f'.)

Thanks!
sage


On Fri, 25 May 2018, Ric Wheeler wrote:
> That seems to be the issue - we need to understand why rename sees them as
> different.
> 
> Ric
> 
> 
> On Fri, May 25, 2018, 9:15 AM Oliver Freyermuth <
> freyerm...@physik.uni-bonn.de> wrote:
> 
> > Mhhhm... that's funny, I checked an mv with an strace now. I get:
> >
> > -
> > access("/cephfs/some_folder/file", W_OK) = 0
> > rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device
> > link)
> > unlink("/cephfs/some_folder/file") = 0
> > lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 255)
> > = 30
> >
> > -
> > But I can assure it's only a single filesystem, and a single ceph-fuse
> > client running.
> >
> > Same happens when using absolute paths.
> >
> > Cheers,
> > Oliver
> >
> > Am 25.05.2018 um 15:06 schrieb Ric Wheeler:
> > > We should look at what mv uses to see if it thinks the directories are
> > on different file systems.
> > >
> > > If the fstat or whatever it looks at is confused, that might explain it.
> > >
> > > Ric
> > >
> > >
> > > On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth <
> > freyerm...@physik.uni-bonn.de >
> > wrote:
> > >
> > > Am 25.05.2018 um 14:57 schrieb Ric Wheeler:
> > > > Is this move between directories on the same file system?
> > >
> > > It is, we only have a single CephFS in use. There's also only a
> > single ceph-fuse client running.
> > >
> > > What's different, though, are different ACLs set for source and
> > target directory, and owner / group,
> > > but I hope that should not matter.
> > >
> > > All the best,
> > > Oliver
> > >
> > > > Rename as a system call only works within a file system.
> > > >
> > > > The user space mv command becomes a copy when not the same file
> > system.
> > > >
> > > > Regards,
> > > >
> > > > Ric
> > > >
> > > >
> > > > On Fri, May 25, 2018, 8:51 AM John Spray  >   jsp...@redhat.com>>> wrote:
> > > >
> > > > On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
> > > >  freyerm...@physik.uni-bonn.de>  > >> wrote:
> > > > > Dear Cephalopodians,
> > > > >
> > > > > I was wondering why a simple "mv" is taking extraordinarily
> > long on CephFS and must note that,
> > > > > at least with the fuse-client (12.2.5) and when moving a
> > file from one directory to another,
> > > > > the file appears to be copied first (byte by byte, traffic
> > going through the client?) before the initial file is deleted.
> > > > >
> > > > > Is this true, or am I missing something?
> > > >
> > > > A mv should not involve copying a file through the client --
> > it's
> > > > implemented in the MDS as a rename from one location to
> > another.
> > > > What's the observation that's making it seem like the data is
> > going
> > > > through the client?
> > > >
> > > > John
> > > >
> > > > >
> > > > > For large files, this might be rather time consuming,
> > > > > and we should certainly advise all our users to not move
> > files around needlessly if this is the case.
> > > > >
> > > > > Cheers,
> > > > > Oliver
> > > > >
> > > > >
> > > > > ___
> > > > > ceph-users mailing list
> > > > > ceph-users@lists.ceph.com 
> > >
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > >
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com 
> > >
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > >
> >
> >
> >
> >
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Ric Wheeler
That seems to be the issue - we need to understand why rename sees them as
different.

Ric


On Fri, May 25, 2018, 9:15 AM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Mhhhm... that's funny, I checked an mv with an strace now. I get:
>
> -
> access("/cephfs/some_folder/file", W_OK) = 0
> rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device
> link)
> unlink("/cephfs/some_folder/file") = 0
> lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 255)
> = 30
>
> -
> But I can assure it's only a single filesystem, and a single ceph-fuse
> client running.
>
> Same happens when using absolute paths.
>
> Cheers,
> Oliver
>
> Am 25.05.2018 um 15:06 schrieb Ric Wheeler:
> > We should look at what mv uses to see if it thinks the directories are
> on different file systems.
> >
> > If the fstat or whatever it looks at is confused, that might explain it.
> >
> > Ric
> >
> >
> > On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth <
> freyerm...@physik.uni-bonn.de >
> wrote:
> >
> > Am 25.05.2018 um 14:57 schrieb Ric Wheeler:
> > > Is this move between directories on the same file system?
> >
> > It is, we only have a single CephFS in use. There's also only a
> single ceph-fuse client running.
> >
> > What's different, though, are different ACLs set for source and
> target directory, and owner / group,
> > but I hope that should not matter.
> >
> > All the best,
> > Oliver
> >
> > > Rename as a system call only works within a file system.
> > >
> > > The user space mv command becomes a copy when not the same file
> system.
> > >
> > > Regards,
> > >
> > > Ric
> > >
> > >
> > > On Fri, May 25, 2018, 8:51 AM John Spray   >> wrote:
> > >
> > > On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
> > >   >> wrote:
> > > > Dear Cephalopodians,
> > > >
> > > > I was wondering why a simple "mv" is taking extraordinarily
> long on CephFS and must note that,
> > > > at least with the fuse-client (12.2.5) and when moving a
> file from one directory to another,
> > > > the file appears to be copied first (byte by byte, traffic
> going through the client?) before the initial file is deleted.
> > > >
> > > > Is this true, or am I missing something?
> > >
> > > A mv should not involve copying a file through the client --
> it's
> > > implemented in the MDS as a rename from one location to
> another.
> > > What's the observation that's making it seem like the data is
> going
> > > through the client?
> > >
> > > John
> > >
> > > >
> > > > For large files, this might be rather time consuming,
> > > > and we should certainly advise all our users to not move
> files around needlessly if this is the case.
> > > >
> > > > Cheers,
> > > > Oliver
> > > >
> > > >
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com 
> >
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com 
> >
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Mhhhm... that's funny, I checked an mv with an strace now. I get:
-
access("/cephfs/some_folder/file", W_OK) = 0
rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device link)
unlink("/cephfs/some_folder/file") = 0
lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 255) = 30
-
But I can assure it's only a single filesystem, and a single ceph-fuse client 
running. 

Same happens when using absolute paths. 

Cheers,
Oliver

Am 25.05.2018 um 15:06 schrieb Ric Wheeler:
> We should look at what mv uses to see if it thinks the directories are on 
> different file systems.
> 
> If the fstat or whatever it looks at is confused, that might explain it.
> 
> Ric
> 
> 
> On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth 
> > wrote:
> 
> Am 25.05.2018 um 14:57 schrieb Ric Wheeler:
> > Is this move between directories on the same file system?
> 
> It is, we only have a single CephFS in use. There's also only a single 
> ceph-fuse client running.
> 
> What's different, though, are different ACLs set for source and target 
> directory, and owner / group,
> but I hope that should not matter.
> 
> All the best,
> Oliver
> 
> > Rename as a system call only works within a file system.
> >
> > The user space mv command becomes a copy when not the same file system. 
> >
> > Regards,
> >
> > Ric
> >
> >
> > On Fri, May 25, 2018, 8:51 AM John Spray    >> wrote:
> >
> >     On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
> >        >> wrote:
> >     > Dear Cephalopodians,
> >     >
> >     > I was wondering why a simple "mv" is taking extraordinarily long 
> on CephFS and must note that,
> >     > at least with the fuse-client (12.2.5) and when moving a file 
> from one directory to another,
> >     > the file appears to be copied first (byte by byte, traffic going 
> through the client?) before the initial file is deleted.
> >     >
> >     > Is this true, or am I missing something?
> >
> >     A mv should not involve copying a file through the client -- it's
> >     implemented in the MDS as a rename from one location to another.
> >     What's the observation that's making it seem like the data is going
> >     through the client?
> >
> >     John
> >
> >     >
> >     > For large files, this might be rather time consuming,
> >     > and we should certainly advise all our users to not move files 
> around needlessly if this is the case.
> >     >
> >     > Cheers,
> >     >         Oliver
> >     >
> >     >
> >     > ___
> >     > ceph-users mailing list
> >     > ceph-users@lists.ceph.com  
> >
> >     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >     >
> >     ___
> >     ceph-users mailing list
> >     ceph-users@lists.ceph.com  
> >
> >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 





smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Different disk sizes after Luminous upgrade 12.2.2 --> 12.2.5

2018-05-25 Thread Eugen Block

Hi Igor,

This difference was introduced by the following PR:  
https://github.com/ceph/ceph/pull/20487 (commit os/bluestore: do not  
account DB volume space in total one reported by statfs method).


The rationale is to show block device capacity as total only. And  
don't add DB space to it. This makes no sense since data stored at  
these locations aren't cumulative.


So this just an effect of a bit different calculation.


thank you very much for this quick response and the confirmation of  
our assumption.
I totally agree that this makes more sense to *not* count the db size  
into total disk size, we were just wondering if something went wrong  
during the upgrade.


Regards,
Eugen


Zitat von Igor Fedotov :


Hi Eugen,

This difference was introduced by the following PR:  
https://github.com/ceph/ceph/pull/20487 (commit os/bluestore: do not  
account DB volume space in total one reported by statfs method).


The rationale is to show block device capacity as total only. And  
don't add DB space to it. This makes no sense since data stored at  
these locations aren't cumulative.


So this just an effect of a bit different calculation.

Thanks,

Igor



On 5/25/2018 2:22 PM, Eugen Block wrote:

Hi list,

we have a Luminous bluestore cluster with separate  
block.db/block.wal on SSDs. We were running version 12.2.2 and  
upgraded yesterday to 12.2.5. The upgrade went smoothly, but since  
the restart of the OSDs I noticed that 'ceph osd df' shows a  
different total disk size:


---cut here---
ceph1:~ #  ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE    AVAIL %USE  VAR  PGS
 1   hdd 0.92429  1.0   931G   557G  373G 59.85 1.03 681
 4   hdd 0.92429  1.0   931G   535G  395G 57.52 0.99 645
 6   hdd 0.92429  1.0   931G   532G  398G 57.19 0.99 640
13   hdd 0.92429  1.0   931G   587G  343G 63.08 1.09 671
16   hdd 0.92429  1.0   931G   562G  368G 60.40 1.04 665
18   hdd 0.92429  1.0   931G   531G  399G 57.07 0.98 623
10   ssd 0.72769  1.0   745G 18423M  727G  2.41 0.04  37
---cut here---

Before the upgrade the displayed size for each 1TB disk was 946G  
where each OSD has a block.db size of 15G (931 + 15 = 946). So it  
seems that in one of the recent changes within 12.2.X the output  
has changed, also resulting in a slightly smaller total cluster  
size. Is this just a code change for the size calculation or is  
there something else I should look out for?


Regards,
Eugen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Ric Wheeler
We should look at what mv uses to see if it thinks the directories are on
different file systems.

If the fstat or whatever it looks at is confused, that might explain it.

Ric


On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Am 25.05.2018 um 14:57 schrieb Ric Wheeler:
> > Is this move between directories on the same file system?
>
> It is, we only have a single CephFS in use. There's also only a single
> ceph-fuse client running.
>
> What's different, though, are different ACLs set for source and target
> directory, and owner / group,
> but I hope that should not matter.
>
> All the best,
> Oliver
>
> > Rename as a system call only works within a file system.
> >
> > The user space mv command becomes a copy when not the same file system.
> >
> > Regards,
> >
> > Ric
> >
> >
> > On Fri, May 25, 2018, 8:51 AM John Spray > wrote:
> >
> > On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
> > >
> wrote:
> > > Dear Cephalopodians,
> > >
> > > I was wondering why a simple "mv" is taking extraordinarily long
> on CephFS and must note that,
> > > at least with the fuse-client (12.2.5) and when moving a file from
> one directory to another,
> > > the file appears to be copied first (byte by byte, traffic going
> through the client?) before the initial file is deleted.
> > >
> > > Is this true, or am I missing something?
> >
> > A mv should not involve copying a file through the client -- it's
> > implemented in the MDS as a rename from one location to another.
> > What's the observation that's making it seem like the data is going
> > through the client?
> >
> > John
> >
> > >
> > > For large files, this might be rather time consuming,
> > > and we should certainly advise all our users to not move files
> around needlessly if this is the case.
> > >
> > > Cheers,
> > > Oliver
> > >
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com 
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Am 25.05.2018 um 14:57 schrieb Ric Wheeler:
> Is this move between directories on the same file system?

It is, we only have a single CephFS in use. There's also only a single 
ceph-fuse client running. 

What's different, though, are different ACLs set for source and target 
directory, and owner / group,
but I hope that should not matter. 

All the best,
Oliver

> Rename as a system call only works within a file system.
> 
> The user space mv command becomes a copy when not the same file system. 
> 
> Regards,
> 
> Ric
> 
> 
> On Fri, May 25, 2018, 8:51 AM John Spray  > wrote:
> 
> On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
> > 
> wrote:
> > Dear Cephalopodians,
> >
> > I was wondering why a simple "mv" is taking extraordinarily long on 
> CephFS and must note that,
> > at least with the fuse-client (12.2.5) and when moving a file from one 
> directory to another,
> > the file appears to be copied first (byte by byte, traffic going 
> through the client?) before the initial file is deleted.
> >
> > Is this true, or am I missing something?
> 
> A mv should not involve copying a file through the client -- it's
> implemented in the MDS as a rename from one location to another.
> What's the observation that's making it seem like the data is going
> through the client?
> 
> John
> 
> >
> > For large files, this might be rather time consuming,
> > and we should certainly advise all our users to not move files around 
> needlessly if this is the case.
> >
> > Cheers,
> >         Oliver
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Am 25.05.2018 um 14:50 schrieb John Spray:
> On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
>  wrote:
>> Dear Cephalopodians,
>>
>> I was wondering why a simple "mv" is taking extraordinarily long on CephFS 
>> and must note that,
>> at least with the fuse-client (12.2.5) and when moving a file from one 
>> directory to another,
>> the file appears to be copied first (byte by byte, traffic going through the 
>> client?) before the initial file is deleted.
>>
>> Is this true, or am I missing something?
> 
> A mv should not involve copying a file through the client -- it's
> implemented in the MDS as a rename from one location to another.
> What's the observation that's making it seem like the data is going
> through the client?

The fact that it's happening with only about 1 GBit/s and all OSDs are reading 
and writing. 
I will also check the network interface of the client next time it occurs. 
Also, ceph-fuse was taking 50 % CPU load just from this. 

Also, I observe the file at the source being kept during the copy,
and the file at the target growing slowly. So it's definitely a copy, and only 
at the end the source file is deleted. 

> 
> John
> 
>>
>> For large files, this might be rather time consuming,
>> and we should certainly advise all our users to not move files around 
>> needlessly if this is the case.
>>
>> Cheers,
>> Oliver
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Ric Wheeler
Is this move between directories on the same file system?

Rename as a system call only works within a file system.

The user space mv command becomes a copy when not the same file system.

Regards,

Ric


On Fri, May 25, 2018, 8:51 AM John Spray  wrote:

> On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
>  wrote:
> > Dear Cephalopodians,
> >
> > I was wondering why a simple "mv" is taking extraordinarily long on
> CephFS and must note that,
> > at least with the fuse-client (12.2.5) and when moving a file from one
> directory to another,
> > the file appears to be copied first (byte by byte, traffic going through
> the client?) before the initial file is deleted.
> >
> > Is this true, or am I missing something?
>
> A mv should not involve copying a file through the client -- it's
> implemented in the MDS as a rename from one location to another.
> What's the observation that's making it seem like the data is going
> through the client?
>
> John
>
> >
> > For large files, this might be rather time consuming,
> > and we should certainly advise all our users to not move files around
> needlessly if this is the case.
> >
> > Cheers,
> > Oliver
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread John Spray
On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
 wrote:
> Dear Cephalopodians,
>
> I was wondering why a simple "mv" is taking extraordinarily long on CephFS 
> and must note that,
> at least with the fuse-client (12.2.5) and when moving a file from one 
> directory to another,
> the file appears to be copied first (byte by byte, traffic going through the 
> client?) before the initial file is deleted.
>
> Is this true, or am I missing something?

A mv should not involve copying a file through the client -- it's
implemented in the MDS as a rename from one location to another.
What's the observation that's making it seem like the data is going
through the client?

John

>
> For large files, this might be rather time consuming,
> and we should certainly advise all our users to not move files around 
> needlessly if this is the case.
>
> Cheers,
> Oliver
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How high-touch is ceph?

2018-05-25 Thread John Spray
On Fri, May 25, 2018 at 1:17 PM, Rhugga Harper  wrote:
>
> I've been evaluating ceph as a solution for persistent block in our
> kubrenetes clusters for low-iops requirement applications. It doesn't do too
> terribly bad with 32k workloads even though it's object storage under the
> hood.
>
> However it seems this is a very high maintenance solution requiring you to
> be on top of it at every minute of the day coupled with impeccable
> monitoring and alert response times.

It's difficult to respond intelligently to a comment like this without
something more specific.  What specific tasks do you envisage needing
to do every minute of every day?  What made it seem that way?

John

>
> Also how stable/reliable are erasure coded pools?
>
> Thx
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How high-touch is ceph?

2018-05-25 Thread Rhugga Harper
I've been evaluating ceph as a solution for persistent block in our
kubrenetes clusters for low-iops requirement applications. It doesn't do
too terribly bad with 32k workloads even though it's object storage under
the hood.

However it seems this is a very high maintenance solution requiring you to
be on top of it at every minute of the day coupled with impeccable
monitoring and alert response times.

Also how stable/reliable are erasure coded pools?

Thx
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Dear Cephalopodians,

I was wondering why a simple "mv" is taking extraordinarily long on CephFS and 
must note that,
at least with the fuse-client (12.2.5) and when moving a file from one 
directory to another,
the file appears to be copied first (byte by byte, traffic going through the 
client?) before the initial file is deleted. 

Is this true, or am I missing something? 

For large files, this might be rather time consuming,
and we should certainly advise all our users to not move files around 
needlessly if this is the case. 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issues with RBD when rebooting

2018-05-25 Thread Maged Mokhtar
On 2018-05-25 12:11, Josef Zelenka wrote:

> Hi, we are running a jewel cluster (54OSDs, six nodes, ubuntu 16.04) that 
> serves as a backend for openstack(newton) VMs. TOday we had to reboot one of 
> the nodes(replicated pool, x2) and some of our VMs oopsed with issues with 
> their FS(mainly database VMs, postgresql) - is there a reason for this to 
> happen? if data is replicated, the VMs shouldn't even notice we rebooted one 
> of the nodes, right? Maybe i just don't understand how this works correctly, 
> but i hope someone around here can either tell me why this is happenning or 
> how to fix it.
> 
> Thanks
> 
> Josef
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

It could be a timeout setting issue. Typically your higher application
level timeouts should be larger than your low level io timeouts to allow
for recovery. Check if your postgresql has timeouts that may be set too
low.
At the low level, the OSD will be detected as failed via
osd_heartbeat_grace + osd_heartbeat_interval, you can lower this to for
example 20s via:
osd heartbeat grace = 15
osd heartbeat interval = 5
this will give 20 sec before osd is reported as dead and new remapping
occurs. Do not lower it too much else you may be triggering remaps on
false alarms. 

At higher levels, it may be worth double checking:
rados_osd_op_timeout in case of librbd
osd_request_timeout in case of kernel rbd (if enabled)
They need to be larger than the osd timeouts above 

At the higher levels 

OS disk timeout is (this is usually high enough)
/sys/block/sdX/device/timeout 

Your postgresql timeouts, needs to be higher that 20s in this case. 

/Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Different disk sizes after Luminous upgrade 12.2.2 --> 12.2.5

2018-05-25 Thread Igor Fedotov

Hi Eugen,

This difference was introduced by the following PR: 
https://github.com/ceph/ceph/pull/20487 (commit os/bluestore: do not 
account DB volume space in total one reported by statfs method).


The rationale is to show block device capacity as total only. And don't 
add DB space to it. This makes no sense since data stored at these 
locations aren't cumulative.


So this just an effect of a bit different calculation.

Thanks,

Igor



On 5/25/2018 2:22 PM, Eugen Block wrote:

Hi list,

we have a Luminous bluestore cluster with separate block.db/block.wal 
on SSDs. We were running version 12.2.2 and upgraded yesterday to 
12.2.5. The upgrade went smoothly, but since the restart of the OSDs I 
noticed that 'ceph osd df' shows a different total disk size:


---cut here---
ceph1:~ #  ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE    AVAIL %USE  VAR  PGS
 1   hdd 0.92429  1.0   931G   557G  373G 59.85 1.03 681
 4   hdd 0.92429  1.0   931G   535G  395G 57.52 0.99 645
 6   hdd 0.92429  1.0   931G   532G  398G 57.19 0.99 640
13   hdd 0.92429  1.0   931G   587G  343G 63.08 1.09 671
16   hdd 0.92429  1.0   931G   562G  368G 60.40 1.04 665
18   hdd 0.92429  1.0   931G   531G  399G 57.07 0.98 623
10   ssd 0.72769  1.0   745G 18423M  727G  2.41 0.04  37
---cut here---

Before the upgrade the displayed size for each 1TB disk was 946G where 
each OSD has a block.db size of 15G (931 + 15 = 946). So it seems that 
in one of the recent changes within 12.2.X the output has changed, 
also resulting in a slightly smaller total cluster size. Is this just 
a code change for the size calculation or is there something else I 
should look out for?


Regards,
Eugen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Different disk sizes after Luminous upgrade 12.2.2 --> 12.2.5

2018-05-25 Thread Eugen Block

Hi list,

we have a Luminous bluestore cluster with separate block.db/block.wal  
on SSDs. We were running version 12.2.2 and upgraded yesterday to  
12.2.5. The upgrade went smoothly, but since the restart of the OSDs I  
noticed that 'ceph osd df' shows a different total disk size:


---cut here---
ceph1:~ #  ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USEAVAIL %USE  VAR  PGS
 1   hdd 0.92429  1.0   931G   557G  373G 59.85 1.03 681
 4   hdd 0.92429  1.0   931G   535G  395G 57.52 0.99 645
 6   hdd 0.92429  1.0   931G   532G  398G 57.19 0.99 640
13   hdd 0.92429  1.0   931G   587G  343G 63.08 1.09 671
16   hdd 0.92429  1.0   931G   562G  368G 60.40 1.04 665
18   hdd 0.92429  1.0   931G   531G  399G 57.07 0.98 623
10   ssd 0.72769  1.0   745G 18423M  727G  2.41 0.04  37
---cut here---

Before the upgrade the displayed size for each 1TB disk was 946G where  
each OSD has a block.db size of 15G (931 + 15 = 946). So it seems that  
in one of the recent changes within 12.2.X the output has changed,  
also resulting in a slightly smaller total cluster size. Is this just  
a code change for the size calculation or is there something else I  
should look out for?


Regards,
Eugen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Delete pool nicely

2018-05-25 Thread Paul Emmerich
Also, upgrade to luminous and migrate your OSDs to bluestore before using
erasure coding.
Luminous + Bluestore performs so much better for erasure coding than any of
the old configurations.

Also, I've found that deleting a large number of objects is far less
stressfull on a Bluestore OSD than on a Filestore OSD.

Paul


2018-05-22 19:28 GMT+02:00 David Turner :

> From my experience, that would cause you some troubles as it would throw
> the entire pool into the deletion queue to be processed as it cleans up the
> disks and everything.  I would suggest using a pool listing from `rados -p
> .rgw.buckets ls` and iterate on that using some scripts around the `rados
> -p .rgw.buckest rm ` command that you could stop, restart at a
> faster pace, slow down, etc.  Once the objects in the pool are gone, you
> can delete the empty pool without any problems.  I like this option because
> it makes it simple to stop it if you're impacting your VM traffic.
>
>
> On Tue, May 22, 2018 at 11:05 AM Simon Ironside 
> wrote:
>
>> Hi Everyone,
>>
>> I have an older cluster (Hammer 0.94.7) with a broken radosgw service
>> that I'd just like to blow away before upgrading to Jewel after which
>> I'll start again with EC pools.
>>
>> I don't need the data but I'm worried that deleting the .rgw.buckets
>> pool will cause performance degradation for the production RBD pool used
>> by VMs. .rgw.buckets is a replicated pool (size=3) with ~14TB data in
>> 5.3M objects. A little over half the data in the whole cluster.
>>
>> Is deleting this pool simply using ceph osd pool delete likely to cause
>> me a performance problem? If so, is there a way I can do it better?
>>
>> Thanks,
>> Simon.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph replication factor of 2

2018-05-25 Thread Paul Emmerich
If you are so worried about the storage efficiency: why not use erasure
coding?
EC performs really well with Luminous in our experience.
Yes, you generate more IOPS and somewhat more CPU load and a higher latency.
But it's often worth a try.

Simple example for everyone considering 2/1 replicas: consider 2/2 erasure
coding.

* Data durability and availability of 3/2 replicas
* Storage efficiency of 2/1 replicas
* 33% more write IOPS than 3/2 replicas
* 100% more read IOPS than any replica setup (400% more to reduce latency
with fast_read)

Of course, 2/2 erasure coding might seem stupid. We typically use 4/2, 5/2,
or 5/3.

So If you are worried about reducing storage overhead: try it out and see
for yourself how it performs
for your use case.

I've rescued several clusters that were configured with 2/1 replica and
broke down in various ways... it's not pretty
and can be annoying and time-consuming to fix. As in tracking down a broken
disk where the OSD doesn't start
up properly and trying get the last copy of a PG off it with
ceph-objectstore-tool...



Paul


2018-05-25 9:48 GMT+02:00 Janne Johansson :

>
>
> Den fre 25 maj 2018 kl 00:20 skrev Jack :
>
>> On 05/24/2018 11:40 PM, Stefan Kooman wrote:
>> >> What are your thoughts, would you run 2x replication factor in
>> >> Production and in what scenarios?
>> Me neither, mostly because I have yet to read a technical point of view,
>> from someone who read and understand the code
>>
>> I do not buy Janne's "trust me, I am an engineer", whom btw confirmed
>> that the "replica 3" stuff is subject to probability and function to the
>> cluster size, thus is not a generic "always-true" rule
>>
>
> I did not call for trust on _my_ experience or value, but on the ones
> posting the
> first "everyone should probably use 3 replicas" over which you showed
> doubt.
> I agree with them, but did not intend to claim that my post had extra
> value because
> it was written by me.
>
> Also, the last part of my post was very much intended to add "not
> everything in 3x is true for everyone",
> but if you value your data, it would be very prudent to listen to
> experienced people who took risks and lost data before.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph replication factor of 2

2018-05-25 Thread Donny Davis
Nobody cares about their data until they don't have it anymore. Using
replica 3 is the same logic as RAID6. Its likely if one drive has crapped
out, more will meet the maker soon. If you care about your data, then do
what you can to keep it around. If its a lab like mine, who cares its
all ephemeral to me. The decision is about your use case and workload.

If it was my production data, i would spend the money.

On Fri, May 25, 2018 at 3:48 AM, Janne Johansson 
wrote:

>
>
> Den fre 25 maj 2018 kl 00:20 skrev Jack :
>
>> On 05/24/2018 11:40 PM, Stefan Kooman wrote:
>> >> What are your thoughts, would you run 2x replication factor in
>> >> Production and in what scenarios?
>> Me neither, mostly because I have yet to read a technical point of view,
>> from someone who read and understand the code
>>
>> I do not buy Janne's "trust me, I am an engineer", whom btw confirmed
>> that the "replica 3" stuff is subject to probability and function to the
>> cluster size, thus is not a generic "always-true" rule
>>
>
> I did not call for trust on _my_ experience or value, but on the ones
> posting the
> first "everyone should probably use 3 replicas" over which you showed
> doubt.
> I agree with them, but did not intend to claim that my post had extra
> value because
> it was written by me.
>
> Also, the last part of my post was very much intended to add "not
> everything in 3x is true for everyone",
> but if you value your data, it would be very prudent to listen to
> experienced people who took risks and lost data before.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Issues with RBD when rebooting

2018-05-25 Thread Josef Zelenka
Hi, we are running a jewel cluster (54OSDs, six nodes, ubuntu 16.04) 
that serves as a backend for openstack(newton) VMs. TOday we had to 
reboot one of the nodes(replicated pool, x2) and some of our VMs oopsed 
with issues with their FS(mainly database VMs, postgresql) - is there a 
reason for this to happen? if data is replicated, the VMs shouldn't even 
notice we rebooted one of the nodes, right? Maybe i just don't 
understand how this works correctly, but i hope someone around here can 
either tell me why this is happenning or how to fix it.


Thanks

Josef

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-25 Thread Yan, Zheng
On Fri, May 25, 2018 at 4:28 PM, Yan, Zheng  wrote:
> I found some memory leak. could you please try
> https://github.com/ceph/ceph/pull/22240
>

the leak only affects multiple active mds, I think it's unrelated to your issue.

>
> On Fri, May 25, 2018 at 1:49 PM, Alexandre DERUMIER  
> wrote:
>> Here the result:
>>
>>
>> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net flush journal
>> {
>> "message": "",
>> "return_code": 0
>> }
>> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net config set mds_cache_size 
>> 1
>> {
>> "success": "mds_cache_size = '1' (not observed, change may require 
>> restart) "
>> }
>>
>> wait ...
>>
>>
>> root@ceph4-2:~# ceph tell mds.ceph4-2.odiso.net heap stats
>> 2018-05-25 07:44:02.185911 7f4cad7fa700  0 client.50748489 ms_handle_reset 
>> on 10.5.0.88:6804/994206868
>> 2018-05-25 07:44:02.196160 7f4cae7fc700  0 client.50792764 ms_handle_reset 
>> on 10.5.0.88:6804/994206868
>> mds.ceph4-2.odiso.net tcmalloc heap 
>> stats:
>> MALLOC:13175782328 (12565.4 MiB) Bytes in use by application
>> MALLOC: +0 (0.0 MiB) Bytes in page heap freelist
>> MALLOC: +   1774628488 ( 1692.4 MiB) Bytes in central cache freelist
>> MALLOC: + 34274608 (   32.7 MiB) Bytes in transfer cache freelist
>> MALLOC: + 57260176 (   54.6 MiB) Bytes in thread cache freelists
>> MALLOC: +120582336 (  115.0 MiB) Bytes in malloc metadata
>> MALLOC:   
>> MALLOC: =  15162527936 (14460.1 MiB) Actual memory used (physical + swap)
>> MALLOC: +   4974067712 ( 4743.6 MiB) Bytes released to OS (aka unmapped)
>> MALLOC:   
>> MALLOC: =  20136595648 (19203.8 MiB) Virtual address space used
>> MALLOC:
>> MALLOC:1852388  Spans in use
>> MALLOC: 18  Thread heaps in use
>> MALLOC:   8192  Tcmalloc page size
>> 
>> Call ReleaseFreeMemory() to release freelist memory to the OS (via 
>> madvise()).
>> Bytes released to the OS take up virtual address space but no physical 
>> memory.
>>
>>
>> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net config set mds_cache_size 0
>> {
>> "success": "mds_cache_size = '0' (not observed, change may require 
>> restart) "
>> }
>>
>> - Mail original -
>> De: "Zheng Yan" 
>> À: "aderumier" 
>> Envoyé: Vendredi 25 Mai 2018 05:56:31
>> Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?
>>
>> On Thu, May 24, 2018 at 11:34 PM, Alexandre DERUMIER
>>  wrote:
>Still don't find any clue. Does the cephfs have idle period. If it
>has, could you decrease mds's cache size and check what happens. For
>example, run following commands during the old period.
>>>
>ceph daemon mds.xx flush journal
>ceph daemon mds.xx config set mds_cache_size 1;
>"wait a minute"
>ceph tell mds.xx heap stats
>ceph daemon mds.xx config set mds_cache_size 0
>>>
>>> ok thanks. I'll try this night.
>>>
>>> I have already mds_cache_memory_limit = 5368709120,
>>>
>>> does it need to remove it first before setting mds_cache_size 1 ?
>>
>> no
>>>
>>>
>>>
>>>
>>> - Mail original -
>>> De: "Zheng Yan" 
>>> À: "aderumier" 
>>> Cc: "ceph-users" 
>>> Envoyé: Jeudi 24 Mai 2018 16:27:21
>>> Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?
>>>
>>> On Thu, May 24, 2018 at 7:22 PM, Alexandre DERUMIER  
>>> wrote:
 Thanks!


 here the profile.pdf

 10-15min profiling, I can't do it longer because my clients where lagging.

 but I think it should be enough to observe the rss memory increase.


>>>
>>> Still don't find any clue. Does the cephfs have idle period. If it
>>> has, could you decrease mds's cache size and check what happens. For
>>> example, run following commands during the old period.
>>>
>>> ceph daemon mds.xx flush journal
>>> ceph daemon mds.xx config set mds_cache_size 1;
>>> "wait a minute"
>>> ceph tell mds.xx heap stats
>>> ceph daemon mds.xx config set mds_cache_size 0
>>>
>>>


 - Mail original -
 De: "Zheng Yan" 
 À: "aderumier" 
 Cc: "ceph-users" 
 Envoyé: Jeudi 24 Mai 2018 11:34:20
 Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

 On Tue, May 22, 2018 at 3:11 PM, Alexandre DERUMIER  
 wrote:
> Hi,some new stats, mds memory is not 16G,
>
> I have almost same number of items and bytes in cache vs some weeks ago 
> when mds was using 8G. (ceph 12.2.5)
>
>
> root@ceph4-2:~# while sleep 1; do ceph daemon mds.ceph4-2.odiso.net perf 
> dump | jq '.mds_mem.rss'; ceph daemon 

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-25 Thread Yan, Zheng
I found some memory leak. could you please try
https://github.com/ceph/ceph/pull/22240


On Fri, May 25, 2018 at 1:49 PM, Alexandre DERUMIER  wrote:
> Here the result:
>
>
> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net flush journal
> {
> "message": "",
> "return_code": 0
> }
> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net config set mds_cache_size 
> 1
> {
> "success": "mds_cache_size = '1' (not observed, change may require 
> restart) "
> }
>
> wait ...
>
>
> root@ceph4-2:~# ceph tell mds.ceph4-2.odiso.net heap stats
> 2018-05-25 07:44:02.185911 7f4cad7fa700  0 client.50748489 ms_handle_reset on 
> 10.5.0.88:6804/994206868
> 2018-05-25 07:44:02.196160 7f4cae7fc700  0 client.50792764 ms_handle_reset on 
> 10.5.0.88:6804/994206868
> mds.ceph4-2.odiso.net tcmalloc heap 
> stats:
> MALLOC:13175782328 (12565.4 MiB) Bytes in use by application
> MALLOC: +0 (0.0 MiB) Bytes in page heap freelist
> MALLOC: +   1774628488 ( 1692.4 MiB) Bytes in central cache freelist
> MALLOC: + 34274608 (   32.7 MiB) Bytes in transfer cache freelist
> MALLOC: + 57260176 (   54.6 MiB) Bytes in thread cache freelists
> MALLOC: +120582336 (  115.0 MiB) Bytes in malloc metadata
> MALLOC:   
> MALLOC: =  15162527936 (14460.1 MiB) Actual memory used (physical + swap)
> MALLOC: +   4974067712 ( 4743.6 MiB) Bytes released to OS (aka unmapped)
> MALLOC:   
> MALLOC: =  20136595648 (19203.8 MiB) Virtual address space used
> MALLOC:
> MALLOC:1852388  Spans in use
> MALLOC: 18  Thread heaps in use
> MALLOC:   8192  Tcmalloc page size
> 
> Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
> Bytes released to the OS take up virtual address space but no physical memory.
>
>
> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net config set mds_cache_size 0
> {
> "success": "mds_cache_size = '0' (not observed, change may require 
> restart) "
> }
>
> - Mail original -
> De: "Zheng Yan" 
> À: "aderumier" 
> Envoyé: Vendredi 25 Mai 2018 05:56:31
> Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?
>
> On Thu, May 24, 2018 at 11:34 PM, Alexandre DERUMIER
>  wrote:
Still don't find any clue. Does the cephfs have idle period. If it
has, could you decrease mds's cache size and check what happens. For
example, run following commands during the old period.
>>
ceph daemon mds.xx flush journal
ceph daemon mds.xx config set mds_cache_size 1;
"wait a minute"
ceph tell mds.xx heap stats
ceph daemon mds.xx config set mds_cache_size 0
>>
>> ok thanks. I'll try this night.
>>
>> I have already mds_cache_memory_limit = 5368709120,
>>
>> does it need to remove it first before setting mds_cache_size 1 ?
>
> no
>>
>>
>>
>>
>> - Mail original -
>> De: "Zheng Yan" 
>> À: "aderumier" 
>> Cc: "ceph-users" 
>> Envoyé: Jeudi 24 Mai 2018 16:27:21
>> Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?
>>
>> On Thu, May 24, 2018 at 7:22 PM, Alexandre DERUMIER  
>> wrote:
>>> Thanks!
>>>
>>>
>>> here the profile.pdf
>>>
>>> 10-15min profiling, I can't do it longer because my clients where lagging.
>>>
>>> but I think it should be enough to observe the rss memory increase.
>>>
>>>
>>
>> Still don't find any clue. Does the cephfs have idle period. If it
>> has, could you decrease mds's cache size and check what happens. For
>> example, run following commands during the old period.
>>
>> ceph daemon mds.xx flush journal
>> ceph daemon mds.xx config set mds_cache_size 1;
>> "wait a minute"
>> ceph tell mds.xx heap stats
>> ceph daemon mds.xx config set mds_cache_size 0
>>
>>
>>>
>>>
>>> - Mail original -
>>> De: "Zheng Yan" 
>>> À: "aderumier" 
>>> Cc: "ceph-users" 
>>> Envoyé: Jeudi 24 Mai 2018 11:34:20
>>> Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?
>>>
>>> On Tue, May 22, 2018 at 3:11 PM, Alexandre DERUMIER  
>>> wrote:
 Hi,some new stats, mds memory is not 16G,

 I have almost same number of items and bytes in cache vs some weeks ago 
 when mds was using 8G. (ceph 12.2.5)


 root@ceph4-2:~# while sleep 1; do ceph daemon mds.ceph4-2.odiso.net perf 
 dump | jq '.mds_mem.rss'; ceph daemon mds.ceph4-2.odiso.net dump_mempools 
 | jq -c '.mds_co'; done
 16905052
 {"items":43350988,"bytes":5257428143}
 16905052
 {"items":43428329,"bytes":5283850173}
 16905052
 {"items":43209167,"bytes":5208578149}
 16905052
 {"items":43177631,"bytes":5198833577}
 16905052
 

Re: [ceph-users] Ceph replication factor of 2

2018-05-25 Thread Janne Johansson
Den fre 25 maj 2018 kl 00:20 skrev Jack :

> On 05/24/2018 11:40 PM, Stefan Kooman wrote:
> >> What are your thoughts, would you run 2x replication factor in
> >> Production and in what scenarios?
> Me neither, mostly because I have yet to read a technical point of view,
> from someone who read and understand the code
>
> I do not buy Janne's "trust me, I am an engineer", whom btw confirmed
> that the "replica 3" stuff is subject to probability and function to the
> cluster size, thus is not a generic "always-true" rule
>

I did not call for trust on _my_ experience or value, but on the ones
posting the
first "everyone should probably use 3 replicas" over which you showed doubt.
I agree with them, but did not intend to claim that my post had extra value
because
it was written by me.

Also, the last part of my post was very much intended to add "not
everything in 3x is true for everyone",
but if you value your data, it would be very prudent to listen to
experienced people who took risks and lost data before.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-disk is getting removed from master

2018-05-25 Thread Konstantin Shalygin

ceph-disk should be considered as "frozen" and deprecated for Mimic,
in favor of ceph-volume.


ceph-volume will continue to support bare block device, i.e. without 
lvm'ish stuff?






k


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can Bluestore work with 2 replicas or still need 3 for data integrity?

2018-05-25 Thread Pardhiv Karri
Thank You Linh for the info. Started reading about this solution. Could be
lot of cost savings, need to check about the limitations though. Not sure
how it works with Openstack as a front end to Ceph with Erasure Coded pools
in Luminous.

Thanks,
Pardhiv Karri


On Thu, May 24, 2018 at 6:39 PM, Linh Vu  wrote:

> You can use erasure code for your SSDs in Luminous if you're worried about
> cost per TB.
> --
> *From:* ceph-users  on behalf of
> Pardhiv Karri 
> *Sent:* Friday, 25 May 2018 11:16:07 AM
> *To:* ceph-users
> *Subject:* [ceph-users] Can Bluestore work with 2 replicas or still need
> 3 for data integrity?
>
> Hi,
>
> Can Ceph Bluestore in Luminous work with 2 replicas using crc32c checksum
> which is more powerful than hashing in filestore versions or do we still
> need 3 replicas for data integrity?
>
> In our current Hammer-filestore environment we are using 3 replicas with
> HDD but planning to move to Bluestore-Luminous all SSD. Due to the cost of
> SSD's want to know if 2 replica is good or still need 3.
>
> Thanks,
> Pardhiv Karri
>
>
>


-- 
*Pardhiv Karri*
"Rise and Rise again until LAMBS become LIONS"
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com