[ceph-users] Re: monitor sst files continue growing

2020-11-13 Thread Zhenshi Zhou
Hi Wido,

thanks for the explanation. I think the root cause is the disks are too
slow for campaction.
I add two new mon with ssd to the cluter to speed it up and the issue
resolved.

That's a good advice and I have plan to migrate my mon to bigger SSD disks.

Thanks again.

Wido den Hollander  于2020年10月30日周五 下午4:39写道:

>
>
> On 29/10/2020 19:29, Zhenshi Zhou wrote:
> > Hi Alex,
> >
> > We found that there were a huge number of keys in the "logm" and "osdmap"
> > table
> > while using ceph-monstore-tool. I think that could be the root cause.
> >
>
> But that is exactly how Ceph works. It might need that very old OSDMap
> to get all the PGs clean again. An OSD which has been gone for a very
> long time and needs to catch up to make a PG clean.
>
> If not all PGs are active+clean you will and can see the MON databases
> grow rapidly.
>
> Therefor I always deploy 1TB SSDs in all Monitors. Not expensive anymore
> and they give breathing room.
>
> I always deploy physical and dedicated machines for Monitors just to
> prevent these cases.
>
> Wido
>
> > Well, some pages also say that disable 'insight' module can resolve this
> > issue, but
> > I checked our cluster and we didn't enable this module. check this page
> > .
> >
> > Anyway, our cluster is unhealthy though, it just need time keep
> recovering
> > data :)
> >
> > Thanks
> >
> > Alex Gracie  于2020年10月29日周四 下午10:57写道:
> >
> >> We hit this issue over the weekend on our HDD backed EC Nautilus cluster
> >> while removing a single OSD. We also did not have any luck using
> >> compaction. The mon-logs filled up our entire root disk on the mon
> servers
> >> and we were running on a single monitor for hours while we tried to
> finish
> >> recovery and reclaim space. The past couple weeks we also noticed "pg
> not
> >> scubbed in time" errors but are unsure if they are related. I'm still
> the
> >> exact cause of this(other than the general misplaced/degraded objects)
> and
> >> what kind of growth is acceptable for these store.db files.
> >>
> >> In order to get our downed mons restarted, we ended up backing up and
> >> coping the /var/lib/ceph/mon/* contents to a remote host, setting up an
> >> sshfs mount to that new host with large NVME and SSDs, ensuring the
> mount
> >> paths were owned by ceph, then clearing up enough space on the monitor
> host
> >> to start the service. This allowed our store.db directory to grow freely
> >> until the misplaced/degraded objects could recover and monitors all
> >> rejoined eventually.
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] BLUEFS_SPILLOVER BlueFS spillover detected

2020-11-13 Thread Zhenshi Zhou
Hi,

I have a cluster of 14.2.8.
I created OSDs with dedicated PCIE for wal/db when deployed the cluster.
I set 72G for db and 3G for wal on each OSD.

And now my cluster is in a WARN stats until a long health time.
# ceph health detail
HEALTH_WARN BlueFS spillover detected on 1 OSD(s)
BLUEFS_SPILLOVER BlueFS spillover detected on 1 OSD(s)
 osd.63 spilled over 33 MiB metadata from 'db' device (1.5 GiB used of
72 GiB) to slow device

I lookup on google and find https://tracker.ceph.com/issues/38745
I'm not sure if it's the same issue.
How can I deal with this?

THANKS
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Beginner's installation questions about network

2020-11-13 Thread Stefan Kooman
On 2020-11-13 21:19, E Taka wrote:
> Hello,
> 
> I want to install Ceph Octopus on Ubuntu 20.04. The nodes for have 2
> network interfaces: 192.168.1.0/24 for the cluster network, and a
> 10.10.0.0/16 is the public network. When I bootstrap with cephadm, which
> Network do I use? That means, do i use cephadm bootstrap --mon-ip
> 192.168.1.1 or do I have to use the other network?
> 
> When adding the other host with ceph orch, which network I have to use:
> ceph orch host add ceph03i 192.168.20.2 …  (or the 10.10 network)

I found the following info in the documentation:

"You need to know which IP address to use for the cluster’s first
monitor daemon. This is normally just the IP for the first host. If
there are multiple networks and interfaces, be sure to choose one that
will be accessible by any host accessing the Ceph cluster."

So that would be the public network in your case. See [1].

Just curious: why do you want to use separate networks? You might as
well use bonded interfaces on the public network (i.e. LACP) and have
more redundancy there. I figure that you might even make more effective
use of the bandwidth as well.

Gr. Stefan

[1]:
https://docs.ceph.com/en/latest/cephadm/install/#bootstrap-a-new-cluster
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Beginner's installation questions about network

2020-11-13 Thread Janne Johansson
Den fre 13 nov. 2020 kl 21:50 skrev E Taka <0eta...@gmail.com>:

> Hi Stefan, the cluster network has its own switch and is faster than the
> public network.
> Thanks for pointing me to the documentation. I must have overlooked this
> sentence.
>
> But let me ask another question: do the OSD use the cluster network
> "magically"? I did not find this in the docs, but that may be my fault…
>

Yes, for OSD<->OSD traffic, all the rest (OSD->MON, clients to OSD, rgw/mds
-> OSD) go via the public interface.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Beginner's installation questions about network

2020-11-13 Thread E Taka
Hi Stefan, the cluster network has its own switch and is faster than the
public network.
Thanks for pointing me to the documentation. I must have overlooked this
sentence.

But let me ask another question: do the OSD use the cluster network
"magically"? I did not find this in the docs, but that may be my fault…

Am Fr., 13. Nov. 2020 um 21:40 Uhr schrieb Stefan Kooman :

> On 2020-11-13 21:19, E Taka wrote:
> > Hello,
> >
> > I want to install Ceph Octopus on Ubuntu 20.04. The nodes for have 2
> > network interfaces: 192.168.1.0/24 for the cluster network, and a
> > 10.10.0.0/16 is the public network. When I bootstrap with cephadm, which
> > Network do I use? That means, do i use cephadm bootstrap --mon-ip
> > 192.168.1.1 or do I have to use the other network?
> >
> > When adding the other host with ceph orch, which network I have to use:
> > ceph orch host add ceph03i 192.168.20.2 …  (or the 10.10 network)
>
> I found the following info in the documentation:
>
> "You need to know which IP address to use for the cluster’s first
> monitor daemon. This is normally just the IP for the first host. If
> there are multiple networks and interfaces, be sure to choose one that
> will be accessible by any host accessing the Ceph cluster."
>
> So that would be the public network in your case. See [1].
>
> Just curious: why do you want to use separate networks? You might as
> well use bonded interfaces on the public network (i.e. LACP) and have
> more redundancy there. I figure that you might even make more effective
> use of the bandwidth as well.
>
> Gr. Stefan
>
> [1]:
> https://docs.ceph.com/en/latest/cephadm/install/#bootstrap-a-new-cluster
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Beginner's installation questions about network

2020-11-13 Thread E Taka
Hello,

I want to install Ceph Octopus on Ubuntu 20.04. The nodes for have 2
network interfaces: 192.168.1.0/24 for the cluster network, and a
10.10.0.0/16 is the public network. When I bootstrap with cephadm, which
Network do I use? That means, do i use cephadm bootstrap --mon-ip
192.168.1.1 or do I have to use the other network?

When adding the other host with ceph orch, which network I have to use:
ceph orch host add ceph03i 192.168.20.2 …  (or the 10.10 network)

Thanks, Erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: which of cpu frequency and number of threads servers osd better?

2020-11-13 Thread Nathan Fish
We have 12TB HDD OSDs with 32GiB of (Optane) NVMe for block.db, used
for cephfs_data pools, and NVMe-only OSDs used for cephfs_data pools.
The NVMe DB about doubled our random IO performance - a great
investment - doubling max CPU load as a result. We had to turn up "osd
op num threads per shard hdd" from 1 to 2. (2 is the default for
SSDs). This didn't noticeably improve performance, but without it,
OSDs under max load would sometimes fail to respond to heartbeats. So
with the load that we have - millions of mostly small files on CephFS
- I wouldn't go below 2 real cores per OSD. But this may be a fringe
workload.

On Fri, Nov 13, 2020 at 3:36 AM Frank Schilder  wrote:
>
> > If each OSD requires 4T
>
> Nobody said that. What was said is HDD=1T,  SSD=3T. It depends on the drive 
> type!
>
> The %-utilisation information is just from top observed during heavy load. It 
> does not show how the kernel schedules things on physical Ts. So, 2x50% 
> utilisation could run on the same HT. I don't know how the OSDs are organised 
> into threads, I'm just stating observations from real life (mimic cluster). 
> So, for an SSD OSD I have seen a maximum of 4 threads in R state, two with 
> 100% and two with 50% CPU, a load that fits on 3HT.
>
> So, real life says 1HT per HDD and 3HT per SSD plus a bit for kernel and 
> networking and you are set - based on worst-case performance monitoring I 
> have seen in 2 years. Note that this is worst-case load. The average load is 
> much lower.
>
> A 16 core machine is totally overpowered. Assuming 1C=2HT, I count 
> (2*3+8*1)/2=7 or (1*3+10*1)/2=6.5. So an 8 core CPU should do in either case. 
> A 10 core CPU might be better, but 16C is a waste of money.
>
> I should mention that these estimates apply to Intel CPUs (x86_64 
> architectures). Other architectures might not provide the same cycle 
> efficiency.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Tony Liu 
> Sent: 13 November 2020 08:32:55
> To: Frank Schilder; Nathan Fish
> Cc: ceph-users@ceph.io
> Subject: RE: [ceph-users] Re: which of cpu frequency and number of threads 
> servers osd better?
>
> You all mentioned first 2T and another 2T. Could you give more
> details how OSD works with multi-thread, or share the link if
> it's already documented somewhere?
>
> Is it always 4T, or start with 1T and grow up to 4T? Is it max 4T?
> Does each T run different job or just multiple instances of the
> same job? Does disk type affect how T works, like 1T is good enough
> for HDD while 4T is required for SSD?
>
> If I change my plan to 2 SSD OSDs and 8 HDD OSDs (with 1 SSD for
> WAL and DB). If each OSD requires 4T, then 16C/32T 3.0GHz could
> be a better choice, because it provides sufficient Ts?
> If SSD OSD requires 4T and HDD OSD only requires 1T, then 8C/16T
> 3.2GHz would be better, because it provides sufficient Ts as well
> as stronger computing?
>
> Thanks!
> Tony
> > -Original Message-
> > From: Frank Schilder 
> > Sent: Thursday, November 12, 2020 10:59 PM
> > To: Tony Liu ; Nathan Fish 
> > Cc: ceph-users@ceph.io
> > Subject: Re: [ceph-users] Re: which of cpu frequency and number of
> > threads servers osd better?
> >
> > I think this depends on the type of backing disk. We use the following
> > CPUs:
> >
> > Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
> > Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
> > Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
> >
> > My experience is, that a HDD OSD hardly gets to 100% of 1 hyper thread
> > load even under heavy recovery/rebalance operations on 8+2 and 6+2 EC
> > pools with compression set to aggressive. The CPU is mostly doing wait-
> > IO, that is, the disk is the real bottle neck, not the processor power.
> > With SSDs I have seen 2HT at 100% and 2 more at 50% each. I guess NVMe
> > might be more demanding.
> >
> > A server with 12 HDD and 1 SSD should be fine with a modern CPU with 8
> > cores. 16 threads sounds like an 8 core CPU. The 2nd generation Intel®
> > Xeon® Silver 4209T with 8 cores should easily handle that (single socket
> > system). We have the 16-core Intel silver in a dual socket system
> > currently connected to 5HDD and 7SSD and I did a rebalance operation
> > yesterday. The CPU user load did not exceed 2%, it can handle OSD
> > processes easily. The server is dimensioned to run up to 12HDD and 14SSD
> > OSDs (Dell R740xd2). As far as I can tell, the CPU configuration is
> > overpowered for that.
> >
> > Just for info, we use ganglia to record node utilisation. I use 1-year
> > records and pick peak loads I observed for dimensioning the CPUs. These
> > records include some very heavy recovery periods.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Tony Liu 
> > Sent: 13 November 2020 04:57:53
> > To: Nathan Fish
> > Cc: 

[ceph-users] Re: which of cpu frequency and number of threads servers osd better?

2020-11-13 Thread Tony Liu
Thank you Frank for the clarification!
Tony
> -Original Message-
> From: Frank Schilder 
> Sent: Friday, November 13, 2020 12:37 AM
> To: Tony Liu ; Nathan Fish 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: which of cpu frequency and number of
> threads servers osd better?
> 
> > If each OSD requires 4T
> 
> Nobody said that. What was said is HDD=1T,  SSD=3T. It depends on the
> drive type!
> 
> The %-utilisation information is just from top observed during heavy
> load. It does not show how the kernel schedules things on physical Ts.
> So, 2x50% utilisation could run on the same HT. I don't know how the
> OSDs are organised into threads, I'm just stating observations from real
> life (mimic cluster). So, for an SSD OSD I have seen a maximum of 4
> threads in R state, two with 100% and two with 50% CPU, a load that fits
> on 3HT.
> 
> So, real life says 1HT per HDD and 3HT per SSD plus a bit for kernel and
> networking and you are set - based on worst-case performance monitoring
> I have seen in 2 years. Note that this is worst-case load. The average
> load is much lower.
> 
> A 16 core machine is totally overpowered. Assuming 1C=2HT, I count
> (2*3+8*1)/2=7 or (1*3+10*1)/2=6.5. So an 8 core CPU should do in either
> case. A 10 core CPU might be better, but 16C is a waste of money.
> 
> I should mention that these estimates apply to Intel CPUs (x86_64
> architectures). Other architectures might not provide the same cycle
> efficiency.
> 
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: Tony Liu 
> Sent: 13 November 2020 08:32:55
> To: Frank Schilder; Nathan Fish
> Cc: ceph-users@ceph.io
> Subject: RE: [ceph-users] Re: which of cpu frequency and number of
> threads servers osd better?
> 
> You all mentioned first 2T and another 2T. Could you give more details
> how OSD works with multi-thread, or share the link if it's already
> documented somewhere?
> 
> Is it always 4T, or start with 1T and grow up to 4T? Is it max 4T?
> Does each T run different job or just multiple instances of the same job?
> Does disk type affect how T works, like 1T is good enough for HDD while
> 4T is required for SSD?
> 
> If I change my plan to 2 SSD OSDs and 8 HDD OSDs (with 1 SSD for WAL and
> DB). If each OSD requires 4T, then 16C/32T 3.0GHz could be a better
> choice, because it provides sufficient Ts?
> If SSD OSD requires 4T and HDD OSD only requires 1T, then 8C/16T 3.2GHz
> would be better, because it provides sufficient Ts as well as stronger
> computing?
> 
> Thanks!
> Tony
> > -Original Message-
> > From: Frank Schilder 
> > Sent: Thursday, November 12, 2020 10:59 PM
> > To: Tony Liu ; Nathan Fish
> > 
> > Cc: ceph-users@ceph.io
> > Subject: Re: [ceph-users] Re: which of cpu frequency and number of
> > threads servers osd better?
> >
> > I think this depends on the type of backing disk. We use the following
> > CPUs:
> >
> > Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
> > Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
> > Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
> >
> > My experience is, that a HDD OSD hardly gets to 100% of 1 hyper thread
> > load even under heavy recovery/rebalance operations on 8+2 and 6+2 EC
> > pools with compression set to aggressive. The CPU is mostly doing
> > wait- IO, that is, the disk is the real bottle neck, not the processor
> power.
> > With SSDs I have seen 2HT at 100% and 2 more at 50% each. I guess NVMe
> > might be more demanding.
> >
> > A server with 12 HDD and 1 SSD should be fine with a modern CPU with 8
> > cores. 16 threads sounds like an 8 core CPU. The 2nd generation Intel®
> > Xeon® Silver 4209T with 8 cores should easily handle that (single
> > socket system). We have the 16-core Intel silver in a dual socket
> > system currently connected to 5HDD and 7SSD and I did a rebalance
> > operation yesterday. The CPU user load did not exceed 2%, it can
> > handle OSD processes easily. The server is dimensioned to run up to
> > 12HDD and 14SSD OSDs (Dell R740xd2). As far as I can tell, the CPU
> > configuration is overpowered for that.
> >
> > Just for info, we use ganglia to record node utilisation. I use 1-year
> > records and pick peak loads I observed for dimensioning the CPUs.
> > These records include some very heavy recovery periods.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Tony Liu 
> > Sent: 13 November 2020 04:57:53
> > To: Nathan Fish
> > Cc: ceph-users@ceph.io
> > Subject: [ceph-users] Re: which of cpu frequency and number of threads
> > servers osd better?
> >
> > Thanks Nathan!
> > Tony
> > > -Original Message-
> > > From: Nathan Fish 
> > > Sent: Thursday, November 12, 2020 7:43 PM
> > > To: Tony Liu 
> > > Cc: ceph-users@ceph.io
> > > Subject: Re: [ceph-users] which of cpu frequency and number of
> > > threads servers 

[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-13 Thread Eric Ivancich
Thank you for the answers to those questions, Janek.

And in case anyone hasn’t seen it, we do have a tracker for this issue:

https://tracker.ceph.com/issues/47866

We may want to move most of the conversation to the comments there, so 
everything’s together.

I do want to follow up on your answer to Question 4, Janek:

> On Nov 13, 2020, at 12:22 PM, Janek Bevendorff 
>  wrote:
>> 
>> 4. Is anyone experiencing this issue willing to run their RGWs with 
>> 'debug_ms=1'? That would allow us to see a request from an RGW to either 
>> remove a tail object or decrement its reference counter (and when its 
>> counter reaches 0 it will be deleted).
> I haven't had any new data loss in the last few days (at least I think so, I 
> read 1byte from all objects, but didn't compare checksums, so I cannot say if 
> all objects are complete, but at least all are there).
> 
With multipart uploads I believe this is a sufficient test, as the first bit of 
data is in the first tail object, and it’s tail objects that seem to be 
disappearing.

However if the object is not uploaded via multipart and if it does have tail 
(_shadow_) objects, then the initial data is stored in the head object. So this 
test would not be truly diagnostic. This could be done with a large object, for 
example, with `s3cmd put --disable-multipart …`.

Eric

--
J. Eric Ivancich
he / him / his
Red Hat Storage
Ann Arbor, Michigan, USA
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-13 Thread Janek Bevendorff


1. It seems like those reporting this issue are seeing it strictly 
after upgrading to Octopus. From what version did each of these sites 
upgrade to Octopus? From Nautilus? Mimic? Luminous?


I upgraded from the latest Luminous release.




2. Does anyone have any lifecycle rules on a bucket experiencing this 
issue? If so, please describe.


Nope.




3. Is anyone making copies of the affected objects (to same or to a 
different bucket) prior to seeing the issue? And if they are making 
copies, does the destination bucket have lifecycle rules? And if they 
are making copies, are those copies ever being removed?


We are not making copies, but we have bucket ACLs in place, which allow 
different users to access the objects. I doubt this is the problem 
though, otherwise we probably would have lost terabytes upon terabytes 
and not 16 objects so far.




4. Is anyone experiencing this issue willing to run their RGWs with 
'debug_ms=1'? That would allow us to see a request from an RGW to 
either remove a tail object or decrement its reference counter (and 
when its counter reaches 0 it will be deleted).


I haven't had any new data loss in the last few days (at least I think 
so, I read 1byte from all objects, but didn't compare checksums, so I 
cannot say if all objects are complete, but at least all are there).





Thanks,

Eric


On Nov 12, 2020, at 4:54 PM, huxia...@horebdata.cn 
 wrote:


Looks like this is a very dangerous bug for data safety. Hope the bug 
would be quickly identified and fixed.


best regards,

Samuel



huxia...@horebdata.cn 

From: Janek Bevendorff
Date: 2020-11-12 18:17
To:huxia...@horebdata.cn ; EDH - Manuel 
Rios; Rafael Lopez

CC: Robin H. Johnson; ceph-users
Subject: Re: [ceph-users] Re: NoSuchKey on key that is visible in s3 
list/radosgw bk
I have never seen this on Luminous. I recently upgraded to Octopus 
and the issue started occurring only few weeks later.


On 12/11/2020 16:37, huxia...@horebdata.cn 
 wrote:
which Ceph versions are affected by this RGW bug/issues? Luminous, 
Mimic, Octupos, or the latest?


any idea?

samuel



huxia...@horebdata.cn 

From: EDH - Manuel Rios
Date: 2020-11-12 14:27
To: Janek Bevendorff; Rafael Lopez
CC: Robin H. Johnson; ceph-users
Subject: [ceph-users] Re: NoSuchKey on key that is visible in s3 
list/radosgw bk
This same error caused us to wipe a full cluster of 300TB... will be 
related to some rados index/database bug not to s3.


As Janek exposed is a mayor issue, because the error silent happend 
and you can only detect it with S3, when you're going to delete/purge 
a S3 bucket. Dropping NoSuchKey. Error is not related to S3 logic ..


Hope this time dev's can take enought time to find and resolve the 
issue. Error happens with low ec profiles, even with replica x3 in 
some cases.


Regards



-Mensaje original-
De: Janek Bevendorff >

Enviado el: jueves, 12 de noviembre de 2020 14:06
Para: Rafael Lopez >
CC: Robin H. Johnson >; ceph-users >
Asunto: [ceph-users] Re: NoSuchKey on key that is visible in s3 
list/radosgw bk


Here is a bug report concerning (probably) this exact issue:
https://tracker.ceph.com/issues/47866

I left a comment describing the situation and my (limited) 
experiences with it.



On 11/11/2020 10:04, Janek Bevendorff wrote:


Yeah, that seems to be it. There are 239 objects prefixed
.8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh in my dump. However, there are none
of the multiparts from the other file to be found and the head object
is 0 bytes.

I checked another multipart object with an end pointer of 11.
Surprisingly, it had way more than 11 parts (39 to be precise) named
.1, .1_1 .1_2, .1_3, etc. Not sure how Ceph identifies those, but I
could find them in the dump at least.

I have no idea why the objects disappeared. I ran a Spark job over all
buckets, read 1 byte of every object and recorded errors. Of the 78
buckets, two are missing objects. One bucket is missing one object,
the other 15. So, luckily, the incidence is still quite low, but the
problem seems to be expanding slowly.


On 10/11/2020 23:46, Rafael Lopez wrote:

Hi Janek,

What you said sounds right - an S3 single part obj won't have an S3
multipart string as part of the prefix. S3 multipart string looks
like "2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme".

From memory, single part S3 objects that don't fit in a single rados
object are assigned a random prefix that has nothing to do with
the object name, and the rados tail/data objects (not the head
object) have that prefix.
As per your working example, the prefix for that would be
'.8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh'. So there would be (239) "shadow"
objects with names containing that prefix, and if you add up the
sizes it should be 

[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-13 Thread Eric Ivancich
I have some questions for those who’ve experienced this issue.

1. It seems like those reporting this issue are seeing it strictly after 
upgrading to Octopus. From what version did each of these sites upgrade to 
Octopus? From Nautilus? Mimic? Luminous?

2. Does anyone have any lifecycle rules on a bucket experiencing this issue? If 
so, please describe.

3. Is anyone making copies of the affected objects (to same or to a different 
bucket) prior to seeing the issue? And if they are making copies, does the 
destination bucket have lifecycle rules? And if they are making copies, are 
those copies ever being removed?

4. Is anyone experiencing this issue willing to run their RGWs with 
'debug_ms=1'? That would allow us to see a request from an RGW to either remove 
a tail object or decrement its reference counter (and when its counter reaches 
0 it will be deleted).

Thanks,

Eric


> On Nov 12, 2020, at 4:54 PM, huxia...@horebdata.cn wrote:
> 
> Looks like this is a very dangerous bug for data safety. Hope the bug would 
> be quickly identified and fixed.
> 
> best regards,
> 
> Samuel
> 
> 
> 
> huxia...@horebdata.cn 
> 
> From: Janek Bevendorff
> Date: 2020-11-12 18:17
> To: huxia...@horebdata.cn ; EDH - Manuel Rios; 
> Rafael Lopez
> CC: Robin H. Johnson; ceph-users
> Subject: Re: [ceph-users] Re: NoSuchKey on key that is visible in s3 
> list/radosgw bk
> I have never seen this on Luminous. I recently upgraded to Octopus and the 
> issue started occurring only few weeks later.
> 
> On 12/11/2020 16:37, huxia...@horebdata.cn wrote:
> which Ceph versions are affected by this RGW bug/issues? Luminous, Mimic, 
> Octupos, or the latest?
> 
> any idea?
> 
> samuel
> 
> 
> 
> huxia...@horebdata.cn
> 
> From: EDH - Manuel Rios
> Date: 2020-11-12 14:27
> To: Janek Bevendorff; Rafael Lopez
> CC: Robin H. Johnson; ceph-users
> Subject: [ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw 
> bk
> This same error caused us to wipe a full cluster of 300TB... will be related 
> to some rados index/database bug not to s3.
> 
> As Janek exposed is a mayor issue, because the error silent happend and you 
> can only detect it with S3, when you're going to delete/purge a S3 bucket. 
> Dropping NoSuchKey. Error is not related to S3 logic ..
> 
> Hope this time dev's can take enought time to find and resolve the issue. 
> Error happens with low ec profiles, even with replica x3 in some cases.
> 
> Regards
> 
> 
> 
> -Mensaje original-
> De: Janek Bevendorff  > 
> Enviado el: jueves, 12 de noviembre de 2020 14:06
> Para: Rafael Lopez mailto:rafael.lo...@monash.edu>>
> CC: Robin H. Johnson mailto:robb...@gentoo.org>>; 
> ceph-users mailto:ceph-users@ceph.io>>
> Asunto: [ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw 
> bk
> 
> Here is a bug report concerning (probably) this exact issue: 
> https://tracker.ceph.com/issues/47866 
> 
> I left a comment describing the situation and my (limited) experiences with 
> it.
> 
> 
> On 11/11/2020 10:04, Janek Bevendorff wrote:
>> 
>> Yeah, that seems to be it. There are 239 objects prefixed 
>> .8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh in my dump. However, there are none 
>> of the multiparts from the other file to be found and the head object 
>> is 0 bytes.
>> 
>> I checked another multipart object with an end pointer of 11. 
>> Surprisingly, it had way more than 11 parts (39 to be precise) named 
>> .1, .1_1 .1_2, .1_3, etc. Not sure how Ceph identifies those, but I 
>> could find them in the dump at least.
>> 
>> I have no idea why the objects disappeared. I ran a Spark job over all 
>> buckets, read 1 byte of every object and recorded errors. Of the 78 
>> buckets, two are missing objects. One bucket is missing one object, 
>> the other 15. So, luckily, the incidence is still quite low, but the 
>> problem seems to be expanding slowly.
>> 
>> 
>> On 10/11/2020 23:46, Rafael Lopez wrote:
>>> Hi Janek,
>>> 
>>> What you said sounds right - an S3 single part obj won't have an S3 
>>> multipart string as part of the prefix. S3 multipart string looks 
>>> like "2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme".
>>> 
>>> From memory, single part S3 objects that don't fit in a single rados 
>>> object are assigned a random prefix that has nothing to do with 
>>> the object name, and the rados tail/data objects (not the head 
>>> object) have that prefix.
>>> As per your working example, the prefix for that would be 
>>> '.8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh'. So there would be (239) "shadow" 
>>> objects with names containing that prefix, and if you add up the 
>>> sizes it should be the size of your S3 object.
>>> 
>>> You should look at working and non working examples of both single 
>>> and multipart S3 objects, as they are probably all a bit different 
>>> when you look in rados.
>>> 
>>> I agree it is a serious issue, because 

[ceph-users] build nautilus 14.2.13 packages and container

2020-11-13 Thread Engelmann Florian
Hi,

I was not able to find any complete guide on how to build ceph (14.2.x) from 
source, create packages and build containers based on those packages.

Ubuntu or centos, does not matter.

I tried so far:
###
docker pull centos:7
docker run -ti centos:7 /bin/bash

yum install -y git rpm-build rpmdevtools wget epel-release
yum install -y python-virtualenv python-pip jq cmake3 make gcc-c++ rpm-build 
which sudo createrepo

git clone https://github.com/ceph/ceph
cd ceph
git checkout v14.2.13
./make-srpm.sh
./install-deps.sh
#

but install-deps.sh fails with:
Error: No Package found for python-scipy

The following error message appeared before:

http://vault.centos.org/centos/7/sclo/Source/rh/repodata/repomd.xml: [Errno 14] 
HTTP Error 404 - Not Found
Trying other mirror.
To address this issue please refer to the below wiki article

https://wiki.centos.org/yum-errors

If above article doesn't help to resolve this issue please use 
https://bugs.centos.org/.

http://vault.centos.org/centos/7/sclo/Source/sclo/repodata/repomd.xml: [Errno 
14] HTTP Error 404 - Not Found


centos:8 fails as well. dependency error:
Error:
 Problem: package 
python36-rpm-macros-3.6.8-2.module_el8.1.0+245+c39af44f.noarch conflicts with 
python-modular-rpm-macros > 3.6 provided by 
python38-rpm-macros-3.8.0-6.module_el8.2.0+317+61fa6e7d.noarch
  - conflicting requests

Any helpful links?

All the best,
Florian

EveryWare AG
Florian Engelmann
Cloud Platform Architect
Zurlindenstrasse 52a
CH-8003 Zürich

T  +41 44 466 60 00
F  +41 44 466 60 10

florian.engelm...@everyware.ch
www.everyware.ch


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How to Improve RGW Bucket Stats Performance

2020-11-13 Thread Denis Krienbühl
Hi!

To bill our customers we regularly call radosgw-admin bucket stats --uid .

Since upgrading from Mimic to Octopus (with a short stop at Nautilus), we’ve 
been seeing much slower response times for this command.

It went from less than a minute for our largest customers, to 5 minutes (with 
some variance depending on load).

Assuming this is not a bug, is there any way to get these stats quicker?

Ceph seems to do in a single call here, which seems to me like something you 
could spread out over time (keep a counter somewhere and just return the latest 
value on request).

One thing we did notice, is that we get a lot of these when the stats are 
synced:

2020-11-13T14:56:17.288+0100 7f15347e0700  0 check_bucket_shards: 
resharding needed: stats.num_objects=5776982 shard max_objects=320

Could that hint at a problem in our configuration?

Anything else we could maybe tune to get this time down?

Appreciate any hints and I hope everyone is about to have a great weekend.

Denis
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: which of cpu frequency and number of threads servers osd better?

2020-11-13 Thread Frank Schilder
> If each OSD requires 4T

Nobody said that. What was said is HDD=1T,  SSD=3T. It depends on the drive 
type!

The %-utilisation information is just from top observed during heavy load. It 
does not show how the kernel schedules things on physical Ts. So, 2x50% 
utilisation could run on the same HT. I don't know how the OSDs are organised 
into threads, I'm just stating observations from real life (mimic cluster). So, 
for an SSD OSD I have seen a maximum of 4 threads in R state, two with 100% and 
two with 50% CPU, a load that fits on 3HT.

So, real life says 1HT per HDD and 3HT per SSD plus a bit for kernel and 
networking and you are set - based on worst-case performance monitoring I have 
seen in 2 years. Note that this is worst-case load. The average load is much 
lower.

A 16 core machine is totally overpowered. Assuming 1C=2HT, I count 
(2*3+8*1)/2=7 or (1*3+10*1)/2=6.5. So an 8 core CPU should do in either case. A 
10 core CPU might be better, but 16C is a waste of money.

I should mention that these estimates apply to Intel CPUs (x86_64 
architectures). Other architectures might not provide the same cycle efficiency.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Tony Liu 
Sent: 13 November 2020 08:32:55
To: Frank Schilder; Nathan Fish
Cc: ceph-users@ceph.io
Subject: RE: [ceph-users] Re: which of cpu frequency and number of threads 
servers osd better?

You all mentioned first 2T and another 2T. Could you give more
details how OSD works with multi-thread, or share the link if
it's already documented somewhere?

Is it always 4T, or start with 1T and grow up to 4T? Is it max 4T?
Does each T run different job or just multiple instances of the
same job? Does disk type affect how T works, like 1T is good enough
for HDD while 4T is required for SSD?

If I change my plan to 2 SSD OSDs and 8 HDD OSDs (with 1 SSD for
WAL and DB). If each OSD requires 4T, then 16C/32T 3.0GHz could
be a better choice, because it provides sufficient Ts?
If SSD OSD requires 4T and HDD OSD only requires 1T, then 8C/16T
3.2GHz would be better, because it provides sufficient Ts as well
as stronger computing?

Thanks!
Tony
> -Original Message-
> From: Frank Schilder 
> Sent: Thursday, November 12, 2020 10:59 PM
> To: Tony Liu ; Nathan Fish 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: which of cpu frequency and number of
> threads servers osd better?
>
> I think this depends on the type of backing disk. We use the following
> CPUs:
>
> Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
> Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
> Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
>
> My experience is, that a HDD OSD hardly gets to 100% of 1 hyper thread
> load even under heavy recovery/rebalance operations on 8+2 and 6+2 EC
> pools with compression set to aggressive. The CPU is mostly doing wait-
> IO, that is, the disk is the real bottle neck, not the processor power.
> With SSDs I have seen 2HT at 100% and 2 more at 50% each. I guess NVMe
> might be more demanding.
>
> A server with 12 HDD and 1 SSD should be fine with a modern CPU with 8
> cores. 16 threads sounds like an 8 core CPU. The 2nd generation Intel®
> Xeon® Silver 4209T with 8 cores should easily handle that (single socket
> system). We have the 16-core Intel silver in a dual socket system
> currently connected to 5HDD and 7SSD and I did a rebalance operation
> yesterday. The CPU user load did not exceed 2%, it can handle OSD
> processes easily. The server is dimensioned to run up to 12HDD and 14SSD
> OSDs (Dell R740xd2). As far as I can tell, the CPU configuration is
> overpowered for that.
>
> Just for info, we use ganglia to record node utilisation. I use 1-year
> records and pick peak loads I observed for dimensioning the CPUs. These
> records include some very heavy recovery periods.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Tony Liu 
> Sent: 13 November 2020 04:57:53
> To: Nathan Fish
> Cc: ceph-users@ceph.io
> Subject: [ceph-users] Re: which of cpu frequency and number of threads
> servers osd better?
>
> Thanks Nathan!
> Tony
> > -Original Message-
> > From: Nathan Fish 
> > Sent: Thursday, November 12, 2020 7:43 PM
> > To: Tony Liu 
> > Cc: ceph-users@ceph.io
> > Subject: Re: [ceph-users] which of cpu frequency and number of threads
> > servers osd better?
> >
> > From what I've seen, OSD daemons tend to bottleneck on the first 2
> > threads, while getting some use out of another 2. So 32 threads at 3.0
> > would be a lot better. Note that you may get better performance
> > splitting off some of that SSD for block.db partitions or at least
> > block.wal for the HDDs.
> >
> > On Thu, Nov 12, 2020 at 9:57 PM Tony Liu 
> wrote:
> > >
> > > Hi,
> > >
> > > For example, 16 threads with 3.2GHz and 32 threads with 3.0GHz,
> > > which makes 11 OSDs (10x12TB HDD and 

[ceph-users] Re: question about rgw delete speed

2020-11-13 Thread Adrian Nicolae

Hi Brent,

Thanks for your input.

We will use Swift instead of S3. The deletes are mainly done by our 
customers using the sync app  (i.e they are syncing their folders with 
the storage accounts and every file change is translated to a delete in 
the cloud). We have a frontend cluster between the customers and the 
storage providing access via FTP/HTTP/Webdav and so on.


 The delete speed is important for us because we want to regain the 
'deleted' storage capacity as free as possible so we can keep the costs 
down. I'm pretty obsessed with that because I went through some 
nightmares in the past having our storage full for some time :).


For the network we will probably use 1x10Gbps or 2x10Gbps for every OSD 
server.


Actually I have a Ceph cluster in production already but it's been used 
as a secondary storage. We are moving here the 'cold data' from the 
primary storage (i.e the big files) . We are using it as a secondary 
storage because it was deployed with 'zombie' OSD servers (old DDR2 
servers recovered from other projects but with new SATA drives). It's 
working very  welll so far and it helped us to lower the costs with 
30-40% per TB but we cannot use it as primary.




On 11/13/2020 1:07 AM, Brent Kennedy wrote:

Ceph is definitely a good choice for storing millions of files.  It sounds like 
you plan to use this like s3, so my first question would be:  Are the deletes 
done for a specific reason?  ( e.g. the files are used for a process and 
discarded  )  If its an age thing, you can set the files to expire when putting 
them in, then ceph will automatically clear them.

The more spinners you have the more performance you will end up with.  Network 
10Gb or higher?

Octopus is production stable and contains many performance enhancements.  
Depending on the OS, you may not be able to upgrade from nautilus until they 
work out that process ( e.g. centos 7/8 ).

Delete speed is not that great but you would have to test it with your cluster 
to see how it performs for your use case.  If you have enough space present, is 
there a process that breaks if the files are not deleted?


Regards,
-Brent

Existing Clusters:
Test: Ocotpus 15.2.5 ( all virtual on nvme )
US Production(HDD): Nautilus 14.2.11 with 11 osd servers, 3 mons, 4 gateways, 2 
iscsi gateways
UK Production(HDD): Nautilus 14.2.11 with 18 osd servers, 3 mons, 4 gateways, 2 
iscsi gateways
US Production(SSD): Nautilus 14.2.11 with 6 osd servers, 3 mons, 4 gateways, 2 
iscsi gateways
UK Production(SSD): Octopus 15.2.5 with 5 osd servers, 3 mons, 4 gateways




-Original Message-
From: Adrian Nicolae 
Sent: Wednesday, November 11, 2020 3:42 PM
To: ceph-users 
Subject: [ceph-users] question about rgw delete speed


Hey guys,


I'm in charge of a local cloud-storage service. Our primary object storage is a 
vendor-based one and I want to replace it in the near future with Ceph with the 
following setup :

- 6 OSD servers with 36 SATA 16TB drives each and 3 big NVME per server
(1 big NVME for every 12 drives so I can reserve 300GB NVME storage for every 
SATA drive), 3 MON, 2 RGW with Epyc 7402p and 128GB RAM. So in the end we'll 
have ~ 3PB of raw data and 216 SATA drives.

Currently we have ~ 100 millions of files on the primary storage with the 
following distribution :

- ~10% = very small files ( less than 1MB - thumbnails, text files and 
so on)

- ~60%= small files (between 1MB and 10MB)

-  20% = medium files ( between 10MB and 1GB)

- 10% = big files (over 1GB).

My main concern is the speed of delete operations. We have around 500k-600k 
delete ops every 24 hours so quite a lot. Our current storage is not deleting 
all the files fast enough (it's always 1 week-10 days
behind) , I guess is not only a software issue and probably the delete speed 
will get better if we add more drives (we now have 108).

What do you think about Ceph delete speed ? I read on other threads that it's 
not very fast . I wonder if this hw setup can handle our current delete load 
better than our current storage. On RGW servers I want to use Swift , not S3.

And another question :   can I start deploying in production directly the 
latest Ceph version (Octopus) or is it safer to start with Nautilus until 
Octopus will be more stable ?

Any input would be greatly appreciated !


Thanks,

Adrian.




___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: which of cpu frequency and number of threads servers osd better?

2020-11-13 Thread Frank Schilder
I think this depends on the type of backing disk. We use the following CPUs:

Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz

My experience is, that a HDD OSD hardly gets to 100% of 1 hyper thread load 
even under heavy recovery/rebalance operations on 8+2 and 6+2 EC pools with 
compression set to aggressive. The CPU is mostly doing wait-IO, that is, the 
disk is the real bottle neck, not the processor power. With SSDs I have seen 
2HT at 100% and 2 more at 50% each. I guess NVMe might be more demanding.

A server with 12 HDD and 1 SSD should be fine with a modern CPU with 8 cores. 
16 threads sounds like an 8 core CPU. The 2nd generation Intel® Xeon® Silver 
4209T with 8 cores should easily handle that (single socket system). We have 
the 16-core Intel silver in a dual socket system currently connected to 5HDD 
and 7SSD and I did a rebalance operation yesterday. The CPU user load did not 
exceed 2%, it can handle OSD processes easily. The server is dimensioned to run 
up to 12HDD and 14SSD OSDs (Dell R740xd2). As far as I can tell, the CPU 
configuration is overpowered for that.

Just for info, we use ganglia to record node utilisation. I use 1-year records 
and pick peak loads I observed for dimensioning the CPUs. These records include 
some very heavy recovery periods.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Tony Liu 
Sent: 13 November 2020 04:57:53
To: Nathan Fish
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: which of cpu frequency and number of threads servers 
osd better?

Thanks Nathan!
Tony
> -Original Message-
> From: Nathan Fish 
> Sent: Thursday, November 12, 2020 7:43 PM
> To: Tony Liu 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] which of cpu frequency and number of threads
> servers osd better?
>
> From what I've seen, OSD daemons tend to bottleneck on the first 2
> threads, while getting some use out of another 2. So 32 threads at 3.0
> would be a lot better. Note that you may get better performance
> splitting off some of that SSD for block.db partitions or at least
> block.wal for the HDDs.
>
> On Thu, Nov 12, 2020 at 9:57 PM Tony Liu  wrote:
> >
> > Hi,
> >
> > For example, 16 threads with 3.2GHz and 32 threads with 3.0GHz, which
> > makes 11 OSDs (10x12TB HDD and 1x960GB SSD) with better performance?
> >
> >
> > Thanks!
> > Tony
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OverlayFS with Cephfs to mount a snapshot read/write

2020-11-13 Thread Frédéric Nass

Hi Jeff,

I understand the idea behind patch [1] but it breaks the operation of overlayfs 
with cephfs. Should the patch be abandoned and tests be modified or should 
overlayfs code be adapted to work with cephfs, if that's possible?

Either way, it'd be nice if overlayfs could work again with cephfs out of the 
box without requiring users to patch and build their own kernels past 5.4+.

Regards,

Frédéric.

[1] https://www.spinics.net/lists/ceph-devel/msg46183.html (CCing Greg)

PS : Please forgive me if you received this message twice. My previous message 
was flagged spam due to the dynamic IP address of my router that was seen in a 
spam campain in the past, so I sent it again.

- Le 9 Nov 20, à 19:52, Jeff Layton jlay...@kernel.org a écrit :

> Yes, you'd have to apply the patch to that kernel yourself. No RHEL7
> kernels have that patch (so far). Newer RHEL8 kernels _do_ if that's an
> option for you.
> -- Jeff
> 
> On Mon, 2020-11-09 at 19:21 +0100, Frédéric Nass wrote:
>> I feel lucky to have you on this one. ;-) Do you mean applying a
>> specific patch on 3.10 kernel? Or is this one too old to have it working
>> anyways.
>> 
>> Frédéric.
>> 
>> Le 09/11/2020 à 19:07, Luis Henriques a écrit :
>> > Frédéric Nass  writes:
>> > 
>> > > Hi Luis,
>> > > 
>> > > Thanks for your help. Sorry I forgot about the kernel details. This is 
>> > > latest
>> > > RHEL 7.9.
>> > > 
>> > > ~/ uname -r
>> > > 3.10.0-1160.2.2.el7.x86_64
>> > > 
>> > > ~/ grep CONFIG_TMPFS_XATTR /boot/config-3.10.0-1160.2.2.el7.x86_64
>> > > CONFIG_TMPFS_XATTR=y
>> > > 
>> > > upper directory /upperdir is using xattrs
>> > > 
>> > > ~/ ls -l /dev/mapper/vg0-racine
>> > > lrwxrwxrwx 1 root root 7  6 mars   2020 /dev/mapper/vg0-racine -> ../dm-0
>> > > 
>> > > ~/ cat /proc/fs/ext4/dm-0/options | grep xattr
>> > > user_xattr
>> > > 
>> > > ~/ setfattr -n user.name -v upperdir /upperdir
>> > > 
>> > > ~/ getfattr -n user.name /upperdir
>> > > getfattr: Suppression des « / » en tête des chemins absolus
>> > > # file: upperdir
>> > > user.name="upperdir"
>> > > 
>> > > Are you able to modify the content of a snapshot directory using 
>> > > overlayfs on
>> > > your side?
>> > [ Cc'ing Jeff ]
>> > 
>> > Yes, I'm able to do that using a *recent* kernel.  I got curious and after
>> > some digging I managed to reproduce the issue with kernel 5.3.  The
>> > culprit was commit e09580b343aa ("ceph: don't list vxattrs in
>> > listxattr()"), in 5.4.
>> > 
>> > Getting a bit more into the whole rabbit hole, it looks like
>> > ovl_copy_xattr() will try to copy all the ceph-related vxattrs.  And that
>> > won't work (for ex. for ceph.dir.entries).
>> > 
>> > Can you try cherry-picking this commit into your kernel to see if that
>> > fixes it for you?
>> > 
>> > Cheers,
> 
> --
> Jeff Layton 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: question about rgw delete speed

2020-11-13 Thread Janne Johansson
Den ons 11 nov. 2020 kl 21:42 skrev Adrian Nicolae <
adrian.nico...@rcs-rds.ro>:

> Hey guys,
> - 6 OSD servers with 36 SATA 16TB drives each and 3 big NVME per server
> (1 big NVME for every 12 drives so I can reserve 300GB NVME storage for
> every SATA drive), 3 MON, 2 RGW with Epyc 7402p and 128GB RAM. So in the
> end we'll have ~ 3PB of raw data and 216 SATA drives.


> My main concern is the speed of delete operations. We have around
> 500k-600k delete ops every 24 hours so quite a lot. Our current storage
> is not deleting all the files fast enough (it's always 1 week-10 days
> behind) , I guess is not only a software issue and probably the delete
> speed will get better if we add more drives (we now have 108).
>

I did some tests on a mimic cluster of mine where the data is on 100+
spin-drives, but all rgw metadata and index pools are on SSDs, and I think
we could create 1M 0-byte objects and delete them at a rate of 1M objs in
24h, so having the index pools on fast OSDs is probably important for doing
large index operations like creating and deleting small files.

Also I think our cluster was quite silent and idle at the time, so doing
this while the cluster is being in full use would probably make it lots
slower or affect other clients. Our host specs are lower than yours, except
we have fewer OSDs per host (8-10 spin-drives and one ssd per host) so
perhaps our boxes could spread the load better.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io