[ceph-users] Re: Archive in Ceph similar to Hadoop Archive Utility (HAR)

2022-02-24 Thread Anthony D'Atri

There was a similar discussion last year around Software Heritage’s archive 
project, suggest digging up that thread.

Some ideas:

* Pack them into (optionally compressed) tarballs - from a quick search it 
sorta looks like HAR uses a similar model.  Store the tarballs as RGW objects, 
or as RBD volumes, or on CephFS.
* Create conventional filesystems on RBD volumes, though depending on size and 
number you might have some space lost to padding.
* SeaweedFS looks like it has small object packing built in, use it instead (or 
on RBD volumes)
* I’ve been told that the Mass. Open Cloud folks had prototyped some sort of 
packing for RGW, but I’ve not been able to find any details or a contact.
* With any of these strategies, 30TB Intel / Solidigm QLC SSDs would be fine 
media to use.  With the right chassis and form factor, >1PB/RU raw capacity can 
be realized.  RUs == money ;)

— aad


> 
> 
> Hi,
> 
> Is there any archive utility in Ceph similar to Hadoop Archive Utility (HAR)? 
> Or in other words. how can one archive small files in Ceph?
> 
> Thanks
> 
> 
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD Container keeps restarting after drive crash

2022-02-24 Thread Eugen Block

Hi,

these are the defaults set by cephadm in Octopus and Pacific:

---snip---
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
EnvironmentFile=-/etc/environment
ExecStart=/bin/bash {data_dir}/{fsid}/%i/unit.run
ExecStop=-{container_path} stop ceph-{fsid}-%i
ExecStopPost=-/bin/bash {data_dir}/{fsid}/%i/unit.poststop
KillMode=none
Restart=on-failure
RestartSec=10s
TimeoutStartSec=120
TimeoutStopSec=120
StartLimitInterval=30min
StartLimitBurst=5
---snip---

So there are StartLimit options.

What are other options to prevent OSD containers from trying to  
restart after a valid crash?


The question is how you determine a "valid" crash. I wouldn't want the  
first crash to result in an out OSD. First I would try to get behind  
the root cause for the crash. Of course, if there are signs of a disk  
failure it's only a matter of time until the OSD won't recover. But  
since there are a lot more things that could kill a process I would  
want ceph to try to bring the OSDs back online. I think the defaults  
are a valid compromise, although one might argue about the specific  
values, of course.


Regards,
Eugen


Zitat von "Frank de Bot (lists)" :


Hi,

I've a small ceph containerized cluster rolled out with  
ceph-ansible. wal and db from each drive are on a seperate nvme  
drive, the data is on spinning sas disks. The cluster is running  
16.2.7
Today a disk failed, but not quite catastrophic. The block device is  
present, lvm metadata is good, but reading certain blocks gives  
'Sense: Unrecovered read error' in the syslog (smart is indicating  
the drive is failing). The OSD crashes on reading/writing.


But the container kept restarting and crashing until manual  
intervention was done. By doing this the faulty was flapping up and  
down, causing the OSD not going out and not rebalancing the cluster.
I could set StartLimitIntervalSec and StartLimitBurst in the osd  
service file, but it's not there by default and I like to keep  
everything as standard as possible.
What are other options to prevent OSD containers from trying to  
restart after a valid crash?


Regards,

Frank de Bot
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Archive in Ceph similar to Hadoop Archive Utility (HAR)

2022-02-24 Thread Bobby
Hi,

Is there any archive utility in Ceph similar to Hadoop Archive Utility
(HAR)? Or in other words. how can one archive small files in Ceph?

Thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: One PG stuck in active+clean+remapped

2022-02-24 Thread Erwin Lubbers
Hoi Dan,

That did the trick. Thanks!

Regards,
Erwin


> Op 24 feb. 2022, om 20:25 heeft Dan van der Ster  het 
> volgende geschreven:
> 
> Hi Erwin,
> 
> This may be one of the rare cases where the default choose_total_tries
> = 50 is too low.
> You can try increasing it to 75 or 100 and see if crush can find 3 up OSDs.
> 
> Here's the basic recipe:
> 
> # ceph osd getcrushmap -o crush.map
> # crushtool -d crush.map -o crush.txt
> # vi crush.txt  # and change to tunable choose_total_tries 100
> # crushtool -c crush.txt -o crush.map2
> # ceph osd setcrushmap -i crush.map2
> 
> Cheers, dan
> 
> On Thu, Feb 24, 2022 at 6:29 PM Erwin Lubbers  wrote:
>> 
>> Hi all,
>> 
>> I have one active+clean+remapped PG on a 152 OSD Octopus (15.2.15) cluster 
>> with equal balanced OSD's (around 40% usage). The cluster has three replicas 
>> spreaded around three datacenters (A+B+C).
>> 
>> All PGs are available in each datacenter (as defined in the crush map), but 
>> only this one (which is in a pool containing 2048 PGs) is up on OSD.34 and 
>> OSD.42 and acting on OSD.34, OSD.42 and OSD.38.
>> 
>> OSD.34 is located in datacenter A, 42 in B and 38 in A again, but it should 
>> be in C.
>> 
>> I did restart all OSD's, monitors, managers and servers. I did out the OSDs 
>> that the PG is acting on and bring it back in a minute later. In all cases 
>> the PG holds the same state after backfilling, but one of the A replicas 
>> switches to another OSD in the A datacenter. I did turn off and on the 
>> balancer. But nothing seems to recover the PG to active+clean.
>> 
>> Any suggestions?
>> 
>> Regards,
>> Erwin
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: One PG stuck in active+clean+remapped

2022-02-24 Thread Dan van der Ster
Hi Erwin,

This may be one of the rare cases where the default choose_total_tries
= 50 is too low.
You can try increasing it to 75 or 100 and see if crush can find 3 up OSDs.

Here's the basic recipe:

# ceph osd getcrushmap -o crush.map
# crushtool -d crush.map -o crush.txt
# vi crush.txt  # and change to tunable choose_total_tries 100
# crushtool -c crush.txt -o crush.map2
# ceph osd setcrushmap -i crush.map2

Cheers, dan

On Thu, Feb 24, 2022 at 6:29 PM Erwin Lubbers  wrote:
>
> Hi all,
>
> I have one active+clean+remapped PG on a 152 OSD Octopus (15.2.15) cluster 
> with equal balanced OSD's (around 40% usage). The cluster has three replicas 
> spreaded around three datacenters (A+B+C).
>
> All PGs are available in each datacenter (as defined in the crush map), but 
> only this one (which is in a pool containing 2048 PGs) is up on OSD.34 and 
> OSD.42 and acting on OSD.34, OSD.42 and OSD.38.
>
> OSD.34 is located in datacenter A, 42 in B and 38 in A again, but it should 
> be in C.
>
> I did restart all OSD's, monitors, managers and servers. I did out the OSDs 
> that the PG is acting on and bring it back in a minute later. In all cases 
> the PG holds the same state after backfilling, but one of the A replicas 
> switches to another OSD in the A datacenter. I did turn off and on the 
> balancer. But nothing seems to recover the PG to active+clean.
>
> Any suggestions?
>
> Regards,
> Erwin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] One PG stuck in active+clean+remapped

2022-02-24 Thread Erwin Lubbers
Hi all,

I have one active+clean+remapped PG on a 152 OSD Octopus (15.2.15) cluster with 
equal balanced OSD's (around 40% usage). The cluster has three replicas 
spreaded around three datacenters (A+B+C).

All PGs are available in each datacenter (as defined in the crush map), but 
only this one (which is in a pool containing 2048 PGs) is up on OSD.34 and 
OSD.42 and acting on OSD.34, OSD.42 and OSD.38. 

OSD.34 is located in datacenter A, 42 in B and 38 in A again, but it should be 
in C.

I did restart all OSD's, monitors, managers and servers. I did out the OSDs 
that the PG is acting on and bring it back in a minute later. In all cases the 
PG holds the same state after backfilling, but one of the A replicas switches 
to another OSD in the A datacenter. I did turn off and on the balancer. But 
nothing seems to recover the PG to active+clean.

Any suggestions?

Regards,
Erwin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph fs snaptrim speed

2022-02-24 Thread Dan van der Ster
Hi Frank,

Thanks for the feedback -- improving the docs is in everyone's best interest.

This semantic of "override if non-zero" is quite common in the OSD.
See https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L3388-L3451
for a few examples.
So it doesn't make sense to change the way this works -- this is IMO a
clean way to have defaults tuned per device type, while letting a user
easily override the setting without having to dive into which device
type setting to change, which can easily be confusing.

But we can improve the doc to make this more obvious. Does this help?
https://github.com/ceph/ceph/pull/45144

Cheers, Dan

On Thu, Feb 24, 2022 at 12:47 PM Frank Schilder  wrote:
>
> Hi Dan,
>
> thanks for the fast reply. Yes, I think the doc should be updated, because:
>
> > if (osd_snap_trim_sleep > 0)
> > return osd_snap_trim_sleep;
>
> means "a value of osd_snap_trim_sleep>0 will override backend specific 
> variants.", which is very different from the current formulation. The current 
> formulation implies, strictly speaking, that osd_snap_trim_sleep will 
> *always* be used and there is no point for even having the other options.
>
> Apart from that, I wonder if it is a good idea to have such an inconsistency 
> in how options are applied. There is the good old software engineering 
> paradigm of "closeness". The closer an option to its object, the higher its 
> precedence. Like any settings in your /home override global settings and so 
> on. More specific options should always override less specific options. But 
> as a bare minimum, the options should at least behave the same (=parallelism, 
> also a good old software paradigm). The current behaviour seems to violate 
> more or less all of this.
>
> I would expect and also prefer that all options behave the same, that is,
>
> deamon specific (eg. osd.0) overrides masked (eg. class:hdd) overrides 
> built-in specific (*_hdd) overrides daemon group (eg. ceph config osd) 
> overrides global defaults (eg. ceph config global ...)
>
> Then I don't have to look at docs all the time to figure out what works how. 
> It would all be the same.
>
> Thanks again!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Dan van der Ster 
> Sent: 24 February 2022 12:31:11
> To: Frank Schilder
> Cc: ceph-users
> Subject: Re: [ceph-users] ceph fs snaptrim speed
>
> Hi Frank,
>
> The semantic of osd_snap_trim_sleep was copied from osd_delete_sleep.
>
> The general setting "osd_snap_trim_sleep" is used only to override the
> _hdd _hybrid _ssd tuned values.
>
> Here's the code to get the effective sleep value:
>
> if (osd_snap_trim_sleep > 0)
> return osd_snap_trim_sleep;
> if (!store_is_rotational && !journal_is_rotational)
> return osd_snap_trim_sleep_ssd
> if (store_is_rotational && !journal_is_rotational)
> return osd_snap_trim_sleep_hybrid
> return osd_snap_trim_sleep_hdd
>
>
> Do you think the doc needs to be updated?
>
> Cheers, Dan
>
>
> On Thu, Feb 24, 2022 at 12:27 PM Frank Schilder  wrote:
> >
> > I seem to have the opposite problem with fs snaptrim progress as others 
> > recently. I would like to speed it up. I looked at the docs and find the 
> > description of osd_snap_trim_sleep* 
> > (https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_snap_trim_sleep)
> >  counter-intuitive and confusing.
> >
> > For osd_snap_trim_sleep it says "This option overrides backend specific 
> > variants.", which is exactly the opposite of how other options are applied, 
> > for example, the osd_memory_target family of options. The usual way is that 
> > more specific options (back-end specific) supersede less specific options 
> > (general defaults). What is correct? Is the overall default sleep=0 for all 
> > types of OSDs, or is it 5s(=osd_snap_trim_sleep_hdd) for HDDs?
> >
> > If the documentation is wrong, please correct.
> >
> > Thanks and best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Mon crash - abort in RocksDB

2022-02-24 Thread Chris Palmer
We have a small Pacific 16.2.7 test cluster that has been ticking over 
for a couple of years with no problems whatever. The last "event" was 14 
days ago when I was testing some OSD replacement procedures - nothing 
remarkable.


At 0146 this morning though mon03 signalled an abort in the RocksDB 
code. The monitor crashed, and systemd successfully restarted it 10 
seconds later. Although its difficult to tell if any RBD VMs were doing 
much it is unlikely. Deep-scrubs are likely to have been running. There 
is nothing of interest in the OS logs.


Config summary:

 * Built manually (not using cephadm)
 * Debian 10 (buster)
 * Host mon01 - mon, mgr, mds, rgw
 * Host mon02 - mon, mgr, mds, rgw
 * Host mon03 - mon
 * Hosts osd01-03 - each have 2 Optane NVMe for HDD DB/WAL, 24 x HDD, 2
   x NVMe OSD
 * Running at very low load and capacity utilisation

It's working fine now, but wondering if anyone knows what might have 
happened and if there is any lurking problem that should be looked at.


Crash info (slightly sanitised) below...

Thanks, Chris


ceph@xxmon03:~$ ceph crash info 
2022-02-24T01:46:41.241025Z_7bcaa4fa-d202-4e48-91ac-f3070493bc73
{
    "backtrace": [
    "/lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7fb338e7d730]",
    "gsignal()",
    "abort()",
    "/lib/x86_64-linux-gnu/libc.so.6(+0x2240f) [0x7fb33894b40f]",
    "/lib/x86_64-linux-gnu/libc.so.6(+0x30102) [0x7fb338959102]",
    "(rocksdb::BlockBasedTableBuilder::Add(rocksdb::Slice const&, rocksdb::Slice 
const&)+0x119) [0x5633d27a792f]",
    
"(rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0xaf8)
 [0x5633d275cf04]",
    "(rocksdb::CompactionJob::Run()+0x235) [0x5633d275adfb]",
    "(rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, 
rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, 
rocksdb::Env::Priority)+0x248a) [0x5633d248a74a]",
    
"(rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, 
rocksdb::Env::Priority)+0x20d) [0x5633d2487a93]",
    "(rocksdb::DBImpl::BGWorkCompaction(void*)+0xc5) [0x5633d248637d]",
    "(void std::__invoke_impl(std::__invoke_other, void 
(*&)(void*), void*&)+0x34) [0x5633d26e7f6e]",
    "(std::__invoke_result::type std::__invoke(void (*&)(void*), void*&)+0x37) [0x5633d26e7ad3]",
    "(void std::_Bind::__call(std::tuple<>&&, 
std::_Index_tuple<0ul>)+0x48) [0x5633d26e71c2]",
    "(void std::_Bind::operator()<, void>()+0x24) 
[0x5633d26e6318]",
    "(std::_Function_handler 
>::_M_invoke(std::_Any_data const&)+0x20) [0x5633d26e5404]",
    "(std::function::operator()() const+0x32) [0x5633d242c58c]",
    "(rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x26b) 
[0x5633d26e1941]",
    "(rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x108) 
[0x5633d26e1aa4]",
    "(void std::__invoke_impl(std::__invoke_other, void (*&&)(void*), 
rocksdb::BGThreadMetadata*&&)+0x34) [0x5633d26e4bdf]",
    "(std::__invoke_result::type std::__invoke(void (*&&)(void*), rocksdb::BGThreadMetadata*&&)+0x37) 
[0x5633d26e3dbf]",
    "(decltype (__invoke((_S_declval<0ul>)(), (_S_declval<1ul>)())) 
std::thread::_Invoker >::_M_invoke<0ul, 
1ul>(std::_Index_tuple<0ul, 1ul>)+0x43) [0x5633d26e8779]",
    "(std::thread::_Invoker 
>::operator()()+0x18) [0x5633d26e8734]",
    "(std::thread::_State_impl > >::_M_run()+0x1c) [0x5633d26e8718]",
    "/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbbb2f) [0x7fb338d44b2f]",
    "/lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7fb338e72fa3]",
    "clone()"
    ],
    "ceph_version": "16.2.7",
    "crash_id": 
"2022-02-24T01:46:41.241025Z_7bcaa4fa-d202-4e48-91ac-f3070493bc73",
    "entity_name": "mon.xxmon03",
    "os_id": "10",
    "os_name": "Debian GNU/Linux 10 (buster)",
    "os_version": "10 (buster)",
    "os_version_id": "10",
    "process_name": "ceph-mon",
    "stack_sig": 
"f5274691c6982e320f630eb9e025f3db660bd3a110bd7ec1400c7ae121feebb7",
    "timestamp": "2022-02-24T01:46:41.241025Z",
    "utsname_hostname": "xxmon03.x.y.z",
    "utsname_machine": "x86_64",
    "utsname_release": "4.19.0-18-amd64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Debian 4.19.208-1 (2021-09-29)"
}
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph os filesystem in read only

2022-02-24 Thread Eugen Block

Hi,

1. How long will ceph continue to run before it starts complaining  
about this?
Looks like it is fine for a few hours, ceph osd tree and ceph -s,  
seem not to notice anything.


if the OSDs don't have to log anything to disk (which can take quite  
some time depending on the log settings) they won't notice. And since  
clients communicate directly with the OSDs they won't notice either.  
For example if the OSDs on that host start deep-scrubbing, they will  
try to log to disk which then would fail.


2. This is still nautilus with majority of ceph-disk and maybe some  
ceph-volume disks
What would be a good procedure to try and recover data from this  
drive to use on a new os disk?


The OSDs should be usable after OS reinstallation as long as you have  
the ceph.conf and the directory structure in /var/lib/ceph/ (incl.  
keyrings and permissions). If the OS is back you can run 'ceph-volume  
lvm active --all' for the OSDs on lvm basis, the ceph-disk OSDs should  
probably recover after running 'ceph-volume simple scan /dev/sdX1' and  
then 'ceph-volume simple activate  '.




Zitat von Marc :

I have a ceph node that has an os filesystem going into read only  
for what ever reason[1].


1. How long will ceph continue to run before it starts complaining  
about this?
Looks like it is fine for a few hours, ceph osd tree and ceph -s,  
seem not to notice anything.


2. This is still nautilus with majority of ceph-disk and maybe some  
ceph-volume disks
What would be a good procedure to try and recover data from this  
drive to use on a new os disk?




[1]
Feb 21 14:41:30 kernel: XFS (dm-0): writeback error on sector 11610872
Feb 21 14:41:30 systemd: ceph-mon@c.service failed.
Feb 21 14:41:31 kernel: XFS (dm-0): metadata I/O error: block  
0x2ee001 ("xfs_buf_iodone_callback_error") error 121 numblks 1
Feb 21 14:41:31 kernel: XFS (dm-0): metadata I/O error: block  
0x5dd5cd ("xlog_iodone") error 121 numblks 64
Feb 21 14:41:31 kernel: XFS (dm-0): Log I/O Error Detected. Shutting  
down filesystem
Feb 21 14:41:31 kernel: XFS (dm-0): Please umount the filesystem and  
rectify the problem(s)



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph fs snaptrim speed

2022-02-24 Thread Dan van der Ster
Hi Frank,

The semantic of osd_snap_trim_sleep was copied from osd_delete_sleep.

The general setting "osd_snap_trim_sleep" is used only to override the
_hdd _hybrid _ssd tuned values.

Here's the code to get the effective sleep value:

if (osd_snap_trim_sleep > 0)
return osd_snap_trim_sleep;
if (!store_is_rotational && !journal_is_rotational)
return osd_snap_trim_sleep_ssd
if (store_is_rotational && !journal_is_rotational)
return osd_snap_trim_sleep_hybrid
return osd_snap_trim_sleep_hdd


Do you think the doc needs to be updated?

Cheers, Dan


On Thu, Feb 24, 2022 at 12:27 PM Frank Schilder  wrote:
>
> I seem to have the opposite problem with fs snaptrim progress as others 
> recently. I would like to speed it up. I looked at the docs and find the 
> description of osd_snap_trim_sleep* 
> (https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_snap_trim_sleep)
>  counter-intuitive and confusing.
>
> For osd_snap_trim_sleep it says "This option overrides backend specific 
> variants.", which is exactly the opposite of how other options are applied, 
> for example, the osd_memory_target family of options. The usual way is that 
> more specific options (back-end specific) supersede less specific options 
> (general defaults). What is correct? Is the overall default sleep=0 for all 
> types of OSDs, or is it 5s(=osd_snap_trim_sleep_hdd) for HDDs?
>
> If the documentation is wrong, please correct.
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unclear on metadata config for new Pacific cluster

2022-02-24 Thread Kai Stian Olstad
On Wed, Feb 23, 2022 at 12:02:53PM +, Adam Huffman wrote:
> On Wed, 23 Feb 2022 at 11:25, Eugen Block  wrote:
> 
> > How exactly did you determine that there was actual WAL data on the HDDs?
> >
> I couldn't say exactly what it was, but 7 or so TBs was in use, even with
> no user data at all.

When you have DB on a separate disk the DB size count towards total size of the
osd. But this DB space is considered used so you will see a lot of used space.

-- 
Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS snaptrim bug?

2022-02-24 Thread Arthur Outhenin-Chalandre
On 2/24/22 09:26, Arthur Outhenin-Chalandre wrote:
> On 2/23/22 21:43, Linkriver Technology wrote:
>> Could someone shed some light please? Assuming that snaptrim didn't run to
>> completion, how can I manually delete objects from now-removed snapshots? I
>> believe this is what the Ceph documentation calls a "backwards scrub" - but I
>> didn't find anything in the Ceph suite that can run such a scrub. This pool 
>> is
>> filling up fast, I'll throw in some more OSDs for the moment to buy some 
>> time,
>> but I certainly would appreciate your help!
> 
> You are probably hitting a bug related to the 52026 tracker [1]. You can
> probably guess all the pgs that still needs snaptrim by checking
> snaptrimq_len with the command `ceph pg dump pgs`. Basically all the pgs
> that have a non zero value need snaptrim and you can trigger the
> snaptrim by re-peering them.
> 
> [1]: https://tracker.ceph.com/issues/52026

I missed the point where you say that you have a snaptrimq_len of 0,
sorry. Not sure if re-peering or restarting some osds might help in your
case unfortunately (you can still try though) :/.

-- 
Arthur Outhenin-Chalandre
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS snaptrim bug?

2022-02-24 Thread Dan van der Ster
See https://tracker.ceph.com/issues/54396

I don't know how to tell the osds to rediscover those trimmed snaps.
Neha does that possible?

Cheers, Dan

On Thu, Feb 24, 2022 at 9:27 AM Dan van der Ster  wrote:
>
> Hi,
>
> I had a look at the code -- looks like there's a flaw in the logic:
> the snaptrim queue is cleared if osd_pg_max_concurrent_snap_trims = 0.
>
> I'll open a tracker and send a PR to restrict
> osd_pg_max_concurrent_snap_trims to >= 1.
>
> Cheers, Dan
>
> On Wed, Feb 23, 2022 at 9:44 PM Linkriver Technology
>  wrote:
> >
> > Hello,
> >
> > I have upgraded our Ceph cluster from Nautilus to Octopus (15.2.15) over the
> > weekend. The upgrade went well as far as I can tell.
> >
> > Earlier today, noticing that our CephFS data pool was approaching capacity, 
> > I
> > removed some old CephFS snapshots (taken weekly at the root of the 
> > filesystem),
> > keeping only the most recent one (created today, 2022-02-21). As expected, a
> > good fraction of the PGs transitioned from active+clean to 
> > active+clean+snaptrim
> > or active+clean+snaptrim_wait. In previous occasions when I removed a 
> > snapshot
> > it took a few days for snaptrimming to complete. This would happen without
> > noticeably impacting other workloads, and would also free up an appreciable
> > amount of disk space.
> >
> > This time around, after a few hours of snaptrimming, users complained of 
> > high IO
> > latency, and indeed Ceph reported "slow ops" on a number of OSDs and on the
> > active MDS. I attributed this to the snaptrimming and decided to reduce it 
> > by
> > initially setting osd_pg_max_concurrent_snap_trims to 1, which didn't seem 
> > to
> > help much, so I then set it to 0, which had the surprising effect of
> > transitioning all PGs back to active+clean (is this intended?). I also 
> > restarted
> > the MDS which seemed to be struggling. IO latency went back to normal
> > immediately.
> >
> > Outside of users' working hours, I decided to resume snaptrimming by setting
> > osd_pg_max_concurrent_snap_trims back to 1. Much to my surprise, nothing
> > happened. All PGs remained (and still remain at time of writing) in the 
> > state
> > active+clean, even after restarting some of them. This definitely seems
> > abnormal, as I mentioned earlier, snaptrimming this FS previously would 
> > take in
> > the order of multiple days. Moreover, if snaptrim were truly complete, I 
> > would
> > expect pool usage to have dropped by appreciable amounts (at least a dozen
> > terabytes), but that doesn't seem to be the case.
> >
> > A du on the CephFS root gives:
> >
> > # du -sh /mnt/pve/cephfs
> > 31T/mnt/pve/cephfs
> >
> > But:
> >
> > # ceph df
> > 
> > --- POOLS ---
> > POOL   ID  PGS   STORED   OBJECTS  USED %USED  MAX AVAIL
> > cephfs_data 7   512   43 TiB  190.83M  147 TiB  93.223.6 TiB
> > cephfs_metadata 832   89 GiB  694.60k  266 GiB   1.326.4 TiB
> > 
> >
> > ceph pg dump reports a SNAPTRIMQ_LEN of 0 on all PGs.
> >
> > Did CephFS just leak a massive 12 TiB worth of objects...? It seems to me 
> > that
> > the snaptrim operation did not complete at all.
> >
> > Perhaps relatedly:
> >
> > # ceph daemon mds.choi dump snaps
> > {
> > "last_created": 93,
> > "last_destroyed": 94,
> > "snaps": [
> > {
> > "snapid": 93,
> > "ino": 1,
> > "stamp": "2022-02-21T00:00:01.245459+0800",
> > "name": "2022-02-21"
> > }
> > ]
> > }
> >
> > How can last_destroyed > last_created? The last snapshot to have been taken 
> > on
> > this FS is indeed #93, and the removed snapshots were all created on 
> > previous
> > weeks.
> >
> > Could someone shed some light please? Assuming that snaptrim didn't run to
> > completion, how can I manually delete objects from now-removed snapshots? I
> > believe this is what the Ceph documentation calls a "backwards scrub" - but 
> > I
> > didn't find anything in the Ceph suite that can run such a scrub. This pool 
> > is
> > filling up fast, I'll throw in some more OSDs for the moment to buy some 
> > time,
> > but I certainly would appreciate your help!
> >
> > Happy to attach any logs or info you deem necessary.
> >
> > Regards,
> >
> > LRT
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS snaptrim bug?

2022-02-24 Thread Dan van der Ster
Hi,

I had a look at the code -- looks like there's a flaw in the logic:
the snaptrim queue is cleared if osd_pg_max_concurrent_snap_trims = 0.

I'll open a tracker and send a PR to restrict
osd_pg_max_concurrent_snap_trims to >= 1.

Cheers, Dan

On Wed, Feb 23, 2022 at 9:44 PM Linkriver Technology
 wrote:
>
> Hello,
>
> I have upgraded our Ceph cluster from Nautilus to Octopus (15.2.15) over the
> weekend. The upgrade went well as far as I can tell.
>
> Earlier today, noticing that our CephFS data pool was approaching capacity, I
> removed some old CephFS snapshots (taken weekly at the root of the 
> filesystem),
> keeping only the most recent one (created today, 2022-02-21). As expected, a
> good fraction of the PGs transitioned from active+clean to 
> active+clean+snaptrim
> or active+clean+snaptrim_wait. In previous occasions when I removed a snapshot
> it took a few days for snaptrimming to complete. This would happen without
> noticeably impacting other workloads, and would also free up an appreciable
> amount of disk space.
>
> This time around, after a few hours of snaptrimming, users complained of high 
> IO
> latency, and indeed Ceph reported "slow ops" on a number of OSDs and on the
> active MDS. I attributed this to the snaptrimming and decided to reduce it by
> initially setting osd_pg_max_concurrent_snap_trims to 1, which didn't seem to
> help much, so I then set it to 0, which had the surprising effect of
> transitioning all PGs back to active+clean (is this intended?). I also 
> restarted
> the MDS which seemed to be struggling. IO latency went back to normal
> immediately.
>
> Outside of users' working hours, I decided to resume snaptrimming by setting
> osd_pg_max_concurrent_snap_trims back to 1. Much to my surprise, nothing
> happened. All PGs remained (and still remain at time of writing) in the state
> active+clean, even after restarting some of them. This definitely seems
> abnormal, as I mentioned earlier, snaptrimming this FS previously would take 
> in
> the order of multiple days. Moreover, if snaptrim were truly complete, I would
> expect pool usage to have dropped by appreciable amounts (at least a dozen
> terabytes), but that doesn't seem to be the case.
>
> A du on the CephFS root gives:
>
> # du -sh /mnt/pve/cephfs
> 31T/mnt/pve/cephfs
>
> But:
>
> # ceph df
> 
> --- POOLS ---
> POOL   ID  PGS   STORED   OBJECTS  USED %USED  MAX AVAIL
> cephfs_data 7   512   43 TiB  190.83M  147 TiB  93.223.6 TiB
> cephfs_metadata 832   89 GiB  694.60k  266 GiB   1.326.4 TiB
> 
>
> ceph pg dump reports a SNAPTRIMQ_LEN of 0 on all PGs.
>
> Did CephFS just leak a massive 12 TiB worth of objects...? It seems to me that
> the snaptrim operation did not complete at all.
>
> Perhaps relatedly:
>
> # ceph daemon mds.choi dump snaps
> {
> "last_created": 93,
> "last_destroyed": 94,
> "snaps": [
> {
> "snapid": 93,
> "ino": 1,
> "stamp": "2022-02-21T00:00:01.245459+0800",
> "name": "2022-02-21"
> }
> ]
> }
>
> How can last_destroyed > last_created? The last snapshot to have been taken on
> this FS is indeed #93, and the removed snapshots were all created on previous
> weeks.
>
> Could someone shed some light please? Assuming that snaptrim didn't run to
> completion, how can I manually delete objects from now-removed snapshots? I
> believe this is what the Ceph documentation calls a "backwards scrub" - but I
> didn't find anything in the Ceph suite that can run such a scrub. This pool is
> filling up fast, I'll throw in some more OSDs for the moment to buy some time,
> but I certainly would appreciate your help!
>
> Happy to attach any logs or info you deem necessary.
>
> Regards,
>
> LRT
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS snaptrim bug?

2022-02-24 Thread Arthur Outhenin-Chalandre
Hi,

On 2/23/22 21:43, Linkriver Technology wrote:
> Could someone shed some light please? Assuming that snaptrim didn't run to
> completion, how can I manually delete objects from now-removed snapshots? I
> believe this is what the Ceph documentation calls a "backwards scrub" - but I
> didn't find anything in the Ceph suite that can run such a scrub. This pool is
> filling up fast, I'll throw in some more OSDs for the moment to buy some time,
> but I certainly would appreciate your help!

You are probably hitting a bug related to the 52026 tracker [1]. You can
probably guess all the pgs that still needs snaptrim by checking
snaptrimq_len with the command `ceph pg dump pgs`. Basically all the pgs
that have a non zero value need snaptrim and you can trigger the
snaptrim by re-peering them.

[1]: https://tracker.ceph.com/issues/52026

Cheers,

-- 
Arthur Outhenin-Chalandre
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster crash after 2B objects pool removed

2022-02-24 Thread Dan van der Ster
Hi,

Basically, a deletion of any size shouldn't cause osds to crash. So
please open a tracker with some example osd logs showing the crash
backtraces.

Cheers, Dan

On Thu, Feb 24, 2022 at 6:20 AM Szabo, Istvan (Agoda)
 wrote:
>
> Hi,
>
> I've removed the old RGW data pool with 2B objects because in multisite seems 
> like the user remove with data purge doesn't work, so need to cleanup somehow.
> Luckily I've expected that the cluster will crash so no user in it, but I 
> wonder how this can be done on a smooth way.
>
> I've deleted the pool 23 Feb 2-3pm and the I'd say 90-95% osd down happened 
> around 7:30am on 24.
> I've manually compacted all the osds so it got's back to normal, but I'd be 
> curious what operations happens in the morning 7:30am in ceph?
>
> The only way that I guess this can be prevented, if this cleanup operation 
> happens 7:30 am, I should do this kind of delete thing like 8am, compact all 
> the osd before the next day 7:30am, so might be not crash?
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io