Re: [ceph-users] OSD's keep crasching after clusterreboot
We got our OSD's back Since we removed the EC-Pool (cephfs.data) we had to figure out how to remove the PG from teh Offline OSD and hier is how we did it. remove cehfs, remove cache layer, remove pools: #ceph mds fail 0 #ceph fs rm cephfs --yes-i-really-mean-it #ceph osd tier remove-overlay cephfs.data there is now (or already was) no overlay for 'cephfs.data' #ceph osd tier remove cephfs.data cephfs.cache pool 'cephfs.cache' is now (or already was) not a tier of 'cephfs.data' #ceph tell mon.\* injectargs '--mon-allow-pool-delete=true' #ceph osd pool delete cephfs.cache cephfs.cache --yes-i-really-really-mean-it pool 'cephfs.cache' removed #ceph osd pool delete cephfs.data cephfs.data --yes-i-really-really-mean-it pool 'cephfs.data' removed #ceph osd pool delete cephfs.metadata cephfs.metadata --yes-i-really-really-mean-it pool 'cephfs.metadata' removed remove placement groups of pool 23 (cephfs.data) from all offline OSDs: DATAPATH=/var/lib/ceph/osd/ceph-${OSD} a=`ceph-objectstore-tool --data-path ${DATAPATH} --op list-pgs | grep "^23\."` for i in $a; do echo "INFO: removing ${i} from OSD ${OSD}" ceph-objectstore-tool --data-path ${DATAPATH} --pgid ${i} --op remove --force done since we now had removed our cephfs we still not know if we could have solved it without data loss by upgrading to nautilus. Have a nice Weekend, Ansgar Am Mi., 7. Aug. 2019 um 17:03 Uhr schrieb Ansgar Jazdzewski : > > another update, > > we now took the more destructive route and removed the cephfs pools > (lucky we had only test date in the filesystem) > Our hope was that within the startup-process the osd will delete the > no longer needed PG, But this is NOT the Case. > > So we are still have the same issue the only difference is that the PG > does not belong to a pool anymore. > > -360> 2019-08-07 14:52:32.655 7fb14db8de00 5 osd.44 pg_epoch: 196586 > pg[23.f8s0(unlocked)] enter Initial > -360> 2019-08-07 14:52:32.659 7fb14db8de00 -1 > /build/ceph-13.2.6/src/osd/ECUtil.h: In function > 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' thread > 7fb14db8de00 time 2019-08-07 14:52:32.660169 > /build/ceph-13.2.6/src/osd/ECUtil.h: 34: FAILED assert(stripe_width % > stripe_size == 0) > > we now can take one rout and try to delete the pg by hand in the OSD > (bluestore) how this can be done? OR we try to upgrade to Nautilus and > hope for the beset. > > any help hints are welcome, > have a nice one > Ansgar > > Am Mi., 7. Aug. 2019 um 11:32 Uhr schrieb Ansgar Jazdzewski > : > > > > Hi, > > > > as a follow-up: > > * a full log of one OSD failing to start https://pastebin.com/T8UQ2rZ6 > > * our ec-pool cration in the fist place https://pastebin.com/20cC06Jn > > * ceph osd dump and ceph osd erasure-code-profile get cephfs > > https://pastebin.com/TRLPaWcH > > > > as we try to dig more into it, it looks like a bug in the cephfs or > > erasure-coding part of ceph. > > > > Ansgar > > > > > > Am Di., 6. Aug. 2019 um 14:50 Uhr schrieb Ansgar Jazdzewski > > : > > > > > > hi folks, > > > > > > we had to move one of our clusters so we had to boot all servers, now > > > we found an Error on all OSD with the EC-Pool. > > > > > > do we miss some opitons, will an upgrade to 13.2.6 help? > > > > > > > > > Thanks, > > > Ansgar > > > > > > 2019-08-06 12:10:16.265 7fb337b83200 -1 > > > /build/ceph-13.2.4/src/osd/ECUtil.h: In function > > > 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' thread > > > 7fb337b83200 time 2019-08-06 12:10:16.263025 > > > /build/ceph-13.2.4/src/osd/ECUtil.h: 34: FAILED assert(stripe_width % > > > stripe_size == 0) > > > > > > ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic > > > (stable) 1: (ceph::ceph_assert_fail(char const, char const, int, char > > > const)+0x102) [0x7fb32eeb83c2] 2: (()+0x2e5587) [0x7fb32eeb8587] 3: > > > (ECBackend::ECBackend(PGBackend::Listener, coll_t const&, > > > boost::intrusive_ptr&, ObjectStore, > > > CephContext, std::shared_ptr, unsigned > > > long)+0x4de) [0xa4cbbe] 4: (PGBackend::build_pg_backend(pg_pool_t > > > const&, std::map > > std::char_traits, std::allocator >, > > > std::cxx11::basic_string, > > > std::allocator >, std::less > > std::char_traits, std::allocator > >, std > > > ::allocator > > std::char_traits, std::allocator > const, > > > std::cxx11::basic_string, > > > std::allocator > > > > const&, PGBackend::Listener, coll_t, > > > boost::intrusive_ptr&, ObjectStore, > > > CephContext)+0x2f9 ) [0x9474e9] 5: > > > (PrimaryLogPG::PrimaryLogPG(OSDService, std::shared_ptr, > > > PGPool const&, std::map > > std::char_traits, std::allocator >, > > > std::cxx11::basic_string, > > > std::allocator >, std::less > > std::char_tra its, std::allocator > >, > > > std::allocator > > std::char_traits, std::allocator > const, > > > std::cxx11::basic_string, > > > std::allocator > > > > const&, spg_t)+0x138) [0x8f96e8] 6: > > > (OSD::_make_pg(std::shared_ptr, spg_t)+0x11d3) > > > [0x753553] 7: (OSD::load_pgs()+0x4a9) [0x758339] 8: > > > (OSD::ini
Re: [ceph-users] CephFS snapshot for backup & disaster recovery
Hi, >>I'm running a single-host Ceph cluster for CephFS and I'd like to keep >>backups in Amazon S3 for disaster recovery. Is there a simple way to extract >>a CephFS snapshot as a single file and/or to create a file that represents >>the incremental difference between two snapshots? I think it's on the roadmap for next ceph version. - Mail original - De: "Eitan Mosenkis" À: "Vitaliy Filippov" Cc: "ceph-users" Envoyé: Lundi 5 Août 2019 18:43:00 Objet: Re: [ceph-users] CephFS snapshot for backup & disaster recovery I'm using it for a NAS to make backups from the other machines on my home network. Since everything is in one location, I want to keep a copy offsite for disaster recovery. Running Ceph across the internet is not recommended and is also very expensive compared to just storing snapshots. On Sun, Aug 4, 2019 at 3:08 PM Виталий Филиппов < [ mailto:vita...@yourcmc.ru | vita...@yourcmc.ru ] > wrote: Afaik no. What's the idea of running a single-host cephfs cluster? 4 августа 2019 г. 13:27:00 GMT+03:00, Eitan Mosenkis < [ mailto:ei...@mosenkis.net | ei...@mosenkis.net ] > пишет: BQ_BEGIN I'm running a single-host Ceph cluster for CephFS and I'd like to keep backups in Amazon S3 for disaster recovery. Is there a simple way to extract a CephFS snapshot as a single file and/or to create a file that represents the incremental difference between two snapshots? -- With best regards, Vitaliy Filippov BQ_END ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] MDS corruption
I had a machine with insufficient memory and it seems to have corrupted data on my MDS. The filesystem seems to be working fine, with the exception of accessing specific files. The ceph-mds logs include things like: mds.0.1596621 unhandled write error (2) No such file or directory, force readonly... dir 0x100fb03 object missing on disk; some files may be lost (/adam/programming/bash) I'm using mimic and trying to follow the instructions here: https://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/ The punchline is this: cephfs-journal-tool --rank all journal export backup.bin Error ((22) Invalid argument) 2019-08-08 20:02:39.847 7f06827537c0 -1 main: Couldn't determine MDS rank. I have a backup (outside of ceph) of all data which is inaccessible and I can back anything which is accessible if need be. There's some more information below, but my main question is: what are my next steps? On a side note, I'd like to get involved with helping with documentation (man pages, the ceph website, usage text, etc). Where can I get started? Here's the context: cephfs-journal-tool event recover_dentries summary Error ((22) Invalid argument) 2019-08-08 19:50:04.798 7f21f4ffe7c0 -1 main: missing mandatory "--rank" argument Seems like a bug in the documentation since `--rank` is a "mandatory option" according to the help text. It looks like the rank of this node for MDS is 0, based on `ceph health detail`, but using `--rank 0` or `--rank all` doesn't work either: ceph health detail HEALTH_ERR 1 MDSs report damaged metadata; 1 MDSs are read only MDS_DAMAGE 1 MDSs report damaged metadata mdsge.hax0rbana.org(mds.0): Metadata damage detected MDS_READ_ONLY 1 MDSs are read only mdsge.hax0rbana.org(mds.0): MDS in read-only mode cephfs-journal-tool --rank 0 event recover_dentries summary Error ((22) Invalid argument) 2019-08-08 19:54:45.583 7f5b37c4c7c0 -1 main: Couldn't determine MDS rank. The only place I've found this error message is in an unanswered stackoverflow question and in the source code here: https://github.com/ceph/ceph/blob/master/src/tools/cephfs/JournalTool.cc#L114 It looks like that is trying to read a filesystem map (fsmap), which might be corrupted. Running `rados export` prints part of the help text and then segfaults, which is rather concerning. This is 100% repeatable (outside of gdb, details below). I tried `rados df` and that worked fine, so it's not all rados commands which are having this problem. However, I tried `rados bench 60 seq` and that also printed out the usage text and then segfaulted. Info on the `rados export` crash: rados export usage: rados [options] [commands] POOL COMMANDS IMPORT AND EXPORT export [filename] Serialize pool contents to a file or standard out. OMAP OPTIONS: --omap-key-file fileread the omap key from a file *** Caught signal (Segmentation fault) ** in thread 7fcb6bfff700 thread_name:fn_anonymous When running it in gdb: (gdb) bt #0 0x7fffef07331f in std::_Rb_tree, std::allocator >, std::pair, std::allocator > const, std::map, std::allocator >, unsigned long, long, double, bool, entity_addr_t, std::chrono::duration >, Option::size_t, uuid_d>, std::less, std::allocator, std::allocator >, unsigned long, long, double, bool, entity_addr_t, std::chrono::duration >, Option::size_t, uuid_d> > > > >, std::_Select1st, std::allocator > const, std::map, std::allocator >, unsigned long, long, double, bool, entity_addr_t, std::chrono::duration >, Option::size_t, uuid_d>, std::less, std::allocator, std::allocator >, unsigned long, long, double, bool, entity_addr_t, std::chrono::duration >, Option::size_t, uuid_d> > > > > >, std::less, std::allocator > >, std::allocator, std::allocator > const, std::map, std::allocator >, unsigned long, long, double, bool, entity_addr_t, std::chrono::duration >, Option::size_t, uuid_d>, std::less, std::allocator, std::allocator >, unsigned long, long, double, bool, entity_addr_t, std::chrono::duration >, Option::size_t, uuid_d> > > > > > >::find(std::__cxx11::basic_string, std::allocator > const&) const () from /usr/lib/ceph/libceph-common.so.0 Backtrace stopped: Cannot access memory at address 0x7fffd9ff89f8 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] tcmu-runner: "Acquired exclusive lock" every 21s
Am 06.08.19 um 18:28 schrieb Mike Christie: On 08/06/2019 07:51 AM, Matthias Leopold wrote: Am 05.08.19 um 18:31 schrieb Mike Christie: On 08/05/2019 05:58 AM, Matthias Leopold wrote: Hi, I'm still testing my 2 node (dedicated) iSCSI gateway with ceph 12.2.12 before I dare to put it into production. I installed latest tcmu-runner release (1.5.1) and (like before) I'm seeing that both nodes switch exclusive locks for the disk images every 21 seconds. tcmu-runner logs look like this: 2019-08-05 12:53:04.184 13742 [WARN] tcmu_notify_lock_lost:222 rbd/iscsi.test03: Async lock drop. Old state 1 2019-08-05 12:53:04.714 13742 [WARN] tcmu_rbd_lock:762 rbd/iscsi.test03: Acquired exclusive lock. 2019-08-05 12:53:25.186 13742 [WARN] tcmu_notify_lock_lost:222 rbd/iscsi.test03: Async lock drop. Old state 1 2019-08-05 12:53:25.773 13742 [WARN] tcmu_rbd_lock:762 rbd/iscsi.test03: Acquired exclusive lock. Old state can sometimes be 0 or 2. Is this expected behaviour? What initiator OS are you using? I'm using CentOS 7 initiators and I somehow missed to configure multipathd on them correctly (device { vendor "LIO.ORG" ... }). After fixing that the above problem disappeared and the output of "multipath -ll" finally looks correct. Thanks for pointing me to this. Nevertheless there's now another problem visible in the logs. As soon as an initiator logs in tcmu-runner on the gateway node that doesn't own the image being accessed logs [ERROR] tcmu_rbd_has_lock:516 rbd/iscsi.test02: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown. This disappears after the osd blacklist entries for the node expire (visible with "ceph osd blacklist ls"). I haven't yet understood how this is supposed to work, right now I restarted from scratch (logged out, waited till all blacklist entries disappeared, logged in) and I'm again seeing several blacklist entries for both gateway nodes (and the above error message in tcmu-runner.log). This doesn't seem to interfere with the iSCSI service, but I want this explained/resolved before I can start using the gateways. This is expected. Before multipath kicks in during path addition/readdition and during failover/failback you can have IO on multiple paths, so the lock is going to bounce temporarily and gws are going to be blacklisted. It should not happen non stop like you saw in the original email. Thank you for explanation. Matthias ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] out of memory bluestore osds
Hi Mark, thanks a lot for your explanation and clarification. Adjusting osd_memory_target to fit in our systems did the trick. Jaime On 07/08/2019 14:09, Mark Nelson wrote: Hi Jaime, we only use the cache size parameters now if you've disabled autotuning. With autotuning we adjust the cache size on the fly to try and keep the mapped process memory under the osd_memory_target. You can set a lower memory target than default, though you will have far less cache for bluestore onodes and rocksdb. You may notice that it's slower, especially if you have a big active data set you are processing. I don't usually recommend setting the osd_memory_target below 2GB. At some point it will have shrunk the caches as far as it can and the process memory may start exceeding the target. (with our default rocksdb and pglog settings this usually happens somewhere between 1.3-1.7GB once the OSD has been sufficiently saturated with IO). Given memory prices right now, I'd still recommend upgrading RAM if you have the ability though. You might be able to get away with setting each OSD to 2-2.5GB in your scenario but you'll be pushing it. I would not recommend lowering the osd_memory_cache_min. You really want rocksdb indexes/filters fitting in cache, and as many bluestore onodes as you can get. In any event, you'll still be bound by the (currently hardcoded) 64MB cache chunk allocation size in the autotuner which osd_memory_cache_min can't reduce (and that's per cache while osd_memory_cache_min is global for the kv,buffer, and rocksdb block caches). IE each cache is going to get 64MB+growth room regardless of how low you set osd_memory_cache_min. That's intentional as we don't want a single SST file in rocksdb to be able to completely blow everything else out of the block cache during compaction, only to quickly become invalid, removed from the cache, and make it look to the priority cache system like rocksdb doesn't actually need any more memory for cache. Mark On 8/7/19 7:44 AM, Jaime Ibar wrote: Hi all, we run a Ceph Luminous 12.2.12 cluster, 7 osds servers 12x4TB disks each. Recently we redeployed the osds of one of them using bluestore backend, however, after this, we're facing Out of memory errors(invoked oom-killer) and the OS kills one of the ceph-osd process. The osd is restarted automatically and back online after one minute. We're running Ubuntu 16.04, kernel 4.15.0-55-generic. The server has 32GB of RAM and 4GB of swap partition. All the disks are hdd, no ssd disks. Bluestore settings are the default ones "osd_memory_target": "4294967296" "osd_memory_cache_min": "134217728" "bluestore_cache_size": "0" "bluestore_cache_size_hdd": "1073741824" "bluestore_cache_autotune": "true" As stated in the documentation, bluestore assigns by default 4GB of RAM per osd(1GB of RAM for 1TB). So in this case 48GB of RAM would be needed. Am I right? Are these the minimun requirements for bluestore? In case adding more RAM is not an option, can any of osd_memory_target, osd_memory_cache_min, bluestore_cache_size_hdd be decrease to fit in our server specs? Would this have any impact on performance? Thanks Jaime ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jaime Ibar High Performance & Research Computing, IS Services Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie Tel: +353-1-896-3725 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com