[ceph-users] Slow Requests when deep scrubbing PGs that hold Bucket Index
Hi, I'm using ceph primarily for block storage (which works quite well) and as an object gateway using the S3 API. Here is some info about my system: Ceph: 12.2.4, OS: Ubuntu 18.04 OSD: Bluestore 6 servers in total, about 60 OSDs, 2TB SSDs each, no HDDs, CFQ scheduler 20 GBit private network 20 GBit public network Block storage and object storage runs on separate disks Main use case: Saving small (30KB - 2MB) objects in rgw buckets. - dynamic bucket index resharding is disabled for now but I keep the index objects per shard at about 100k. - data pool: EC4+2 - index pool: replicated (3) - atm around 500k objects in each bucket My problem: Sometimes, I get "slow request" warnings like so: "[WRN] Health check update: 7 slow requests are blocked > 32 sec (REQUEST_SLOW)" It turned out that these warnings appear, whenever specific PGs are being deep scrubbed. After further investigation, I figured out that these PG's hold the bucket index of the rados gateway. I already tried some configuration changes like: ceph tell osd.* injectargs '--osd_disk_thread_ioprio_priority 0' ceph tell osd.* injectargs '--osd_disk_thread_ioprio_class idle' ceph tell osd.* injectargs '--osd_scrub_sleep 1'; ceph tell osd.* injectargs '--osd_deep_scrub_stride 1048576' ceph tell osd.* injectargs '--osd_scrub_chunk_max 1' ceph tell osd.* injectargs '--osd_scrub_chunk_min 1' This helped a lot to mitigate the effects but the problem is still there. Does anybody else have this issue? I have a few questions to better understand what's going on: As far as I know, the bucket index is stored in rocksdb and the (empty) objects in the index pool are just references to the data in rocksdb. Is that correct? How does a deep scrub affect rocksdb? Does the index pool even need deep scrubbing or could I just disable it? Also: Does it make sense to create more index shards to get the objects per shard down to let's say 50k or 20k? Right now, I have about 500k objects per bucket. I want to increase that number to a couple of hundred million objects. Do you see any problems with that, provided that the bucket index is sharded appropriately? Any help is appreciated. Let me know if you need anything like logs, configs, etc. Thanks! Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Journel SSD recommendation
I have lots of Samsung 850 EVO but they are consumer, Do you think consume drive should be good for journal? No. Since the fall of 2017purchase of Intel P3700 is not difficult, you should buy it if you can. k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure coding RBD pool for OpenStack Glance, Nova and Cinder
So if you want, two more questions to you : - How do you handle your ceph.conf configuration (default data pool by user) / distribution ? Manually, config management, openstack-ansible... ? - Did you made comparisons, benchmarks between replicated pools and EC pools, on the same hardware / drives ? I read that small writes are not very performant with EC. ceph.conf with default data pool is only need for Cinder at image creation time, after this luminous+ rbd client will be found feature "data-pool" and will perform data-io to this pool. # rbd info erasure_rbd_meta/volume-09ed44bf-7d16-453a-b712-a636a0d3d812 <- meta pool ! rbd image 'volume-09ed44bf-7d16-453a-b712-a636a0d3d812': size 1500 GB in 384000 objects order 22 (4096 kB objects) data_pool: erasure_rbd_data <- our data pool block_name_prefix: rbd_data.6.a2720a1ec432bf format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, data-pool <- "data-pool" feature flags: create_timestamp: Sat Jan 27 20:24:04 2018 k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Journel SSD recommendation
I am planning to use Intel 3700 (200GB) for journal and 500GB Samsung 850 EVO for OSD, do you think this design make sense? On Tue, Jul 10, 2018 at 3:04 PM, Simon Ironside wrote: > > On 10/07/18 19:32, Robert Stanford wrote: >> >> >> Do the recommendations apply to both data and journal SSDs equally? >> > > Search the list for "Many concurrent drive failures - How do I activate > pgs?" to read about the Intel DC S4600 failure story. The OP had several 2TB > models of these fail when used as Bluestore data devices. The Samsung SM863a > is discussed as a good alternative in the same thread. > > Simon > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors
Hi John, Thanks for the explanation, that command is a lot more impacting than I thought! I hope the change of name for the verb "reset" comes through in the next version, because that is very easy to misunderstand. "The first question is why we're talking about running it at all. What chain of reasoning led you to believe that your inotable needed erasing?" I thought the reset inode command is just like the reset session command, as you can pass mds rank to it as a param, it only resets whatever the MDS was holding. "The most typical case is where the journal has been recovered/erased, and take_inos is used to skip forward to avoid re-using any inode numbers that had been claimed by journal entries that we threw away." We had the situation where our MDS was crashing at MDCache::add_inode(CInode*), as discussed earlier. take_inos should fix this, as you mentioned, but we thought that we would need to reset what the MDS was holding, just like the session. So with your clarification, I believe we only need to do these: journal backup recover dentries reset mds journal (it wasn't replaying anyway, kept crashing) reset session take_inos start mds up again Is that correct? Many thanks, I've learned a lot more about this process. Cheers, Linh From: John Spray Sent: Tuesday, 10 July 2018 7:24 PM To: Linh Vu Cc: Wido den Hollander; ceph-users@lists.ceph.com Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors On Tue, Jul 10, 2018 at 2:49 AM Linh Vu wrote: > > While we're on this topic, could someone please explain to me what > `cephfs-table-tool all reset inode` does? The inode table stores an interval set of free inode numbers. Active MDS daemons consume inode numbers as they create files. Resetting the inode table means rewriting it to its original state (i.e. everything free). Using the "take_inos" command consumes some range of inodes, to reflect that the inodes up to a certain point aren't really free, but in use by some files that already exist. > Does it only reset what the MDS has in its cache, and after starting up > again, the MDS will read in new inode range from the metadata pool? I'm repeating myself a bit, but for the benefit of anyone reading this thread in the future: no, it's nothing like that. It effectively *erases the inode table* by overwriting it ("resetting") with a blank one. As with the journal tool (https://github.com/ceph/ceph/pull/22853), perhaps the verb "reset" is too prone to misunderstanding. > If so, does it mean *before* we run `cephfs-table-tool take_inos`, we must > run `cephfs-table-tool all reset inode`? The first question is why we're talking about running it at all. What chain of reasoning led you to believe that your inotable needed erasing? The most typical case is where the journal has been recovered/erased, and take_inos is used to skip forward to avoid re-using any inode numbers that had been claimed by journal entries that we threw away. John > > Cheers, > > Linh > > > From: ceph-users on behalf of Wido den > Hollander > Sent: Saturday, 7 July 2018 12:26:15 AM > To: John Spray > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors > > > > On 07/06/2018 01:47 PM, John Spray wrote: > > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander wrote: > >> > >> > >> > >> On 07/05/2018 03:36 PM, John Spray wrote: > >>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS) > >>> wrote: > > Hi list, > > I have a serious problem now... I think. > > One of my users just informed me that a file he created (.doc file) has > a different content then before. It looks like the file's inode is > completely wrong and points to the wrong object. I myself have found > another file with the same symptoms. I'm afraid my (production) FS is > corrupt now, unless there is a possibility to fix the inodes. > >>> > >>> You can probably get back to a state with some valid metadata, but it > >>> might not necessarily be the metadata the user was expecting (e.g. if > >>> two files are claiming the same inode number, one of them's is > >>> probably going to get deleted). > >>> > Timeline of what happend: > > Last week I upgraded our Ceph Jewel to Luminous. > This went without any problem. > > I already had 5 MDS available and went with the Multi-MDS feature and > enabled it. The seemed to work okay, but after a while my MDS went > beserk and went flapping (crashed -> replay -> rejoin -> crashed) > > The only way to fix this and get the FS back online was the disaster > recovery procedure: > > cephfs-journal-tool event recover_dentries summary > ceph fs set cephfs cluster_down true > cephfs-table-tool all reset session > cephfs-table-tool all reset inode > cephfs-journal-tool --rank=cephfs:0
Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors
Thanks John :) Has it - asserting out on dupe inode - already been logged as a bug yet? I could put one in if needed. Cheers, Linh From: John Spray Sent: Tuesday, 10 July 2018 7:11 PM To: Linh Vu Cc: Wido den Hollander; ceph-users@lists.ceph.com Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors On Tue, Jul 10, 2018 at 12:43 AM Linh Vu wrote: > > We're affected by something like this right now (the dup inode causing MDS to > crash via assert(!p) with add_inode(CInode) function). > > In terms of behaviours, shouldn't the MDS simply skip to the next available > free inode in the event of a dup, than crashing the entire FS because of one > file? Probably I'm missing something but that'd be a no brainer picking > between the two? Historically (a few years ago) the MDS asserted out on any invalid metadata. Most of these cases have been picked up and converted into explicit damage handling, but this one appears to have been missed -- so yes, it's a bug that the MDS asserts out. John > > From: ceph-users on behalf of Wido den > Hollander > Sent: Saturday, 7 July 2018 12:26:15 AM > To: John Spray > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors > > > > On 07/06/2018 01:47 PM, John Spray wrote: > > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander wrote: > >> > >> > >> > >> On 07/05/2018 03:36 PM, John Spray wrote: > >>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS) > >>> wrote: > > Hi list, > > I have a serious problem now... I think. > > One of my users just informed me that a file he created (.doc file) has > a different content then before. It looks like the file's inode is > completely wrong and points to the wrong object. I myself have found > another file with the same symptoms. I'm afraid my (production) FS is > corrupt now, unless there is a possibility to fix the inodes. > >>> > >>> You can probably get back to a state with some valid metadata, but it > >>> might not necessarily be the metadata the user was expecting (e.g. if > >>> two files are claiming the same inode number, one of them's is > >>> probably going to get deleted). > >>> > Timeline of what happend: > > Last week I upgraded our Ceph Jewel to Luminous. > This went without any problem. > > I already had 5 MDS available and went with the Multi-MDS feature and > enabled it. The seemed to work okay, but after a while my MDS went > beserk and went flapping (crashed -> replay -> rejoin -> crashed) > > The only way to fix this and get the FS back online was the disaster > recovery procedure: > > cephfs-journal-tool event recover_dentries summary > ceph fs set cephfs cluster_down true > cephfs-table-tool all reset session > cephfs-table-tool all reset inode > cephfs-journal-tool --rank=cephfs:0 journal reset > ceph mds fail 0 > ceph fs reset cephfs --yes-i-really-mean-it > >>> > >>> My concern with this procedure is that the recover_dentries and > >>> journal reset only happened on rank 0, whereas the other 4 MDS ranks > >>> would have retained lots of content in their journals. I wonder if we > >>> should be adding some more multi-mds aware checks to these tools, to > >>> warn the user when they're only acting on particular ranks (a > >>> reasonable person might assume that recover_dentries with no args is > >>> operating on all ranks, not just 0). Created > >>> http://tracker.ceph.com/issues/24780 to track improving the default > >>> behaviour. > >>> > Restarted the MDS and I was back online. Shortly after I was getting a > lot of "loaded dup inode". In the meanwhile the MDS kept crashing. It > looks like it had trouble creating new inodes. Right before the crash > it mostly complained something like: > > -2> 2018-07-05 05:05:01.614290 7f8f8574b700 4 mds.0.server > handle_client_request client_request(client.324932014:1434 create > #0x1360346/pyfiles.txt 2018-07-05 05:05:01.607458 caller_uid=0, > caller_gid=0{}) v2 > -1> 2018-07-05 05:05:01.614320 7f8f7e73d700 5 mds.0.log > _submit_thread 24100753876035~1070 : EOpen [metablob 0x1360346, 1 > dirs], 1 open files > 0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1 /build/ceph- > 12.2.5/src/mds/MDCache.cc: In function 'void > MDCache::add_inode(CInode*)' thread 7f8f8574b700 time 2018-07-05 > 05:05:01.615123 > /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p) > > I also tried to counter the create inode crash by doing the following: > > cephfs-journal-tool event recover_dentries > cephfs-journal-tool journal reset > cephfs-table-tool all reset session > cephfs-table-tool all reset inode > cephfs-table-tool all take_inos 10 >
Re: [ceph-users] size of journal partitions pretty small
1) yes, 5 GB is the default. You can control this with the 'osd journal size' option during creation. (Or partition the disk manually) 2) no, well, maybe a little bit in weird edge cases with tuned configs but that's rarely advisable. But using Bluestore instead of Filestore might help with the performance. Paul 2018-07-10 21:03 GMT+02:00 Robert Stanford : > > I installed my OSDs using ceph-disk. The journals are SSDs and are 1TB. > I notice that Ceph has only dedicated 5GB each to the four OSDs that use > the journal. > > 1) Is this normal > > 2) Would performance increase if I made the partitions bigger? > > Thank you > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Journel SSD recommendation
On 10/07/18 19:32, Robert Stanford wrote: Do the recommendations apply to both data and journal SSDs equally? Search the list for "Many concurrent drive failures - How do I activate pgs?" to read about the Intel DC S4600 failure story. The OP had several 2TB models of these fail when used as Bluestore data devices. The Samsung SM863a is discussed as a good alternative in the same thread. Simon ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] size of journal partitions pretty small
I installed my OSDs using ceph-disk. The journals are SSDs and are 1TB. I notice that Ceph has only dedicated 5GB each to the four OSDs that use the journal. 1) Is this normal 2) Would performance increase if I made the partitions bigger? Thank you ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Journel SSD recommendation
On 10/07/18 18:59, Satish Patel wrote: Thanks, I would also like to know about Intel SSD 3700 (Intel SSD SC 3700 Series SSDSC2BA400G3P), price also looking promising, Do you have opinion on it? I can't quite tell from Google what exactly that is. If it's the Intel DC S3700 then I believe those are discontinued now but if you can still get hold of them they were used successfully and recommended by lots of cephers, myself included. Cheers, Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure coding RBD pool for OpenStack Glance, Nova and Cinder
2018-07-10 6:26 GMT+02:00 Konstantin Shalygin : > > rbd default data pool = erasure_rbd_data > > > Keep in mind, your minimal client version is Luminous. > specifically, it's 12.2.2 or later for the clients! 12.2.0/1 clients have serious bugs in the rbd ec code that will ruin your day as soon as you try to use snapshots. -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Looking for some advise on distributed FS: Is Ceph the right option for me?
Yes, Ceph is probably a good fit for what you are planning. The documentation should answer your questions: http://docs.ceph.com/docs/master/ Look for erasure coding, crush rules, and CephFS-specific pages in particular. Paul 2018-07-10 18:40 GMT+02:00 Jones de Andrade : > Hi all. > > I'm looking for some information on several distributed filesystems for > our application. > > It looks like it finally came down to two candidates, Ceph being one of > them. But there are still a few questions about ir that I would really like > to clarify, if possible. > > Our plan, initially on 6 workstations, is to have it hosting a distributed > file system that can withstand two simultaneous computers failures without > data loss (something that can remember a raid 6, but over the network). > This file system will also need to be also remotely mounted (NFS server > with fallbacks) by other 5+ computers. Students will be working on all 11+ > computers at the same time (different requisites from different softwares: > some use many small files, other a few really big, 100s gb, files), and > absolutely no hardware modifications are allowed. This initial test bed is > for undergraduate students usage, but if successful will be employed also > for our small clusters. The connection is a simple GbE. > > Our actual concerns are: > 1) Data Resilience: It seems that double copy of each block is the > standard setting, is it correct? As such, it will strip-parity data among > three computers for each block? > > 2) Metadata Resilience: We seen that we can now have more than a single > Metadata Server (which was a show-stopper on previous versions). However, > do they have to be dedicated boxes, or they can share boxes with the Data > Servers? Can it be configured in such a way that even if two metadata > server computers fail the whole system data will still be accessible from > the remaining computers, without interruptions, or they share different > data aiming only for performance? > > 3) Other softwares compability: We seen that there is NFS incompability, > is it correct? Also, any posix issues? > > 4) No single (or double) point of failure: every single possible stance > has to be able to endure a *double* failure (yes, things can get time to be > fixed here). Does Ceph need s single master server for any of its > activities? Can it endure double failure? How long would it take to any > sort of "fallback" to be completed, users would need to wait to regain > access? > > I think that covers the initial questions we have. Sorry if this is the > wrong list, however. > > Looking forward for any answer or suggestion, > > Regards, > > Jones > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Journel SSD recommendation
Do the recommendations apply to both data and journal SSDs equally? On Tue, Jul 10, 2018 at 12:59 PM, Satish Patel wrote: > On Tue, Jul 10, 2018 at 11:51 AM, Simon Ironside > wrote: > > Hi, > > > > On 10/07/18 16:25, Satish Patel wrote: > >> > >> Folks, > >> > >> I am in middle or ordering hardware for my Ceph cluster, so need some > >> recommendation from communities. > >> > >> - What company/Vendor SSD is good ? > > > > > > Samsung SM863a is the current favourite I believe. > > Thanks, I would also like to know about Intel SSD 3700 (Intel SSD SC > 3700 Series SSDSC2BA400G3P), price also looking promising, Do you have > opinion on it? also should i get 1 SSD driver for journal or need > two? I am planning to put 5 OSD per server > > > > > > The Intel DC S4600 is one to specifically avoid at the moment unless the > > latest firmware has resolved some of the list member reported issues. > > > >> - What size should be good for Journal (for BlueStore) > > > > > > ceph-disk defaults to a RocksDB partition that is 1% of the main device > > size. That'll get you in the right ball park. > > > >> I have lots of Samsung 850 EVO but they are consumer, Do you think > >> consume drive should be good for journal? > > > > > > No :) > > > > Cheers, > > Simon. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Journel SSD recommendation
On Tue, Jul 10, 2018 at 11:51 AM, Simon Ironside wrote: > Hi, > > On 10/07/18 16:25, Satish Patel wrote: >> >> Folks, >> >> I am in middle or ordering hardware for my Ceph cluster, so need some >> recommendation from communities. >> >> - What company/Vendor SSD is good ? > > > Samsung SM863a is the current favourite I believe. Thanks, I would also like to know about Intel SSD 3700 (Intel SSD SC 3700 Series SSDSC2BA400G3P), price also looking promising, Do you have opinion on it? also should i get 1 SSD driver for journal or need two? I am planning to put 5 OSD per server > > The Intel DC S4600 is one to specifically avoid at the moment unless the > latest firmware has resolved some of the list member reported issues. > >> - What size should be good for Journal (for BlueStore) > > > ceph-disk defaults to a RocksDB partition that is 1% of the main device > size. That'll get you in the right ball park. > >> I have lots of Samsung 850 EVO but they are consumer, Do you think >> consume drive should be good for journal? > > > No :) > > Cheers, > Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Recovering from no quorum (2/3 monitors down) via 1 good monitor
Hi Paul, Yes that's what I did - caused some errors. In the end I had to delete the /var/lib/ceph/mon/* directory in the bad node and run inject with --mkfs argument to recreate the database. I am good now - thanks. :) On Tue, Jul 10, 2018 at 10:46 PM, Paul Emmerich wrote: > easy: > > 1. make sure that none of the mons are running > 2. extract the monmap from the good one > 3. use monmaptool to remove the two other mons from it > 4. inject the mon map back into the good mon > 5. start the good mon > 6. you now have a running cluster with only one mon, add two new ones > > > Paul > > > 2018-07-10 5:50 GMT+02:00 Syahrul Sazli Shaharir : >> >> Hi, >> >> I am running proxmox pve-5.1, with ceph luminous 12.2.4 as storage. I >> have been running on 3 monitors, up until an abrupt power outage, >> resulting in 2 monitors down and unable to start, while 1 monitor up >> but with no quorum. >> >> I tried extracting monmap from the good monitor and injecting it into >> the other two, but got different errors for each:- >> >> 1. mon.mail1 >> >> # ceph-mon -i mail1 --inject-monmap /tmp/monmap >> 2018-07-10 11:29:03.562840 7f7d82845f80 -1 abort: Corruption: Bad >> table magic number*** Caught signal (Aborted) ** >> in thread 7f7d82845f80 thread_name:ceph-mon >> >> ceph version 12.2.4 (4832b6f0acade977670a37c20ff5dbe69e727416) >> luminous (stable) >> 1: (()+0x9439e4) [0x5652655669e4] >> 2: (()+0x110c0) [0x7f7d81bfe0c0] >> 3: (gsignal()+0xcf) [0x7f7d7ee12fff] >> 4: (abort()+0x16a) [0x7f7d7ee1442a] >> 5: (RocksDBStore::get(std::__cxx11::basic_string> std::char_traits, std::allocator > const&, >> std::__cxx11::basic_string, >> std::allocator > const&, ceph::buffer::list*)+0x2f9) >> [0x5652650a2eb9] >> 6: (main()+0x1377) [0x565264ec3c57] >> 7: (__libc_start_main()+0xf1) [0x7f7d7ee002e1] >> 8: (_start()+0x2a) [0x565264f5954a] >> 2018-07-10 11:29:03.563721 7f7d82845f80 -1 *** Caught signal (Aborted) ** >> in thread 7f7d82845f80 thread_name:ceph-mon >> >> 2. mon,mail2 >> >> # ceph-mon -i mail2 --inject-monmap /tmp/monmap >> 2018-07-10 11:18:07.536097 7f161e2e3f80 -1 rocksdb: Corruption: Can't >> access /065339.sst: IO error: >> /var/lib/ceph/mon/ceph-mail2/store.db/065339.sst: No such file or >> directory >> Can't access /065337.sst: IO error: >> /var/lib/ceph/mon/ceph-mail2/store.db/065337.sst: No such file or >> directory >> >> 2018-07-10 11:18:07.536106 7f161e2e3f80 -1 error opening mon data >> directory at '/var/lib/ceph/mon/ceph-mail2': (22) Invalid argument >> >> Any other way I can recover other than rebuilding the monitor store >> from the OSDs? >> >> Thanks. >> >> -- >> --sazli >> Syahrul Sazli Shaharir >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 -- --sazli Syahrul Sazli Shaharir Mobile: +6019 385 8301 - YM/Skype: syahrulsazli System Administrator TMK Pulasan (002339810-M) http://pulasan.my/ 11 Jalan 3/4, 43650 Bandar Baru Bangi, Selangor, Malaysia. Tel/Fax: +603 8926 0338 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Looking for some advise on distributed FS: Is Ceph the right option for me?
Hi all. I'm looking for some information on several distributed filesystems for our application. It looks like it finally came down to two candidates, Ceph being one of them. But there are still a few questions about ir that I would really like to clarify, if possible. Our plan, initially on 6 workstations, is to have it hosting a distributed file system that can withstand two simultaneous computers failures without data loss (something that can remember a raid 6, but over the network). This file system will also need to be also remotely mounted (NFS server with fallbacks) by other 5+ computers. Students will be working on all 11+ computers at the same time (different requisites from different softwares: some use many small files, other a few really big, 100s gb, files), and absolutely no hardware modifications are allowed. This initial test bed is for undergraduate students usage, but if successful will be employed also for our small clusters. The connection is a simple GbE. Our actual concerns are: 1) Data Resilience: It seems that double copy of each block is the standard setting, is it correct? As such, it will strip-parity data among three computers for each block? 2) Metadata Resilience: We seen that we can now have more than a single Metadata Server (which was a show-stopper on previous versions). However, do they have to be dedicated boxes, or they can share boxes with the Data Servers? Can it be configured in such a way that even if two metadata server computers fail the whole system data will still be accessible from the remaining computers, without interruptions, or they share different data aiming only for performance? 3) Other softwares compability: We seen that there is NFS incompability, is it correct? Also, any posix issues? 4) No single (or double) point of failure: every single possible stance has to be able to endure a *double* failure (yes, things can get time to be fixed here). Does Ceph need s single master server for any of its activities? Can it endure double failure? How long would it take to any sort of "fallback" to be completed, users would need to wait to regain access? I think that covers the initial questions we have. Sorry if this is the wrong list, however. Looking forward for any answer or suggestion, Regards, Jones ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd lock remove unable to parse address
2018-07-10 14:37 GMT+02:00 Jason Dillaman : > On Tue, Jul 10, 2018 at 2:37 AM Kevin Olbrich wrote: > >> 2018-07-10 0:35 GMT+02:00 Jason Dillaman : >> >>> Is the link-local address of "fe80::219:99ff:fe9e:3a86%eth0" at least >>> present on the client computer you used? I would have expected the OSD to >>> determine the client address, so it's odd that it was able to get a >>> link-local address. >>> >> >> Yes, it is. eth0 is part of bond0 which is a vlan trunk. Bond0.X is >> attached to brX which has an ULA-prefix for the ceph cluster. >> Eth0 has no address itself. In this case this must mean, the address has >> been carried down to the hardware interface. >> >> I am wondering why it uses link local when there is an ULA-prefix >> available. >> >> The address is available on brX on this client node. >> > > I'll open a tracker ticker to get that issue fixed, but in the meantime, > you can run "rados -p rmxattr rbd_header. > lock.rbd_lock" to remove the lock. > Worked perfectly, thank you very much! > >> - Kevin >> >> >>> On Mon, Jul 9, 2018 at 3:43 PM Kevin Olbrich wrote: >>> 2018-07-09 21:25 GMT+02:00 Jason Dillaman : > BTW -- are you running Ceph on a one-node computer? I thought IPv6 > addresses starting w/ fe80 were link-local addresses which would probably > explain why an interface scope id was appended. The current IPv6 address > parser stops reading after it encounters a non hex, colon character [1]. > No, this is a compute machine attached to the storage vlan where I previously had also local disks. > > > On Mon, Jul 9, 2018 at 3:14 PM Jason Dillaman > wrote: > >> Hmm ... it looks like there is a bug w/ RBD locks and IPv6 addresses >> since it is failing to parse the address as valid. Perhaps it's barfing >> on >> the "%eth0" scope id suffix within the address. >> >> On Mon, Jul 9, 2018 at 2:47 PM Kevin Olbrich wrote: >> >>> Hi! >>> >>> I tried to convert an qcow2 file to rbd and set the wrong pool. >>> Immediately I stopped the transfer but the image is stuck locked: >>> >>> Previusly when that happened, I was able to remove the image after >>> 30 secs. >>> >>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock list fpi_server02 >>> There is 1 exclusive lock on this image. >>> Locker ID Address >>> >>> client.1195723 auto 93921602220416 [fe80::219:99ff:fe9e:3a86% >>> eth0]:0/1200385089 >>> >>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock rm fpi_server02 >>> "auto 93921602220416" client.1195723 >>> rbd: releasing lock failed: (22) Invalid argument >>> 2018-07-09 20:45:19.080543 7f6c2c267d40 -1 librados: unable to parse >>> address [fe80::219:99ff:fe9e:3a86%eth0]:0/1200385089 >>> 2018-07-09 20:45:19.080555 7f6c2c267d40 -1 librbd: unable to >>> blacklist client: (22) Invalid argument >>> >>> The image is not in use anywhere! >>> >>> How can I force removal of all locks for this image? >>> >>> Kind regards, >>> Kevin >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> >> -- >> Jason >> > > [1] https://github.com/ceph/ceph/blob/master/src/msg/msg_types.cc#L108 > > -- > Jason > >>> >>> -- >>> Jason >>> >> >> > > -- > Jason > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Journel SSD recommendation
Hi, On 10/07/18 16:25, Satish Patel wrote: Folks, I am in middle or ordering hardware for my Ceph cluster, so need some recommendation from communities. - What company/Vendor SSD is good ? Samsung SM863a is the current favourite I believe. The Intel DC S4600 is one to specifically avoid at the moment unless the latest firmware has resolved some of the list member reported issues. - What size should be good for Journal (for BlueStore) ceph-disk defaults to a RocksDB partition that is 1% of the main device size. That'll get you in the right ball park. I have lots of Samsung 850 EVO but they are consumer, Do you think consume drive should be good for journal? No :) Cheers, Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSDs stalling on Intel SSDs
Hi everybody, I have a situation that occurs under moderate I/O load on Ceph Luminous: 2018-07-10 10:27:01.257916 mon.node4 mon.0 172.16.0.4:6789/0 15590 : cluster [INF] mon.node4 is new leader, mons node4,node5,node6,node7,node8 in quorum (ranks 0,1,2,3,4) 2018-07-10 10:27:01.306329 mon.node4 mon.0 172.16.0.4:6789/0 15595 : cluster [INF] Health check cleared: MON_DOWN (was: 1/5 mons down, quorum node4,node6,node7,node8) 2018-07-10 10:27:01.386124 mon.node4 mon.0 172.16.0.4:6789/0 15596 : cluster [WRN] overall HEALTH_WARN 1 osds down; Reduced data availability: 1 pg peering; Degraded data redundancy: 58774/10188798 objects degraded (0.577%), 13 pgs degraded; 412 slow requests are blocked > 32 sec 2018-07-10 10:27:02.598175 mon.node4 mon.0 172.16.0.4:6789/0 15597 : cluster [WRN] Health check update: Degraded data redundancy: 77153/10188798 objects degraded (0.757%), 17 pgs degraded (PG_DEGRADED) 2018-07-10 10:27:02.598225 mon.node4 mon.0 172.16.0.4:6789/0 15598 : cluster [WRN] Health check update: 381 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-10 10:27:02.598264 mon.node4 mon.0 172.16.0.4:6789/0 15599 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering) 2018-07-10 10:27:02.608006 mon.node4 mon.0 172.16.0.4:6789/0 15600 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down) 2018-07-10 10:27:02.701029 mon.node4 mon.0 172.16.0.4:6789/0 15601 : cluster [INF] osd.36 172.16.0.5:6800/3087 boot 2018-07-10 10:27:01.184334 osd.36 osd.36 172.16.0.5:6800/3087 23 : cluster [WRN] Monitor daemon marked osd.36 down, but it is still running 2018-07-10 10:27:04.861372 mon.node4 mon.0 172.16.0.4:6789/0 15604 : cluster [INF] Health check cleared: REQUEST_SLOW (was: 381 slow requests are blocked > 32 sec) The OSDs that seem to be affected are Intel SSDs, specific model is SSDSC2BX480G4L. I have throttled backups to try to lessen the situation, but it seems to affect the same OSDs when it happens. It has the added side effect of taking down the mon on the same node for a few seconds and triggering a monitor election. I am wondering if this may be a firmware issue on this drive and if anyone has any insight or additional troubleshooting steps I should try to get a deeper look at this behavior. I am going to upgrade firmware on these drives and see if it helps. -- Shawn Iverson, CETL Director of Technology Rush County Schools 765-932-3901 x1171 ivers...@rushville.k12.in.us ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Journel SSD recommendation
I think you will get some useful information from this link: https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ Even though it is dated 2014 - you can get approximate direction. Anton On 10.07.2018 18:25, Satish Patel wrote: Folks, I am in middle or ordering hardware for my Ceph cluster, so need some recommendation from communities. - What company/Vendor SSD is good ? - What size should be good for Journal (for BlueStore) I have lots of Samsung 850 EVO but they are consumer, Do you think consume drive should be good for journal? ~S ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Journel SSD recommendation
Folks, I am in middle or ordering hardware for my Ceph cluster, so need some recommendation from communities. - What company/Vendor SSD is good ? - What size should be good for Journal (for BlueStore) I have lots of Samsung 850 EVO but they are consumer, Do you think consume drive should be good for journal? ~S ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Recovering from no quorum (2/3 monitors down) via 1 good monitor
easy: 1. make sure that none of the mons are running 2. extract the monmap from the good one 3. use monmaptool to remove the two other mons from it 4. inject the mon map back into the good mon 5. start the good mon 6. you now have a running cluster with only one mon, add two new ones Paul 2018-07-10 5:50 GMT+02:00 Syahrul Sazli Shaharir : > Hi, > > I am running proxmox pve-5.1, with ceph luminous 12.2.4 as storage. I > have been running on 3 monitors, up until an abrupt power outage, > resulting in 2 monitors down and unable to start, while 1 monitor up > but with no quorum. > > I tried extracting monmap from the good monitor and injecting it into > the other two, but got different errors for each:- > > 1. mon.mail1 > > # ceph-mon -i mail1 --inject-monmap /tmp/monmap > 2018-07-10 11:29:03.562840 7f7d82845f80 -1 abort: Corruption: Bad > table magic number*** Caught signal (Aborted) ** > in thread 7f7d82845f80 thread_name:ceph-mon > > ceph version 12.2.4 (4832b6f0acade977670a37c20ff5dbe69e727416) > luminous (stable) > 1: (()+0x9439e4) [0x5652655669e4] > 2: (()+0x110c0) [0x7f7d81bfe0c0] > 3: (gsignal()+0xcf) [0x7f7d7ee12fff] > 4: (abort()+0x16a) [0x7f7d7ee1442a] > 5: (RocksDBStore::get(std::__cxx11::basic_string std::char_traits, std::allocator > const&, > std::__cxx11::basic_string, > std::allocator > const&, ceph::buffer::list*)+0x2f9) > [0x5652650a2eb9] > 6: (main()+0x1377) [0x565264ec3c57] > 7: (__libc_start_main()+0xf1) [0x7f7d7ee002e1] > 8: (_start()+0x2a) [0x565264f5954a] > 2018-07-10 11:29:03.563721 7f7d82845f80 -1 *** Caught signal (Aborted) ** > in thread 7f7d82845f80 thread_name:ceph-mon > > 2. mon,mail2 > > # ceph-mon -i mail2 --inject-monmap /tmp/monmap > 2018-07-10 11:18:07.536097 7f161e2e3f80 -1 rocksdb: Corruption: Can't > access /065339.sst: IO error: > /var/lib/ceph/mon/ceph-mail2/store.db/065339.sst: No such file or > directory > Can't access /065337.sst: IO error: > /var/lib/ceph/mon/ceph-mail2/store.db/065337.sst: No such file or > directory > > 2018-07-10 11:18:07.536106 7f161e2e3f80 -1 error opening mon data > directory at '/var/lib/ceph/mon/ceph-mail2': (22) Invalid argument > > Any other way I can recover other than rebuilding the monitor store > from the OSDs? > > Thanks. > > -- > --sazli > Syahrul Sazli Shaharir > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors
On Tue, Jul 10, 2018 at 3:14 PM Dennis Kramer (DBS) wrote: > > Hi John, > > On Tue, 2018-07-10 at 10:11 +0100, John Spray wrote: > > On Tue, Jul 10, 2018 at 12:43 AM Linh Vu wrote: > > > > > > > > > We're affected by something like this right now (the dup inode > > > causing MDS to crash via assert(!p) with add_inode(CInode) > > > function). > > > > > > In terms of behaviours, shouldn't the MDS simply skip to the next > > > available free inode in the event of a dup, than crashing the > > > entire FS because of one file? Probably I'm missing something but > > > that'd be a no brainer picking between the two? > > Historically (a few years ago) the MDS asserted out on any invalid > > metadata. Most of these cases have been picked up and converted into > > explicit damage handling, but this one appears to have been missed -- > > so yes, it's a bug that the MDS asserts out. > > I have followed the disaster recovery and now all my files and > directories in CephFS which complained about duplicate inodes > disappeared from my FS. I see *some* data in "lost+found", but thats > only a part of it. Is there any way to retrieve those missing files? If you had multiple files trying to use the same inode number, then the contents of the data pool would only have been storing the contents of one of those files (or, worst case, some interspersed mixture of both files). So the chances are that if something wasn't linked into lost+found, it is gone for good. Now that your damaged filesystem is up and running again, if you have the capacity then it's a good precaution to create a fresh filesystem, copy the files over, and then restore anything missing from backups. The multi-filesystem functionality is officially an experimental feature (mainly because it gets little testing), but when you've gone through a metadata damage episode it's the lesser of two evils. John > > > John > > > > > > > > > > > From: ceph-users on behalf of > > > Wido den Hollander > > > Sent: Saturday, 7 July 2018 12:26:15 AM > > > To: John Spray > > > Cc: ceph-users@lists.ceph.com > > > Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" > > > errors > > > > > > > > > > > > On 07/06/2018 01:47 PM, John Spray wrote: > > > > > > > > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > On 07/05/2018 03:36 PM, John Spray wrote: > > > > > > > > > > > > On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS) > > > > > lmes.nl> wrote: > > > > > > > > > > > > > > > > > > > > > Hi list, > > > > > > > > > > > > > > I have a serious problem now... I think. > > > > > > > > > > > > > > One of my users just informed me that a file he created > > > > > > > (.doc file) has > > > > > > > a different content then before. It looks like the file's > > > > > > > inode is > > > > > > > completely wrong and points to the wrong object. I myself > > > > > > > have found > > > > > > > another file with the same symptoms. I'm afraid my > > > > > > > (production) FS is > > > > > > > corrupt now, unless there is a possibility to fix the > > > > > > > inodes. > > > > > > You can probably get back to a state with some valid > > > > > > metadata, but it > > > > > > might not necessarily be the metadata the user was expecting > > > > > > (e.g. if > > > > > > two files are claiming the same inode number, one of them's > > > > > > is > > > > > > probably going to get deleted). > > > > > > > > > > > > > > > > > > > > Timeline of what happend: > > > > > > > > > > > > > > Last week I upgraded our Ceph Jewel to Luminous. > > > > > > > This went without any problem. > > > > > > > > > > > > > > I already had 5 MDS available and went with the Multi-MDS > > > > > > > feature and > > > > > > > enabled it. The seemed to work okay, but after a while my > > > > > > > MDS went > > > > > > > beserk and went flapping (crashed -> replay -> rejoin -> > > > > > > > crashed) > > > > > > > > > > > > > > The only way to fix this and get the FS back online was the > > > > > > > disaster > > > > > > > recovery procedure: > > > > > > > > > > > > > > cephfs-journal-tool event recover_dentries summary > > > > > > > ceph fs set cephfs cluster_down true > > > > > > > cephfs-table-tool all reset session > > > > > > > cephfs-table-tool all reset inode > > > > > > > cephfs-journal-tool --rank=cephfs:0 journal reset > > > > > > > ceph mds fail 0 > > > > > > > ceph fs reset cephfs --yes-i-really-mean-it > > > > > > My concern with this procedure is that the recover_dentries > > > > > > and > > > > > > journal reset only happened on rank 0, whereas the other 4 > > > > > > MDS ranks > > > > > > would have retained lots of content in their journals. I > > > > > > wonder if we > > > > > > should be adding some more multi-mds aware checks to these > > > > > > tools, to > > > > > > warn the user when they're only acting on particular ranks (a > > > > > > reasonable person might assume that recover_dentries
Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors
Hi John, On Tue, 2018-07-10 at 10:11 +0100, John Spray wrote: > On Tue, Jul 10, 2018 at 12:43 AM Linh Vu wrote: > > > > > > We're affected by something like this right now (the dup inode > > causing MDS to crash via assert(!p) with add_inode(CInode) > > function). > > > > In terms of behaviours, shouldn't the MDS simply skip to the next > > available free inode in the event of a dup, than crashing the > > entire FS because of one file? Probably I'm missing something but > > that'd be a no brainer picking between the two? > Historically (a few years ago) the MDS asserted out on any invalid > metadata. Most of these cases have been picked up and converted into > explicit damage handling, but this one appears to have been missed -- > so yes, it's a bug that the MDS asserts out. I have followed the disaster recovery and now all my files and directories in CephFS which complained about duplicate inodes disappeared from my FS. I see *some* data in "lost+found", but thats only a part of it. Is there any way to retrieve those missing files? > John > > > > > > > From: ceph-users on behalf of > > Wido den Hollander > > Sent: Saturday, 7 July 2018 12:26:15 AM > > To: John Spray > > Cc: ceph-users@lists.ceph.com > > Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" > > errors > > > > > > > > On 07/06/2018 01:47 PM, John Spray wrote: > > > > > > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander > > > wrote: > > > > > > > > > > > > > > > > > > > > On 07/05/2018 03:36 PM, John Spray wrote: > > > > > > > > > > On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS) > > > > lmes.nl> wrote: > > > > > > > > > > > > > > > > > > Hi list, > > > > > > > > > > > > I have a serious problem now... I think. > > > > > > > > > > > > One of my users just informed me that a file he created > > > > > > (.doc file) has > > > > > > a different content then before. It looks like the file's > > > > > > inode is > > > > > > completely wrong and points to the wrong object. I myself > > > > > > have found > > > > > > another file with the same symptoms. I'm afraid my > > > > > > (production) FS is > > > > > > corrupt now, unless there is a possibility to fix the > > > > > > inodes. > > > > > You can probably get back to a state with some valid > > > > > metadata, but it > > > > > might not necessarily be the metadata the user was expecting > > > > > (e.g. if > > > > > two files are claiming the same inode number, one of them's > > > > > is > > > > > probably going to get deleted). > > > > > > > > > > > > > > > > > Timeline of what happend: > > > > > > > > > > > > Last week I upgraded our Ceph Jewel to Luminous. > > > > > > This went without any problem. > > > > > > > > > > > > I already had 5 MDS available and went with the Multi-MDS > > > > > > feature and > > > > > > enabled it. The seemed to work okay, but after a while my > > > > > > MDS went > > > > > > beserk and went flapping (crashed -> replay -> rejoin -> > > > > > > crashed) > > > > > > > > > > > > The only way to fix this and get the FS back online was the > > > > > > disaster > > > > > > recovery procedure: > > > > > > > > > > > > cephfs-journal-tool event recover_dentries summary > > > > > > ceph fs set cephfs cluster_down true > > > > > > cephfs-table-tool all reset session > > > > > > cephfs-table-tool all reset inode > > > > > > cephfs-journal-tool --rank=cephfs:0 journal reset > > > > > > ceph mds fail 0 > > > > > > ceph fs reset cephfs --yes-i-really-mean-it > > > > > My concern with this procedure is that the recover_dentries > > > > > and > > > > > journal reset only happened on rank 0, whereas the other 4 > > > > > MDS ranks > > > > > would have retained lots of content in their journals. I > > > > > wonder if we > > > > > should be adding some more multi-mds aware checks to these > > > > > tools, to > > > > > warn the user when they're only acting on particular ranks (a > > > > > reasonable person might assume that recover_dentries with no > > > > > args is > > > > > operating on all ranks, not just 0). Created > > > > > https://protect-au.mimecast.com/s/PZyQCJypvAfnP9VmfVwGUS?doma > > > > > in=tracker.ceph.com to track improving the default > > > > > behaviour. > > > > > > > > > > > > > > > > > Restarted the MDS and I was back online. Shortly after I > > > > > > was getting a > > > > > > lot of "loaded dup inode". In the meanwhile the MDS kept > > > > > > crashing. It > > > > > > looks like it had trouble creating new inodes. Right before > > > > > > the crash > > > > > > it mostly complained something like: > > > > > > > > > > > > -2> 2018-07-05 05:05:01.614290 7f8f8574b700 4 > > > > > > mds.0.server > > > > > > handle_client_request client_request(client.324932014:1434 > > > > > > create > > > > > > #0x1360346/pyfiles.txt 2018-07-05 05:05:01.607458 > > > > > > caller_uid=0, > > > > > > caller_gid=0{}) v2 > > > > > > -1> 2018-07-05 05:05:01.614320 7f8f7e73d700 5 > > > > >
Re: [ceph-users] rbd lock remove unable to parse address
On Tue, Jul 10, 2018 at 2:37 AM Kevin Olbrich wrote: > 2018-07-10 0:35 GMT+02:00 Jason Dillaman : > >> Is the link-local address of "fe80::219:99ff:fe9e:3a86%eth0" at least >> present on the client computer you used? I would have expected the OSD to >> determine the client address, so it's odd that it was able to get a >> link-local address. >> > > Yes, it is. eth0 is part of bond0 which is a vlan trunk. Bond0.X is > attached to brX which has an ULA-prefix for the ceph cluster. > Eth0 has no address itself. In this case this must mean, the address has > been carried down to the hardware interface. > > I am wondering why it uses link local when there is an ULA-prefix > available. > > The address is available on brX on this client node. > I'll open a tracker ticker to get that issue fixed, but in the meantime, you can run "rados -p rmxattr rbd_header. lock.rbd_lock" to remove the lock. > > - Kevin > > >> On Mon, Jul 9, 2018 at 3:43 PM Kevin Olbrich wrote: >> >>> 2018-07-09 21:25 GMT+02:00 Jason Dillaman : >>> BTW -- are you running Ceph on a one-node computer? I thought IPv6 addresses starting w/ fe80 were link-local addresses which would probably explain why an interface scope id was appended. The current IPv6 address parser stops reading after it encounters a non hex, colon character [1]. >>> >>> No, this is a compute machine attached to the storage vlan where I >>> previously had also local disks. >>> >>> On Mon, Jul 9, 2018 at 3:14 PM Jason Dillaman wrote: > Hmm ... it looks like there is a bug w/ RBD locks and IPv6 addresses > since it is failing to parse the address as valid. Perhaps it's barfing on > the "%eth0" scope id suffix within the address. > > On Mon, Jul 9, 2018 at 2:47 PM Kevin Olbrich wrote: > >> Hi! >> >> I tried to convert an qcow2 file to rbd and set the wrong pool. >> Immediately I stopped the transfer but the image is stuck locked: >> >> Previusly when that happened, I was able to remove the image after 30 >> secs. >> >> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock list fpi_server02 >> There is 1 exclusive lock on this image. >> Locker ID Address >> >> client.1195723 auto 93921602220416 >> [fe80::219:99ff:fe9e:3a86%eth0]:0/1200385089 >> >> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock rm fpi_server02 "auto >> 93921602220416" client.1195723 >> rbd: releasing lock failed: (22) Invalid argument >> 2018-07-09 20:45:19.080543 7f6c2c267d40 -1 librados: unable to parse >> address [fe80::219:99ff:fe9e:3a86%eth0]:0/1200385089 >> 2018-07-09 20:45:19.080555 7f6c2c267d40 -1 librbd: unable to >> blacklist client: (22) Invalid argument >> >> The image is not in use anywhere! >> >> How can I force removal of all locks for this image? >> >> Kind regards, >> Kevin >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > -- > Jason > [1] https://github.com/ceph/ceph/blob/master/src/msg/msg_types.cc#L108 -- Jason >>> >>> >> >> -- >> Jason >> > > -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure coding RBD pool for OpenStack Glance, Nova and Cinder
Le 2018-07-10 06:26, Konstantin Shalygin a écrit : Does someone have used EC pools with OpenStack in production ? By chance, I found that link : https://www.reddit.com/r/ceph/comments/72yc9m/ceph_openstack_with_ec/ Yes, this good post. My configuration is: cinder.conf: [erasure-rbd-hdd] volume_driver = cinder.volume.drivers.rbd.RBDDriver volume_backend_name = erasure-rbd-hdd rbd_pool = erasure_rbd_meta rbd_user = cinder_erasure_hdd rbd_ceph_conf = /etc/ceph/ceph.conf ceph.conf: [client.cinder_erasure_hdd] rbd default data pool = erasure_rbd_data Keep in mind, your minimal client version is Luminous. So trick is - tell to everyone your pool is "erasure_rbd_meta", rbd clients will find data pool "erasure_rbd_data" automatically. k Thank you for your feed back Konstantin ! So if you want, two more questions to you : - How do you handle your ceph.conf configuration (default data pool by user) / distribution ? Manually, config management, openstack-ansible... ? - Did you made comparisons, benchmarks between replicated pools and EC pools, on the same hardware / drives ? I read that small writes are not very performant with EC. Thanks again, -- Gilles ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mimic 13.2.1 release date
> Den 9. jul. 2018 kl. 17.12 skrev Wido den Hollander : > > Hi, > > Is there a release date for Mimic 13.2.1 yet? > > There are a few issues which currently make deploying with Mimic 13.2.0 > a bit difficult, for example: > > - https://tracker.ceph.com/issues/24423 > - https://github.com/ceph/ceph/pull/22393 > > Especially the first one makes it difficult. > > 13.2.1 would be very welcome with these fixes in there. > > Is there a ETA for this version yet? > > Wido > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Also looking forward to this release, we had to revert to luminous to continue expanding our cluster. An ETA would be great, thanks. Best regards, Martin Overgaard Hansen MultiHouse IT Partner A/S ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Add Partitions to Ceph Cluster
Hi, is it possible to use just a partition instead of a whole disk for OSD? On a server I already use hdb for Ceph and want to add hda4 to be used in the Ceph Cluster, but it didn’t work for me. On the server with the partition I tried: ceph-disk prepare /dev/sda4 and ceph-disk activate /dev/sda4 And with df I see, that ceph did something on the partition: /dev/sda4 1.8T 2.8G 1.8T 1% /var/lib/ceph/osd/ceph-4 My problem is, that after I activated the disk, I didn't see a change in the ceph status output: data: pools: 6 pools, 168 pgs objects: 25.84 k objects, 100 GiB usage: 305 GiB used, 6.8 TiB / 7.1 TiB avail pgs: 168 active+clean Can some one help me? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Luminous 12.2.6 release date?
Hi Sean, On Tue, 10 Jul 2018, Sean Redmond said: > Can you please link me to the tracker 12.2.6 fixes? I have disabled > resharding in 12.2.5 due to it running endlessly. http://tracker.ceph.com/issues/22721 Sean > Thanks > > On Tue, Jul 10, 2018 at 9:07 AM, Sean Purdy > wrote: > > > While we're at it, is there a release date for 12.2.6? It fixes a > > reshard/versioning bug for us. > > > > Sean > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors
On Tue, Jul 10, 2018 at 2:49 AM Linh Vu wrote: > > While we're on this topic, could someone please explain to me what > `cephfs-table-tool all reset inode` does? The inode table stores an interval set of free inode numbers. Active MDS daemons consume inode numbers as they create files. Resetting the inode table means rewriting it to its original state (i.e. everything free). Using the "take_inos" command consumes some range of inodes, to reflect that the inodes up to a certain point aren't really free, but in use by some files that already exist. > Does it only reset what the MDS has in its cache, and after starting up > again, the MDS will read in new inode range from the metadata pool? I'm repeating myself a bit, but for the benefit of anyone reading this thread in the future: no, it's nothing like that. It effectively *erases the inode table* by overwriting it ("resetting") with a blank one. As with the journal tool (https://github.com/ceph/ceph/pull/22853), perhaps the verb "reset" is too prone to misunderstanding. > If so, does it mean *before* we run `cephfs-table-tool take_inos`, we must > run `cephfs-table-tool all reset inode`? The first question is why we're talking about running it at all. What chain of reasoning led you to believe that your inotable needed erasing? The most typical case is where the journal has been recovered/erased, and take_inos is used to skip forward to avoid re-using any inode numbers that had been claimed by journal entries that we threw away. John > > Cheers, > > Linh > > > From: ceph-users on behalf of Wido den > Hollander > Sent: Saturday, 7 July 2018 12:26:15 AM > To: John Spray > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors > > > > On 07/06/2018 01:47 PM, John Spray wrote: > > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander wrote: > >> > >> > >> > >> On 07/05/2018 03:36 PM, John Spray wrote: > >>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS) > >>> wrote: > > Hi list, > > I have a serious problem now... I think. > > One of my users just informed me that a file he created (.doc file) has > a different content then before. It looks like the file's inode is > completely wrong and points to the wrong object. I myself have found > another file with the same symptoms. I'm afraid my (production) FS is > corrupt now, unless there is a possibility to fix the inodes. > >>> > >>> You can probably get back to a state with some valid metadata, but it > >>> might not necessarily be the metadata the user was expecting (e.g. if > >>> two files are claiming the same inode number, one of them's is > >>> probably going to get deleted). > >>> > Timeline of what happend: > > Last week I upgraded our Ceph Jewel to Luminous. > This went without any problem. > > I already had 5 MDS available and went with the Multi-MDS feature and > enabled it. The seemed to work okay, but after a while my MDS went > beserk and went flapping (crashed -> replay -> rejoin -> crashed) > > The only way to fix this and get the FS back online was the disaster > recovery procedure: > > cephfs-journal-tool event recover_dentries summary > ceph fs set cephfs cluster_down true > cephfs-table-tool all reset session > cephfs-table-tool all reset inode > cephfs-journal-tool --rank=cephfs:0 journal reset > ceph mds fail 0 > ceph fs reset cephfs --yes-i-really-mean-it > >>> > >>> My concern with this procedure is that the recover_dentries and > >>> journal reset only happened on rank 0, whereas the other 4 MDS ranks > >>> would have retained lots of content in their journals. I wonder if we > >>> should be adding some more multi-mds aware checks to these tools, to > >>> warn the user when they're only acting on particular ranks (a > >>> reasonable person might assume that recover_dentries with no args is > >>> operating on all ranks, not just 0). Created > >>> https://protect-au.mimecast.com/s/PZyQCJypvAfnP9VmfVwGUS?domain=tracker.ceph.com > >>> to track improving the default > >>> behaviour. > >>> > Restarted the MDS and I was back online. Shortly after I was getting a > lot of "loaded dup inode". In the meanwhile the MDS kept crashing. It > looks like it had trouble creating new inodes. Right before the crash > it mostly complained something like: > > -2> 2018-07-05 05:05:01.614290 7f8f8574b700 4 mds.0.server > handle_client_request client_request(client.324932014:1434 create > #0x1360346/pyfiles.txt 2018-07-05 05:05:01.607458 caller_uid=0, > caller_gid=0{}) v2 > -1> 2018-07-05 05:05:01.614320 7f8f7e73d700 5 mds.0.log > _submit_thread 24100753876035~1070 : EOpen [metablob 0x1360346, 1 > dirs], 1 open files > 0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1
Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors
On Tue, Jul 10, 2018 at 12:43 AM Linh Vu wrote: > > We're affected by something like this right now (the dup inode causing MDS to > crash via assert(!p) with add_inode(CInode) function). > > In terms of behaviours, shouldn't the MDS simply skip to the next available > free inode in the event of a dup, than crashing the entire FS because of one > file? Probably I'm missing something but that'd be a no brainer picking > between the two? Historically (a few years ago) the MDS asserted out on any invalid metadata. Most of these cases have been picked up and converted into explicit damage handling, but this one appears to have been missed -- so yes, it's a bug that the MDS asserts out. John > > From: ceph-users on behalf of Wido den > Hollander > Sent: Saturday, 7 July 2018 12:26:15 AM > To: John Spray > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors > > > > On 07/06/2018 01:47 PM, John Spray wrote: > > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander wrote: > >> > >> > >> > >> On 07/05/2018 03:36 PM, John Spray wrote: > >>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS) > >>> wrote: > > Hi list, > > I have a serious problem now... I think. > > One of my users just informed me that a file he created (.doc file) has > a different content then before. It looks like the file's inode is > completely wrong and points to the wrong object. I myself have found > another file with the same symptoms. I'm afraid my (production) FS is > corrupt now, unless there is a possibility to fix the inodes. > >>> > >>> You can probably get back to a state with some valid metadata, but it > >>> might not necessarily be the metadata the user was expecting (e.g. if > >>> two files are claiming the same inode number, one of them's is > >>> probably going to get deleted). > >>> > Timeline of what happend: > > Last week I upgraded our Ceph Jewel to Luminous. > This went without any problem. > > I already had 5 MDS available and went with the Multi-MDS feature and > enabled it. The seemed to work okay, but after a while my MDS went > beserk and went flapping (crashed -> replay -> rejoin -> crashed) > > The only way to fix this and get the FS back online was the disaster > recovery procedure: > > cephfs-journal-tool event recover_dentries summary > ceph fs set cephfs cluster_down true > cephfs-table-tool all reset session > cephfs-table-tool all reset inode > cephfs-journal-tool --rank=cephfs:0 journal reset > ceph mds fail 0 > ceph fs reset cephfs --yes-i-really-mean-it > >>> > >>> My concern with this procedure is that the recover_dentries and > >>> journal reset only happened on rank 0, whereas the other 4 MDS ranks > >>> would have retained lots of content in their journals. I wonder if we > >>> should be adding some more multi-mds aware checks to these tools, to > >>> warn the user when they're only acting on particular ranks (a > >>> reasonable person might assume that recover_dentries with no args is > >>> operating on all ranks, not just 0). Created > >>> https://protect-au.mimecast.com/s/PZyQCJypvAfnP9VmfVwGUS?domain=tracker.ceph.com > >>> to track improving the default > >>> behaviour. > >>> > Restarted the MDS and I was back online. Shortly after I was getting a > lot of "loaded dup inode". In the meanwhile the MDS kept crashing. It > looks like it had trouble creating new inodes. Right before the crash > it mostly complained something like: > > -2> 2018-07-05 05:05:01.614290 7f8f8574b700 4 mds.0.server > handle_client_request client_request(client.324932014:1434 create > #0x1360346/pyfiles.txt 2018-07-05 05:05:01.607458 caller_uid=0, > caller_gid=0{}) v2 > -1> 2018-07-05 05:05:01.614320 7f8f7e73d700 5 mds.0.log > _submit_thread 24100753876035~1070 : EOpen [metablob 0x1360346, 1 > dirs], 1 open files > 0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1 /build/ceph- > 12.2.5/src/mds/MDCache.cc: In function 'void > MDCache::add_inode(CInode*)' thread 7f8f8574b700 time 2018-07-05 > 05:05:01.615123 > /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p) > > I also tried to counter the create inode crash by doing the following: > > cephfs-journal-tool event recover_dentries > cephfs-journal-tool journal reset > cephfs-table-tool all reset session > cephfs-table-tool all reset inode > cephfs-table-tool all take_inos 10 > >>> > >>> This procedure is recovering some metadata from the journal into the > >>> main tree, then resetting everything, but duplicate inodes are > >>> happening when the main tree has multiple dentries containing inodes > >>> using the same inode number. > >>> > >>> What you need is something that scans
Re: [ceph-users] Luminous 12.2.6 release date?
Hi Sean (Good name btw), Can you please link me to the tracker 12.2.6 fixes? I have disabled resharding in 12.2.5 due to it running endlessly. Thanks On Tue, Jul 10, 2018 at 9:07 AM, Sean Purdy wrote: > While we're at it, is there a release date for 12.2.6? It fixes a > reshard/versioning bug for us. > > Sean > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mimic 13.2.1 release date
> On 9 Jul 2018, at 17.11, Wido den Hollander wrote: > > Hi, > > Is there a release date for Mimic 13.2.1 yet? > > There are a few issues which currently make deploying with Mimic 13.2.0 > a bit difficult, for example: > > - https://tracker.ceph.com/issues/24423 > - https://github.com/ceph/ceph/pull/22393 > > Especially the first one makes it difficult. +1 > 13.2.1 would be very welcome with these fixes in there. +1 > > Is there a ETA for this version yet? > > Wido > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Luminous 12.2.6 release date?
While we're at it, is there a release date for 12.2.6? It fixes a reshard/versioning bug for us. Sean ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph poor performance when compress files
Hi Ceph Experts, When I compress my files stored in ceph cluster using gzip command, the command take long time. The poor performance only when ziping files stored on ceph. Please , any idea about this problem. Thank you ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd lock remove unable to parse address
2018-07-10 0:35 GMT+02:00 Jason Dillaman : > Is the link-local address of "fe80::219:99ff:fe9e:3a86%eth0" at least > present on the client computer you used? I would have expected the OSD to > determine the client address, so it's odd that it was able to get a > link-local address. > Yes, it is. eth0 is part of bond0 which is a vlan trunk. Bond0.X is attached to brX which has an ULA-prefix for the ceph cluster. Eth0 has no address itself. In this case this must mean, the address has been carried down to the hardware interface. I am wondering why it uses link local when there is an ULA-prefix available. The address is available on brX on this client node. - Kevin > On Mon, Jul 9, 2018 at 3:43 PM Kevin Olbrich wrote: > >> 2018-07-09 21:25 GMT+02:00 Jason Dillaman : >> >>> BTW -- are you running Ceph on a one-node computer? I thought IPv6 >>> addresses starting w/ fe80 were link-local addresses which would probably >>> explain why an interface scope id was appended. The current IPv6 address >>> parser stops reading after it encounters a non hex, colon character [1]. >>> >> >> No, this is a compute machine attached to the storage vlan where I >> previously had also local disks. >> >> >>> >>> >>> On Mon, Jul 9, 2018 at 3:14 PM Jason Dillaman >>> wrote: >>> Hmm ... it looks like there is a bug w/ RBD locks and IPv6 addresses since it is failing to parse the address as valid. Perhaps it's barfing on the "%eth0" scope id suffix within the address. On Mon, Jul 9, 2018 at 2:47 PM Kevin Olbrich wrote: > Hi! > > I tried to convert an qcow2 file to rbd and set the wrong pool. > Immediately I stopped the transfer but the image is stuck locked: > > Previusly when that happened, I was able to remove the image after 30 > secs. > > [root@vm2003 images1]# rbd -p rbd_vms_hdd lock list fpi_server02 > There is 1 exclusive lock on this image. > Locker ID Address > > client.1195723 auto 93921602220416 [fe80::219:99ff:fe9e:3a86% > eth0]:0/1200385089 > > [root@vm2003 images1]# rbd -p rbd_vms_hdd lock rm fpi_server02 "auto > 93921602220416" client.1195723 > rbd: releasing lock failed: (22) Invalid argument > 2018-07-09 20:45:19.080543 7f6c2c267d40 -1 librados: unable to parse > address [fe80::219:99ff:fe9e:3a86%eth0]:0/1200385089 > 2018-07-09 20:45:19.080555 7f6c2c267d40 -1 librbd: unable to blacklist > client: (22) Invalid argument > > The image is not in use anywhere! > > How can I force removal of all locks for this image? > > Kind regards, > Kevin > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Jason >>> >>> [1] https://github.com/ceph/ceph/blob/master/src/msg/msg_types.cc#L108 >>> >>> -- >>> Jason >>> >> >> > > -- > Jason > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com