Re: [ceph-users] Ceph nautilus upgrade problem
Quoting Paul Emmerich (paul.emmer...@croit.io): > This also happened sometimes during a Luminous -> Mimic upgrade due to > a bug in Luminous; however I thought it was fixed on the ceph-mgr > side. > Maybe the fix was (also) required in the OSDs and you are seeing this > because the running OSDs have that bug? > > Anyways, it's harmless and you can ignore it. Ah, so it's merely "cosmetic" than that those PGs are really inactive. Because, that would *freak me out* if I were doing an upgrade. Thanks for the clarification. Gr. Stefan -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] op_w_latency
Thanks for the updated command – much cleaner! The OSD nodes have a single 6core X5650 @ 2.67GHz, 72GB GB and around 8x10TB HDD OSD/ 4 x 2TB SSD OSD. Cpu usage is around 20% and the ram has 22GB available. The 3 MON nodes are the same but with no OSDs The cluster has around 150 drives and only doing 500-1000 ops overall. The network is dual 10gbit using lacp. Vlan for private ceph traffic and untagged for public Glen From: Konstantin Shalygin Sent: Wednesday, 3 April 2019 11:39 AM To: Glen Baars Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] op_w_latency Hello Ceph Users, I am finding that the write latency across my ceph clusters isn't great and I wanted to see what other people are getting for op_w_latency. Generally I am getting 70-110ms latency. I am using: ceph --admin-daemon /var/run/ceph/ceph-osd.102.asok perf dump | grep -A3 '\"op_w_latency' | grep 'avgtime' Better like this: ceph daemon osd.102 perf dump | jq '.osd.op_w_latency.avgtime' Ram, CPU and network don't seem to be the bottleneck. The drives are behind a dell H810p raid card with a 1GB writeback cache and battery. I have tried with LSI JBOD cards and haven't found it faster ( as you would expect with write cache ). The disks through iostat -xyz 1 show 10-30% usage with general service + write latency around 3-4ms. Queue depth is normally less than one. RocksDB write latency is around 0.6ms, read 1-2ms. Usage is RBD backend for Cloudstack. What is your hardware? Your CPU, RAM, Eth? k This e-mail is intended solely for the benefit of the addressee(s) and any other named recipient. It is confidential and may contain legally privileged or confidential information. If you are not the recipient, any use, distribution, disclosure or copying of this e-mail is prohibited. The confidentiality and legal privilege attached to this communication is not waived or lost by reason of the mistaken transmission or delivery to you. If you have received this e-mail in error, please notify us immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] op_w_latency
Hello Ceph Users, I am finding that the write latency across my ceph clusters isn't great and I wanted to see what other people are getting for op_w_latency. Generally I am getting 70-110ms latency. I am using: ceph --admin-daemon /var/run/ceph/ceph-osd.102.asok perf dump | grep -A3 '\"op_w_latency' | grep 'avgtime' Better like this: ceph daemon osd.102 perf dump | jq '.osd.op_w_latency.avgtime' Ram, CPU and network don't seem to be the bottleneck. The drives are behind a dell H810p raid card with a 1GB writeback cache and battery. I have tried with LSI JBOD cards and haven't found it faster ( as you would expect with write cache ). The disks through iostat -xyz 1 show 10-30% usage with general service + write latency around 3-4ms. Queue depth is normally less than one. RocksDB write latency is around 0.6ms, read 1-2ms. Usage is RBD backend for Cloudstack. What is your hardware? Your CPU, RAM, Eth? k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS allocates all memory (>500G) replaying, OOM-killed, repeat
Looks like http://tracker.ceph.com/issues/37399. which version of ceph-mds do you use? On Tue, Apr 2, 2019 at 7:47 AM Sergey Malinin wrote: > > These steps pretty well correspond to > http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/ > Were you able to replay journal manually with no issues? IIRC, > "cephfs-journal-tool recover_dentries" would lead to OOM in case of MDS doing > so, and it has already been discussed on this list. > > > April 2, 2019 1:37 AM, "Pickett, Neale T" wrote: > > Here is what I wound up doing to fix this: > > Bring down all MDSes so they stop flapping > Back up journal (as seen in previous message) > Apply journal manually > Reset journal manually > Clear session table > Clear other tables (not sure I needed to do this) > Mark FS down > Mark the rank 0 MDS as failed > Reset the FS (yes, I really mean it) > Restart MDSes > Finally get some sleep > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure Coding failure domain (again)
On Tue, 2 Apr 2019 19:04:28 +0900 Hector Martin wrote: > On 02/04/2019 18.27, Christian Balzer wrote: > > I did a quick peek at my test cluster (20 OSDs, 5 hosts) and a replica 2 > > pool with 1024 PGs. > > (20 choose 2) is 190, so you're never going to have more than that many > unique sets of OSDs. > And this is why one shouldn't send mails when in a rush, w/o fully groking the math one was just given. Thanks for setting me straight. > I just looked at the OSD distribution for a replica 3 pool across 48 > OSDs with 4096 PGs that I have and the result is reasonable. There are > 3782 unique OSD tuples, out of (48 choose 3) = 17296 options. Since this > is a random process, due to the birthday paradox, some duplicates are > expected after only the order of 17296^0.5 = ~131 PGs; at 4096 PGs > having 3782 unique choices seems to pass the gut feeling test. Too lazy > to do the math closed form, but here's a quick simulation: > > >>> len(set(random.randrange(17296) for i in range(4096))) > 3671 > > So I'm actually slightly ahead. > > At the numbers in my previous example (1500 OSDs, 50k pool PGs), > statistically you should get something like ~3 collisions on average, so > negligible. > Sounds promising. > > Another thing to look at here is of course critical period and disk > > failure probabilities, these guys explain the logic behind their > > calculator, would be delighted if you could have a peek and comment. > > > > https://www.memset.com/support/resources/raid-calculator/ > > I'll take a look tonight :) > Thanks, a look at the Backblaze disk failure rates (picking the worst ones) gives a good insight into real life probabilities, too. https://www.backblaze.com/blog/hard-drive-stats-for-2018/ If we go with 2%/year, that's an average failure ever 12 days. Aside from how likely the actual failure rate is, another concern of course is extended periods of the cluster being unhealthy, with certain versions there was that "mon map will grow indefinitely" issue, other more subtle ones might lurk still. Christian > -- > Hector Martin (hec...@marcansoft.com) > Public Key: https://mrcn.st/pub > -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph nautilus upgrade problem
This also happened sometimes during a Luminous -> Mimic upgrade due to a bug in Luminous; however I thought it was fixed on the ceph-mgr side. Maybe the fix was (also) required in the OSDs and you are seeing this because the running OSDs have that bug? Anyways, it's harmless and you can ignore it. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Tue, Apr 2, 2019 at 7:49 PM Jan-Willem Michels wrote: > > Op 2-4-2019 om 12:16 schreef Stefan Kooman: > > Quoting Stadsnet (jwil...@stads.net): > >> On 26-3-2019 16:39, Ashley Merrick wrote: > >>> Have you upgraded any OSD's? > >> > >> No didn't go through with the osd's > > Just checking here: are your sure all PGs have been scrubbed while > > running Luminous? As the release notes [1] mention this: > > > > "If you are unsure whether or not your Luminous cluster has completed a > > full scrub of all PGs, you can check your clusters state by running: > > > > # ceph osd dump | grep ^flags > > > > In order to be able to proceed to Nautilus, your OSD map must include > > the recovery_deletes and purged_snapdirs flags." > > Yes I did check that. > > No everything went fine, exactly as Ashley predicted > > "On a test cluster I saw the same and as I upgraded / restarted the > OSD's the PG's started to show online till it was 100%." > > So I upgraded the first osd, and exactly that amount of percentage of > OSD's became active. > And every server the same percentage was added. > And then finaly, with the last one I got 100% active. > > So went without problems. > But it looked a bit uggly that's why I asked. > > And the new Nautilus versions is really a big plus in almost every way. > > Sorry for not getting back how it went. I was not sure if I should > bother the mailing list. > > Thanks for your time. > > > > > > Gr. Stefan > > > > [1]: > > http://docs.ceph.com/docs/master/releases/nautilus/#upgrading-from-mimic-or-luminous > > > > P.s. I expect most users upgrade to Mimic first, then go to Nautilus. > > It might be a better tested upgrade path ... > > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph nautilus upgrade problem
Op 2-4-2019 om 12:16 schreef Stefan Kooman: Quoting Stadsnet (jwil...@stads.net): On 26-3-2019 16:39, Ashley Merrick wrote: Have you upgraded any OSD's? No didn't go through with the osd's Just checking here: are your sure all PGs have been scrubbed while running Luminous? As the release notes [1] mention this: "If you are unsure whether or not your Luminous cluster has completed a full scrub of all PGs, you can check your clusters state by running: # ceph osd dump | grep ^flags In order to be able to proceed to Nautilus, your OSD map must include the recovery_deletes and purged_snapdirs flags." Yes I did check that. No everything went fine, exactly as Ashley predicted "On a test cluster I saw the same and as I upgraded / restarted the OSD's the PG's started to show online till it was 100%." So I upgraded the first osd, and exactly that amount of percentage of OSD's became active. And every server the same percentage was added. And then finaly, with the last one I got 100% active. So went without problems. But it looked a bit uggly that's why I asked. And the new Nautilus versions is really a big plus in almost every way. Sorry for not getting back how it went. I was not sure if I should bother the mailing list. Thanks for your time. Gr. Stefan [1]: http://docs.ceph.com/docs/master/releases/nautilus/#upgrading-from-mimic-or-luminous P.s. I expect most users upgrade to Mimic first, then go to Nautilus. It might be a better tested upgrade path ... ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] inline_data (was: CephFS and many small files)
On Tue, Apr 2, 2019 at 9:10 PM Paul Emmerich wrote: > > On Tue, Apr 2, 2019 at 3:05 PM Yan, Zheng wrote: > > > > On Tue, Apr 2, 2019 at 8:23 PM Clausen, Jörn wrote: > > > > > > Hi! > > > > > > Am 29.03.2019 um 23:56 schrieb Paul Emmerich: > > > > There's also some metadata overhead etc. You might want to consider > > > > enabling inline data in cephfs to handle small files in a > > > > store-efficient way (note that this feature is officially marked as > > > > experimental, though). > > > > http://docs.ceph.com/docs/master/cephfs/experimental-features/#inline-data > > > > > > Is there something missing from the documentation? I have turned on this > > > feature: > > > > > > > I don't use this feature. We don't have plan to mark this feature > > stable. (probably we will remove this feature in the furthure). > > We also don't use this feature in any of our production clusters > (because it's marked experimental). > > But it seems like a really useful feature and I know of at least one > real-world production cluster using this with great success... > So why remove it? > mds needs to serve both data/metadata requests. It only suites for small amount of data. > > Paul > > > > > Yan, Zheng > > > > > > > > > $ ceph fs dump | grep inline_data > > > dumped fsmap epoch 1224 > > > inline_data enabled > > > > > > I have reduced the size of the bonnie-generated files to 1 byte. But > > > this is the situation halfway into the test: (output slightly shortened) > > > > > > $ rados df > > > POOL_NAME USED OBJECTS CLONES COPIES > > > fs-data 3.2 MiB 3390041 0 10170123 > > > fs-metadata 772 MiB2249 0 6747 > > > > > > total_objects3392290 > > > total_used 643 GiB > > > total_avail 957 GiB > > > total_space 1.6 TiB > > > > > > i.e. bonnie has created a little over 3 million files, for which the > > > same number of objects was created in the data pool. So the raw usage is > > > again at more than 500 GB. > > > > > > If the data was inlined, I would expect far less objects in the data > > > pool - actually none at all - and maybe some more usage in the metadata > > > pool. > > > > > > Do I have to restart any daemons after turning on inline_data? Am I > > > missing anything else here? > > > > > > For the record: > > > > > > $ ceph versions > > > { > > > "mon": { > > > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > > > nautilus (stable)": 3 > > > }, > > > "mgr": { > > > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > > > nautilus (stable)": 3 > > > }, > > > "osd": { > > > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > > > nautilus (stable)": 16 > > > }, > > > "mds": { > > > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > > > nautilus (stable)": 2 > > > }, > > > "overall": { > > > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > > > nautilus (stable)": 24 > > > } > > > } > > > > > > -- > > > Jörn Clausen > > > Daten- und Rechenzentrum > > > GEOMAR Helmholtz-Zentrum für Ozeanforschung Kiel > > > Düsternbrookerweg 20 > > > 24105 Kiel > > > > > > ___ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] inline_data (was: CephFS and many small files)
On 02/04/2019 15.05, Yan, Zheng wrote: > I don't use this feature. We don't have plan to mark this feature > stable. (probably we will remove this feature in the furthure). Oh no! We have activated inline_data since our cluster does have lots of small files (but also big ones), and performance increased significantly, especially when deleting those small files on CephFS. I hope inline_data is not removed, but improved (or at least marked stable) instead! Of course, if CephFS can handle small files with the same performance after inline_data was removed, that would also be okay. But stepping back would hurt :) -- Jonas ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS and many small files
Hello, I haven't had any issues either with 4k allocation size in cluster holding 358M objects for 116TB (237TB raw) and 2.264B chunks/replicas. This is an average of 324k per object and 12.6M of chunks/replicas per OSD with RocksDB sizes going from 12.1GB to 21.14GB depending on how much PGs the OSDs have. RocksDB sizes will lower as we add more OSDs to the cluster by the end of this year. We've seen a huge latency improvement by moving OSDs to Bluestore. Filestore (XFS) wouldn't operate well anymore with over 10M of files, with a negligible fragmentation factor and 8/40 split/merge thresholds. Frédéric. Le 01/04/2019 à 14:47, Sergey Malinin a écrit : I haven't had any issues with 4k allocation size in cluster holding 189M files. April 1, 2019 2:04 PM, "Paul Emmerich" wrote: I'm not sure about the real-world impacts of a lower min alloc size or the rationale behind the default values for HDDs (64) and SSDs (16kb). Paul ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com smime.p7s Description: Signature cryptographique S/MIME ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd: error processing image xxx (2) No such file or directory
Sorry -- you need the "" as part of that command. My bad, I only read this from the help page ignoring the (and forgot the pool name): -a [ --all ] list snapshots from all namespaces I figured this would list all existing snapshots, similar to the "rbd -p ls --long" command. Thanks for the clarification. Eugen Zitat von Jason Dillaman : On Tue, Apr 2, 2019 at 8:42 AM Eugen Block wrote: Hi, > If you run "rbd snap ls --all", you should see a snapshot in > the "trash" namespace. I just tried the command "rbd snap ls --all" on a lab cluster (nautilus) and get this error: ceph-2:~ # rbd snap ls --all rbd: image name was not specified Sorry -- you need the "" as part of that command. Are there any requirements I haven't noticed? This lab cluster was upgraded from Mimic a couple of weeks ago. ceph-2:~ # ceph version ceph version 14.1.0-559-gf1a72cff25 (f1a72cff2522833d16ff057ed43eeaddfc17ea8a) nautilus (dev) Regards, Eugen Zitat von Jason Dillaman : > On Tue, Apr 2, 2019 at 4:19 AM Nikola Ciprich > wrote: >> >> Hi, >> >> on one of my clusters, I'm getting error message which is getting >> me a bit nervous.. while listing contents of a pool I'm getting >> error for one of images: >> >> [root@node1 ~]# rbd ls -l nvme > /dev/null >> rbd: error processing image xxx: (2) No such file or directory >> >> [root@node1 ~]# rbd info nvme/xxx >> rbd image 'xxx': >> size 60 GiB in 15360 objects >> order 22 (4 MiB objects) >> id: 132773d6deb56 >> block_name_prefix: rbd_data.132773d6deb56 >> format: 2 >> features: layering, operations >> op_features: snap-trash >> flags: >> create_timestamp: Wed Aug 29 12:25:13 2018 >> >> volume contains production data and seems to be working correctly (it's used >> by VM) >> >> is this something to worry about? What is snap-trash feature? >> wasn't able to google >> much about it.. > > This implies that you are (or were) using transparent image clones and > that you deleted a snapshot that had one or more child images attached > to it. If you run "rbd snap ls --all", you should see a snapshot in > the "trash" namespace. You can also list its child images by running > "rbd children --snap-id ". > > There definitely is an issue w/ the "rbd ls --long" command in that > when it attempts to list all snapshots in the image, it is incorrectly > using the snapshot's name instead of it's ID. I've opened a tracker > ticket to get the bug fixed [1]. It was fixed in Nautilus but it > wasn't flagged for backport to Mimic. > >> I'm running ceph 13.2.4 on centos 7. >> >> I'd be gratefull any help >> >> BR >> >> nik >> >> >> -- >> - >> Ing. Nikola CIPRICH >> LinuxBox.cz, s.r.o. >> 28.rijna 168, 709 00 Ostrava >> >> tel.: +420 591 166 214 >> fax:+420 596 621 273 >> mobil: +420 777 093 799 >> www.linuxbox.cz >> >> mobil servis: +420 737 238 656 >> email servis: ser...@linuxbox.cz >> - >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > [1] http://tracker.ceph.com/issues/39081 > > -- > Jason > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] inline_data (was: CephFS and many small files)
On Tue, Apr 2, 2019 at 3:05 PM Yan, Zheng wrote: > > On Tue, Apr 2, 2019 at 8:23 PM Clausen, Jörn wrote: > > > > Hi! > > > > Am 29.03.2019 um 23:56 schrieb Paul Emmerich: > > > There's also some metadata overhead etc. You might want to consider > > > enabling inline data in cephfs to handle small files in a > > > store-efficient way (note that this feature is officially marked as > > > experimental, though). > > > http://docs.ceph.com/docs/master/cephfs/experimental-features/#inline-data > > > > Is there something missing from the documentation? I have turned on this > > feature: > > > > I don't use this feature. We don't have plan to mark this feature > stable. (probably we will remove this feature in the furthure). We also don't use this feature in any of our production clusters (because it's marked experimental). But it seems like a really useful feature and I know of at least one real-world production cluster using this with great success... So why remove it? Paul > > Yan, Zheng > > > > > $ ceph fs dump | grep inline_data > > dumped fsmap epoch 1224 > > inline_data enabled > > > > I have reduced the size of the bonnie-generated files to 1 byte. But > > this is the situation halfway into the test: (output slightly shortened) > > > > $ rados df > > POOL_NAME USED OBJECTS CLONES COPIES > > fs-data 3.2 MiB 3390041 0 10170123 > > fs-metadata 772 MiB2249 0 6747 > > > > total_objects3392290 > > total_used 643 GiB > > total_avail 957 GiB > > total_space 1.6 TiB > > > > i.e. bonnie has created a little over 3 million files, for which the > > same number of objects was created in the data pool. So the raw usage is > > again at more than 500 GB. > > > > If the data was inlined, I would expect far less objects in the data > > pool - actually none at all - and maybe some more usage in the metadata > > pool. > > > > Do I have to restart any daemons after turning on inline_data? Am I > > missing anything else here? > > > > For the record: > > > > $ ceph versions > > { > > "mon": { > > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > > nautilus (stable)": 3 > > }, > > "mgr": { > > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > > nautilus (stable)": 3 > > }, > > "osd": { > > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > > nautilus (stable)": 16 > > }, > > "mds": { > > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > > nautilus (stable)": 2 > > }, > > "overall": { > > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > > nautilus (stable)": 24 > > } > > } > > > > -- > > Jörn Clausen > > Daten- und Rechenzentrum > > GEOMAR Helmholtz-Zentrum für Ozeanforschung Kiel > > Düsternbrookerweg 20 > > 24105 Kiel > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] inline_data (was: CephFS and many small files)
On Tue, Apr 2, 2019 at 9:05 PM Yan, Zheng wrote: > > On Tue, Apr 2, 2019 at 8:23 PM Clausen, Jörn wrote: > > > > Hi! > > > > Am 29.03.2019 um 23:56 schrieb Paul Emmerich: > > > There's also some metadata overhead etc. You might want to consider > > > enabling inline data in cephfs to handle small files in a > > > store-efficient way (note that this feature is officially marked as > > > experimental, though). > > > http://docs.ceph.com/docs/master/cephfs/experimental-features/#inline-data > > > > Is there something missing from the documentation? I have turned on this > > feature: > > > > I don't use this feature. We don't have plan to mark this feature > stable. (probably we will remove this feature in the furthure). > I mean "don't use this feature" > Yan, Zheng > > > > > $ ceph fs dump | grep inline_data > > dumped fsmap epoch 1224 > > inline_data enabled > > > > I have reduced the size of the bonnie-generated files to 1 byte. But > > this is the situation halfway into the test: (output slightly shortened) > > > > $ rados df > > POOL_NAME USED OBJECTS CLONES COPIES > > fs-data 3.2 MiB 3390041 0 10170123 > > fs-metadata 772 MiB2249 0 6747 > > > > total_objects3392290 > > total_used 643 GiB > > total_avail 957 GiB > > total_space 1.6 TiB > > > > i.e. bonnie has created a little over 3 million files, for which the > > same number of objects was created in the data pool. So the raw usage is > > again at more than 500 GB. > > > > If the data was inlined, I would expect far less objects in the data > > pool - actually none at all - and maybe some more usage in the metadata > > pool. > > > > Do I have to restart any daemons after turning on inline_data? Am I > > missing anything else here? > > > > For the record: > > > > $ ceph versions > > { > > "mon": { > > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > > nautilus (stable)": 3 > > }, > > "mgr": { > > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > > nautilus (stable)": 3 > > }, > > "osd": { > > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > > nautilus (stable)": 16 > > }, > > "mds": { > > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > > nautilus (stable)": 2 > > }, > > "overall": { > > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > > nautilus (stable)": 24 > > } > > } > > > > -- > > Jörn Clausen > > Daten- und Rechenzentrum > > GEOMAR Helmholtz-Zentrum für Ozeanforschung Kiel > > Düsternbrookerweg 20 > > 24105 Kiel > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] inline_data (was: CephFS and many small files)
On Tue, Apr 2, 2019 at 8:23 PM Clausen, Jörn wrote: > > Hi! > > Am 29.03.2019 um 23:56 schrieb Paul Emmerich: > > There's also some metadata overhead etc. You might want to consider > > enabling inline data in cephfs to handle small files in a > > store-efficient way (note that this feature is officially marked as > > experimental, though). > > http://docs.ceph.com/docs/master/cephfs/experimental-features/#inline-data > > Is there something missing from the documentation? I have turned on this > feature: > I don't use this feature. We don't have plan to mark this feature stable. (probably we will remove this feature in the furthure). Yan, Zheng > $ ceph fs dump | grep inline_data > dumped fsmap epoch 1224 > inline_data enabled > > I have reduced the size of the bonnie-generated files to 1 byte. But > this is the situation halfway into the test: (output slightly shortened) > > $ rados df > POOL_NAME USED OBJECTS CLONES COPIES > fs-data 3.2 MiB 3390041 0 10170123 > fs-metadata 772 MiB2249 0 6747 > > total_objects3392290 > total_used 643 GiB > total_avail 957 GiB > total_space 1.6 TiB > > i.e. bonnie has created a little over 3 million files, for which the > same number of objects was created in the data pool. So the raw usage is > again at more than 500 GB. > > If the data was inlined, I would expect far less objects in the data > pool - actually none at all - and maybe some more usage in the metadata > pool. > > Do I have to restart any daemons after turning on inline_data? Am I > missing anything else here? > > For the record: > > $ ceph versions > { > "mon": { > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > nautilus (stable)": 3 > }, > "mgr": { > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > nautilus (stable)": 3 > }, > "osd": { > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > nautilus (stable)": 16 > }, > "mds": { > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > nautilus (stable)": 2 > }, > "overall": { > "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) > nautilus (stable)": 24 > } > } > > -- > Jörn Clausen > Daten- und Rechenzentrum > GEOMAR Helmholtz-Zentrum für Ozeanforschung Kiel > Düsternbrookerweg 20 > 24105 Kiel > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd: error processing image xxx (2) No such file or directory
On Tue, Apr 2, 2019 at 8:42 AM Eugen Block wrote: > > Hi, > > > If you run "rbd snap ls --all", you should see a snapshot in > > the "trash" namespace. > > I just tried the command "rbd snap ls --all" on a lab cluster > (nautilus) and get this error: > > ceph-2:~ # rbd snap ls --all > rbd: image name was not specified Sorry -- you need the "" as part of that command. > Are there any requirements I haven't noticed? This lab cluster was > upgraded from Mimic a couple of weeks ago. > > ceph-2:~ # ceph version > ceph version 14.1.0-559-gf1a72cff25 > (f1a72cff2522833d16ff057ed43eeaddfc17ea8a) nautilus (dev) > > Regards, > Eugen > > > Zitat von Jason Dillaman : > > > On Tue, Apr 2, 2019 at 4:19 AM Nikola Ciprich > > wrote: > >> > >> Hi, > >> > >> on one of my clusters, I'm getting error message which is getting > >> me a bit nervous.. while listing contents of a pool I'm getting > >> error for one of images: > >> > >> [root@node1 ~]# rbd ls -l nvme > /dev/null > >> rbd: error processing image xxx: (2) No such file or directory > >> > >> [root@node1 ~]# rbd info nvme/xxx > >> rbd image 'xxx': > >> size 60 GiB in 15360 objects > >> order 22 (4 MiB objects) > >> id: 132773d6deb56 > >> block_name_prefix: rbd_data.132773d6deb56 > >> format: 2 > >> features: layering, operations > >> op_features: snap-trash > >> flags: > >> create_timestamp: Wed Aug 29 12:25:13 2018 > >> > >> volume contains production data and seems to be working correctly (it's > >> used > >> by VM) > >> > >> is this something to worry about? What is snap-trash feature? > >> wasn't able to google > >> much about it.. > > > > This implies that you are (or were) using transparent image clones and > > that you deleted a snapshot that had one or more child images attached > > to it. If you run "rbd snap ls --all", you should see a snapshot in > > the "trash" namespace. You can also list its child images by running > > "rbd children --snap-id ". > > > > There definitely is an issue w/ the "rbd ls --long" command in that > > when it attempts to list all snapshots in the image, it is incorrectly > > using the snapshot's name instead of it's ID. I've opened a tracker > > ticket to get the bug fixed [1]. It was fixed in Nautilus but it > > wasn't flagged for backport to Mimic. > > > >> I'm running ceph 13.2.4 on centos 7. > >> > >> I'd be gratefull any help > >> > >> BR > >> > >> nik > >> > >> > >> -- > >> - > >> Ing. Nikola CIPRICH > >> LinuxBox.cz, s.r.o. > >> 28.rijna 168, 709 00 Ostrava > >> > >> tel.: +420 591 166 214 > >> fax:+420 596 621 273 > >> mobil: +420 777 093 799 > >> www.linuxbox.cz > >> > >> mobil servis: +420 737 238 656 > >> email servis: ser...@linuxbox.cz > >> - > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > [1] http://tracker.ceph.com/issues/39081 > > > > -- > > Jason > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd: error processing image xxx (2) No such file or directory
Hi, If you run "rbd snap ls --all", you should see a snapshot in the "trash" namespace. I just tried the command "rbd snap ls --all" on a lab cluster (nautilus) and get this error: ceph-2:~ # rbd snap ls --all rbd: image name was not specified Are there any requirements I haven't noticed? This lab cluster was upgraded from Mimic a couple of weeks ago. ceph-2:~ # ceph version ceph version 14.1.0-559-gf1a72cff25 (f1a72cff2522833d16ff057ed43eeaddfc17ea8a) nautilus (dev) Regards, Eugen Zitat von Jason Dillaman : On Tue, Apr 2, 2019 at 4:19 AM Nikola Ciprich wrote: Hi, on one of my clusters, I'm getting error message which is getting me a bit nervous.. while listing contents of a pool I'm getting error for one of images: [root@node1 ~]# rbd ls -l nvme > /dev/null rbd: error processing image xxx: (2) No such file or directory [root@node1 ~]# rbd info nvme/xxx rbd image 'xxx': size 60 GiB in 15360 objects order 22 (4 MiB objects) id: 132773d6deb56 block_name_prefix: rbd_data.132773d6deb56 format: 2 features: layering, operations op_features: snap-trash flags: create_timestamp: Wed Aug 29 12:25:13 2018 volume contains production data and seems to be working correctly (it's used by VM) is this something to worry about? What is snap-trash feature? wasn't able to google much about it.. This implies that you are (or were) using transparent image clones and that you deleted a snapshot that had one or more child images attached to it. If you run "rbd snap ls --all", you should see a snapshot in the "trash" namespace. You can also list its child images by running "rbd children --snap-id ". There definitely is an issue w/ the "rbd ls --long" command in that when it attempts to list all snapshots in the image, it is incorrectly using the snapshot's name instead of it's ID. I've opened a tracker ticket to get the bug fixed [1]. It was fixed in Nautilus but it wasn't flagged for backport to Mimic. I'm running ceph 13.2.4 on centos 7. I'd be gratefull any help BR nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1] http://tracker.ceph.com/issues/39081 -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd: error processing image xxx (2) No such file or directory
On Tue, Apr 2, 2019 at 4:19 AM Nikola Ciprich wrote: > > Hi, > > on one of my clusters, I'm getting error message which is getting > me a bit nervous.. while listing contents of a pool I'm getting > error for one of images: > > [root@node1 ~]# rbd ls -l nvme > /dev/null > rbd: error processing image xxx: (2) No such file or directory > > [root@node1 ~]# rbd info nvme/xxx > rbd image 'xxx': > size 60 GiB in 15360 objects > order 22 (4 MiB objects) > id: 132773d6deb56 > block_name_prefix: rbd_data.132773d6deb56 > format: 2 > features: layering, operations > op_features: snap-trash > flags: > create_timestamp: Wed Aug 29 12:25:13 2018 > > volume contains production data and seems to be working correctly (it's used > by VM) > > is this something to worry about? What is snap-trash feature? wasn't able to > google > much about it.. This implies that you are (or were) using transparent image clones and that you deleted a snapshot that had one or more child images attached to it. If you run "rbd snap ls --all", you should see a snapshot in the "trash" namespace. You can also list its child images by running "rbd children --snap-id ". There definitely is an issue w/ the "rbd ls --long" command in that when it attempts to list all snapshots in the image, it is incorrectly using the snapshot's name instead of it's ID. I've opened a tracker ticket to get the bug fixed [1]. It was fixed in Nautilus but it wasn't flagged for backport to Mimic. > I'm running ceph 13.2.4 on centos 7. > > I'd be gratefull any help > > BR > > nik > > > -- > - > Ing. Nikola CIPRICH > LinuxBox.cz, s.r.o. > 28.rijna 168, 709 00 Ostrava > > tel.: +420 591 166 214 > fax:+420 596 621 273 > mobil: +420 777 093 799 > www.linuxbox.cz > > mobil servis: +420 737 238 656 > email servis: ser...@linuxbox.cz > - > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1] http://tracker.ceph.com/issues/39081 -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] inline_data (was: CephFS and many small files)
Hi! Am 29.03.2019 um 23:56 schrieb Paul Emmerich: There's also some metadata overhead etc. You might want to consider enabling inline data in cephfs to handle small files in a store-efficient way (note that this feature is officially marked as experimental, though). http://docs.ceph.com/docs/master/cephfs/experimental-features/#inline-data Is there something missing from the documentation? I have turned on this feature: $ ceph fs dump | grep inline_data dumped fsmap epoch 1224 inline_data enabled I have reduced the size of the bonnie-generated files to 1 byte. But this is the situation halfway into the test: (output slightly shortened) $ rados df POOL_NAME USED OBJECTS CLONES COPIES fs-data 3.2 MiB 3390041 0 10170123 fs-metadata 772 MiB2249 0 6747 total_objects3392290 total_used 643 GiB total_avail 957 GiB total_space 1.6 TiB i.e. bonnie has created a little over 3 million files, for which the same number of objects was created in the data pool. So the raw usage is again at more than 500 GB. If the data was inlined, I would expect far less objects in the data pool - actually none at all - and maybe some more usage in the metadata pool. Do I have to restart any daemons after turning on inline_data? Am I missing anything else here? For the record: $ ceph versions { "mon": { "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)": 3 }, "osd": { "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)": 16 }, "mds": { "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)": 2 }, "overall": { "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)": 24 } } -- Jörn Clausen Daten- und Rechenzentrum GEOMAR Helmholtz-Zentrum für Ozeanforschung Kiel Düsternbrookerweg 20 24105 Kiel smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph nautilus upgrade problem
Quoting Stadsnet (jwil...@stads.net): > On 26-3-2019 16:39, Ashley Merrick wrote: > >Have you upgraded any OSD's? > > > No didn't go through with the osd's Just checking here: are your sure all PGs have been scrubbed while running Luminous? As the release notes [1] mention this: "If you are unsure whether or not your Luminous cluster has completed a full scrub of all PGs, you can check your clusters state by running: # ceph osd dump | grep ^flags In order to be able to proceed to Nautilus, your OSD map must include the recovery_deletes and purged_snapdirs flags." Gr. Stefan [1]: http://docs.ceph.com/docs/master/releases/nautilus/#upgrading-from-mimic-or-luminous P.s. I expect most users upgrade to Mimic first, then go to Nautilus. It might be a better tested upgrade path ... -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure Coding failure domain (again)
On 02/04/2019 18.27, Christian Balzer wrote: I did a quick peek at my test cluster (20 OSDs, 5 hosts) and a replica 2 pool with 1024 PGs. (20 choose 2) is 190, so you're never going to have more than that many unique sets of OSDs. I just looked at the OSD distribution for a replica 3 pool across 48 OSDs with 4096 PGs that I have and the result is reasonable. There are 3782 unique OSD tuples, out of (48 choose 3) = 17296 options. Since this is a random process, due to the birthday paradox, some duplicates are expected after only the order of 17296^0.5 = ~131 PGs; at 4096 PGs having 3782 unique choices seems to pass the gut feeling test. Too lazy to do the math closed form, but here's a quick simulation: >>> len(set(random.randrange(17296) for i in range(4096))) 3671 So I'm actually slightly ahead. At the numbers in my previous example (1500 OSDs, 50k pool PGs), statistically you should get something like ~3 collisions on average, so negligible. Another thing to look at here is of course critical period and disk failure probabilities, these guys explain the logic behind their calculator, would be delighted if you could have a peek and comment. https://www.memset.com/support/resources/raid-calculator/ I'll take a look tonight :) -- Hector Martin (hec...@marcansoft.com) Public Key: https://mrcn.st/pub ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Moving pools between cluster
Quoting Burkhard Linke (burkhard.li...@computational.bio.uni-giessen.de): > Hi, > Images: > > Straight-forward attempt would be exporting all images with qemu-img from > one cluster, and uploading them again on the second cluster. But this will > break snapshots, protections etc. You can use rbd-mirror [1] (RBD mirroring requires the Ceph Jewel release or later.). You do need to be able to set the "journaling" and "exclusive-lock" feature on the rbd images (rbd feature enable {pool-name}/{image-name} --image-feature exclusive-lock,journaling). This will preserve snapshots, etc. When everything is mirrored you can shutdown the VMs (or one by one) and promote the image(s) on the new cluster, and have the VM(s) use the new cluster for their storage. Note: You can also mirror a whole pool instead of mirroring on image level. Gr. Stefan [1]: http://docs.ceph.com/docs/mimic/rbd/rbd-mirroring/ -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure Coding failure domain (again)
Hello Hector, Firstly I'm so happy somebody actually replied. On Tue, 2 Apr 2019 16:43:10 +0900 Hector Martin wrote: > On 31/03/2019 17.56, Christian Balzer wrote: > > Am I correct that unlike with with replication there isn't a maximum size > > of the critical path OSDs? > > As far as I know, the math for calculating the probability of data loss > wrt placement groups is the same for EC and for replication. Replication > to n copies should be equivalent to EC with k=1 and m=(n-1). > > > Meaning that with replication x3 and typical values of 100 PGs per OSD at > > most 300 OSDs form a set out of which 3 OSDs need to fail for data loss. > > The statistical likelihood for that based on some assumptions > > is significant, but not nightmarishly so. > > A cluster with 1500 OSDs in total is thus as susceptible as one with just > > 300. > > Meaning that 3 disk losses in the big cluster don't necessarily mean data > > loss at all. > > Someone might correct me on this, but here's my take on the math. > > If you have 100 PGs per OSD, 1500 OSDs, and replication 3, you have: > > 1500 * 100 / 3 = 5 pool PGs, and thus 5 (hopefully) different > 3-sets of OSDs. > I think your math is essentially correct, but so seems to be the "hopefully" part. I did a quick peek at my test cluster (20 OSDs, 5 hosts) and a replica 2 pool with 1024 PGs. Which should give us 1000 sets of OSDs to choose from given your formula. Just looking at OSD 0 and the first 6 other OSDs out of that list of 1024 PGs gives us this: --- UP_PRIMARY ACTING [0,1] 0 [0,2] 0 [0,2] 0 [0,2] 0 [0,3] 0 [0,3] 0 [0,3] 0 [0,3] 0 [0,3] 0 [0,5] 0 [0,5] 0 [0,5] 0 [0,5] 0 [0,5] 0 [0,6] 0 [0,6] 0 [1,0] 1 [1,0] 1 [1,0] 1 [2,0] 2 [2,0] 2 [2,0] 2 [3,0] 3 [3,0] 3 [5,0] 5 [5,0] 5 [5,0] 5 [6,0] 6 [6,0] 6 [6,0] 6 --- So this looks significantly worse than the theoretical set of choices. Another thing to look at here is of course critical period and disk failure probabilities, these guys explain the logic behind their calculator, would be delighted if you could have a peek and comment. https://www.memset.com/support/resources/raid-calculator/ Thanks again for the feedback! Christian > (1500 choose 3) = 561375500 possible sets of 3 OSDs > > Therefore if you lose 3 random OSDs, your chance of (any) data loss is > 5/561375500 = ~0.008%. (and if you *do* get unlucky and hit the > wrong set of 3 OSDs, you can expect to lose 1/5 = ~0.002% of your data) > > > However it feels that with EC all OSDs can essentially be in the same set > > and thus having 6 out of 1500 OSDs fail in a 10+5 EC pool with 100 PGs per > > OSD would affect every last object in that cluster, not just a subset. > > The math should work essentially the same way: > > 1500 * 100 / 15 = 1 15-sets of OSDs > > (1500 choose 15) = 3.1215495e+35 possible 15-sets of OSDs > > Now if 6 OSDs fail that will affect many potential 15-sets of OSDs > chosen with the remaining OSD in the cluster: > > ((1500 - 6) choose 9) = 9.9748762e+22 > > Putting it together, the chance of any data loss from a simultaneous > loss of 6 random OSDs: > > 1 / 3.1215495e+35 * 9.9748762e+22 = 0.0032% > > And if you *do* get unlucky you can expect to lose 1/1 = ~0.01% of > your data. > > So your chance of data loss is much smaller with such a wide EC > encoding, but if you do lose a PG you'll lose more data because there > are fewer PGs. > > Feedback on my math welcome. > -- > Hector Martin (hec...@marcansoft.com) > Public Key: https://mrcn.st/pub > -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Moving pools between cluster
Hi, we are about to setup a new Ceph cluster for our Openstack cloud. Ceph is used for images, volumes and object storage. I'm unsure to handle these cases and how to move the data correctly. Object storage: I consider this the easiest case, since RGW itself provides the necessary means to synchronize clusters. But the pools are rather small (~5 TB for buckets), so maybe there's an easier way? How does RGW refer to the various pools internally? By name? By ID? (ID would be a problem, since a simple pool copy won't work in this case) Images: Straight-forward attempt would be exporting all images with qemu-img from one cluster, and uploading them again on the second cluster. But this will break snapshots, protections etc. Volumes: This is the most difficult case. The pool is the largest one affected (~60 TB), and many volumes are boot-from-volume instances acting as COW copy for an image. I would prefer not to flatten these images and thus generate a lot more data. There are other pools we use outside of Openstack, so adding the new hosts to the existing cluster, moving the data by crush rules and splitting the cluster afterwards is not an option. Keeping all hosts in a single cluster and separating the pools logically within crush is also undesired due to administrative reasons (but will be the last resort if necessary). Any comments on this? How did you move individual pools to a new cluster in the past? Regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rbd: error processing image xxx (2) No such file or directory
Hi, on one of my clusters, I'm getting error message which is getting me a bit nervous.. while listing contents of a pool I'm getting error for one of images: [root@node1 ~]# rbd ls -l nvme > /dev/null rbd: error processing image xxx: (2) No such file or directory [root@node1 ~]# rbd info nvme/xxx rbd image 'xxx': size 60 GiB in 15360 objects order 22 (4 MiB objects) id: 132773d6deb56 block_name_prefix: rbd_data.132773d6deb56 format: 2 features: layering, operations op_features: snap-trash flags: create_timestamp: Wed Aug 29 12:25:13 2018 volume contains production data and seems to be working correctly (it's used by VM) is this something to worry about? What is snap-trash feature? wasn't able to google much about it.. I'm running ceph 13.2.4 on centos 7. I'd be gratefull any help BR nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure Coding failure domain (again)
On 31/03/2019 17.56, Christian Balzer wrote: Am I correct that unlike with with replication there isn't a maximum size of the critical path OSDs? As far as I know, the math for calculating the probability of data loss wrt placement groups is the same for EC and for replication. Replication to n copies should be equivalent to EC with k=1 and m=(n-1). Meaning that with replication x3 and typical values of 100 PGs per OSD at most 300 OSDs form a set out of which 3 OSDs need to fail for data loss. The statistical likelihood for that based on some assumptions is significant, but not nightmarishly so. A cluster with 1500 OSDs in total is thus as susceptible as one with just 300. Meaning that 3 disk losses in the big cluster don't necessarily mean data loss at all. Someone might correct me on this, but here's my take on the math. If you have 100 PGs per OSD, 1500 OSDs, and replication 3, you have: 1500 * 100 / 3 = 5 pool PGs, and thus 5 (hopefully) different 3-sets of OSDs. (1500 choose 3) = 561375500 possible sets of 3 OSDs Therefore if you lose 3 random OSDs, your chance of (any) data loss is 5/561375500 = ~0.008%. (and if you *do* get unlucky and hit the wrong set of 3 OSDs, you can expect to lose 1/5 = ~0.002% of your data) However it feels that with EC all OSDs can essentially be in the same set and thus having 6 out of 1500 OSDs fail in a 10+5 EC pool with 100 PGs per OSD would affect every last object in that cluster, not just a subset. The math should work essentially the same way: 1500 * 100 / 15 = 1 15-sets of OSDs (1500 choose 15) = 3.1215495e+35 possible 15-sets of OSDs Now if 6 OSDs fail that will affect many potential 15-sets of OSDs chosen with the remaining OSD in the cluster: ((1500 - 6) choose 9) = 9.9748762e+22 Putting it together, the chance of any data loss from a simultaneous loss of 6 random OSDs: 1 / 3.1215495e+35 * 9.9748762e+22 = 0.0032% And if you *do* get unlucky you can expect to lose 1/1 = ~0.01% of your data. So your chance of data loss is much smaller with such a wide EC encoding, but if you do lose a PG you'll lose more data because there are fewer PGs. Feedback on my math welcome. -- Hector Martin (hec...@marcansoft.com) Public Key: https://mrcn.st/pub ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS stuck at replaying status
please set debug_mds=10, and try again On Tue, Apr 2, 2019 at 1:01 PM Albert Yue wrote: > > Hi, > > This happens after we restart the active MDS, and somehow the standby MDS > daemon cannot take over successfully and is stuck at up:replaying. It is > showing the following log. Any idea on how to fix this? > > 2019-04-02 12:54:00.985079 7f6f70670700 1 mds.WXS0023 respawn > 2019-04-02 12:54:00.985095 7f6f70670700 1 mds.WXS0023 e: '/usr/bin/ceph-mds' > 2019-04-02 12:54:00.985097 7f6f70670700 1 mds.WXS0023 0: '/usr/bin/ceph-mds' > 2019-04-02 12:54:00.985099 7f6f70670700 1 mds.WXS0023 1: '-f' > 2019-04-02 12:54:00.985100 7f6f70670700 1 mds.WXS0023 2: '--cluster' > 2019-04-02 12:54:00.985101 7f6f70670700 1 mds.WXS0023 3: 'ceph' > 2019-04-02 12:54:00.985102 7f6f70670700 1 mds.WXS0023 4: '--id' > 2019-04-02 12:54:00.985103 7f6f70670700 1 mds.WXS0023 5: 'WXS0023' > 2019-04-02 12:54:00.985104 7f6f70670700 1 mds.WXS0023 6: '--setuser' > 2019-04-02 12:54:00.985105 7f6f70670700 1 mds.WXS0023 7: 'ceph' > 2019-04-02 12:54:00.985106 7f6f70670700 1 mds.WXS0023 8: '--setgroup' > 2019-04-02 12:54:00.985107 7f6f70670700 1 mds.WXS0023 9: 'ceph' > 2019-04-02 12:54:00.985142 7f6f70670700 1 mds.WXS0023 respawning with exe > /usr/bin/ceph-mds > 2019-04-02 12:54:00.985145 7f6f70670700 1 mds.WXS0023 exe_path > /proc/self/exe > 2019-04-02 12:54:02.139272 7ff8a739a200 0 ceph version 12.2.5 > (cad919881333ac92274171586c827e01f554a70a) luminous (stable), process > (unknown), pid 3369045 > 2019-04-02 12:54:02.141565 7ff8a739a200 0 pidfile_write: ignore empty > --pid-file > 2019-04-02 12:54:06.675604 7ff8a0ecd700 1 mds.WXS0023 handle_mds_map standby > 2019-04-02 12:54:26.114757 7ff8a0ecd700 1 mds.0.136021 handle_mds_map i am > now mds.0.136021 > 2019-04-02 12:54:26.114764 7ff8a0ecd700 1 mds.0.136021 handle_mds_map state > change up:boot --> up:replay > 2019-04-02 12:54:26.114779 7ff8a0ecd700 1 mds.0.136021 replay_start > 2019-04-02 12:54:26.114784 7ff8a0ecd700 1 mds.0.136021 recovery set is > 2019-04-02 12:54:26.114789 7ff8a0ecd700 1 mds.0.136021 waiting for osdmap > 14333 (which blacklists prior instance) > 2019-04-02 12:54:26.141256 7ff89a6c0700 0 mds.0.cache creating system inode > with ino:0x100 > 2019-04-02 12:54:26.141454 7ff89a6c0700 0 mds.0.cache creating system inode > with ino:0x1 > 2019-04-02 12:54:50.148022 7ff89dec7700 1 heartbeat_map is_healthy 'MDSRank' > had timed out after 15 > 2019-04-02 12:54:50.148049 7ff89dec7700 1 mds.beacon.WXS0023 _send skipping > beacon, heartbeat map not healthy > 2019-04-02 12:54:52.143637 7ff8a1ecf700 1 heartbeat_map is_healthy 'MDSRank' > had timed out after 15 > 2019-04-02 12:54:54.148122 7ff89dec7700 1 heartbeat_map is_healthy 'MDSRank' > had timed out after 15 > 2019-04-02 12:54:54.148157 7ff89dec7700 1 mds.beacon.WXS0023 _send skipping > beacon, heartbeat map not healthy > 2019-04-02 12:54:57.143730 7ff8a1ecf700 1 heartbeat_map is_healthy 'MDSRank' > had timed out after 15 > 2019-04-02 12:54:58.148239 7ff89dec7700 1 heartbeat_map is_healthy 'MDSRank' > had timed out after 15 > 2019-04-02 12:54:58.148249 7ff89dec7700 1 mds.beacon.WXS0023 _send skipping > beacon, heartbeat map not healthy > 2019-04-02 12:55:02.143819 7ff8a1ecf700 1 heartbeat_map is_healthy 'MDSRank' > had timed out after 15 > 2019-04-02 12:55:02.148311 7ff89dec7700 1 heartbeat_map is_healthy 'MDSRank' > had timed out after 15 > 2019-04-02 12:55:02.148330 7ff89dec7700 1 mds.beacon.WXS0023 _send skipping > beacon, heartbeat map not healthy > 2019-04-02 12:55:06.148393 7ff89dec7700 1 heartbeat_map is_healthy 'MDSRank' > had timed out after 15 > 2019-04-02 12:55:06.148416 7ff89dec7700 1 mds.beacon.WXS0023 _send skipping > beacon, heartbeat map not healthy > 2019-04-02 12:55:07.143914 7ff8a1ecf700 1 heartbeat_map is_healthy 'MDSRank' > had timed out after 15 > 2019-04-02 12:55:07.615602 7ff89e6c8700 1 heartbeat_map reset_timeout > 'MDSRank' had timed out after 15 > 2019-04-02 12:55:07.618294 7ff8a0ecd700 1 mds.WXS0023 map removed me (mds.-1 > gid:7441294) from cluster due to lost contact; respawning > 2019-04-02 12:55:07.618296 7ff8a0ecd700 1 mds.WXS0023 respawn > 2019-04-02 12:55:07.618314 7ff8a0ecd700 1 mds.WXS0023 e: '/usr/bin/ceph-mds' > 2019-04-02 12:55:07.618318 7ff8a0ecd700 1 mds.WXS0023 0: '/usr/bin/ceph-mds' > 2019-04-02 12:55:07.618319 7ff8a0ecd700 1 mds.WXS0023 1: '-f' > 2019-04-02 12:55:07.618320 7ff8a0ecd700 1 mds.WXS0023 2: '--cluster' > 2019-04-02 12:55:07.618320 7ff8a0ecd700 1 mds.WXS0023 3: 'ceph' > 2019-04-02 12:55:07.618321 7ff8a0ecd700 1 mds.WXS0023 4: '--id' > 2019-04-02 12:55:07.618321 7ff8a0ecd700 1 mds.WXS0023 5: 'WXS0023' > 2019-04-02 12:55:07.618322 7ff8a0ecd700 1 mds.WXS0023 6: '--setuser' > 2019-04-02 12:55:07.618323 7ff8a0ecd700 1 mds.WXS0023 7: 'ceph' > 2019-04-02 12:55:07.618323 7ff8a0ecd700 1 mds.WXS0023 8: '--setgroup' > 2019-04-02 12:55:07.618325 7ff8a0ecd700 1 mds.WXS0023 9: 'ceph' >