Re: [ceph-users] Remove rbd image after interrupt of deletion command
On Tue, 11 Jun 2019 at 14:46, Sakirnth Nagarasa wrote: > On 6/7/19 3:35 PM, Jason Dillaman wrote: [...] > > Can you run "rbd rm --log-to-stderr=true --debug-rbd=20 > > ${POOLNAME}/${IMAGE}" and provide the logs via pastebin.com? > > > >> Cheers, > >> Sakirnth > > It is not necessary anymore the remove command worked. The problem was > only "rbd info" command. It took approximately one day to remove the > cloned image (50 TB) which was not flaten. Why it took so long? The > clone command completed within seconds. > > Thanks, > Sakirnth Sakirnth, previously you've said (statement A): "... rbd rm ${POOLNAME}/${IMAGE} rbd: error opening image ${IMAGE}: (2) No such file or directory ..." Now you're saying (statement B): "rm worked and the only issue was info command". Obviously both statements can't be true at the same time. Can you elaborate(?) on that matter so that themail list users would have better understanding. -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] balancer module makes OSD distribution worse
On Thu, 6 Jun 2019 at 03:01, Josh Haft wrote: > > Hi everyone, > > On my 13.2.5 cluster, I recently enabled the ceph balancer module in > crush-compat mode. Why did you choose compat mode? Don't you want to try another one instead? -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Can I limit OSD memory usage?
On Sat, 8 Jun 2019 at 04:35, Sergei Genchev wrote: > > Hi, > My OSD processes are constantly getting killed by OOM killer. My > cluster has 5 servers, each with 18 spinning disks, running 18 OSD > daemons in 48GB of memory. > I was trying to limit OSD cache, according to > http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/ > > [osd] > bluestore_cache_size_ssd = 1G > bluestore_cache_size_hdd = 768M > Yet, my OSDs are using way more memory than that. I have seen as high as 3.2G Well, it's widely known for a long time that 640KB isn't enough for everyone. ;) CEPH's OSD RAM consumption is largely dependent on its backing store capacity. Check out official recommendations: http://docs.ceph.com/docs/luminous/start/hardware-recommendations/ -- "ceph-osd: RAM~1GB for 1TB of storage per daemon". You didn't specify capacity of the disks, BTW. 2--3 TB? [...] > Is there any way for me to limit how much memory does OSD use? Try adding to same [osd] section osd_memory_target = ...Amount_in_Bytes... Don't set it to 640 KB though. ;-) minimum recommendations make sense still, reduce with caution. -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Any CEPH's iSCSI gateway users?
What is your experience? Does it make sense to use it -- is it solid enough or beta quality rather (both in terms of stability and performance)? I've read it was more or less packaged to work with RHEL. Does it hold true still? What's the best way to install it on, say, CentOS or Debian/Ubuntu? -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Massive TCP connection on radosgw
On Wed, 22 May 2019 at 20:32, Torben Hørup wrote: > > Which states are all these connections in ? > > ss -tn That set of the args won't display anything but ESTAB-lished conn-s.. One typically needs `-atn` instead. -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Nautilus, k+m erasure coding a profile vs size+min_size
On Tue, 21 May 2019 at 19:32, Yoann Moulin wrote: > > >> I am doing some tests with Nautilus and cephfs on erasure coding pool. [...] > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/034867.html > > Oh thanks, I missed that thread, make sense. I agree with some comment that > it is a little bit confusing. Check out this as well: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034242.html -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Tip for erasure code profile?
On Fri, 3 May 2019 at 22:46, Robert Sander wrote: > The cluster spans 2 rooms ... > The failure domain would be the room level ... > Is that even possible with erasure coding? Sure deal but you'd need slightly more rooms then. For e. g., minimal EC(2, 1) means (2 + 1) rooms. -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore
On Fri, 3 May 2019 at 21:39, Mark Nelson wrote: [...] > > [osd] > > ... > > bluestore_allocator = bitmap > > bluefs_allocator = bitmap > > > > I would restart the nodes one by one and see, what happens. > > If you are using 12.2.11 you likely still have the old bitmap allocator Would those config changes be just ignored or OSD would fail to start instead? -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore
On Fri, 3 May 2019 at 13:38, Denny Fuchs wrote: [...] > If I understand correct: I should try to set bitmap allocator That's among one of the options I mentioned. Another one was to try using jemalloc (re-read my emails). > [osd] > ... > bluestore_allocator = bitmap > bluefs_allocator = bitmap > > I would restart the nodes one by one and see, what happens. Right. -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore
On Fri, 3 May 2019 at 05:12, Mark Nelson wrote: [...] > > -- https://www.kernel.org/doc/Documentation/vm/transhuge.txt > > Why are you quoting the description for the madvise setting when that's > clearly not what was set in the case I just showed you? Similarly why(?) are you telling us it must be due to THPs if: 1) by default they're not used unless madvise()'ed, 2) none of jemalloc or tcmalloc would madvise by default too. [...] > previously|malloc|'ed. Because the machine used transparent huge pages, Is it from DigitalOcean's blog? I read it pretty long ago. And it was written long ago, referring to some ancient release of jemalloc and what's more important -- to a system that has THP activated. -- But I've shown you that it's not default kernel's setting to use THP -- unless madvise would tell kernel so. Your example with CentOS isn't relevant due to person who started this thread use Debian (Proxmox, to be more correct). Moreover, something's telling me that even in default CentOS installs THPs are also set to madvise()-only. > I'm not going to argue with you about this. I don't argue with you. I'm merely showing you that instead of doing baseless claims (or wild guess-working), it's worth checking facts first. Checking if THP's are used at all (although it might be not due OSDs but, say, KVM) is as simple as looking into /proc/meminfo. > Test it if you want or don't. I didn't start this thread. ;) As to me -- I've played enough with all kind of allocators and THP settings. :) -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore
On Fri, 3 May 2019 at 01:29, Mark Nelson wrote: > On 5/2/19 11:46 AM, Igor Podlesny wrote: > > On Thu, 2 May 2019 at 05:02, Mark Nelson wrote: > > [...] > >> FWIW, if you still have an OSD up with tcmalloc, it's probably worth > >> looking at the heap stats to see how much memory tcmalloc thinks it's > >> allocated vs how much RSS memory is being used by the process. It's > >> quite possible that there is memory that has been unmapped but that the > >> kernel can't (or has decided not yet to) reclaim. > >> Transparent huge pages can potentially have an effect here both with > >> tcmalloc and with > >> jemalloc so it's not certain that switching the allocator will fix it > >> entirely. > > Most likely wrong. -- Default kernel's settings in regards of THP are > > "madvise". > > None of tcmalloc or jemalloc would madvise() to make it happen. > > With fresh enough jemalloc you could have it, but it needs special > > malloc.conf'ing. > > > From one of our centos nodes with no special actions taken to change > THP settings (though it's possible it was inherited from something else): > > > $ cat /etc/redhat-release > CentOS Linux release 7.5.1804 (Core) > $ cat /sys/kernel/mm/transparent_hugepage/enabled > [always] madvise never "madvise" will enter direct reclaim like "always" but only for regions that are have used madvise(MADV_HUGEPAGE). This is the default behaviour. -- https://www.kernel.org/doc/Documentation/vm/transhuge.txt > And regarding madvise and alternate memory allocators: > https: [...] did you ever read any of it? one link's info: "By default jemalloc does not use huge pages for heap memory (there is opt.metadata_thp which uses THP for internal metadata though)" (and I've said > > None of tcmalloc or jemalloc would madvise() to make it happen. > > With fresh enough jemalloc you could have it, but it needs special > > malloc.conf'ing. before) -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore
On Thu, 2 May 2019 at 05:02, Mark Nelson wrote: [...] > FWIW, if you still have an OSD up with tcmalloc, it's probably worth > looking at the heap stats to see how much memory tcmalloc thinks it's > allocated vs how much RSS memory is being used by the process. It's > quite possible that there is memory that has been unmapped but that the > kernel can't (or has decided not yet to) reclaim. > Transparent huge pages can potentially have an effect here both with tcmalloc > and with > jemalloc so it's not certain that switching the allocator will fix it > entirely. Most likely wrong. -- Default kernel's settings in regards of THP are "madvise". None of tcmalloc or jemalloc would madvise() to make it happen. With fresh enough jemalloc you could have it, but it needs special malloc.conf'ing. > First I would just get the heap stats and then after that I would be > very curious if disabling transparent huge pages helps. Alternately, > it's always possible it's a memory leak. :D RedHat can do better (hopefully). ;-P -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore
On Tue, 30 Apr 2019 at 20:56, Igor Podlesny wrote: > On Tue, 30 Apr 2019 at 19:10, Denny Fuchs wrote: > [..] > > Any suggestions ? > > -- Try different allocator. Ah, BTW, except memory allocator there's another option: recently backported bitmap allocator. Igor Fedotov wrote about it's expected to have lesser memory footprint with time: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034299.html Also I'm not sure though if it's okay to switch existent OSDs "on-fly" -- changing config and restarting OSDs. Igor (Fedotov), can you please elaborate on this matter? -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Data distribution question
On Wed, 1 May 2019 at 01:58, Dan van der Ster wrote: > On Tue, Apr 30, 2019 at 8:26 PM Igor Podlesny wrote: [...] > All of the clients need to be luminous our newer: > > # ceph osd set-require-min-compat-client luminous > > You need to enable the module: > > # ceph mgr module enable balancer (Enabled by default according to the docs.) > > You probably don't want to it run 24/7: > > # ceph config-key set mgr/balancer/begin_time 0800 > # ceph config-key set mgr/balancer/end_time 1800 oh, that's handy. > The default rate that it balances things are a bit too high for my taste: > > # ceph config-key set mgr/balancer/max_misplaced 0.005 > # ceph config-key set mgr/balancer/upmap_max_iterations 2 > > (Those above are optional... YMMV) Yep, but good to know! > > Now fail the active mgr so that the new one reads those new options above. > > # ceph mgr fail > > Enable the upmap mode: > > # ceph balancer mode upmap > > Test it once to see that it works at all: > > # ceph balancer optimize myplan > # ceph balancer show myplan > # ceph balancer reset > > (any errors, start debugging -- use debug_mgr = 4/5 and check the > active mgr's log for the balancer details.) > > # ceph balancer on > > Now it'll start moving the PGs around until things are quite well balanced. > In our clusters that process takes a week or two... it depends on > cluster size, numpgs, etc... > > Hope that helps! Thank you :) -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Data distribution question
On Wed, 1 May 2019 at 01:26, Igor Podlesny wrote: > On Wed, 1 May 2019 at 01:01, Dan van der Ster wrote: > >> > The upmap balancer in v12.2.12 works really well... Perfectly uniform on > >> > our clusters. > >> > >> mode upmap ? > > > > yes, mgr balancer, mode upmap. Also -- do your CEPHs have single root hierarchy pools (like "default"), or there're some pools that use non-default ones? Looking through docs I didn't find a way to narrow balancer's scope down to specific pool(s), although personally I would prefer it to operate on a small set of them. -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Data distribution question
On Wed, 1 May 2019 at 01:26, Jack wrote: > If those pools are useless, you can: > - drop them As Dan pointed out it's unlikely of having any effect. The thing is imbalance is a "property" of a pool, I'd suppose that most often -- is the most loaded one (or of a few most loaded ones). Not that much used pools don't impact it. -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Data distribution question
On Wed, 1 May 2019 at 01:01, Dan van der Ster wrote: >> > The upmap balancer in v12.2.12 works really well... Perfectly uniform on >> > our clusters. >> >> mode upmap ? > > yes, mgr balancer, mode upmap. I see. Was it a matter of just: 1) ceph balancer mode upmap 2) ceph balancer on or were there any other steps? -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Data distribution question
On Wed, 1 May 2019 at 00:24, Dan van der Ster wrote: > > The upmap balancer in v12.2.12 works really well... Perfectly uniform on our > clusters. > > .. Dan mode upmap ? -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore
On Tue, 30 Apr 2019 at 19:10, Denny Fuchs wrote: [..] > Any suggestions ? -- Try different allocator. In Proxmox 4 they by default had this in /etc/default/ceph {{ ## use jemalloc instead of tcmalloc # # jemalloc is generally faster for small IO workloads and when # ceph-osd is backed by SSDs. However, memory usage is usually # higher by 200-300mb. # #LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 }}, so you may try using it in the same way, the package is still there in Proxmox 5: libjemalloc1: /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 No one can tell for sure if it would help, but jemalloc "... is a general purpose malloc(3) implementation that emphasizes fragmentation avoidance and scalable concurrency support. ..." -- http://jemalloc.net/ I noticed OSDs with jemalloc tend to have way bigger VSZ with time but RSS should be fine. Look forward hearing your experience with it. -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Need some advice about Pools and Erasure Coding
On Tue, 30 Apr 2019 at 19:11, Adrien Gillard wrote: > On Tue, Apr 30, 2019 at 10:06 AM Igor Podlesny wrote: > > > > On Tue, 30 Apr 2019 at 04:13, Adrien Gillard > wrote: > > > I would add that the use of cache tiering, though still possible, is > not recommended > > > > It lacks references. CEPH docs I gave links to didn't say so. > > The cache tiering documention mentions that (your link refers to it) : > > http://docs.ceph.com/docs/nautilus/rados/operations/cache-tiering/#a-word-of-caution I saw this and didn't find "not recommended" or alike > > <http://docs.ceph.com/docs/nautilus/rados/operations/cache-tiering/#a-word-of-caution> > > There are some threads on the mailing list refering to the subject as > well (by David Turner or > Christian Balzer for instance) Thanks, will try to find ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] BlueStore bitmap allocator under Luminous and Mimic
On Mon, 15 Apr 2019 at 19:40, Wido den Hollander wrote: > > Hi, > > With the release of 12.2.12 the bitmap allocator for BlueStore is now > available under Mimic and Luminous. > > [osd] > bluestore_allocator = bitmap > bluefs_allocator = bitmap Hi! Have you tried this? :) -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Need some advice about Pools and Erasure Coding
On Tue, 30 Apr 2019 at 04:13, Adrien Gillard wrote: > I would add that the use of cache tiering, though still possible, is not > recommended It lacks references. CEPH docs I gave links to didn't say so. > comes with its own challenges. It's challenging for some to not over-quote when replying, but I don't think it holds true for everyone. -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Need some advice about Pools and Erasure Coding
On Mon, 29 Apr 2019 at 16:19, Rainer Krienke wrote: [...] > - Do I still (nautilus) need two pools for EC based RBD images, one EC > data pool and a second replicated pool for metadatata? The answer is given at http://docs.ceph.com/docs/nautilus/rados/operations/erasure-code/#erasure-coding-with-overwrites "... Erasure coded pools do not support omap, so to use them with RBD and CephFS you must instruct them to store their data in an ec pool, and their metadata in a replicated pool ..." Another option is using tiered pools, specially when you can dedicate fast OSDs for that: http://docs.ceph.com/docs/nautilus/rados/operations/erasure-code/#erasure-coded-pool-and-cache-tiering -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Need some advice about Pools and Erasure Coding
On Mon, 29 Apr 2019 at 16:37, Burkhard Linke wrote: > On 4/29/19 11:19 AM, Rainer Krienke wrote: [...] > > - I also thought about the different k+m settings for a EC pool, for > > example k=4, m=2 compared to k=8 and m=2. Both settings allow for two > > OSDs to fail without any data loss, but I asked myself which of the two > > settings would be more performant? On one hand distributing data to more > > OSDs allows a higher parallel access to the data, that should result in > > a faster access. On the other hand each OSD has a latency until > > it can deliver its data shard. So is there a recommandation which of my > > two k+m examples should be preferred? > > I cannot comment on speed (interesting question, since we are about to In theory the more stripes you have the faster it works overall (IO load is distributed among bigger number of hosts). -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is it possible to get list of all the PGs assigned to an OSD?
On Mon, 29 Apr 2019 at 15:13, Eugen Block wrote: > > Sure there is: > > ceph pg ls-by-osd Thank you Eugen, I overlooked it somehow :) -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Is it possible to get list of all the PGs assigned to an OSD?
Or is there no direct way to accomplish that? What workarounds can be used then? -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Does ceph osd reweight-by-xxx work correctly if OSDs aren't of same size?
Say, some nodes have OSDs that are 1.5 times bigger, than other nodes have, meanwhile weights of all the nodes in question is almost equal (due having different number of OSDs obviously) -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does CEPH calculates PGs per OSD for erasure coded (EC) pools?
On Sun, 28 Apr 2019 at 16:14, Paul Emmerich wrote: > Use k+m for PG calculation, that value also shows up as "erasure size" > in ceph osd pool ls detail So does it mean that for PG calculation those 2 pools are equivalent: 1) EC(4, 2) 2) replicated, size 6 ? Sounds weird to be honest. Replicated with size 6 means each logical data is stored 6 times, what needed single PG now requires 6 PGs. And with EC(4, 2) there's still only 1.5 overhead in terms of raw occupied space -- how come PG calculation distribution needs adjusting to 6 instead of 1.5 then? Also, why does CEPH documentation say "It is equivalent to a replicated pool of size __two__" when describing EC(2, 1) example? -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How does CEPH calculates PGs per OSD for erasure coded (EC) pools?
For replicated pools (w/o rounding to nearest power of two) overall PGs number is calculated so: Pools_PGs = 100 * (OSDs / Pool_Size), where 100 -- target number of PGs per single OSD related to that pool, Pool_Size -- factor showing how much raw storage would in fact be used to store one logical data unit. By analogy I can suppose that with EC pools corresponding Pool_Size can be calculated so: Raw_Storage_Use / Logical_Storage_Use or, using EC semantics, (k + m) / k. And for EC (k=2, m=1) it gives: Raw_Storage_Use = 3 Logical_Storage_Use = 2 -- Hence, Pool_Size should be 1.5. OTOH, CEPH documentation says that about same EC pool (underline is mine): "It is equivalent to a replicated pool of size __two__ but requires 1.5TB instead of 2TB to store 1TB of data" So how does CEPH calculate PGs distribution per OSD for it? Using (k + m) / k? Or just k? Or differently at all? -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Were fixed CephFS lock ups when it's running on nodes with OSDs?
I remember seeing reports in regards but it's being a while now. Can anyone tell? -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?
On Tue, 16 Apr 2019 at 17:05, Paul Emmerich wrote: > > No, the problem is that a storage system should never tell a client > that it has written data if it cannot guarantee that the data is still > there if one device fails. [...] Ah, now I got your point. Anyways, it should be users' choice (with warning probably, but still). I can easily (but with heavy heart) remind what happened twice and not too long ago when someone decided "we better know what to do than users^W pilots do". Too many similar decisions were and (still are) popping up in IT industry too. Of course always "for good reasons" -- who'd have a doubt(?)... Oh, and BTW -- is it not possible to change EC(2,1)'s 3/3 to 3/2 in Luminous, is it? -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?
On Tue, 16 Apr 2019 at 16:52, Paul Emmerich wrote: > On Tue, Apr 16, 2019 at 11:50 AM Igor Podlesny wrote: > > On Tue, 16 Apr 2019 at 14:46, Paul Emmerich wrote: [...] > > Looked at it, didn't see any explanation of your point of view. If > > there're 2 active data instances > > (and 3rd is missing) how is it different to replicated pools with 3/2 > > config(?) > > each of these "copies" has only half the data Still not seeing how come. EC(2, 1) is conceptually RAID5 on 3 devices. You're basically saying that if one disk of those 3 disks is missing you can't safely write to 2 others that are still in. But CEPH's EC has no partial update issue, has it? Can you elaborate? -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?
On Tue, 16 Apr 2019 at 14:46, Paul Emmerich wrote: > Sorry, I just realized I didn't answer your original question. [...] No problemo. -- I've figured out the answer to my own question earlier anyways. And actually gave a hint today http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034278.html based on those findings. > Regarding min_size: yes, you are right about a 2+1 pool being created > with min_size 2 by default in the latest Nautilus release. > This seems like a bug to me, I've opened a ticket here: > http://tracker.ceph.com/issues/39307 Looked at it, didn't see any explanation of your point of view. If there're 2 active data instances (and 3rd is missing) how is it different to replicated pools with 3/2 config(?) [... overquoting removed ...] -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 'Missing' capacity
On Tue, 16 Apr 2019 at 06:43, Mark Schouten wrote: [...] > So where is the rest of the free space? :X Makes sense to see: sudo ceph osd df tree -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?
And as to min_size choice -- since you've replied exactly to that part of mine message only. On Sat, 13 Apr 2019 at 06:54, Paul Emmerich wrote: > On Fri, Apr 12, 2019 at 9:30 PM Igor Podlesny wrote: > > For e. g., an EC pool with default profile (2, 1) has bogus "sizing" > > params (size=3, min_size=3). {{ > > Min. size 3 is wrong as far as I know and it's been fixed in fresh > > releases (but not in Luminous). }} I didn't give any proof when writing this due being more focused on EC Pool usage calculation. Take a look at: https://github.com/ceph/ceph/pull/8008 As it can be seen formula for min_size became min_size = k + min(1, m - 1) effectively on March 2019. -- That's why I've said "fixed in fresh releases but not in Luminous". Let's see what does this new formula produce for k=2, m=1 (the default and documented EC profile): min_size = 2 + min(1, 1 - 1) = 2 + 0 = 2. Before that change it would be 3 instead, thus giving that 3/3 for EC (2, 1). [...] > min_size 3 is the default for that pool, yes. That means your data > will be unavailable if any OSD is offline. > Reducing min_size to 2 means you are accepting writes when you cannot > guarantee durability which will cause problems in the long run. > See older discussions about min_size here Would be glad doing so, but It's not a forum (here), but mail list instead, right(?) -- so the only way to "see here" is to rely on search engine that might have indexed mail list archive. If you have specific URL or at least exact keywords allowing to find what you're referring to, I'd gladly see what you're talking about. And of course I did search before writing and the fact I wrote it anyways means I didn't find anything answering my question "here or there". -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?
On Sat, 13 Apr 2019 at 06:54, Paul Emmerich wrote: > > Please don't use an EC pool with 2+1, that configuration makes no sense. That's too much of an irony given that (2, 1) is default EC profile, described in CEPH documentation in addition. > min_size 3 is the default for that pool, yes. That means your data > will be unavailable if any OSD is offline. > Reducing min_size to 2 means you are accepting writes when you cannot > guarantee durability which will cause problems in the long run. > See older discussions about min_size here Well, my primary concern wasn't about min_size at all but about this: { > > But besides that it looks like pool usage isn't calculated according > > to EC overhead but as if it was replicated pool with size=3 as well. } -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?
For e. g., an EC pool with default profile (2, 1) has bogus "sizing" params (size=3, min_size=3). Min. size 3 is wrong as far as I know and it's been fixed in fresh releases (but not in Luminous). But besides that it looks like pool usage isn't calculated according to EC overhead but as if it was replicated pool with size=3 as well. -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Does Bluestore backed OSD detect bit rot immediately when reading or only when scrubbed?
It's wide known that some filesystems (well, ok -- 2 of them: ZFS and Btrfs) detect bit rot on any read request, although, of course an admin can initiate "whole platter" scrubbing. Before Bluestore CEPH could provide only "on demand" detection. I don't take into consideration imaginary setups with Btrfs or ZFS backed OSDs -- although Btrfs was supported it couldn't be trusted due to being too quirky and ZFS would mean way higher overhead, resources consumption; moreover, it conceptually doesn't fit well into CEPH's paradigm. So when scrub would find a mismatch that would trigger infamous HEALTH ERR state and require manual tinkering to resolve (although, in typical cases when 3 copies of placement group were used it seemed more logical to autofix it -- at least most of its users would do same choice in 99 % occurrences). Since Bluestore I'd expect bitrot detection to be made on any read request as it's the case with Btrfs and ZFS. But expectations can be wrong no matter how logically they might seem, and that's why I've decide to clear it up. Can anyone tell for sure how does it work in CEPH with Bluestore? If it's NOT the same as with those 2 CoW FSes and bit rot is detected with scrubbing only how prone to data corruption / loss would be * 2 copies pools (2/1) * erasure coded pools (say 2, 1) ? Let's consider replicated pool with 2/1 where both data instances are up-to-date, and then one is found to be corrupted. Would be its csum mismatch enough to be "cured" semi-automatically with ceph pg repair? What and how would happen in case erasure coded pool's data was found to be damaged as well? -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com