Re: [ceph-users] Remove rbd image after interrupt of deletion command

2019-06-11 Thread Igor Podlesny
On Tue, 11 Jun 2019 at 14:46, Sakirnth Nagarasa
 wrote:
> On 6/7/19 3:35 PM, Jason Dillaman wrote:
[...]
> > Can you run "rbd rm --log-to-stderr=true --debug-rbd=20
> > ${POOLNAME}/${IMAGE}" and provide the logs via pastebin.com?
> >
> >> Cheers,
> >> Sakirnth
>
> It is not necessary anymore the remove command worked. The problem was
> only "rbd info" command. It took approximately one day to remove the
> cloned image (50 TB) which was not flaten. Why it took so long? The
> clone command completed within seconds.
>
> Thanks,
> Sakirnth

Sakirnth,

previously you've said (statement A): "...
rbd rm ${POOLNAME}/${IMAGE}
rbd: error opening image ${IMAGE}: (2) No such file or directory
..."

Now you're saying (statement B): "rm worked and the only issue was
info command".
Obviously both statements can't be true at the same time.
Can you elaborate(?) on that matter so that  themail list users would
have better understanding.

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] balancer module makes OSD distribution worse

2019-06-08 Thread Igor Podlesny
On Thu, 6 Jun 2019 at 03:01, Josh Haft  wrote:
>
> Hi everyone,
>
> On my 13.2.5 cluster, I recently enabled the ceph balancer module in
> crush-compat mode.

Why did you choose compat mode? Don't you want to try another one instead?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can I limit OSD memory usage?

2019-06-08 Thread Igor Podlesny
On Sat, 8 Jun 2019 at 04:35, Sergei Genchev  wrote:
>
>  Hi,
>  My OSD processes are constantly getting killed by OOM killer. My
> cluster has 5 servers, each with 18 spinning disks, running 18 OSD
> daemons in 48GB of memory.
>  I was trying to limit OSD cache, according to
> http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/
>
> [osd]
> bluestore_cache_size_ssd = 1G
> bluestore_cache_size_hdd = 768M
> Yet, my OSDs are using way more memory than that. I have seen as high as 3.2G

Well, it's widely known for a long time that 640KB isn't enough for everyone. ;)

CEPH's OSD RAM consumption is largely dependent on its backing store capacity.
Check out official recommendations:
http://docs.ceph.com/docs/luminous/start/hardware-recommendations/
-- "ceph-osd: RAM~1GB for 1TB of storage per daemon".

You didn't specify capacity of the disks, BTW. 2--3 TB?

[...]
>  Is there any way for me to limit how much memory does OSD use?

Try adding to same [osd] section

osd_memory_target = ...Amount_in_Bytes...

Don't set it to 640 KB though. ;-) minimum recommendations make sense
still, reduce with caution.

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Any CEPH's iSCSI gateway users?

2019-05-28 Thread Igor Podlesny
What is your experience?
Does it make sense to use it -- is it solid enough or beta quality
rather (both in terms of stability and performance)?

I've read it was more or less packaged to work with RHEL. Does it hold
true still?
What's the best way to install it on, say, CentOS or Debian/Ubuntu?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Massive TCP connection on radosgw

2019-05-22 Thread Igor Podlesny
On Wed, 22 May 2019 at 20:32, Torben Hørup  wrote:
>
> Which states are all these connections in ?
>
> ss -tn

That set of the args won't display anything but ESTAB-lished conn-s..

One typically needs `-atn` instead.

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus, k+m erasure coding a profile vs size+min_size

2019-05-21 Thread Igor Podlesny
On Tue, 21 May 2019 at 19:32, Yoann Moulin  wrote:
>
> >> I am doing some tests with Nautilus and cephfs on erasure coding pool.
[...]
> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/034867.html
>
> Oh thanks, I missed that thread, make sense. I agree with some comment that 
> it is a little bit confusing.

Check out this as well:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034242.html

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Tip for erasure code profile?

2019-05-03 Thread Igor Podlesny
On Fri, 3 May 2019 at 22:46, Robert Sander  wrote:
> The cluster spans 2 rooms
...
> The failure domain would be the room level
...
> Is that even possible with erasure coding?

Sure deal but you'd need slightly more rooms then. For e. g., minimal
EC(2, 1) means (2 + 1) rooms.

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-03 Thread Igor Podlesny
On Fri, 3 May 2019 at 21:39, Mark Nelson  wrote:
[...]
> > [osd]
> > ...
> > bluestore_allocator = bitmap
> > bluefs_allocator = bitmap
> >
> > I would restart the nodes one by one and see, what happens.
>
> If you are using 12.2.11 you likely still have the old bitmap allocator

Would those config changes be just ignored or OSD would fail to start instead?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-03 Thread Igor Podlesny
On Fri, 3 May 2019 at 13:38, Denny Fuchs  wrote:
[...]
> If I understand correct: I should try to set bitmap allocator

That's among one of the options I mentioned.

Another one was to try using jemalloc (re-read my emails).

> [osd]
> ...
> bluestore_allocator = bitmap
> bluefs_allocator = bitmap
>
> I would restart the nodes one by one and see, what happens.

Right.

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-02 Thread Igor Podlesny
On Fri, 3 May 2019 at 05:12, Mark Nelson  wrote:
[...]
> > -- https://www.kernel.org/doc/Documentation/vm/transhuge.txt
>
> Why are you quoting the description for the madvise setting when that's
> clearly not what was set in the case I just showed you?

Similarly why(?) are you telling us it must be due to THPs if:

1) by default they're not used unless madvise()'ed,
2) none of jemalloc or tcmalloc would madvise by default too.

[...]
> previously|malloc|'ed. Because the machine used transparent huge pages,

Is it from DigitalOcean's blog? I read it pretty long ago. And it was
written long ago, referring to some
ancient release of jemalloc and what's more important -- to a system
that has THP activated.

-- But I've shown you that it's not default kernel's setting to use
THP -- unless madvise would tell kernel so.
Your example with CentOS isn't relevant due to person who started this
thread use Debian (Proxmox, to be more correct).
Moreover, something's telling me that even in default CentOS installs
THPs are also set to madvise()-only.

> I'm not going to argue with you about this.

I don't argue with you.
I'm merely showing you that instead of doing baseless claims (or wild
guess-working), it's worth checking facts first.
Checking if THP's are used at all (although it might be not due OSDs
but, say, KVM) is as simple as looking into /proc/meminfo.

> Test it if you want or don't.

I didn't start this thread. ;)
As to me -- I've played enough with all kind of allocators and THP settings. :)

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-02 Thread Igor Podlesny
On Fri, 3 May 2019 at 01:29, Mark Nelson  wrote:
> On 5/2/19 11:46 AM, Igor Podlesny wrote:
> > On Thu, 2 May 2019 at 05:02, Mark Nelson  wrote:
> > [...]
> >> FWIW, if you still have an OSD up with tcmalloc, it's probably worth
> >> looking at the heap stats to see how much memory tcmalloc thinks it's
> >> allocated vs how much RSS memory is being used by the process.  It's
> >> quite possible that there is memory that has been unmapped but that the
> >> kernel can't (or has decided not yet to) reclaim.
> >> Transparent huge pages can potentially have an effect here both with 
> >> tcmalloc and with
> >> jemalloc so it's not certain that switching the allocator will fix it 
> >> entirely.
> > Most likely wrong. -- Default kernel's settings in regards of THP are 
> > "madvise".
> > None of tcmalloc or jemalloc would madvise() to make it happen.
> > With fresh enough jemalloc you could have it, but it needs special
> > malloc.conf'ing.
>
>
>  From one of our centos nodes with no special actions taken to change
> THP settings (though it's possible it was inherited from something else):
>
>
> $ cat /etc/redhat-release
> CentOS Linux release 7.5.1804 (Core)
> $ cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never

"madvise" will enter direct reclaim like "always" but only for regions
that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.

-- https://www.kernel.org/doc/Documentation/vm/transhuge.txt

> And regarding madvise and alternate memory allocators:
> https:
[...]

did you ever read any of it?

one link's info:

"By default jemalloc does not use huge pages for heap memory (there is
opt.metadata_thp which uses THP for internal metadata though)"

(and I've said
> > None of tcmalloc or jemalloc would madvise() to make it happen.
> > With fresh enough jemalloc you could have it, but it needs special
> > malloc.conf'ing.
before)

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-02 Thread Igor Podlesny
On Thu, 2 May 2019 at 05:02, Mark Nelson  wrote:
[...]
> FWIW, if you still have an OSD up with tcmalloc, it's probably worth
> looking at the heap stats to see how much memory tcmalloc thinks it's
> allocated vs how much RSS memory is being used by the process.  It's
> quite possible that there is memory that has been unmapped but that the
> kernel can't (or has decided not yet to) reclaim.

> Transparent huge pages can potentially have an effect here both with tcmalloc 
> and with
> jemalloc so it's not certain that switching the allocator will fix it 
> entirely.

Most likely wrong. -- Default kernel's settings in regards of THP are "madvise".
None of tcmalloc or jemalloc would madvise() to make it happen.
With fresh enough jemalloc you could have it, but it needs special
malloc.conf'ing.

> First I would just get the heap stats and then after that I would be
> very curious if disabling transparent huge pages helps. Alternately,
> it's always possible it's a memory leak. :D

RedHat can do better (hopefully). ;-P

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-01 Thread Igor Podlesny
On Tue, 30 Apr 2019 at 20:56, Igor Podlesny  wrote:
> On Tue, 30 Apr 2019 at 19:10, Denny Fuchs  wrote:
> [..]
> > Any suggestions ?
>
> -- Try different allocator.

Ah, BTW, except memory allocator there's another option: recently
backported bitmap allocator.
Igor Fedotov wrote about it's expected to have lesser memory footprint
with time:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034299.html

Also I'm not sure though if it's okay to switch existent OSDs "on-fly"
-- changing config and restarting OSDs.
Igor (Fedotov), can you please elaborate on this matter?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data distribution question

2019-04-30 Thread Igor Podlesny
On Wed, 1 May 2019 at 01:58, Dan van der Ster  wrote:
> On Tue, Apr 30, 2019 at 8:26 PM Igor Podlesny  wrote:
[...]
> All of the clients need to be luminous our newer:
>
> # ceph osd set-require-min-compat-client luminous
>
> You need to enable the module:
>
> # ceph mgr module enable balancer

(Enabled by default according to the docs.)
>
> You probably don't want to it run 24/7:
>
> # ceph config-key set mgr/balancer/begin_time 0800
> # ceph config-key set mgr/balancer/end_time 1800

oh, that's handy.

> The default rate that it balances things are a bit too high for my taste:
>
> # ceph config-key set mgr/balancer/max_misplaced 0.005
> # ceph config-key set mgr/balancer/upmap_max_iterations 2
>
> (Those above are optional... YMMV)

Yep, but good to know!
>
> Now fail the active mgr so that the new one reads those new options above.
>
> # ceph mgr fail 
>
> Enable the upmap mode:
>
> # ceph balancer mode upmap
>
> Test it once to see that it works at all:
>
> # ceph balancer optimize myplan
> # ceph balancer show myplan
> # ceph balancer reset
>
> (any errors, start debugging -- use debug_mgr = 4/5 and check the
> active mgr's log for the balancer details.)
>
> # ceph balancer on
>
> Now it'll start moving the PGs around until things are quite well balanced.
> In our clusters that process takes a week or two... it depends on
> cluster size, numpgs, etc...
>
> Hope that helps!

Thank you :)

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data distribution question

2019-04-30 Thread Igor Podlesny
On Wed, 1 May 2019 at 01:26, Igor Podlesny  wrote:
> On Wed, 1 May 2019 at 01:01, Dan van der Ster  wrote:
> >> > The upmap balancer in v12.2.12 works really well... Perfectly uniform on 
> >> > our clusters.
> >>
> >> mode upmap ?
> >
> > yes, mgr balancer, mode upmap.

Also -- do your CEPHs have single root hierarchy pools (like
"default"), or there're some pools that use non-default ones?

Looking through docs I didn't find a way to narrow balancer's scope
down to specific pool(s), although personally I would prefer it to
operate on a small set of them.

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data distribution question

2019-04-30 Thread Igor Podlesny
On Wed, 1 May 2019 at 01:26, Jack  wrote:
> If those pools are useless, you can:
> - drop them

As Dan pointed out it's unlikely of having any effect.
The thing is imbalance is a "property" of a pool, I'd suppose that
most often -- is the most loaded one (or of a few most loaded ones).
Not that much used pools don't impact it.

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data distribution question

2019-04-30 Thread Igor Podlesny
On Wed, 1 May 2019 at 01:01, Dan van der Ster  wrote:
>> > The upmap balancer in v12.2.12 works really well... Perfectly uniform on 
>> > our clusters.
>>
>> mode upmap ?
>
> yes, mgr balancer, mode upmap.

I see. Was it a matter of just:

1) ceph balancer mode upmap
2) ceph balancer on

or were there any other steps?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data distribution question

2019-04-30 Thread Igor Podlesny
On Wed, 1 May 2019 at 00:24, Dan van der Ster  wrote:
>
> The upmap balancer in v12.2.12 works really well... Perfectly uniform on our 
> clusters.
>
> .. Dan

mode upmap ?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-04-30 Thread Igor Podlesny
On Tue, 30 Apr 2019 at 19:10, Denny Fuchs  wrote:
[..]
> Any suggestions ?

-- Try different allocator.

In Proxmox 4 they by default had this in /etc/default/ceph {{

## use jemalloc instead of tcmalloc
#
# jemalloc is generally faster for small IO workloads and when
# ceph-osd is backed by SSDs.  However, memory usage is usually
# higher by 200-300mb.
#
#LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1

}},

so you may try using it in the same way, the package is still there in
Proxmox 5:

  libjemalloc1: /usr/lib/x86_64-linux-gnu/libjemalloc.so.1

No one can tell for sure if it would help, but jemalloc "...

is a general purpose malloc(3) implementation that emphasizes
fragmentation avoidance and scalable concurrency support.

..." -- http://jemalloc.net/

I noticed OSDs with jemalloc tend to have way bigger VSZ with time but
RSS should be fine.
Look forward hearing your experience with it.

--
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need some advice about Pools and Erasure Coding

2019-04-30 Thread Igor Podlesny
On Tue, 30 Apr 2019 at 19:11, Adrien Gillard 
wrote:

> On Tue, Apr 30, 2019 at 10:06 AM Igor Podlesny  wrote:
> >
> > On Tue, 30 Apr 2019 at 04:13, Adrien Gillard 
> wrote:
> > > I would add that the use of cache tiering, though still possible, is
> not recommended
> >
> > It lacks references. CEPH docs I gave links to didn't say so.
>
> The cache tiering documention mentions that (your link refers to it) :
>
> http://docs.ceph.com/docs/nautilus/rados/operations/cache-tiering/#a-word-of-caution


I saw this and didn't find "not recommended" or alike

>
> <http://docs.ceph.com/docs/nautilus/rados/operations/cache-tiering/#a-word-of-caution>
>
> There are some threads on the mailing list refering to the subject as
> well  (by David Turner or
> Christian Balzer for instance)


Thanks, will try to find
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore bitmap allocator under Luminous and Mimic

2019-04-30 Thread Igor Podlesny
On Mon, 15 Apr 2019 at 19:40, Wido den Hollander  wrote:
>
> Hi,
>
> With the release of 12.2.12 the bitmap allocator for BlueStore is now
> available under Mimic and Luminous.
>
> [osd]
> bluestore_allocator = bitmap
> bluefs_allocator = bitmap

Hi!

Have you tried this? :)

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need some advice about Pools and Erasure Coding

2019-04-30 Thread Igor Podlesny
On Tue, 30 Apr 2019 at 04:13, Adrien Gillard  wrote:
> I would add that the use of cache tiering, though still possible, is not 
> recommended

It lacks references. CEPH docs I gave links to didn't say so.

> comes with its own challenges.

It's challenging for some to not over-quote when replying, but I don't
think it holds true for everyone.

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need some advice about Pools and Erasure Coding

2019-04-29 Thread Igor Podlesny
On Mon, 29 Apr 2019 at 16:19, Rainer Krienke  wrote:
[...]
> - Do I still (nautilus) need two pools for EC based RBD images, one EC
> data pool and a second replicated pool for metadatata?

The answer is given at
http://docs.ceph.com/docs/nautilus/rados/operations/erasure-code/#erasure-coding-with-overwrites
"...
Erasure coded pools do not support omap, so to use them with RBD and
CephFS you must instruct them to store their data in an ec pool, and
their metadata in a replicated pool
..."

Another option is using tiered pools, specially when you can dedicate
fast OSDs for that:

http://docs.ceph.com/docs/nautilus/rados/operations/erasure-code/#erasure-coded-pool-and-cache-tiering

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need some advice about Pools and Erasure Coding

2019-04-29 Thread Igor Podlesny
On Mon, 29 Apr 2019 at 16:37, Burkhard Linke
 wrote:
> On 4/29/19 11:19 AM, Rainer Krienke wrote:
[...]
> > - I also thought about the different k+m settings for a EC pool, for
> > example k=4, m=2 compared to k=8 and m=2. Both settings allow for two
> > OSDs to fail without any data loss, but I asked myself which of the two
> > settings would be more performant? On one hand distributing data to more
> > OSDs allows a higher parallel access to the data, that should result in
> > a faster access. On the other hand each OSD has a latency until
> > it can deliver its data shard. So is there a recommandation which of my
> > two k+m examples should be preferred?
>
> I cannot comment on speed (interesting question, since we are about to

In theory the more stripes you have the faster it works overall (IO
load is distributed among bigger number of hosts).

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to get list of all the PGs assigned to an OSD?

2019-04-29 Thread Igor Podlesny
On Mon, 29 Apr 2019 at 15:13, Eugen Block  wrote:
>
> Sure there is:
>
> ceph pg ls-by-osd 

Thank you Eugen, I overlooked it somehow :)

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Is it possible to get list of all the PGs assigned to an OSD?

2019-04-29 Thread Igor Podlesny
Or is there no direct way to accomplish that?
What workarounds can be used then?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Does ceph osd reweight-by-xxx work correctly if OSDs aren't of same size?

2019-04-29 Thread Igor Podlesny
Say, some nodes have OSDs that are 1.5 times bigger, than other nodes
have, meanwhile weights of all the nodes in question is almost equal
(due having different number of OSDs obviously)

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does CEPH calculates PGs per OSD for erasure coded (EC) pools?

2019-04-28 Thread Igor Podlesny
On Sun, 28 Apr 2019 at 16:14, Paul Emmerich  wrote:
> Use k+m for PG calculation, that value also shows up as "erasure size"
> in ceph osd pool ls detail

So does it mean that for PG calculation those 2 pools are equivalent:

1) EC(4, 2)
2) replicated, size 6

? Sounds weird to be honest. Replicated with size 6 means each logical
data is stored 6 times, what needed single PG now requires 6 PGs.
And with EC(4, 2) there's still only 1.5 overhead in terms of raw
occupied space -- how come PG calculation distribution needs adjusting
to 6 instead of 1.5 then?

Also, why does CEPH documentation say "It is equivalent to a
replicated pool of size __two__" when describing EC(2, 1) example?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How does CEPH calculates PGs per OSD for erasure coded (EC) pools?

2019-04-28 Thread Igor Podlesny
For replicated pools (w/o rounding to nearest power of two) overall
PGs number is calculated so:

Pools_PGs = 100 * (OSDs / Pool_Size),

where
100 -- target number of PGs per single OSD related to that pool,
Pool_Size -- factor showing how much raw storage would in fact be
used to store one logical data unit.

By analogy I can suppose that with EC pools corresponding Pool_Size
can be calculated so:

Raw_Storage_Use / Logical_Storage_Use

or, using EC semantics, (k + m) / k. And for EC (k=2, m=1) it gives:

Raw_Storage_Use = 3
Logical_Storage_Use = 2

-- Hence, Pool_Size should be 1.5.

OTOH, CEPH documentation says that about same EC pool (underline is mine):

"It is equivalent to a replicated pool of size __two__ but
requires 1.5TB instead of 2TB to store 1TB of data"

So how does CEPH calculate PGs distribution per OSD for it?
Using (k + m) / k? Or just k? Or differently at all?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Were fixed CephFS lock ups when it's running on nodes with OSDs?

2019-04-20 Thread Igor Podlesny
I remember seeing reports in regards but it's being a while now.
Can anyone tell?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?

2019-04-16 Thread Igor Podlesny
On Tue, 16 Apr 2019 at 17:05, Paul Emmerich  wrote:
>
> No, the problem is that a storage system should never tell a client
> that it has written data if it cannot guarantee that the data is still
> there if one device fails.
[...]

Ah, now I got your point.

Anyways, it should be users' choice (with warning probably, but
still). I can easily
(but with heavy heart) remind what happened twice and not too long ago
when someone decided
"we better know what to do than users^W pilots do". Too many similar
decisions were and (still are)
popping up in IT industry too. Of course always "for good reasons" --
who'd have a doubt(?)...

Oh, and BTW -- is it not possible to change EC(2,1)'s 3/3 to 3/2 in
Luminous, is it?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?

2019-04-16 Thread Igor Podlesny
On Tue, 16 Apr 2019 at 16:52, Paul Emmerich  wrote:
> On Tue, Apr 16, 2019 at 11:50 AM Igor Podlesny  wrote:
> > On Tue, 16 Apr 2019 at 14:46, Paul Emmerich  wrote:
[...]
> > Looked at it, didn't see any explanation of your point of view. If
> > there're 2 active data instances
> > (and 3rd is missing) how is it different to replicated pools with 3/2 
> > config(?)
>
> each of these "copies" has only half the data

Still not seeing how come.

EC(2, 1) is conceptually RAID5 on 3 devices. You're basically saying
that if one disk of those 3 disks is missing
you can't safely write to 2 others that are still in. But CEPH's EC
has no partial update issue, has it?

Can you elaborate?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?

2019-04-16 Thread Igor Podlesny
On Tue, 16 Apr 2019 at 14:46, Paul Emmerich  wrote:
> Sorry, I just realized I didn't answer your original question.
[...]

No problemo. -- I've figured out the answer to my own question earlier anyways.
And actually gave a hint today

  http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034278.html

based on those findings.

> Regarding min_size: yes, you are right about a 2+1 pool being created
> with min_size 2 by default in the latest Nautilus release.
> This seems like a bug to me, I've opened a ticket here:
> http://tracker.ceph.com/issues/39307

Looked at it, didn't see any explanation of your point of view. If
there're 2 active data instances
(and 3rd is missing) how is it different to replicated pools with 3/2 config(?)

[... overquoting removed ...]

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 'Missing' capacity

2019-04-15 Thread Igor Podlesny
On Tue, 16 Apr 2019 at 06:43, Mark Schouten  wrote:
[...]
> So where is the rest of the free space? :X

Makes sense to see:

sudo ceph osd df tree

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?

2019-04-12 Thread Igor Podlesny
And as to min_size choice -- since you've replied exactly to that part
of mine message only.

On Sat, 13 Apr 2019 at 06:54, Paul Emmerich  wrote:
> On Fri, Apr 12, 2019 at 9:30 PM Igor Podlesny  wrote:
> > For e. g., an EC pool with default profile (2, 1) has bogus "sizing"
> > params (size=3, min_size=3).

{{
> > Min. size 3 is wrong as far as I know and it's been fixed in fresh
> > releases (but not in Luminous).
}}

I didn't give any proof when writing this due being more focused on EC
Pool usage calculation.
Take a look at:

  https://github.com/ceph/ceph/pull/8008

As it can be seen formula for min_size became min_size = k + min(1, m
- 1) effectively on March 2019.
-- That's why I've said "fixed in fresh releases but not in Luminous".

Let's see what does this new formula produce for k=2, m=1 (the default
and documented EC profile):

min_size = 2 + min(1, 1 - 1) = 2 + 0 = 2.

Before that change it would be 3 instead, thus giving that 3/3 for EC (2, 1).

[...]
> min_size 3 is the default for that pool, yes. That means your data
> will be unavailable if any OSD is offline.
> Reducing min_size to 2 means you are accepting writes when you cannot
> guarantee durability which will cause problems in the long run.
> See older discussions about min_size here

Would be glad doing so, but It's not a forum (here), but mail list
instead, right(?) -- so the only way
to "see here" is to rely on search engine that might have indexed mail
list archive. If you have
specific URL or at least exact keywords allowing to find what you're
referring to, I'd gladly see
what you're talking about.

And of course I did search before writing and the fact I wrote it
anyways means I didn't find
anything answering my question "here or there".

--
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?

2019-04-12 Thread Igor Podlesny
On Sat, 13 Apr 2019 at 06:54, Paul Emmerich  wrote:
>
> Please don't use an EC pool with 2+1, that configuration makes no sense.

That's too much of an irony given that (2, 1) is default EC profile,
described in CEPH documentation in addition.

> min_size 3 is the default for that pool, yes. That means your data
> will be unavailable if any OSD is offline.
> Reducing min_size to 2 means you are accepting writes when you cannot
> guarantee durability which will cause problems in the long run.
> See older discussions about min_size here

Well, my primary concern wasn't about min_size at all but about this: {
> > But besides that it looks like pool usage isn't calculated according
> > to EC overhead but as if it was replicated pool with size=3 as well.
}

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?

2019-04-12 Thread Igor Podlesny
For e. g., an EC pool with default profile (2, 1) has bogus "sizing"
params (size=3, min_size=3).
Min. size 3 is wrong as far as I know and it's been fixed in fresh
releases (but not in Luminous).

But besides that it looks like pool usage isn't calculated according
to EC overhead but as if it was replicated pool with size=3 as well.

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Does Bluestore backed OSD detect bit rot immediately when reading or only when scrubbed?

2019-04-01 Thread Igor Podlesny
It's wide known that some filesystems (well, ok -- 2 of them: ZFS and
Btrfs) detect bit rot on any read request, although, of course an
admin can initiate "whole platter" scrubbing.

Before Bluestore CEPH could provide only "on demand" detection. I
don't take into consideration imaginary setups with Btrfs or ZFS
backed OSDs -- although Btrfs was supported it couldn't be trusted due
to being too quirky and ZFS would mean way higher overhead, resources
consumption; moreover, it conceptually doesn't fit well into CEPH's
paradigm.
So when scrub would find a mismatch that would trigger infamous HEALTH
ERR state and require manual tinkering to resolve (although, in
typical cases when 3 copies of placement group were used it seemed
more logical to autofix it -- at least most of its users would do same
choice in 99 % occurrences).

Since Bluestore I'd expect bitrot detection to be made on any read
request as it's the case with Btrfs and ZFS. But expectations can be
wrong no matter how logically they might seem, and that's why I've
decide to clear it up. Can anyone tell for sure how does it work in
CEPH with Bluestore?

If it's NOT the same as with those 2 CoW FSes and bit rot is detected
with scrubbing only how prone to data corruption / loss would be

* 2 copies pools (2/1)
* erasure coded pools (say 2, 1)

?

Let's consider replicated pool with 2/1 where both data instances are
up-to-date, and then one is found to be corrupted. Would be its csum
mismatch enough to be "cured" semi-automatically with ceph pg repair?

What and how would happen in case erasure coded pool's data was found
to be damaged as well?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com