[ceph-users] How to reduce HDD OSD flapping due to rocksdb compacting event?

2019-04-10 Thread Charles Alva
Hi Ceph Users,

Is there a way around to minimize rocksdb compacting event so that it won't
use all the spinning disk IO utilization and avoid it being marked as down
due to fail to send heartbeat to others?

Right now we have frequent high IO disk utilization for every 20-25 minutes
where the rocksdb reaches level 4 with 67GB data to compact.


Kind regards,

Charles Alva
Sent from Gmail Mobile
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to reduce HDD OSD flapping due to rocksdb compacting event?

2019-04-10 Thread Wido den Hollander



On 4/10/19 9:07 AM, Charles Alva wrote:
> Hi Ceph Users,
> 
> Is there a way around to minimize rocksdb compacting event so that it
> won't use all the spinning disk IO utilization and avoid it being marked
> as down due to fail to send heartbeat to others?
> 
> Right now we have frequent high IO disk utilization for every 20-25
> minutes where the rocksdb reaches level 4 with 67GB data to compact.
> 

How big is the disk? RocksDB will need to compact at some point and it
seems that the HDD can't keep up.

I've seen this with many customers and in those cases we offloaded the
WAL+DB to an SSD.

How big is the data drive and the DB?

Wido

> 
> Kind regards,
> 
> Charles Alva
> Sent from Gmail Mobile
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to reduce HDD OSD flapping due to rocksdb compacting event?

2019-04-10 Thread jesper
> On 4/10/19 9:07 AM, Charles Alva wrote:
>> Hi Ceph Users,
>>
>> Is there a way around to minimize rocksdb compacting event so that it
>> won't use all the spinning disk IO utilization and avoid it being marked
>> as down due to fail to send heartbeat to others?
>>
>> Right now we have frequent high IO disk utilization for every 20-25
>> minutes where the rocksdb reaches level 4 with 67GB data to compact.
>>
>
> How big is the disk? RocksDB will need to compact at some point and it
> seems that the HDD can't keep up.
>
> I've seen this with many customers and in those cases we offloaded the
> WAL+DB to an SSD.

Guess the SSD need to be pretty durable to handle that?

Is there a "migration path" to offload this or is it needed to destroy
and re-create the OSD?

Thanks.

Jesper


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to reduce HDD OSD flapping due to rocksdb compacting event?

2019-04-10 Thread Wido den Hollander



On 4/10/19 9:25 AM, jes...@krogh.cc wrote:
>> On 4/10/19 9:07 AM, Charles Alva wrote:
>>> Hi Ceph Users,
>>>
>>> Is there a way around to minimize rocksdb compacting event so that it
>>> won't use all the spinning disk IO utilization and avoid it being marked
>>> as down due to fail to send heartbeat to others?
>>>
>>> Right now we have frequent high IO disk utilization for every 20-25
>>> minutes where the rocksdb reaches level 4 with 67GB data to compact.
>>>
>>
>> How big is the disk? RocksDB will need to compact at some point and it
>> seems that the HDD can't keep up.
>>
>> I've seen this with many customers and in those cases we offloaded the
>> WAL+DB to an SSD.
> 
> Guess the SSD need to be pretty durable to handle that?
> 

Always use DC-grade SSDs, but you don't need to buy the most expensive
ones you can find. ~1.5DWPD is sufficient.

> Is there a "migration path" to offload this or is it needed to destroy
> and re-create the OSD?
> 

In Nautilus release (and maybe Mimic) there is a tool to migrate the DB
to a different device without the need to re-create the OSD. This is
bluestore-dev-tool I think.

Wido

> Thanks.
> 
> Jesper
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-10 Thread Frédéric Nass

Hi everyone,

So if the kernel is able to reclaim those pages, is there still a point 
in running the heap release on a regular basis?


Regards,
Frédéric.

Le 09/04/2019 à 19:33, Olivier Bonvalet a écrit :

Good point, thanks !

By making memory pressure (by playing with vm.min_free_kbytes), memory
is freed by the kernel.

So I think I essentially need to update monitoring rules, to avoid
false positive.

Thanks, I continue to read your resources.


Le mardi 09 avril 2019 à 09:30 -0500, Mark Nelson a écrit :

My understanding is that basically the kernel is either unable or
uninterested (maybe due to lack of memory pressure?) in reclaiming
the
memory .  It's possible you might have better behavior if you set
/sys/kernel/mm/khugepaged/max_ptes_none to a low value (maybe 0) or
maybe disable transparent huge pages entirely.


Some background:

https://github.com/gperftools/gperftools/issues/1073

https://blog.nelhage.com/post/transparent-hugepages/

https://www.kernel.org/doc/Documentation/vm/transhuge.txt


Mark


On 4/9/19 7:31 AM, Olivier Bonvalet wrote:

Well, Dan seems to be right :

_tune_cache_size
  target: 4294967296
heap: 6514409472
unmapped: 2267537408
  mapped: 4246872064
old cache_size: 2845396873
new cache size: 2845397085


So we have 6GB in heap, but "only" 4GB mapped.

But "ceph tell osd.* heap release" should had release that ?


Thanks,

Olivier


Le lundi 08 avril 2019 à 16:09 -0500, Mark Nelson a écrit :

One of the difficulties with the osd_memory_target work is that
we
can't
tune based on the RSS memory usage of the process. Ultimately
it's up
to
the kernel to decide to reclaim memory and especially with
transparent
huge pages it's tough to judge what the kernel is going to do
even
if
memory has been unmapped by the process.  Instead the autotuner
looks
at
how much memory has been mapped and tries to balance the caches
based
on
that.


In addition to Dan's advice, you might also want to enable debug
bluestore at level 5 and look for lines containing "target:" and
"cache_size:".  These will tell you the current target, the
mapped
memory, unmapped memory, heap size, previous aggregate cache
size,
and
new aggregate cache size.  The other line will give you a break
down
of
how much memory was assigned to each of the bluestore caches and
how
much each case is using.  If there is a memory leak, the
autotuner
can
only do so much.  At some point it will reduce the caches to fit
within
cache_min and leave it there.


Mark


On 4/8/19 5:18 AM, Dan van der Ster wrote:

Which OS are you using?
With CentOS we find that the heap is not always automatically
released. (You can check the heap freelist with `ceph tell
osd.0
heap
stats`).
As a workaround we run this hourly:

ceph tell mon.* heap release
ceph tell osd.* heap release
ceph tell mds.* heap release

-- Dan

On Sat, Apr 6, 2019 at 1:30 PM Olivier Bonvalet <
ceph.l...@daevel.fr> wrote:

Hi,

on a Luminous 12.2.11 deploiement, my bluestore OSD exceed
the
osd_memory_target :

daevel-ob@ssdr712h:~$ ps auxw | grep ceph-osd
ceph3646 17.1 12.0 6828916 5893136 ? Ssl  mars29
1903:42 /usr/bin/ceph-osd -f --cluster ceph --id 143 --
setuser
ceph --setgroup ceph
ceph3991 12.9 11.2 6342812 5485356 ? Ssl  mars29
1443:41 /usr/bin/ceph-osd -f --cluster ceph --id 144 --
setuser
ceph --setgroup ceph
ceph4361 16.9 11.8 6718432 5783584 ? Ssl  mars29
1889:41 /usr/bin/ceph-osd -f --cluster ceph --id 145 --
setuser
ceph --setgroup ceph
ceph4731 19.7 12.2 6949584 5982040 ? Ssl  mars29
2198:47 /usr/bin/ceph-osd -f --cluster ceph --id 146 --
setuser
ceph --setgroup ceph
ceph5073 16.7 11.6 6639568 5701368 ? Ssl  mars29
1866:05 /usr/bin/ceph-osd -f --cluster ceph --id 147 --
setuser
ceph --setgroup ceph
ceph5417 14.6 11.2 6386764 5519944 ? Ssl  mars29
1634:30 /usr/bin/ceph-osd -f --cluster ceph --id 148 --
setuser
ceph --setgroup ceph
ceph5760 16.9 12.0 6806448 5879624 ? Ssl  mars29
1882:42 /usr/bin/ceph-osd -f --cluster ceph --id 149 --
setuser
ceph --setgroup ceph
ceph6105 16.0 11.6 6576336 5694556 ? Ssl  mars29
1782:52 /usr/bin/ceph-osd -f --cluster ceph --id 150 --
setuser
ceph --setgroup ceph

daevel-ob@ssdr712h:~$ free -m
 totalusedfree  shared  bu
ff/ca
che   available
Mem:  47771   452101643  17
9
17   43556
Swap: 0   0   0

# ceph daemon osd.147 config show | grep memory_target
   "osd_memory_target": "4294967296",


And there is no recovery / backfilling, the cluster is fine :

  $ ceph status
cluster:
  id: de035250-323d-4cf6-8c4b-cf0faf6296b1
  health: HEALTH_OK

services:
  mon: 5 daemons, quorum
tolriq,tsyne,olkas,lorunde,amphel
  mgr: tsyne(active), standbys: olkas, tolriq,
lorunde,
amphel
  osd: 120 osds: 116 up, 116 in

data:
  pools:   20 pool

Re: [ceph-users] How to reduce HDD OSD flapping due to rocksdb compacting event?

2019-04-10 Thread Igor Fedotov

It's ceph-bluestore-tool.

On 4/10/2019 10:27 AM, Wido den Hollander wrote:


On 4/10/19 9:25 AM, jes...@krogh.cc wrote:

On 4/10/19 9:07 AM, Charles Alva wrote:

Hi Ceph Users,

Is there a way around to minimize rocksdb compacting event so that it
won't use all the spinning disk IO utilization and avoid it being marked
as down due to fail to send heartbeat to others?

Right now we have frequent high IO disk utilization for every 20-25
minutes where the rocksdb reaches level 4 with 67GB data to compact.


How big is the disk? RocksDB will need to compact at some point and it
seems that the HDD can't keep up.

I've seen this with many customers and in those cases we offloaded the
WAL+DB to an SSD.

Guess the SSD need to be pretty durable to handle that?


Always use DC-grade SSDs, but you don't need to buy the most expensive
ones you can find. ~1.5DWPD is sufficient.


Is there a "migration path" to offload this or is it needed to destroy
and re-create the OSD?


In Nautilus release (and maybe Mimic) there is a tool to migrate the DB
to a different device without the need to re-create the OSD. This is
bluestore-dev-tool I think.

Wido


Thanks.

Jesper



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding failure domain (again)

2019-04-10 Thread Christian Balzer


Hello,

Another thing that crossed my mind aside from failure probabilities caused
by actual HDDs dying is of course the little detail that most Ceph
installations will have have WAL/DB (journal) on SSDs, the most typical
ratio being 1:4. 
And given the current thread about compaction killing pure HDD OSDs,
something you may _have_ to do.

So if you get unlucky and a SSD dies 4 OSDs are irrecoverably lost, unlike
a dead node that can be recovered.
Combine that with the background noise of HDDs failing, things got just
quite a bit scarier. 

And if you have a "crap firmware of the week" situation like experienced
with several people here, you're even more like to wind up in trouble very
fast.

This is of course all something people do (or should know), I'm more
wondering how to model it to correctly asses risks.

Christian

On Wed, 3 Apr 2019 10:28:09 +0900 Christian Balzer wrote:

> On Tue, 2 Apr 2019 19:04:28 +0900 Hector Martin wrote:
> 
> > On 02/04/2019 18.27, Christian Balzer wrote:  
> > > I did a quick peek at my test cluster (20 OSDs, 5 hosts) and a replica 2
> > > pool with 1024 PGs.
> > 
> > (20 choose 2) is 190, so you're never going to have more than that many 
> > unique sets of OSDs.
> >   
> And this is why one shouldn't send mails when in a rush, w/o fully groking
> the math one was just given. 
> Thanks for setting me straight. 
> 
> > I just looked at the OSD distribution for a replica 3 pool across 48 
> > OSDs with 4096 PGs that I have and the result is reasonable. There are 
> > 3782 unique OSD tuples, out of (48 choose 3) = 17296 options. Since this 
> > is a random process, due to the birthday paradox, some duplicates are 
> > expected after only the order of 17296^0.5 = ~131 PGs; at 4096 PGs 
> > having 3782 unique choices seems to pass the gut feeling test. Too lazy 
> > to do the math closed form, but here's a quick simulation:
> >   
> >  >>> len(set(random.randrange(17296) for i in range(4096)))
> > 3671
> > 
> > So I'm actually slightly ahead.
> > 
> > At the numbers in my previous example (1500 OSDs, 50k pool PGs), 
> > statistically you should get something like ~3 collisions on average, so 
> > negligible.
> >   
> Sounds promising. 
> 
> > > Another thing to look at here is of course critical period and disk
> > > failure probabilities, these guys explain the logic behind their
> > > calculator, would be delighted if you could have a peek and comment.
> > > 
> > > https://www.memset.com/support/resources/raid-calculator/
> > 
> > I'll take a look tonight :)
> >   
> Thanks, a look at the Backblaze disk failure rates (picking the worst
> ones) gives a good insight into real life probabilities, too.
> https://www.backblaze.com/blog/hard-drive-stats-for-2018/
> If we go with 2%/year, that's an average failure ever 12 days.
> 
> Aside from how likely the actual failure rate is, another concern of
> course is extended periods of the cluster being unhealthy, with certain
> versions there was that "mon map will grow indefinitely" issue, other more
> subtle ones might lurk still.
> 
> Christian
> > -- 
> > Hector Martin (hec...@marcansoft.com)
> > Public Key: https://mrcn.st/pub
> >   
> 
> 
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Rakuten Communications
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding failure domain (again)

2019-04-10 Thread Hector Martin
On 10/04/2019 18.11, Christian Balzer wrote:
> Another thing that crossed my mind aside from failure probabilities caused
> by actual HDDs dying is of course the little detail that most Ceph
> installations will have have WAL/DB (journal) on SSDs, the most typical
> ratio being 1:4. 
> And given the current thread about compaction killing pure HDD OSDs,
> something you may _have_ to do.
> 
> So if you get unlucky and a SSD dies 4 OSDs are irrecoverably lost, unlike
> a dead node that can be recovered.
> Combine that with the background noise of HDDs failing, things got just
> quite a bit scarier. 

Certainly, your failure domain should be at least host, and that changes
the math (even without considering whole-host failure).

Let's say you have 375 hosts and 4 OSDs per host, with the failure
domain correctly set to host. Same 5 pool PGs as before. Now if 3
hosts die:

5 / (375 choose 3) =~ 0.57% chance of data loss

This is equivalent to having 3 shared SSDs die.

If 3 random OSDs die in different hosts, the chances of data loss would
be 0.57% / (4^3) =~ 0.00896 % (1 in 4 chance per host that you hit the
OSD a PG actually lives in, and you need to hit all 3). This is
marginally higher than the ~ 0.00891% with uniformly distributed PGs,
because you've eliminated all sets of OSDs which share a host.


-- 
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-10 Thread Mark Nelson

In fact the autotuner does it itself every time it tunes the cache size:


https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L3630


Mark


On 4/10/19 2:53 AM, Frédéric Nass wrote:

Hi everyone,

So if the kernel is able to reclaim those pages, is there still a 
point in running the heap release on a regular basis?


Regards,
Frédéric.

Le 09/04/2019 à 19:33, Olivier Bonvalet a écrit :

Good point, thanks !

By making memory pressure (by playing with vm.min_free_kbytes), memory
is freed by the kernel.

So I think I essentially need to update monitoring rules, to avoid
false positive.

Thanks, I continue to read your resources.


Le mardi 09 avril 2019 à 09:30 -0500, Mark Nelson a écrit :

My understanding is that basically the kernel is either unable or
uninterested (maybe due to lack of memory pressure?) in reclaiming
the
memory .  It's possible you might have better behavior if you set
/sys/kernel/mm/khugepaged/max_ptes_none to a low value (maybe 0) or
maybe disable transparent huge pages entirely.


Some background:

https://github.com/gperftools/gperftools/issues/1073

https://blog.nelhage.com/post/transparent-hugepages/

https://www.kernel.org/doc/Documentation/vm/transhuge.txt


Mark


On 4/9/19 7:31 AM, Olivier Bonvalet wrote:

Well, Dan seems to be right :

_tune_cache_size
  target: 4294967296
    heap: 6514409472
    unmapped: 2267537408
  mapped: 4246872064
old cache_size: 2845396873
new cache size: 2845397085


So we have 6GB in heap, but "only" 4GB mapped.

But "ceph tell osd.* heap release" should had release that ?


Thanks,

Olivier


Le lundi 08 avril 2019 à 16:09 -0500, Mark Nelson a écrit :

One of the difficulties with the osd_memory_target work is that
we
can't
tune based on the RSS memory usage of the process. Ultimately
it's up
to
the kernel to decide to reclaim memory and especially with
transparent
huge pages it's tough to judge what the kernel is going to do
even
if
memory has been unmapped by the process.  Instead the autotuner
looks
at
how much memory has been mapped and tries to balance the caches
based
on
that.


In addition to Dan's advice, you might also want to enable debug
bluestore at level 5 and look for lines containing "target:" and
"cache_size:".  These will tell you the current target, the
mapped
memory, unmapped memory, heap size, previous aggregate cache
size,
and
new aggregate cache size.  The other line will give you a break
down
of
how much memory was assigned to each of the bluestore caches and
how
much each case is using.  If there is a memory leak, the
autotuner
can
only do so much.  At some point it will reduce the caches to fit
within
cache_min and leave it there.


Mark


On 4/8/19 5:18 AM, Dan van der Ster wrote:

Which OS are you using?
With CentOS we find that the heap is not always automatically
released. (You can check the heap freelist with `ceph tell
osd.0
heap
stats`).
As a workaround we run this hourly:

ceph tell mon.* heap release
ceph tell osd.* heap release
ceph tell mds.* heap release

-- Dan

On Sat, Apr 6, 2019 at 1:30 PM Olivier Bonvalet <
ceph.l...@daevel.fr> wrote:

Hi,

on a Luminous 12.2.11 deploiement, my bluestore OSD exceed
the
osd_memory_target :

daevel-ob@ssdr712h:~$ ps auxw | grep ceph-osd
ceph    3646 17.1 12.0 6828916 5893136 ? Ssl mars29
1903:42 /usr/bin/ceph-osd -f --cluster ceph --id 143 --
setuser
ceph --setgroup ceph
ceph    3991 12.9 11.2 6342812 5485356 ? Ssl mars29
1443:41 /usr/bin/ceph-osd -f --cluster ceph --id 144 --
setuser
ceph --setgroup ceph
ceph    4361 16.9 11.8 6718432 5783584 ? Ssl mars29
1889:41 /usr/bin/ceph-osd -f --cluster ceph --id 145 --
setuser
ceph --setgroup ceph
ceph    4731 19.7 12.2 6949584 5982040 ? Ssl mars29
2198:47 /usr/bin/ceph-osd -f --cluster ceph --id 146 --
setuser
ceph --setgroup ceph
ceph    5073 16.7 11.6 6639568 5701368 ? Ssl mars29
1866:05 /usr/bin/ceph-osd -f --cluster ceph --id 147 --
setuser
ceph --setgroup ceph
ceph    5417 14.6 11.2 6386764 5519944 ? Ssl mars29
1634:30 /usr/bin/ceph-osd -f --cluster ceph --id 148 --
setuser
ceph --setgroup ceph
ceph    5760 16.9 12.0 6806448 5879624 ? Ssl mars29
1882:42 /usr/bin/ceph-osd -f --cluster ceph --id 149 --
setuser
ceph --setgroup ceph
ceph    6105 16.0 11.6 6576336 5694556 ? Ssl mars29
1782:52 /usr/bin/ceph-osd -f --cluster ceph --id 150 --
setuser
ceph --setgroup ceph

daevel-ob@ssdr712h:~$ free -m
 total    used    free shared  bu
ff/ca
che   available
Mem:  47771   45210    1643 17
    9
17   43556
Swap: 0   0   0

# ceph daemon osd.147 config show | grep memory_target
   "osd_memory_target": "4294967296",


And there is no recovery / backfilling, the cluster is fine :

  $ ceph status
    cluster:
  id: de035250-323d-4cf6-8c4b-cf0faf6296b1
  health: HEALTH_OK

    services:
  mon: 5 daemons, quorum
tolri

Re: [ceph-users] showing active config settings

2019-04-10 Thread Janne Johansson
Den ons 10 apr. 2019 kl 13:31 skrev Eugen Block :

>
> While --show-config still shows
>
> host1:~ # ceph --show-config | grep osd_recovery_max_active
> osd_recovery_max_active = 3
>
>
> It seems as if --show-config is not really up-to-date anymore?
> Although I can execute it, the option doesn't appear in the help page
> of a Mimic and Luminous cluster. So maybe this is deprecated.
>
>
If you don't specify which daemon to talk to, it tells you what the
defaults would be for a random daemon started just now using the same
config as you have in /etc/ceph/ceph.conf.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] showing active config settings

2019-04-10 Thread Eugen Block

If you don't specify which daemon to talk to, it tells you what the
defaults would be for a random daemon started just now using the same
config as you have in /etc/ceph/ceph.conf.


I tried that, too, but the result is not correct:

host1:~ # ceph -n osd.1 --show-config | grep osd_recovery_max_active
osd_recovery_max_active = 3


Zitat von Janne Johansson :


Den ons 10 apr. 2019 kl 13:31 skrev Eugen Block :



While --show-config still shows

host1:~ # ceph --show-config | grep osd_recovery_max_active
osd_recovery_max_active = 3


It seems as if --show-config is not really up-to-date anymore?
Although I can execute it, the option doesn't appear in the help page
of a Mimic and Luminous cluster. So maybe this is deprecated.



If you don't specify which daemon to talk to, it tells you what the
defaults would be for a random daemon started just now using the same
config as you have in /etc/ceph/ceph.conf.


--
May the most significant bit of your life be positive.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] showing active config settings

2019-04-10 Thread Janne Johansson
Den ons 10 apr. 2019 kl 13:37 skrev Eugen Block :

> > If you don't specify which daemon to talk to, it tells you what the
> > defaults would be for a random daemon started just now using the same
> > config as you have in /etc/ceph/ceph.conf.
>
> I tried that, too, but the result is not correct:
>
> host1:~ # ceph -n osd.1 --show-config | grep osd_recovery_max_active
> osd_recovery_max_active = 3
>

I always end up using "ceph --admin-daemon
/var/run/ceph/name-of-socket-here.asok config show | grep ..." to get what
is in effect now for a certain daemon.
Needs you to be on the host of the daemon of course.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] showing active config settings

2019-04-10 Thread Eugen Block

I always end up using "ceph --admin-daemon
/var/run/ceph/name-of-socket-here.asok config show | grep ..." to get what
is in effect now for a certain daemon.
Needs you to be on the host of the daemon of course.


Me too, I just wanted to try what OP reported. And after trying that,  
I'll keep it that way. ;-)



Zitat von Janne Johansson :


Den ons 10 apr. 2019 kl 13:37 skrev Eugen Block :


> If you don't specify which daemon to talk to, it tells you what the
> defaults would be for a random daemon started just now using the same
> config as you have in /etc/ceph/ceph.conf.

I tried that, too, but the result is not correct:

host1:~ # ceph -n osd.1 --show-config | grep osd_recovery_max_active
osd_recovery_max_active = 3



I always end up using "ceph --admin-daemon
/var/run/ceph/name-of-socket-here.asok config show | grep ..." to get what
is in effect now for a certain daemon.
Needs you to be on the host of the daemon of course.

--
May the most significant bit of your life be positive.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Glance client and RBD export checksum mismatch

2019-04-10 Thread Jason Dillaman
On Wed, Apr 10, 2019 at 1:46 AM Brayan Perera  wrote:
>
> Dear All,
>
> Ceph Version : 12.2.5-2.ge988fb6.el7
>
> We are facing an issue on glance which have backend set to ceph, when
> we try to create an instance or volume out of an image, it throws
> checksum error.
> When we use rbd export and use md5sum, value is matching with glance checksum.
>
> When we use following script, it provides same error checksum as glance.

What version of Python are you using?

> We have used below images for testing.
> 1. Failing image (checksum mismatch): ffed4088-74e1-4f22-86cb-35e7e97c377c
> 2. Passing image (checksum identical): c048f0f9-973d-4285-9397-939251c80a84
>
> Output from storage node:
>
> 1. Failing image: ffed4088-74e1-4f22-86cb-35e7e97c377c
> checksum from glance database: 34da2198ec7941174349712c6d2096d8
> [root@storage01moc ~]# python test_rbd_format.py
> ffed4088-74e1-4f22-86cb-35e7e97c377c admin
> Image size: 681181184
> checksum from ceph: b82d85ae5160a7b74f52be6b5871f596
> Remarks: checksum is different
>
> 2. Passing image: c048f0f9-973d-4285-9397-939251c80a84
> checksum from glance database: 4f977f748c9ac2989cff32732ef740ed
> [root@storage01moc ~]# python test_rbd_format.py
> c048f0f9-973d-4285-9397-939251c80a84 admin
> Image size: 1411121152
> checksum from ceph: 4f977f748c9ac2989cff32732ef740ed
> Remarks: checksum is identical
>
> Wondering whether this issue is from ceph python libs or from ceph itself.
>
> Please note that we do not have ceph pool tiering configured.
>
> Please let us know whether anyone faced similar issue and any fixes for this.
>
> test_rbd_format.py
> ===
> import rados, sys, rbd
>
> image_id = sys.argv[1]
> try:
> rados_id = sys.argv[2]
> except:
> rados_id = 'openstack'
>
>
> class ImageIterator(object):
> """
> Reads data from an RBD image, one chunk at a time.
> """
>
> def __init__(self, conn, pool, name, snapshot, store, chunk_size='8'):

Am I correct in assuming this was adapted from OpenStack code? That
8-byte "chunk" is going to be terribly inefficient to compute a CRC.
Not that it should matter, but does it still fail if you increase this
to 32KiB or 64KiB?

> self.pool = pool
> self.conn = conn
> self.name = name
> self.snapshot = snapshot
> self.chunk_size = chunk_size
> self.store = store
>
> def __iter__(self):
> try:
> with conn.open_ioctx(self.pool) as ioctx:
> with rbd.Image(ioctx, self.name,
>snapshot=self.snapshot) as image:
> img_info = image.stat()
> size = img_info['size']
> bytes_left = size
> while bytes_left > 0:
> length = min(self.chunk_size, bytes_left)
> data = image.read(size - bytes_left, length)
> bytes_left -= len(data)
> yield data
> raise StopIteration()
> except rbd.ImageNotFound:
> raise exceptions.NotFound(
> _('RBD image %s does not exist') % self.name)
>
> conn = rados.Rados(conffile='/etc/ceph/ceph.conf',rados_id=rados_id)
> conn.connect()
>
>
> with conn.open_ioctx('images') as ioctx:
> try:
> with rbd.Image(ioctx, image_id,
>snapshot='snap') as image:
> img_info = image.stat()
> print "Image size: %s " % img_info['size']
> iter, size = (ImageIterator(conn, 'images', image_id,
> 'snap', 'rbd'), img_info['size'])
> import six, hashlib
> md5sum = hashlib.md5()
> for chunk in iter:
> if isinstance(chunk, six.string_types):
> chunk = six.b(chunk)
> md5sum.update(chunk)
> md5sum = md5sum.hexdigest()
> print "checksum from ceph: " + md5sum
> except:
> raise
> ===
>
>
> Thank You !
>
> --
> Best Regards,
> Brayan Perera
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Glance client and RBD export checksum mismatch

2019-04-10 Thread Maged Mokhtar




On 10/04/2019 07:46, Brayan Perera wrote:

Dear All,

Ceph Version : 12.2.5-2.ge988fb6.el7

We are facing an issue on glance which have backend set to ceph, when
we try to create an instance or volume out of an image, it throws
checksum error.
When we use rbd export and use md5sum, value is matching with glance checksum.

When we use following script, it provides same error checksum as glance.

We have used below images for testing.
1. Failing image (checksum mismatch): ffed4088-74e1-4f22-86cb-35e7e97c377c
2. Passing image (checksum identical): c048f0f9-973d-4285-9397-939251c80a84

Output from storage node:

1. Failing image: ffed4088-74e1-4f22-86cb-35e7e97c377c
checksum from glance database: 34da2198ec7941174349712c6d2096d8
[root@storage01moc ~]# python test_rbd_format.py
ffed4088-74e1-4f22-86cb-35e7e97c377c admin
Image size: 681181184
checksum from ceph: b82d85ae5160a7b74f52be6b5871f596
Remarks: checksum is different

2. Passing image: c048f0f9-973d-4285-9397-939251c80a84
checksum from glance database: 4f977f748c9ac2989cff32732ef740ed
[root@storage01moc ~]# python test_rbd_format.py
c048f0f9-973d-4285-9397-939251c80a84 admin
Image size: 1411121152
checksum from ceph: 4f977f748c9ac2989cff32732ef740ed
Remarks: checksum is identical

Wondering whether this issue is from ceph python libs or from ceph itself.

Please note that we do not have ceph pool tiering configured.

Please let us know whether anyone faced similar issue and any fixes for this.

test_rbd_format.py
===
import rados, sys, rbd

image_id = sys.argv[1]
try:
 rados_id = sys.argv[2]
except:
 rados_id = 'openstack'


class ImageIterator(object):
 """
 Reads data from an RBD image, one chunk at a time.
 """

 def __init__(self, conn, pool, name, snapshot, store, chunk_size='8'):
 self.pool = pool
 self.conn = conn
 self.name = name
 self.snapshot = snapshot
 self.chunk_size = chunk_size
 self.store = store

 def __iter__(self):
 try:
 with conn.open_ioctx(self.pool) as ioctx:
 with rbd.Image(ioctx, self.name,
snapshot=self.snapshot) as image:
 img_info = image.stat()
 size = img_info['size']
 bytes_left = size
 while bytes_left > 0:
 length = min(self.chunk_size, bytes_left)
 data = image.read(size - bytes_left, length)
 bytes_left -= len(data)
 yield data
 raise StopIteration()
 except rbd.ImageNotFound:
 raise exceptions.NotFound(
 _('RBD image %s does not exist') % self.name)

conn = rados.Rados(conffile='/etc/ceph/ceph.conf',rados_id=rados_id)
conn.connect()


with conn.open_ioctx('images') as ioctx:
 try:
 with rbd.Image(ioctx, image_id,
snapshot='snap') as image:
 img_info = image.stat()
 print "Image size: %s " % img_info['size']
 iter, size = (ImageIterator(conn, 'images', image_id,
'snap', 'rbd'), img_info['size'])
 import six, hashlib
 md5sum = hashlib.md5()
 for chunk in iter:
 if isinstance(chunk, six.string_types):
 chunk = six.b(chunk)
 md5sum.update(chunk)
 md5sum = md5sum.hexdigest()
 print "checksum from ceph: " + md5sum
 except:
 raise
===


Thank You !



some comments
1) the code:
  if isinstance(chunk, six.string_types):
chunk = six.b(chunk)
are there cases where the if will/will-not occur ? does six.b() do any 
encoding ?


2) can you adapt the script to calculate the checksum direct from the 
export file created by rbd export rather than reading the data directly 
itself, does it make a difference ?


3) the image sizes are not multiple of 1M which is a bit strange

4) you using rbd -export-format 1 (default) correct ?

/Maged


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] reshard list

2019-04-10 Thread Andrew Cassera
Hello,

I am have been managing a ceph cluster running 12.2.11.  This was running
12.2.5 until the recent upgrade three months ago.  We build another cluster
running 13.2.5 and synced the data between clusters and now would like to
run primarily off the 13.2.5 cluster.  The data is all S3 buckets.  There
are 15 buckets with more than 1 million objects in them. I attempted to
start sharding on the bucket indexes by using the following process from
the documentation.

Pulling the zonegroup

#radosgw-admin zonegroup get > zonegroup.json

Changing bucket_index_max_shards to a number other than 0 and then

#radosgw-admin zonegroup set < zonegroup.json

Update the period

This had no effect on existing buckets.  What is the methodology to enable
sharding on existing buckets.  Also I am not able to see the reshard list I
get the follwoing error.

2019-04-10 10:33:05.074 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.00
2019-04-10 10:33:05.078 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.01
2019-04-10 10:33:05.082 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.02
2019-04-10 10:33:05.082 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.03
2019-04-10 10:33:05.114 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.04
2019-04-10 10:33:05.118 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.05
2019-04-10 10:33:05.118 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.06
2019-04-10 10:33:05.122 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.07
2019-04-10 10:33:05.122 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.08
2019-04-10 10:33:05.122 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.09
2019-04-10 10:33:05.122 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.10
2019-04-10 10:33:05.126 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.11
2019-04-10 10:33:05.126 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.12
2019-04-10 10:33:05.126 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.13
2019-04-10 10:33:05.126 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.14

Any suggestions

AC
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] showing active config settings

2019-04-10 Thread Paul Emmerich
To summarize this discussion:

There are two ways to change the configuration:
1. ceph config * is for permanently changing settings
2. ceph injectargs is for temporarily changing a setting until the
next restart of that daemon

* ceph config get or --show-config shows the defaults/permanent
settings, but not anything that was temporarily overriden by
injectargs
* run "ceph daemon type.id config diff" to set what a specific daemon
is currently running with (and where it got that value from)


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Apr 9, 2019 at 9:00 PM solarflow99  wrote:
>
> I noticed when changing some settings, they appear to stay the same, for 
> example when trying to set this higher:
>
> ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
>
> It gives the usual warning about may need to restart, but it still has the 
> old value:
>
> # ceph --show-config | grep osd_recovery_max_active
> osd_recovery_max_active = 3
>
>
> restarting the OSDs seems fairly intrusive for every configuration change.
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding failure domain (again)

2019-04-10 Thread Paul Emmerich
On Wed, Apr 10, 2019 at 11:12 AM Christian Balzer  wrote:
>
>
> Hello,
>
> Another thing that crossed my mind aside from failure probabilities caused
> by actual HDDs dying is of course the little detail that most Ceph
> installations will have have WAL/DB (journal) on SSDs, the most typical
> ratio being 1:4.

Unfortunately the ratios seen "in the wild" seems to be a lot higher.
I've seen 1:100 and 1:60 which is a obviously a really bad idea. But
1:24 is also quite common.

1:12 is quite common: 2 NVMe disks in 24 bay chassis. I think that's
perfectly reasonable.


Paul

> And given the current thread about compaction killing pure HDD OSDs,
> something you may _have_ to do.
>
> So if you get unlucky and a SSD dies 4 OSDs are irrecoverably lost, unlike
> a dead node that can be recovered.
> Combine that with the background noise of HDDs failing, things got just
> quite a bit scarier.
>
> And if you have a "crap firmware of the week" situation like experienced
> with several people here, you're even more like to wind up in trouble very
> fast.
>
> This is of course all something people do (or should know), I'm more
> wondering how to model it to correctly asses risks.
>
> Christian
>
> On Wed, 3 Apr 2019 10:28:09 +0900 Christian Balzer wrote:
>
> > On Tue, 2 Apr 2019 19:04:28 +0900 Hector Martin wrote:
> >
> > > On 02/04/2019 18.27, Christian Balzer wrote:
> > > > I did a quick peek at my test cluster (20 OSDs, 5 hosts) and a replica 2
> > > > pool with 1024 PGs.
> > >
> > > (20 choose 2) is 190, so you're never going to have more than that many
> > > unique sets of OSDs.
> > >
> > And this is why one shouldn't send mails when in a rush, w/o fully groking
> > the math one was just given.
> > Thanks for setting me straight.
> >
> > > I just looked at the OSD distribution for a replica 3 pool across 48
> > > OSDs with 4096 PGs that I have and the result is reasonable. There are
> > > 3782 unique OSD tuples, out of (48 choose 3) = 17296 options. Since this
> > > is a random process, due to the birthday paradox, some duplicates are
> > > expected after only the order of 17296^0.5 = ~131 PGs; at 4096 PGs
> > > having 3782 unique choices seems to pass the gut feeling test. Too lazy
> > > to do the math closed form, but here's a quick simulation:
> > >
> > >  >>> len(set(random.randrange(17296) for i in range(4096)))
> > > 3671
> > >
> > > So I'm actually slightly ahead.
> > >
> > > At the numbers in my previous example (1500 OSDs, 50k pool PGs),
> > > statistically you should get something like ~3 collisions on average, so
> > > negligible.
> > >
> > Sounds promising.
> >
> > > > Another thing to look at here is of course critical period and disk
> > > > failure probabilities, these guys explain the logic behind their
> > > > calculator, would be delighted if you could have a peek and comment.
> > > >
> > > > https://www.memset.com/support/resources/raid-calculator/
> > >
> > > I'll take a look tonight :)
> > >
> > Thanks, a look at the Backblaze disk failure rates (picking the worst
> > ones) gives a good insight into real life probabilities, too.
> > https://www.backblaze.com/blog/hard-drive-stats-for-2018/
> > If we go with 2%/year, that's an average failure ever 12 days.
> >
> > Aside from how likely the actual failure rate is, another concern of
> > course is extended periods of the cluster being unhealthy, with certain
> > versions there was that "mon map will grow indefinitely" issue, other more
> > subtle ones might lurk still.
> >
> > Christian
> > > --
> > > Hector Martin (hec...@marcansoft.com)
> > > Public Key: https://mrcn.st/pub
> > >
> >
> >
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com Rakuten Communications
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Rakuten Communications
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluefs-bdev-expand experience

2019-04-10 Thread Igor Fedotov


On 4/9/2019 1:59 PM, Yury Shevchuk wrote:

Igor, thank you, Round 2 is explained now.

Main aka block aka slow device cannot be expanded in Luminus, this
functionality will be available after upgrade to Nautilus.
Wal and db devices can be expanded in Luminous.

Now I have recreated osd2 once again to get rid of the paradoxical
cepf osd df output and tried to test db expansion, 40G -> 60G:

node2:/# ceph-volume lvm zap --destroy --osd-id 2
node2:/# ceph osd lost 2 --yes-i-really-mean-it
node2:/# ceph osd destroy 2 --yes-i-really-mean-it
node2:/# lvcreate -L1G -n osd2wal vg0
node2:/# lvcreate -L40G -n osd2db vg0
node2:/# lvcreate -L400G -n osd2 vg0
node2:/# ceph-volume lvm create --osd-id 2 --bluestore --data vg0/osd2 
--block.db vg0/osd2db --block.wal vg0/osd2wal

node2:/# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE AVAIL  %USE VAR  PGS
  0   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
  1   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
  3   hdd 0.227390 0B  0B 0B00   0
  2   hdd 0.22739  1.0 400GiB 9.49GiB 391GiB 2.37 0.72 128
 TOTAL 866GiB 28.5GiB 837GiB 3.29
MIN/MAX VAR: 0.72/1.24  STDDEV: 0.83

node2:/# lvextend -L+20G /dev/vg0/osd2db
   Size of logical volume vg0/osd2db changed from 40.00 GiB (10240 extents) to 
60.00 GiB (15360 extents).
   Logical volume vg0/osd2db successfully resized.

node2:/# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
inferring bluefs devices from bluestore path
  slot 0 /var/lib/ceph/osd/ceph-2//block.wal
  slot 1 /var/lib/ceph/osd/ceph-2//block.db
  slot 2 /var/lib/ceph/osd/ceph-2//block
0 : size 0x4000 : own 0x[1000~3000]
1 : size 0xf : own 0x[2000~9e000]
2 : size 0x64 : own 0x[30~4]
Expanding...
1 : expanding  from 0xa to 0xf
1 : size label updated to 64424509440

node2:/# ceph-bluestore-tool show-label --dev /dev/vg0/osd2db | grep size
 "size": 64424509440,

The label updated correctly, but ceph osd df have not changed.
I expected to see 391 + 20 = 411GiB in AVAIL column, but it stays at 391:

node2:/# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE AVAIL  %USE VAR  PGS
  0   hdd 0.22739  1.0 233GiB 9.50GiB 223GiB 4.08 1.24 128
  1   hdd 0.22739  1.0 233GiB 9.50GiB 223GiB 4.08 1.24 128
  3   hdd 0.227390 0B  0B 0B00   0
  2   hdd 0.22739  1.0 400GiB 9.49GiB 391GiB 2.37 0.72 128
 TOTAL 866GiB 28.5GiB 837GiB 3.29
MIN/MAX VAR: 0.72/1.24  STDDEV: 0.83

I have restarted monitors on all three nodes, 391GiB stays intact.

OK, but I used bluefs-bdev-expand on running OSD... probably not good,
it seems to fork by opening bluefs directly... trying once again:

node2:/# systemctl stop ceph-osd@2

node2:/# lvextend -L+20G /dev/vg0/osd2db
   Size of logical volume vg0/osd2db changed from 60.00 GiB (15360 extents) to 
80.00 GiB (20480 extents).
   Logical volume vg0/osd2db successfully resized.

node2:/# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
inferring bluefs devices from bluestore path
  slot 0 /var/lib/ceph/osd/ceph-2//block.wal
  slot 1 /var/lib/ceph/osd/ceph-2//block.db
  slot 2 /var/lib/ceph/osd/ceph-2//block
0 : size 0x4000 : own 0x[1000~3000]
1 : size 0x14 : own 0x[2000~9e000]
2 : size 0x64 : own 0x[30~4]
Expanding...
1 : expanding  from 0xa to 0x14
1 : size label updated to 85899345920

node2:/# systemctl start ceph-osd@2
node2:/# systemctl restart ceph-mon@pier42

node2:/# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE AVAIL  %USE VAR  PGS
  0   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
  1   hdd 0.22739  1.0 233GiB 9.50GiB 223GiB 4.08 1.24 128
  3   hdd 0.227390 0B  0B 0B00   0
  2   hdd 0.22739  1.0 400GiB 9.50GiB 391GiB 2.37 0.72   0
 TOTAL 866GiB 28.5GiB 837GiB 3.29
MIN/MAX VAR: 0.72/1.24  STDDEV: 0.83

Something is wrong.  Maybe I was wrong expecting db change to appear
in AVAIL column?  From Bluestore description I got db and slow should
sum up, no?


It was a while ago when db and slow were summed to provide total store 
size. In the latest Luminous releases that's not true anymore. Ceph uses 
slow device space to report SIZE/AVAIL only . There is some adjustment 
for BlueFS part residing on the slow device but DB device is definitely 
out of calculation here for now.


You can also note that reported SIZE for osd.2 is 400GiB in your case 
which is absolutely inline with slow device capacity.  Hence no DB involved.




Thanks for your help,


-- Yury

On Mon, Apr 08, 2019 at 10:17:24PM +0300, Igor Fedotov wrote:

Hi Yuri,

both issues from Round 2 relate to unsupported expansion for main device.

In fact it doesn't work and silently bypasses the operation in you case.

Please try with a different device...


Also I've just submitted a PR for mimic to indicate t

Re: [ceph-users] Erasure Coding failure domain (again)

2019-04-10 Thread Christian Balzer


Hello,

On Wed, 10 Apr 2019 20:09:58 +0200 Paul Emmerich wrote:

> On Wed, Apr 10, 2019 at 11:12 AM Christian Balzer  wrote:
> >
> >
> > Hello,
> >
> > Another thing that crossed my mind aside from failure probabilities caused
> > by actual HDDs dying is of course the little detail that most Ceph
> > installations will have have WAL/DB (journal) on SSDs, the most typical
> > ratio being 1:4.  
> 
> Unfortunately the ratios seen "in the wild" seems to be a lot higher.
> I've seen 1:100 and 1:60 which is a obviously a really bad idea. But
> 1:24 is also quite common.
> 
> 1:12 is quite common: 2 NVMe disks in 24 bay chassis. I think that's
> perfectly reasonable.
>
Given the numbers Hector provided in the mail just after this one (thanks
for that!), I'd be even less inclined to go that high.
The time for recovering 12 large OSDs (w/o service impact) is going to be
_significant_, increasing the likelihood of something else (2 somethings)
going bang in the mean time.
Cluster size of course plays a role here, too.

The highest ratio I ever considered was this baby with 6 NVMes and it felt
risky on a pure gut level:
https://www.supermicro.com/products/system/4U/6048/SSG-6048R-E1CR60N.cfm

Christian
 
> 
> Paul
> 
> > And given the current thread about compaction killing pure HDD OSDs,
> > something you may _have_ to do.
> >
> > So if you get unlucky and a SSD dies 4 OSDs are irrecoverably lost, unlike
> > a dead node that can be recovered.
> > Combine that with the background noise of HDDs failing, things got just
> > quite a bit scarier.
> >
> > And if you have a "crap firmware of the week" situation like experienced
> > with several people here, you're even more like to wind up in trouble very
> > fast.
> >
> > This is of course all something people do (or should know), I'm more
> > wondering how to model it to correctly asses risks.
> >
> > Christian
> >
> > On Wed, 3 Apr 2019 10:28:09 +0900 Christian Balzer wrote:
> >  
> > > On Tue, 2 Apr 2019 19:04:28 +0900 Hector Martin wrote:
> > >  
> > > > On 02/04/2019 18.27, Christian Balzer wrote:  
> > > > > I did a quick peek at my test cluster (20 OSDs, 5 hosts) and a 
> > > > > replica 2
> > > > > pool with 1024 PGs.  
> > > >
> > > > (20 choose 2) is 190, so you're never going to have more than that many
> > > > unique sets of OSDs.
> > > >  
> > > And this is why one shouldn't send mails when in a rush, w/o fully groking
> > > the math one was just given.
> > > Thanks for setting me straight.
> > >  
> > > > I just looked at the OSD distribution for a replica 3 pool across 48
> > > > OSDs with 4096 PGs that I have and the result is reasonable. There are
> > > > 3782 unique OSD tuples, out of (48 choose 3) = 17296 options. Since this
> > > > is a random process, due to the birthday paradox, some duplicates are
> > > > expected after only the order of 17296^0.5 = ~131 PGs; at 4096 PGs
> > > > having 3782 unique choices seems to pass the gut feeling test. Too lazy
> > > > to do the math closed form, but here's a quick simulation:
> > > >  
> > > >  >>> len(set(random.randrange(17296) for i in range(4096)))  
> > > > 3671
> > > >
> > > > So I'm actually slightly ahead.
> > > >
> > > > At the numbers in my previous example (1500 OSDs, 50k pool PGs),
> > > > statistically you should get something like ~3 collisions on average, so
> > > > negligible.
> > > >  
> > > Sounds promising.
> > >  
> > > > > Another thing to look at here is of course critical period and disk
> > > > > failure probabilities, these guys explain the logic behind their
> > > > > calculator, would be delighted if you could have a peek and comment.
> > > > >
> > > > > https://www.memset.com/support/resources/raid-calculator/  
> > > >
> > > > I'll take a look tonight :)
> > > >  
> > > Thanks, a look at the Backblaze disk failure rates (picking the worst
> > > ones) gives a good insight into real life probabilities, too.
> > > https://www.backblaze.com/blog/hard-drive-stats-for-2018/
> > > If we go with 2%/year, that's an average failure ever 12 days.
> > >
> > > Aside from how likely the actual failure rate is, another concern of
> > > course is extended periods of the cluster being unhealthy, with certain
> > > versions there was that "mon map will grow indefinitely" issue, other more
> > > subtle ones might lurk still.
> > >
> > > Christian  
> > > > --
> > > > Hector Martin (hec...@marcansoft.com)
> > > > Public Key: https://mrcn.st/pub
> > > >  
> > >
> > >
> > > --
> > > Christian BalzerNetwork/Systems Engineer
> > > ch...@gol.com Rakuten Communications
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >  
> >
> >
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Rakuten Communications
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://li

Re: [ceph-users] reshard list

2019-04-10 Thread Konstantin Shalygin

Hello,

I am have been managing a ceph cluster running 12.2.11.  This was running
12.2.5 until the recent upgrade three months ago.  We build another cluster
running 13.2.5 and synced the data between clusters and now would like to
run primarily off the 13.2.5 cluster.  The data is all S3 buckets.  There
are 15 buckets with more than 1 million objects in them. I attempted to
start sharding on the bucket indexes by using the following process from
the documentation.

Pulling the zonegroup

#radosgw-admin zonegroup get > zonegroup.json

Changing bucket_index_max_shards to a number other than 0 and then

#radosgw-admin zonegroup set < zonegroup.json

Update the period

This had no effect on existing buckets.  What is the methodology to enable
sharding on existing buckets.  Also I am not able to see the reshard list I
get the follwoing error.

2019-04-10 10:33:05.074 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.00
2019-04-10 10:33:05.078 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.01
2019-04-10 10:33:05.082 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.02
2019-04-10 10:33:05.082 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.03
2019-04-10 10:33:05.114 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.04
2019-04-10 10:33:05.118 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.05
2019-04-10 10:33:05.118 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.06
2019-04-10 10:33:05.122 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.07
2019-04-10 10:33:05.122 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.08
2019-04-10 10:33:05.122 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.09
2019-04-10 10:33:05.122 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.10
2019-04-10 10:33:05.126 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.11
2019-04-10 10:33:05.126 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.12
2019-04-10 10:33:05.126 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.13
2019-04-10 10:33:05.126 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.14

Any suggestions
Andrew, RGW dynamic resharding is enabling via `rgw_dynamic_resharding` 
and ruled by `rgw_max_objs_per_shard`.


Or you may reshard bucket by hand via `radosgw-admin reshard add ...`.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Glance client and RBD export checksum mismatch

2019-04-10 Thread Brayan Perera
Dear Jason,


Thanks for the reply.

We are using python 2.7.5

Yes. script is based on openstack code.

As suggested, we have tried chunk_size 32 and 64, and both giving same
incorrect checksum value.

We tried to copy same image in different pool and resulted same
incorrect checksum.


Thanks & Regards,
Brayan

On Wed, Apr 10, 2019 at 6:21 PM Jason Dillaman  wrote:
>
> On Wed, Apr 10, 2019 at 1:46 AM Brayan Perera  wrote:
> >
> > Dear All,
> >
> > Ceph Version : 12.2.5-2.ge988fb6.el7
> >
> > We are facing an issue on glance which have backend set to ceph, when
> > we try to create an instance or volume out of an image, it throws
> > checksum error.
> > When we use rbd export and use md5sum, value is matching with glance 
> > checksum.
> >
> > When we use following script, it provides same error checksum as glance.
>
> What version of Python are you using?
>
> > We have used below images for testing.
> > 1. Failing image (checksum mismatch): ffed4088-74e1-4f22-86cb-35e7e97c377c
> > 2. Passing image (checksum identical): c048f0f9-973d-4285-9397-939251c80a84
> >
> > Output from storage node:
> >
> > 1. Failing image: ffed4088-74e1-4f22-86cb-35e7e97c377c
> > checksum from glance database: 34da2198ec7941174349712c6d2096d8
> > [root@storage01moc ~]# python test_rbd_format.py
> > ffed4088-74e1-4f22-86cb-35e7e97c377c admin
> > Image size: 681181184
> > checksum from ceph: b82d85ae5160a7b74f52be6b5871f596
> > Remarks: checksum is different
> >
> > 2. Passing image: c048f0f9-973d-4285-9397-939251c80a84
> > checksum from glance database: 4f977f748c9ac2989cff32732ef740ed
> > [root@storage01moc ~]# python test_rbd_format.py
> > c048f0f9-973d-4285-9397-939251c80a84 admin
> > Image size: 1411121152
> > checksum from ceph: 4f977f748c9ac2989cff32732ef740ed
> > Remarks: checksum is identical
> >
> > Wondering whether this issue is from ceph python libs or from ceph itself.
> >
> > Please note that we do not have ceph pool tiering configured.
> >
> > Please let us know whether anyone faced similar issue and any fixes for 
> > this.
> >
> > test_rbd_format.py
> > ===
> > import rados, sys, rbd
> >
> > image_id = sys.argv[1]
> > try:
> > rados_id = sys.argv[2]
> > except:
> > rados_id = 'openstack'
> >
> >
> > class ImageIterator(object):
> > """
> > Reads data from an RBD image, one chunk at a time.
> > """
> >
> > def __init__(self, conn, pool, name, snapshot, store, chunk_size='8'):
>
> Am I correct in assuming this was adapted from OpenStack code? That
> 8-byte "chunk" is going to be terribly inefficient to compute a CRC.
> Not that it should matter, but does it still fail if you increase this
> to 32KiB or 64KiB?
>
> > self.pool = pool
> > self.conn = conn
> > self.name = name
> > self.snapshot = snapshot
> > self.chunk_size = chunk_size
> > self.store = store
> >
> > def __iter__(self):
> > try:
> > with conn.open_ioctx(self.pool) as ioctx:
> > with rbd.Image(ioctx, self.name,
> >snapshot=self.snapshot) as image:
> > img_info = image.stat()
> > size = img_info['size']
> > bytes_left = size
> > while bytes_left > 0:
> > length = min(self.chunk_size, bytes_left)
> > data = image.read(size - bytes_left, length)
> > bytes_left -= len(data)
> > yield data
> > raise StopIteration()
> > except rbd.ImageNotFound:
> > raise exceptions.NotFound(
> > _('RBD image %s does not exist') % self.name)
> >
> > conn = rados.Rados(conffile='/etc/ceph/ceph.conf',rados_id=rados_id)
> > conn.connect()
> >
> >
> > with conn.open_ioctx('images') as ioctx:
> > try:
> > with rbd.Image(ioctx, image_id,
> >snapshot='snap') as image:
> > img_info = image.stat()
> > print "Image size: %s " % img_info['size']
> > iter, size = (ImageIterator(conn, 'images', image_id,
> > 'snap', 'rbd'), img_info['size'])
> > import six, hashlib
> > md5sum = hashlib.md5()
> > for chunk in iter:
> > if isinstance(chunk, six.string_types):
> > chunk = six.b(chunk)
> > md5sum.update(chunk)
> > md5sum = md5sum.hexdigest()
> > print "checksum from ceph: " + md5sum
> > except:
> > raise
> > ===
> >
> >
> > Thank You !
> >
> > --
> > Best Regards,
> > Brayan Perera
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason



--
Best Regards,
Brayan Perera
__