Re: [ceph-users] bluefs-bdev-expand experience

2019-04-12 Thread Igor Fedotov



On 4/11/2019 11:23 PM, Yury Shevchuk wrote:

Hi Igor!

I have upgraded from Luminous to Nautilus and now slow device
expansion works indeed.  The steps are shown below to round up the
topic.

node2# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZERAW USE DATAOMAPMETAAVAIL   %USE  
VAR  PGS STATUS
  0   hdd 0.22739  1.0 233 GiB  91 GiB  90 GiB 208 MiB 816 MiB 142 GiB 
38.92 1.04 128 up
  1   hdd 0.22739  1.0 233 GiB  91 GiB  90 GiB 200 MiB 824 MiB 142 GiB 
38.92 1.04 128 up
  3   hdd 0.227390 0 B 0 B 0 B 0 B 0 B 0 B 
00   0   down
  2   hdd 0.22739  1.0 481 GiB 172 GiB  90 GiB 201 MiB 823 MiB 309 GiB 
35.70 0.96 128 up
 TOTAL 947 GiB 353 GiB 269 GiB 610 MiB 2.4 GiB 594 GiB 37.28
MIN/MAX VAR: 0.96/1.04  STDDEV: 1.62

node2# lvextend -L+50G /dev/vg0/osd2
   Size of logical volume vg0/osd2 changed from 400.00 GiB (102400 extents) to 
450.00 GiB (115200 extents).
   Logical volume vg0/osd2 successfully resized.

node2# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
inferring bluefs devices from bluestore path
2019-04-11 22:28:00.240 7f2e24e190c0 -1 bluestore(/var/lib/ceph/osd/ceph-2) 
_lock_fsid failed to lock /var/lib/ceph/osd/ceph-2/fsid (is another ceph-osd 
still running?)(11) Resource temporarily unavailable
...
*** Caught signal (Aborted) **
[two pages of stack dump stripped]

My mistake in the first place: I tried to expand non-stopped osd again.

node2# systemctl stop ceph-osd.target

node2# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
inferring bluefs devices from bluestore path
0 : device size 0x4000 : own 0x[1000~3000] = 0x3000 : using 0x8ff000
1 : device size 0x144000 : own 0x[2000~143fffe000] = 0x143fffe000 : using 
0x24dfe000
2 : device size 0x708000 : own 0x[30~4] = 0x4 : 
using 0x0
Expanding...
2 : expanding  from 0x64 to 0x708000
2 : size label updated to 483183820800

node2# ceph-bluestore-tool show-label --dev /dev/vg0/osd2 | grep size
 "size": 483183820800,

node2# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZERAW USE DATAOMAPMETAAVAIL   %USE  
VAR  PGS STATUS
  0   hdd 0.22739  1.0 233 GiB  91 GiB  90 GiB 208 MiB 816 MiB 142 GiB 
38.92 1.10 128 up
  1   hdd 0.22739  1.0 233 GiB  91 GiB  90 GiB 200 MiB 824 MiB 142 GiB 
38.92 1.10 128 up
  3   hdd 0.227390 0 B 0 B 0 B 0 B 0 B 0 B 
00   0   down
  2   hdd 0.22739  1.0 531 GiB 172 GiB  90 GiB 185 MiB 839 MiB 359 GiB 
32.33 0.91 128 up
 TOTAL 997 GiB 353 GiB 269 GiB 593 MiB 2.4 GiB 644 GiB 35.41
MIN/MAX VAR: 0.91/1.10  STDDEV: 3.37

It worked: AVAIL = 594+50 = 644.  Great!
Thanks a lot for your help.

And one more question regarding your last remark is inline below.

On Wed, Apr 10, 2019 at 09:54:35PM +0300, Igor Fedotov wrote:

On 4/9/2019 1:59 PM, Yury Shevchuk wrote:

Igor, thank you, Round 2 is explained now.

Main aka block aka slow device cannot be expanded in Luminus, this
functionality will be available after upgrade to Nautilus.
Wal and db devices can be expanded in Luminous.

Now I have recreated osd2 once again to get rid of the paradoxical
cepf osd df output and tried to test db expansion, 40G -> 60G:

node2:/# ceph-volume lvm zap --destroy --osd-id 2
node2:/# ceph osd lost 2 --yes-i-really-mean-it
node2:/# ceph osd destroy 2 --yes-i-really-mean-it
node2:/# lvcreate -L1G -n osd2wal vg0
node2:/# lvcreate -L40G -n osd2db vg0
node2:/# lvcreate -L400G -n osd2 vg0
node2:/# ceph-volume lvm create --osd-id 2 --bluestore --data vg0/osd2 
--block.db vg0/osd2db --block.wal vg0/osd2wal

node2:/# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE AVAIL  %USE VAR  PGS
   0   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
   1   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
   3   hdd 0.227390 0B  0B 0B00   0
   2   hdd 0.22739  1.0 400GiB 9.49GiB 391GiB 2.37 0.72 128
  TOTAL 866GiB 28.5GiB 837GiB 3.29
MIN/MAX VAR: 0.72/1.24  STDDEV: 0.83

node2:/# lvextend -L+20G /dev/vg0/osd2db
Size of logical volume vg0/osd2db changed from 40.00 GiB (10240 extents) to 
60.00 GiB (15360 extents).
Logical volume vg0/osd2db successfully resized.

node2:/# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
inferring bluefs devices from bluestore path
   slot 0 /var/lib/ceph/osd/ceph-2//block.wal
   slot 1 /var/lib/ceph/osd/ceph-2//block.db
   slot 2 /var/lib/ceph/osd/ceph-2//block
0 : size 0x4000 : own 0x[1000~3000]
1 : size 0xf : own 0x[2000~9e000]
2 : size 0x64 : own 0x[30~4]
Expanding...
1 : expanding  from 0xa to 0xf
1 : size label updated to 64424509440

node2:/# ceph-bluestore-tool show-label --dev /dev/vg0/osd2db | grep size
  "size": 64424509440,

The label updated correctly, but ceph 

[ceph-users] Ceph Object storage for physically separating tenants storage infrastructure

2019-04-12 Thread Varun Singh
Hi,
We have a requirement to build an object storage solution with thin
layer of customization on top. This is to be deployed in our own data
centre. We will be using the objects stored in this system at various
places in our business workflow. The solution should support
multi-tenancy. Multiple tenants can come and store their objects in
it. However, there is also a requirement that a tenant may want to use
their own machines. In that case, their objects should be stored and
replicated within their machines. But those machines should still be
part of our system. This is because we will still need access to the
objects for our business workflows. It's just that their data should
not be stored and replicated outside of their systems. Is it something
that can be achieved using Ceph? Thanks a lot in advance.

-- 
Regards,
Varun Singh

-- 
Confidentiality Notice and Disclaimer: This email (including any 
attachments) contains information that may be confidential, privileged 
and/or copyrighted. If you are not the intended recipient, please notify 
the sender immediately and destroy this email. Any unauthorized use of the 
contents of this email in any manner whatsoever, is strictly prohibited. If 
improper activity is suspected, all available information may be used by 
the sender for possible disciplinary action, prosecution, civil claim or 
any remedy or lawful purpose. Email transmission cannot be guaranteed to be 
secure or error-free, as information could be intercepted, lost, arrive 
late, or contain viruses. The sender is not liable whatsoever for damage 
resulting from the opening of this message and/or the use of the 
information contained in this message and/or attachments. Expressions in 
this email cannot be treated as opined by the sender company management – 
they are solely expressed by the sender unless authorized.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluefs-bdev-expand experience

2019-04-12 Thread Alfredo Deza
On Thu, Apr 11, 2019 at 4:23 PM Yury Shevchuk  wrote:
>
> Hi Igor!
>
> I have upgraded from Luminous to Nautilus and now slow device
> expansion works indeed.  The steps are shown below to round up the
> topic.
>
> node2# ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZERAW USE DATAOMAPMETAAVAIL   
> %USE  VAR  PGS STATUS
>  0   hdd 0.22739  1.0 233 GiB  91 GiB  90 GiB 208 MiB 816 MiB 142 GiB 
> 38.92 1.04 128 up
>  1   hdd 0.22739  1.0 233 GiB  91 GiB  90 GiB 200 MiB 824 MiB 142 GiB 
> 38.92 1.04 128 up
>  3   hdd 0.227390 0 B 0 B 0 B 0 B 0 B 0 B 
> 00   0   down
>  2   hdd 0.22739  1.0 481 GiB 172 GiB  90 GiB 201 MiB 823 MiB 309 GiB 
> 35.70 0.96 128 up
> TOTAL 947 GiB 353 GiB 269 GiB 610 MiB 2.4 GiB 594 GiB 
> 37.28
> MIN/MAX VAR: 0.96/1.04  STDDEV: 1.62
>
> node2# lvextend -L+50G /dev/vg0/osd2
>   Size of logical volume vg0/osd2 changed from 400.00 GiB (102400 extents) to 
> 450.00 GiB (115200 extents).
>   Logical volume vg0/osd2 successfully resized.
>
> node2# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
> inferring bluefs devices from bluestore path
> 2019-04-11 22:28:00.240 7f2e24e190c0 -1 bluestore(/var/lib/ceph/osd/ceph-2) 
> _lock_fsid failed to lock /var/lib/ceph/osd/ceph-2/fsid (is another ceph-osd 
> still running?)(11) Resource temporarily unavailable
> ...
> *** Caught signal (Aborted) **
> [two pages of stack dump stripped]
>
> My mistake in the first place: I tried to expand non-stopped osd again.
>
> node2# systemctl stop ceph-osd.target
>
> node2# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
> inferring bluefs devices from bluestore path
> 0 : device size 0x4000 : own 0x[1000~3000] = 0x3000 : using 
> 0x8ff000
> 1 : device size 0x144000 : own 0x[2000~143fffe000] = 0x143fffe000 : using 
> 0x24dfe000
> 2 : device size 0x708000 : own 0x[30~4] = 0x4 : 
> using 0x0
> Expanding...
> 2 : expanding  from 0x64 to 0x708000
> 2 : size label updated to 483183820800
>
> node2# ceph-bluestore-tool show-label --dev /dev/vg0/osd2 | grep size
> "size": 483183820800,
>
> node2# ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZERAW USE DATAOMAPMETAAVAIL   
> %USE  VAR  PGS STATUS
>  0   hdd 0.22739  1.0 233 GiB  91 GiB  90 GiB 208 MiB 816 MiB 142 GiB 
> 38.92 1.10 128 up
>  1   hdd 0.22739  1.0 233 GiB  91 GiB  90 GiB 200 MiB 824 MiB 142 GiB 
> 38.92 1.10 128 up
>  3   hdd 0.227390 0 B 0 B 0 B 0 B 0 B 0 B 
> 00   0   down
>  2   hdd 0.22739  1.0 531 GiB 172 GiB  90 GiB 185 MiB 839 MiB 359 GiB 
> 32.33 0.91 128 up
> TOTAL 997 GiB 353 GiB 269 GiB 593 MiB 2.4 GiB 644 GiB 
> 35.41
> MIN/MAX VAR: 0.91/1.10  STDDEV: 3.37
>
> It worked: AVAIL = 594+50 = 644.  Great!
> Thanks a lot for your help.
>
> And one more question regarding your last remark is inline below.
>
> On Wed, Apr 10, 2019 at 09:54:35PM +0300, Igor Fedotov wrote:
> >
> > On 4/9/2019 1:59 PM, Yury Shevchuk wrote:
> > > Igor, thank you, Round 2 is explained now.
> > >
> > > Main aka block aka slow device cannot be expanded in Luminus, this
> > > functionality will be available after upgrade to Nautilus.
> > > Wal and db devices can be expanded in Luminous.
> > >
> > > Now I have recreated osd2 once again to get rid of the paradoxical
> > > cepf osd df output and tried to test db expansion, 40G -> 60G:
> > >
> > > node2:/# ceph-volume lvm zap --destroy --osd-id 2
> > > node2:/# ceph osd lost 2 --yes-i-really-mean-it
> > > node2:/# ceph osd destroy 2 --yes-i-really-mean-it
> > > node2:/# lvcreate -L1G -n osd2wal vg0
> > > node2:/# lvcreate -L40G -n osd2db vg0
> > > node2:/# lvcreate -L400G -n osd2 vg0
> > > node2:/# ceph-volume lvm create --osd-id 2 --bluestore --data vg0/osd2 
> > > --block.db vg0/osd2db --block.wal vg0/osd2wal
> > >
> > > node2:/# ceph osd df
> > > ID CLASS WEIGHT  REWEIGHT SIZE   USE AVAIL  %USE VAR  PGS
> > >   0   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
> > >   1   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
> > >   3   hdd 0.227390 0B  0B 0B00   0
> > >   2   hdd 0.22739  1.0 400GiB 9.49GiB 391GiB 2.37 0.72 128
> > >  TOTAL 866GiB 28.5GiB 837GiB 3.29
> > > MIN/MAX VAR: 0.72/1.24  STDDEV: 0.83
> > >
> > > node2:/# lvextend -L+20G /dev/vg0/osd2db
> > >Size of logical volume vg0/osd2db changed from 40.00 GiB (10240 
> > > extents) to 60.00 GiB (15360 extents).
> > >Logical volume vg0/osd2db successfully resized.
> > >
> > > node2:/# ceph-bluestore-tool bluefs-bdev-expand --path 
> > > /var/lib/ceph/osd/ceph-2/
> > > inferring bluefs devices from bluestore path
> > >   slot 0 /var/lib/ceph/osd/ceph-2//block.wal
> > >   slot 1 /var/lib/ceph/osd/ceph-2//block.db
> > >   slot 2 /var/lib/ceph/osd/ceph-2//block
> > > 0 : size 0x4000 : own

Re: [ceph-users] How to reduce HDD OSD flapping due to rocksdb compacting event?

2019-04-12 Thread Mark Nelson

Hi Charles,


Basically the goal is to reduce write-amplification as much as 
possible.  The deeper that the rocksdb hierarchy gets, the worse the 
write-amplifcation for compaction is going to be.  If you look at the 
OSD logs you'll see the write-amp factors for compaction in the rocksdb 
compaction summary sections that periodically pop up. There's a couple 
of things we are trying to see if we can improve things on our end:



1) Adam has been working on experimenting with sharding data across 
multiple column families.  The idea here is that it might be better to 
hav multiple L0 and L1 levels rather than L0, L1, L2 and L3.  I'm not 
sure if this will pan out of not, but that was one of the goals behind 
trying this.



2) Toshiba recently released trocksdb which could have a really big 
impact on compaction write amplification:



Code: https://github.com/ToshibaMemoryAmerica/trocksdb/tree/TRocksRel

Wiki: https://github.com/ToshibaMemoryAmerica/trocksdb/wiki


I recently took a look to see if our key/value size distribution would 
work well with the approach that trocksdb is taking to reduce 
write-amplification:



https://docs.google.com/spreadsheets/d/1fNFI8U-JRkU5uaRJzgg5rNxqhgRJFlDB4TsTAVsuYkk/edit?usp=sharing


The good news is that it sounds like the "Trocks Ratio" for the data we 
put in rocksdb is sufficiently high that we'd see some benefit since it 
should greatly reduce write-amplification during compaction for data 
(but not keys). This doesn't help your immediate problem, but I wanted 
you to know that you aren't the only one and we are thinking about ways 
to reduce the compaction impact.



Mark


On 4/10/19 2:07 AM, Charles Alva wrote:

Hi Ceph Users,

Is there a way around to minimize rocksdb compacting event so that it 
won't use all the spinning disk IO utilization and avoid it being 
marked as down due to fail to send heartbeat to others?


Right now we have frequent high IO disk utilization for every 20-25 
minutes where the rocksdb reaches level 4 with 67GB data to compact.



Kind regards,

Charles Alva
Sent from Gmail Mobile

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Remove RBD mirror?

2019-04-12 Thread Magnus Grönlund
Hi Jason,

Tried to follow the instructions and setting the debug level to 15 worked
OK, but the daemon appeared to silently ignore the restart command (nothing
indicating a restart seen in the log).
So I set the log level to 15 in the config file and restarted the rbd
mirror daemon. The output surprised me though, my previous perception of
the issue might be completely wrong...
Lots of "image_replayer::BootstrapRequest: failed to create local
image: (2) No such file or directory" and ":ImageReplayer:   replay
encountered an error: (42) No message of desired type"

https://pastebin.com/1bTETNGs

Best regards
/Magnus

Den tis 9 apr. 2019 kl 18:35 skrev Jason Dillaman :

> Can you pastebin the results from running the following on your backup
> site rbd-mirror daemon node?
>
> ceph --admin-socket /path/to/asok config set debug_rbd_mirror 15
> ceph --admin-socket /path/to/asok rbd mirror restart nova
>  wait a minute to let some logs accumulate ...
> ceph --admin-socket /path/to/asok config set debug_rbd_mirror 0/5
>
> ... and collect the rbd-mirror log from /var/log/ceph/ (should have
> lots of "rbd::mirror"-like log entries.
>
>
> On Tue, Apr 9, 2019 at 12:23 PM Magnus Grönlund 
> wrote:
> >
> >
> >
> > Den tis 9 apr. 2019 kl 17:48 skrev Jason Dillaman :
> >>
> >> Any chance your rbd-mirror daemon has the admin sockets available
> >> (defaults to /var/run/ceph/cephdr-clientasok)? If
> >> so, you can run "ceph --admin-daemon /path/to/asok rbd mirror status".
> >
> >
> > {
> > "pool_replayers": [
> > {
> > "pool": "glance",
> > "peer": "uuid: df30fb21-d1de-4c3a-9c00-10eaa4b30e00 cluster:
> production client: client.productionbackup",
> > "instance_id": "869081",
> > "leader_instance_id": "869081",
> > "leader": true,
> > "instances": [],
> > "local_cluster_admin_socket":
> "/var/run/ceph/client.backup.1936211.backup.94225674131712.asok",
> > "remote_cluster_admin_socket":
> "/var/run/ceph/client.productionbackup.1936211.production.9422567521.asok",
> > "sync_throttler": {
> > "max_parallel_syncs": 5,
> > "running_syncs": 0,
> > "waiting_syncs": 0
> > },
> > "image_replayers": [
> > {
> > "name":
> "glance/ea5e4ad2-090a-4665-b142-5c7a095963e0",
> > "state": "Replaying"
> > },
> > {
> > "name":
> "glance/d7095183-45ef-40b5-80ef-f7c9d3bb1e62",
> > "state": "Replaying"
> > },
> > ---cut--
> > {
> > "name":
> "cinder/volume-bcb41f46-3716-4ee2-aa19-6fbc241fbf05",
> > "state": "Replaying"
> > }
> > ]
> > },
> >  {
> > "pool": "nova",
> > "peer": "uuid: 1fc7fefc-9bcb-4f36-a259-66c3d8086702 cluster:
> production client: client.productionbackup",
> > "instance_id": "889074",
> > "leader_instance_id": "889074",
> > "leader": true,
> > "instances": [],
> > "local_cluster_admin_socket":
> "/var/run/ceph/client.backup.1936211.backup.94225678548048.asok",
> > "remote_cluster_admin_socket":
> "/var/run/ceph/client.productionbackup.1936211.production.94225679621728.asok",
> > "sync_throttler": {
> > "max_parallel_syncs": 5,
> > "running_syncs": 0,
> > "waiting_syncs": 0
> > },
> > "image_replayers": []
> > }
> > ],
> > "image_deleter": {
> > "image_deleter_status": {
> > "delete_images_queue": [
> > {
> > "local_pool_id": 3,
> > "global_image_id":
> "ff531159-de6f-4324-a022-50c079dedd45"
> > }
> > ],
> > "failed_deletes_queue": []
> > }
> >>
> >>
> >> On Tue, Apr 9, 2019 at 11:26 AM Magnus Grönlund 
> wrote:
> >> >
> >> >
> >> >
> >> > Den tis 9 apr. 2019 kl 17:14 skrev Jason Dillaman <
> jdill...@redhat.com>:
> >> >>
> >> >> On Tue, Apr 9, 2019 at 11:08 AM Magnus Grönlund 
> wrote:
> >> >> >
> >> >> > >On Tue, Apr 9, 2019 at 10:40 AM Magnus Grönlund <
> mag...@gronlund.se> wrote:
> >> >> > >>
> >> >> > >> Hi,
> >> >> > >> We have configured one-way replication of pools between a
> production cluster and a backup cluster. But unfortunately the rbd-mirror
> or the backup cluster is unable to keep up with the production cluster so
> the replication fails to reach replaying state.
> >> >> > >
> >> >> > >Hmm, it's odd that they don't at least reach the replaying state.
> Are
> >> >> > >they still performing the initial sync?
> >> >> >
> >> >> > There are three pools we try to mirror, (glance, cinder, and nova,
> no points for guessing what the c

Re: [ceph-users] Remove RBD mirror?

2019-04-12 Thread Jason Dillaman
On Fri, Apr 12, 2019 at 9:52 AM Magnus Grönlund  wrote:
>
> Hi Jason,
>
> Tried to follow the instructions and setting the debug level to 15 worked OK, 
> but the daemon appeared to silently ignore the restart command (nothing 
> indicating a restart seen in the log).
> So I set the log level to 15 in the config file and restarted the rbd mirror 
> daemon. The output surprised me though, my previous perception of the issue 
> might be completely wrong...
> Lots of "image_replayer::BootstrapRequest: failed to create local image: 
> (2) No such file or directory" and ":ImageReplayer:   replay encountered 
> an error: (42) No message of desired type"

What is the result from "rbd mirror pool status --verbose nova"
against your DR cluster now? Are they in up+error now? The ENOENT
errors most likely related to a parent image that hasn't been
mirrored. The ENOMSG error seems to indicate that there might be some
corruption in a journal and it's missing expected records (like a
production client crashed), but it should be able to recover from
that.

> https://pastebin.com/1bTETNGs
>
> Best regards
> /Magnus
>
> Den tis 9 apr. 2019 kl 18:35 skrev Jason Dillaman :
>>
>> Can you pastebin the results from running the following on your backup
>> site rbd-mirror daemon node?
>>
>> ceph --admin-socket /path/to/asok config set debug_rbd_mirror 15
>> ceph --admin-socket /path/to/asok rbd mirror restart nova
>>  wait a minute to let some logs accumulate ...
>> ceph --admin-socket /path/to/asok config set debug_rbd_mirror 0/5
>>
>> ... and collect the rbd-mirror log from /var/log/ceph/ (should have
>> lots of "rbd::mirror"-like log entries.
>>
>>
>> On Tue, Apr 9, 2019 at 12:23 PM Magnus Grönlund  wrote:
>> >
>> >
>> >
>> > Den tis 9 apr. 2019 kl 17:48 skrev Jason Dillaman :
>> >>
>> >> Any chance your rbd-mirror daemon has the admin sockets available
>> >> (defaults to /var/run/ceph/cephdr-clientasok)? If
>> >> so, you can run "ceph --admin-daemon /path/to/asok rbd mirror status".
>> >
>> >
>> > {
>> > "pool_replayers": [
>> > {
>> > "pool": "glance",
>> > "peer": "uuid: df30fb21-d1de-4c3a-9c00-10eaa4b30e00 cluster: 
>> > production client: client.productionbackup",
>> > "instance_id": "869081",
>> > "leader_instance_id": "869081",
>> > "leader": true,
>> > "instances": [],
>> > "local_cluster_admin_socket": 
>> > "/var/run/ceph/client.backup.1936211.backup.94225674131712.asok",
>> > "remote_cluster_admin_socket": 
>> > "/var/run/ceph/client.productionbackup.1936211.production.9422567521.asok",
>> > "sync_throttler": {
>> > "max_parallel_syncs": 5,
>> > "running_syncs": 0,
>> > "waiting_syncs": 0
>> > },
>> > "image_replayers": [
>> > {
>> > "name": "glance/ea5e4ad2-090a-4665-b142-5c7a095963e0",
>> > "state": "Replaying"
>> > },
>> > {
>> > "name": "glance/d7095183-45ef-40b5-80ef-f7c9d3bb1e62",
>> > "state": "Replaying"
>> > },
>> > ---cut--
>> > {
>> > "name": 
>> > "cinder/volume-bcb41f46-3716-4ee2-aa19-6fbc241fbf05",
>> > "state": "Replaying"
>> > }
>> > ]
>> > },
>> >  {
>> > "pool": "nova",
>> > "peer": "uuid: 1fc7fefc-9bcb-4f36-a259-66c3d8086702 cluster: 
>> > production client: client.productionbackup",
>> > "instance_id": "889074",
>> > "leader_instance_id": "889074",
>> > "leader": true,
>> > "instances": [],
>> > "local_cluster_admin_socket": 
>> > "/var/run/ceph/client.backup.1936211.backup.94225678548048.asok",
>> > "remote_cluster_admin_socket": 
>> > "/var/run/ceph/client.productionbackup.1936211.production.94225679621728.asok",
>> > "sync_throttler": {
>> > "max_parallel_syncs": 5,
>> > "running_syncs": 0,
>> > "waiting_syncs": 0
>> > },
>> > "image_replayers": []
>> > }
>> > ],
>> > "image_deleter": {
>> > "image_deleter_status": {
>> > "delete_images_queue": [
>> > {
>> > "local_pool_id": 3,
>> > "global_image_id": 
>> > "ff531159-de6f-4324-a022-50c079dedd45"
>> > }
>> > ],
>> > "failed_deletes_queue": []
>> > }
>> >>
>> >>
>> >> On Tue, Apr 9, 2019 at 11:26 AM Magnus Grönlund  
>> >> wrote:
>> >> >
>> >> >
>> >> >
>> >> > Den tis 9 apr. 2019 kl 17:14 skrev Jason Dillaman :
>> >> >>
>> >> >> On Tue, Apr 9, 2019 at 11:08 AM Magnus Grönlund  
>> >> >> wrote:
>> >> >> >
>> >> >> > >On Tue, Apr 9, 2019 at 10:40 AM Magnus Grönl

Re: [ceph-users] Remove RBD mirror?

2019-04-12 Thread Magnus Grönlund
Den fre 12 apr. 2019 kl 16:37 skrev Jason Dillaman :

> On Fri, Apr 12, 2019 at 9:52 AM Magnus Grönlund 
> wrote:
> >
> > Hi Jason,
> >
> > Tried to follow the instructions and setting the debug level to 15
> worked OK, but the daemon appeared to silently ignore the restart command
> (nothing indicating a restart seen in the log).
> > So I set the log level to 15 in the config file and restarted the rbd
> mirror daemon. The output surprised me though, my previous perception of
> the issue might be completely wrong...
> > Lots of "image_replayer::BootstrapRequest: failed to create local
> image: (2) No such file or directory" and ":ImageReplayer:   replay
> encountered an error: (42) No message of desired type"
>
> What is the result from "rbd mirror pool status --verbose nova"
> against your DR cluster now? Are they in up+error now? The ENOENT
> errors most likely related to a parent image that hasn't been
> mirrored. The ENOMSG error seems to indicate that there might be some
> corruption in a journal and it's missing expected records (like a
> production client crashed), but it should be able to recover from
> that
>

# rbd mirror pool status --verbose nova
health: WARNING
images: 2479 total
2479 unknown

002344ab-c324-4c01-97ff-de32868fa712_disk:
  global_id:   c02e0202-df8f-46ce-a4b6-1a50a9692804
  state:   down+unknown
  description: status not found
  last_update:

002a8fde-3a63-4e32-9c18-b0bf64393d0f_disk:
  global_id:   d412abc4-b37e-44a2-8aba-107f352dec60
  state:   down+unknown
  description: status not found
  last_update:





> > https://pastebin.com/1bTETNGs
> >
> > Best regards
> > /Magnus
> >
> > Den tis 9 apr. 2019 kl 18:35 skrev Jason Dillaman :
> >>
> >> Can you pastebin the results from running the following on your backup
> >> site rbd-mirror daemon node?
> >>
> >> ceph --admin-socket /path/to/asok config set debug_rbd_mirror 15
> >> ceph --admin-socket /path/to/asok rbd mirror restart nova
> >>  wait a minute to let some logs accumulate ...
> >> ceph --admin-socket /path/to/asok config set debug_rbd_mirror 0/5
> >>
> >> ... and collect the rbd-mirror log from /var/log/ceph/ (should have
> >> lots of "rbd::mirror"-like log entries.
> >>
> >>
> >> On Tue, Apr 9, 2019 at 12:23 PM Magnus Grönlund 
> wrote:
> >> >
> >> >
> >> >
> >> > Den tis 9 apr. 2019 kl 17:48 skrev Jason Dillaman <
> jdill...@redhat.com>:
> >> >>
> >> >> Any chance your rbd-mirror daemon has the admin sockets available
> >> >> (defaults to /var/run/ceph/cephdr-clientasok)?
> If
> >> >> so, you can run "ceph --admin-daemon /path/to/asok rbd mirror
> status".
> >> >
> >> >
> >> > {
> >> > "pool_replayers": [
> >> > {
> >> > "pool": "glance",
> >> > "peer": "uuid: df30fb21-d1de-4c3a-9c00-10eaa4b30e00
> cluster: production client: client.productionbackup",
> >> > "instance_id": "869081",
> >> > "leader_instance_id": "869081",
> >> > "leader": true,
> >> > "instances": [],
> >> > "local_cluster_admin_socket":
> "/var/run/ceph/client.backup.1936211.backup.94225674131712.asok",
> >> > "remote_cluster_admin_socket":
> "/var/run/ceph/client.productionbackup.1936211.production.9422567521.asok",
> >> > "sync_throttler": {
> >> > "max_parallel_syncs": 5,
> >> > "running_syncs": 0,
> >> > "waiting_syncs": 0
> >> > },
> >> > "image_replayers": [
> >> > {
> >> > "name":
> "glance/ea5e4ad2-090a-4665-b142-5c7a095963e0",
> >> > "state": "Replaying"
> >> > },
> >> > {
> >> > "name":
> "glance/d7095183-45ef-40b5-80ef-f7c9d3bb1e62",
> >> > "state": "Replaying"
> >> > },
> >> > ---cut--
> >> > {
> >> > "name":
> "cinder/volume-bcb41f46-3716-4ee2-aa19-6fbc241fbf05",
> >> > "state": "Replaying"
> >> > }
> >> > ]
> >> > },
> >> >  {
> >> > "pool": "nova",
> >> > "peer": "uuid: 1fc7fefc-9bcb-4f36-a259-66c3d8086702
> cluster: production client: client.productionbackup",
> >> > "instance_id": "889074",
> >> > "leader_instance_id": "889074",
> >> > "leader": true,
> >> > "instances": [],
> >> > "local_cluster_admin_socket":
> "/var/run/ceph/client.backup.1936211.backup.94225678548048.asok",
> >> > "remote_cluster_admin_socket":
> "/var/run/ceph/client.productionbackup.1936211.production.94225679621728.asok",
> >> > "sync_throttler": {
> >> > "max_parallel_syncs": 5,
> >> > "running_syncs": 0,
> >> > "waiting_syncs": 0
> >> > },
> >> > "image_replayers": []
> >> > }
> >> > ],
>

[ceph-users] RadosGW ops log lag?

2019-04-12 Thread Aaron Bassett
I have an radogw log centralizer that we use to for an audit trail for data 
access in our ceph clusters. We've enabled the ops log socket and added logging 
of the http_authorization header to it:

rgw log http headers = "http_authorization"
rgw ops log socket path = /var/run/ceph/rgw-ops.sock
rgw enable ops log = true

We have a daemon that listens on the ops socket, extracts/manipulates some 
information from the ops log, and sends it off to our log aggregator.

This setup works pretty well for the most part, except when the cluster comes 
under heavy load, it can get _very_ laggy - sometimes up to several hours 
behind. I'm having a hard time nailing down whats causing this lag. The daemon 
is rather naive, basically just some nc with jq in between, but the log 
aggregator has plenty of spare capacity, so I don't think its slowing down how 
fast the daemon is consuming from the socket.

I was revisiting the documentation about this ops log and noticed the following 
which I hadn't seen previously:

When specifying a UNIX domain socket, it is also possible to specify the 
maximum amount of memory that will be used to keep the data backlog:
rgw ops log data backlog = 
Any backlogged data in excess to the specified size will be lost, so the socket 
needs to be read constantly.

I'm wondering if theres a way I can query radosgw for the current size of that 
backlog to help me narrow down where the bottleneck may be occuring.

Thanks,
Aaron



CONFIDENTIALITY NOTICE
This e-mail message and any attachments are only for the use of the intended 
recipient and may contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient, any disclosure, distribution or other use of this e-mail message or 
attachments is prohibited. If you have received this e-mail message in error, 
please delete and notify the sender immediately. Thank you.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW ops log lag?

2019-04-12 Thread Matt Benjamin
Hi Aaron,

I don't think that exists currently.

Matt

On Fri, Apr 12, 2019 at 11:12 AM Aaron Bassett
 wrote:
>
> I have an radogw log centralizer that we use to for an audit trail for data 
> access in our ceph clusters. We've enabled the ops log socket and added 
> logging of the http_authorization header to it:
>
> rgw log http headers = "http_authorization"
> rgw ops log socket path = /var/run/ceph/rgw-ops.sock
> rgw enable ops log = true
>
> We have a daemon that listens on the ops socket, extracts/manipulates some 
> information from the ops log, and sends it off to our log aggregator.
>
> This setup works pretty well for the most part, except when the cluster comes 
> under heavy load, it can get _very_ laggy - sometimes up to several hours 
> behind. I'm having a hard time nailing down whats causing this lag. The 
> daemon is rather naive, basically just some nc with jq in between, but the 
> log aggregator has plenty of spare capacity, so I don't think its slowing 
> down how fast the daemon is consuming from the socket.
>
> I was revisiting the documentation about this ops log and noticed the 
> following which I hadn't seen previously:
>
> When specifying a UNIX domain socket, it is also possible to specify the 
> maximum amount of memory that will be used to keep the data backlog:
> rgw ops log data backlog = 
> Any backlogged data in excess to the specified size will be lost, so the 
> socket needs to be read constantly.
>
> I'm wondering if theres a way I can query radosgw for the current size of 
> that backlog to help me narrow down where the bottleneck may be occuring.
>
> Thanks,
> Aaron
>
>
>
> CONFIDENTIALITY NOTICE
> This e-mail message and any attachments are only for the use of the intended 
> recipient and may contain information that is privileged, confidential or 
> exempt from disclosure under applicable law. If you are not the intended 
> recipient, any disclosure, distribution or other use of this e-mail message 
> or attachments is prohibited. If you have received this e-mail message in 
> error, please delete and notify the sender immediately. Thank you.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW ops log lag?

2019-04-12 Thread Aaron Bassett
Ok thanks. Is the expectation that events will be available on that socket as 
soon as the occur or is it more of a best effort situation? I'm just trying to 
nail down which side of the socket might be lagging. It's pretty difficult to 
recreate this as I have to hit the cluster very hard to get it to start lagging.

Thanks, Aaron 

> On Apr 12, 2019, at 11:16 AM, Matt Benjamin  wrote:
> 
> Hi Aaron,
> 
> I don't think that exists currently.
> 
> Matt
> 
> On Fri, Apr 12, 2019 at 11:12 AM Aaron Bassett
>  wrote:
>> 
>> I have an radogw log centralizer that we use to for an audit trail for data 
>> access in our ceph clusters. We've enabled the ops log socket and added 
>> logging of the http_authorization header to it:
>> 
>> rgw log http headers = "http_authorization"
>> rgw ops log socket path = /var/run/ceph/rgw-ops.sock
>> rgw enable ops log = true
>> 
>> We have a daemon that listens on the ops socket, extracts/manipulates some 
>> information from the ops log, and sends it off to our log aggregator.
>> 
>> This setup works pretty well for the most part, except when the cluster 
>> comes under heavy load, it can get _very_ laggy - sometimes up to several 
>> hours behind. I'm having a hard time nailing down whats causing this lag. 
>> The daemon is rather naive, basically just some nc with jq in between, but 
>> the log aggregator has plenty of spare capacity, so I don't think its 
>> slowing down how fast the daemon is consuming from the socket.
>> 
>> I was revisiting the documentation about this ops log and noticed the 
>> following which I hadn't seen previously:
>> 
>> When specifying a UNIX domain socket, it is also possible to specify the 
>> maximum amount of memory that will be used to keep the data backlog:
>> rgw ops log data backlog = 
>> Any backlogged data in excess to the specified size will be lost, so the 
>> socket needs to be read constantly.
>> 
>> I'm wondering if theres a way I can query radosgw for the current size of 
>> that backlog to help me narrow down where the bottleneck may be occuring.
>> 
>> Thanks,
>> Aaron
>> 
>> 
>> 
>> CONFIDENTIALITY NOTICE
>> This e-mail message and any attachments are only for the use of the intended 
>> recipient and may contain information that is privileged, confidential or 
>> exempt from disclosure under applicable law. If you are not the intended 
>> recipient, any disclosure, distribution or other use of this e-mail message 
>> or attachments is prohibited. If you have received this e-mail message in 
>> error, please delete and notify the sender immediately. Thank you.
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com&d=DwIFaQ&c=Tpa2GKmmYSmpYS4baANxQwQYqA0vwGXwkJOPBegaiTs&r=5nKer5huNDFQXjYpOR4o_7t5CRI8wb5Vb_v1pBywbYw&m=sIK_aBR3PrR2olfXOZWgvPVm7jIoZtvEk2YHofl4TDU&s=FzFoCJ8qtZ66OKdL1Ph10qjZbCEjvMg9JyS_9LwEpSg&e=
>> 
>> 
> 
> 
> -- 
> 
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
> 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.redhat.com_en_technologies_storage&d=DwIFaQ&c=Tpa2GKmmYSmpYS4baANxQwQYqA0vwGXwkJOPBegaiTs&r=5nKer5huNDFQXjYpOR4o_7t5CRI8wb5Vb_v1pBywbYw&m=sIK_aBR3PrR2olfXOZWgvPVm7jIoZtvEk2YHofl4TDU&s=hi6_HiZS0D_nzAqKsvJPPfmi8nZSv4lZCRFZ1ru9CxM&e=
> 
> tel.  734-821-5101
> fax.  734-769-8938
> cel.  734-216-5309


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] can not change log level for ceph-client.libvirt online

2019-04-12 Thread lin zhou
Hi, cephers

we have a ceph cluster with openstack.
maybe long ago, we set debug_rbd in ceph.conf and then boot vm.
but these debug config not exist in the config now.
Now we find the ceph-client.libvirt.log is 200GB.
But I can not using ceph --admin-daemon ceph-client.libvirt.asok config set
debug_rbd 0/5  to change it online.

I guess if I reboot all vms in this host,the log level maybe reset.
But can I change it online ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to reduce HDD OSD flapping due to rocksdb compacting event?

2019-04-12 Thread Charles Alva
Thanks Mark,

This is interesting. I'll take a look at the links you provided.

Does rocksdb compacting issue only affect HDDs? Or SSDs are having same
issue?

Kind regards,

Charles Alva
Sent from Gmail Mobile

On Fri, Apr 12, 2019, 9:01 PM Mark Nelson  wrote:

> Hi Charles,
>
>
> Basically the goal is to reduce write-amplification as much as
> possible.  The deeper that the rocksdb hierarchy gets, the worse the
> write-amplifcation for compaction is going to be.  If you look at the
> OSD logs you'll see the write-amp factors for compaction in the rocksdb
> compaction summary sections that periodically pop up. There's a couple
> of things we are trying to see if we can improve things on our end:
>
>
> 1) Adam has been working on experimenting with sharding data across
> multiple column families.  The idea here is that it might be better to
> hav multiple L0 and L1 levels rather than L0, L1, L2 and L3.  I'm not
> sure if this will pan out of not, but that was one of the goals behind
> trying this.
>
>
> 2) Toshiba recently released trocksdb which could have a really big
> impact on compaction write amplification:
>
>
> Code: https://github.com/ToshibaMemoryAmerica/trocksdb/tree/TRocksRel
>
> Wiki: https://github.com/ToshibaMemoryAmerica/trocksdb/wiki
>
>
> I recently took a look to see if our key/value size distribution would
> work well with the approach that trocksdb is taking to reduce
> write-amplification:
>
>
>
> https://docs.google.com/spreadsheets/d/1fNFI8U-JRkU5uaRJzgg5rNxqhgRJFlDB4TsTAVsuYkk/edit?usp=sharing
>
>
> The good news is that it sounds like the "Trocks Ratio" for the data we
> put in rocksdb is sufficiently high that we'd see some benefit since it
> should greatly reduce write-amplification during compaction for data
> (but not keys). This doesn't help your immediate problem, but I wanted
> you to know that you aren't the only one and we are thinking about ways
> to reduce the compaction impact.
>
>
> Mark
>
>
> On 4/10/19 2:07 AM, Charles Alva wrote:
> > Hi Ceph Users,
> >
> > Is there a way around to minimize rocksdb compacting event so that it
> > won't use all the spinning disk IO utilization and avoid it being
> > marked as down due to fail to send heartbeat to others?
> >
> > Right now we have frequent high IO disk utilization for every 20-25
> > minutes where the rocksdb reaches level 4 with 67GB data to compact.
> >
> >
> > Kind regards,
> >
> > Charles Alva
> > Sent from Gmail Mobile
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to reduce HDD OSD flapping due to rocksdb compacting event?

2019-04-12 Thread Mark Nelson
They have the same issue, but depending on the SSD may be better at 
absorbing the extra IO if network or CPU are bigger bottlenecks.  That's 
one of the reasons that a lot of folks like to put the DB on flash for 
HDD based clusters.  It's still possible to oversubscribe them, but 
you've got more headroom.



Mark

On 4/12/19 10:25 AM, Charles Alva wrote:

Thanks Mark,

This is interesting. I'll take a look at the links you provided.

Does rocksdb compacting issue only affect HDDs? Or SSDs are having 
same issue?


Kind regards,

Charles Alva
Sent from Gmail Mobile

On Fri, Apr 12, 2019, 9:01 PM Mark Nelson > wrote:


Hi Charles,


Basically the goal is to reduce write-amplification as much as
possible.  The deeper that the rocksdb hierarchy gets, the worse the
write-amplifcation for compaction is going to be.  If you look at the
OSD logs you'll see the write-amp factors for compaction in the
rocksdb
compaction summary sections that periodically pop up. There's a
couple
of things we are trying to see if we can improve things on our end:


1) Adam has been working on experimenting with sharding data across
multiple column families.  The idea here is that it might be
better to
hav multiple L0 and L1 levels rather than L0, L1, L2 and L3. I'm not
sure if this will pan out of not, but that was one of the goals
behind
trying this.


2) Toshiba recently released trocksdb which could have a really big
impact on compaction write amplification:


Code: https://github.com/ToshibaMemoryAmerica/trocksdb/tree/TRocksRel

Wiki: https://github.com/ToshibaMemoryAmerica/trocksdb/wiki


I recently took a look to see if our key/value size distribution
would
work well with the approach that trocksdb is taking to reduce
write-amplification:



https://docs.google.com/spreadsheets/d/1fNFI8U-JRkU5uaRJzgg5rNxqhgRJFlDB4TsTAVsuYkk/edit?usp=sharing


The good news is that it sounds like the "Trocks Ratio" for the
data we
put in rocksdb is sufficiently high that we'd see some benefit
since it
should greatly reduce write-amplification during compaction for data
(but not keys). This doesn't help your immediate problem, but I
wanted
you to know that you aren't the only one and we are thinking about
ways
to reduce the compaction impact.


Mark


On 4/10/19 2:07 AM, Charles Alva wrote:
> Hi Ceph Users,
>
> Is there a way around to minimize rocksdb compacting event so
that it
> won't use all the spinning disk IO utilization and avoid it being
> marked as down due to fail to send heartbeat to others?
>
> Right now we have frequent high IO disk utilization for every 20-25
> minutes where the rocksdb reaches level 4 with 67GB data to compact.
>
>
> Kind regards,
>
> Charles Alva
> Sent from Gmail Mobile
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to reduce HDD OSD flapping due to rocksdb compacting event?

2019-04-12 Thread Charles Alva
Got it. Thanks, Mark!

Kind regards,

Charles Alva
Sent from Gmail Mobile


On Fri, Apr 12, 2019 at 10:53 PM Mark Nelson  wrote:

> They have the same issue, but depending on the SSD may be better at
> absorbing the extra IO if network or CPU are bigger bottlenecks.  That's
> one of the reasons that a lot of folks like to put the DB on flash for
> HDD based clusters.  It's still possible to oversubscribe them, but
> you've got more headroom.
>
>
> Mark
>
> On 4/12/19 10:25 AM, Charles Alva wrote:
> > Thanks Mark,
> >
> > This is interesting. I'll take a look at the links you provided.
> >
> > Does rocksdb compacting issue only affect HDDs? Or SSDs are having
> > same issue?
> >
> > Kind regards,
> >
> > Charles Alva
> > Sent from Gmail Mobile
> >
> > On Fri, Apr 12, 2019, 9:01 PM Mark Nelson  > > wrote:
> >
> > Hi Charles,
> >
> >
> > Basically the goal is to reduce write-amplification as much as
> > possible.  The deeper that the rocksdb hierarchy gets, the worse the
> > write-amplifcation for compaction is going to be.  If you look at the
> > OSD logs you'll see the write-amp factors for compaction in the
> > rocksdb
> > compaction summary sections that periodically pop up. There's a
> > couple
> > of things we are trying to see if we can improve things on our end:
> >
> >
> > 1) Adam has been working on experimenting with sharding data across
> > multiple column families.  The idea here is that it might be
> > better to
> > hav multiple L0 and L1 levels rather than L0, L1, L2 and L3. I'm not
> > sure if this will pan out of not, but that was one of the goals
> > behind
> > trying this.
> >
> >
> > 2) Toshiba recently released trocksdb which could have a really big
> > impact on compaction write amplification:
> >
> >
> > Code:
> https://github.com/ToshibaMemoryAmerica/trocksdb/tree/TRocksRel
> >
> > Wiki: https://github.com/ToshibaMemoryAmerica/trocksdb/wiki
> >
> >
> > I recently took a look to see if our key/value size distribution
> > would
> > work well with the approach that trocksdb is taking to reduce
> > write-amplification:
> >
> >
> >
> https://docs.google.com/spreadsheets/d/1fNFI8U-JRkU5uaRJzgg5rNxqhgRJFlDB4TsTAVsuYkk/edit?usp=sharing
> >
> >
> > The good news is that it sounds like the "Trocks Ratio" for the
> > data we
> > put in rocksdb is sufficiently high that we'd see some benefit
> > since it
> > should greatly reduce write-amplification during compaction for data
> > (but not keys). This doesn't help your immediate problem, but I
> > wanted
> > you to know that you aren't the only one and we are thinking about
> > ways
> > to reduce the compaction impact.
> >
> >
> > Mark
> >
> >
> > On 4/10/19 2:07 AM, Charles Alva wrote:
> > > Hi Ceph Users,
> > >
> > > Is there a way around to minimize rocksdb compacting event so
> > that it
> > > won't use all the spinning disk IO utilization and avoid it being
> > > marked as down due to fail to send heartbeat to others?
> > >
> > > Right now we have frequent high IO disk utilization for every 20-25
> > > minutes where the rocksdb reaches level 4 with 67GB data to
> compact.
> > >
> > >
> > > Kind regards,
> > >
> > > Charles Alva
> > > Sent from Gmail Mobile
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com 
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] v12.2.12 Luminous released

2019-04-12 Thread Abhishek Lekshmanan

We are happy to announce the next bugfix release for v12.2.x Luminous
stable release series. We recommend all luminous users to upgrade to
this release. Many thanks to everyone who contributed backports and a
special mention to Yuri for the QE efforts put in to this release.

Notable Changes
---
* In 12.2.11 and earlier releases, keyring caps were not checked for validity,
  so the caps string could be anything. As of 12.2.12, caps strings are
  validated and providing a keyring with an invalid caps string to, e.g.,
  `ceph auth add` will result in an error.

For the complete changelog, please refer to the release blog entry at 
https://ceph.com/releases/v12-2-12-luminous-released/

Getting ceph:

* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-12.2.12.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: 1436006594665279fe734b4c15d7e08c13ebd777

-- 
Abhishek Lekshmanan
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Limits of mds bal fragment size max

2019-04-12 Thread Benjeman Meekhof
We have a user syncing data with some kind of rsync + hardlink based
system creating/removing large numbers of hard links.  We've
encountered many of the issues with stray inode re-integration as
described in the thread and tracker below.

As noted one fix is to increase mds_bal_fragment_size_max so the stray
directories can accommodate the high stray count.  We blew right
through 200,000, then 300,000, and at this point I'm wondering if
there is an upper safe limit on this parameter?   If I go to something
like 1mil to work with this use case will I have other problems?

Background:
https://www.spinics.net/lists/ceph-users/msg51985.html
http://tracker.ceph.com/issues/38849

thanks,
Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Remove RBD mirror?

2019-04-12 Thread Jason Dillaman
On Fri, Apr 12, 2019 at 10:48 AM Magnus Grönlund  wrote:
>
>
>
> Den fre 12 apr. 2019 kl 16:37 skrev Jason Dillaman :
>>
>> On Fri, Apr 12, 2019 at 9:52 AM Magnus Grönlund  wrote:
>> >
>> > Hi Jason,
>> >
>> > Tried to follow the instructions and setting the debug level to 15 worked 
>> > OK, but the daemon appeared to silently ignore the restart command 
>> > (nothing indicating a restart seen in the log).
>> > So I set the log level to 15 in the config file and restarted the rbd 
>> > mirror daemon. The output surprised me though, my previous perception of 
>> > the issue might be completely wrong...
>> > Lots of "image_replayer::BootstrapRequest: failed to create local 
>> > image: (2) No such file or directory" and ":ImageReplayer:   replay 
>> > encountered an error: (42) No message of desired type"
>>
>> What is the result from "rbd mirror pool status --verbose nova"
>> against your DR cluster now? Are they in up+error now? The ENOENT
>> errors most likely related to a parent image that hasn't been
>> mirrored. The ENOMSG error seems to indicate that there might be some
>> corruption in a journal and it's missing expected records (like a
>> production client crashed), but it should be able to recover from
>> that
>
>
> # rbd mirror pool status --verbose nova
> health: WARNING
> images: 2479 total
> 2479 unknown

Odd, so those log messages were probably related to the two images in
the glance pool.  Unfortunately, v12.2.x will actually require "debug
rbd_mirror = 20" to see the progression in the state machines, which
will result in a huge log. Any chance you are willing to collect that
data for a few minutes at that high log level and upload the
compressed log somewhere? You can use "ceph-post-file" if needed.

> 002344ab-c324-4c01-97ff-de32868fa712_disk:
>   global_id:   c02e0202-df8f-46ce-a4b6-1a50a9692804
>   state:   down+unknown
>   description: status not found
>   last_update:
>
> 002a8fde-3a63-4e32-9c18-b0bf64393d0f_disk:
>   global_id:   d412abc4-b37e-44a2-8aba-107f352dec60
>   state:   down+unknown
>   description: status not found
>   last_update:
>
> 
>
>
>>
>> > https://pastebin.com/1bTETNGs
>> >
>> > Best regards
>> > /Magnus
>> >
>> > Den tis 9 apr. 2019 kl 18:35 skrev Jason Dillaman :
>> >>
>> >> Can you pastebin the results from running the following on your backup
>> >> site rbd-mirror daemon node?
>> >>
>> >> ceph --admin-socket /path/to/asok config set debug_rbd_mirror 15
>> >> ceph --admin-socket /path/to/asok rbd mirror restart nova
>> >>  wait a minute to let some logs accumulate ...
>> >> ceph --admin-socket /path/to/asok config set debug_rbd_mirror 0/5
>> >>
>> >> ... and collect the rbd-mirror log from /var/log/ceph/ (should have
>> >> lots of "rbd::mirror"-like log entries.
>> >>
>> >>
>> >> On Tue, Apr 9, 2019 at 12:23 PM Magnus Grönlund  
>> >> wrote:
>> >> >
>> >> >
>> >> >
>> >> > Den tis 9 apr. 2019 kl 17:48 skrev Jason Dillaman :
>> >> >>
>> >> >> Any chance your rbd-mirror daemon has the admin sockets available
>> >> >> (defaults to /var/run/ceph/cephdr-clientasok)? If
>> >> >> so, you can run "ceph --admin-daemon /path/to/asok rbd mirror status".
>> >> >
>> >> >
>> >> > {
>> >> > "pool_replayers": [
>> >> > {
>> >> > "pool": "glance",
>> >> > "peer": "uuid: df30fb21-d1de-4c3a-9c00-10eaa4b30e00 
>> >> > cluster: production client: client.productionbackup",
>> >> > "instance_id": "869081",
>> >> > "leader_instance_id": "869081",
>> >> > "leader": true,
>> >> > "instances": [],
>> >> > "local_cluster_admin_socket": 
>> >> > "/var/run/ceph/client.backup.1936211.backup.94225674131712.asok",
>> >> > "remote_cluster_admin_socket": 
>> >> > "/var/run/ceph/client.productionbackup.1936211.production.9422567521.asok",
>> >> > "sync_throttler": {
>> >> > "max_parallel_syncs": 5,
>> >> > "running_syncs": 0,
>> >> > "waiting_syncs": 0
>> >> > },
>> >> > "image_replayers": [
>> >> > {
>> >> > "name": 
>> >> > "glance/ea5e4ad2-090a-4665-b142-5c7a095963e0",
>> >> > "state": "Replaying"
>> >> > },
>> >> > {
>> >> > "name": 
>> >> > "glance/d7095183-45ef-40b5-80ef-f7c9d3bb1e62",
>> >> > "state": "Replaying"
>> >> > },
>> >> > ---cut--
>> >> > {
>> >> > "name": 
>> >> > "cinder/volume-bcb41f46-3716-4ee2-aa19-6fbc241fbf05",
>> >> > "state": "Replaying"
>> >> > }
>> >> > ]
>> >> > },
>> >> >  {
>> >> > "pool": "nova",
>> >> > "peer": "uuid: 1fc7fefc-9bcb-4f36-a259-66c3d8086702 
>> >> > cluster: production client: client.productionbackup",
>> >> > "instance_id": "889074"

Re: [ceph-users] Ceph Object storage for physically separating tenants storage infrastructure

2019-04-12 Thread Gregory Farnum
Yes, you would do this by setting up separate data pools for segregated
clients, giving those pools a CRUSH rule placing them on their own servers,
and if using S3 assigning the clients to them using either wholly separate
instances or perhaps separate zones and the S3 placement options.
-Greg

On Fri, Apr 12, 2019 at 3:04 AM Varun Singh  wrote:

> Hi,
> We have a requirement to build an object storage solution with thin
> layer of customization on top. This is to be deployed in our own data
> centre. We will be using the objects stored in this system at various
> places in our business workflow. The solution should support
> multi-tenancy. Multiple tenants can come and store their objects in
> it. However, there is also a requirement that a tenant may want to use
> their own machines. In that case, their objects should be stored and
> replicated within their machines. But those machines should still be
> part of our system. This is because we will still need access to the
> objects for our business workflows. It's just that their data should
> not be stored and replicated outside of their systems. Is it something
> that can be achieved using Ceph? Thanks a lot in advance.
>
> --
> Regards,
> Varun Singh
>



>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?

2019-04-12 Thread Igor Podlesny
For e. g., an EC pool with default profile (2, 1) has bogus "sizing"
params (size=3, min_size=3).
Min. size 3 is wrong as far as I know and it's been fixed in fresh
releases (but not in Luminous).

But besides that it looks like pool usage isn't calculated according
to EC overhead but as if it was replicated pool with size=3 as well.

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?

2019-04-12 Thread Paul Emmerich
Please don't use an EC pool with 2+1, that configuration makes no sense.

min_size 3 is the default for that pool, yes. That means your data
will be unavailable if any OSD is offline.
Reducing min_size to 2 means you are accepting writes when you cannot
guarantee durability which will cause problems in the long run.
See older discussions about min_size here



Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Apr 12, 2019 at 9:30 PM Igor Podlesny  wrote:
>
> For e. g., an EC pool with default profile (2, 1) has bogus "sizing"
> params (size=3, min_size=3).
> Min. size 3 is wrong as far as I know and it's been fixed in fresh
> releases (but not in Luminous).
>
> But besides that it looks like pool usage isn't calculated according
> to EC overhead but as if it was replicated pool with size=3 as well.
>
> --
> End of message. Next message?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.12 Luminous released

2019-04-12 Thread Paul Emmerich
I think the most notable change here is the backport of the new bitmap
allocator, but that's missing completely from the change log.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Apr 12, 2019 at 6:48 PM Abhishek Lekshmanan  wrote:
>
>
> We are happy to announce the next bugfix release for v12.2.x Luminous
> stable release series. We recommend all luminous users to upgrade to
> this release. Many thanks to everyone who contributed backports and a
> special mention to Yuri for the QE efforts put in to this release.
>
> Notable Changes
> ---
> * In 12.2.11 and earlier releases, keyring caps were not checked for validity,
>   so the caps string could be anything. As of 12.2.12, caps strings are
>   validated and providing a keyring with an invalid caps string to, e.g.,
>   `ceph auth add` will result in an error.
>
> For the complete changelog, please refer to the release blog entry at
> https://ceph.com/releases/v12-2-12-luminous-released/
>
> Getting ceph:
> 
> * Git at git://github.com/ceph/ceph.git
> * Tarball at http://download.ceph.com/tarballs/ceph-12.2.12.tar.gz
> * For packages, see http://docs.ceph.com/docs/master/install/get-packages/
> * Release git sha1: 1436006594665279fe734b4c15d7e08c13ebd777
>
> --
> Abhishek Lekshmanan
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
> HRB 21284 (AG Nürnberg)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?

2019-04-12 Thread Igor Podlesny
On Sat, 13 Apr 2019 at 06:54, Paul Emmerich  wrote:
>
> Please don't use an EC pool with 2+1, that configuration makes no sense.

That's too much of an irony given that (2, 1) is default EC profile,
described in CEPH documentation in addition.

> min_size 3 is the default for that pool, yes. That means your data
> will be unavailable if any OSD is offline.
> Reducing min_size to 2 means you are accepting writes when you cannot
> guarantee durability which will cause problems in the long run.
> See older discussions about min_size here

Well, my primary concern wasn't about min_size at all but about this: {
> > But besides that it looks like pool usage isn't calculated according
> > to EC overhead but as if it was replicated pool with size=3 as well.
}

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?

2019-04-12 Thread Igor Podlesny
And as to min_size choice -- since you've replied exactly to that part
of mine message only.

On Sat, 13 Apr 2019 at 06:54, Paul Emmerich  wrote:
> On Fri, Apr 12, 2019 at 9:30 PM Igor Podlesny  wrote:
> > For e. g., an EC pool with default profile (2, 1) has bogus "sizing"
> > params (size=3, min_size=3).

{{
> > Min. size 3 is wrong as far as I know and it's been fixed in fresh
> > releases (but not in Luminous).
}}

I didn't give any proof when writing this due being more focused on EC
Pool usage calculation.
Take a look at:

  https://github.com/ceph/ceph/pull/8008

As it can be seen formula for min_size became min_size = k + min(1, m
- 1) effectively on March 2019.
-- That's why I've said "fixed in fresh releases but not in Luminous".

Let's see what does this new formula produce for k=2, m=1 (the default
and documented EC profile):

min_size = 2 + min(1, 1 - 1) = 2 + 0 = 2.

Before that change it would be 3 instead, thus giving that 3/3 for EC (2, 1).

[...]
> min_size 3 is the default for that pool, yes. That means your data
> will be unavailable if any OSD is offline.
> Reducing min_size to 2 means you are accepting writes when you cannot
> guarantee durability which will cause problems in the long run.
> See older discussions about min_size here

Would be glad doing so, but It's not a forum (here), but mail list
instead, right(?) -- so the only way
to "see here" is to rely on search engine that might have indexed mail
list archive. If you have
specific URL or at least exact keywords allowing to find what you're
referring to, I'd gladly see
what you're talking about.

And of course I did search before writing and the fact I wrote it
anyways means I didn't find
anything answering my question "here or there".

--
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com