Re: [ceph-users] clust recovery stuck

2019-10-21 Thread Eugen Block

Hi,

can you share `ceph osd tree`? What crush rules are in use in your  
cluster? I assume that the two failed OSDs prevent the remapping  
because the rules can't be applied.



Regards,
Eugen


Zitat von Philipp Schwaha :


hi,

I have a problem with a cluster being stuck in recovery after osd
failure. at first recovery was doing quite well, but now it just sits
there without any progress. I currently looks like this:

 health HEALTH_ERR
36 pgs are stuck inactive for more than 300 seconds
50 pgs backfill_wait
52 pgs degraded
36 pgs down
36 pgs peering
1 pgs recovering
1 pgs recovery_wait
36 pgs stuck inactive
52 pgs stuck unclean
52 pgs undersized
recovery 261632/2235446 objects degraded (11.704%)
recovery 259813/2235446 objects misplaced (11.622%)
recovery 2/1117723 unfound (0.000%)
 monmap e3: 3 mons at
{0=192.168.19.13:6789/0,1=192.168.19.17:6789/0,2=192.168.19.23:6789/0}
election epoch 78, quorum 0,1,2 0,1,2
 osdmap e7430: 6 osds: 4 up, 4 in; 88 remapped pgs
flags sortbitwise
  pgmap v20023893: 256 pgs, 1 pools, 4366 GB data, 1091 kobjects
8421 GB used, 10183 GB / 18629 GB avail
261632/2235446 objects degraded (11.704%)
259813/2235446 objects misplaced (11.622%)
2/1117723 unfound (0.000%)
 168 active+clean
  50 active+undersized+degraded+remapped+wait_backfill
  36 down+remapped+peering
   1 active+recovering+undersized+degraded+remapped
   1 active+recovery_wait+undersized+degraded+remapped

Is there any way to motivate it to resume recovery?

Thanks
Philipp




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Replace ceph osd in a container

2019-10-21 Thread Alex Litvak

Hello cephers,

So I am having trouble with a new hardware systems with strange OSD behavior 
and I want to replace a disk with a brand new one to test the theory.

I run all daemons in containers and on one of the nodes I have mon, mgr, and 6 
osds.  So following 
https://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd

I stopped container with osd.23, waited until it is down and out, ran 
safe-to-destroy loop and then destroyed the osd all using the monitor from the 
container on this node.  All good.

Then I swapped the SSDs and started running additional steps (from step 3) using the same mon container.  I have no ceph packages installed on the bare metal box. It looks like mon container doesn't 
see the disk.


podman exec -it ceph-mon-storage2n2-la ceph-volume lvm zap /dev/sdh
 stderr: lsblk: /dev/sdh: not a block device
 stderr: error: /dev/sdh: No such file or directory
 stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys 
expected.
usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]
   [--osd-fsid OSD_FSID]
   [DEVICES [DEVICES ...]]
ceph-volume lvm zap: error: Unable to proceed with non-existing device: /dev/sdh
Error: exit status 2
root@storage2n2-la:~# ls -l /dev/sd
sda   sdc   sdd   sde   sdf   sdg   sdg1  sdg2  sdg5  sdh
root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la ceph-volume lvm 
zap sdh
 stderr: lsblk: sdh: not a block device
 stderr: error: sdh: No such file or directory
 stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys 
expected.
usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]
   [--osd-fsid OSD_FSID]
   [DEVICES [DEVICES ...]]
ceph-volume lvm zap: error: Unable to proceed with non-existing device: sdh
Error: exit status 2

I execute lsblk and it sees device sdh
root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la lsblk
lsblk: dm-1: failed to get device path
lsblk: dm-2: failed to get device path
lsblk: dm-4: failed to get device path
lsblk: dm-6: failed to get device path
lsblk: dm-4: failed to get device path
lsblk: dm-2: failed to get device path
lsblk: dm-1: failed to get device path
lsblk: dm-0: failed to get device path
lsblk: dm-0: failed to get device path
lsblk: dm-7: failed to get device path
lsblk: dm-5: failed to get device path
lsblk: dm-7: failed to get device path
lsblk: dm-6: failed to get device path
lsblk: dm-5: failed to get device path
lsblk: dm-3: failed to get device path
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sdf  8:80   0   1.8T  0 disk
sdd  8:48   0   1.8T  0 disk
sdg  8:96   0 223.5G  0 disk
|-sdg5   8:101  0   223G  0 part
|-sdg1   8:97   487M  0 part
`-sdg2   8:98 1K  0 part
sde  8:64   0   1.8T  0 disk
sdc  8:32   0   3.5T  0 disk
sda  8:00   3.5T  0 disk
sdh  8:112  0   3.5T  0 disk

So I use a fellow osd container (osd.5) on the same node and run all of the 
operations (zap and prepare) successfully.

I am suspecting that mon or mgr have no access to /dev or /var/lib while osd 
containers do.  Cluster configured originally by ceph-ansible (nautilus 14.2.2)

The question is if I want to replace all disks on a single node, and I have 6 
nodes with pools replication 3, is it safe to restart mgr mounting /dev and 
/var/lib/ceph volumes (not configured right now).

I cannot use other osd containers on the same box because my controller reverts from raid to non-raid mode with all disks lost and not just a single one.  So I need to replace all 6 osds to run back 
in containers and the only things will remain operational on node are mon and mgr containers.


I prefer not to install a full cluster or client on the bare metal node if 
possible.

Thank you for your help,

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] clust recovery stuck

2019-10-21 Thread Philipp Schwaha
hi,

I have a problem with a cluster being stuck in recovery after osd
failure. at first recovery was doing quite well, but now it just sits
there without any progress. I currently looks like this:

 health HEALTH_ERR
36 pgs are stuck inactive for more than 300 seconds
50 pgs backfill_wait
52 pgs degraded
36 pgs down
36 pgs peering
1 pgs recovering
1 pgs recovery_wait
36 pgs stuck inactive
52 pgs stuck unclean
52 pgs undersized
recovery 261632/2235446 objects degraded (11.704%)
recovery 259813/2235446 objects misplaced (11.622%)
recovery 2/1117723 unfound (0.000%)
 monmap e3: 3 mons at
{0=192.168.19.13:6789/0,1=192.168.19.17:6789/0,2=192.168.19.23:6789/0}
election epoch 78, quorum 0,1,2 0,1,2
 osdmap e7430: 6 osds: 4 up, 4 in; 88 remapped pgs
flags sortbitwise
  pgmap v20023893: 256 pgs, 1 pools, 4366 GB data, 1091 kobjects
8421 GB used, 10183 GB / 18629 GB avail
261632/2235446 objects degraded (11.704%)
259813/2235446 objects misplaced (11.622%)
2/1117723 unfound (0.000%)
 168 active+clean
  50 active+undersized+degraded+remapped+wait_backfill
  36 down+remapped+peering
   1 active+recovering+undersized+degraded+remapped
   1 active+recovery_wait+undersized+degraded+remapped

Is there any way to motivate it to resume recovery?

Thanks
Philipp



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Decreasing the impact of reweighting osds

2019-10-21 Thread Mark Kirkwood
We recently needed to reweight a couple of OSDs on one of our clusters 
(luminous on Ubuntu,  8 hosts, 8 OSD/host). I (think) we reweighted by 
approx 0.2. This was perhaps too much, as IO latency on RBD drives 
spiked to several seconds at times.


We'd like to lessen this effect as much as we can. So we are looking at 
priority and queue parameters (OSDs are Filestore based with S3700 SSD 
or similar NVME journals):


# priorities
osd_client_op_priority
osd_recovery_op_priority
osd_recovery_priority
osd_scrub_priority
osd_snap_trim_priority

# queue tuning
filestore_queue_max_ops
filestore_queue_low_threshhold
filestore_queue_high_threshhold
filestore_expected_throughput_ops
filestore_queue_high_delay_multiple
filestore_queue_max_delay_multiple

My first question is this - do these parameters require the CFQ 
scheduler (like osd_disk_thread_ioprio_priority does)? We are currently 
using deadline (we have not tweaked queue/iosched/write_expire down from 
5000 to 1500 which might be good to do).


My 2nd question is - should we consider increasing 
osd_disk_thread_ioprio_priority (and hence changing to CFQ scheduler)? I 
usually see this parameter discussed WRT scrubbing, and we are not 
having issues with that.


regards

Mark


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: large concurrent rbd operations block for over 15 mins!

2019-10-21 Thread Void Star Nill
Apparently the graph is too big, so my last post is stuck. Resending
without the graph.

Thanks


-- Forwarded message -
From: Void Star Nill 
Date: Mon, Oct 21, 2019 at 4:41 PM
Subject: large concurrent rbd operations block for over 15 mins!
To: ceph-users 


Hello,

I have been running some benchmark tests with a mid-size cluster and I am
seeing some issues. Wanted to know if this is a bug or something that can
be tuned. Appreciate any help on this.

- I have a 15 node Ceph cluster, with 3 monitors and 12 data nodes with
total 61 OSDs on SSDs running 14.2.4 nautilus (stable) version. Each node
has 100G link.
- I have 245 client machines from which I am triggering rbd operations.
Each client has 25G link
- rbd operations include, creating an RBD image of 50G size and layering
feature, mapping the image to the client machine, formatting the device in
ext4 format, mounting it, running dd to write to the full disk and cleaning
up (unmount, unmap and remove).

If I run these RBD operations concurrently on a small number of machines
(say 16-20), they run very well and I see good throughput. All image
operations (except for dd) take less than 2 seconds.

However, when I scale it up to 245 clients, each running these operations
concurrently, I see lot of operations getting hung for a long time and the
overall throughput reduces drastically.

For example, some of the format operations take over 10-15 mins!!!

Note that, all operations do complete - so its most likely not a deadlock
kind of situation.

I dont see any errors in ceph.log on the monitor nodes. However, the
clients do report "hung_task_timeout" in dmesg logs.

As you can see in the below image, half the format operations are
completing in less than a second time, while the other half is over 10mins
(y axis is in seconds)



[7.113618] INFO: task umount:9902 blocked for more than 120 seconds.
[7.113677]   Tainted: G   OE4.15.0-51-generic
#55~16.04.1-Ubuntu
[7.113731] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[7.113787] umount  D0  9902   9901 0x
[7.113793] Call Trace:
[7.113804]  __schedule+0x3d6/0x8b0
[7.113810]  ? _raw_spin_unlock_bh+0x1e/0x20
[7.113814]  schedule+0x36/0x80
[7.113821]  wb_wait_for_completion+0x64/0x90
[7.113828]  ? wait_woken+0x80/0x80
[7.113831]  __writeback_inodes_sb_nr+0x8e/0xb0
[7.113835]  writeback_inodes_sb+0x27/0x30
[7.113840]  __sync_filesystem+0x51/0x60
[7.113844]  sync_filesystem+0x26/0x40
[7.113850]  generic_shutdown_super+0x27/0x120
[7.113854]  kill_block_super+0x2c/0x80
[7.113858]  deactivate_locked_super+0x48/0x80
[7.113862]  deactivate_super+0x5a/0x60
[7.113866]  cleanup_mnt+0x3f/0x80
[7.113868]  __cleanup_mnt+0x12/0x20
[7.113874]  task_work_run+0x8a/0xb0
[7.113881]  exit_to_usermode_loop+0xc4/0xd0
[7.113885]  do_syscall_64+0x100/0x130
[7.113887]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[7.113891] RIP: 0033:0x7f0094384487
[7.113893] RSP: 002b:7fff4199efc8 EFLAGS: 0246 ORIG_RAX:
00a6
[7.113897] RAX:  RBX: 00944030 RCX:
7f0094384487
[7.113899] RDX: 0001 RSI:  RDI:
00944210
[7.113900] RBP: 00944210 R08:  R09:
0014
[7.113902] R10: 06b2 R11: 0246 R12:
7f009488d83c
[7.113903] R13:  R14:  R15:
7fff4199f250
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crashed MDS (segfault)

2019-10-21 Thread Gustavo Tonini
Is there a possibility to lose data if I use "cephfs-data-scan init
--force-init"?

On Mon, Oct 21, 2019 at 4:36 AM Yan, Zheng  wrote:

> On Fri, Oct 18, 2019 at 9:10 AM Gustavo Tonini 
> wrote:
> >
> > Hi Zheng,
> > the cluster is running ceph mimic. This warning about network only
> appears when using nautilus' cephfs-journal-tool.
> >
> > "cephfs-data-scan scan_links" does not report any issue.
> >
> > How could variable "newparent" be NULL at
> https://github.com/ceph/ceph/blob/master/src/mds/SnapRealm.cc#L599 ? Is
> there a way to fix this?
> >
>
>
> try 'cephfs-data-scan init'. It will setup root inode's snaprealm.
>
> > On Thu, Oct 17, 2019 at 9:58 PM Yan, Zheng  wrote:
> >>
> >> On Thu, Oct 17, 2019 at 10:19 PM Gustavo Tonini <
> gustavoton...@gmail.com> wrote:
> >> >
> >> > No. The cluster was just rebalancing.
> >> >
> >> > The journal seems damaged:
> >> >
> >> > ceph@deployer:~$ cephfs-journal-tool --rank=fs_padrao:0 journal
> inspect
> >> > 2019-10-16 17:46:29.596 7fcd34cbf700 -1 NetHandler create_socket
> couldn't create socket (97) Address family not supported by protocol
> >>
> >> corrupted journal shouldn't cause error like this. This is more like
> >> network issue. please double check network config of your cluster.
> >>
> >> > Overall journal integrity: DAMAGED
> >> > Corrupt regions:
> >> > 0x1c5e4d904ab-1c5e4d9ddbc
> >> > ceph@deployer:~$
> >> >
> >> > Could a journal reset help with this?
> >> >
> >> > I could snapshot all FS pools and export the journal before to
> guarantee a rollback to this state if something goes wrong with jounal
> reset.
> >> >
> >> > On Thu, Oct 17, 2019, 09:07 Yan, Zheng  wrote:
> >> >>
> >> >> On Tue, Oct 15, 2019 at 12:03 PM Gustavo Tonini <
> gustavoton...@gmail.com> wrote:
> >> >> >
> >> >> > Dear ceph users,
> >> >> > we're experiencing a segfault during MDS startup (replay process)
> which is making our FS inaccessible.
> >> >> >
> >> >> > MDS log messages:
> >> >> >
> >> >> > Oct 15 03:41:39.894584 mds1 ceph-mds:   -472> 2019-10-15
> 00:40:30.201 7f3c08f49700  1 -- 192.168.8.195:6800/3181891717 <== osd.26
> 192.168.8.209:6821/2419345 3  osd_op_reply(21 1. [getxattr]
> v0'0 uv0 ondisk = -61 ((61) No data available)) v8  154+0+0 (3715233608
> 0 0) 0x2776340 con 0x18bd500
> >> >> > Oct 15 03:41:39.894584 mds1 ceph-mds:   -472> 2019-10-15
> 00:40:30.201 7f3c00589700 10 MDSIOContextBase::complete:
> 18C_IO_Inode_Fetched
> >> >> > Oct 15 03:41:39.894658 mds1 ceph-mds:   -472> 2019-10-15
> 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x100) _fetched got 0 and 544
> >> >> > Oct 15 03:41:39.894658 mds1 ceph-mds:   -472> 2019-10-15
> 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x100)  magic is 'ceph fs
> volume v011' (expecting 'ceph fs volume v011')
> >> >> > Oct 15 03:41:39.894735 mds1 ceph-mds:   -472> 2019-10-15
> 00:40:30.201 7f3c00589700 10  mds.0.cache.snaprealm(0x100 seq 1 0x1799c00)
> open_parents [1,head]
> >> >> > Oct 15 03:41:39.894735 mds1 ceph-mds:   -472> 2019-10-15
> 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x100) _fetched [inode 0x100
> [...2,head] ~mds0/ auth v275131 snaprealm=0x1799c00 f(v0 1=1+0) n(v76166
> rc2020-07-17 15:29:27.00 b41838692297 -3184=-3168+-16)/n() (iversion
> lock) 0x18bf800]
> >> >> > Oct 15 03:41:39.894821 mds1 ceph-mds:   -472> 2019-10-15
> 00:40:30.201 7f3c00589700 10 MDSIOContextBase::complete:
> 18C_IO_Inode_Fetched
> >> >> > Oct 15 03:41:39.894821 mds1 ceph-mds:   -472> 2019-10-15
> 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x1) _fetched got 0 and 482
> >> >> > Oct 15 03:41:39.894891 mds1 ceph-mds:   -472> 2019-10-15
> 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x1)  magic is 'ceph fs volume
> v011' (expecting 'ceph fs volume v011')
> >> >> > Oct 15 03:41:39.894958 mds1 ceph-mds:   -472> 2019-10-15
> 00:40:30.205 7f3c00589700 -1 *** Caught signal (Segmentation fault) **#012
> in thread 7f3c00589700 thread_name:fn_anonymous#012#012 ceph version 13.2.6
> (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)#012 1:
> (()+0x11390) [0x7f3c0e48a390]#012 2: (operator<<(std::ostream&, SnapRealm
> const&)+0x42) [0x72cb92]#012 3: (SnapRealm::merge_to(SnapRealm*)+0x308)
> [0x72f488]#012 4: (CInode::decode_snap_blob(ceph::buffer::list&)+0x53)
> [0x6e1f63]#012 5:
> (CInode::decode_store(ceph::buffer::list::iterator&)+0x76) [0x702b86]#012
> 6: (CInode::_fetched(ceph::buffer::list&, ceph::buffer::list&,
> Context*)+0x1b2) [0x702da2]#012 7: (MDSIOContextBase::complete(int)+0x119)
> [0x74fcc9]#012 8: (Finisher::finisher_thread_entry()+0x12e)
> [0x7f3c0ebffece]#012 9: (()+0x76ba) [0x7f3c0e4806ba]#012 10: (clone()+0x6d)
> [0x7f3c0dca941d]#012 NOTE: a copy of the executable, or `objdump -rdS
> ` is needed to interpret this.
> >> >> > Oct 15 03:41:39.895400 mds1 ceph-mds: --- logging levels ---
> >> >> > Oct 15 03:41:39.895473 mds1 ceph-mds:0/ 5 none
> >> >> > Oct 15 03:41:39.895473 mds1 ceph-mds:0/ 1 lockdep
> >> >> >
> >> >>
> >> >> looks like snap info for root inode is corrupted. did you do a

[ceph-users] Nautilus - inconsistent PGs - stat mismatch

2019-10-21 Thread Andras Pataki

We have a new ceph Nautilus setup (Nautilus from scratch - not upgraded):

# ceph versions
{
    "mon": {
    "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) 
nautilus (stable)": 3

    },
    "mgr": {
    "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) 
nautilus (stable)": 3

    },
    "osd": {
    "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) 
nautilus (stable)": 169

    },
    "mds": {},
    "overall": {
    "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) 
nautilus (stable)": 175

    }
}

We only have CephFS on it with the two required triple replicated 
pools.  After creating the pools, we've increased the PG numbers on 
them, but did not turn autoscaling on:


# ceph osd pool ls detail
pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 1 
object_hash rjenkins pg_num 8192 pgp_num 8192 autoscale_mode warn 
last_change 2017 lfor 0/0/886 flags hashpspool stripe_width 0 
application cephfs
pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 2 
object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode warn 
last_change 1995 flags hashpspool stripe_width 0 pg_autoscale_bias 4 
pg_num_min 16 recovery_priority 5 application cephfs


After a few days of running, we started seeing inconsistent placement 
groups:


# ceph pg dump | grep incons
dumped all
1.bb4    21  0    0 0   0 
67108864   0  0 3059 3059 active+clean+inconsistent 
2019-10-20 11:13:43.270022 5830'5655 5831:18901 [346,426,373]    346 
[346,426,373]    346 5830'5655 2019-10-20 11:13:43.269992   
1763'5424 2019-10-18 08:14:37.582180 0
1.795    29  0    0 0   0 
96468992   0  0 3081 3081 active+clean+inconsistent 
2019-10-20 17:06:45.876483 5830'5472 5831:17921 [468,384,403]    468 
[468,384,403]    468 5830'5472 2019-10-20 17:06:45.876455   
1763'5235 2019-10-18 08:16:07.166754 0
1.1fa    18  0    0 0   0 
33554432   0  0 3065 3065 active+clean+inconsistent 
2019-10-20 15:35:29.755622 5830'5268 5831:17139 [337,401,455]    337 
[337,401,455]    337 5830'5268 2019-10-20 15:35:29.755588   
1763'5084 2019-10-18 08:17:17.962888 0
1.579    26  0    0 0   0 
75497472   0  0 3068 3068 active+clean+inconsistent 
2019-10-20 21:45:42.914200 5830'5218 5831:15405 [477,364,332]    477 
[477,364,332]    477 5830'5218 2019-10-20 21:45:42.914173   
5830'5218 2019-10-19 12:13:53.259686 0
1.11c5   21  0    0 0   0 
71303168   0  0 3010 3010 active+clean+inconsistent 
2019-10-20 23:31:36.183053 5831'5183 5831:16214 [458,370,416]    458 
[458,370,416]    458 5831'5183 2019-10-20 23:31:36.183030   
5831'5183 2019-10-19 16:35:17.195721 0
1.128d   17  0    0 0   0 
46137344   0  0 3073 3073 active+clean+inconsistent 
2019-10-20 19:14:55.459236 5830'5368 5831:17584 [441,422,377]    441 
[441,422,377]    441 5830'5368 2019-10-20 19:14:55.459209   
1763'5110 2019-10-18 08:12:51.062548 0
1.19ef   16  0    0 0   0 
41943040   0  0 3076 3076 active+clean+inconsistent 
2019-10-20 23:33:02.020050 5830'5502 5831:18244 [323,431,439]    323 
[323,431,439]    323 5830'5502 2019-10-20 23:33:02.020025   
1763'5220 2019-10-18 08:12:51.117020 0


The logs look like this (the 1.bb4 PG for example):

2019-10-20 11:13:43.261 7fffd3633700  0 log_channel(cluster) log [DBG] : 
1.bb4 scrub starts
2019-10-20 11:13:43.265 7fffd3633700 -1 log_channel(cluster) log [ERR] : 
1.bb4 scrub : stat mismatch, got 21/21 objects, 0/0 clones, 21/21 dirty, 
0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 
88080384/67108864 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.
2019-10-20 11:13:43.265 7fffd3633700 -1 log_channel(cluster) log [ERR] : 
1.bb4 scrub 1 errors


It looks like doing a pg repair fixes the issue:

2019-10-21 09:17:50.125 7fffd3633700  0 log_channel(cluster) log [DBG] : 
1.bb4 repair starts
2019-10-21 09:17:50.653 7fffd3633700 -1 log_channel(cluster) log [ERR] : 
1.bb4 repair : stat mismatch, got 21/21 objects, 0/0 clones, 21/21 
dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 
88080384/67108864 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.
2019-10-21 09:17:50.653 7fffd3633700 -1 log_channel(cluster) log [ERR] : 
1.bb4 repair 1 errors, 1 fixed


Is this a known issue with Nautilus?  We have other Luminous/Mimic 
clusters, where I haven't seen this come up.


Thanks,

Andras

___
ceph-users mailing list
ceph-users@lists.c

Re: [ceph-users] Can't create erasure coded pools with k+m greater than hosts?

2019-10-21 Thread Salsa
Just to clarify my situation, We have 2 datacenters with 3 hosts each, 12 4TB 
disks each host (2 are RAID with OS installed and the remaining 10 are used for 
Ceph). Right now I'm trying a single DC installation and intended to migrate to 
multi site mirroring DC1 to DC2, so if we lose DC1 we can activate DC2 (NOTE: I 
have no idea how this is setup and have not planned at all; I thought of geting 
DC1 to work first and later set the mirroring)

I don't think I'll be able to change the setup in any way, so my next question 
is: Should I go with a replica 3 or would an erasure 2,1 be ok?

There's a very small chance we get 2 extra hosts for each DC in a near future, 
but we'll probably use all the available storage space in the nearer future.

We're trying to use as much space as possible.

Thanks;

--
Salsa

Sent with [ProtonMail](https://protonmail.com) Secure Email.

‐‐‐ Original Message ‐‐‐
On Monday, October 21, 2019 2:53 AM, Martin Verges  
wrote:

> Just don't do such setups for production, It will be a lot of pain, trouble, 
> and cause you problems.
>
> Just take a cheap system, put some of the disks in it and do a way way better 
> deployment than something like 4+2 on 3 hosts. Whatever you do with that 
> cluster (example kernel update, reboot, PSU failure, ...) causes you and all 
> attached clients, especially bad with VMs on that Ceph cluster, to stop any 
> IO or even crash completely.
>
> --
> Martin Verges
> Managing director
>
> Mobile: +49 174 9335695
> E-Mail: martin.ver...@croit.io
> Chat: https://t.me/MartinVerges
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
>
> Web: https://croit.io
> YouTube: https://goo.gl/PGE1Bx
>
> Am Sa., 19. Okt. 2019 um 01:51 Uhr schrieb Chris Taylor :
>
>> Full disclosure - I have not created an erasure code pool yet!
>>
>> I have been wanting to do the same thing that you are attempting and
>> have these links saved. I believe this is what you are looking for.
>>
>> This link is for decompiling the CRUSH rules and recompiling:
>>
>> https://docs.ceph.com/docs/luminous/rados/operations/crush-map-edits/
>>
>> This link is for creating the EC rules for 4+2 with only 3 hosts:
>>
>> https://ceph.io/planet/erasure-code-on-small-clusters/
>>
>> I hope that helps!
>>
>> Chris
>>
>> On 2019-10-18 2:55 pm, Salsa wrote:
>>> Ok, I'm lost here.
>>>
>>> How am I supposed to write a crush rule?
>>>
>>> So far I managed to run:
>>>
>>> #ceph osd crush rule dump test -o test.txt
>>>
>>> So I can edit the rule. Now I have two problems:
>>>
>>> 1. Whats the functions and operations to use here? Is there
>>> documentation anywhere abuot this?
>>> 2. How may I create a crush rule using this file? 'ceph osd crush rule
>>> create ... -i test.txt' does not work.
>>>
>>> Am I taking the wrong approach here?
>>>
>>>
>>> --
>>> Salsa
>>>
>>> Sent with ProtonMail Secure Email.
>>>
>>> ‐‐‐ Original Message ‐‐‐
>>> On Friday, October 18, 2019 3:56 PM, Paul Emmerich
>>>  wrote:
>>>
 Default failure domain in Ceph is "host" (see ec profile), i.e., you
 need at least k+m hosts (but at least k+m+1 is better for production
 setups).
 You can change that to OSD, but that's not a good idea for a
 production setup for obvious reasons. It's slightly better to write a
 crush rule that explicitly picks two disks on 3 different hosts

 Paul

 

 Paul Emmerich

 Looking for help with your Ceph cluster? Contact us at
 https://croit.io

 croit GmbH
 Freseniusstr. 31h
 81247 München
 www.croit.io
 Tel: +49 89 1896585 90

 On Fri, Oct 18, 2019 at 8:45 PM Salsa sa...@protonmail.com wrote:

 > I have probably misunterstood how to create erasure coded pools so I may 
 > be in need of some theory and appreciate if you can point me to 
 > documentation that may clarify my doubts.
 > I have so far 1 cluster with 3 hosts and 30 OSDs (10 each host).
 > I tried to create an erasure code profile like so:
 > "
 >
 > ceph osd erasure-code-profile get ec4x2rs
 >
 > ==
 >
 > crush-device-class=
 > crush-failure-domain=host
 > crush-root=default
 > jerasure-per-chunk-alignment=false
 > k=4
 > m=2
 > plugin=jerasure
 > technique=reed_sol_van
 > w=8
 > "
 > If I create a pool using this profile or any profile where K+M > hosts , 
 > then the pool gets stuck.
 > "
 >
 > ceph -s
 >
 > 
 >
 > cluster:
 > id: eb4a

[ceph-users] Getting rid of prometheus messages in /var/log/messages

2019-10-21 Thread Vladimir Brik

Hello

/var/log/messages on machines in our ceph cluster are inundated with 
entries from Prometheus scraping ("GET /metrics HTTP/1.1" 200 - "" 
"Prometheus/2.11.1")


Is it possible to configure ceph to not send those to syslog? If not, 
can I configure something so that none of ceph-mgr messages go to syslog 
and only go to /var/log/ceph/ceph-mgr.log?


Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Science User Group Call October

2019-10-21 Thread Kevin Hrpcek
Hello,

This Wednesday we'll have a ceph science user group call. This is an informal 
conversation focused on using ceph in htc/hpc and scientific research 
environments.

Call details copied from the event:

Wednesday October 23rd
14:00 UTC
4:00PM Central European
10:00AM Eastern American

Main pad for discussions: 
https://pad.ceph.com/p/Ceph_Science_User_Group_Index
Meetings will be recorded and posted to the Ceph Youtube channel.
To join the meeting on a computer or mobile phone: 
https://bluejeans.com/908675367?src=calendarLink
To join from a Red Hat Deskphone or Softphone, dial: 84336.
Connecting directly from a room system?
1.) Dial: 199.48.152.152 or 
bjn.vc
2.) Enter Meeting ID: 908675367
Just want to dial in on your phone?
1.) Dial one of the following numbers: 408-915-6466 (US) See all numbers: 
https://www.redhat.com/en/conference-numbers
2.) Enter Meeting ID: 908675367
3.) Press #
Want to test your video connection? 
https://bluejeans.com/111

--
Kevin Hrpcek
NASA VIIRS Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Yan, Zheng
On Mon, Oct 21, 2019 at 7:58 PM Stefan Kooman  wrote:
>
> Quoting Yan, Zheng (uker...@gmail.com):
>
> > I double checked the code, but didn't find any clue. Can you compile
> > mds with a debug patch?
>
> Sure, I'll try to do my best to get a properly packaged Ceph Mimic
> 13.2.6 with the debug patch in it (and / or get help to get it build).
> Do you already have the patch (on github) somewhere?
>

please apply following patch, thanks.

diff --git a/src/mds/OpenFileTable.cc b/src/mds/OpenFileTable.cc
index c0f72d581d..2ca737470d 100644
--- a/src/mds/OpenFileTable.cc
+++ b/src/mds/OpenFileTable.cc
@@ -470,7 +470,11 @@ void OpenFileTable::commit(MDSInternalContextBase *c,
uint64_t log_seq, int op_p
  }
  if (omap_idx < 0) {
++omap_num_objs;
-   assert(omap_num_objs <= MAX_OBJECTS);
+   if (omap_num_objs > MAX_OBJECTS) {
+ dout(1) << "omap_num_objs " << omap_num_objs << dendl;
+ dout(1) << "anchor_map size " << anchor_map.size() << dendl;
+ assert(omap_num_objs <= MAX_OBJECTS);
+   }
omap_num_items.resize(omap_num_objs);
omap_updates.resize(omap_num_objs);
omap_updates.back().clear = true;


> Thanks,
>
> Stefan
>
> --
> | BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd / kcephfs - jewel client features question

2019-10-21 Thread Lei Liu
Hello  llya and paul,

Thanks for your reply. Yes, you are right, 0x7fddff8ee8cbffb is come from
kernel upgrade, it's reported by a docker container
(digitalocean/ceph_exporter) use for ceph monitoring.

Now upmap mode is enabled, client features:

"client": {
"group": {
"features": "0x7010fb86aa42ada",
"release": "jewel",
"num": 6
},
"group": {
"features": "0x3ffddff8eea4fffb",
"release": "luminous",
"num": 6
}
}

Thanks again.

Ilya Dryomov  于2019年10月21日周一 下午6:38写道:

> On Sat, Oct 19, 2019 at 2:00 PM Lei Liu  wrote:
> >
> > Hello llya,
> >
> > After updated client kernel version to 3.10.0-862 , ceph features shows:
> >
> > "client": {
> > "group": {
> > "features": "0x7010fb86aa42ada",
> > "release": "jewel",
> > "num": 5
> > },
> > "group": {
> > "features": "0x7fddff8ee8cbffb",
> > "release": "jewel",
> > "num": 1
> > },
> > "group": {
> > "features": "0x3ffddff8eea4fffb",
> > "release": "luminous",
> > "num": 6
> > },
> > "group": {
> > "features": "0x3ffddff8eeacfffb",
> > "release": "luminous",
> > "num": 1
> > }
> > }
> >
> > both 0x7fddff8ee8cbffb and 0x7010fb86aa42ada are reported by new kernel
> client.
> >
> > Is it now possible to force set-require-min-compat-client to be
> luminous, if not how to fix it?
>
> No, you haven't upgraded the one with features 0x7fddff8ee8cbffb (or
> rather it looks like you have upgraded it from 0x7fddff8ee84bffb, but
> to a version that is still too old).
>
> What exactly did you do on that machine?  That change doesn't look like
> it came from a kernel upgrade.  What is the output of "uname -a" there?
>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph balancer do not start

2019-10-21 Thread Jan Peters
Hello,

I use ceph 12.2.12 and would like to activate the ceph balancer.

unfortunately no redistribution of the PGs is started:

ceph balancer status
{
"active": true,
"plans": [],
"mode": "crush-compat"
}

ceph balancer eval
current cluster score 0.023776 (lower is better)


ceph config-key dump
{
"initial_mon_keyring":
"AQBLchlbABAA+5CuVU+8MB69xfc3xAXkjQ==",
"mgr/balancer/active": "1",
"mgr/balancer/max_misplaced:": "0.01",
"mgr/balancer/mode": "crush-compat"
}


What am I not doing correctly?

best regards
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Stefan Kooman
Quoting Yan, Zheng (uker...@gmail.com):

> I double checked the code, but didn't find any clue. Can you compile
> mds with a debug patch?

Sure, I'll try to do my best to get a properly packaged Ceph Mimic
13.2.6 with the debug patch in it (and / or get help to get it build).
Do you already have the patch (on github) somewhere?

Thanks,

Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Yan, Zheng
On Mon, Oct 21, 2019 at 4:33 PM Stefan Kooman  wrote:
>
> Quoting Yan, Zheng (uker...@gmail.com):
>
> > delete 'mdsX_openfiles.0' object from cephfs metadata pool. (X is rank
> > of the crashed mds)
>
> OK, MDS crashed again, restarted. I stopped it, deleted the object and
> restarted the MDS. It became active right away.
>
> Any idea on why the openfiles list (object) becomes corrupted? As in:
> have a bugfix in place?
>

I double checked the code, but didn't find any clue. Can you compile
mds with a debug patch?

> Thanks!
>
> Gr. Stefan
>
> --
> | BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd / kcephfs - jewel client features question

2019-10-21 Thread Ilya Dryomov
On Sat, Oct 19, 2019 at 2:00 PM Lei Liu  wrote:
>
> Hello llya,
>
> After updated client kernel version to 3.10.0-862 , ceph features shows:
>
> "client": {
> "group": {
> "features": "0x7010fb86aa42ada",
> "release": "jewel",
> "num": 5
> },
> "group": {
> "features": "0x7fddff8ee8cbffb",
> "release": "jewel",
> "num": 1
> },
> "group": {
> "features": "0x3ffddff8eea4fffb",
> "release": "luminous",
> "num": 6
> },
> "group": {
> "features": "0x3ffddff8eeacfffb",
> "release": "luminous",
> "num": 1
> }
> }
>
> both 0x7fddff8ee8cbffb and 0x7010fb86aa42ada are reported by new kernel 
> client.
>
> Is it now possible to force set-require-min-compat-client to be luminous, if 
> not how to fix it?

No, you haven't upgraded the one with features 0x7fddff8ee8cbffb (or
rather it looks like you have upgraded it from 0x7fddff8ee84bffb, but
to a version that is still too old).

What exactly did you do on that machine?  That change doesn't look like
it came from a kernel upgrade.  What is the output of "uname -a" there?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hanging slow requests: failed to authpin, subtree is being exported

2019-10-21 Thread Marc Roos
 
I think I am having this issue also (at least I had with luminous) I had 
to remove the hidden temp files rsync had left, when the cephfs mount 
'stalled', otherwise I would never be able to complete the rsync.


-Original Message-
Cc: ceph-users
Subject: Re: [ceph-users] hanging slow requests: failed to authpin, 
subtree is being exported


I've made a ticket for this issue: https://tracker.ceph.com/issues/42338

Thanks again!

K

On 15/10/2019 18:00, Kenneth Waegeman wrote:
> Hi Robert, all,
>
>
> On 23/09/2019 17:37, Robert LeBlanc wrote:
>> On Mon, Sep 23, 2019 at 4:14 AM Kenneth Waegeman 
>>  wrote:
>>> Hi all,
>>>
>>> When syncing data with rsync, I'm often getting blocked slow 
>>> requests, which also block access to this path.
>>>
 2019-09-23 11:25:49.477 7f4f401e8700 0 log_channel(cluster) log 
 [WRN]
 : slow request 31.895478 seconds old, received at 2019-09-23
 11:25:17.598152: client_request(client.38352684:92684 lookup
 #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
 caller_gid=0{0,}) currently failed to authpin, subtree is being 
 exported
 2019-09-23 11:26:19.477 7f4f401e8700  0 log_channel(cluster) log 
 [WRN]
 : slow request 61.896079 seconds old, received at 2019-09-23
 11:25:17.598152: client_request(client.38352684:92684 lookup
 #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
 caller_gid=0{0,}) currently failed to authpin, subtree is being 
 exported
 2019-09-23 11:27:19.478 7f4f401e8700  0 log_channel(cluster) log 
 [WRN]
 : slow request 121.897268 seconds old, received at 2019-09-23
 11:25:17.598152: client_request(client.38352684:92684 lookup
 #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
 caller_gid=0{0,}) currently failed to authpin, subtree is being 
 exported
 2019-09-23 11:29:19.488 7f4f401e8700  0 log_channel(cluster) log 
 [WRN]
 : slow request 241.899467 seconds old, received at 2019-09-23
 11:25:17.598152: client_request(client.38352684:92684 lookup
 #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
 caller_gid=0{0,}) currently failed to authpin, subtree is being 
 exported
 2019-09-23 11:33:19.680 7f4f401e8700  0 log_channel(cluster) log 
 [WRN]
 : slow request 482.087927 seconds old, received at 2019-09-23
 11:25:17.598152: client_request(client.38352684:92684 lookup
 #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
 caller_gid=0{0,}) currently failed to authpin, subtree is being 
 exported
 2019-09-23 11:36:09.881 7f4f401e8700  0 log_channel(cluster) log 
 [WRN]
 : slow request 32.677511 seconds old, received at 2019-09-23
 11:35:37.217113: client_request(client.38347357:111963 lookup 
 #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
 caller_gid=0{0,}) currently failed to authpin, subtree is being 
 exported
 2019-09-23 11:36:39.881 7f4f401e8700  0 log_channel(cluster) log 
 [WRN]
 : slow request 62.678132 seconds old, received at 2019-09-23
 11:35:37.217113: client_request(client.38347357:111963 lookup 
 #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
 caller_gid=0{0,}) currently failed to authpin, subtree is being 
 exported
 2019-09-23 11:37:39.891 7f4f401e8700  0 log_channel(cluster) log 
 [WRN]
 : slow request 122.679273 seconds old, received at 2019-09-23
 11:35:37.217113: client_request(client.38347357:111963 lookup 
 #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
 caller_gid=0{0,}) currently failed to authpin, subtree is being 
 exported
 2019-09-23 11:39:39.892 7f4f401e8700  0 log_channel(cluster) log 
 [WRN]
 : slow request 242.684667 seconds old, received at 2019-09-23
 11:35:37.217113: client_request(client.38347357:111963 lookup 
 #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
 caller_gid=0{0,}) currently failed to authpin, subtree is being 
 exported
 2019-09-23 11:41:19.893 7f4f401e8700  0 log_channel(cluster) log 
 [WRN]
 : slow request 962.305681 seconds old, received at 2019-09-23
 11:25:17.598152: client_request(client.38352684:92684 lookup
 #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
 caller_gid=0{0,}) currently failed to authpin, subtree is being 
 exported
 2019-09-23 11:43:39.923 7f4f401e8700  0 log_channel(cluster) log 
 [WRN]
 : slow request 482.712888 seconds old, received at 2019-09-23
 11:35:37.217113: client_request(client.38347357:111963 lookup 
 #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
 caller_gid=0{0,}) currently failed to authpin, subtree is being 
 exported
 2019-09-23 11:51:40.236 7f4f401e8700  0 log_channel(cluster) log 
 [WRN]
 : slow request 963.037049 seconds old, received at 2019-09-23
 11:35:37.217113: client_re

Re: [ceph-users] hanging slow requests: failed to authpin, subtree is being exported

2019-10-21 Thread Kenneth Waegeman


I've made a ticket for this issue: https://tracker.ceph.com/issues/42338

Thanks again!

K

On 15/10/2019 18:00, Kenneth Waegeman wrote:

Hi Robert, all,


On 23/09/2019 17:37, Robert LeBlanc wrote:

On Mon, Sep 23, 2019 at 4:14 AM Kenneth Waegeman
 wrote:

Hi all,

When syncing data with rsync, I'm often getting blocked slow requests,
which also block access to this path.


2019-09-23 11:25:49.477 7f4f401e8700 0 log_channel(cluster) log [WRN]
: slow request 31.895478 seconds old, received at 2019-09-23
11:25:17.598152: client_request(client.38352684:92684 lookup
#0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being 
exported

2019-09-23 11:26:19.477 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 61.896079 seconds old, received at 2019-09-23
11:25:17.598152: client_request(client.38352684:92684 lookup
#0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being 
exported

2019-09-23 11:27:19.478 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 121.897268 seconds old, received at 2019-09-23
11:25:17.598152: client_request(client.38352684:92684 lookup
#0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being 
exported

2019-09-23 11:29:19.488 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 241.899467 seconds old, received at 2019-09-23
11:25:17.598152: client_request(client.38352684:92684 lookup
#0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being 
exported

2019-09-23 11:33:19.680 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 482.087927 seconds old, received at 2019-09-23
11:25:17.598152: client_request(client.38352684:92684 lookup
#0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being 
exported

2019-09-23 11:36:09.881 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 32.677511 seconds old, received at 2019-09-23
11:35:37.217113: client_request(client.38347357:111963 lookup
#0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being 
exported

2019-09-23 11:36:39.881 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 62.678132 seconds old, received at 2019-09-23
11:35:37.217113: client_request(client.38347357:111963 lookup
#0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being 
exported

2019-09-23 11:37:39.891 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 122.679273 seconds old, received at 2019-09-23
11:35:37.217113: client_request(client.38347357:111963 lookup
#0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being 
exported

2019-09-23 11:39:39.892 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 242.684667 seconds old, received at 2019-09-23
11:35:37.217113: client_request(client.38347357:111963 lookup
#0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being 
exported

2019-09-23 11:41:19.893 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 962.305681 seconds old, received at 2019-09-23
11:25:17.598152: client_request(client.38352684:92684 lookup
#0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being 
exported

2019-09-23 11:43:39.923 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 482.712888 seconds old, received at 2019-09-23
11:35:37.217113: client_request(client.38347357:111963 lookup
#0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being 
exported

2019-09-23 11:51:40.236 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 963.037049 seconds old, received at 2019-09-23
11:35:37.217113: client_request(client.38347357:111963 lookup
#0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being 
exported

2019-09-23 11:57:20.308 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 1922.719287 seconds old, received at 2019-09-23
11:25:17.598152: client_request(client.38352684:92684 lookup
#0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being 
exported

2019-09-23 12:07:40.621 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 1923.409501 seconds old, received at 2019-09-23
11:35:37.217113: client_request(client.38347357:111963 lookup
#0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
caller_gid=0{0,}) cur

Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Marc Roos
 
The 'xx-.conf' are mine, custom. So I would not have to merge 
changes with newer /etc/collectd.conf rpm updates. 

I would suggest get a small configuration that is working, set debug 
logging[0], and increase the configuration until it fails with little 
steps. Load plugin ceph empty, configure then one osd, fails? try mgr?, 
try mon? etc


[0] Try something like this?
LoadPlugin logfile


#not compiled with debug?
#also not writing to the logfile
LogLevel debug
 File "/tmp/collectd.log"
#File STDOUT
Timestamp true
#   PrintSeverity false



-Original Message-
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] collectd Ceph metric

Is there any instruction to install the plugin configuration?

Attach my RHEL/collectd configuration file under /etc/ directory.
On RHEL:
[rdma@rdmarhel0 collectd.d]$ pwd
/etc/collectd.d
[rdma@rdmarhel0 collectd.d]$ tree .
.

0 directories, 0 files
[rdma@rdmarhel0 collectd.d]$

I've also checked the collectd configuration file under Ubuntu, there's 
no 11-network.conf or 12-memory.conf .etc. However, it could still 
collect the cpu and memory information.
On Ubuntu:
   nstcc1@nstcloudcc1:collectd$ pwd
   /etc/collectd
   nstcc1@nstcloudcc1:collectd$ tree .
   .
   ├── collectd.conf
   ├── collectd.conf.d
   │   ├── filters.conf
   │   └── thresholds.conf
   └── collection.conf

   1 directory, 4 files
   nstcc1@nstcloudcc1:collectd$

B.R.
Changcheng

On 10:56 Mon 21 Oct, Marc Roos wrote:
> 
> Your collectd starts without the ceph plugin ok? 
> 
> I have also your error " didn't register a configuration callback", 
> because I configured debug logging, but did not enable it by loading 
> the plugin 'logfile'. Maybe it is the order in which your 
> configuration files a read (I think this used to be important with 
> collectd)
> 
> I have only in my collectd.conf these two lines:
> Include "/etc/collectd.d"
> LoadPlugin syslog
> 
> And in /etc/collectd.d/ these files:
> 10-custom.conf(with network section for influx)
> 11-network.conf   (ethx)
> 12-memory.conf
> 50-ceph.conf
> 51-ipmi.conf
> 52-ipmi-power.conf
> 53-disk.conf
> 
> 
> 
> 
> [0] journalctl -u collectd.service
> Oct 21 10:29:39 c02 collectd[1017750]: Exiting normally.
> Oct 21 10:29:39 c02 systemd[1]: Stopping Collectd statistics daemon...
> Oct 21 10:29:39 c02 collectd[1017750]: collectd: Stopping 5 read 
> threads.
> Oct 21 10:29:39 c02 collectd[1017750]: collectd: Stopping 5 write 
> threads.
> Oct 21 10:29:40 c02 systemd[1]: Stopped Collectd statistics daemon.
> Oct 21 10:29:53 c02 systemd[1]: Starting Collectd statistics daemon...
> Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "syslog" 
> successfully loaded.
> Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "network" 
> successfully loaded.
> Oct 21 10:29:53 c02 collectd[1031939]: Found a configuration for the 
> `logfile' plugin, but the plugin isn't loaded or didn't register a 
> configuration callback.
> Oct 21 10:29:53 c02 collectd[1031939]: Found a configuration for the 
> `logfile' plugin, but the plugin isn't loaded or didn't register a 
> configuration callback.
> Oct 21 10:29:53 c02 collectd[1031939]: Found a configuration for the 
> `logfile' plugin, but the plugin isn't loaded or didn't register a 
> configuration callback.
> Oct 21 10:29:53 c02 collectd[1031939]: network plugin: The 
> `MaxPacketSize' must be between 1024 and 65535.
> Oct 21 10:29:53 c02 collectd[1031939]: network plugin: Option 
> `CacheFlush' is not allowed here.
> Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "interface" 

> successfully loaded.
> Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "memory" 
> successfully loaded.
> Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "ceph" 
> successfully loaded.
> Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "ipmi" 
> successfully loaded.
> Oct 21 10:29:53 c02 collectd[1031939]: ipmi plugin: Legacy 
> configuration found! Please update your config file.
> Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "exec" 
> successfully loaded.
> Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "disk" 
> successfully loaded.
> Oct 21 10:29:53 c02 collectd[1031939]: Systemd detected, trying to 
> signal readyness.
> Oct 21 10:29:53 c02 systemd[1]: Started Collectd statistics daemon.
> Oct 21 10:29:53 c02 collectd[1031939]: Initialization complete, 
> entering read-loop.
> Oct 21 10:30:04 c02 collectd[1031939]: ipmi plugin: sensor_list_add: 
> Ignore sensor `PS2 Status power_supply (10.2)` of `main`, because it 
> is discrete (0x8)! Its type:
> Oct 21 10:30:04 c02 collectd[1031939]: ipmi plugin: sensor_list_add: 
> Ignore sensor ` management controller firmware (46.4)` of `main`, 
> because it isn't readable! Its Oct 21 10:30:04 c02 collectd[1031939]: 
> ipmi plugin: sensor_list_add:
> Ignore sensor ` management controller firmware (46.3)` of `main`, 
> because it isn't

Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Liu, Changcheng
Is there any instruction to install the plugin configuration?

Attach my RHEL/collectd configuration file under /etc/ directory.
On RHEL:
[rdma@rdmarhel0 collectd.d]$ pwd
/etc/collectd.d
[rdma@rdmarhel0 collectd.d]$ tree .
.

0 directories, 0 files
[rdma@rdmarhel0 collectd.d]$

I've also checked the collectd configuration file under Ubuntu, there's no 
11-network.conf or 12-memory.conf .etc. However, it could still collect the cpu 
and memory information.
On Ubuntu:
   nstcc1@nstcloudcc1:collectd$ pwd
   /etc/collectd
   nstcc1@nstcloudcc1:collectd$ tree .
   .
   ├── collectd.conf
   ├── collectd.conf.d
   │   ├── filters.conf
   │   └── thresholds.conf
   └── collection.conf

   1 directory, 4 files
   nstcc1@nstcloudcc1:collectd$

B.R.
Changcheng

On 10:56 Mon 21 Oct, Marc Roos wrote:
> 
> Your collectd starts without the ceph plugin ok? 
> 
> I have also your error " didn't register a configuration callback", 
> because I configured debug logging, but did not enable it by loading the 
> plugin 'logfile'. Maybe it is the order in which your configuration 
> files a read (I think this used to be important with collectd)
> 
> I have only in my collectd.conf these two lines:
> Include "/etc/collectd.d"
> LoadPlugin syslog
> 
> And in /etc/collectd.d/ these files:
> 10-custom.conf(with network section for influx)
> 11-network.conf   (ethx)
> 12-memory.conf
> 50-ceph.conf
> 51-ipmi.conf
> 52-ipmi-power.conf
> 53-disk.conf
> 
> 
> 
> 
> [0] journalctl -u collectd.service
> Oct 21 10:29:39 c02 collectd[1017750]: Exiting normally.
> Oct 21 10:29:39 c02 systemd[1]: Stopping Collectd statistics daemon...
> Oct 21 10:29:39 c02 collectd[1017750]: collectd: Stopping 5 read 
> threads.
> Oct 21 10:29:39 c02 collectd[1017750]: collectd: Stopping 5 write 
> threads.
> Oct 21 10:29:40 c02 systemd[1]: Stopped Collectd statistics daemon.
> Oct 21 10:29:53 c02 systemd[1]: Starting Collectd statistics daemon...
> Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "syslog" 
> successfully loaded.
> Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "network" 
> successfully loaded.
> Oct 21 10:29:53 c02 collectd[1031939]: Found a configuration for the 
> `logfile' plugin, but the plugin isn't loaded or didn't register a 
> configuration callback.
> Oct 21 10:29:53 c02 collectd[1031939]: Found a configuration for the 
> `logfile' plugin, but the plugin isn't loaded or didn't register a 
> configuration callback.
> Oct 21 10:29:53 c02 collectd[1031939]: Found a configuration for the 
> `logfile' plugin, but the plugin isn't loaded or didn't register a 
> configuration callback.
> Oct 21 10:29:53 c02 collectd[1031939]: network plugin: The 
> `MaxPacketSize' must be between 1024 and 65535.
> Oct 21 10:29:53 c02 collectd[1031939]: network plugin: Option 
> `CacheFlush' is not allowed here.
> Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "interface" 
> successfully loaded.
> Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "memory" 
> successfully loaded.
> Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "ceph" 
> successfully loaded.
> Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "ipmi" 
> successfully loaded.
> Oct 21 10:29:53 c02 collectd[1031939]: ipmi plugin: Legacy configuration 
> found! Please update your config file.
> Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "exec" 
> successfully loaded.
> Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "disk" 
> successfully loaded.
> Oct 21 10:29:53 c02 collectd[1031939]: Systemd detected, trying to 
> signal readyness.
> Oct 21 10:29:53 c02 systemd[1]: Started Collectd statistics daemon.
> Oct 21 10:29:53 c02 collectd[1031939]: Initialization complete, entering 
> read-loop.
> Oct 21 10:30:04 c02 collectd[1031939]: ipmi plugin: sensor_list_add: 
> Ignore sensor `PS2 Status power_supply (10.2)` of `main`, because it is 
> discrete (0x8)! Its type:
> Oct 21 10:30:04 c02 collectd[1031939]: ipmi plugin: sensor_list_add: 
> Ignore sensor ` management controller firmware (46.4)` of `main`, 
> because it isn't readable! Its
> Oct 21 10:30:04 c02 collectd[1031939]: ipmi plugin: sensor_list_add: 
> Ignore sensor ` management controller firmware (46.3)` of `main`, 
> because it isn't readable! Its
> Oct 21 10:30:04 c02 collectd[1031939]: ipmi plugin: sensor_list_add: 
> Ignore sensor ` management controller firmware (46.2)` of `main`, 
> because it isn't readable! Its
> Oct 21 10:30:04 c02 collectd[1031939]: ipmi plugin: sensor_list_add: 
> Ignore sensor ` management controller firmware (46.1)` of `main`, 
> because it isn't readable! Its
> Oct 21 10:30:04 c02 collectd[1031939]: ipmi plugin: sensor_list_add: 
> Ignore sensor `PS1 Status power_supply (10.1)` of `main`, because it is 
> discrete (0x8)! Its type:
> Oct 21 10:30:04 c02 collectd[1031939]: ipmi plugin: sensor_list_add: 
> Ignore sensor `Chassis Intru system_chassis (23.1)` of `main`, because 
> it is discrete 

Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Marc Roos


Your collectd starts without the ceph plugin ok? 

I have also your error " didn't register a configuration callback", 
because I configured debug logging, but did not enable it by loading the 
plugin 'logfile'. Maybe it is the order in which your configuration 
files a read (I think this used to be important with collectd)

I have only in my collectd.conf these two lines:
Include "/etc/collectd.d"
LoadPlugin syslog

And in /etc/collectd.d/ these files:
10-custom.conf(with network section for influx)
11-network.conf   (ethx)
12-memory.conf
50-ceph.conf
51-ipmi.conf
52-ipmi-power.conf
53-disk.conf




[0] journalctl -u collectd.service
Oct 21 10:29:39 c02 collectd[1017750]: Exiting normally.
Oct 21 10:29:39 c02 systemd[1]: Stopping Collectd statistics daemon...
Oct 21 10:29:39 c02 collectd[1017750]: collectd: Stopping 5 read 
threads.
Oct 21 10:29:39 c02 collectd[1017750]: collectd: Stopping 5 write 
threads.
Oct 21 10:29:40 c02 systemd[1]: Stopped Collectd statistics daemon.
Oct 21 10:29:53 c02 systemd[1]: Starting Collectd statistics daemon...
Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "syslog" 
successfully loaded.
Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "network" 
successfully loaded.
Oct 21 10:29:53 c02 collectd[1031939]: Found a configuration for the 
`logfile' plugin, but the plugin isn't loaded or didn't register a 
configuration callback.
Oct 21 10:29:53 c02 collectd[1031939]: Found a configuration for the 
`logfile' plugin, but the plugin isn't loaded or didn't register a 
configuration callback.
Oct 21 10:29:53 c02 collectd[1031939]: Found a configuration for the 
`logfile' plugin, but the plugin isn't loaded or didn't register a 
configuration callback.
Oct 21 10:29:53 c02 collectd[1031939]: network plugin: The 
`MaxPacketSize' must be between 1024 and 65535.
Oct 21 10:29:53 c02 collectd[1031939]: network plugin: Option 
`CacheFlush' is not allowed here.
Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "interface" 
successfully loaded.
Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "memory" 
successfully loaded.
Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "ceph" 
successfully loaded.
Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "ipmi" 
successfully loaded.
Oct 21 10:29:53 c02 collectd[1031939]: ipmi plugin: Legacy configuration 
found! Please update your config file.
Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "exec" 
successfully loaded.
Oct 21 10:29:53 c02 collectd[1031939]: plugin_load: plugin "disk" 
successfully loaded.
Oct 21 10:29:53 c02 collectd[1031939]: Systemd detected, trying to 
signal readyness.
Oct 21 10:29:53 c02 systemd[1]: Started Collectd statistics daemon.
Oct 21 10:29:53 c02 collectd[1031939]: Initialization complete, entering 
read-loop.
Oct 21 10:30:04 c02 collectd[1031939]: ipmi plugin: sensor_list_add: 
Ignore sensor `PS2 Status power_supply (10.2)` of `main`, because it is 
discrete (0x8)! Its type:
Oct 21 10:30:04 c02 collectd[1031939]: ipmi plugin: sensor_list_add: 
Ignore sensor ` management controller firmware (46.4)` of `main`, 
because it isn't readable! Its
Oct 21 10:30:04 c02 collectd[1031939]: ipmi plugin: sensor_list_add: 
Ignore sensor ` management controller firmware (46.3)` of `main`, 
because it isn't readable! Its
Oct 21 10:30:04 c02 collectd[1031939]: ipmi plugin: sensor_list_add: 
Ignore sensor ` management controller firmware (46.2)` of `main`, 
because it isn't readable! Its
Oct 21 10:30:04 c02 collectd[1031939]: ipmi plugin: sensor_list_add: 
Ignore sensor ` management controller firmware (46.1)` of `main`, 
because it isn't readable! Its
Oct 21 10:30:04 c02 collectd[1031939]: ipmi plugin: sensor_list_add: 
Ignore sensor `PS1 Status power_supply (10.1)` of `main`, because it is 
discrete (0x8)! Its type:
Oct 21 10:30:04 c02 collectd[1031939]: ipmi plugin: sensor_list_add: 
Ignore sensor `Chassis Intru system_chassis (23.1)` of `main`, because 
it is discrete (0x5)! Its
Oct 21 10:30:04 c02 collectd[1031939]: ipmi plugin: sensor_list_add: 
Ignore sensor `HDD Status disk_drive_bay (26.1)` of `main`, because it 
is discrete (0xd)! Its typ
Oct 21 10:30:55 c02 collectd[1031939]: ipmi plugin: sensor_read_handler: 
sensor `FANB fan_cooling (29.8)` of `main` not present.
Oct 21 10:30:55 c02 collectd[1031939]: ipmi plugin: sensor_read_handler: 
sensor `FANA fan_cooling (29.7)` of `main` not present.
Oct 21 10:30:55 c02 collectd[1031939]: ipmi plugin: sensor_read_handler: 
sensor `FAN6 fan_cooling (29.6)` of `main` not present.
Oct 21 10:30:55 c02 collectd[1031939]: ipmi plugin: sensor_read_handler: 
sensor `FAN5 fan_cooling (29.5)` of `main` not present.
Oct 21 10:30:55 c02 collectd[1031939]: ipmi plugin: sensor_read_handler: 
sensor `FAN1 fan_cooling (29.1)` of `main` not present.
Oct 21 10:30:55 c02 collectd[1031939]: ipmi plugin: sensor_read_handler: 
sensor `P2-DIMMH3 TEMP memory_device (32.94)` of `main` not present.
Oct 21 10:30:55 c02 collectd[1031

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Stefan Kooman
Quoting Yan, Zheng (uker...@gmail.com):

> delete 'mdsX_openfiles.0' object from cephfs metadata pool. (X is rank
> of the crashed mds)

OK, MDS crashed again, restarted. I stopped it, deleted the object and
restarted the MDS. It became active right away.

Any idea on why the openfiles list (object) becomes corrupted? As in:
have a bugfix in place?

Thanks!

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Liu, Changcheng
On 10:16 Mon 21 Oct, Marc Roos wrote:
> I have the same. I do not think ConvertSpecialMetricTypes is necessary. 
> 
> 
>   Globals true
> 
> 
> 
>   LongRunAvgLatency false
>   ConvertSpecialMetricTypes true
>   
> SocketPath "/var/run/ceph/ceph-osd.1.asok"
>   
> 
Same configuration, but there's below error after "systemctl restart collectd"
Have you ever hit this error before?

=Log start===
Oct 21 16:22:52 rdmarhel0 collectd[768000]: Found a configuration for the 
`ceph' plugin, but the plugin isn't loaded or didn't register a configuration 
callback.
Oct 21 16:22:52 rdmarhel0 systemd[1]: Unit collectd.service entered failed 
state.
Oct 21 16:22:52 rdmarhel0 collectd[768000]: Found a configuration for the 
`ceph' plugin, but the plugin isn't loaded or didn't register a configuration 
callback.
Oct 21 16:22:52 rdmarhel0 systemd[1]: collectd.service failed.
Oct 21 16:22:52 rdmarhel0 collectd[768000]: There is a `Daemon' block within 
the configuration for the ceph plugin. The plugin either only expects "simple" 
configuration statements or wasn
Oct 21 16:22:52 rdmarhel0 systemd[1]: collectd.service holdoff time over, 
scheduling restart.
Oct 21 16:22:52 rdmarhel0 systemd[1]: Stopped Collectd statistics daemon.
-- Subject: Unit collectd.service has finished shutting down
=Log end===

B.R.
Changcheng
> 
> 
> -Original Message-
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] collectd Ceph metric
> 
> On 09:50 Mon 21 Oct, Marc Roos wrote:
> > 
> > I am, collectd with luminous, and upgraded to nautilus and collectd
> > 5.8.1-1.el7 this weekend. Maybe increase logging or so. 
> > I had to wait a long time before collectd was supporting the luminous 
> > release, maybe it is the same with octopus (=15?)
> > 
> @Roos: Do you mean that you could run collectd(5.8.1) with 
> Ceph-Nautilus? Below is my collectd configuration with Ceph-Octopus:
> 
> 
> 
>   
> SocketPath "/var/run/ceph/ceph-osd.0.asok"
>   
> 
> 
> Is there anything wrong?
> 
> > 
> >  
> > 
> > -Original Message-
> > From: Liu, Changcheng [mailto:changcheng@intel.com]
> > Sent: maandag 21 oktober 2019 9:41
> > To: ceph-users@lists.ceph.com
> > Subject: [ceph-users] collectd Ceph metric
> > 
> > Hi all,
> >Does anyone succeed to use collectd/ceph plugin to collect ceph
> >cluster data?
> >I'm using collectd(5.8.1) and Ceph-15.0.0. collectd failed to get
> >cluster data with below error:
> >"collectd.service holdoff time over, scheduling restart"
> > 
> > Regards,
> > Changcheng
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > 
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Marc Roos
I have the same. I do not think ConvertSpecialMetricTypes is necessary. 


  Globals true



  LongRunAvgLatency false
  ConvertSpecialMetricTypes true
  
SocketPath "/var/run/ceph/ceph-osd.1.asok"
  



-Original Message-
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] collectd Ceph metric

On 09:50 Mon 21 Oct, Marc Roos wrote:
> 
> I am, collectd with luminous, and upgraded to nautilus and collectd
> 5.8.1-1.el7 this weekend. Maybe increase logging or so. 
> I had to wait a long time before collectd was supporting the luminous 
> release, maybe it is the same with octopus (=15?)
> 
@Roos: Do you mean that you could run collectd(5.8.1) with 
Ceph-Nautilus? Below is my collectd configuration with Ceph-Octopus:



  
SocketPath "/var/run/ceph/ceph-osd.0.asok"
  


Is there anything wrong?

> 
>  
> 
> -Original Message-
> From: Liu, Changcheng [mailto:changcheng@intel.com]
> Sent: maandag 21 oktober 2019 9:41
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] collectd Ceph metric
> 
> Hi all,
>Does anyone succeed to use collectd/ceph plugin to collect ceph
>cluster data?
>I'm using collectd(5.8.1) and Ceph-15.0.0. collectd failed to get
>cluster data with below error:
>"collectd.service holdoff time over, scheduling restart"
> 
> Regards,
> Changcheng
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Liu, Changcheng
On 09:50 Mon 21 Oct, Marc Roos wrote:
> 
> I am, collectd with luminous, and upgraded to nautilus and collectd 
> 5.8.1-1.el7 this weekend. Maybe increase logging or so. 
> I had to wait a long time before collectd was supporting the luminous 
> release, maybe it is the same with octopus (=15?)
> 
@Roos: Do you mean that you could run collectd(5.8.1) with Ceph-Nautilus? Below 
is my collectd configuration with Ceph-Octopus:



  
SocketPath "/var/run/ceph/ceph-osd.0.asok"
  


Is there anything wrong?

> 
>  
> 
> -Original Message-
> From: Liu, Changcheng [mailto:changcheng@intel.com] 
> Sent: maandag 21 oktober 2019 9:41
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] collectd Ceph metric
> 
> Hi all,
>Does anyone succeed to use collectd/ceph plugin to collect ceph
>cluster data?
>I'm using collectd(5.8.1) and Ceph-15.0.0. collectd failed to get
>cluster data with below error:
>"collectd.service holdoff time over, scheduling restart"
> 
> Regards,
> Changcheng
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Stefan Kooman
Quoting Yan, Zheng (uker...@gmail.com):

> delete 'mdsX_openfiles.0' object from cephfs metadata pool. (X is rank
> of the crashed mds)

Just to make sure I understand correctly. Current status is that the MDS
is active (no standby for now) and not in a "crashed" state (although it
has been crashing for at least 10 times now).

Is the following what you want me to do, and safe to do in this
situation?

1) Stop running (active) MDS
2) delete object 'mdsX_openfiles.0' from cephfs metadata pool

Thanks,

Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Marc Roos


I am, collectd with luminous, and upgraded to nautilus and collectd 
5.8.1-1.el7 this weekend. Maybe increase logging or so. 
I had to wait a long time before collectd was supporting the luminous 
release, maybe it is the same with octopus (=15?)


 

-Original Message-
From: Liu, Changcheng [mailto:changcheng@intel.com] 
Sent: maandag 21 oktober 2019 9:41
To: ceph-users@lists.ceph.com
Subject: [ceph-users] collectd Ceph metric

Hi all,
   Does anyone succeed to use collectd/ceph plugin to collect ceph
   cluster data?
   I'm using collectd(5.8.1) and Ceph-15.0.0. collectd failed to get
   cluster data with below error:
   "collectd.service holdoff time over, scheduling restart"

Regards,
Changcheng
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Yan, Zheng
On Sun, Oct 20, 2019 at 1:53 PM Stefan Kooman  wrote:
>
> Dear list,
>
> Quoting Stefan Kooman (ste...@bit.nl):
>
> > I wonder if this situation is more likely to be hit on Mimic 13.2.6 than
> > on any other system.
> >
> > Any hints / help to prevent this from happening?
>
> We have had this happening another two times now. In both cases the MDS
> recovers, becomes active (for a few seconds), and crashes again. It won't
> come out of this loop by itself. When put in deug mode "debug_mds =
> 10/10) we won't hit the bug and it stays active. After a few minutes we
> disable debug (live, ceph tell mds.* config set debug_mds 0/0) and it
> keeps running (Heisenbug)... until hours later when it crashes again and the 
> story
> repeats itself.
>
> So unfortunately no more debug information available, but at least a
> workaround to get it active again.
>

delete 'mdsX_openfiles.0' object from cephfs metadata pool. (X is rank
of the crashed mds)


> Gr. Stefan
>
> --
> | BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] collectd Ceph metric

2019-10-21 Thread Liu, Changcheng
Hi all,
   Does anyone succeed to use collectd/ceph plugin to collect ceph
   cluster data?
   I'm using collectd(5.8.1) and Ceph-15.0.0. collectd failed to get
   cluster data with below error:
   "collectd.service holdoff time over, scheduling restart"

Regards,
Changcheng
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crashed MDS (segfault)

2019-10-21 Thread Yan, Zheng
On Fri, Oct 18, 2019 at 9:10 AM Gustavo Tonini  wrote:
>
> Hi Zheng,
> the cluster is running ceph mimic. This warning about network only appears 
> when using nautilus' cephfs-journal-tool.
>
> "cephfs-data-scan scan_links" does not report any issue.
>
> How could variable "newparent" be NULL at 
> https://github.com/ceph/ceph/blob/master/src/mds/SnapRealm.cc#L599 ? Is there 
> a way to fix this?
>


try 'cephfs-data-scan init'. It will setup root inode's snaprealm.

> On Thu, Oct 17, 2019 at 9:58 PM Yan, Zheng  wrote:
>>
>> On Thu, Oct 17, 2019 at 10:19 PM Gustavo Tonini  
>> wrote:
>> >
>> > No. The cluster was just rebalancing.
>> >
>> > The journal seems damaged:
>> >
>> > ceph@deployer:~$ cephfs-journal-tool --rank=fs_padrao:0 journal inspect
>> > 2019-10-16 17:46:29.596 7fcd34cbf700 -1 NetHandler create_socket couldn't 
>> > create socket (97) Address family not supported by protocol
>>
>> corrupted journal shouldn't cause error like this. This is more like
>> network issue. please double check network config of your cluster.
>>
>> > Overall journal integrity: DAMAGED
>> > Corrupt regions:
>> > 0x1c5e4d904ab-1c5e4d9ddbc
>> > ceph@deployer:~$
>> >
>> > Could a journal reset help with this?
>> >
>> > I could snapshot all FS pools and export the journal before to guarantee a 
>> > rollback to this state if something goes wrong with jounal reset.
>> >
>> > On Thu, Oct 17, 2019, 09:07 Yan, Zheng  wrote:
>> >>
>> >> On Tue, Oct 15, 2019 at 12:03 PM Gustavo Tonini  
>> >> wrote:
>> >> >
>> >> > Dear ceph users,
>> >> > we're experiencing a segfault during MDS startup (replay process) which 
>> >> > is making our FS inaccessible.
>> >> >
>> >> > MDS log messages:
>> >> >
>> >> > Oct 15 03:41:39.894584 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 
>> >> > 7f3c08f49700  1 -- 192.168.8.195:6800/3181891717 <== osd.26 
>> >> > 192.168.8.209:6821/2419345 3  osd_op_reply(21 1. [getxattr] 
>> >> > v0'0 uv0 ondisk = -61 ((61) No data available)) v8  154+0+0 
>> >> > (3715233608 0 0) 0x2776340 con 0x18bd500
>> >> > Oct 15 03:41:39.894584 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 
>> >> > 7f3c00589700 10 MDSIOContextBase::complete: 18C_IO_Inode_Fetched
>> >> > Oct 15 03:41:39.894658 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 
>> >> > 7f3c00589700 10 mds.0.cache.ino(0x100) _fetched got 0 and 544
>> >> > Oct 15 03:41:39.894658 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 
>> >> > 7f3c00589700 10 mds.0.cache.ino(0x100)  magic is 'ceph fs volume v011' 
>> >> > (expecting 'ceph fs volume v011')
>> >> > Oct 15 03:41:39.894735 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 
>> >> > 7f3c00589700 10  mds.0.cache.snaprealm(0x100 seq 1 0x1799c00) 
>> >> > open_parents [1,head]
>> >> > Oct 15 03:41:39.894735 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 
>> >> > 7f3c00589700 10 mds.0.cache.ino(0x100) _fetched [inode 0x100 
>> >> > [...2,head] ~mds0/ auth v275131 snaprealm=0x1799c00 f(v0 1=1+0) 
>> >> > n(v76166 rc2020-07-17 15:29:27.00 b41838692297 -3184=-3168+-16)/n() 
>> >> > (iversion lock) 0x18bf800]
>> >> > Oct 15 03:41:39.894821 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 
>> >> > 7f3c00589700 10 MDSIOContextBase::complete: 18C_IO_Inode_Fetched
>> >> > Oct 15 03:41:39.894821 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 
>> >> > 7f3c00589700 10 mds.0.cache.ino(0x1) _fetched got 0 and 482
>> >> > Oct 15 03:41:39.894891 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 
>> >> > 7f3c00589700 10 mds.0.cache.ino(0x1)  magic is 'ceph fs volume v011' 
>> >> > (expecting 'ceph fs volume v011')
>> >> > Oct 15 03:41:39.894958 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.205 
>> >> > 7f3c00589700 -1 *** Caught signal (Segmentation fault) **#012 in thread 
>> >> > 7f3c00589700 thread_name:fn_anonymous#012#012 ceph version 13.2.6 
>> >> > (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)#012 1: 
>> >> > (()+0x11390) [0x7f3c0e48a390]#012 2: (operator<<(std::ostream&, 
>> >> > SnapRealm const&)+0x42) [0x72cb92]#012 3: 
>> >> > (SnapRealm::merge_to(SnapRealm*)+0x308) [0x72f488]#012 4: 
>> >> > (CInode::decode_snap_blob(ceph::buffer::list&)+0x53) [0x6e1f63]#012 5: 
>> >> > (CInode::decode_store(ceph::buffer::list::iterator&)+0x76) 
>> >> > [0x702b86]#012 6: (CInode::_fetched(ceph::buffer::list&, 
>> >> > ceph::buffer::list&, Context*)+0x1b2) [0x702da2]#012 7: 
>> >> > (MDSIOContextBase::complete(int)+0x119) [0x74fcc9]#012 8: 
>> >> > (Finisher::finisher_thread_entry()+0x12e) [0x7f3c0ebffece]#012 9: 
>> >> > (()+0x76ba) [0x7f3c0e4806ba]#012 10: (clone()+0x6d) 
>> >> > [0x7f3c0dca941d]#012 NOTE: a copy of the executable, or `objdump -rdS 
>> >> > ` is needed to interpret this.
>> >> > Oct 15 03:41:39.895400 mds1 ceph-mds: --- logging levels ---
>> >> > Oct 15 03:41:39.895473 mds1 ceph-mds:0/ 5 none
>> >> > Oct 15 03:41:39.895473 mds1 ceph-mds:0/ 1 lockdep
>> >> >
>> >>
>> >> looks like snap info for root inode is corrupted. did you do any
>> >> unusually operation before this happened?
>> >>
>>