Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.
Thanks Yan! I did this for the bug ticket and missed these replies. I hope I did it correctly. Here are the pastes of the dumps: https://pastebin.com/kw4bZVZT -- primary https://pastebin.com/sYZQx0ER -- secondary they are not that long here is the output of one: 1. Thread 17 "mds_rank_progr" received signal SIGSEGV, Segmentation fault . 2. [Switching to Thread 0x7fe3b100a700 (LWP 120481)] 3. 0x5617aacc48c2 in Server::handle_client_getattr (this=this@entry= 0x5617b5acbcd0, mdr=..., is_lookup=is_lookup@entry=true) at /build/ceph-12.2.5/src/mds/Server.cc:3065 4. 3065/build/ceph-12.2.5/src/mds/Server.cc: No such file or directory. 5. (gdb) t 6. [Current thread is 17 (Thread 0x7fe3b100a700 (LWP 120481))] 7. (gdb) bt 8. #0 0x5617aacc48c2 in Server::handle_client_getattr ( this=this@entry=0x5617b5acbcd0, mdr=..., is_lookup=is_lookup@entry=true) at /build/ceph-12.2.5/src/mds/Server.cc:3065 9. #1 0x5617aacfc98b in Server::dispatch_client_request ( this=this@entry=0x5617b5acbcd0, mdr=...) at /build/ceph-12.2.5/src/mds/Server.cc:1802 10. #2 0x5617aacfce9b in Server::handle_client_request ( this=this@entry=0x5617b5acbcd0, req=req@entry=0x5617bdfa8700)at /build/ceph-12.2.5/src/mds/Server.cc:1716 11. #3 0x5617aad017b6 in Server::dispatch (this=0x5617b5acbcd0, m=m@entry=0x5617bdfa8700) at /build/ceph-12.2.5/src/mds/Server.cc:258 12. #4 0x5617aac6afac in MDSRank::handle_deferrable_message ( this=this@entry=0x5617b5d22000, m=m@entry=0x5617bdfa8700)at /build/ceph-12.2.5/src/mds/MDSRank.cc:716 13. #5 0x5617aac795cb in MDSRank::_dispatch (this=this@entry= 0x5617b5d22000, m=0x5617bdfa8700, new_msg=new_msg@entry=false) at /build/ceph-12.2.5/src/mds/MDSRank.cc:551 14. #6 0x5617aac7a472 in MDSRank::retry_dispatch (this= 0x5617b5d22000, m=) at /build/ceph-12.2.5/src/mds/MDSRank.cc:998 15. #7 0x5617aaf0207b in Context::complete (r=0, this=0x5617bd568080 ) at /build/ceph-12.2.5/src/include/Context.h:70 16. #8 MDSInternalContextBase::complete (this=0x5617bd568080, r=0) at /build/ceph-12.2.5/src/mds/MDSContext.cc:30 17. #9 0x5617aac78bf7 in MDSRank::_advance_queues (this= 0x5617b5d22000) at /build/ceph-12.2.5/src/mds/MDSRank.cc:776 18. #10 0x5617aac7921a in MDSRank::ProgressThread::entry (this= 0x5617b5d22d40) at /build/ceph-12.2.5/src/mds/MDSRank.cc:502 19. #11 0x7fe3bb3066ba in start_thread (arg=0x7fe3b100a700) at pthread_create.c:333 20. #12 0x7fe3ba37241d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 I * set the debug level to mds=20 mon=1, * attached gdb prior to trying to mount aufs from a separate client, * typed continue, attempted the mount, * then backtraced after it seg faulted. I hope this is more helpful. Is there something else I should try to get more info? I was hoping for something closer to a python trace where it says a variable is a different type or a missing delimiter. womp. I am definitely out of my depth but now is a great time to learn! Can anyone shed some more light as to what may be wrong? On Fri, May 4, 2018 at 7:49 PM, Yan, Zheng wrote: > On Wed, May 2, 2018 at 7:19 AM, Sean Sullivan wrote: > > Forgot to reply to all: > > > > Sure thing! > > > > I couldn't install the ceph-mds-dbg packages without upgrading. I just > > finished upgrading the cluster to 12.2.5. The issue still persists in > 12.2.5 > > > > From here I'm not really sure how to do generate the backtrace so I hope > I > > did it right. For others on Ubuntu this is what I did: > > > > * firstly up the debug_mds to 20 and debug_ms to 1: > > ceph tell mds.* injectargs '--debug-mds 20 --debug-ms 1' > > > > * install the debug packages > > ceph-mds-dbg in my case > > > > * I also added these options to /etc/ceph/ceph.conf just in case they > > restart. > > > > * Now allow pids to dump (stolen partly from redhat docs and partly from > > ubuntu) > > echo -e 'DefaultLimitCORE=infinity\nPrivateTmp=true' | tee -a > > /etc/systemd/system.conf > > sysctl fs.suid_dumpable=2 > > sysctl kernel.core_pattern=/tmp/core > > systemctl daemon-reload > > systemctl restart ceph-mds@$(hostname -s) > > > > * A crash was created in /var/crash by apport but gdb cant read it. I > used > > apport-unpack and then ran GDB on what is inside: > > > > core dump should be in /tmp/core > > > apport-unpack /var/crash/$(ls /var/crash/*mds*) /root/crash_dump/ > > cd /root/crash_dump/ > > gdb $(cat ExecutablePath) CoreDump -ex 'thr a a bt' | tee > > /root/ceph_mds_$(hostname -s)_backtrace > > > > * This left me with the attached backtraces (which I think are wrong as I
Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.
Most of this is over my head but the last line of the logs on both mds servers show something similar to: 0> 2018-05-01 15:37:46.871932 7fd10163b700 -1 *** Caught signal (Segmentation fault) ** in thread 7fd10163b700 thread_name:mds_rank_progr When I search for this in ceph user and devel mailing list the only mention I can see is from 12.0.3: https://marc.info/?l=ceph-devel&m=149726392820648&w=2 -- ceph-devel I don't see any mention of journal.cc in my logs however so I hope they are not related. I also have not experienced any major loss in my cluster as of yet and cephfs-journal-tool shows my journals as healthy. To trigger this bug I created a cephfs directory and user called aufstest. Here is the part of the log with the crash mentioning aufstest. https://pastebin.com/EL5ALLuE I created a new bug ticket on ceph.com with all of the current info as I believe this isn't a problem with my setup specifically and anyone else trying this will have the same issue. https://tracker.ceph.com/issues/23972 I hope this is the correct path. If anyone can guide me in the right direction for troubleshooting this further I would be grateful. On Tue, May 1, 2018 at 6:19 PM, Sean Sullivan wrote: > Forgot to reply to all: > > > Sure thing! > > I couldn't install the ceph-mds-dbg packages without upgrading. I just > finished upgrading the cluster to 12.2.5. The issue still persists in 12.2.5 > > From here I'm not really sure how to do generate the backtrace so I hope I > did it right. For others on Ubuntu this is what I did: > > * firstly up the debug_mds to 20 and debug_ms to 1: > ceph tell mds.* injectargs '--debug-mds 20 --debug-ms 1' > > * install the debug packages > ceph-mds-dbg in my case > > * I also added these options to /etc/ceph/ceph.conf just in case they > restart. > > * Now allow pids to dump (stolen partly from redhat docs and partly from > ubuntu) > echo -e 'DefaultLimitCORE=infinity\nPrivateTmp=true' | tee -a > /etc/systemd/system.conf > sysctl fs.suid_dumpable=2 > sysctl kernel.core_pattern=/tmp/core > systemctl daemon-reload > systemctl restart ceph-mds@$(hostname -s) > > * A crash was created in /var/crash by apport but gdb cant read it. I used > apport-unpack and then ran GDB on what is inside: > > apport-unpack /var/crash/$(ls /var/crash/*mds*) /root/crash_dump/ > cd /root/crash_dump/ > gdb $(cat ExecutablePath) CoreDump -ex 'thr a a bt' | tee > /root/ceph_mds_$(hostname -s)_backtrace > > * This left me with the attached backtraces (which I think are wrong as I > see a lot of ?? yet gdb says /usr/lib/debug/.build-id/1d/ > 23dc5ef4fec1dacebba2c6445f05c8fe6b8a7c.debug was loaded) > > kh10-8 mds backtrace -- https://pastebin.com/bwqZGcfD > kh09-8 mds backtrace -- https://pastebin.com/vvGiXYVY > > > The log files are pretty large (one 4.1G and the other 200MB) > > kh10-8 (200MB) mds log -- https://griffin-objstore.op > ensciencedatacloud.org/logs/ceph-mds.kh10-8.log > kh09-8 (4.1GB) mds log -- https://griffin-objstore.op > ensciencedatacloud.org/logs/ceph-mds.kh09-8.log > > On Tue, May 1, 2018 at 12:09 AM, Patrick Donnelly > wrote: > >> Hello Sean, >> >> On Mon, Apr 30, 2018 at 2:32 PM, Sean Sullivan >> wrote: >> > I was creating a new user and mount point. On another hardware node I >> > mounted CephFS as admin to mount as root. I created /aufstest and then >> > unmounted. From there it seems that both of my mds nodes crashed for >> some >> > reason and I can't start them any more. >> > >> > https://pastebin.com/1ZgkL9fa -- my mds log >> > >> > I have never had this happen in my tests so now I have live data here. >> If >> > anyone can lend a hand or point me in the right direction while >> > troubleshooting that would be a godsend! >> >> Thanks for keeping the list apprised of your efforts. Since this is so >> easily reproduced for you, I would suggest that you next get higher >> debug logs (debug_mds=20/debug_ms=1) from the MDS. And, since this is >> a segmentation fault, a backtrace with debug symbols from gdb would >> also be helpful. >> >> -- >> Patrick Donnelly >> > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.
Forgot to reply to all: Sure thing! I couldn't install the ceph-mds-dbg packages without upgrading. I just finished upgrading the cluster to 12.2.5. The issue still persists in 12.2.5 >From here I'm not really sure how to do generate the backtrace so I hope I did it right. For others on Ubuntu this is what I did: * firstly up the debug_mds to 20 and debug_ms to 1: ceph tell mds.* injectargs '--debug-mds 20 --debug-ms 1' * install the debug packages ceph-mds-dbg in my case * I also added these options to /etc/ceph/ceph.conf just in case they restart. * Now allow pids to dump (stolen partly from redhat docs and partly from ubuntu) echo -e 'DefaultLimitCORE=infinity\nPrivateTmp=true' | tee -a /etc/systemd/system.conf sysctl fs.suid_dumpable=2 sysctl kernel.core_pattern=/tmp/core systemctl daemon-reload systemctl restart ceph-mds@$(hostname -s) * A crash was created in /var/crash by apport but gdb cant read it. I used apport-unpack and then ran GDB on what is inside: apport-unpack /var/crash/$(ls /var/crash/*mds*) /root/crash_dump/ cd /root/crash_dump/ gdb $(cat ExecutablePath) CoreDump -ex 'thr a a bt' | tee /root/ceph_mds_$(hostname -s)_backtrace * This left me with the attached backtraces (which I think are wrong as I see a lot of ?? yet gdb says /usr/lib/debug/.build-id/1d/ 23dc5ef4fec1dacebba2c6445f05c8fe6b8a7c.debug was loaded) kh10-8 mds backtrace -- https://pastebin.com/bwqZGcfD kh09-8 mds backtrace -- https://pastebin.com/vvGiXYVY The log files are pretty large (one 4.1G and the other 200MB) kh10-8 (200MB) mds log -- https://griffin-objstore. opensciencedatacloud.org/logs/ceph-mds.kh10-8.log kh09-8 (4.1GB) mds log -- https://griffin-objstore. opensciencedatacloud.org/logs/ceph-mds.kh09-8.log On Tue, May 1, 2018 at 12:09 AM, Patrick Donnelly wrote: > Hello Sean, > > On Mon, Apr 30, 2018 at 2:32 PM, Sean Sullivan > wrote: > > I was creating a new user and mount point. On another hardware node I > > mounted CephFS as admin to mount as root. I created /aufstest and then > > unmounted. From there it seems that both of my mds nodes crashed for some > > reason and I can't start them any more. > > > > https://pastebin.com/1ZgkL9fa -- my mds log > > > > I have never had this happen in my tests so now I have live data here. If > > anyone can lend a hand or point me in the right direction while > > troubleshooting that would be a godsend! > > Thanks for keeping the list apprised of your efforts. Since this is so > easily reproduced for you, I would suggest that you next get higher > debug logs (debug_mds=20/debug_ms=1) from the MDS. And, since this is > a segmentation fault, a backtrace with debug symbols from gdb would > also be helpful. > > -- > Patrick Donnelly > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.
I forgot that I left my VM mount command running. It hangs my VM but more alarming is that it crashes my MDS servers on the ceph cluster. The ceph cluster is all hardware nodes and the openstack vm does not have an admin keyring (although the cephX keyring for cephfs generated does have write permissions to the ec42 pool. +-+ | | | Luminous CephFS Cluster | | version 12.2.4 | | Ubuntu 16.04 | | 4.10.0-38-generic (all hardware nodes)| | | ++ +---+++ || | | || | Openstack VM | | Ceph Monitor A | Ceph Monitor B| Ceph Monitor C| | Ubuntu 16.04 +---> | Ceph Mon Server | Ceph MDS A| Ceph MDS Failover | | 4.13.0-39-generic | | kh08-8 | Kh09-8 | kh10-8| | Cephfs via kernel | | | || ++ +---+++ | | |ec42 16384 PGs | |CephFS Data Pool | |Erasure coded with 4/2 profile | | | +-+ | | | cephfs_metadata 4096 PGs| | CephFS Metadata Pool | | Replicated pool (n=3) | | | +-+ As far as I am aware this shouldn't happen. I will try upgrading as soon as I can but I didn't see anything like this mentioned in the change log and am worried this will still exist in 12.2.5. Has anyone seen this before? On Mon, Apr 30, 2018 at 7:24 PM, Sean Sullivan wrote: > So I think I can reliably reproduce this crash from a ceph client. > > ``` > root@kh08-8:~# ceph -s > cluster: > id: 9f58ee5a-7c5d-4d68-81ee-debe16322544 > health: HEALTH_OK > > services: > mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8 > mgr: kh08-8(active) > mds: cephfs-1/1/1 up {0=kh09-8=up:active}, 1 up:standby > osd: 570 osds: 570 up, 570 in > ``` > > > then from a client try to mount aufs over cephfs: > ``` > mount -vvv -t aufs -o br=/cephfs=rw:/mnt/aufs=rw -o udba=reval none /aufs > ``` > > Now watch as your ceph mds servers fail: > > ``` > root@kh08-8:~# ceph -s > cluster: > id: 9f58ee5a-7c5d-4d68-81ee-debe16322544 > health: HEALTH_WARN > insufficient standby MDS daemons available > > services: > mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8 > mgr: kh08-8(active) > mds: cephfs-1/1/1 up {0=kh10-8=up:active(laggy or crashed)} > ``` > > > I am now stuck in a degraded and I can't seem to get them to start again. > > On Mon, Apr 30, 2018 at 5:06 PM, Sean Sullivan > wrote: > >> I had 2 MDS servers (one active one standby) and both were down. I took a >> dumb chance and marked the active as down (it said it was up but laggy). >> Then started the primary again and now both are back up. I have never seen >> this before I am also not sure of what I just did. >> >> On Mon, Apr 30, 2018 at 4:32 PM, Sean Sullivan >> wrote: >> >>> I was creating a new user and mount point. On another hardware node I >>> mounted CephFS as admin to mount as root. I created /aufstest and then >>> unmounted. From there it seems that both of my mds nodes crashed for some >>> reason and I can't start them any more. >>> >>> https://pastebin.c
Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.
So I think I can reliably reproduce this crash from a ceph client. ``` root@kh08-8:~# ceph -s cluster: id: 9f58ee5a-7c5d-4d68-81ee-debe16322544 health: HEALTH_OK services: mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8 mgr: kh08-8(active) mds: cephfs-1/1/1 up {0=kh09-8=up:active}, 1 up:standby osd: 570 osds: 570 up, 570 in ``` then from a client try to mount aufs over cephfs: ``` mount -vvv -t aufs -o br=/cephfs=rw:/mnt/aufs=rw -o udba=reval none /aufs ``` Now watch as your ceph mds servers fail: ``` root@kh08-8:~# ceph -s cluster: id: 9f58ee5a-7c5d-4d68-81ee-debe16322544 health: HEALTH_WARN insufficient standby MDS daemons available services: mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8 mgr: kh08-8(active) mds: cephfs-1/1/1 up {0=kh10-8=up:active(laggy or crashed)} ``` I am now stuck in a degraded and I can't seem to get them to start again. On Mon, Apr 30, 2018 at 5:06 PM, Sean Sullivan wrote: > I had 2 MDS servers (one active one standby) and both were down. I took a > dumb chance and marked the active as down (it said it was up but laggy). > Then started the primary again and now both are back up. I have never seen > this before I am also not sure of what I just did. > > On Mon, Apr 30, 2018 at 4:32 PM, Sean Sullivan > wrote: > >> I was creating a new user and mount point. On another hardware node I >> mounted CephFS as admin to mount as root. I created /aufstest and then >> unmounted. From there it seems that both of my mds nodes crashed for some >> reason and I can't start them any more. >> >> https://pastebin.com/1ZgkL9fa -- my mds log >> >> I have never had this happen in my tests so now I have live data here. If >> anyone can lend a hand or point me in the right direction while >> troubleshooting that would be a godsend! >> >> I tried cephfs-journal-tool inspect and it reports that the journal >> should be fine. I am not sure why it's crashing: >> >> /home/lacadmin# cephfs-journal-tool journal inspect >> Overall journal integrity: OK >> >> >> >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.
I had 2 MDS servers (one active one standby) and both were down. I took a dumb chance and marked the active as down (it said it was up but laggy). Then started the primary again and now both are back up. I have never seen this before I am also not sure of what I just did. On Mon, Apr 30, 2018 at 4:32 PM, Sean Sullivan wrote: > I was creating a new user and mount point. On another hardware node I > mounted CephFS as admin to mount as root. I created /aufstest and then > unmounted. From there it seems that both of my mds nodes crashed for some > reason and I can't start them any more. > > https://pastebin.com/1ZgkL9fa -- my mds log > > I have never had this happen in my tests so now I have live data here. If > anyone can lend a hand or point me in the right direction while > troubleshooting that would be a godsend! > > I tried cephfs-journal-tool inspect and it reports that the journal should > be fine. I am not sure why it's crashing: > > /home/lacadmin# cephfs-journal-tool journal inspect > Overall journal integrity: OK > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.
I was creating a new user and mount point. On another hardware node I mounted CephFS as admin to mount as root. I created /aufstest and then unmounted. From there it seems that both of my mds nodes crashed for some reason and I can't start them any more. https://pastebin.com/1ZgkL9fa -- my mds log I have never had this happen in my tests so now I have live data here. If anyone can lend a hand or point me in the right direction while troubleshooting that would be a godsend! I tried cephfs-journal-tool inspect and it reports that the journal should be fine. I am not sure why it's crashing: /home/lacadmin# cephfs-journal-tool journal inspect Overall journal integrity: OK ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] luminous ubuntu 16.04 HWE (4.10 kernel). ceph-disk can't prepare a disk
On freshly installed ubuntu 16.04 servers with the HWE kernel selected (4.10). I can not use ceph-deploy or ceph-disk to provision osd. whenever I try I get the following:: ceph-disk -v prepare --dmcrypt --dmcrypt-key-dir /etc/ceph/dmcrypt-keys --bluestore --cluster ceph --fs-type xfs -- /dev/sdy command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid get_dm_uuid: get_dm_uuid /dev/sdy uuid path is /sys/dev/block/65:128/dm/uuid set_type: Will colocate block with data on /dev/sdy command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_size [command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_db_size command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_size command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_wal_size get_dm_uuid: get_dm_uuid /dev/sdy uuid path is /sys/dev/block/65:128/dm/uuid get_dm_uuid: get_dm_uuid /dev/sdy uuid path is /sys/dev/block/65:128/dm/uuid get_dm_uuid: get_dm_uuid /dev/sdy uuid path is /sys/dev/block/65:128/dm/uuid Traceback (most recent call last): File "/usr/sbin/ceph-disk", line 9, in load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')() File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5704, in run main(sys.argv[1:]) File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5655, in main args.func(args) File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2091, in main Prepare.factory(args).prepare() File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2080, in prepare self._prepare() File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2154, in _prepare self.lockbox.prepare() File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2842, in prepare verify_not_in_use(self.args.lockbox, check_partitions=True) File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 950, in verify_not_in_use raise Error('Device is mounted', partition) ceph_disk.main.Error: Error: Device is mounted: /dev/sdy5 unmounting the disk does not seem to help either. I'm assuming something is triggering too early but i'm not sure how to delay or figure that out. has anyone deployed on xenial with the 4.10 kernel? Am I missing something important? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] zombie partitions, ceph-disk failure.
I am trying to stand up ceph (luminous) on 3 72 disk supermicro servers running ubuntu 16.04 with HWE enabled (for a 4.10 kernel for cephfs). I am not sure how this is possible but even though I am running the following line to wipe all disks of their partitions, once I run ceph-disk to partition the drive udev or device mapper automatically mounts a lockbox partition and ceph-disk fails:: wipe line:: for disk in $(lsblk --output MODEL,NAME | grep -iE "HGST|SSDSC2BA40" | awk '{print $NF}'); do sgdisk -Z /dev/${disk}; dd if=/dev/zero of=/dev/${disk} bs=1024 count=1; ceph-disk zap /dev/${disk}; sgdisk -o /dev/${disk}; sgdisk -G /dev/${disk}; done ceph-disk line: cephcmd="ceph-disk -v prepare --dmcrypt --dmcrypt-key-dir /etc/ceph/dmcrypt-keys --block.db /dev/${pssd} --block.wal /dev/${pssd} --bluestore --cluster ceph --fs-type xfs -- /dev/${phdd}" prior to running that on a single disk all of the drives are empty except the OS drives root@kg15-1:/home/ceph-admin# lsblk --fs NAMEFSTYPELABELUUID MOUNTPOINT sdbu sdy sdam sdbb sdf sdau sdab sdbk sdo sdbs sdw sdak sdd sdas sdbi sdm sdbq sdu sdai sdb sdaq sdbg sdk sdaz sds sdag sdbe sdi sdax sdq sdae sdbn sdbv ├─sdbv3 linux_raid_member kg15-1:2 664f69b7-2dd7-7012-75e3-a920ba7416b8 │ └─md2 ext4 6696d9f5-3385-47cb-8e8b-058637f8a1b8 / ├─sdbv1 linux_raid_member kg15-1:0 c4c78d8b-5c0b-6d51-d0a4-ecd40432f98c │ └─md0 ext4 44f76d8d-0333-49a7-ab89-dafe70f6f12d /boot └─sdbv2 linux_raid_member kg15-1:1 e3a74474-502c-098c-9415-7b99abcbd2e1 └─md1 swap 37e071a9-9361-456b-a740-87ddc99a8260 [SWAP] sdz sdan sdbc sdg sdav sdac sdbl sdbt sdx sdal sdba sde sdat sdaa sdbj sdn sdbr sdv sdaj sdc sdar sdbh sdl sdbp sdt sdah sda ├─sda2 linux_raid_member kg15-1:1 e3a74474-502c-098c-9415-7b99abcbd2e1 │ └─md1 swap 37e071a9-9361-456b-a740-87ddc99a8260 [SWAP] ├─sda3 linux_raid_member kg15-1:2 664f69b7-2dd7-7012-75e3-a920ba7416b8 │ └─md2 ext4 6696d9f5-3385-47cb-8e8b-058637f8a1b8 / └─sda1 linux_raid_member kg15-1:0 c4c78d8b-5c0b-6d51-d0a4-ecd40432f98c └─md0 ext4 44f76d8d-0333-49a7-ab89-dafe70f6f12d /boot sdap sdbf sdj sday sdr sdaf sdbo sdao sdbd sdh sdaw sdp sdad sdbm - but as soon as I run that cephcmd (which worked prior to upgrading to the 4.10 kernel: ceph-disk -v prepare --dmcrypt --dmcrypt-key-dir /etc/ceph/dmcrypt-keys --block.db /dev/sdd --block.wal /dev/sdd --bluestore --cluster ceph --fs-type xfs -- /dev/sdbu command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid get_dm_uuid: get_dm_uuid /dev/sdbu uuid path is /sys/dev/block/68:128/dm/uuid set_type: Will colocate block with data on /dev/sdbu command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_size command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_db_size command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_size command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_wal_size get_dm_uuid: get_dm_uuid /dev/sdbu uuid path is /sys/dev/block/68:128/dm/uuid get_dm_uuid: get_dm_uuid /dev/sdbu uuid path is /sys/dev/block/68:128/dm/uuid get_dm_uuid: get_dm_uuid /dev/sdbu uuid path is /sys/dev/block/68:128/dm/uuid Traceback (most recent call last): File "/usr/sbin/ceph-disk", line 9, in load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')() File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5704, in run main(sys.argv[1:]) File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5655, in main args.func(args) File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2091, in main Prepare.factory(args).prepare() File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2080, in prepare self._prepare() File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2154, in _prepare self.lockbox.prepare() File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2842, in prepare verify_not_in_use(self.args.lockbox, check_partitions=True) File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 950, in verify_not_in_use raise Error('Device is mounted', partition) ceph_disk.main.Error: Error: Device is mounted: /dev/sdbu5 So it says sdbu is mounted. I unmount it and again it errors saying it can't create the partition it just tried to create. root@kg15-1:/# mount | grep sdbu /dev/sdbu5 on /var/lib/ceph/osd-lockbox/0e3baee9-a5dd-46f0-ae53-0e7dd2b0b257 type ext4 (rw,relatime,stripe=4,
Re: [ceph-users] Luminous can't seem to provision more than 32 OSDs per server
I have tried using ceph-disk directly and i'm running into all sorts of trouble but I'm trying my best. Currently I am using the following cobbled script which seems to be working: https://github.com/seapasulli/CephScripts/blob/master/provision_storage.sh I'm at 11 right now. I hope this works. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Luminous can't seem to provision more than 32 OSDs per server
I am trying to install Ceph luminous (ceph version 12.2.1) on 4 ubuntu 16.04 servers each with 74 disks, 60 of which are HGST 7200rpm sas drives:: HGST HUS724040AL sdbv sas root@kg15-2:~# lsblk --output MODEL,KNAME,TRAN | grep HGST | wc -l 60 I am trying to deploy them all with :: a line like the following:: ceph-deploy osd zap kg15-2:(sas_disk) ceph-deploy osd create --dmcrypt --bluestore --block-db (ssd_partition) kg15-2:(sas_disk) This didn't seem to work at all so I am now trying to troubleshoot by just provisioning the sas disks:: ceph-deploy osd create --dmcrypt --bluestore kg15-2:(sas_disk) Across all 4 hosts I can only seem to get 32 OSDs up and after that the rest fail:: root@kg15-1:~# ps faux | grep [c]eph-osd' | wc -l 32 root@kg15-2:~# ps faux | grep [c]eph-osd' | wc -l 32 root@kg15-3:~# ps faux | grep [c]eph-osd' | wc -l 32 The ceph-deploy tool doesn't seem to log or notice any failure but the host itself shows the following in the osd log: 2017-10-17 23:05:43.121016 7f8ca75c9e00 0 set uid:gid to 64045:64045 (ceph:ceph) 2017-10-17 23:05:43.121040 7f8ca75c9e00 0 ceph version 12.2.1 ( 3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable), process (unknown), pid 69926 2017-10-17 23:05:43.123939 7f8ca75c9e00 1 bluestore(/var/lib/ceph/tmp/mnt.8oIc5b) mkfs path /var/lib/ceph/tmp/mnt.8oIc5b 2017-10-17 23:05:43.124037 7f8ca75c9e00 1 bdev create path /var/lib/ceph/tmp/mnt.8oIc5b/block type kernel 2017-10-17 23:05:43.124045 7f8ca75c9e00 1 bdev(0x564b7a05e900 /var/lib/ceph/tmp/mnt.8oIc5b/block) open path /var/lib/ceph/tmp/mnt.8oIc5b/ block 2017-10-17 23:05:43.124231 7f8ca75c9e00 1 bdev(0x564b7a05e900 /var/lib/ceph/tmp/mnt.8oIc5b/block) open size 4000668520448 (0x3a37a6d1000, 3725 GB) block_size 4096 (4096 B) rotational 2017-10-17 23:05:43.124296 7f8ca75c9e00 1 bluestore(/var/lib/ceph/tmp/mnt.8oIc5b) _set_cache_sizes max 0.5 < ratio 0.99 2017-10-17 23:05:43.124313 7f8ca75c9e00 1 bluestore(/var/lib/ceph/tmp/mnt.8oIc5b) _set_cache_sizes cache_size 1073741824 meta 0.5 kv 0.5 data 0 2017-10-17 23:05:43.124349 7f8ca75c9e00 -1 bluestore(/var/lib/ceph/tmp/mnt.8oIc5b) _open_db /var/lib/ceph/tmp/mnt.8oIc5b/block.db link target doesn't exist 2017-10-17 23:05:43.124368 7f8ca75c9e00 1 bdev(0x564b7a05e900 /var/lib/ceph/tmp/mnt.8oIc5b/block) close 2017-10-17 23:05:43.402165 7f8ca75c9e00 -1 bluestore(/var/lib/ceph/tmp/mnt.8oIc5b) mkfs failed, (2) No such file or directory 2017-10-17 23:05:43.402185 7f8ca75c9e00 -1 OSD::mkfs: ObjectStore::mkfs failed with error (2) No such file or directory 2017-10-17 23:05:43.402258 7f8ca75c9e00 -1 ** ERROR: error creating empty object store in /var/lib/ceph/tmp/mnt.8oIc5b: (2) No such file or directory I have a few questions. I am not sure where to start troubleshooting so I have a few questions. 1.) Anyone have any idea on why 32? 2.) Is there a good guide / outline on how to get the benefit of storing the keys in the monitor while still having ceph more or less manage the drives but provisioning the drives without ceph-deploy? I looked at the manual deployment long and short form and it doesn't mention dmcrypt or bluestore at all. I know I can use crypttab and cryptsetup to do this and then give ceph-disk the path to the mapped device but I would prefer to keep as much management in ceph as possible if I could. (mailing list thread :: https://www.mail-archive.com/ceph-users@lists.ceph.com/ msg38575.html ) 3.) Ideally I would like to provision the drives with the DB on the SSD. (or would it be better to make a cache tier? I read on a reddit thread that the tiering in ceph isn't being developed anymore is it still worth it?) Sorry for the bother and thanks for all the help!!! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-monstore-tool rebuild assert error
I have a hammer cluster that died a bit ago (hammer 94.9) consisting of 3 monitors and 630 osds spread across 21 storage hosts. The clusters monitors all died due to leveldb corruption and the cluster was shut down. I was finally given word that I could try to revive the cluster this week! https://github.com/ceph/ceph/blob/hammer/doc/rados/troubleshooting/troubleshooting-mon.rst#recovery-using-osds I see that the latest hammer code in github has the ceph-monstore-tool rebuild backport and that is what I am running on the cluster now (ceph version 0.94.9-4530-g83af8cd (83af8cdaaa6d94404e6146b68e532a784e3cc99c). I was able to scrape all 630 of the osds and am left with a 1.1G store.db directory. Using python I was successfully able to list all of the keys and values which was very promising. That said I can not run the final command in the recovery-using-osds article (ceph-monstore-tool rebuild) successfully. Whenever I run the tool (with the newly created admin keyring or with my existing one) it errors with the following: 1. 0> 2017-02-17 15:00:47.516901 7f8b4d7408c0 -1 ./mon/MonitorDBStore.h: In function 'KeyValueDB::Iterator MonitorDBStore::get_iterator(const string&)' thread 7f8b4d7408c0 time 2017-02-07 15:00:47.516319 2. The complete trace is here http://pastebin.com/NQE8uYiG Can anyone lend a hand and tell me what may be wrong? I am able to iterate over the leveldb database in python so the structure should be somewhat okay? Am I SOL at this point? The cluster isn't production any longer and while I don't have months of time I would really like to recover this cluster just to see if it is at all possible. -- - Sean: I wrote this. - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now
cd6d88c0 5 asok(0x355a000) register_command log flush hook 0x350a0d0 -3> 2017-02-06 17:35:54.362215 7f10cd6d88c0 5 asok(0x355a000) register_command log dump hook 0x350a0d0 -2> 2017-02-06 17:35:54.362220 7f10cd6d88c0 5 asok(0x355a000) register_command log reopen hook 0x350a0d0 -1> 2017-02-06 17:35:54.379684 7f10cd6d88c0 2 auth: KeyRing::load: loaded key file /home/lacadmin/admin.keyring 0> 2017-02-06 17:35:59.885651 7f10cd6d88c0 -1 *** Caught signal (Segmentation fault) ** in thread 7f10cd6d88c0 ceph version 0.94.9-4530-g83af8cd (83af8cdaaa6d94404e6146b68e532a784e3cc99c) 1: ceph-monstore-tool() [0x5e960a] 2: (()+0x10330) [0x7f10cc5c8330] 3: (strlen()+0x2a) [0x7f10cac629da] 4: (std::basic_string, std::allocator >::basic_string(char const*, std::allocator const&)+0x25) [0x7f10cb576d75] 5: (rebuild_monstore(char const*, std::vector >&, MonitorDBStore&)+0x878) [0x544958] 6: (main()+0x3e05) [0x52c035] 7: (__libc_start_main()+0xf5) [0x7f10cabfbf45] 8: ceph-monstore-tool() [0x540347] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 1/ 1 ms 10/10 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio -2/-2 (syslog threshold) 99/99 (stderr threshold) max_recent 500 max_new 1000 log_file --- end dump of recent events --- Segmentation fault (core dumped) -- I have tried copying my monitor and admin keyring into the admin.keyring used to try to rebuild and it still fails. I am not sure whether this is due to my packages or if something else is wrong. Is there a way to test or see what may be happening? On Sat, Aug 13, 2016 at 10:36 PM, Sean Sullivan wrote: > So with a patched leveldb to skip errors I now have a store.db that I can > extract the pg,mon,and osd map from. That said when I try to start kh10-8 > it bombs out:: > > --- > --- > root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8# ceph-mon -i $(hostname) -d > 2016-08-13 22:30:54.596039 7fa8b9e088c0 0 ceph version 0.94.7 ( > d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 708653 > starting mon.kh10-8 rank 2 at 10.64.64.125:6789/0 mon_data > /var/lib/ceph/mon/ceph-kh10-8 fsid e452874b-cb29-4468-ac7f-f8901dfccebf > 2016-08-13 22:30:54.608150 7fa8b9e088c0 0 starting mon.kh10-8 rank 2 at > 10.64.64.125:6789/0 mon_data /var/lib/ceph/mon/ceph-kh10-8 fsid > e452874b-cb29-4468-ac7f-f8901dfccebf > 2016-08-13 22:30:54.608395 7fa8b9e088c0 1 mon.kh10-8@-1(probing) e1 > preinit fsid e452874b-cb29-4468-ac7f-f8901dfccebf > 2016-08-13 22:30:54.608617 7fa8b9e088c0 1 > mon.kh10-8@-1(probing).paxosservice(pgmap > 0..35606392) refresh upgraded, format 0 -> 1 > 2016-08-13 22:30:54.608629 7fa8b9e088c0 1 mon.kh10-8@-1(probing).pg v0 > on_upgrade discarding in-core PGMap > terminate called after throwing an instance of > 'ceph::buffer::end_of_buffer' > what(): buffer::end_of_buffer > *** Caught signal (Aborted) ** > in thread 7fa8b9e088c0 > ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) > 1: ceph-mon() [0x9b25ea] > 2: (()+0x10330) [0x7fa8b8f0b330] > 3: (gsignal()+0x37) [0x7fa8b73a8c37] > 4: (abort()+0x148) [0x7fa8b73ac028] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fa8b7cb3535] > 6: (()+0x5e6d6) [0x7fa8b7cb16d6] > 7: (()+0x5e703) [0x7fa8b7cb1703] > 8: (()+0x5e922) [0x7fa8b7cb1922] > 9: ceph-mon() [0x853c39] > 10: (object_stat_collection_t::decode(ceph::buffer::list::iterator&)+0x167) > [0x894227] > 11: (pg_stat_t::decode(ceph::buffer::list::iterator&)+0x5ff) [0x894baf] > 12: (PGMap::update_pg(pg_t, ceph::buffer::list&)+0xa3) [0x91a8d3] > 13: (PGMonitor::read_pgmap_full()+0x1d8) [0x68b9b8] > 14: (PGMonitor::update_from_paxos(bool*)+0xbf7) [0x6977b7] > 15: (PaxosService::refresh(bool*)+0x19a) [0x605b5a] > 16: (Monitor::refresh_from_paxos(bool*)+0x1db) [0x5b1ffb] > 17: (Monitor::init_paxos()+0x85) [0x5b2365] > 18: (Monitor::preinit()+0x7d7) [0x5b6f87] > 19: (main()+0x230c) [0x57853c] > 20: (__libc_start_main()+0xf5) [0x7fa8b7393f45] > 21: ceph-m
[ceph-users] ceph radosgw - 500 errors -- odd
I am sorry for posting this if this has been addressed already. I am not sure on how to search through old ceph-users mailing list posts. I used to use gmane.org but that seems to be down. My setup:: I have a moderate ceph cluster (ceph hammer 94.9 - fe6d859066244b97b24f09d46552afc2071e6f90 ). The cluster is running ubuntu but the gateways are running centos7 due to an odd memory issue we had across all of our gateways. Outside of that the cluster is pretty standard and healthy: [root@kh11-9 ~]# ceph -s cluster XXX-XXX-XXX-XXX health HEALTH_OK monmap e4: 3 mons at {kh11-8=X.X.X.X:6789/0,kh12-8=X.X.X.X:6789/0,kh13-8=X.X.X.X:6789/0} election epoch 150, quorum 0,1,2 kh11-8,kh12-8,kh13-8 osdmap e69678: 627 osds: 627 up, 627 in Here is my radosgw config in ceph:: [client.rgw.kh09-10] log_file = /var/log/radosgw/client.radosgw.log rgw_frontends = "civetweb port=80 access_log_file=/var/log/radosgw/rgw.access error_log_file=/var/log/radosgw/rgw.error" rgw_enable_ops_log = true rgw_ops_log_rados = true rgw_thread_pool_size = 1000 rgw_override_bucket_index_max_shards = 23 error_log_file = /var/log/radosgw/civetweb.error.log access_log_file = /var/log/radosgw/civetweb.access.log objecter_inflight_op_bytes = 1073741824 objecter_inflight_ops = 20480 ms_dispatch_throttle_bytes = 209715200 The gateways are sitting behind haproxy for ssl termination. Here is my haproxy config: global log /dev/loglocal0 log /dev/loglocal1 notice chroot /var/lib/haproxy stats socket /var/lib/haproxy/admin.sock mode 660 level admin stats timeout 30s user haproxy group haproxy daemon ca-base /etc/ssl/certs crt-base /etc/ssl/private tune.ssl.default-dh-param 2048 tune.ssl.maxrecord 2048 ssl-default-bind-ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256 ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets ssl-default-server-ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256 ssl-default-server-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets defaults log global modehttp option httplog option dontlognull timeout connect 5000 timeout client 5 timeout server 5 errorfile 400 /etc/haproxy/errors/400.http errorfile 403 /etc/haproxy/errors/403.http errorfile 408 /etc/haproxy/errors/408.http errorfile 500 /etc/haproxy/errors/500.http errorfile 502 /etc/haproxy/errors/502.http errorfile 503 /etc/haproxy/errors/503.http errorfile 504 /etc/haproxy/errors/504.http option forwardfor option http-server-close frontend fourfourthree bind :443 ssl crt /etc/ssl/STAR.opensciencedatacloud.org.pem reqadd X-Forwarded-Proto:\ https default_backend radosgw backend radosgw cookie RADOSGWLB insert indirect nocache server primary 127.0.0.1:80 check cookie primary I am seeing sporadic 500 errors in my access logs on all of my radosgws: /var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.635645 7feacf6c6700 0 RGWObjManifest::operator++(): result: ofs=12607029248 stripe_ofs=12607029248 part_ofs=12598640640 rule->part_size=15728640 /var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.637559 7feacf6c6700 0 RGWObjManifest::operator++(): result: ofs=12611223552 stripe_ofs=12611223552 part_ofs=12598640640 rule->part_size=15728640 /var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.642630 7feacf6c6700 0 RGWObjManifest::operator++(): result: ofs=12614369280 stripe_ofs=12614369280 part_ofs=12614369280 rule->part_size=15728640 /var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.644368 7feadf6e6700 1 == req done req=0x7fed00053a50 http_status=500 == /var/log/radosgw/client.radosgw.log:2017-01-13 11:30:41.644475 7feadf6e6700 1 civetweb: 0x7fed9340: 10.64.0.124 - - [13/Jan/2017:11:28:24 -0600] "GET /BUCKET/306d4fe1-1515-44e0-b527-eee0e83412bf/306d4fe1-1515-44e0-b527-eee0e83412bf_gdc_realn_rehead.bam HTTP/1.1" 500 0 - Boto/2.36.0 Python/2.7.6 Linux/3.13.0-95-generic /var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.645611 7feacf6c6700 0 RGWObjManifest::operator++(): result: ofs=12618563584 stripe_ofs=12618563584 part_ofs=12614369280 rule->part_size=15728640 /var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.647998 7feacf6c6700 0 RGWObjManifest::operator++(): result: ofs=12622757888 stripe_ofs=12622757888 pa
Re: [ceph-users] Filling up ceph past 75%
I've seen it in the past in the ML but I don't remember seeing it lately. We recently had an ceph engineer come out from RH and he mentioned he hasn't seen this kind of disparity either which made me jump on here to double check as I thought it was a well known thing. So I'm not crazy and the roughly 30% difference is normal? I've tried the osd by utilization function before (with other clusters) and have been left with broken pgs(ones that seem to be stuck back filling) before so I've stayed away from it. I saw that it has been redone but with past exposure I've been hesitant. I'll give it another shot in a test instance and see how it goes. Thanks for your help as always Mr. Balzer. On Aug 28, 2016 8:59 PM, "Christian Balzer" wrote: > > Hello, > > On Sun, 28 Aug 2016 14:34:25 -0500 Sean Sullivan wrote: > > > I was curious if anyone has filled ceph storage beyond 75%. > > If you (re-)search the ML archives, you will find plenty of cases like > this, albeit most of them involuntary. > Same goes for uneven distribution. > > > Admitedly we > > lost a single host due to power failure and are down 1 host until the > > replacement parts arrive but outside of that I am seeing disparity > between > > the most and least full osd:: > > > > ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR > > MIN/MAX VAR: 0/1.26 STDDEV: 7.12 > >TOTAL 2178T 1625T 552T 74.63 > > > > 559 4.54955 1.0 3724G 2327G 1396G 62.50 0.84 > > 193 2.48537 1.0 3724G 3406G 317G 91.47 1.23 > > > Those extremes, especially with the weights they have, look odd indeed. > Unless OSD 193 is in the rack which lost a node. > > > The crush weights are really off right now but even with a default crush > > map I am seeing a similar spread:: > > > > # osdmaptool --test-map-pgs --pool 1 /tmp/osdmap > > avg 82 stddev 10.54 (0.128537x) (expected 9.05095 0.110377x)) > > min osd.336 55 > > max osd.54 115 > > > > That's with a default weight of 3.000 across all osds. I was wondering if > > anyone can give me any tips on how to reach closer to 80% full. > > > > We have 630 osds (down one host right now but it will be back in in a > week > > or so) spread across 3 racks of 7 hosts (30 osds each). Our data > > replication scheme is by rack and we only use S3 (so 98% of our data is > in > > .rgw.buckets pool). We are on hammer (94.7) and using the hammer > tunables. > > > What comes to mind here is that probably your split into 3 buckets (racks) > and then into 7 (hosts) is probably not helping the already rather fuzzy > CRUSH algorithm to come up with an even distribution. > Meaning that imbalances are likely to be amplified. > > And dense (30 OSDs) storage servers amplify things of course when one goes > down. > > So how many PGs in the bucket pool then? > > With jewel (backport exists, check the ML archives) there's an improved > reweight-by-utilization script that can help with these things. > And I prefer to do this manually by using the (persistent) crush-reweight > to achieve a more even distribution. > > For example on one cluster here I got the 18 HDD OSDs all within 100GB of > each other. > > However having lost 3 of those OSDs 2 days ago the spread is now 300GB, > most likely NOT helped by the manual adjustments done earlier. > So your nice and evenly distributed cluster during normal state may be > worse off using custom weights when there is a significant OSD loss. > > Christian > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Filling up ceph past 75%
I was curious if anyone has filled ceph storage beyond 75%. Admitedly we lost a single host due to power failure and are down 1 host until the replacement parts arrive but outside of that I am seeing disparity between the most and least full osd:: ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR MIN/MAX VAR: 0/1.26 STDDEV: 7.12 TOTAL 2178T 1625T 552T 74.63 559 4.54955 1.0 3724G 2327G 1396G 62.50 0.84 193 2.48537 1.0 3724G 3406G 317G 91.47 1.23 The crush weights are really off right now but even with a default crush map I am seeing a similar spread:: # osdmaptool --test-map-pgs --pool 1 /tmp/osdmap avg 82 stddev 10.54 (0.128537x) (expected 9.05095 0.110377x)) min osd.336 55 max osd.54 115 That's with a default weight of 3.000 across all osds. I was wondering if anyone can give me any tips on how to reach closer to 80% full. We have 630 osds (down one host right now but it will be back in in a week or so) spread across 3 racks of 7 hosts (30 osds each). Our data replication scheme is by rack and we only use S3 (so 98% of our data is in .rgw.buckets pool). We are on hammer (94.7) and using the hammer tunables. -- - Sean: I wrote this. - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How can we repair OSD leveldb?
We have a hammer cluster that experienced a similar power failure and ended up corrupting our monitors leveldb stores. I am still trying to repair ours but I can give you a few tips that seem to help. 1.) I would copy the database off to somewhere safe right away. Just opening it seems to change it. 2.) check out ceph-test tools (ceph-objectstore-tool, ceph-kvstore-tool, ceph-osdmap-tool, etc). It lets you list the keys/data in your osd leveldb, possibly export them and get some barings on what you need to do to recover your map. 3.) I am making a few assumptions here. a.) You are using replication for your pools. b.) you are using either S3 or rbd, not cephFS. >From here worse case chances are your data is recoverable sans the osd and monitor leveldb store so long as the rest of the data is okay. (The actual rados objects spread across each osd in '/var/lib/ceph/osd/ceph-*/ current/blah_head) If you use RBD there is a tool out there that lets you recover your RBD images:: https://github.com/ceph/ceph/tree/master/src/tools/rbd_recover_tool We only use S3 but this seems to be doable as well: As an example we have a 9MB file that was stored in ceph:: I ran a find across all of the osds in my cluster and compiled a list of files:: find /var/lib/ceph/osd/ceph-*/current/ -type f -iname \*this_is_my_File\. gzip\* >From here I resulted in a list that looks like the following:: This is the head. It's usually the bucket.id\file__head__ default.20283.1\ud975ef9e-c7b1-42c5-938b-d746fc2c7996\ sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam__head_CA57D598__1 [__A]\[_B___ _].[__C__] default.20283.1\u\umultipart\ud975ef9e-c7b1-42c5-938b- d746fc2c7996\sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam.2\ sYDDf8Qip4tn5YxQWfOmTt5fgm7o9Tw6.1__head_C338075C__1 [__A]\[_D___]\[__B_ _].[__C_ _] And for each of those you'll have matching shadow files:: default.20283.1\u\ushadow\ud975ef9e-c7b1-42c5-938b- d746fc2c7996\sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam.2\ sYDDf8Qip4tn5YxQWfOmTt5fgm7o9Tw6.1\u1__head_02F05634__1 [__A]\[_E__]\[__B___ ].[__C__ __] Here is another part of the multipart (this file only had 1 multipart and we use multipart for all files larger than 5MB irrespective of size):: default.20283.1\u\ushadow\ud975ef9e-c7b1-42c5-938b- d746fc2c7996\sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam.2\ sYDDf8Qip4tn5YxQWfOmTt5fgm7o9Tw6.1\u2__head_1EA07BDF__1 [__A]\[_E__]\[__B___ ].[__C__ __] ^^ notice the different part number here. A is the bucket.id and is the same for every object in the same bucket. Even if you don't know what the bucket id for your bucket is, you should be able to assume with good certainty after you review your list which is which B is our object name. We generate uuids for each object so I can not be certain how much of this is ceph or us but the tail of your object name should exist and be the same across all of your parts. C.) Is their suffix for each object. From here you may have suffix' like the above D.) Is your upload chunks E.) Is your shadow chunks for each part of the multipart (i think) I'm sure it's much more complicated than that but that's what worked for me. From here I just scanned through all of my osds and slowly pulled all of the individual parts via ssh and concatinated them all to their respective files. So far the md5 sums match our md5 of the file prior to uploading them to ceph in the first place. We have a python tool to do this but it's kind of specific to us. I can ask the author and see if I can post a gist of the code if that helps. Please let me know. I can't speak for CephFS unfortunately as we do not use it but I wouldn't be surprised if it is similar. So if you set up ssh-keys across all of your osd nodes you should be able to export all of the data to another server/cluster/etc. I am working on trying to rebuild leveldb for our monitors with the correct keys/values but I have a feeling this is going to be a long way off. I wouldn't be surprised if the leveldb structure for the mon databse is similar to the osd omap database. On Wed, Aug 17, 2016 at 4:54 PM, Dan Jakubiec wrote: > Hi Wido, > > Thank you for the response: > > > On Aug 17, 2016, at 16:25, Wido den Hollander wrote: > > > > > >> Op 17 augustus 2016 om 17:44 schreef Dan Jakubiec < > dan.jakub...@gmail.com>: > >> > >> > >> Hello, we have a Ceph cluster with 8 OSD that recently lost power to > all 8 machines. We've managed to recover the XFS filesystems on 7 of the > machines, but the O
Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now
13 22:30:54.593557 7fa8b9e088c0 5 asok(0x36a20f0) register_command log dump hook 0x365a050 -20> 2016-08-13 22:30:54.593561 7fa8b9e088c0 5 asok(0x36a20f0) register_command log reopen hook 0x365a050 -19> 2016-08-13 22:30:54.596039 7fa8b9e088c0 0 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 708653 -18> 2016-08-13 22:30:54.597587 7fa8b9e088c0 5 asok(0x36a20f0) init /var/run/ceph/ceph-mon.kh10-8.asok -17> 2016-08-13 22:30:54.597601 7fa8b9e088c0 5 asok(0x36a20f0) bind_and_listen /var/run/ceph/ceph-mon.kh10-8.asok -16> 2016-08-13 22:30:54.597767 7fa8b9e088c0 5 asok(0x36a20f0) register_command 0 hook 0x36560c0 -15> 2016-08-13 22:30:54.597775 7fa8b9e088c0 5 asok(0x36a20f0) register_command version hook 0x36560c0 -14> 2016-08-13 22:30:54.597778 7fa8b9e088c0 5 asok(0x36a20f0) register_command git_version hook 0x36560c0 -13> 2016-08-13 22:30:54.597781 7fa8b9e088c0 5 asok(0x36a20f0) register_command help hook 0x365a150 -12> 2016-08-13 22:30:54.597783 7fa8b9e088c0 5 asok(0x36a20f0) register_command get_command_descriptions hook 0x365a140 -11> 2016-08-13 22:30:54.597860 7fa8b5181700 5 asok(0x36a20f0) entry start -10> 2016-08-13 22:30:54.608150 7fa8b9e088c0 0 starting mon.kh10-8 rank 2 at 10.64.64.125:6789/0 mon_data /var/lib/ceph/mon/ceph-kh10-8 fsid e452874b-cb29-4468-ac7f-f8901dfccebf -9> 2016-08-13 22:30:54.608210 7fa8b9e088c0 1 -- 10.64.64.125:6789/0 learned my addr 10.64.64.125:6789/0 -8> 2016-08-13 22:30:54.608214 7fa8b9e088c0 1 accepter.accepter.bind my_inst.addr is 10.64.64.125:6789/0 need_addr=0 -7> 2016-08-13 22:30:54.608279 7fa8b9e088c0 5 adding auth protocol: cephx -6> 2016-08-13 22:30:54.608282 7fa8b9e088c0 5 adding auth protocol: cephx -5> 2016-08-13 22:30:54.608311 7fa8b9e088c0 10 log_channel(cluster) update_config to_monitors: true to_syslog: false syslog_facility: daemon prio: info) -4> 2016-08-13 22:30:54.608317 7fa8b9e088c0 10 log_channel(audit) update_config to_monitors: true to_syslog: false syslog_facility: local0 prio: info) -3> 2016-08-13 22:30:54.608395 7fa8b9e088c0 1 mon.kh10-8@-1(probing) e1 preinit fsid e452874b-cb29-4468-ac7f-f8901dfccebf -2> 2016-08-13 22:30:54.608617 7fa8b9e088c0 1 mon.kh10-8@-1(probing).paxosservice(pgmap 0..35606392) refresh upgraded, format 0 -> 1 -1> 2016-08-13 22:30:54.608629 7fa8b9e088c0 1 mon.kh10-8@-1(probing).pg v0 on_upgrade discarding in-core PGMap 0> 2016-08-13 22:30:54.611791 7fa8b9e088c0 -1 *** Caught signal (Aborted) ** in thread 7fa8b9e088c0 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) 1: ceph-mon() [0x9b25ea] 2: (()+0x10330) [0x7fa8b8f0b330] 3: (gsignal()+0x37) [0x7fa8b73a8c37] 4: (abort()+0x148) [0x7fa8b73ac028] 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fa8b7cb3535] 6: (()+0x5e6d6) [0x7fa8b7cb16d6] 7: (()+0x5e703) [0x7fa8b7cb1703] 8: (()+0x5e922) [0x7fa8b7cb1922] 9: ceph-mon() [0x853c39] 10: (object_stat_collection_t::decode(ceph::buffer::list::iterator&)+0x167) [0x894227] 11: (pg_stat_t::decode(ceph::buffer::list::iterator&)+0x5ff) [0x894baf] 12: (PGMap::update_pg(pg_t, ceph::buffer::list&)+0xa3) [0x91a8d3] 13: (PGMonitor::read_pgmap_full()+0x1d8) [0x68b9b8] 14: (PGMonitor::update_from_paxos(bool*)+0xbf7) [0x6977b7] 15: (PaxosService::refresh(bool*)+0x19a) [0x605b5a] 16: (Monitor::refresh_from_paxos(bool*)+0x1db) [0x5b1ffb] 17: (Monitor::init_paxos()+0x85) [0x5b2365] 18: (Monitor::preinit()+0x7d7) [0x5b6f87] 19: (main()+0x230c) [0x57853c] 20: (__libc_start_main()+0xf5) [0x7fa8b7393f45] 21: ceph-mon() [0x59a3c7] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio -2/-2 (syslog threshold) 99/99 (stderr threshold) max_recent 1 max_new 1000 log_file --- end dump of recent events --- Aborted (core dumped) --- --- I feel like I am so close but so far. Can anyone give me a nudge as to what I can do next? it looks like it is bombing out on trying to get an updated paxos. On Fri, Aug 12, 2016 at 1:09 PM, Sean Sullivan wrote: > A coworker patched leveldb and w
Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now
A coworker patched leveldb and we were able to export quite a bit of data from kh08's leveldb database. At this point I think I need to re-construct a new leveldb with whatever values I can. Is it the same leveldb database across all 3 montiors? IE will keys exported from one work in the other? All should have the same keys/values although constructed differently right? I can't blindly copy /var/lib/ceph/mon/ceph-$(hostname)/store.db/ from one host to another right? But can I copy the keys/values from one to another? On Fri, Aug 12, 2016 at 12:45 PM, Sean Sullivan wrote: > ceph-monstore-tool? Is that the same as monmaptool? oops! NM found it in > ceph-test package:: > > I can't seem to get it working :-( dump monmap or any of the commands. > They all bomb out with the same message: > > root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# ceph-monstore-tool > /var/lib/ceph/mon/ceph-kh10-8 dump-trace -- /tmp/test.trace > Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/ > store.db/10882319.ldb > root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# ceph-monstore-tool > /var/lib/ceph/mon/ceph-kh10-8 dump-keys > Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/ > store.db/10882319.ldb > > > I need to clarify as I originally had 2 clusters with this issue and now I > have 1 with all 3 monitors dead and 1 that I was successfully able to > repair. I am about to recap everything I know about the issue and the issue > at hand. Should I start a new email thread about this instead? > > The cluster that is currently having issues is on hammer (94.7), and the > monitor stats are the same:: > root@kh08-8:~# cat /proc/cpuinfo | grep -iE "model name" | uniq -c > 24 model name : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz > ext4 volume comprised of 4x300GB 10k drives in raid 10. > ubuntu 14.04 > > root@kh08-8:~# uname -a > Linux kh08-8 3.13.0-76-generic #120-Ubuntu SMP Mon Jan 18 15:59:10 UTC > 2016 x86_64 x86_64 x86_64 GNU/Linux > root@kh08-8:~# ceph --version > ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) > > > From here: Here are the errors I am getting when starting each of the > monitors:: > > > --- > root@kh08-8:~# /usr/bin/ceph-mon --cluster=ceph -i kh08-8 -d > 2016-08-11 22:15:23.731550 7fe5ad3e98c0 0 ceph version 0.94.7 > (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 317309 > Corruption: error in middle of record > 2016-08-11 22:15:28.274340 7fe5ad3e98c0 -1 error opening mon data > directory at '/var/lib/ceph/mon/ceph-kh08-8': (22) Invalid argument > -- > root@kh09-8:~# /usr/bin/ceph-mon --cluster=ceph -i kh09-8 -d > 2016-08-11 22:14:28.252370 7f7eaab908c0 0 ceph version 0.94.7 > (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 30 > Corruption: 14 missing files; e.g.: /var/lib/ceph/mon/ceph-kh09-8/ > store.db/10845998.ldb > 2016-08-11 22:14:35.094237 7f7eaab908c0 -1 error opening mon data > directory at '/var/lib/ceph/mon/ceph-kh09-8': (22) Invalid argument > -- > root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# /usr/bin/ceph-mon > --cluster=ceph -i kh10-8 -d > 2016-08-11 22:17:54.632762 7f80bf34d8c0 0 ceph version 0.94.7 > (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 292620 > Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/ > store.db/10882319.ldb > 2016-08-11 22:18:01.207749 7f80bf34d8c0 -1 error opening mon data > directory at '/var/lib/ceph/mon/ceph-kh10-8': (22) Invalid argument > --- > > > for kh08, a coworker patched leveldb to print and skip on the first error > and that one is also missing a bunch of files. As such I think kh10-8 is my > most likely candidate to recover but either way recovery is probably not an > option. I see leveldb has a repair.cc (https://github.com/google/lev > eldb/blob/master/db/repair.cc)) but I do not see repair mentioned in > monitor in respect to the dbstore. I tried using the leveldb python module > (plyvel) to attempt a repair but my repl just ends up dying. > > I understand two things:: 1.) Without rebuilding the monitor backend > leveldb (the cluster map as I understand it) store all of the data in the > cluster is essentialy lost (right?) > 2.) it is possible to rebuild > this database via some form of magic or (source)ry as all of this data is > essential held throughout the cluster as well. > > We only use radosgw / S3 for this cluster. If there is a way to recover my > data that is easier//more likely than rebuilding the leveldb of a monitor > and starting a single monitor cluster up I would like to switch gears and > focus on that. > > Looking
Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now
iew a CRUSH map, execute ceph osd getcrushmap -o {filename}; then, decompile it by executing crushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename}. You can view the decompiled map in a text editor or with cat. The MDS Map: Contains the current MDS map epoch, when the map was created, and the last time it changed. It also contains the pool for storing metadata, a list of metadata servers, and which metadata servers are up and in. To view an MDS map, execute ceph mds dump. ``` As we don't use cephfs mds can essentially be blank(right) so I am left with 4 valid maps needed to get a working cluster again. I don't see auth mentioned in there but that too. Then I just need to rebuild the leveldb database somehow with the right information and I should be good. So long long long journey ahead. I don't think that the data is stored in strings or json, right? Am I going down the wrong path here? Is there a shorter/simpler path to retrieve the data from a cluster that lost all 3 monitors in power falure? If I am going down the right path is there any advice on how I can assemble/repair the database? I see that there is a rbd recovery from a dead cluster tool. Is it possible to do the same with s3 objects? On Thu, Aug 11, 2016 at 11:15 AM, Wido den Hollander wrote: > > > Op 11 augustus 2016 om 15:17 schreef Sean Sullivan < > seapasu...@uchicago.edu>: > > > > > > Hello Wido, > > > > Thanks for the advice. While the data center has a/b circuits and > > redundant power, etc if a ground fault happens it travels outside and > > fails causing the whole building to fail (apparently). > > > > The monitors are each the same with > > 2x e5 cpus > > 64gb of ram > > 4x 300gb 10k SAS drives in raid 10 (write through mode). > > Ubuntu 14.04 with the latest updates prior to power failure (2016/Aug/10 > - > > 3am CST) > > Ceph hammer LTS 0.94.7 > > > > (we are still working on our jewel test cluster so it is planned but not > in > > place yet) > > > > The only thing that seems to be corrupt is the monitors leveldb store. I > > see multiple issues on Google leveldb github from March 2016 about fsync > > and power failure so I assume this is an issue with leveldb. > > > > I have backed up /var/lib/ceph/Mon on all of my monitors before trying to > > proceed with any form of recovery. > > > > Is there any way to reconstruct the leveldb or replace the monitors and > > recover the data? > > > I don't know. I have never done it. Other people might know this better > than me. > > Maybe 'ceph-monstore-tool' can help you? > > Wido > > > I found the following post in which sage says it is tedious but > possible. ( > > http://www.spinics.net/lists/ceph-devel/msg06662.html). Tedious is fine > if > > I have any chance of doing it. I have the fsid, the Mon key map and all > of > > the osds look to be fine so all of the previous osd maps are there. > > > > I just don't understand what key/values I need inside. > > > > On Aug 11, 2016 1:33 AM, "Wido den Hollander" wrote: > > > > > > > > > Op 11 augustus 2016 om 0:10 schreef Sean Sullivan < > > > seapasu...@uchicago.edu>: > > > > > > > > > > > > I think it just got worse:: > > > > > > > > all three monitors on my other cluster say that ceph-mon can't open > > > > /var/lib/ceph/mon/$(hostname). Is there any way to recover if you > lose > > > all > > > > 3 monitors? I saw a post by Sage saying that the data can be > recovered as > > > > all of the data is held on other servers. Is this possible? If so has > > > > anyone had any experience doing so? > > > > > > I have never done so, so I couldn't tell you. > > > > > > However, it is weird that on all three it got corrupted. What hardware > are > > > you using? Was it properly protected against power failure? > > > > > > If you mon store is corrupted I'm not sure what might happen. > > > > > > However, make a backup of ALL monitors right now before doing anything. > > > > > > Wido > > > > > > > ___ > > > > ceph-users mailing list > > > > ceph-users@lists.ceph.com > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- - Sean: I wrote this. - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now
Hello Wido, Thanks for the advice. While the data center has a/b circuits and redundant power, etc if a ground fault happens it travels outside and fails causing the whole building to fail (apparently). The monitors are each the same with 2x e5 cpus 64gb of ram 4x 300gb 10k SAS drives in raid 10 (write through mode). Ubuntu 14.04 with the latest updates prior to power failure (2016/Aug/10 - 3am CST) Ceph hammer LTS 0.94.7 (we are still working on our jewel test cluster so it is planned but not in place yet) The only thing that seems to be corrupt is the monitors leveldb store. I see multiple issues on Google leveldb github from March 2016 about fsync and power failure so I assume this is an issue with leveldb. I have backed up /var/lib/ceph/Mon on all of my monitors before trying to proceed with any form of recovery. Is there any way to reconstruct the leveldb or replace the monitors and recover the data? I found the following post in which sage says it is tedious but possible. ( http://www.spinics.net/lists/ceph-devel/msg06662.html). Tedious is fine if I have any chance of doing it. I have the fsid, the Mon key map and all of the osds look to be fine so all of the previous osd maps are there. I just don't understand what key/values I need inside. On Aug 11, 2016 1:33 AM, "Wido den Hollander" wrote: > > > Op 11 augustus 2016 om 0:10 schreef Sean Sullivan < > seapasu...@uchicago.edu>: > > > > > > I think it just got worse:: > > > > all three monitors on my other cluster say that ceph-mon can't open > > /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose > all > > 3 monitors? I saw a post by Sage saying that the data can be recovered as > > all of the data is held on other servers. Is this possible? If so has > > anyone had any experience doing so? > > I have never done so, so I couldn't tell you. > > However, it is weird that on all three it got corrupted. What hardware are > you using? Was it properly protected against power failure? > > If you mon store is corrupted I'm not sure what might happen. > > However, make a backup of ALL monitors right now before doing anything. > > Wido > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Fwd: lost power. monitors died. Cephx errors now
I think it just got worse:: all three monitors on my other cluster say that ceph-mon can't open /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose all 3 monitors? I saw a post by Sage saying that the data can be recovered as all of the data is held on other servers. Is this possible? If so has anyone had any experience doing so? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] lost power. monitors died. Cephx errors now
So our datacenter lost power and 2/3 of our monitors died with FS corruption. I tried fixing it but it looks like the store.db didn't make it. I copied the working journal via 1. sudo mv /var/lib/ceph/mon/ceph-$(hostname){,.BAK} 2. sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename} 1. ceph-mon -i `hostname` --extract-monmap /tmp/monmap 2. ceph-mon -i {mon-id} --inject-monmap {map-path} and for a brief moment i had a quorum but any ceph cli commands would result in cephx errors. Now the two failed monitors have elected a quorum and the monitor that was working keeps getting kicked out of the cluster:: ''' { "election_epoch": 402, "quorum": [ 0, 1 ], "quorum_names": [ "kh11-8", "kh12-8" ], "quorum_leader_name": "kh11-8", "monmap": { "epoch": 1, "fsid": "a6ae50db-5c71-4ef8-885e-8137c7793da8", "modified": "0.00", "created": "0.00", "mons": [ { "rank": 0, "name": "kh11-8", "addr": "10.64.64.134:6789\/0" }, { "rank": 1, "name": "kh12-8", "addr": "10.64.64.143:6789\/0" }, { "rank": 2, "name": "kh13-8", "addr": "10.64.64.151:6789\/0" } ] } } ''' At this point I am not sure what to do as any ceph commands return cephx errors and I can't seem to verify if the new "quorum" is actually valid any way to regenerate a cephx authentication key or recover it with hardware access to the nodes or any advice on how to recover from what seems to be complete monitor failure? -- - Sean: I wrote this. - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Power Outage! Oh No!
So we recently had a power outage and I seem to have lost 2 of 3 of my monitors. I have since copied /var/lib/ceph/mon/ceph-$(hostname){,.BAK} and then created a new cephfs and finally generated a new filesystem via ''' sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename} ''' After this I copied the monmap from the working monitor to the other two. via:: ''' ceph-mon -i {mon-id} --inject-monmap {map-path} ''' At this point I was left with a working monitor map (afaik) but ceph cli commands return :: ''' root@kh11-8:/var/run/ceph# ceph -s 2016-08-10 14:13:58.563241 7fdd719b3700 0 librados: client.admin authentication error (1) Operation not permitted Error connecting to cluster: PermissionError ''' Now after waiting a little while it looks like the quorum kicked out the only working monitor:: ''' { "election_epoch": 358, "quorum": [ 0, 1 ], "quorum_names": [ "kh11-8", "kh12-8" ], "quorum_leader_name": "kh11-8", "monmap": { "epoch": 1, "fsid": "a6ae50db-5c71-4ef8-885e-8137c7793da8", "modified": "0.00", "created": "0.00", "mons": [ { "rank": 0, "name": "kh11-8", "addr": "10.64.64.134:6789\/0" }, { "rank": 1, "name": "kh12-8", "addr": "10.64.64.143:6789\/0" }, { "rank": 2, "name": "kh13-8", "addr": "10.64.64.151:6789\/0" } ] } } ''' kh13-8 was the original working node and kh11-8 and kh12-8 were the ones that had fs issues. Currently I am at a loss as to what to do as ceph -w and -s commands do not work due to permissions/cephx errors and the original working monitor was kicked out. Is there any way to regenerate the cephx authentication and recover the monitor map? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Radosgw (civetweb) hangs once around 850 established connections
Hi Ben! I'm using ubuntu 14.04 I have restarted the gateways with the numthreads line you suggested. I hope this helps. I would think I would get some kind of throttle log or something. 500 seems really strange as well. Do you have a thread for this? RGW still has a weird race condition with multipart uploads where it garbage collects the parts but I think I get a 404 for those which makes sense. I hope you're not seeing something similar. Thanks for the tip and good luck! I'll bump this thread when it happens again. Sent from my pocket typo cannon. On March 16, 2016 8:30:46 PM Ben Hines wrote: What OS are you using? I have a lot more open connections than that. (though i have some other issues, where rgw sometimes returns 500 errors, it doesn't stop like yours) You might try tuning civetweb's num_threads and 'rgw num rados handles': rgw frontends = civetweb num_threads=125 error_log_file=/var/log/radosgw/civetweb.error.log access_log_file=/var/log/radosgw/civetweb.access.log rgw num rados handles = 32 You can also up civetweb loglevel: debug civetweb = 20 -Ben On Wed, Mar 16, 2016 at 5:03 PM, seapasu...@uchicago.edu < seapasu...@uchicago.edu> wrote: I have a cluster of around 630 OSDs with 3 dedicated monitors and 2 dedicated gateways. The entire cluster is running hammer (0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)). (Both of my gateways have stopped responding to curl right now. root@host:~# timeout 5 curl localhost ; echo $? 124 From here I checked and it looks like radosgw has over 1 million open files: root@host:~# grep -i rados whatisopen.files.list | wc -l 1151753 And around 750 open connections: root@host:~# netstat -planet | grep radosgw | wc -l 752 root@host:~# ss -tnlap | grep rados | wc -l 752 I don't think that the backend storage is hanging based on the following dump: root@host:~# ceph daemon /var/run/ceph/ceph-client.rgw.kh11-9.asok objecter_requests | grep -i mtime "mtime": "0.00", "mtime": "0.00", "mtime": "0.00", "mtime": "0.00", "mtime": "0.00", "mtime": "0.00", [...] "mtime": "0.00", The radosgw log is still showing lots of activity and so does strace which makes me think this is a config issue or limit of some kind that is not triggering a log. Of what I am not sure as the log doesn't seem to show any open file limit being hit and I don't see any big errors showing up in the logs. (last 500 lines of /var/log/radosgw/client.radosgw.log) http://pastebin.com/jmM1GFSA Perf dump of radosgw http://pastebin.com/rjfqkxzE Radosgw objecter requests: http://pastebin.com/skDJiyHb After restarting the gateway with '/etc/init.d/radosgw restart' the old process remains, no error is sent, and then I get connection refused via curl or netcat:: root@kh11-9:~# curl localhost curl: (7) Failed to connect to localhost port 80: Connection refused Once I kill the old radosgw via sigkill the new radosgw instance restarts automatically and starts responding:: root@kh11-9:~# curl localhost http://s3.amazonaws.com/doc/2006-03-01/ ">anonymoushttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph-deploy won't write journal if partition exists and using -- dmcrypt
Some context. I have a small cluster running ubuntu 14.04 and giant ( now hsmmer). I ran some updates everything was fine. Rebooted a node and a drive must have failed as it no longer shows up. I use --dmcrypt with ceph deploy and 5 osds per ssd journal. To do this I created the ssd partitions already and pointed ceph-deploy towards the partition for the journal. This worked in giant without issue (I was able to zap the osd and redeploy using the same journal all of the time). Now it seems to fail in hammer stating that the partition exists and im using - - decrypt. This raises a few questions. 1.) ceph osd start scripts must have a list of dm-crypt keys and uuids somewhere as the init mounts the drives. Is this accessible? Normally outside of ceph I've used crypt tab, how is ceph doing it? 2.) my ceph-deploy line is: ceph-deploy osd --dmcrypt create ${host}:/dev/drive:/dev/journal_partition I see that a variable in ceph-disk exists in and is set to false. Is this what I would need to change to get this working again? Or is this set to false for a reason? 3.) I see multiple references to journal_uuid in Sebastian Hans blog as well as the mailing list when replacing a disk. I don't have this file, and I'm assuming it's due to the - - dmcrypt flag. I also see 60 dmcrypt-keys in /etc/ceph/dmxrypt-keys but only 30 mapped devices. Are the journals not using these keys at all? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW - Can't download complete object
Thank you so much Yahuda! I look forward to testing these. Is there a way for me to pull this code in? Is it in master? On May 13, 2015 7:08:44 PM Yehuda Sadeh-Weinraub wrote: Ok, I dug a bit more, and it seems to me that the problem is with the manifest that was created. I was able to reproduce a similar issue (opened ceph bug #11622), for which I also have a fix. I created new tests to cover this issue, and we'll get those recent fixes as soon as we can, after we test for any regressions. Thanks, Yehuda - Original Message - > From: "Yehuda Sadeh-Weinraub" > To: "Sean Sullivan" > Cc: ceph-users@lists.ceph.com > Sent: Wednesday, May 13, 2015 2:33:07 PM > Subject: Re: [ceph-users] RGW - Can't download complete object > > That's another interesting issue. Note that for part 12_80 the manifest > specifies (I assume, by the messenger log) this part: > > default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.tJ8UddmcCxe0lOsgfHR9Q-ZHXdlrM14.12_80 > (note the 'tJ8UddmcCxe0lOsgfHR9Q-ZHXdlrM14') > > whereas it seems that you do have the original part: > default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.12_80 > (note the '2/...') > > The part that the manifest specifies does not exist, which makes me think > that there is some weird upload sequence, something like: > > - client uploads part, upload finishes but client does not get ack for it > - client retries (second upload) > - client gets ack for the first upload and gives up on the second one > > But I'm not sure if it would explain the manifest, I'll need to take a look > at the code. Could such a sequence happen with the client that you're using > to upload? > > Yehuda > > - Original Message - > > From: "Sean Sullivan" > > To: "Yehuda Sadeh-Weinraub" > > Cc: ceph-users@lists.ceph.com > > Sent: Wednesday, May 13, 2015 2:07:22 PM > > Subject: Re: [ceph-users] RGW - Can't download complete object > > > > Sorry for the delay. It took me a while to figure out how to do a range > > request and append the data to a single file. The good news is that the end > > file seems to be 14G in size which matches the files manifest size. The bad > > news is that the file is completely corrupt and the radosgw log has errors. > > I am using the following code to perform the download:: > > > > https://raw.githubusercontent.com/mumrah/s3-multipart/master/s3-mp-download.py > > > > Here is a clip of the log file:: > > -- > > 2015-05-11 15:28:52.313742 7f570db7d700 1 -- 10.64.64.126:0/108 <== > > osd.11 10.64.64.101:6809/942707 5 osd_op_reply(74566287 > > default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.13_12 > > [read 0~858004] v0'0 uv41308 ondisk = 0) v6 304+0+858004 (1180387808 0 > > 2445559038) 0x7f53d005b1a0 con 0x7f56f8119240 > > 2015-05-11 15:28:52.313797 7f57067fc700 20 get_obj_aio_completion_cb: io > > completion ofs=12934184960 len=858004 > > 2015-05-11 15:28:52.372453 7f570db7d700 1 -- 10.64.64.126:0/108 <== > > osd.45 10.64.64.101:6845/944590 2 osd_op_reply(74566142 > > default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.tJ8UddmcCxe0lOsgfHR9Q-ZHXdlrM14.12_80 > > [read 0~4194304] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6 > > 302+0+0 (3754425489 0 0) 0x7f53d005b1a0 con 0x7f56f81b1f30 > > 2015-05-11 15:28:52.372494 7f57067fc700 20 get_obj_aio_completion_cb: io > > completion ofs=12145655808 len=4194304 > > > > 2015-05-11 15:28:52.372501 7f57067fc700 0 ERROR: got unexpected error when > > trying to read object: -2 > > > > 2015-05-11 15:28:52.426079 7f570db7d700 1 -- 10.64.64.126:0/108 <== > > osd.21 10.64.64.102:6856/1133473 16 osd_op_reply(74566144 > > default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.11_12 > > [read 0~3671316] v0'0 uv41395 ondisk = 0) v6 304+0+3671316 (1695485150 > > 0 3933234139) 0x7f53d005b1a0 con 0x7f56f81e17d0 > > 2015-05-11 15:28:52.426123 7f57067fc700 20 get_obj_aio_completion_cb: io > > completion ofs=10786701312 len=3671316 > > 2015-05-11 15:28:52.504072 7f570db7d700 1 -- 10.64.64.126:0/108 <== > > osd.82 10.64.64.103:6857/88524 2 osd_op_reply(74566283 > > default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de6
Re: [ceph-users] RGW - Can't download complete object
Sorry for the delay. It took me a while to figure out how to do a range request and append the data to a single file. The good news is that the end file seems to be 14G in size which matches the files manifest size. The bad news is that the file is completely corrupt and the radosgw log has errors. I am using the following code to perform the download:: https://raw.githubusercontent.com/mumrah/s3-multipart/master/s3-mp-download.py Here is a clip of the log file:: -- 2015-05-11 15:28:52.313742 7f570db7d700 1 -- 10.64.64.126:0/108 <== osd.11 10.64.64.101:6809/942707 5 osd_op_reply(74566287 default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.13_12 [read 0~858004] v0'0 uv41308 ondisk = 0) v6 304+0+858004 (1180387808 0 2445559038) 0x7f53d005b1a0 con 0x7f56f8119240 2015-05-11 15:28:52.313797 7f57067fc700 20 get_obj_aio_completion_cb: io completion ofs=12934184960 len=858004 2015-05-11 15:28:52.372453 7f570db7d700 1 -- 10.64.64.126:0/108 <== osd.45 10.64.64.101:6845/944590 2 osd_op_reply(74566142 default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.tJ8UddmcCxe0lOsgfHR9Q-ZHXdlrM14.12_80 [read 0~4194304] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6 302+0+0 (3754425489 0 0) 0x7f53d005b1a0 con 0x7f56f81b1f30 2015-05-11 15:28:52.372494 7f57067fc700 20 get_obj_aio_completion_cb: io completion ofs=12145655808 len=4194304 2015-05-11 15:28:52.372501 7f57067fc700 0 ERROR: got unexpected error when trying to read object: -2 2015-05-11 15:28:52.426079 7f570db7d700 1 -- 10.64.64.126:0/108 <== osd.21 10.64.64.102:6856/1133473 16 osd_op_reply(74566144 default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.11_12 [read 0~3671316] v0'0 uv41395 ondisk = 0) v6 304+0+3671316 (1695485150 0 3933234139) 0x7f53d005b1a0 con 0x7f56f81e17d0 2015-05-11 15:28:52.426123 7f57067fc700 20 get_obj_aio_completion_cb: io completion ofs=10786701312 len=3671316 2015-05-11 15:28:52.504072 7f570db7d700 1 -- 10.64.64.126:0/108 <== osd.82 10.64.64.103:6857/88524 2 osd_op_reply(74566283 default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.13_8 [read 0~4194304] v0'0 uv41566 ondisk = 0) v6 303+0+4194304 (1474509283 0 3209869954) 0x7f53d005b1a0 con 0x7f56f81b1420 2015-05-11 15:28:52.504118 7f57067fc700 20 get_obj_aio_completion_cb: io completion ofs=12917407744 len=4194304 I couldn't really find any good documentation on how fragments/files are layed out on the object file system so I am not sure on where the file will be. How could the 4mb object have issues but the cluster be completely health okay? I did do the rados stat of each object inside ceph and they all appear to be there:: http://paste.ubuntu.com/8561/ The sum of all of the objects :: 14584887282 The stat of the object inside ceph:: 14577056082 So for some reason I have more data in objects than the key manifest. We easiliy identified this object via the same method as the other thread I have:: for key in keys: : if ( key.name == 'b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam' ): : implicit = key.size : explicit = conn.get_bucket(bucket).get_key(key.name).size : absolute = abs(implicit - explicit) : print key.name : print implicit : print explicit : b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam 14578628946 14577056082 So it looks like I have 3 different sizes. I figure this may be the network issue that was mentioned in the other thread but seeing as this is not the first 512k and the overalll size still matches as well as the errors I am seeing in the gateway I feel that this may be a bigger issue. Has anyone seen this before? The only mention of the "got unexpected error when trying to read object" is here (http://lists.ceph.com/pipermail/ceph-commit-ceph.com/2014-May/021688.html) but my google skills are pretty poor. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Civet RadosGW S3 not storing complete obects; civetweb logs stop after rotation
Will do. The reason for the partial request is that the total size of the file is close to 1TB so attempting a download would take quite some time on our 10Gb connection. What is odd is that if I request the last bit received to the end of the file we get a 406 can not be satisfied response while if I request one byte less to the end of the file we are only given 1byte but not the whole file. I will bump it up and attempt a partial then full download. Thanks for the reply!! On April 28, 2015 5:03:12 PM Yehuda Sadeh-Weinraub wrote: - Original Message - > From: "Sean" > To: ceph-users@lists.ceph.com > Sent: Tuesday, April 28, 2015 2:52:35 PM > Subject: [ceph-users] Civet RadosGW S3 not storing complete obects; civetweb logs stop after rotation > > Hey yall! > > I have a weird issue and I am not sure where to look so any help would > be appreciated. I have a large ceph giant cluster that has been stable > and healthy almost entirely since its inception. We have stored over > 1.5PB into the cluster currently through RGW and everything seems to be > functioning great. We have downloaded smaller objects without issue but > last night we did a test on our largest file (almost 1 terabyte) and it > continuously times out at almost the exact same place. Investigating > further it looks like Civetweb/RGW is returning that the uploads > completed even though the objects are truncated. At least when we > download the objects they seem to be truncated. > > I have tried searching through the mailing list archives to see what may > be going on but it looks like the mailing list DB may be going through > some mainenance: > > > Unable to read word database file > '/dh/mailman/dap/archives/private/ceph-users-ceph.com/htdig/db.words.db' > > > After checking through the gzipped logs I see that civetweb just stops > logging after a rotation for some reason as well and my last log is from > the 28th of march. I tried manually running /etc/init.d/radosgw reload > but this didn't seem to work. As running the download again could take > all day to error out we instead use the range request to try and pull > the missing bites. > > https://gist.github.com/MurphyMarkW/8e356823cfe00de86a48 -- there is the > code we are using to download via S3 / boto as well as the returned size > report and overview of our issue. > http://pastebin.com/cVLdQBMF-- Here is some of the log from the civetweb > server they are hitting. > > Here is our current config :: > http://pastebin.com/2SGfSDYG > > Current output of ceph health:: > http://pastebin.com/3f6iJEbu > > I am thinking that this must be a civetweb/radosgw bug of somekind. My > question is 1.) is there a way to try and download the object via rados > directly I am guessing I will need to find the prefix and then just cat > all of them together and hope I get it right? 2.) Why would ceph say the > upload went fine but then return a smaller object? > > Note that the returned http resonse returns 206 (partial content): /var/log/radosgw/client.radosgw.log:2015-04-28 16:08:26.525268 7f6e93fff700 2 req 0:1.067030:s3:GET /tcga_cghub_protected/ff9b730c-d303-4d49-b28f-e0bf9d8f1c84/759366461d2bf8bb0583d5b9566ce947.bam:get_obj:http status=206 It'll only return that if partial content is requested (through the http Range header). It's really hard to tell from these logs whether there's any actual problem. I suggest bumping up the log level (debug ms = 1, debug rgw = 20), and take a look at an entire request (one that include all the request http headers). Yehuda > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Can not list objects in large bucket
I have a single radosgw user with 2 s3 keys and 1 swift key. I have created a few buckets and I can list all of the contents of bucket A and C but not B with either S3 (boto) or python-swiftclient. I am able to list the first 1000 entries using radosgw-admin 'bucket list --bucket=bucketB' without any issues but this doesn't really help. The odd thing is I can still upload and download objects in the bucket. I just can't list them. I tried setting the bucket canned_acl to private and public but I still can't list the objects inside. I'm using ceph .87 (Giant) Here is some info about the cluster:: http://pastebin.com/LvQYnXem -- ceph.conf http://pastebin.com/efBBPCwa -- ceph -s http://pastebin.com/tF62WMU9 -- radosgw-admin bucket list http://pastebin.com/CZ8TkyNG -- python list bucket objects script http://pastebin.com/TUCyxhMD -- radosgw-admin bucket stats --bucketB http://pastebin.com/uHbEtGHs -- rados -p .rgw.buckets ls | grep default.20283.2 (bucketB marker) http://pastebin.com/WYwfQndV -- Python Error when trying to list BucketB via boto I have no idea why this could be happening outside of the acl. Has anyone seen this before? Any idea on how I can get access to this bucket again via s3/swift? Also is there a way to list the full list of a bucket via radosgw-admin and not the first 9000 lines / 1000 entries, or a way to page through them? EDIT:: I just fixed it (I hope) but the fix doesn't make any sense: radosgw-admin bucket unlink --uid=user --bucket=bucketB radosgw-admin bucket link --uid=user --bucket=bucketB --bucket-id=default.20283.2 Now with swift or s3 (boto) I am able to list the bucket contents without issue ^_^ Can someone elaborate on why this works and how it broken in the first place when ceph was health_ok the entire time? With 3 replicas how did this happen? Could this be a bug? sorry for the rambling. I am confused and tired ;p ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.
I am trying to understand these drive throttle markers that were mentioned to get an idea of why these drives are marked as slow.:: here is the iostat of the drive /dev/sdbm http://paste.ubuntu.com/9607168/ an IO wait of .79 doesn't seem bad but a write wait of 21.52 seems really high. Looking at the ops in flight:: http://paste.ubuntu.com/9607253/ If we check against all of the osds on this node, this seems strange:: http://paste.ubuntu.com/9607331/ I do not understand why this node has ops in flight while the the remainder seem to be performing without issue. The load on the node is pretty light as well with an average CPU at 16 and an average iowait of .79:: --- /var/run/ceph# iostat -xm /dev/sdbm Linux 3.13.0-40-generic (kh10-4) 12/23/2014 _x86_64_(40 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 3.940.00 23.300.790.00 71.97 Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdbm 0.09 0.255.033.42 0.55 0.63 288.02 0.09 10.562.55 22.32 2.54 2.15 --- I am still trying to understand the osd throttle perfdump so if anyone can help shed some light on this that would be rad. From what I can tell from the perfdump 4 osds (the last one, 228, being the slow one currently). I ended up pulling .228 from the cluster and I have yet to see another slow/blocked osd in the output of ceph -s. It is still rebuilding as I just pulled .228 out but I am still getting at least 200MB/s via bonnie while the rebuild is occurring. Finally, if this helps anyone. Although one 1x1Gb takes around 2.0 - 2.5 minutes. If we split a 10 file into 100 x 100MB we get a completion time of about 1 minute. Which would be a 10G file in about 1-1.5 minutes or 166.66MB/s versus the 8MB/s I was getting before with sequential uploads. All of these are coming from a single client via boto. This leads me to think that this is a radosgw issue specifically. This again makes me think that this is not a slow disk issue but an overall radosgw issue. If this were structural in anyway I would think that all of rados/cephs faculties would be hit and the 8MBps limit per client would be due to client throttling due to a ceiling being hit. As it turns out I am not hitting the ceiling but some other aspect of the radosgw or boto is limiting my throughput. Is this logic not correct? I feel like I am missing something. Thanks for the help everyone! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.
Awesome! I have yet to hear of any zfs in ceph chat nor have I seen it on the mailing lists that I have caught. I would assume it would function pretty well considering how long it has been in use along some production systems I have seen. I have little to no experience with it personally though. I thought the rados issue was weird as well. Even with a degraded cluster I feel like I should be getting better throughput unless I hit an object with a bunch of bad PGs or something. We are using 2x 2x10G cards in LACP to get over 10G on average and have separate gateway nodes (Went with the Supermicro kit after all) so CPU on those nodes shouldn't be an issue. It is extremely low as it is currently which is again surprising. I honestly think that this is some kind of radosgw bug in giant as I have another giant cluster with the exact same config that is performing much better with much less hardware. Hopefully it is indeed a bug of somesort and not yet another screw up on my end. Furthermore hopefully I find the bug and fix it for others to find and profit from ^_^. Thanks for all of your help! On 12/22/2014 05:26 PM, Craig Lewis wrote: > > > On Mon, Dec 22, 2014 at 2:57 PM, Sean Sullivan > mailto:seapasu...@uchicago.edu>> wrote: > > Thanks Craig! > > I think that this may very well be my issue with osds dropping out > but I am still not certain as I had the cluster up for a small > period while running rados bench for a few days without any status > changes. > > > Mine were fine for a while too, through several benchmarks and a large > RadosGW import. My problems were memory pressure plus an XFS bug, so > it took a while to manifest. When it did, all of the ceph-osd > processes on that node would have periods of ~30 seconds with 100% > CPU. Some OSDs would get kicked out. Once that started, it was a > downward spiral of recovery causing increasing load causing more OSDs > to get kicked out... > > Once I found the memory problem, I cronned a buffer flush, and that > usually kept things from getting too bad. > > I was able to see on the CPU graphs that CPU was increasing before the > problems started. Once CPU got close to 100% usage on all cores, > that's when the OSDs started dropping out. Hard to say if it was the > CPU itself, or if the CPU was just a symptom of the memory pressure > plus XFS bug. > > > > > The real big issue that I have is the radosgw one currently. After > I figure out the root cause of the slow radosgw performance and > correct that, it should hopefully buy me enough time to figure out > the osd slow issue. > > It just doesn't make sense that I am getting 8mbps per client no > matter 1 or 60 clients while rbd and rados shoot well above 600MBs > (above 1000 as well). > > > That is strange. I was able to get >300 Mbps per client, on a 3 node > cluster with GigE. I expected that each client would saturate the > GigE on their own, but 300 Mbps is more than enough for now. > > I am using the Ceph apache and fastcgi module, but otherwise it's a > pretty standard apache setup. My RadosGW processes are using a fair > amount of CPU, but as long as you have some idle CPU, that shouldn't > be the bottleneck. > > > > > > May I ask how you are monitoring your clusters logs? Are you just > using rsyslog or do you have a logstash type system set up? Load > wise I do not see a spike until I pull an osd out of the cluster > or stop then start an osd without marking nodown. > > > I'm monitoring the cluster with Zabbix, and that gives me pretty much > the same info that I'd get in the logs. I am planning to start > pushing the logs to Logstash soon, as soon as I get my logstash is > able to handle the extra load. > > > > I do think that CPU is probably the cause of the osd slow issue > though as it makes the most logical sense. Did you end up dropping > ceph and moving to zfs or did you stick with it and try to > mitigate it via file flusher/ other tweaks? > > > I'm still on Ceph. I worked around the memory pressure by > reformatting my XFS filesystems to use regular sized inodes. It was a > rough couple of months, but everything has been stable for the last > two months. > > I do still want to use ZFS on my OSDs. It's got all the features of > BtrFS, with the extra feature of being production ready. It's just > not production ready in Ceph yet. It's coming along nicely though, > and I hope to reformat one node to be all ZFS sometime next year. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.
Hello Christian, Sorry for the long wait. Actually I have done a rados bench earlier on in the cluster without any failure but it did take a while. That and there is actually a lot of data being downloaded to the cluster now. Here are the rados results for 100 seconds:: http://pastebin.com/q5E6JjkG On 12/19/2014 08:10 PM, Christian Balzer wrote > Hello Sean, > > On Fri, 19 Dec 2014 02:47:41 -0600 Sean Sullivan wrote: > >> Hello Christian, >> >> Thanks again for all of your help! I started a bonnie test using the >> following:: >> bonnie -d /mnt/rbd/scratch2/ -m $(hostname) -f -b >> > While that gives you a decent idea of what the limitations of kernelspace e - > mounted RBD images are, it won't tell you what your cluster is actually > capable of in raw power. Indeed I agree here, and I am not interested in raw power at this point as I am a bit past this. I performed a rados test prior and it seemed to do pretty well, or as expected. What I have noticed in rados bench tests is that the test can only go as fast as the client network can allow. The above seems to demonstrate this as well. If I were to start two rados bench tests from two different hosts I am confident I can push above 1100 Mbps without any issue. > > For that use rados bench, however if your cluster is as brittle as it > seems, this may very well cause OSDs to flop, so look out for that. > Observe your nodes (a bit tricky with 21, but try) while this is going on. > > To test the write throughput, do something like this: > "rados -p rbd bench 60 write -t 64" > > To see your CPUs melt and get an idea of the IOPS capability with 4k > blocks, do this: > > "rados -p rbd bench 60 write -t 64 -b 4096" > I will try with 4k blocks next to see how this works out. I honestly think that the cluster will stress but should be able to handle it. A rebuild on failure will be scary however. >> Hopefully it completes in the next hour or so. A reboot of the slow OSDs >> clears the slow marker for now >> >> kh10-9$ ceph -w >> cluster 9ea4d9d9-04e4-42fe-835a-34e4259cf8ec >> health HEALTH_OK >> monmap e1: 3 mons at > 3 monitors, another recommendation/default that isn't really adequate for > a cluster of this size and magnitude. Because it means you can only loose > ONE monitor before the whole thing seizes up. I'd get 2 more (with DC > S3700, 100 or 200GB will do fine) and spread them among the racks. The plan is to scale out the monitors to have two more, they have not arrived yet but that is in the plan. I agree about the number of monitors. When I talked to Inktank/redhat about this when I was testing the 36 disk storage node cluster. though they said something along the lines of we shouldn't need 2 more until we have a much larger cluster. Just know that two more monitors are indeed on the way and that this is a known issue. > >> {kh08-8=10.64.64.108:6789/0,kh09-8=10.64.64.117:6789/0,kh10-8=10.64.64.125:6789/0}, >> >> election epoch 338, quorum 0,1,2 kh08-8,kh09-8,kh10-8 >> osdmap e15356: 1256 osds: 1256 up, 1256 in >>pgmap v788798: 87560 pgs, 18 pools, 187 TB data, 47919 kobjects >> 566 TB used, 4001 TB / 4567 TB avail > That's a lot of objects and data, was your cluster that full when before > it started to have problems? This is due to the rados benches I ran as well as the massive amount of data we are transferring to the current cluster. We have 20 pools currently: 1 data,2 rbd,3 .rgw,4 .rgw.root,5 .rgw.control,6 .rgw.gc,7 .rgw.buckets,8 .rgw.buckets.index,9 .log,10 .intent-log,11 .usage,12 .users,13 .users.email,14 .users.swift,15 .users.uid,16 volumes,18 vms,19 .rgw.buckets.extra,20 images, data and rbd will be removed once I am done testing. these pools were my test pools I created. The rest are the standard s3/swift // openstack pools. >> 87560 active+clean > Odd number of PGs, it makes for 71 per OSD, a bit on the low side. OTOH > you're already having scaling issues of sorts, so probably leave it be for > now. How many pools? 20 pools, but we will only have 18 once I delete data and rbd (these were just testing pools to begin with). > >>client io 542 MB/s rd, 1548 MB/s wr, 7552 op/s >> > Is that a typical, idle, steady state example or is this while you're > running bonnie and pushing things into radosgw? I am doing both actually. The downloads into radosgw can't be stopped right now but I can stop the bonnie tests. > >> 2014-12-19 01:27:28.547884 mon.0 [INF] pgmap v788797: 87560 pgs: 87560 >> active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 433 >> MB/s rd, 1090 MB/s wr, 5774 op/s >> 2014-12-19 01:27:29.581955 mon.0 [INF] pgmap
Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.
Hello Christian, Thanks again for all of your help! I started a bonnie test using the following:: bonnie -d /mnt/rbd/scratch2/ -m $(hostname) -f -b Hopefully it completes in the next hour or so. A reboot of the slow OSDs clears the slow marker for now kh10-9$ ceph -w cluster 9ea4d9d9-04e4-42fe-835a-34e4259cf8ec health HEALTH_OK monmap e1: 3 mons at {kh08-8=10.64.64.108:6789/0,kh09-8=10.64.64.117:6789/0,kh10-8=10.64.64.125:6789/0}, election epoch 338, quorum 0,1,2 kh08-8,kh09-8,kh10-8 osdmap e15356: 1256 osds: 1256 up, 1256 in pgmap v788798: 87560 pgs, 18 pools, 187 TB data, 47919 kobjects 566 TB used, 4001 TB / 4567 TB avail 87560 active+clean client io 542 MB/s rd, 1548 MB/s wr, 7552 op/s 2014-12-19 01:27:28.547884 mon.0 [INF] pgmap v788797: 87560 pgs: 87560 active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 433 MB/s rd, 1090 MB/s wr, 5774 op/s 2014-12-19 01:27:29.581955 mon.0 [INF] pgmap v788798: 87560 pgs: 87560 active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 542 MB/s rd, 1548 MB/s wr, 7552 op/s 2014-12-19 01:27:30.638744 mon.0 [INF] pgmap v788799: 87560 pgs: 87560 active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 726 MB/s rd, 2284 MB/s wr, 10451 op/s Once the next slow osd comes up I guess I can tell it to bump it's log up to 5 and see what may be going on. That said I didn't see much last time. On 12/19/2014 12:17 AM, Christian Balzer wrote: Hello, On Thu, 18 Dec 2014 23:45:57 -0600 Sean Sullivan wrote: Wow Christian, Sorry I missed these in line replies. Give me a minute to gather some data. Thanks a million for the in depth responses! No worries. I thought about raiding it but I needed the space unfortunately. I had a 3x60 osd node test cluster that we tried before this and it didn't have this flopping issue or rgw issue I am seeing . I think I remember that... I hope not. I don't think I posted about it at all. I only had it for a short period before it was re purposed. I did post about a cluster before that with 32 osds per node though. That one had tons of issues but now seems to be running relatively smoothly. You do realize that the RAID6 configuration option I mentioned would actually give you MORE space (replication of 2 is sufficient with reliable OSDs) than what you have now? Albeit probably at reduced performance, how much would also depend on the controllers used, but at worst the RAID6 OSD performance would be equivalent to that of single disk. So a Cluster (performance wise) with 21 nodes and 8 disks each. Ah I must have misread, I thought you said raid 10 which would half the storage and a small write penalty. For a raid 6 of 4 drives I would get something like 160 iops (assuming each drive is 75) which may be worth it. I would just hate to have 2+ failures and lose 4-5 drives as opposed to 2 and the rebuild for a raid 6 always left a sour taste in my mouth. Still 4 slow drives is better than 4TB of data over the network slowing down the whole cluster. I knew about the 40 cores being low but I thought at 2.7 we may be fine as the docs recommend 1 X 1G xeons per osd. The cluster hovers around 15-18 CPU but with the constant flipping disks I am seeing it bump up as high as 120 when a disk is marked as out of the cluster. kh10-3$ cat /proc/loadavg 14.35 29.50 66.06 14/109434 724476 No need, now that strange monitor configuration makes sense, you (or whoever spec'ed this) went for the Supermicro Ceph solution, right? indeed. In my not so humble opinion, this the worst storage chassis ever designed by a long shot and totally unsuitable for Ceph. I told the Supermicro GM for Japan as much. ^o^ Well it looks like I done goofed. I thought it was odd that they went against most of what ceph documentation says about recommended hardware. I read/heard from them that they worked with intank on this though so I was swayed. Besides that we really needed the density per rack due to limited floor space. As I said in capable hands this cluster would work but by stroke of luck.. Every time a HDD dies, you will have to go and shut down the other OSD that resides on the same tray (and set the cluster to noout). Even worse of course if a SSD should fail. And if somebody should just go and hotswap things w/o that step first, hello data movement storm (2 or 10 OSDs instead of 1 or 5 respectively). Christian Thanks for your help and insight on this! I am going to take a nap and hope the cluster doesn't set fire before I wake up o_o ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.
Wow Christian, Sorry I missed these in line replies. Give me a minute to gather some data. Thanks a million for the in depth responses! I thought about raiding it but I needed the space unfortunately. I had a 3x60 osd node test cluster that we tried before this and it didn't have this flopping issue or rgw issue I am seeing . I can quickly answered the case/make questions, the model will need to wait till I get home :) Case is a 72 disk supermicro chassis, I'll grab the exact model in my next reply. Drives are HGST 4TB drives, ill grab the model once I get home as well. The 300 was completely incorrect and it can push more, it was just meant for a quick comparison but I agree it should be higher. Thank you so much. Please hold up and ill grab the extra info ^~^ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.
thanks! It would be really great in the right hands. Through some stroke of luck it's in mine. The flapping osd is becoming a real issue at this point as it is the only possible lead I have to why the gateways are transferring so slowly. The weird issue is that I can have 8 or 60 transfers going to the radosgw and they are all at roughly 8mbps. To work around this right now I am starting 60+ clients across 10 boxes to get roughly 1gbps per gateway across gw1 and gw2. I heve been staring at logs for hours trying to get a handle at what the issue may be with no luck. The third gateway was made last minute to test and rule out the hardware. On December 18, 2014 10:57:41 PM Christian Balzer wrote: Hello, Nice cluster, I wouldn't mind getting my hand or her ample nacelles, er, wrong movie. ^o^ On Thu, 18 Dec 2014 21:35:36 -0600 Sean Sullivan wrote: > Hello Yall! > > I can't figure out why my gateways are performing so poorly and I am not > sure where to start looking. My RBD mounts seem to be performing fine > (over 300 MB/s) > I wouldn't call 300MB/s writes fine with a cluster of this size. How are you testing this (which tool, settings, from where)? > while uploading a 5G file to Swift/S3 takes 2m32s > (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing > with nuttcp shows that I can transfer from a client with 10G interface > to any node on the ceph cluster at the full 10G and ceph can transfer > close to 20G between itself. I am not really sure where to start looking > as outside of another issue which I will mention below I am clueless. > I know nuttin about radosgw, but I wouldn't be surprised that the difference you see here is based how that is eventually written to the storage (smaller chunks than what you're using to test RBD performance). > I have a weird setup I'm always interested in monster storage nodes, care to share what case this is? > [osd nodes] > 60 x 4TB 7200 RPM SATA Drives What maker/model? > 12 x 400GB s3700 SSD drives Journals, one assumes. > 3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly across > the 3 cards) I smell a port-expander or 3 on your backplane. And while making sure that your SSDs get undivided 6Gb/s love would probably help, you still have plenty of bandwidth here (4.5Gb/s per drive), so no real issue. > 512 GB of RAM Sufficient. > 2 x CPU E5-2670 v2 @ 2.50GHz Vastly, and I mean VASTLY insufficient. It would still be 10GHz short of the (optimistic IMHO) recommendation of 1GHz per OSD w/o SSD journals. With SSD journals my experience shows that with certain write patterns even 3.5GHz per OSD isn't sufficient. (there are several threads about this here) > 2 x 10G interfaces LACP bonded for cluster traffic > 2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G > ports) > Your journals could handle 5.5GB/s, so you're limiting yourself here a bit, but not too horribly. If I had been given this hardware, I would have RAIDed things (different controller) to keep the number of OSDs per node to something the CPUs (any CPU really!) can handle. Something like 16 x 4HDD RAID10 + SSDs +spares (if possible) for performance and 8 x 8HDD RAID6 + SSDs +spares for capacity. That still gives you 336 or 168 OSDs, allows for a replication size of 2 and as bonus you'll probably never have to deal with a failed OSD. ^o^ > [monitor nodes and gateway nodes] > 4 x 300G 1500RPM SAS drives in raid 10 I would have used Intel DC S3700s here as well, mons love their leveldb to be fast but > 1 x SAS 2208 combined with this it should be fine. > 64G of RAM > 2 x CPU E5-2630 v2 > 2 x 10G interfaces LACP bonded for public traffic (total of 2 10G ports) > > > Here is a pastebin dump of my details, I am running ceph giant 0.87 > (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel 3.13.0-40-generic > across the entire cluster. > > http://pastebin.com/XQ7USGUz -- ceph health detail That looks positively scary, blocked requests for hours... > http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf > http://pastebin.com/BC3gzWhT -- ceph osd tree scroll, scroll, woah! ^o^ > http://pastebin.com/eRyY4H4c -- /var/log/radosgw/client.radosgw.rgw03.log > http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't let me) > > > We ran into a few issues with density (conntrack limits, pid limit, and > number of open files) all of which I adjusted by bumping the ulimits in > /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any > signs of these limits being hit so I have not included my limits or > sysctl conf. If you like this as well let me know and I can include it. > > One of the issues I am seeing is that OSDs have started to flop/ be > marked as slow. The cluster was HEALTH_OK with all of the d
Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.
Thanks for the reply Gegory, Sorry if this is in the wrong direction or something. Maybe I do not understand To test uploads I either use bash time and either python-swiftclient or boto key.set_contents_from_filename to the radosgw. I was unaware that radosgw had any type of throttle settings in the configuration (I can't seem to find any either). As for rbd mounts I test by creating a 1TB mount and writing a file to it through time+cp or dd. Not the most accurate test but I think should be good enough as a quick functionality test. So for writes, it's more for functionality than performance. I would think a basic functionality test should yield more than 8mb/s though. As for checking admin sockets: I have actually, I set the 3rd gateways debug_civetweb to 10 as well as debug_rgw to 5 but I still do not see anything that stands out. The snippet of the log I pasted has these values set. I did the same for an osd that is marked as slow (1112). All I can see in the log for the osd are ticks and heartbeat responses though, nothing that shows any issues. Finally I did it for the primary monitor node to see if I would see anything there with debug_mon set to 5 (http://pastebin.com/hhnaFac1). I do not really see anything that would stand out as a failure (like a fault or timeout error). What kind of throttler limits do you mean? I didn't/don't see any mention of rgw throttler limits in the ceph.com docs or admin socket just osd/ filesystem throttle like inode/flusher limits, do you mean these? I have not messed with these limits yet on this cluster, do you think it would help? On 12/18/2014 10:24 PM, Gregory Farnum wrote: > What kind of uploads are you performing? How are you testing? > Have you looked at the admin sockets on any daemons yet? Examining the > OSDs to see if they're behaving differently on the different requests > is one angle of attack. The other is look into is if the RGW daemons > are hitting throttler limits or something that the RBD clients aren't. > -Greg > On Thu, Dec 18, 2014 at 7:35 PM Sean Sullivan <mailto:seapasu...@uchicago.edu>> wrote: > > Hello Yall! > > I can't figure out why my gateways are performing so poorly and I > am not > sure where to start looking. My RBD mounts seem to be performing fine > (over 300 MB/s) while uploading a 5G file to Swift/S3 takes 2m32s > (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing > with nuttcp shows that I can transfer from a client with 10G interface > to any node on the ceph cluster at the full 10G and ceph can transfer > close to 20G between itself. I am not really sure where to start > looking > as outside of another issue which I will mention below I am clueless. > > I have a weird setup > [osd nodes] > 60 x 4TB 7200 RPM SATA Drives > 12 x 400GB s3700 SSD drives > 3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly > across > the 3 cards) > 512 GB of RAM > 2 x CPU E5-2670 v2 @ 2.50GHz > 2 x 10G interfaces LACP bonded for cluster traffic > 2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G > ports) > > [monitor nodes and gateway nodes] > 4 x 300G 1500RPM SAS drives in raid 10 > 1 x SAS 2208 > 64G of RAM > 2 x CPU E5-2630 v2 > 2 x 10G interfaces LACP bonded for public traffic (total of 2 10G > ports) > > > Here is a pastebin dump of my details, I am running ceph giant 0.87 > (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel > 3.13.0-40-generic > across the entire cluster. > > http://pastebin.com/XQ7USGUz -- ceph health detail > http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf > http://pastebin.com/BC3gzWhT -- ceph osd tree > http://pastebin.com/eRyY4H4c -- > /var/log/radosgw/client.radosgw.rgw03.log > http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't > let me) > > > We ran into a few issues with density (conntrack limits, pid > limit, and > number of open files) all of which I adjusted by bumping the > ulimits in > /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any > signs of these limits being hit so I have not included my limits or > sysctl conf. If you like this as well let me know and I can > include it. > > One of the issues I am seeing is that OSDs have started to flop/ be > marked as slow. The cluster was HEALTH_OK with all of the disks added > for over 3 weeks before this behaviour started. RBD transfers seem > to be > fine for the most part which makes me think that this has little > baring > on the gateway issue but it may be related.
[ceph-users] 1256 OSD/21 server ceph cluster performance issues.
Hello Yall! I can't figure out why my gateways are performing so poorly and I am not sure where to start looking. My RBD mounts seem to be performing fine (over 300 MB/s) while uploading a 5G file to Swift/S3 takes 2m32s (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing with nuttcp shows that I can transfer from a client with 10G interface to any node on the ceph cluster at the full 10G and ceph can transfer close to 20G between itself. I am not really sure where to start looking as outside of another issue which I will mention below I am clueless. I have a weird setup [osd nodes] 60 x 4TB 7200 RPM SATA Drives 12 x 400GB s3700 SSD drives 3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly across the 3 cards) 512 GB of RAM 2 x CPU E5-2670 v2 @ 2.50GHz 2 x 10G interfaces LACP bonded for cluster traffic 2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G ports) [monitor nodes and gateway nodes] 4 x 300G 1500RPM SAS drives in raid 10 1 x SAS 2208 64G of RAM 2 x CPU E5-2630 v2 2 x 10G interfaces LACP bonded for public traffic (total of 2 10G ports) Here is a pastebin dump of my details, I am running ceph giant 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel 3.13.0-40-generic across the entire cluster. http://pastebin.com/XQ7USGUz -- ceph health detail http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf http://pastebin.com/BC3gzWhT -- ceph osd tree http://pastebin.com/eRyY4H4c -- /var/log/radosgw/client.radosgw.rgw03.log http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't let me) We ran into a few issues with density (conntrack limits, pid limit, and number of open files) all of which I adjusted by bumping the ulimits in /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any signs of these limits being hit so I have not included my limits or sysctl conf. If you like this as well let me know and I can include it. One of the issues I am seeing is that OSDs have started to flop/ be marked as slow. The cluster was HEALTH_OK with all of the disks added for over 3 weeks before this behaviour started. RBD transfers seem to be fine for the most part which makes me think that this has little baring on the gateway issue but it may be related. Rebooting the OSD seems to fix this issue. I would like to figure out the root cause of both of these issues and post the results back here if possible (perhaps it can help other people). I am really looking for a place to start looking at as the gateway just outputs that it is posting data and all of the logs (outside of the monitors reporting down osds) seem to show a fully functioning cluster. Please help. I am in the #ceph room on OFTC every day as 'seapasulli' as well. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph health related message
I had this happen to me as well. Turned out to be a connlimit thing for me. I would check dmesg/kernel log and see if you see any conntrack limit reached connection dropped messages then increase connlimit. Odd as I connected over ssh for this but I can't deny syslog. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Swift can upload, list, and delete, but not download
So this was working a moment ago and I was running rados bencharks as well as swift benchmarks to try to see how my install was doing. Now when I try to download an object I get this read_length error:: http://pastebin.com/R4CW8Cgj To try to poke at this I wiped all of the .rgw pools, removed the rados gateways and re-installed them. I am still getting the same error. Here are my config if that helps http://pastebin.com/q9DRHaQr Uploads work, downloads fail with the same error (the content_length changes though depending on object):: Error downloading 1G: read_length != content_length, 0 != 1073741824' I have tried removing all of the .rgw pools and re-creating a user but this has not had any change in behaviour. Any help would be greatly appreciated. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph can't seem to forget
I think I have a split issue or I can't seem to get rid of these objects. How can I tell ceph to forget the objects and revert? How this happened is that due to the python 2.7.8/ceph bug ( a whole rack of ceph went town (it had ubuntu 14.10 and that seemed to have 2.7.8 before 14.04). I didn't know what was going on and tried re-installing which killed the vast majority of the data. 2/3. The drives are gone and the data on them is lost now. I tried deleting them via rados but that didn't seem to work either and just froze there. Any help would be much appreciated. Pastebin data below http://pastebin.com/HU8yZ1ae cephuser@host:~/CephPDC$ ceph --version ceph version 0.82-524-gbf04897 (bf048976f50bd0142f291414ea893ef0f205b51a) cephuser@host:~/CephPDC$ ceph -s cluster 9e0a4a8e-91fa-4643-887a-c7464aa3fd14 health HEALTH_WARN 2 pgs recovering; 2 pgs stuck unclean; 5 requests are blocked > 32 sec; recovery 478/15386946 objects degraded (0.003%); 23/5128982 unfound (0.000%) monmap e9: 5 mons at {kg37-12= 10.16.0.124:6789/0,kg37-17=10.16.0.129:6789/0,kg37-23=10.16.0.135:6789/0,kg37-28=10.16.0.140:6789/0,kg37-5=10.16.0.117:6789/0}, election epoch 1450, quorum 0,1,2,3,4 kg37-5,kg37-12,kg37-17,kg37-23,kg37-28 mdsmap e100: 1/1/1 up {0=kg37-5=up:active} osdmap e46061: 245 osds: 245 up, 245 in pgmap v3268915: 22560 pgs, 19 pools, 20020 GB data, 5008 kobjects 61956 GB used, 830 TB / 890 TB avail 478/15386946 objects degraded (0.003%); 23/5128982 unfound (0.000%) 22558 active+clean 2 active+recovering client io 95939 kB/s rd, 80854 B/s wr, 795 op/s cephuser@host:~/CephPDC$ ceph health detail HEALTH_WARN 2 pgs recovering; 2 pgs stuck unclean; 5 requests are blocked > 32 sec; 1 osds have slow requests; recovery 478/15386946 objects degraded (0.003%); 23/5128982 unfound (0.000%) pg 5.f4f is stuck unclean since forever, current state active+recovering, last acting [279,115,78] pg 5.27f is stuck unclean since forever, current state active+recovering, last acting [213,0,258] pg 5.f4f is active+recovering, acting [279,115,78], 10 unfound pg 5.27f is active+recovering, acting [213,0,258], 13 unfound 5 ops are blocked > 67108.9 sec 5 ops are blocked > 67108.9 sec on osd.279 1 osds have slow requests recovery 478/15386946 objects degraded (0.003%); 23/5128982 unfound (0.000%) cephuser@host:~/CephPDC$ ceph pg 5.f4f mark_unfound_lost revert 2014-08-06 12:59:42.282672 7f7d4a6fb700 0 -- 10.16.0.117:0/1005129 >> 10.16.64.29:6844/718 pipe(0x7f7d4005c120 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f7d4005c3b0).fault 2014-08-06 12:59:51.890574 7f7d4a4f9700 0 -- 10.16.0.117:0/1005129 >> 10.16.64.29:6806/7875 pipe(0x7f7d4005f180 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f7d4005fae0).fault pg has no unfound objects ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph can't seem to forget.
I forgot to register before posting so reposting. I think I have a split issue or I can't seem to get rid of these objects. How can I tell ceph to forget the objects and revert? How this happened is that due to the python 2.7.8/ceph bug ( a whole rack of ceph went town (it had ubuntu 14.10 and that seemed to have 2.7.8 before 14.04). I didn't know what was going on and tried re-installing which killed the vast majority of the data. 2/3. The drives are gone and the data on them is lost now. I tried deleting them via rados but that didn't seem to work either and just froze there. Any help would be much appreciated. Pastebin data below http://pastebin.com/HU8yZ1ae cephuser@host:~/CephPDC$ ceph --version ceph version 0.82-524-gbf04897 (bf048976f50bd0142f291414ea893ef0f205b51a) cephuser@host:~/CephPDC$ ceph -s cluster 9e0a4a8e-91fa-4643-887a-c7464aa3fd14 health HEALTH_WARN 2 pgs recovering; 2 pgs stuck unclean; 5 requests are blocked > 32 sec; recovery 478/15386946 objects degraded (0.003%); 23/5128982 unfound (0.000%) monmap e9: 5 mons at {kg37-12= 10.16.0.124:6789/0,kg37-17=10.16.0.129:6789/0,kg37-23=10.16.0.135:6789/0,kg37-28=10.16.0.140:6789/0,kg37-5=10.16.0.117:6789/0}, election epoch 1450, quorum 0,1,2,3,4 kg37-5,kg37-12,kg37-17,kg37-23,kg37-28 mdsmap e100: 1/1/1 up {0=kg37-5=up:active} osdmap e46061: 245 osds: 245 up, 245 in pgmap v3268915: 22560 pgs, 19 pools, 20020 GB data, 5008 kobjects 61956 GB used, 830 TB / 890 TB avail 478/15386946 objects degraded (0.003%); 23/5128982 unfound (0.000%) 22558 active+clean 2 active+recovering client io 95939 kB/s rd, 80854 B/s wr, 795 op/s cephuser@host:~/CephPDC$ ceph health detail HEALTH_WARN 2 pgs recovering; 2 pgs stuck unclean; 5 requests are blocked > 32 sec; 1 osds have slow requests; recovery 478/15386946 objects degraded (0.003%); 23/5128982 unfound (0.000%) pg 5.f4f is stuck unclean since forever, current state active+recovering, last acting [279,115,78] pg 5.27f is stuck unclean since forever, current state active+recovering, last acting [213,0,258] pg 5.f4f is active+recovering, acting [279,115,78], 10 unfound pg 5.27f is active+recovering, acting [213,0,258], 13 unfound 5 ops are blocked > 67108.9 sec 5 ops are blocked > 67108.9 sec on osd.279 1 osds have slow requests recovery 478/15386946 objects degraded (0.003%); 23/5128982 unfound (0.000%) cephuser@host:~/CephPDC$ ceph pg 5.f4f mark_unfound_lost revert 2014-08-06 12:59:42.282672 7f7d4a6fb700 0 -- 10.16.0.117:0/1005129 >> 10.16.64.29:6844/718 pipe(0x7f7d4005c120 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f7d4005c3b0).fault 2014-08-06 12:59:51.890574 7f7d4a4f9700 0 -- 10.16.0.117:0/1005129 >> 10.16.64.29:6806/7875 pipe(0x7f7d4005f180 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f7d4005fae0).fault pg has no unfound objects ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com