Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous
Hi, back to work, i face my problem. @Alexandre : AMDTurion for N54L HP Microserver. This server is OSD and LXC only, no mon working in. After rebooting the whole cluster and attempting to add a third time the same disk : ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 7.47226 root default -2 3.65898 host jon 1 2.2 osd.1 up 1.0 1.0 3 1.35899 osd.3 up 1.0 1.0 -3 0.34999 host daenerys 0 0.34999 osd.0 up 1.0 1.0 -4 1.64969 host tyrion 2 0.44969 osd.2 up 1.0 1.0 4 1.2 osd.4 up 1.0 1.0 -5 1.81360 host jaime 5 1.81360 osd.5 up 1.0 1.0 6 0 osd.6down0 1.0 7 0 osd.7down0 1.0 8 0 osd.8down0 1.0 6,7,8 disks are the same issue for the same disk (which isn't faulty). Any clue ? I'm gonna try soon to create the osd on this disk in another server. Thanks. Best regards Le 26/07/2017 à 15:53, Alexandre DERUMIER a écrit : Hi Phil, It's possible that rocksdb have a bug with some old cpus currently (old xeon and some opteron) I have the same behaviour with new cluster when creating mons http://tracker.ceph.com/issues/20529 What is your cpu model ? in your log: sh[1869]: in thread 7f6d85db3c80 thread_name:ceph-osd sh[1869]: ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) luminous (dev) sh[1869]: 1: (()+0x9bc562) [0x558561169562] sh[1869]: 2: (()+0x110c0) [0x7f6d835cb0c0] sh[1869]: 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x5585615788b1] sh[1869]: 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator > const&, bool)+0x26bc) [0x55856145ca4c] sh[1869]: 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator > const&, bool, bool, bool)+0x11f) [0x558561423e6f] sh[1869]: 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std: sh[1869]: 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, rocksdb: sh[1869]: 8: (RocksDBStore::do_open(std::ostream&, bool)+0x68e) [0x5585610af76e] sh[1869]: 9: (RocksDBStore::create_and_open(std::ostream&)+0xd7) [0x5585610b0d27] sh[1869]: 10: (BlueStore::_open_db(bool)+0x326) [0x55856103c6d6] sh[1869]: 11: (BlueStore::mkfs()+0x856) [0x55856106d406] sh[1869]: 12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, uuid_d, int)+0x348) [0x558560bc98f8] sh[1869]: 13: (main()+0xe58) [0x558560b1da78] sh[1869]: 14: (__libc_start_main()+0xf1) [0x7f6d825802b1] sh[1869]: 15: (_start()+0x2a) [0x558560ba4dfa] sh[1869]: 2017-07-16 14:46:00.763521 7f6d85db3c80 -1 *** Caught signal (Illegal instruction) ** sh[1869]: in thread 7f6d85db3c80 thread_name:ceph-osd sh[1869]: ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) luminous (dev) sh[1869]: 1: (()+0x9bc562) [0x558561169562] - Mail original - De: "Phil Schwarz" <infol...@schwarz-fr.net> À: "Udo Lembke" <ulem...@polarzone.de>, "ceph-users" <ceph-users@lists.ceph.com> Envoyé: Dimanche 16 Juillet 2017 15:04:16 Objet: Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous Le 15/07/2017 à 23:09, Udo Lembke a écrit : Hi, On 15.07.2017 16:01, Phil Schwarz wrote: Hi, ... While investigating, i wondered about my config : Question relative to /etc/hosts file : Should i use private_replication_LAN Ip or public ones ? private_replication_LAN!! And the pve-cluster should use another network (nics) if possible. Udo OK, thanks Udo. After investigation, i did : - set Noout OSDs - Stopped CPU-pegging LXC - Check the cabling - Restart the whole cluster Everything went fine ! But, when i tried to add a new OSD : fdisk /dev/sdc --> Deleted the partition table parted /dev/sdc --> mklabel msdos (Disk came from a ZFS FreeBSD system) dd if=/dev/null of=/dev/sdc ceph-disk zap /dev/sdc dd if=/dev/zero of=/dev/sdc bs=10M count=1000 And recreated the OSD via Web GUI. Same result, the OSD is known by the node, but not by the cluster. Logs seem to show an issue with this bluestore OSD, have a look at the file. I'm gonna give a try to OSD recreating using Filestore. Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous
Hi Phil, It's possible that rocksdb have a bug with some old cpus currently (old xeon and some opteron) I have the same behaviour with new cluster when creating mons http://tracker.ceph.com/issues/20529 What is your cpu model ? in your log: sh[1869]: in thread 7f6d85db3c80 thread_name:ceph-osd sh[1869]: ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) luminous (dev) sh[1869]: 1: (()+0x9bc562) [0x558561169562] sh[1869]: 2: (()+0x110c0) [0x7f6d835cb0c0] sh[1869]: 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x5585615788b1] sh[1869]: 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator > const&, bool)+0x26bc) [0x55856145ca4c] sh[1869]: 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator > const&, bool, bool, bool)+0x11f) [0x558561423e6f] sh[1869]: 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std: sh[1869]: 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, rocksdb: sh[1869]: 8: (RocksDBStore::do_open(std::ostream&, bool)+0x68e) [0x5585610af76e] sh[1869]: 9: (RocksDBStore::create_and_open(std::ostream&)+0xd7) [0x5585610b0d27] sh[1869]: 10: (BlueStore::_open_db(bool)+0x326) [0x55856103c6d6] sh[1869]: 11: (BlueStore::mkfs()+0x856) [0x55856106d406] sh[1869]: 12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, uuid_d, int)+0x348) [0x558560bc98f8] sh[1869]: 13: (main()+0xe58) [0x558560b1da78] sh[1869]: 14: (__libc_start_main()+0xf1) [0x7f6d825802b1] sh[1869]: 15: (_start()+0x2a) [0x558560ba4dfa] sh[1869]: 2017-07-16 14:46:00.763521 7f6d85db3c80 -1 *** Caught signal (Illegal instruction) ** sh[1869]: in thread 7f6d85db3c80 thread_name:ceph-osd sh[1869]: ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) luminous (dev) sh[1869]: 1: (()+0x9bc562) [0x558561169562] - Mail original - De: "Phil Schwarz" <infol...@schwarz-fr.net> À: "Udo Lembke" <ulem...@polarzone.de>, "ceph-users" <ceph-users@lists.ceph.com> Envoyé: Dimanche 16 Juillet 2017 15:04:16 Objet: Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous Le 15/07/2017 à 23:09, Udo Lembke a écrit : > Hi, > > On 15.07.2017 16:01, Phil Schwarz wrote: >> Hi, >> ... >> >> While investigating, i wondered about my config : >> Question relative to /etc/hosts file : >> Should i use private_replication_LAN Ip or public ones ? > private_replication_LAN!! And the pve-cluster should use another network > (nics) if possible. > > Udo > OK, thanks Udo. After investigation, i did : - set Noout OSDs - Stopped CPU-pegging LXC - Check the cabling - Restart the whole cluster Everything went fine ! But, when i tried to add a new OSD : fdisk /dev/sdc --> Deleted the partition table parted /dev/sdc --> mklabel msdos (Disk came from a ZFS FreeBSD system) dd if=/dev/null of=/dev/sdc ceph-disk zap /dev/sdc dd if=/dev/zero of=/dev/sdc bs=10M count=1000 And recreated the OSD via Web GUI. Same result, the OSD is known by the node, but not by the cluster. Logs seem to show an issue with this bluestore OSD, have a look at the file. I'm gonna give a try to OSD recreating using Filestore. Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous
Le 16/07/2017 à 17:02, Udo Lembke a écrit : Hi, On 16.07.2017 15:04, Phil Schwarz wrote: ... Same result, the OSD is known by the node, but not by the cluster. ... Firewall? Or missmatch in /etc/hosts or DNS?? Udo OK, - No FW, - No DNS issue at this point. - Same procedure followed with the last node, except full cluster update before adding new node,new osd. Only the strange behavior of the 'pveceph createosd' command which was shown in prevous mail. ... systemd[1]: ceph-disk@dev-sdc1.service: Main process exited, code=exited, status=1/FAILURE systemd[1]: Failed to start Ceph disk activation: /dev/sdc1. systemd[1]: ceph-disk@dev-sdc1.service: Unit entered failed state. systemd[1]: ceph-disk@dev-sdc1.service: Failed with result 'exit-code' What consequences should i encounter when switching /etc/hosts from public_IPs to private_IPs ? ( appart from time travel paradox or blackhole bursting ..) Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous
Hi, On 16.07.2017 15:04, Phil Schwarz wrote: > ... > Same result, the OSD is known by the node, but not by the cluster. > ... Firewall? Or missmatch in /etc/hosts or DNS?? Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous
Le 15/07/2017 à 23:09, Udo Lembke a écrit : Hi, On 15.07.2017 16:01, Phil Schwarz wrote: Hi, ... While investigating, i wondered about my config : Question relative to /etc/hosts file : Should i use private_replication_LAN Ip or public ones ? private_replication_LAN!! And the pve-cluster should use another network (nics) if possible. Udo OK, thanks Udo. After investigation, i did : - set Noout OSDs - Stopped CPU-pegging LXC - Check the cabling - Restart the whole cluster Everything went fine ! But, when i tried to add a new OSD : fdisk /dev/sdc --> Deleted the partition table parted /dev/sdc --> mklabel msdos (Disk came from a ZFS FreeBSD system) dd if=/dev/null of=/dev/sdc ceph-disk zap /dev/sdc dd if=/dev/zero of=/dev/sdc bs=10M count=1000 And recreated the OSD via Web GUI. Same result, the OSD is known by the node, but not by the cluster. Logs seem to show an issue with this bluestore OSD, have a look at the file. I'm gonna give a try to OSD recreating using Filestore. Thanks pvedaemon[3077]:starting task UPID:varys:7E7D:0004F489:596B5FCE:cephcreateosd:sdc:root@pam: kernel: [ 3267.263313] sdc: systemd[1]: Created slice system-ceph\x2ddisk.slice. systemd[1]: Starting Ceph disk activation: /dev/sdc2... sh[1074]: main_trigger: main_trigger: Namespace(cluster='ceph', dev='/dev/sdc2', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', func=, log_stdout=True, prepend_to_path='/usr/bin', prog='ceph-disk', setgroup=None, setuser=None, statedir='/var/lib/ceph', sync=True, sh[1074]: command: Running command: /sbin/init --version sh[1074]: command_check_call: Running command: /bin/chown ceph:ceph /dev/sdc2 sh[1074]: command: Running command: /sbin/blkid -o udev -p /dev/sdc2 sh[1074]: command: Running command: /sbin/blkid -o udev -p /dev/sdc2 sh[1074]: main_trigger: trigger /dev/sdc2 parttype cafecafe-9b03-4f30-b4c6-b4b80ceff106 uuid 7a6d7546-b93a-452b-9bbc-f660f9a8416c sh[1074]: command: Running command: /usr/sbin/ceph-disk --verbose activate-block /dev/sdc2 systemd[1]: Stopped Ceph disk activation: /dev/sdc2. systemd[1]: Starting Ceph disk activation: /dev/sdc2... sh[1074]: main_trigger: sh[1074]: main_trigger: get_dm_uuid: get_dm_uuid /dev/sdc2 uuid path is /sys/dev/block/8:34/dm/uuid sh[1074]: command: Running command: /sbin/blkid -o udev -p /dev/sdc2 sh[1074]: command: Running command: /usr/bin/ceph-osd --get-device-fsid /dev/sdc2 sh[1074]: get_space_osd_uuid: Block /dev/sdc2 has OSD UUID ---- sh[1074]: main_activate_space: activate: OSD device not present, not starting, yet systemd[1]: Stopped Ceph disk activation: /dev/sdc2. systemd[1]: Starting Ceph disk activation: /dev/sdc2... sh[1475]: main_trigger: main_trigger: Namespace(cluster='ceph', dev='/dev/sdc2', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', func=, log_stdout=True, prepend_to_path='/usr/bin', prog='ceph-disk', setgroup=None, setuser=None, statedir='/var/lib/ceph', sync=True, sh[1475]: command: Running command: /sbin/init --version sh[1475]: command_check_call: Running command: /bin/chown ceph:ceph /dev/sdc2 sh[1475]: command: Running command: /sbin/blkid -o udev -p /dev/sdc2 sh[1475]: command: Running command: /sbin/blkid -o udev -p /dev/sdc2 sh[1475]: main_trigger: trigger /dev/sdc2 parttype cafecafe-9b03-4f30-b4c6-b4b80ceff664 uuid 7a6d7546-b93a-452b-9bbc-f660f9a84664 sh[1475]: command: Running command: /usr/sbin/ceph-disk --verbose activate-block /dev/sdc2 kernel: [ 3291.171474] sdc: sdc1 sdc2 sh[1475]: main_trigger: sh[1475]: main_trigger: get_dm_uuid: get_dm_uuid /dev/sdc2 uuid path is /sys/dev/block/8:34/dm/uuid sh[1475]: command: Running command: /sbin/blkid -o udev -p /dev/sdc2 sh[1475]: command: Running command: /usr/bin/ceph-osd --get-device-fsid /dev/sdc2 sh[1475]: get_space_osd_uuid: Block /dev/sdc2 has OSD UUID ---- sh[1475]: main_activate_space: activate: OSD device not present, not starting, yet systemd[1]: Stopped Ceph disk activation: /dev/sdc2. systemd[1]: Starting Ceph disk activation: /dev/sdc2... sh[1492]: main_trigger: main_trigger: Namespace(cluster='ceph', dev='/dev/sdc2', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', func=, log_stdout=True, prepend_to_path='/usr/bin', prog='ceph-disk', setgroup=None, setuser=None, statedir='/var/lib/ceph', sync=True, sh[1492]: command: Running command: /sbin/init --version sh[1492]: command_check_call: Running command: /bin/chown ceph:ceph /dev/sdc2 sh[1492]: command: Running command: /sbin/blkid -o udev -p /dev/sdc2 sh[1492]: command: Running command: /sbin/blkid -o udev -p /dev/sdc2 sh[1492]: main_trigger: trigger /dev/sdc2 parttype cafecafe-9b03-4f30-b4c6-b4b80ceff664 uuid 7a6d7546-b93a-452b-9bbc-f660f9a84664 sh[1492]: command: Running command: /usr/sbin/ceph-disk --verbose activate-block /dev/sdc2 sh[1492]: main_trigger: sh[1492]: main_trigger: get_dm_uuid: get_dm_uuid /dev/sdc2 uuid path is /sys/dev/block/8:34/dm/uuid sh[1492]:
Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous
Hi, On 15.07.2017 16:01, Phil Schwarz wrote: > Hi, > ... > > While investigating, i wondered about my config : > Question relative to /etc/hosts file : > Should i use private_replication_LAN Ip or public ones ? private_replication_LAN!! And the pve-cluster should use another network (nics) if possible. Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous
Hi, short version : I broke my cluster ! Long version , with context: With a 4 nodes Proxmox Cluster The nodes are all Pproxmox 5.05+Ceph luminous with filestore -3 mon+OSD -1 LXC+OSD Was working fine Added a fifth node (proxmox+ceph) today a broke everything.. Though every node can ping each other, the web GUI is full of red crossed nodes. No LXC is seen though there up and alive. However, every other proxmox is manageable through the web GUI In logs, i've tons of same message on 2 over 3 mons : " failed to decode message of type 80 v6: buffer::malformed_input: void pg_history_t::decode(ceph::buffer::list::iterator&) unknown encoding version > 7" Thanks for your answers. Best regards While investigating, i wondered about my config : Question relative to /etc/hosts file : Should i use private_replication_LAN Ip or public ones ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com