Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous

2017-08-29 Thread Phil Schwarz

Hi, back to work, i face my problem.

@Alexandre : AMDTurion  for N54L HP Microserver.
This server is OSD and LXC only, no mon working in.

After rebooting the whole cluster and attempting to add a third time the 
same disk :


ceph osd tree
ID WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 7.47226 root default
-2 3.65898 host jon
 1 2.2 osd.1  up  1.0  1.0
 3 1.35899 osd.3  up  1.0  1.0
-3 0.34999 host daenerys
 0 0.34999 osd.0  up  1.0  1.0
-4 1.64969 host tyrion
 2 0.44969 osd.2  up  1.0  1.0
 4 1.2 osd.4  up  1.0  1.0
-5 1.81360 host jaime
 5 1.81360 osd.5  up  1.0  1.0
 6   0 osd.6down0  1.0
 7   0 osd.7down0  1.0
 8   0 osd.8down0  1.0

6,7,8 disks are the same issue for the same disk (which isn't faulty).


Any clue ?
I'm gonna try soon to create the osd on this disk in another server.

Thanks.

Best regards
Le 26/07/2017 à 15:53, Alexandre DERUMIER a écrit :

Hi Phil,


It's possible that rocksdb have a bug with some old cpus currently (old xeon 
and some opteron)
I have the same behaviour with new cluster when creating mons
http://tracker.ceph.com/issues/20529

What is your cpu model ?

in your log:

sh[1869]:  in thread 7f6d85db3c80 thread_name:ceph-osd
sh[1869]:  ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) 
luminous (dev)
sh[1869]:  1: (()+0x9bc562) [0x558561169562]
sh[1869]:  2: (()+0x110c0) [0x7f6d835cb0c0]
sh[1869]:  3: 
(rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) 
[0x5585615788b1]
sh[1869]:  4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, 
std::allocator > const&, bool)+0x26bc) 
[0x55856145ca4c]
sh[1869]:  5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, 
std::allocator > const&, bool, bool, bool)+0x11f) 
[0x558561423e6f]
sh[1869]:  6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, 
std::char_traits, std::allocator > const&, std:
sh[1869]:  7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, 
std::char_traits, std::allocator > const&, rocksdb:
sh[1869]:  8: (RocksDBStore::do_open(std::ostream&, bool)+0x68e) 
[0x5585610af76e]
sh[1869]:  9: (RocksDBStore::create_and_open(std::ostream&)+0xd7) 
[0x5585610b0d27]
sh[1869]:  10: (BlueStore::_open_db(bool)+0x326) [0x55856103c6d6]
sh[1869]:  11: (BlueStore::mkfs()+0x856) [0x55856106d406]
sh[1869]:  12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, 
std::char_traits, std::allocator > const&, uuid_d, int)+0x348) 
[0x558560bc98f8]
sh[1869]:  13: (main()+0xe58) [0x558560b1da78]
sh[1869]:  14: (__libc_start_main()+0xf1) [0x7f6d825802b1]
sh[1869]:  15: (_start()+0x2a) [0x558560ba4dfa]
sh[1869]: 2017-07-16 14:46:00.763521 7f6d85db3c80 -1 *** Caught signal (Illegal 
instruction) **
sh[1869]:  in thread 7f6d85db3c80 thread_name:ceph-osd
sh[1869]:  ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) 
luminous (dev)
sh[1869]:  1: (()+0x9bc562) [0x558561169562]

- Mail original -
De: "Phil Schwarz" <infol...@schwarz-fr.net>
À: "Udo Lembke" <ulem...@polarzone.de>, "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Dimanche 16 Juillet 2017 15:04:16
Objet: Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & 
Ceph Luminous

Le 15/07/2017 à 23:09, Udo Lembke a écrit :

Hi,

On 15.07.2017 16:01, Phil Schwarz wrote:

Hi,
...

While investigating, i wondered about my config :
Question relative to /etc/hosts file :
Should i use private_replication_LAN Ip or public ones ?

private_replication_LAN!! And the pve-cluster should use another network
(nics) if possible.

Udo


OK, thanks Udo.

After investigation, i did :
- set Noout OSDs
- Stopped CPU-pegging LXC
- Check the cabling
- Restart the whole cluster

Everything went fine !

But, when i tried to add a new OSD :

fdisk /dev/sdc --> Deleted the partition table
parted /dev/sdc --> mklabel msdos (Disk came from a ZFS FreeBSD system)
dd if=/dev/null of=/dev/sdc
ceph-disk zap /dev/sdc
dd if=/dev/zero of=/dev/sdc bs=10M count=1000

And recreated the OSD via Web GUI.
Same result, the OSD is known by the node, but not by the cluster.

Logs seem to show an issue with this bluestore OSD, have a look at the file.

I'm gonna give a try to OSD recreating using Filestore.

Thanks


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous

2017-07-26 Thread Alexandre DERUMIER
Hi Phil,


It's possible that rocksdb have a bug with some old cpus currently (old xeon 
and some opteron)
I have the same behaviour with new cluster when creating mons
http://tracker.ceph.com/issues/20529

What is your cpu model ?

in your log: 

sh[1869]:  in thread 7f6d85db3c80 thread_name:ceph-osd
sh[1869]:  ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) 
luminous (dev)
sh[1869]:  1: (()+0x9bc562) [0x558561169562]
sh[1869]:  2: (()+0x110c0) [0x7f6d835cb0c0]
sh[1869]:  3: 
(rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) 
[0x5585615788b1]
sh[1869]:  4: 
(rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, 
std::allocator > const&, bool)+0x26bc) 
[0x55856145ca4c]
sh[1869]:  5: 
(rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, 
std::allocator > const&, bool, bool, 
bool)+0x11f) [0x558561423e6f]
sh[1869]:  6: (rocksdb::DB::Open(rocksdb::DBOptions const&, 
std::__cxx11::basic_string<char, std::char_traits, std::allocator > 
const&, std:
sh[1869]:  7: (rocksdb::DB::Open(rocksdb::Options const&, 
std::__cxx11::basic_string<char, std::char_traits, std::allocator > 
const&, rocksdb:
sh[1869]:  8: (RocksDBStore::do_open(std::ostream&, bool)+0x68e) 
[0x5585610af76e]
sh[1869]:  9: (RocksDBStore::create_and_open(std::ostream&)+0xd7) 
[0x5585610b0d27]
sh[1869]:  10: (BlueStore::_open_db(bool)+0x326) [0x55856103c6d6]
sh[1869]:  11: (BlueStore::mkfs()+0x856) [0x55856106d406]
sh[1869]:  12: (OSD::mkfs(CephContext*, ObjectStore*, 
std::__cxx11::basic_string<char, std::char_traits, std::allocator > 
const&, uuid_d, int)+0x348) [0x558560bc98f8]
sh[1869]:  13: (main()+0xe58) [0x558560b1da78]
sh[1869]:  14: (__libc_start_main()+0xf1) [0x7f6d825802b1]
sh[1869]:  15: (_start()+0x2a) [0x558560ba4dfa]
sh[1869]: 2017-07-16 14:46:00.763521 7f6d85db3c80 -1 *** Caught signal (Illegal 
instruction) **
sh[1869]:  in thread 7f6d85db3c80 thread_name:ceph-osd
sh[1869]:  ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) 
luminous (dev)
sh[1869]:  1: (()+0x9bc562) [0x558561169562]

- Mail original -
De: "Phil Schwarz" <infol...@schwarz-fr.net>
À: "Udo Lembke" <ulem...@polarzone.de>, "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Dimanche 16 Juillet 2017 15:04:16
Objet: Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & 
Ceph Luminous

Le 15/07/2017 à 23:09, Udo Lembke a écrit : 
> Hi, 
> 
> On 15.07.2017 16:01, Phil Schwarz wrote: 
>> Hi, 
>> ... 
>> 
>> While investigating, i wondered about my config : 
>> Question relative to /etc/hosts file : 
>> Should i use private_replication_LAN Ip or public ones ? 
> private_replication_LAN!! And the pve-cluster should use another network 
> (nics) if possible. 
> 
> Udo 
> 
OK, thanks Udo. 

After investigation, i did : 
- set Noout OSDs 
- Stopped CPU-pegging LXC 
- Check the cabling 
- Restart the whole cluster 

Everything went fine ! 

But, when i tried to add a new OSD : 

fdisk /dev/sdc --> Deleted the partition table 
parted /dev/sdc --> mklabel msdos (Disk came from a ZFS FreeBSD system) 
dd if=/dev/null of=/dev/sdc 
ceph-disk zap /dev/sdc 
dd if=/dev/zero of=/dev/sdc bs=10M count=1000 

And recreated the OSD via Web GUI. 
Same result, the OSD is known by the node, but not by the cluster. 

Logs seem to show an issue with this bluestore OSD, have a look at the file. 

I'm gonna give a try to OSD recreating using Filestore. 

Thanks 


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous

2017-07-16 Thread Phil Schwarz

Le 16/07/2017 à 17:02, Udo Lembke a écrit :

Hi,

On 16.07.2017 15:04, Phil Schwarz wrote:

...
Same result, the OSD is known by the node, but not by the cluster.
...

Firewall? Or missmatch in /etc/hosts or DNS??

Udo


OK,
- No FW,
- No DNS issue at this point.
- Same procedure followed with the last node, except full cluster update 
before adding new node,new osd.



Only the strange behavior of the 'pveceph createosd' command which
was shown in prevous mail.

...
systemd[1]: ceph-disk@dev-sdc1.service: Main process exited, 
code=exited, status=1/FAILURE

systemd[1]: Failed to start Ceph disk activation: /dev/sdc1.
systemd[1]: ceph-disk@dev-sdc1.service: Unit entered failed state.
systemd[1]: ceph-disk@dev-sdc1.service: Failed with result 'exit-code'

What consequences should i encounter when switching /etc/hosts from 
public_IPs to private_IPs ? ( appart from time travel paradox or 
blackhole bursting ..)


Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous

2017-07-16 Thread Udo Lembke
Hi,

On 16.07.2017 15:04, Phil Schwarz wrote:
> ...
> Same result, the OSD is known by the node, but not by the cluster.
> ...
Firewall? Or missmatch in /etc/hosts or DNS??

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous

2017-07-16 Thread Phil Schwarz

Le 15/07/2017 à 23:09, Udo Lembke a écrit :

Hi,

On 15.07.2017 16:01, Phil Schwarz wrote:

Hi,
...

While investigating, i wondered about my config :
Question relative to /etc/hosts file :
Should i use private_replication_LAN Ip or public ones ?

private_replication_LAN!! And the pve-cluster should use another network
(nics) if possible.

Udo


OK, thanks Udo.

After investigation, i did :
- set Noout OSDs
- Stopped CPU-pegging LXC
- Check the cabling
- Restart the whole cluster

Everything went fine !

But, when i tried to add a new OSD :

fdisk /dev/sdc --> Deleted the partition table
parted /dev/sdc --> mklabel msdos (Disk came from a ZFS FreeBSD system)
dd if=/dev/null of=/dev/sdc
ceph-disk zap /dev/sdc
dd if=/dev/zero  of=/dev/sdc bs=10M count=1000

And recreated the OSD via Web GUI.
Same result, the OSD is known by the node, but not by the cluster.

Logs seem to show an issue with this bluestore OSD, have a look at the file.

I'm gonna give a try to OSD recreating using Filestore.

Thanks

pvedaemon[3077]:  starting task 
UPID:varys:7E7D:0004F489:596B5FCE:cephcreateosd:sdc:root@pam:
kernel: [ 3267.263313]  sdc:
systemd[1]: Created slice system-ceph\x2ddisk.slice.
systemd[1]: Starting Ceph disk activation: /dev/sdc2...
sh[1074]: main_trigger: main_trigger: Namespace(cluster='ceph', 
dev='/dev/sdc2', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', 
func=, log_stdout=True, 
prepend_to_path='/usr/bin', prog='ceph-disk', setgroup=None, setuser=None, 
statedir='/var/lib/ceph', sync=True,
sh[1074]: command: Running command: /sbin/init --version
sh[1074]: command_check_call: Running command: /bin/chown ceph:ceph /dev/sdc2
sh[1074]: command: Running command: /sbin/blkid -o udev -p /dev/sdc2
sh[1074]: command: Running command: /sbin/blkid -o udev -p /dev/sdc2
sh[1074]: main_trigger: trigger /dev/sdc2 parttype 
cafecafe-9b03-4f30-b4c6-b4b80ceff106 uuid 7a6d7546-b93a-452b-9bbc-f660f9a8416c
sh[1074]: command: Running command: /usr/sbin/ceph-disk --verbose 
activate-block /dev/sdc2
systemd[1]: Stopped Ceph disk activation: /dev/sdc2.
systemd[1]: Starting Ceph disk activation: /dev/sdc2...
sh[1074]: main_trigger:
sh[1074]: main_trigger: get_dm_uuid: get_dm_uuid /dev/sdc2 uuid path is 
/sys/dev/block/8:34/dm/uuid
sh[1074]: command: Running command: /sbin/blkid -o udev -p /dev/sdc2
sh[1074]: command: Running command: /usr/bin/ceph-osd --get-device-fsid 
/dev/sdc2
sh[1074]: get_space_osd_uuid: Block /dev/sdc2 has OSD UUID 
----
sh[1074]: main_activate_space: activate: OSD device not present, not starting, 
yet
systemd[1]: Stopped Ceph disk activation: /dev/sdc2.
systemd[1]: Starting Ceph disk activation: /dev/sdc2...
sh[1475]: main_trigger: main_trigger: Namespace(cluster='ceph', 
dev='/dev/sdc2', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', 
func=, log_stdout=True, 
prepend_to_path='/usr/bin', prog='ceph-disk', setgroup=None, setuser=None, 
statedir='/var/lib/ceph', sync=True,
sh[1475]: command: Running command: /sbin/init --version
sh[1475]: command_check_call: Running command: /bin/chown ceph:ceph /dev/sdc2
sh[1475]: command: Running command: /sbin/blkid -o udev -p /dev/sdc2
sh[1475]: command: Running command: /sbin/blkid -o udev -p /dev/sdc2
sh[1475]: main_trigger: trigger /dev/sdc2 parttype 
cafecafe-9b03-4f30-b4c6-b4b80ceff664 uuid 7a6d7546-b93a-452b-9bbc-f660f9a84664
sh[1475]: command: Running command: /usr/sbin/ceph-disk --verbose 
activate-block /dev/sdc2
kernel: [ 3291.171474]  sdc: sdc1 sdc2
sh[1475]: main_trigger:
sh[1475]: main_trigger: get_dm_uuid: get_dm_uuid /dev/sdc2 uuid path is 
/sys/dev/block/8:34/dm/uuid
sh[1475]: command: Running command: /sbin/blkid -o udev -p /dev/sdc2
sh[1475]: command: Running command: /usr/bin/ceph-osd --get-device-fsid 
/dev/sdc2
sh[1475]: get_space_osd_uuid: Block /dev/sdc2 has OSD UUID 
----
sh[1475]: main_activate_space: activate: OSD device not present, not starting, 
yet
systemd[1]: Stopped Ceph disk activation: /dev/sdc2.
systemd[1]: Starting Ceph disk activation: /dev/sdc2...
sh[1492]: main_trigger: main_trigger: Namespace(cluster='ceph', 
dev='/dev/sdc2', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', 
func=, log_stdout=True, 
prepend_to_path='/usr/bin', prog='ceph-disk', setgroup=None, setuser=None, 
statedir='/var/lib/ceph', sync=True,
sh[1492]: command: Running command: /sbin/init --version
sh[1492]: command_check_call: Running command: /bin/chown ceph:ceph /dev/sdc2
sh[1492]: command: Running command: /sbin/blkid -o udev -p /dev/sdc2
sh[1492]: command: Running command: /sbin/blkid -o udev -p /dev/sdc2
sh[1492]: main_trigger: trigger /dev/sdc2 parttype 
cafecafe-9b03-4f30-b4c6-b4b80ceff664 uuid 7a6d7546-b93a-452b-9bbc-f660f9a84664
sh[1492]: command: Running command: /usr/sbin/ceph-disk --verbose 
activate-block /dev/sdc2
sh[1492]: main_trigger:
sh[1492]: main_trigger: get_dm_uuid: get_dm_uuid /dev/sdc2 uuid path is 
/sys/dev/block/8:34/dm/uuid
sh[1492]: 

Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous

2017-07-15 Thread Udo Lembke
Hi,

On 15.07.2017 16:01, Phil Schwarz wrote:
> Hi,
> ...
>
> While investigating, i wondered about my config :
> Question relative to /etc/hosts file :
> Should i use private_replication_LAN Ip or public ones ?
private_replication_LAN!! And the pve-cluster should use another network
(nics) if possible.

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous

2017-07-15 Thread Phil Schwarz

Hi,

short version :
I broke my cluster !

Long version , with context:
With a 4 nodes Proxmox Cluster
The nodes are all Pproxmox 5.05+Ceph luminous with filestore
-3 mon+OSD
-1 LXC+OSD

Was working fine
Added a fifth node (proxmox+ceph) today a broke everything..

Though every node can ping each other, the web GUI is full of red 
crossed nodes. No LXC is seen though there up and alive.

However, every other proxmox is manageable through the web GUI

In logs, i've tons of same message on 2 over 3 mons :

" failed to decode message of type 80 v6: buffer::malformed_input: void 
pg_history_t::decode(ceph::buffer::list::iterator&) unknown encoding 
version > 7"


Thanks for your answers.
Best regards

While investigating, i wondered about my config :
Question relative to /etc/hosts file :
Should i use private_replication_LAN Ip or public ones ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com