Re: [ceph-users] Crash and question

2015-08-03 Thread Christian Balzer

Hello,

On Thu, 30 Jul 2015 11:39:29 +0200 Khalid Ahsein wrote:

> Good morning christian,
> 
> thank you for your quick response.
> so I need to upgrade to 64 GB or 96 GB to be more secure ?
> 
32GB would be sufficient, 64GB will give you read performance benefits
with hot objects (large pagecache).

> And sorry I though that 2 monitors was the minimum. We will work to add
> a new host quickly.
> 
Good, can't really help you with your key problems, though.

> About osd_pool_default_min_size should I change something for the
> future ? 
> 
It's fine for your setup, 2 is the norm with a replication of 3.

Christian
> thank you again
> K
> 
> > Le 30 juil. 2015 à 11:12, Christian Balzer  a écrit :
> > 
> > 
> > Hello,
> > 
> > On Thu, 30 Jul 2015 10:55:30 +0200 Khalid Ahsein wrote:
> > 
> >> Hello everybody,
> >> 
> >> I’m running since 4 months a ceph cluster configured with two
> >> monitors :
> >> 
> >> 1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1 for
> >> system 1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1
> >> for system
> >> 
> > Too little RAM, just 2 monitors, just 2 nodes...
> > 
> >> This night I’ve encountered an issue with the crash of the first host.
> >> 
> >> My first question is why with 1 host down, all my cluster was down
> >> (unable to do ceph status — hang command) and all my rbd was stuck
> >> without possibility to R/W. 
> > 
> > Re-read the documentation, you need at least 3 monitors to survive the
> > loss of one (monitor) node.
> > 
> > Your osd_pool_default_min_size would have left in a usable situation, 2
> > nodes is really a minimal case.
> > 
> >> I rebooted the first host, and 2 hours later
> >> the second go down with the same issue (all rbd down and ceph hang).
> >> 
> >> After reboot, here is ceph status :
> >> 
> >> # ceph status
> >>cluster 9c29f469-7bad-4b64-97bf-3fbb1bbc0c5f
> >> health HEALTH_ERR
> >>3 pgs inconsistent
> >>1 pgs peering
> >>1 pgs stuck inactive
> >>1 pgs stuck unclean
> >>36 requests are blocked > 32 sec
> >>928 scrub errors
> >>clock skew detected on mon.drt-becks
> >> monmap e1: 2 mons at
> >> {drt-becks=172.16.21.6:6789/0,drt-marco=172.16.21.4:6789/0} election
> >> epoch 26, quorum 0,1 drt-marco,drt-becks osdmap e961: 24 osds: 24 up,
> >> 24 in pgmap v2532968: 400 pgs, 1 pools, 512 GB data, 130 kobjects
> >>1039 GB used, 88092 GB / 89177 GB avail
> >> 393 active+clean
> >>   3 active+clean+scrubbing+deep
> >>   3 active+clean+inconsistent
> >>   1 peering
> >>  client io 57290 B/s wr, 7 op/s
> >> 
> > You will want to:
> > a) fix your NTP, clock skew.
> > b) check your logs about the scrub errors
> > c) same for the stuck requests
> > 
> >> Also I found this error on DMESG about the crash :
> >> 
> >> Message from syslogd@drt-marco at Jul 30 04:03:57 ...
> >> kernel:[4876519.657178] BUG: soft lockup - CPU#7 stuck for 22s!
> >> [btrfs-cleaner:32713]
> >> 
> >> All my volumes are on BTRFS, maybe it was not a good idea ?
> >> 
> > Depending on your OS, kernel version, most definitely. 
> > Plenty of BTRFS problems in the ML archives to be found.
> > 
> > Christian
> > 
> > -- 
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com Global OnLine
> > Japan/Fusion Communications http://www.gol.com/ 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crash and question

2015-07-30 Thread Khalid Ahsein
Hi,

I tried to add a new monitor, but now I was unable to use ceph command

after doing ceph-deploy mon create myhostname

I’ve got :
# ceph status
2015-07-30 10:42:39.682038 7f7b16d90700  0 librados: client.admin 
authentication error (1) Operation not permitted
Error connecting to cluster: PermissionError

Could you help me to fix and how to change keys please ?

Thank a lot in advance, and sorry I’m a newbie on this topic.
K

> Le 30 juil. 2015 à 11:39, Khalid Ahsein  a écrit :
> 
> Good morning christian,
> 
> thank you for your quick response.
> so I need to upgrade to 64 GB or 96 GB to be more secure ?
> 
> And sorry I though that 2 monitors was the minimum. We will work to add a new 
> host quickly.
> 
> About osd_pool_default_min_size should I change something for the future ? 
> 
> thank you again
> K
> 
>> Le 30 juil. 2015 à 11:12, Christian Balzer > > a écrit :
>> 
>> 
>> Hello,
>> 
>> On Thu, 30 Jul 2015 10:55:30 +0200 Khalid Ahsein wrote:
>> 
>>> Hello everybody,
>>> 
>>> I’m running since 4 months a ceph cluster configured with two monitors :
>>> 
>>> 1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1 for
>>> system 1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1
>>> for system
>>> 
>> Too little RAM, just 2 monitors, just 2 nodes...
>> 
>>> This night I’ve encountered an issue with the crash of the first host.
>>> 
>>> My first question is why with 1 host down, all my cluster was down
>>> (unable to do ceph status — hang command) and all my rbd was stuck
>>> without possibility to R/W. 
>> 
>> Re-read the documentation, you need at least 3 monitors to survive the
>> loss of one (monitor) node.
>> 
>> Your osd_pool_default_min_size would have left in a usable situation, 2
>> nodes is really a minimal case.
>> 
>>> I rebooted the first host, and 2 hours later
>>> the second go down with the same issue (all rbd down and ceph hang).
>>> 
>>> After reboot, here is ceph status :
>>> 
>>> # ceph status
>>>cluster 9c29f469-7bad-4b64-97bf-3fbb1bbc0c5f
>>> health HEALTH_ERR
>>>3 pgs inconsistent
>>>1 pgs peering
>>>1 pgs stuck inactive
>>>1 pgs stuck unclean
>>>36 requests are blocked > 32 sec
>>>928 scrub errors
>>>clock skew detected on mon.drt-becks
>>> monmap e1: 2 mons at
>>> {drt-becks=172.16.21.6:6789/0,drt-marco=172.16.21.4:6789/0} election
>>> epoch 26, quorum 0,1 drt-marco,drt-becks osdmap e961: 24 osds: 24 up, 24
>>> in pgmap v2532968: 400 pgs, 1 pools, 512 GB data, 130 kobjects
>>>1039 GB used, 88092 GB / 89177 GB avail
>>> 393 active+clean
>>>   3 active+clean+scrubbing+deep
>>>   3 active+clean+inconsistent
>>>   1 peering
>>>  client io 57290 B/s wr, 7 op/s
>>> 
>> You will want to:
>> a) fix your NTP, clock skew.
>> b) check your logs about the scrub errors
>> c) same for the stuck requests
>> 
>>> Also I found this error on DMESG about the crash :
>>> 
>>> Message from syslogd@drt-marco at Jul 30 04:03:57 ...
>>> kernel:[4876519.657178] BUG: soft lockup - CPU#7 stuck for 22s!
>>> [btrfs-cleaner:32713]
>>> 
>>> All my volumes are on BTRFS, maybe it was not a good idea ?
>>> 
>> Depending on your OS, kernel version, most definitely. 
>> Plenty of BTRFS problems in the ML archives to be found.
>> 
>> Christian
>> 
>> -- 
>> Christian BalzerNetwork/Systems Engineer
>> ch...@gol.com  Global OnLine Japan/Fusion 
>> Communications
>> http://www.gol.com/ 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crash and question

2015-07-30 Thread Khalid Ahsein
Good morning christian,

thank you for your quick response.
so I need to upgrade to 64 GB or 96 GB to be more secure ?

And sorry I though that 2 monitors was the minimum. We will work to add a new 
host quickly.

About osd_pool_default_min_size should I change something for the future ? 

thank you again
K

> Le 30 juil. 2015 à 11:12, Christian Balzer  a écrit :
> 
> 
> Hello,
> 
> On Thu, 30 Jul 2015 10:55:30 +0200 Khalid Ahsein wrote:
> 
>> Hello everybody,
>> 
>> I’m running since 4 months a ceph cluster configured with two monitors :
>> 
>> 1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1 for
>> system 1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1
>> for system
>> 
> Too little RAM, just 2 monitors, just 2 nodes...
> 
>> This night I’ve encountered an issue with the crash of the first host.
>> 
>> My first question is why with 1 host down, all my cluster was down
>> (unable to do ceph status — hang command) and all my rbd was stuck
>> without possibility to R/W. 
> 
> Re-read the documentation, you need at least 3 monitors to survive the
> loss of one (monitor) node.
> 
> Your osd_pool_default_min_size would have left in a usable situation, 2
> nodes is really a minimal case.
> 
>> I rebooted the first host, and 2 hours later
>> the second go down with the same issue (all rbd down and ceph hang).
>> 
>> After reboot, here is ceph status :
>> 
>> # ceph status
>>cluster 9c29f469-7bad-4b64-97bf-3fbb1bbc0c5f
>> health HEALTH_ERR
>>3 pgs inconsistent
>>1 pgs peering
>>1 pgs stuck inactive
>>1 pgs stuck unclean
>>36 requests are blocked > 32 sec
>>928 scrub errors
>>clock skew detected on mon.drt-becks
>> monmap e1: 2 mons at
>> {drt-becks=172.16.21.6:6789/0,drt-marco=172.16.21.4:6789/0} election
>> epoch 26, quorum 0,1 drt-marco,drt-becks osdmap e961: 24 osds: 24 up, 24
>> in pgmap v2532968: 400 pgs, 1 pools, 512 GB data, 130 kobjects
>>1039 GB used, 88092 GB / 89177 GB avail
>> 393 active+clean
>>   3 active+clean+scrubbing+deep
>>   3 active+clean+inconsistent
>>   1 peering
>>  client io 57290 B/s wr, 7 op/s
>> 
> You will want to:
> a) fix your NTP, clock skew.
> b) check your logs about the scrub errors
> c) same for the stuck requests
> 
>> Also I found this error on DMESG about the crash :
>> 
>> Message from syslogd@drt-marco at Jul 30 04:03:57 ...
>> kernel:[4876519.657178] BUG: soft lockup - CPU#7 stuck for 22s!
>> [btrfs-cleaner:32713]
>> 
>> All my volumes are on BTRFS, maybe it was not a good idea ?
>> 
> Depending on your OS, kernel version, most definitely. 
> Plenty of BTRFS problems in the ML archives to be found.
> 
> Christian
> 
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Fusion 
> Communications
> http://www.gol.com/ 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crash and question

2015-07-30 Thread Christian Balzer

Hello,

On Thu, 30 Jul 2015 10:55:30 +0200 Khalid Ahsein wrote:

> Hello everybody,
> 
> I’m running since 4 months a ceph cluster configured with two monitors :
> 
> 1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1 for
> system 1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1
> for system
> 
Too little RAM, just 2 monitors, just 2 nodes...

> This night I’ve encountered an issue with the crash of the first host.
> 
> My first question is why with 1 host down, all my cluster was down
> (unable to do ceph status — hang command) and all my rbd was stuck
> without possibility to R/W. 

Re-read the documentation, you need at least 3 monitors to survive the
loss of one (monitor) node.

Your osd_pool_default_min_size would have left in a usable situation, 2
nodes is really a minimal case.

> I rebooted the first host, and 2 hours later
> the second go down with the same issue (all rbd down and ceph hang).
> 
> After reboot, here is ceph status :
> 
> # ceph status
> cluster 9c29f469-7bad-4b64-97bf-3fbb1bbc0c5f
>  health HEALTH_ERR
> 3 pgs inconsistent
> 1 pgs peering
> 1 pgs stuck inactive
> 1 pgs stuck unclean
> 36 requests are blocked > 32 sec
> 928 scrub errors
> clock skew detected on mon.drt-becks
>  monmap e1: 2 mons at
> {drt-becks=172.16.21.6:6789/0,drt-marco=172.16.21.4:6789/0} election
> epoch 26, quorum 0,1 drt-marco,drt-becks osdmap e961: 24 osds: 24 up, 24
> in pgmap v2532968: 400 pgs, 1 pools, 512 GB data, 130 kobjects
> 1039 GB used, 88092 GB / 89177 GB avail
>  393 active+clean
>3 active+clean+scrubbing+deep
>3 active+clean+inconsistent
>1 peering
>   client io 57290 B/s wr, 7 op/s
> 
You will want to:
a) fix your NTP, clock skew.
b) check your logs about the scrub errors
c) same for the stuck requests

> Also I found this error on DMESG about the crash :
> 
> Message from syslogd@drt-marco at Jul 30 04:03:57 ...
>  kernel:[4876519.657178] BUG: soft lockup - CPU#7 stuck for 22s!
> [btrfs-cleaner:32713]
> 
> All my volumes are on BTRFS, maybe it was not a good idea ?
> 
Depending on your OS, kernel version, most definitely. 
Plenty of BTRFS problems in the ML archives to be found.

Christian

-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Crash and question

2015-07-30 Thread Khalid Ahsein
Hello everybody,

I’m running since 4 months a ceph cluster configured with two monitors :

1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1 for system
1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1 for system

This night I’ve encountered an issue with the crash of the first host.

My first question is why with 1 host down, all my cluster was down (unable to 
do ceph status — hang command) and all my rbd was stuck without possibility to 
R/W.
I rebooted the first host, and 2 hours later the second go down with the same 
issue (all rbd down and ceph hang).

After reboot, here is ceph status :

# ceph status
cluster 9c29f469-7bad-4b64-97bf-3fbb1bbc0c5f
 health HEALTH_ERR
3 pgs inconsistent
1 pgs peering
1 pgs stuck inactive
1 pgs stuck unclean
36 requests are blocked > 32 sec
928 scrub errors
clock skew detected on mon.drt-becks
 monmap e1: 2 mons at 
{drt-becks=172.16.21.6:6789/0,drt-marco=172.16.21.4:6789/0}
election epoch 26, quorum 0,1 drt-marco,drt-becks
 osdmap e961: 24 osds: 24 up, 24 in
  pgmap v2532968: 400 pgs, 1 pools, 512 GB data, 130 kobjects
1039 GB used, 88092 GB / 89177 GB avail
 393 active+clean
   3 active+clean+scrubbing+deep
   3 active+clean+inconsistent
   1 peering
  client io 57290 B/s wr, 7 op/s

Also I found this error on DMESG about the crash :

Message from syslogd@drt-marco at Jul 30 04:03:57 ...
 kernel:[4876519.657178] BUG: soft lockup - CPU#7 stuck for 22s! 
[btrfs-cleaner:32713]

All my volumes are on BTRFS, maybe it was not a good idea ?

Thanks a lot for your help, on the bottom more hardware information

K

# cat /proc/cpuinfo 
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 26
model name  : Intel(R) Xeon(R) CPU   E5506  @ 2.13GHz
stepping: 5
microcode   : 0x19
cpu MHz : 2133.433
cache size  : 4096 KB
physical id : 1
siblings: 4
core id : 0
cpu cores   : 4
apicid  : 16
initial apicid  : 16
fpu : yes
fpu_exception   : yes
cpuid level : 11
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm 
constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc 
aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca 
sse4_1 sse4_2 popcnt lahf_lm dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips: 4266.86
clflush size: 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model   : 26
model name  : Intel(R) Xeon(R) CPU   E5506  @ 2.13GHz
stepping: 5
microcode   : 0x19
cpu MHz : 2133.433
cache size  : 4096 KB
physical id : 0
siblings: 4
core id : 0
cpu cores   : 4
apicid  : 0
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 11
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm 
constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc 
aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca 
sse4_1 sse4_2 popcnt lahf_lm dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips: 4266.74
clflush size: 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 2
vendor_id   : GenuineIntel
cpu family  : 6
model   : 26
model name  : Intel(R) Xeon(R) CPU   E5506  @ 2.13GHz
stepping: 5
microcode   : 0x19
cpu MHz : 2133.433
cache size  : 4096 KB
physical id : 1
siblings: 4
core id : 1
cpu cores   : 4
apicid  : 18
initial apicid  : 18
fpu : yes
fpu_exception   : yes
cpuid level : 11
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm 
constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc 
aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca 
sse4_1 sse4_2 popcnt lahf_lm dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips: 4266.86
clflush size: 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 3
vendor_id   : GenuineIntel
cpu family  : 6
model   : 26
model name  : Intel(R) Xeon(R) CPU   E5506  @ 2.13GHz
stepping: 5
microcode   : 0x19
cpu MHz : 2133.433
cache siz