Re: [ceph-users] New Ceph cluster design

2018-03-09 Thread Tristan Le Toullec

Hi,
    same REX, we had troubles with OutOfMemory Kill Process on OSD 
process with ten 8 To disks. After an upgrade to 128 Go these troubles 
disapears.


Recommendations on memory aren't overestimated.

Regards,
Tristan


On 09/03/2018 11:31, Eino Tuominen wrote:

On 09/03/2018 12.16, Ján Senko wrote:

I am planning a new Ceph deployement and I have few questions that I 
could not find good answers yet.


Our nodes will be using Xeon-D machines with 12 HDDs each and 64GB each.
Our target is to use 10TB drives for 120TB capacity per node.
We ran into problems with 20 x 6 TB drives and 64 GB memory which we 
then increased to 128 GB. According to my experience the 
recommendation of 1 GB of memory per 1 TB of disk space has to be 
taken seriously.




<>___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to troubleshoot "heartbeat_check: no reply" in OSD log

2017-12-14 Thread Tristan Le Toullec

Hi Jared,
    did you have find a solution to your problem ? It appear that I 
have the same osd problem, and tcpdump captures won't show any solution.


All OSD nodes produced logs like

2017-12-14 11:25:11.756552 7f0cc5905700 -1 osd.49 29546 heartbeat_check: 
no reply from 172.16.5.155:6817 osd.46 since back 2017-12-14 
11:24:44.252310 front 2017-12-14 11:24:44.252310 (cutoff 2017-12-14 
11:24:51.756201)
2017-12-14 11:25:11.756558 7f0cc5905700 -1 osd.49 29546 heartbeat_check: 
no reply from 172.16.5.155:6815 osd.48 since back 2017-12-14 
11:24:44.252310 front 2017-12-14 11:24:44.252310 (cutoff 2017-12-14 
11:24:51.756201)
2017-12-14 11:25:11.756564 7f0cc5905700 -1 osd.49 29546 heartbeat_check: 
no reply from 172.16.5.156:6805 osd.50 since back 2017-12-14 
11:24:44.252310 front 2017-12-14 11:24:44.252310 (cutoff 2017-12-14 
11:24:51.756201)


Sometime OSD Process was shutdown and respawn, sometime just shutdown.

We used Ubuntu 14.04 (one node is on 16.04) and ceph version 10.2.10.

Thanks
Tristan





On Fri, Jul 28, 2017 at 6:06 AM, Jared Watts > wrote:
/I’ve got a cluster where a bunch of OSDs are down/out (only 6/21 are 

up/in). />/ceph status and ceph osd tree output can be found at: />//>/https://gist.github.com/jbw976/24895f5c35ef0557421124f4b26f6a12 />//>//>//>/In osd.4 log, 
I see many of these: />//>/2017-07-27 19:38:53.468852 7f3855c1c700 -1 osd.4 120 heartbeat_check: no />/reply from 10.32.0.3:6807 osd.15 ever on either front or back, first 
ping />/sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850) />//>/2017-07-27 19:38:53.468881 7f3855c1c700 -1 osd.4 120 heartbeat_check: no />/reply from 
10.32.0.3:6811 osd.16 ever on either front or back, first ping />/sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850) />//>//>//>/ From osd.4, those 
endpoints look reachable: />//>// # nc -vz 10.32.0.3 6807 />//>/10.32.0.3 (10.32.0.3:6807) open />//>// # nc -vz 10.32.0.3 6811 />//>/10.32.0.3 
(10.32.0.3:6811) open />//>//>//>/What else can I look at to determine why most of the OSDs cannot />/communicate? http://tracker.ceph.com/issues/16092 indicates this 
behavior />/is a networking or hardware issue, what else can I check there? I can turn />/on extra logging as needed. Thanks! /
Do a packet capture on both machines at the same time and verify the
packets are arriving as expected.

//>//>/___ />/ceph-users mailing list />/ceph-users at lists.ceph.com 

 
/>/http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com />//

<>___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com