Re: [ceph-users] PG stuck peering after host reboot

Gregory Farnum Wed, 08 Feb 2017 10:30:06 -0800

On Wed, Feb 8, 2017 at 10:25 AM,  <george.vasilaka...@stfc.ac.uk> wrote:
> Hi Greg,
>
>> Yes, "bad crc" indicates that the checksums on an incoming message did
>> not match what was provided — ie, the message got corrupted. You
>> shouldn't try and fix that by playing around with the peering settings
>> as it's not a peering bug.
>> Unless there's a bug in the messaging layer causing this (very
>> unlikely), you have bad hardware or a bad network configuration
>> (people occasionally talk about MTU settings?). Fix that and things
>> will work; don't and the only software tweaks you could apply are more
>> likely to result in lost data than a happy cluster.
>> -Greg
>
>
> I thought of the network initially but I didn't observe packet loss between 
> the two hosts and neither host is having trouble talking to the rest of its 
> peers. It's these two OSDs that can't talk to each other so I figured it's 
> not likely to be a network issue. Network monitoring does show virtually 
> non-existent inbound traffic over those links compared to the other ports on 
> the switch but no other peerings fail.
>
> Is there something you can suggest to do to drill down deeper?


Sadly no. It being a single route is indeed weird and hopefully
somebody with more networking background can suggest a cause. :)

> Also, am I correct in assuming that I can pull one of these OSDs from the 
> cluster as a last resort to cause a remapping to a different to potentially 
> give this a quick/temp fix and get the cluster serving I/O properly again?

I'd expect so!
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PG stuck peering after host reboot

Reply via email to