[ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-02-28 Thread Marco Baldini - H.S. Amiata
Hello I have a little ceph cluster with 3 nodes, each with 3x1TB HDD and 1x240GB SSD. I created this cluster after Luminous release, so all OSDs are Bluestore. In my crush map I have two rules, one targeting the SSDs and one targeting the HDDs. I have 4 pools, one using the SSD rule and the o

Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-02-28 Thread Paul Emmerich
Hi, might be http://tracker.ceph.com/issues/22464 Can you check the OSD log file to see if the reported checksum is 0x6706be76? Paul > Am 28.02.2018 um 11:43 schrieb Marco Baldini - H.S. Amiata > : > > Hello > > I have a little ceph cluster with 3 nodes, each with 3x1TB HDD and 1x240GB > S

Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-02-28 Thread Marco Baldini - H.S. Amiata
Hi I read the bugtracker issue and it seems a lot like my problem, even if I can't check the reported checksum because I don't have it in my logs, perhaps it's because of debug osd = 0/0 in ceph.conf I just raised the OSD log level ceph tell osd.* injectargs --debug-osd 5/5 I'll check OSD l

Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-03-05 Thread Marco Baldini - H.S. Amiata
Hi After some days with debug_osd 5/5 I found [ERR] in different days, different PGs, different OSDs, different hosts. This is what I get in the OSD logs: *OSD.5 (host 3)* 2018-03-01 20:30:02.702269 7fdf4d515700  2 osd.5 pg_epoch: 16486 pg[9.1c( v 16486'51798 (16431'50251,16486'51798] local-

Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-03-05 Thread Paul Emmerich
Hi, yeah, the cluster that I'm seeing this on also has only one host that reports that specific checksum. Two other hosts only report the same error that you are seeing. Could you post to the tracker issue that you are also seeing this? Paul 2018-03-05 12:21 GMT+01:00 Marco Baldini - H.S. Amiat

Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-03-05 Thread Vladimir Prokofev
> candidate had a read error speaks for itself - while scrubbing it coudn't read data. I had similar issue, and it was just OSD dying - errors and relocated sectors in SMART, just replaced the disk. But in your case it seems that errors are on different OSDs? Are your OSDs all healthy? You can use

Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-03-05 Thread Marco Baldini - H.S. Amiata
Hi I just posted in the ceph tracker with my logs and my issue Let's hope this will be fixed Thanks Il 05/03/2018 13:36, Paul Emmerich ha scritto: Hi, yeah, the cluster that I'm seeing this on also has only one host that reports that specific checksum. Two other hosts only report the same

Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-03-05 Thread Marco Baldini - H.S. Amiata
Hi and thanks for reply The OSDs are all healthy, in fact after a ceph pg repair the ceph health is back to OK and in the OSD log I see repair ok, 0 fixed The SMART data of the 3 OSDs seems fine *OSD.5* # ceph-disk list | grep osd.5  /dev/sdd1 ceph data, active, cluster ceph, osd.5, block

Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-03-05 Thread Vladimir Prokofev
> always solved by ceph pg repair That doesn't necessarily means that there's no hardware issue. In my case repair also worked fine and returned cluster to OK state every time, but in time faulty disk fail another scrub operation, and this repeated multiple times before we replaced that disk. One

Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-03-05 Thread Marco Baldini - H.S. Amiata
Hi I monitor dmesg in each of the 3 nodes, no hardware issue reported. And the problem happens with various different OSDs in different nodes, for me it is clear it's not an hardware problem. Thanks for reply Il 05/03/2018 21:45, Vladimir Prokofev ha scritto: > always solved by ceph pg re

Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-03-06 Thread Brad Hubbard
On Tue, Mar 6, 2018 at 5:26 PM, Marco Baldini - H.S. Amiata < mbald...@hsamiata.it> wrote: > Hi > > I monitor dmesg in each of the 3 nodes, no hardware issue reported. And > the problem happens with various different OSDs in different nodes, for me > it is clear it's not an hardware problem. > If

Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-03-06 Thread Brad Hubbard
debug_osd that is... :) On Tue, Mar 6, 2018 at 7:10 PM, Brad Hubbard wrote: > > > On Tue, Mar 6, 2018 at 5:26 PM, Marco Baldini - H.S. Amiata < > mbald...@hsamiata.it> wrote: > >> Hi >> >> I monitor dmesg in each of the 3 nodes, no hardware issue reported. And >> the problem happens with various