Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

Marco Baldini - H.S. Amiata Mon, 05 Mar 2018 23:27:08 -0800

Hi

I monitor dmesg in each of the 3 nodes, no hardware issue reported. Andthe problem happens with various different OSDs in different nodes, forme it is clear it's not an hardware problem.


Thanks for reply



Il 05/03/2018 21:45, Vladimir Prokofev ha scritto:

> always solved by ceph pg repair <PG>

That doesn't necessarily means that there's no hardware issue. In mycase repair also worked fine and returned cluster to OK state everytime, but in time faulty disk fail another scrub operation, and thisrepeated multiple times before we replaced that disk.One last thing to look into is dmesg at your OSD nodes. If there's ahardware read error it will be logged in dmesg.

2018-03-05 18:26 GMT+03:00 Marco Baldini - H.S. Amiata<mbald...@hsamiata.it <mailto:mbald...@hsamiata.it>>:


    Hi and thanks for reply

    The OSDs are all healthy, in fact after a ceph pg repair <PG> the
    ceph health is back to OK and in the OSD log I see  <PG> repair
    ok, 0 fixed

    The SMART data of the 3 OSDs seems fine

    *OSD.5*

    # ceph-disk list | grep osd.5
      /dev/sdd1 ceph data, active, cluster ceph, osd.5, block /dev/sdd2

    # smartctl -a /dev/sdd
    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke,www.smartmontools.org 
<http://www.smartmontools.org>

    === START OF INFORMATION SECTION ===
    Model Family:     Seagate Barracuda 7200.14 (AF)
    Device Model:     ST1000DM003-1SB10C
    Serial Number:    Z9A1MA1V
    LU WWN Device Id: 5 000c50 090c7028b
    Firmware Version: CC43
    User Capacity:    1,000,204,886,016 bytes [1.00 TB]
    Sector Sizes:     512 bytes logical, 4096 bytes physical
    Rotation Rate:    7200 rpm
    Form Factor:      3.5 inches
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   ATA8-ACS T13/1699-D revision 4
    SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
    Local Time is:    Mon Mar  5 16:17:22 2018 CET
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED

    General SMART Values:
    Offline data collection status:  (0x82)     Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
    Self-test execution status:      (   0)     The previous self-test routine 
completed
                                        without error or no self-test has ever
                                        been run.
    Total time to complete Offline
    data collection:            (    0) seconds.
    Offline data collection
    capabilities:                        (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off 
support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
    SMART capabilities:            (0x0003)     Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
    Error logging capability:        (0x01)     Error logging supported.
                                        General Purpose Logging supported.
    Short self-test routine
    recommended polling time:    (   1) minutes.
    Extended self-test routine
    recommended polling time:    ( 109) minutes.
    Conveyance self-test routine
    recommended polling time:    (   2) minutes.
    SCT capabilities:          (0x1085) SCT Status supported.

    SMART Attributes Data Structure revision number: 10
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
       1 Raw_Read_Error_Rate     0x000f   082   063   006    Pre-fail  Always   
    -       193297722
       3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always   
    -       0
       4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always   
    -       60
       5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always   
    -       0
       7 Seek_Error_Rate         0x000f   091   060   045    Pre-fail  Always   
    -       1451132477
       9 Power_On_Hours          0x0032   085   085   000    Old_age   Always   
    -       13283
      10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always   
    -       0
      12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always   
    -       61
    183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always    
   -       0
    184 End-to-End_Error        0x0032   100   100   099    Old_age   Always    
   -       0
    187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always    
   -       0
    188 Command_Timeout         0x0032   100   100   000    Old_age   Always    
   -       0 0 0
    189 High_Fly_Writes         0x003a   086   086   000    Old_age   Always    
   -       14
    190 Airflow_Temperature_Cel 0x0022   071   055   040    Old_age   Always    
   -       29 (Min/Max 23/32)
    193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always    
   -       607
    194 Temperature_Celsius     0x0022   029   014   000    Old_age   Always    
   -       29 (0 14 0 0 0)
    195 Hardware_ECC_Recovered  0x001a   004   001   000    Old_age   Always    
   -       193297722
    197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always    
   -       0
    198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline   
   -       0
    199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always    
   -       0
    240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline   
   -       13211h+23m+08.363s
    241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline   
   -       53042120064
    242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline   
   -       170788993187


    *OSD.4*

    # ceph-disk list | grep osd.4
      /dev/sdc1 ceph data, active, cluster ceph, osd.4, block /dev/sdc2

    # smartctl -a /dev/sdc
    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke,www.smartmontools.org 
<http://www.smartmontools.org>

    === START OF INFORMATION SECTION ===
    Model Family:     Seagate Barracuda 7200.14 (AF)
    Device Model:     ST1000DM003-1SB10C
    Serial Number:    Z9A1M1BW
    LU WWN Device Id: 5 000c50 090c78d27
    Firmware Version: CC43
    User Capacity:    1,000,204,886,016 bytes [1.00 TB]
    Sector Sizes:     512 bytes logical, 4096 bytes physical
    Rotation Rate:    7200 rpm
    Form Factor:      3.5 inches
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   ATA8-ACS T13/1699-D revision 4
    SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
    Local Time is:    Mon Mar  5 16:20:46 2018 CET
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED

    General SMART Values:
    Offline data collection status:  (0x82)     Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
    Self-test execution status:      (   0)     The previous self-test routine 
completed
                                        without error or no self-test has ever
                                        been run.
    Total time to complete Offline
    data collection:            (    0) seconds.
    Offline data collection
    capabilities:                        (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off 
support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
    SMART capabilities:            (0x0003)     Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
    Error logging capability:        (0x01)     Error logging supported.
                                        General Purpose Logging supported.
    Short self-test routine
    recommended polling time:    (   1) minutes.
    Extended self-test routine
    recommended polling time:    ( 109) minutes.
    Conveyance self-test routine
    recommended polling time:    (   2) minutes.
    SCT capabilities:          (0x1085) SCT Status supported.

    SMART Attributes Data Structure revision number: 10
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
       1 Raw_Read_Error_Rate     0x000f   082   063   006    Pre-fail  Always   
    -       194906537
       3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always   
    -       0
       4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always   
    -       64
       5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always   
    -       0
       7 Seek_Error_Rate         0x000f   091   060   045    Pre-fail  Always   
    -       1485899434
       9 Power_On_Hours          0x0032   085   085   000    Old_age   Always   
    -       13390
      10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always   
    -       0
      12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always   
    -       65
    183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always    
   -       0
    184 End-to-End_Error        0x0032   100   100   099    Old_age   Always    
   -       0
    187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always    
   -       0
    188 Command_Timeout         0x0032   100   100   000    Old_age   Always    
   -       0 0 0
    189 High_Fly_Writes         0x003a   095   095   000    Old_age   Always    
   -       5
    190 Airflow_Temperature_Cel 0x0022   074   051   040    Old_age   Always    
   -       26 (Min/Max 19/29)
    193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always    
   -       616
    194 Temperature_Celsius     0x0022   026   014   000    Old_age   Always    
   -       26 (0 14 0 0 0)
    195 Hardware_ECC_Recovered  0x001a   004   001   000    Old_age   Always    
   -       194906537
    197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always    
   -       0
    198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline   
   -       0
    199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always    
   -       0
    240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline   
   -       13315h+20m+30.974s
    241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline   
   -       52137467719
    242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline   
   -       177227508503



    *OSD.8*

    # ceph-disk list | grep osd.8
      /dev/sda1 ceph data, active, cluster ceph, osd.8, block /dev/sda2

    # smartctl -a /dev/sda
    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke,www.smartmontools.org 
<http://www.smartmontools.org>

    === START OF INFORMATION SECTION ===
    Model Family:     Seagate Barracuda 7200.14 (AF)
    Device Model:     ST1000DM003-1SB10C
    Serial Number:    Z9A2BEF2
    LU WWN Device Id: 5 000c50 0910f5427
    Firmware Version: CC43
    User Capacity:    1,000,203,804,160 bytes [1.00 TB]
    Sector Sizes:     512 bytes logical, 4096 bytes physical
    Rotation Rate:    7200 rpm
    Form Factor:      3.5 inches
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   ATA8-ACS T13/1699-D revision 4
    SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
    Local Time is:    Mon Mar  5 16:22:47 2018 CET
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED

    General SMART Values:
    Offline data collection status:  (0x82)     Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
    Self-test execution status:      (   0)     The previous self-test routine 
completed
                                        without error or no self-test has ever
                                        been run.
    Total time to complete Offline
    data collection:            (    0) seconds.
    Offline data collection
    capabilities:                        (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off 
support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
    SMART capabilities:            (0x0003)     Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
    Error logging capability:        (0x01)     Error logging supported.
                                        General Purpose Logging supported.
    Short self-test routine
    recommended polling time:    (   1) minutes.
    Extended self-test routine
    recommended polling time:    ( 110) minutes.
    Conveyance self-test routine
    recommended polling time:    (   2) minutes.
    SCT capabilities:          (0x1085) SCT Status supported.

    SMART Attributes Data Structure revision number: 10
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
       1 Raw_Read_Error_Rate     0x000f   083   063   006    Pre-fail  Always   
    -       224621855
       3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always   
    -       0
       4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always   
    -       275
       5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always   
    -       0
       7 Seek_Error_Rate         0x000f   081   060   045    Pre-fail  Always   
    -       149383284
       9 Power_On_Hours          0x0032   093   093   000    Old_age   Always   
    -       6210
      10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always   
    -       0
      12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always   
    -       265
    183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always    
   -       0
    184 End-to-End_Error        0x0032   100   100   099    Old_age   Always    
   -       0
    187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always    
   -       0
    188 Command_Timeout         0x0032   100   100   000    Old_age   Always    
   -       0 0 0
    189 High_Fly_Writes         0x003a   098   098   000    Old_age   Always    
   -       2
    190 Airflow_Temperature_Cel 0x0022   069   058   040    Old_age   Always    
   -       31 (Min/Max 21/35)
    193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always    
   -       516
    194 Temperature_Celsius     0x0022   031   017   000    Old_age   Always    
   -       31 (0 17 0 0 0)
    195 Hardware_ECC_Recovered  0x001a   005   001   000    Old_age   Always    
   -       224621855
    197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always    
   -       0
    198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline   
   -       0
    199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always    
   -       0
    240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline   
   -       6154h+03m+35.126s
    241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline   
   -       24333847321
    242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline   
   -       50261005553


    However it's not only these 3 OSD to have PG with errors, these
    are onlyl the most recent, in the last 3 months I had often
    OSD_SCRUB_ERRORS in various OSDs, always solved by ceph pg repair
    <PG>, I don't think it's an hardware issue.





    Il 05/03/2018 13:40, Vladimir Prokofev ha scritto:

    > candidate had a read error
    speaks for itself - while scrubbing it coudn't read data.
    I had similar issue, and it was just OSD dying - errors and
    relocated sectors in SMART, just replaced the disk. But in your
    case it seems that errors are on different OSDs? Are your OSDs
    all healthy?
    You can use this command to see some details.
    rados list-inconsistent-obj <pg.id <http://pg.id>>
    --format=json-pretty
    pg.id <http://pg.id> is the PG that's reporting as inconsistent.
    My guess is that you'll see read errors in this output, with OSD
    number that encountered error. After that you have to check that
    OSD health - SMART details, etc.
    Not always it's the disk itself that causing problems - for
    example we had read errors because of a faulty backplane
    interface in a server; changing the chassis resolved this issue.


    2018-03-05 14:21 GMT+03:00 Marco Baldini - H.S. Amiata
    <mbald...@hsamiata.it <mailto:mbald...@hsamiata.it>>:

        Hi

        After some days with debug_osd 5/5 I found [ERR] in different
        days, different PGs, different OSDs, different hosts. This is
        what I get in the OSD logs:

        *OSD.5 (host 3)*
        2018-03-01 20:30:02.702269 7fdf4d515700  2 osd.5 pg_epoch: 16486 
pg[9.1c( v 16486'51798 (16431'50251,16486'51798] local-lis/les=16474/16475 
n=3629 ec=1477/1477 lis/c 16474/16474 les/c/f 16475/16477/0 16474/16474/16474) 
[5,6] r=0 lpr=16474 crt=16486'51798 lcod 16486'51797 mlcod 16486'51797 
active+clean+scrubbing+deep] 9.1c shard 6: soid 
9:3b157c56:::rbd_data.1526386b8b4567.0000000000001761:head candidate had a read 
error
        2018-03-01 20:30:02.702278 7fdf4d515700 -1 log_channel(cluster) log 
[ERR] : 9.1c shard 6: soid 
9:3b157c56:::rbd_data.1526386b8b4567.0000000000001761:head candidate had a read 
error

        *OSD.4 (host 3)*
        2018-02-28 00:03:33.458558 7f112cf76700 -1 log_channel(cluster) log 
[ERR] : 13.65 shard 2: soid 
13:a719ecdf:::rbd_data.5f65056b8b4567.000000000000f8eb:head candidate had a 
read error

        *OSD.8 (host 2)*
        2018-02-27 23:55:15.100084 7f4dd0816700 -1 log_channel(cluster) log 
[ERR] : 14.31 shard 1: soid 
14:8cc6cd37:::rbd_data.30b15b6b8b4567.00000000000081a1:head candidate had a 
read error

        I don't know what this error is meaning, and as always a ceph
        pg repair fixes it. I don't think this is normal.

        Ideas?

        Thanks


        Il 28/02/2018 14:48, Marco Baldini - H.S. Amiata ha scritto:


        Hi

        I read the bugtracker issue and it seems a lot like my
        problem, even if I can't check the reported checksum because
        I don't have it in my logs, perhaps it's because of debug
        osd = 0/0 in ceph.conf

        I just raised the OSD log level

        ceph tell osd.* injectargs --debug-osd 5/5

        I'll check OSD logs in the next days...

        Thanks



        Il 28/02/2018 11:59, Paul Emmerich ha scritto:

        Hi,

        might be http://tracker.ceph.com/issues/22464
        <http://tracker.ceph.com/issues/22464>

        Can you check the OSD log file to see if the reported
        checksum is 0x6706be76?


        Paul

        Am 28.02.2018 um 11:43 schrieb Marco Baldini - H.S. Amiata
        <mbald...@hsamiata.it <mailto:mbald...@hsamiata.it>>:

        Hello

        I have a little ceph cluster with 3 nodes, each with 3x1TB
        HDD and 1x240GB SSD. I created this cluster after Luminous
        release, so all OSDs are Bluestore. In my crush map I have
        two rules, one targeting the SSDs and one targeting the
        HDDs. I have 4 pools, one using the SSD rule and the
        others using the HDD rule, three pools are size=3
        min_size=2, one is size=2 min_size=1 (this one have
        content that it's ok to lose)

        In the last 3 month I'm having a strange random problem. I
        planned my osd scrubs during the night (osd scrub begin
        hour = 20, osd scrub end hour = 7) when office is closed
        so there is low impact on the users. Some mornings, when I
        ceph the cluster health, I find:

        HEALTH_ERR X scrub errors; Possible data damage: Y pgs inconsistent
        OSD_SCRUB_ERRORS X scrub errors
        PG_DAMAGED Possible data damage: Y pg inconsistent

        X and Y sometimes are 1, sometimes 2.

        I issue a ceph health detail, check the damaged PGs, and
        run a ceph pg repair for the damaged PGs, I get

        instructing pg PG on osd.N to repair

        PG are different, OSD that have to repair PG is different,
        even the node hosting the OSD is different, I made a list
        of all PGs and OSDs. This morning is the most recent case:

        > ceph health detail
        HEALTH_ERR 2 scrub errors; Possible data damage: 2 pgs inconsistent
        OSD_SCRUB_ERRORS 2 scrub errors
        PG_DAMAGED Possible data damage: 2 pgs inconsistent
        pg 13.65 is active+clean+inconsistent, acting [4,2,6]
        pg 14.31 is active+clean+inconsistent, acting [8,3,1]
        > ceph pg repair 13.65
        instructing pg 13.65 on osd.4 to repair

        (node-2)> tail /var/log/ceph/ceph-osd.4.log
        2018-02-28 08:38:47.593447 7f112cf76700  0 log_channel(cluster) log 
[DBG] : 13.65 repair starts
        2018-02-28 08:39:37.573342 7f112cf76700  0 log_channel(cluster) log 
[DBG] : 13.65 repair ok, 0 fixed
        > ceph pg repair 14.31
        instructing pg 14.31 on osd.8 to repair

        (node-3)> tail /var/log/ceph/ceph-osd.8.log
        2018-02-28 08:52:37.297490 7f4dd0816700  0 log_channel(cluster) log 
[DBG] : 14.31 repair starts
        2018-02-28 08:53:00.704020 7f4dd0816700  0 log_channel(cluster) log 
[DBG] : 14.31 repair ok, 0 fixed


        I made a list of when I got OSD_SCRUB_ERRORS, what PG and
        what OSD had to repair PG. Date is dd/mm/yyyy

        21/12/2017   --  pg 14.29 is active+clean+inconsistent, acting [6,2,4]

        18/01/2018   --  pg 14.5a is active+clean+inconsistent, acting [6,4,1]

        22/01/2018   --  pg 9.3a is active+clean+inconsistent, acting [2,7]

        29/01/2018   --  pg 13.3e is active+clean+inconsistent, acting [4,6,1]
                          instructing pg 13.3e on osd.4 to repair

        07/02/2018   --  pg 13.7e is active+clean+inconsistent, acting [8,2,5]
                          instructing pg 13.7e on osd.8 to repair

        09/02/2018   --  pg 13.30 is active+clean+inconsistent, acting [7,3,2]
                          instructing pg 13.30 on osd.7 to repair

        15/02/2018   --  pg 9.35 is active+clean+inconsistent, acting [1,8]
                          instructing pg 9.35 on osd.1 to repair

                          pg 13.3e is active+clean+inconsistent, acting [4,6,1]
                          instructing pg 13.3e on osd.4 to repair

        17/02/2018   --  pg 9.2d is active+clean+inconsistent, acting [7,5]
                          instructing pg 9.2d on osd.7 to repair

        22/02/2018   --  pg 9.24 is active+clean+inconsistent, acting [5,8]
                          instructing pg 9.24 on osd.5 to repair

        28/02/2018   --  pg 13.65 is active+clean+inconsistent, acting [4,2,6]
                          instructing pg 13.65 on osd.4 to repair

                          pg 14.31 is active+clean+inconsistent, acting [8,3,1]
                          instructing pg 14.31 on osd.8 to repair



        If can be useful, my ceph.conf is here:

        [global]
        auth client required = none
        auth cluster required = none
        auth service required = none
        fsid = 24d5d6bc-0943-4345-b44e-46c19099004b
        cluster network =10.10.10.0/24 <http://10.10.10.0/24>
        public network =10.10.10.0/24 <http://10.10.10.0/24>
        keyring = /etc/pve/priv/$cluster.$name.keyring
        mon allow pool delete = true
        osd journal size = 5120
        osd pool default min size = 2
        osd pool default size = 3
        bluestore_block_db_size = 64424509440

        debug asok = 0/0
        debug auth = 0/0
        debug buffer = 0/0
        debug client = 0/0
        debug context = 0/0
        debug crush = 0/0
        debug filer = 0/0
        debug filestore = 0/0
        debug finisher = 0/0
        debug heartbeatmap = 0/0
        debug journal = 0/0
        debug journaler = 0/0
        debug lockdep = 0/0
        debug mds = 0/0
        debug mds balancer = 0/0
        debug mds locker = 0/0
        debug mds log = 0/0
        debug mds log expire = 0/0
        debug mds migrator = 0/0
        debug mon = 0/0
        debug monc = 0/0
        debug ms = 0/0
        debug objclass = 0/0
        debug objectcacher = 0/0
        debug objecter = 0/0
        debug optracker = 0/0
        debug osd = 0/0
        debug paxos = 0/0
        debug perfcounter = 0/0
        debug rados = 0/0
        debug rbd = 0/0
        debug rgw = 0/0
        debug throttle = 0/0
        debug timer = 0/0
        debug tp = 0/0


        [osd]
        keyring = /var/lib/ceph/osd/ceph-$id/keyring
        osd max backfills = 1
        osd recovery max active = 1

        osd scrub begin hour = 20
        osd scrub end hour = 7
        osd scrub during recovery = false
        osd scrub load threshold = 0.3

        [client]
        rbd cache = true
        rbd cache size = 268435456      # 256MB
        rbd cache max dirty = 201326592    # 192MB
        rbd cache max dirty age = 2
        rbd cache target dirty = 33554432    # 32MB
        rbd cache writethrough until flush = true


        #[mgr]
        #debug_mgr = 20


        [mon.pve-hs-main]
        host = pve-hs-main
        mon addr =10.10.10.251:6789 <http://10.10.10.251:6789>

        [mon.pve-hs-2]
        host = pve-hs-2
        mon addr =10.10.10.252:6789 <http://10.10.10.252:6789>

        [mon.pve-hs-3]
        host = pve-hs-3
        mon addr =10.10.10.253:6789 <http://10.10.10.253:6789>


        My ceph versions:

        {
             "mon": {
                 "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) 
luminous (stable)": 3
             },
             "mgr": {
                 "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) 
luminous (stable)": 3
             },
             "osd": {
                 "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) 
luminous (stable)": 12
             },
             "mds": {},
             "overall": {
                 "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) 
luminous (stable)": 18
             }
        }



        My ceph osd tree:

        ID CLASS WEIGHT  TYPE NAME            STATUS REWEIGHT PRI-AFF
        -1       8.93686 root default
        -6       2.94696     host pve-hs-2
          3   hdd 0.90959         osd.3            up  1.00000 1.00000
          4   hdd 0.90959         osd.4            up  1.00000 1.00000
          5   hdd 0.90959         osd.5            up  1.00000 1.00000
        10   ssd 0.21819         osd.10           up  1.00000 1.00000
        -3       2.86716     host pve-hs-3
          6   hdd 0.85599         osd.6            up  1.00000 1.00000
          7   hdd 0.85599         osd.7            up  1.00000 1.00000
          8   hdd 0.93700         osd.8            up  1.00000 1.00000
        11   ssd 0.21819         osd.11           up  1.00000 1.00000
        -7       3.12274     host pve-hs-main
          0   hdd 0.96819         osd.0            up  1.00000 1.00000
          1   hdd 0.96819         osd.1            up  1.00000 1.00000
          2   hdd 0.96819         osd.2            up  1.00000 1.00000
          9   ssd 0.21819         osd.9            up  1.00000 1.00000

        My pools:

        pool 9 'cephbackup' replicated size 2 min_size 1 crush_rule 1 
object_hash rjenkins pg_num 64 pgp_num 64 last_change 5665 flags hashpspool 
stripe_width 0 application rbd
                 removed_snaps [1~3]
        pool 13 'cephwin' replicated size 3 min_size 2 crush_rule 1 object_hash 
rjenkins pg_num 128 pgp_num 128 last_change 16454 flags hashpspool stripe_width 
0 application rbd
                 removed_snaps [1~5]
        pool 14 'cephnix' replicated size 3 min_size 2 crush_rule 1 object_hash 
rjenkins pg_num 128 pgp_num 128 last_change 16482 flags hashpspool stripe_width 
0 application rbd
                 removed_snaps [1~227]
        pool 17 'cephssd' replicated size 3 min_size 2 crush_rule 2 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 8601 flags hashpspool stripe_width 0 
application rbd
                 removed_snaps [1~3]


        I can't understand where the problem comes from, I don't
        think it's hardware, if I had a failed disk, then I should
        have problems always on the same OSD. Any ideas

        Thanks

--*Marco Baldini*

        *H.S. Amiata Srl*
        Ufficio:        0577-779396
        Cellulare:      335-8765169
        WEB:    www.hsamiata.it <https://www.hsamiata.it/>
        EMAIL:  mbald...@hsamiata.it <mailto:mbald...@hsamiata.it>

        _______________________________________________
        ceph-users mailing list
        ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

--Mit freundlichen Grüßen / Best Regards

        Paul Emmerich

        croit GmbH
        Freseniusstr. 31h
        81247 München
        www.croit.io <http://www.croit.io>
        Tel: +49 89 1896585 90

        Geschäftsführer: Martin Verges
        Handelsregister: Amtsgericht München
        USt-IdNr: DE310638492

--*Marco Baldini*

        *H.S. Amiata Srl*
        Ufficio:        0577-779396
        Cellulare:      335-8765169
        WEB:    www.hsamiata.it <https://www.hsamiata.it>
        EMAIL:  mbald...@hsamiata.it <mailto:mbald...@hsamiata.it>



        _______________________________________________
        ceph-users mailing list
        ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

--*Marco Baldini*

        *H.S. Amiata Srl*
        Ufficio:        0577-779396
        Cellulare:      335-8765169
        WEB:    www.hsamiata.it <https://www.hsamiata.it>
        EMAIL:  mbald...@hsamiata.it <mailto:mbald...@hsamiata.it>


        _______________________________________________
        ceph-users mailing list
        ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>




    _______________________________________________
    ceph-users mailing list
    ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

--*Marco Baldini*

    *H.S. Amiata Srl*
    Ufficio:    0577-779396
    Cellulare:  335-8765169
    WEB:        www.hsamiata.it <https://www.hsamiata.it>
    EMAIL:      mbald...@hsamiata.it <mailto:mbald...@hsamiata.it>


    _______________________________________________
    ceph-users mailing list
    ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>




_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
*Marco Baldini*
*H.S. Amiata Srl*
Ufficio:        0577-779396
Cellulare:      335-8765169
WEB:    www.hsamiata.it <https://www.hsamiata.it>
EMAIL:  mbald...@hsamiata.it <mailto:mbald...@hsamiata.it>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

Reply via email to