On Tue, Mar 6, 2018 at 5:26 PM, Marco Baldini - H.S. Amiata < mbald...@hsamiata.it> wrote:
> Hi > > I monitor dmesg in each of the 3 nodes, no hardware issue reported. And > the problem happens with various different OSDs in different nodes, for me > it is clear it's not an hardware problem. > If you have osd_debug set to 25 or greater when you run the deep scrub you should get more information about the nature of the read error in the ReplicatedBackend::be_deep_scrub() function (assuming this is a replicated pool). This may create large logs so watch they don't exhaust storage. > Thanks for reply > > > > Il 05/03/2018 21:45, Vladimir Prokofev ha scritto: > > > always solved by ceph pg repair <PG> > That doesn't necessarily means that there's no hardware issue. In my case > repair also worked fine and returned cluster to OK state every time, but in > time faulty disk fail another scrub operation, and this repeated multiple > times before we replaced that disk. > One last thing to look into is dmesg at your OSD nodes. If there's a > hardware read error it will be logged in dmesg. > > 2018-03-05 18:26 GMT+03:00 Marco Baldini - H.S. Amiata < > mbald...@hsamiata.it>: > >> Hi and thanks for reply >> >> The OSDs are all healthy, in fact after a ceph pg repair <PG> the ceph >> health is back to OK and in the OSD log I see <PG> repair ok, 0 fixed >> >> The SMART data of the 3 OSDs seems fine >> >> *OSD.5* >> >> # ceph-disk list | grep osd.5 >> /dev/sdd1 ceph data, active, cluster ceph, osd.5, block /dev/sdd2 >> >> # smartctl -a /dev/sdd >> smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build) >> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org >> >> === START OF INFORMATION SECTION === >> Model Family: Seagate Barracuda 7200.14 (AF) >> Device Model: ST1000DM003-1SB10C >> Serial Number: Z9A1MA1V >> LU WWN Device Id: 5 000c50 090c7028b >> Firmware Version: CC43 >> User Capacity: 1,000,204,886,016 bytes [1.00 TB] >> Sector Sizes: 512 bytes logical, 4096 bytes physical >> Rotation Rate: 7200 rpm >> Form Factor: 3.5 inches >> Device is: In smartctl database [for details use: -P show] >> ATA Version is: ATA8-ACS T13/1699-D revision 4 >> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) >> Local Time is: Mon Mar 5 16:17:22 2018 CET >> SMART support is: Available - device has SMART capability. >> SMART support is: Enabled >> >> === START OF READ SMART DATA SECTION === >> SMART overall-health self-assessment test result: PASSED >> >> General SMART Values: >> Offline data collection status: (0x82) Offline data collection activity >> was completed without error. >> Auto Offline Data Collection: Enabled. >> Self-test execution status: ( 0) The previous self-test routine >> completed >> without error or no self-test has ever >> been run. >> Total time to complete Offline >> data collection: ( 0) seconds. >> Offline data collection >> capabilities: (0x7b) SMART execute Offline immediate. >> Auto Offline data collection on/off >> support. >> Suspend Offline collection upon new >> command. >> Offline surface scan supported. >> Self-test supported. >> Conveyance Self-test supported. >> Selective Self-test supported. >> SMART capabilities: (0x0003) Saves SMART data before entering >> power-saving mode. >> Supports SMART auto save timer. >> Error logging capability: (0x01) Error logging supported. >> General Purpose Logging supported. >> Short self-test routine >> recommended polling time: ( 1) minutes. >> Extended self-test routine >> recommended polling time: ( 109) minutes. >> Conveyance self-test routine >> recommended polling time: ( 2) minutes. >> SCT capabilities: (0x1085) SCT Status supported. >> >> SMART Attributes Data Structure revision number: 10 >> Vendor Specific SMART Attributes with Thresholds: >> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED >> WHEN_FAILED RAW_VALUE >> 1 Raw_Read_Error_Rate 0x000f 082 063 006 Pre-fail Always >> - 193297722 >> 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always >> - 0 >> 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always >> - 60 >> 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always >> - 0 >> 7 Seek_Error_Rate 0x000f 091 060 045 Pre-fail Always >> - 1451132477 >> 9 Power_On_Hours 0x0032 085 085 000 Old_age Always >> - 13283 >> 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always >> - 0 >> 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always >> - 61 >> 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always >> - 0 >> 184 End-to-End_Error 0x0032 100 100 099 Old_age Always >> - 0 >> 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always >> - 0 >> 188 Command_Timeout 0x0032 100 100 000 Old_age Always >> - 0 0 0 >> 189 High_Fly_Writes 0x003a 086 086 000 Old_age Always >> - 14 >> 190 Airflow_Temperature_Cel 0x0022 071 055 040 Old_age Always >> - 29 (Min/Max 23/32) >> 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always >> - 607 >> 194 Temperature_Celsius 0x0022 029 014 000 Old_age Always >> - 29 (0 14 0 0 0) >> 195 Hardware_ECC_Recovered 0x001a 004 001 000 Old_age Always >> - 193297722 >> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always >> - 0 >> 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline >> - 0 >> 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always >> - 0 >> 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline >> - 13211h+23m+08.363s >> 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline >> - 53042120064 >> 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline >> - 170788993187 >> >> *OSD.4* >> >> # ceph-disk list | grep osd.4 >> /dev/sdc1 ceph data, active, cluster ceph, osd.4, block /dev/sdc2 >> >> # smartctl -a /dev/sdc >> smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build) >> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org >> >> === START OF INFORMATION SECTION === >> Model Family: Seagate Barracuda 7200.14 (AF) >> Device Model: ST1000DM003-1SB10C >> Serial Number: Z9A1M1BW >> LU WWN Device Id: 5 000c50 090c78d27 >> Firmware Version: CC43 >> User Capacity: 1,000,204,886,016 bytes [1.00 TB] >> Sector Sizes: 512 bytes logical, 4096 bytes physical >> Rotation Rate: 7200 rpm >> Form Factor: 3.5 inches >> Device is: In smartctl database [for details use: -P show] >> ATA Version is: ATA8-ACS T13/1699-D revision 4 >> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) >> Local Time is: Mon Mar 5 16:20:46 2018 CET >> SMART support is: Available - device has SMART capability. >> SMART support is: Enabled >> >> === START OF READ SMART DATA SECTION === >> SMART overall-health self-assessment test result: PASSED >> >> General SMART Values: >> Offline data collection status: (0x82) Offline data collection activity >> was completed without error. >> Auto Offline Data Collection: Enabled. >> Self-test execution status: ( 0) The previous self-test routine >> completed >> without error or no self-test has ever >> been run. >> Total time to complete Offline >> data collection: ( 0) seconds. >> Offline data collection >> capabilities: (0x7b) SMART execute Offline immediate. >> Auto Offline data collection on/off >> support. >> Suspend Offline collection upon new >> command. >> Offline surface scan supported. >> Self-test supported. >> Conveyance Self-test supported. >> Selective Self-test supported. >> SMART capabilities: (0x0003) Saves SMART data before entering >> power-saving mode. >> Supports SMART auto save timer. >> Error logging capability: (0x01) Error logging supported. >> General Purpose Logging supported. >> Short self-test routine >> recommended polling time: ( 1) minutes. >> Extended self-test routine >> recommended polling time: ( 109) minutes. >> Conveyance self-test routine >> recommended polling time: ( 2) minutes. >> SCT capabilities: (0x1085) SCT Status supported. >> >> SMART Attributes Data Structure revision number: 10 >> Vendor Specific SMART Attributes with Thresholds: >> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED >> WHEN_FAILED RAW_VALUE >> 1 Raw_Read_Error_Rate 0x000f 082 063 006 Pre-fail Always >> - 194906537 >> 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always >> - 0 >> 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always >> - 64 >> 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always >> - 0 >> 7 Seek_Error_Rate 0x000f 091 060 045 Pre-fail Always >> - 1485899434 >> 9 Power_On_Hours 0x0032 085 085 000 Old_age Always >> - 13390 >> 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always >> - 0 >> 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always >> - 65 >> 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always >> - 0 >> 184 End-to-End_Error 0x0032 100 100 099 Old_age Always >> - 0 >> 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always >> - 0 >> 188 Command_Timeout 0x0032 100 100 000 Old_age Always >> - 0 0 0 >> 189 High_Fly_Writes 0x003a 095 095 000 Old_age Always >> - 5 >> 190 Airflow_Temperature_Cel 0x0022 074 051 040 Old_age Always >> - 26 (Min/Max 19/29) >> 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always >> - 616 >> 194 Temperature_Celsius 0x0022 026 014 000 Old_age Always >> - 26 (0 14 0 0 0) >> 195 Hardware_ECC_Recovered 0x001a 004 001 000 Old_age Always >> - 194906537 >> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always >> - 0 >> 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline >> - 0 >> 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always >> - 0 >> 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline >> - 13315h+20m+30.974s >> 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline >> - 52137467719 >> 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline >> - 177227508503 >> >> >> *OSD.8* >> >> # ceph-disk list | grep osd.8 >> /dev/sda1 ceph data, active, cluster ceph, osd.8, block /dev/sda2 >> >> # smartctl -a /dev/sda >> smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build) >> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org >> >> === START OF INFORMATION SECTION === >> Model Family: Seagate Barracuda 7200.14 (AF) >> Device Model: ST1000DM003-1SB10C >> Serial Number: Z9A2BEF2 >> LU WWN Device Id: 5 000c50 0910f5427 >> Firmware Version: CC43 >> User Capacity: 1,000,203,804,160 bytes [1.00 TB] >> Sector Sizes: 512 bytes logical, 4096 bytes physical >> Rotation Rate: 7200 rpm >> Form Factor: 3.5 inches >> Device is: In smartctl database [for details use: -P show] >> ATA Version is: ATA8-ACS T13/1699-D revision 4 >> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) >> Local Time is: Mon Mar 5 16:22:47 2018 CET >> SMART support is: Available - device has SMART capability. >> SMART support is: Enabled >> >> === START OF READ SMART DATA SECTION === >> SMART overall-health self-assessment test result: PASSED >> >> General SMART Values: >> Offline data collection status: (0x82) Offline data collection activity >> was completed without error. >> Auto Offline Data Collection: Enabled. >> Self-test execution status: ( 0) The previous self-test routine >> completed >> without error or no self-test has ever >> been run. >> Total time to complete Offline >> data collection: ( 0) seconds. >> Offline data collection >> capabilities: (0x7b) SMART execute Offline immediate. >> Auto Offline data collection on/off >> support. >> Suspend Offline collection upon new >> command. >> Offline surface scan supported. >> Self-test supported. >> Conveyance Self-test supported. >> Selective Self-test supported. >> SMART capabilities: (0x0003) Saves SMART data before entering >> power-saving mode. >> Supports SMART auto save timer. >> Error logging capability: (0x01) Error logging supported. >> General Purpose Logging supported. >> Short self-test routine >> recommended polling time: ( 1) minutes. >> Extended self-test routine >> recommended polling time: ( 110) minutes. >> Conveyance self-test routine >> recommended polling time: ( 2) minutes. >> SCT capabilities: (0x1085) SCT Status supported. >> >> SMART Attributes Data Structure revision number: 10 >> Vendor Specific SMART Attributes with Thresholds: >> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED >> WHEN_FAILED RAW_VALUE >> 1 Raw_Read_Error_Rate 0x000f 083 063 006 Pre-fail Always >> - 224621855 >> 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always >> - 0 >> 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always >> - 275 >> 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always >> - 0 >> 7 Seek_Error_Rate 0x000f 081 060 045 Pre-fail Always >> - 149383284 >> 9 Power_On_Hours 0x0032 093 093 000 Old_age Always >> - 6210 >> 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always >> - 0 >> 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always >> - 265 >> 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always >> - 0 >> 184 End-to-End_Error 0x0032 100 100 099 Old_age Always >> - 0 >> 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always >> - 0 >> 188 Command_Timeout 0x0032 100 100 000 Old_age Always >> - 0 0 0 >> 189 High_Fly_Writes 0x003a 098 098 000 Old_age Always >> - 2 >> 190 Airflow_Temperature_Cel 0x0022 069 058 040 Old_age Always >> - 31 (Min/Max 21/35) >> 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always >> - 516 >> 194 Temperature_Celsius 0x0022 031 017 000 Old_age Always >> - 31 (0 17 0 0 0) >> 195 Hardware_ECC_Recovered 0x001a 005 001 000 Old_age Always >> - 224621855 >> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always >> - 0 >> 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline >> - 0 >> 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always >> - 0 >> 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline >> - 6154h+03m+35.126s >> 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline >> - 24333847321 >> 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline >> - 50261005553 >> >> >> >> However it's not only these 3 OSD to have PG with errors, these are onlyl >> the most recent, in the last 3 months I had often OSD_SCRUB_ERRORS in >> various OSDs, always solved by ceph pg repair <PG>, I don't think it's an >> hardware issue. >> >> >> >> >> >> Il 05/03/2018 13:40, Vladimir Prokofev ha scritto: >> >> > candidate had a read error >> speaks for itself - while scrubbing it coudn't read data. >> I had similar issue, and it was just OSD dying - errors and relocated >> sectors in SMART, just replaced the disk. But in your case it seems that >> errors are on different OSDs? Are your OSDs all healthy? >> You can use this command to see some details. >> rados list-inconsistent-obj <pg.id> --format=json-pretty >> pg.id is the PG that's reporting as inconsistent. My guess is that >> you'll see read errors in this output, with OSD number that encountered >> error. After that you have to check that OSD health - SMART details, etc. >> Not always it's the disk itself that causing problems - for example we >> had read errors because of a faulty backplane interface in a server; >> changing the chassis resolved this issue. >> >> >> 2018-03-05 14:21 GMT+03:00 Marco Baldini - H.S. Amiata < >> mbald...@hsamiata.it>: >> >>> Hi >>> >>> After some days with debug_osd 5/5 I found [ERR] in different days, >>> different PGs, different OSDs, different hosts. This is what I get in the >>> OSD logs: >>> >>> *OSD.5 (host 3)* >>> 2018-03-01 20:30:02.702269 7fdf4d515700 2 osd.5 pg_epoch: 16486 pg[9.1c( v >>> 16486'51798 (16431'50251,16486'51798] local-lis/les=16474/16475 n=3629 >>> ec=1477/1477 lis/c 16474/16474 les/c/f 16475/16477/0 16474/16474/16474) >>> [5,6] r=0 lpr=16474 crt=16486'51798 lcod 16486'51797 mlcod 16486'51797 >>> active+clean+scrubbing+deep] 9.1c shard 6: soid >>> 9:3b157c56:::rbd_data.1526386b8b4567.0000000000001761:head candidate had a >>> read error >>> 2018-03-01 20:30:02.702278 7fdf4d515700 -1 log_channel(cluster) log [ERR] : >>> 9.1c shard 6: soid >>> 9:3b157c56:::rbd_data.1526386b8b4567.0000000000001761:head candidate had a >>> read error >>> >>> * >>> OSD.4 (host 3)* >>> 2018-02-28 00:03:33.458558 7f112cf76700 -1 log_channel(cluster) log [ERR] : >>> 13.65 shard 2: soid >>> 13:a719ecdf:::rbd_data.5f65056b8b4567.000000000000f8eb:head candidate had a >>> read error >>> >>> *OSD.8 (host 2)* >>> 2018-02-27 23:55:15.100084 7f4dd0816700 -1 log_channel(cluster) log [ERR] : >>> 14.31 shard 1: soid >>> 14:8cc6cd37:::rbd_data.30b15b6b8b4567.00000000000081a1:head candidate had a >>> read error >>> >>> I don't know what this error is meaning, and as always a ceph pg repair >>> fixes it. I don't think this is normal. >>> >>> Ideas? >>> >>> Thanks >>> >>> Il 28/02/2018 14:48, Marco Baldini - H.S. Amiata ha scritto: >>> >>> Hi >>> >>> I read the bugtracker issue and it seems a lot like my problem, even if >>> I can't check the reported checksum because I don't have it in my logs, >>> perhaps it's because of debug osd = 0/0 in ceph.conf >>> >>> I just raised the OSD log level >>> >>> ceph tell osd.* injectargs --debug-osd 5/5 >>> >>> I'll check OSD logs in the next days... >>> >>> Thanks >>> >>> >>> >>> Il 28/02/2018 11:59, Paul Emmerich ha scritto: >>> >>> Hi, >>> >>> might be http://tracker.ceph.com/issues/22464 >>> >>> Can you check the OSD log file to see if the reported checksum >>> is 0x6706be76? >>> >>> >>> Paul >>> >>> Am 28.02.2018 um 11:43 schrieb Marco Baldini - H.S. Amiata < >>> mbald...@hsamiata.it>: >>> >>> Hello >>> >>> I have a little ceph cluster with 3 nodes, each with 3x1TB HDD and >>> 1x240GB SSD. I created this cluster after Luminous release, so all OSDs are >>> Bluestore. In my crush map I have two rules, one targeting the SSDs and one >>> targeting the HDDs. I have 4 pools, one using the SSD rule and the others >>> using the HDD rule, three pools are size=3 min_size=2, one is size=2 >>> min_size=1 (this one have content that it's ok to lose) >>> >>> In the last 3 month I'm having a strange random problem. I planned my >>> osd scrubs during the night (osd scrub begin hour = 20, osd scrub end hour >>> = 7) when office is closed so there is low impact on the users. Some >>> mornings, when I ceph the cluster health, I find: >>> >>> HEALTH_ERR X scrub errors; Possible data damage: Y pgs inconsistent >>> OSD_SCRUB_ERRORS X scrub errors >>> PG_DAMAGED Possible data damage: Y pg inconsistent >>> >>> X and Y sometimes are 1, sometimes 2. >>> >>> I issue a ceph health detail, check the damaged PGs, and run a ceph pg >>> repair for the damaged PGs, I get >>> >>> instructing pg PG on osd.N to repair >>> >>> PG are different, OSD that have to repair PG is different, even the node >>> hosting the OSD is different, I made a list of all PGs and OSDs. This >>> morning is the most recent case: >>> >>> > ceph health detail >>> HEALTH_ERR 2 scrub errors; Possible data damage: 2 pgs inconsistent >>> OSD_SCRUB_ERRORS 2 scrub errors >>> PG_DAMAGED Possible data damage: 2 pgs inconsistent >>> pg 13.65 is active+clean+inconsistent, acting [4,2,6] >>> pg 14.31 is active+clean+inconsistent, acting [8,3,1] >>> >>> > ceph pg repair 13.65 >>> instructing pg 13.65 on osd.4 to repair >>> >>> (node-2)> tail /var/log/ceph/ceph-osd.4.log >>> 2018-02-28 08:38:47.593447 7f112cf76700 0 log_channel(cluster) log [DBG] : >>> 13.65 repair starts >>> 2018-02-28 08:39:37.573342 7f112cf76700 0 log_channel(cluster) log [DBG] : >>> 13.65 repair ok, 0 fixed >>> >>> > ceph pg repair 14.31 >>> instructing pg 14.31 on osd.8 to repair >>> >>> (node-3)> tail /var/log/ceph/ceph-osd.8.log >>> 2018-02-28 08:52:37.297490 7f4dd0816700 0 log_channel(cluster) log [DBG] : >>> 14.31 repair starts >>> 2018-02-28 08:53:00.704020 7f4dd0816700 0 log_channel(cluster) log [DBG] : >>> 14.31 repair ok, 0 fixed >>> >>> >>> >>> I made a list of when I got OSD_SCRUB_ERRORS, what PG and what OSD had >>> to repair PG. Date is dd/mm/yyyy >>> >>> 21/12/2017 -- pg 14.29 is active+clean+inconsistent, acting [6,2,4] >>> >>> 18/01/2018 -- pg 14.5a is active+clean+inconsistent, acting [6,4,1] >>> >>> 22/01/2018 -- pg 9.3a is active+clean+inconsistent, acting [2,7] >>> >>> 29/01/2018 -- pg 13.3e is active+clean+inconsistent, acting [4,6,1] >>> instructing pg 13.3e on osd.4 to repair >>> >>> 07/02/2018 -- pg 13.7e is active+clean+inconsistent, acting [8,2,5] >>> instructing pg 13.7e on osd.8 to repair >>> >>> 09/02/2018 -- pg 13.30 is active+clean+inconsistent, acting [7,3,2] >>> instructing pg 13.30 on osd.7 to repair >>> >>> 15/02/2018 -- pg 9.35 is active+clean+inconsistent, acting [1,8] >>> instructing pg 9.35 on osd.1 to repair >>> >>> pg 13.3e is active+clean+inconsistent, acting [4,6,1] >>> instructing pg 13.3e on osd.4 to repair >>> >>> 17/02/2018 -- pg 9.2d is active+clean+inconsistent, acting [7,5] >>> instructing pg 9.2d on osd.7 to repair >>> >>> 22/02/2018 -- pg 9.24 is active+clean+inconsistent, acting [5,8] >>> instructing pg 9.24 on osd.5 to repair >>> >>> 28/02/2018 -- pg 13.65 is active+clean+inconsistent, acting [4,2,6] >>> instructing pg 13.65 on osd.4 to repair >>> >>> pg 14.31 is active+clean+inconsistent, acting [8,3,1] >>> instructing pg 14.31 on osd.8 to repair >>> >>> >>> >>> >>> If can be useful, my ceph.conf is here: >>> >>> [global] >>> auth client required = none >>> auth cluster required = none >>> auth service required = none >>> fsid = 24d5d6bc-0943-4345-b44e-46c19099004b >>> cluster network = 10.10.10.0/24 >>> public network = 10.10.10.0/24 >>> keyring = /etc/pve/priv/$cluster.$name.keyring >>> mon allow pool delete = true >>> osd journal size = 5120 >>> osd pool default min size = 2 >>> osd pool default size = 3 >>> bluestore_block_db_size = 64424509440 >>> >>> debug asok = 0/0 >>> debug auth = 0/0 >>> debug buffer = 0/0 >>> debug client = 0/0 >>> debug context = 0/0 >>> debug crush = 0/0 >>> debug filer = 0/0 >>> debug filestore = 0/0 >>> debug finisher = 0/0 >>> debug heartbeatmap = 0/0 >>> debug journal = 0/0 >>> debug journaler = 0/0 >>> debug lockdep = 0/0 >>> debug mds = 0/0 >>> debug mds balancer = 0/0 >>> debug mds locker = 0/0 >>> debug mds log = 0/0 >>> debug mds log expire = 0/0 >>> debug mds migrator = 0/0 >>> debug mon = 0/0 >>> debug monc = 0/0 >>> debug ms = 0/0 >>> debug objclass = 0/0 >>> debug objectcacher = 0/0 >>> debug objecter = 0/0 >>> debug optracker = 0/0 >>> debug osd = 0/0 >>> debug paxos = 0/0 >>> debug perfcounter = 0/0 >>> debug rados = 0/0 >>> debug rbd = 0/0 >>> debug rgw = 0/0 >>> debug throttle = 0/0 >>> debug timer = 0/0 >>> debug tp = 0/0 >>> >>> >>> [osd] >>> keyring = /var/lib/ceph/osd/ceph-$id/keyring >>> osd max backfills = 1 >>> osd recovery max active = 1 >>> >>> osd scrub begin hour = 20 >>> osd scrub end hour = 7 >>> osd scrub during recovery = false >>> osd scrub load threshold = 0.3 >>> >>> [client] >>> rbd cache = true >>> rbd cache size = 268435456 # 256MB >>> rbd cache max dirty = 201326592 # 192MB >>> rbd cache max dirty age = 2 >>> rbd cache target dirty = 33554432 # 32MB >>> rbd cache writethrough until flush = true >>> >>> >>> #[mgr] >>> #debug_mgr = 20 >>> >>> >>> [mon.pve-hs-main] >>> host = pve-hs-main >>> mon addr = 10.10.10.251:6789 >>> >>> [mon.pve-hs-2] >>> host = pve-hs-2 >>> mon addr = 10.10.10.252:6789 >>> >>> [mon.pve-hs-3] >>> host = pve-hs-3 >>> mon addr = 10.10.10.253:6789 >>> >>> My ceph versions: >>> >>> { >>> "mon": { >>> "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) >>> luminous (stable)": 3 >>> }, >>> "mgr": { >>> "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) >>> luminous (stable)": 3 >>> }, >>> "osd": { >>> "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) >>> luminous (stable)": 12 >>> }, >>> "mds": {}, >>> "overall": { >>> "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) >>> luminous (stable)": 18 >>> } >>> } >>> >>> >>> >>> >>> My ceph osd tree: >>> >>> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >>> -1 8.93686 root default >>> -6 2.94696 host pve-hs-2 >>> 3 hdd 0.90959 osd.3 up 1.00000 1.00000 >>> 4 hdd 0.90959 osd.4 up 1.00000 1.00000 >>> 5 hdd 0.90959 osd.5 up 1.00000 1.00000 >>> 10 ssd 0.21819 osd.10 up 1.00000 1.00000 >>> -3 2.86716 host pve-hs-3 >>> 6 hdd 0.85599 osd.6 up 1.00000 1.00000 >>> 7 hdd 0.85599 osd.7 up 1.00000 1.00000 >>> 8 hdd 0.93700 osd.8 up 1.00000 1.00000 >>> 11 ssd 0.21819 osd.11 up 1.00000 1.00000 >>> -7 3.12274 host pve-hs-main >>> 0 hdd 0.96819 osd.0 up 1.00000 1.00000 >>> 1 hdd 0.96819 osd.1 up 1.00000 1.00000 >>> 2 hdd 0.96819 osd.2 up 1.00000 1.00000 >>> 9 ssd 0.21819 osd.9 up 1.00000 1.00000 >>> >>> >>> My pools: >>> >>> pool 9 'cephbackup' replicated size 2 min_size 1 crush_rule 1 object_hash >>> rjenkins pg_num 64 pgp_num 64 last_change 5665 flags hashpspool >>> stripe_width 0 application rbd >>> removed_snaps [1~3] >>> pool 13 'cephwin' replicated size 3 min_size 2 crush_rule 1 object_hash >>> rjenkins pg_num 128 pgp_num 128 last_change 16454 flags hashpspool >>> stripe_width 0 application rbd >>> removed_snaps [1~5] >>> pool 14 'cephnix' replicated size 3 min_size 2 crush_rule 1 object_hash >>> rjenkins pg_num 128 pgp_num 128 last_change 16482 flags hashpspool >>> stripe_width 0 application rbd >>> removed_snaps [1~227] >>> pool 17 'cephssd' replicated size 3 min_size 2 crush_rule 2 object_hash >>> rjenkins pg_num 64 pgp_num 64 last_change 8601 flags hashpspool >>> stripe_width 0 application rbd >>> removed_snaps [1~3] >>> >>> >>> >>> I can't understand where the problem comes from, I don't think it's >>> hardware, if I had a failed disk, then I should have problems always on the >>> same OSD. Any ideas >>> >>> Thanks >>> >>> >>> >>> -- >>> *Marco Baldini* >>> *H.S. Amiata Srl* >>> Ufficio: 0577-779396 >>> Cellulare: 335-8765169 >>> WEB: www.hsamiata.it >>> EMAIL: mbald...@hsamiata.it >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> -- >>> Mit freundlichen Grüßen / Best Regards >>> Paul Emmerich >>> >>> croit GmbH >>> Freseniusstr. 31h >>> 81247 München >>> www.croit.io >>> Tel: +49 89 1896585 90 <+49%2089%20189658590> >>> >>> Geschäftsführer: Martin Verges >>> Handelsregister: Amtsgericht München >>> USt-IdNr: DE310638492 >>> >>> >>> -- >>> *Marco Baldini* >>> *H.S. Amiata Srl* >>> Ufficio: 0577-779396 >>> Cellulare: 335-8765169 >>> WEB: www.hsamiata.it >>> EMAIL: mbald...@hsamiata.it >>> >>> >>> _______________________________________________ >>> ceph-users mailing >>> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> -- >>> *Marco Baldini* >>> *H.S. Amiata Srl* >>> Ufficio: 0577-779396 >>> Cellulare: 335-8765169 >>> WEB: www.hsamiata.it >>> EMAIL: mbald...@hsamiata.it >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >> >> >> _______________________________________________ >> ceph-users mailing >> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> -- >> *Marco Baldini* >> *H.S. Amiata Srl* >> Ufficio: 0577-779396 >> Cellulare: 335-8765169 >> WEB: www.hsamiata.it >> EMAIL: mbald...@hsamiata.it >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > > _______________________________________________ > ceph-users mailing > listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- > *Marco Baldini* > *H.S. Amiata Srl* > Ufficio: 0577-779396 > Cellulare: 335-8765169 > WEB: www.hsamiata.it > EMAIL: mbald...@hsamiata.it > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Cheers, Brad
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com