Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-28 Thread Brad Hubbard
On Fri, Jun 29, 2018 at 2:38 AM, Andrei Mikhailovsky  wrote:
> Hi Brad,
>
> This has helped to repair the issue. Many thanks for your help on this!!!

No problem.

>
> I had so many objects with broken omap checksum, that I spent at least a few 
> hours identifying those and using the commands you've listed to repair. They 
> were all related to one pool called .rgw.buckets.index . All other pools look 
> okay so far.

So originally you said you were having trouble with "one inconsistent
and stubborn PG" When did that become "so many objects"?

>
> I am wondering what could have got horribly wrong with the above pool?

Is that pool 18? I notice it seems to be size 2, what is min_size on that pool?

As to working out what went wrong. What event(s) coincided with or
preceded the problem? What history can you provide? What data can you
provide from the time leading up to when the issue was first seen?

>
> Cheers
>
> Andrei
> - Original Message -
>> From: "Brad Hubbard" 
>> To: "Andrei Mikhailovsky" 
>> Cc: "ceph-users" 
>> Sent: Thursday, 28 June, 2018 01:08:34
>> Subject: Re: [ceph-users] fixing unrepairable inconsistent PG
>
>> Try the following. You can do this with all osds up and running.
>>
>> # rados -p [name_of_pool_18] setomapval .dir.default.80018061.2
>> temporary-key anything
>> # ceph pg deep-scrub 18.2
>>
>> Once you are sure the scrub has completed and the pg is no longer
>> inconsistent you can remove the temporary key.
>>
>> # rados -p [name_of_pool_18] rmomapkey .dir.default.80018061.2 temporary-key
>>
>>
>> On Wed, Jun 27, 2018 at 9:42 PM, Andrei Mikhailovsky  
>> wrote:
>>> Here is one more thing:
>>>
>>> rados list-inconsistent-obj 18.2
>>> {
>>>"inconsistents" : [
>>>   {
>>>  "object" : {
>>> "locator" : "",
>>> "version" : 632942,
>>> "nspace" : "",
>>> "name" : ".dir.default.80018061.2",
>>> "snap" : "head"
>>>  },
>>>  "union_shard_errors" : [
>>> "omap_digest_mismatch_info"
>>>  ],
>>>  "shards" : [
>>> {
>>>"osd" : 21,
>>>"primary" : true,
>>>"data_digest" : "0x",
>>>"omap_digest" : "0x25e8a1da",
>>>"errors" : [
>>>   "omap_digest_mismatch_info"
>>>],
>>>"size" : 0
>>> },
>>> {
>>>"data_digest" : "0x",
>>>"primary" : false,
>>>"osd" : 28,
>>>"errors" : [
>>>   "omap_digest_mismatch_info"
>>>],
>>>"omap_digest" : "0x25e8a1da",
>>>"size" : 0
>>> }
>>>  ],
>>>  "errors" : [],
>>>  "selected_object_info" : {
>>> "mtime" : "2018-06-19 16:31:44.759717",
>>> "alloc_hint_flags" : 0,
>>> "size" : 0,
>>> "last_reqid" : "client.410876514.0:1",
>>> "local_mtime" : "2018-06-19 16:31:44.760139",
>>> "data_digest" : "0x",
>>> "truncate_seq" : 0,
>>> "legacy_snaps" : [],
>>> "expected_write_size" : 0,
>>> "watchers" : {},
>>> "flags" : [
>>>"dirty",
>>>"data_digest",
>>>"omap_digest"
>>> ],
>>> "oid" : {
>>>    "pool" : 18,
>>>"hash" : 1156456354,
>>>"key" : "",
>>>"oid" : ".dir.default.80018061.2",
>>>"namespace&qu

Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-28 Thread Andrei Mikhailovsky
Hi Brad,

This has helped to repair the issue. Many thanks for your help on this!!!

I had so many objects with broken omap checksum, that I spent at least a few 
hours identifying those and using the commands you've listed to repair. They 
were all related to one pool called .rgw.buckets.index . All other pools look 
okay so far.

I am wondering what could have got horribly wrong with the above pool? 

Cheers

Andrei
- Original Message -
> From: "Brad Hubbard" 
> To: "Andrei Mikhailovsky" 
> Cc: "ceph-users" 
> Sent: Thursday, 28 June, 2018 01:08:34
> Subject: Re: [ceph-users] fixing unrepairable inconsistent PG

> Try the following. You can do this with all osds up and running.
> 
> # rados -p [name_of_pool_18] setomapval .dir.default.80018061.2
> temporary-key anything
> # ceph pg deep-scrub 18.2
> 
> Once you are sure the scrub has completed and the pg is no longer
> inconsistent you can remove the temporary key.
> 
> # rados -p [name_of_pool_18] rmomapkey .dir.default.80018061.2 temporary-key
> 
> 
> On Wed, Jun 27, 2018 at 9:42 PM, Andrei Mikhailovsky  
> wrote:
>> Here is one more thing:
>>
>> rados list-inconsistent-obj 18.2
>> {
>>"inconsistents" : [
>>   {
>>  "object" : {
>> "locator" : "",
>> "version" : 632942,
>> "nspace" : "",
>> "name" : ".dir.default.80018061.2",
>> "snap" : "head"
>>  },
>>  "union_shard_errors" : [
>> "omap_digest_mismatch_info"
>>  ],
>>  "shards" : [
>> {
>>"osd" : 21,
>>"primary" : true,
>>"data_digest" : "0x",
>>"omap_digest" : "0x25e8a1da",
>>"errors" : [
>>   "omap_digest_mismatch_info"
>>],
>>"size" : 0
>> },
>> {
>>"data_digest" : "0x",
>>"primary" : false,
>>"osd" : 28,
>>"errors" : [
>>   "omap_digest_mismatch_info"
>>],
>>"omap_digest" : "0x25e8a1da",
>>"size" : 0
>> }
>>  ],
>>  "errors" : [],
>>  "selected_object_info" : {
>> "mtime" : "2018-06-19 16:31:44.759717",
>> "alloc_hint_flags" : 0,
>> "size" : 0,
>> "last_reqid" : "client.410876514.0:1",
>> "local_mtime" : "2018-06-19 16:31:44.760139",
>> "data_digest" : "0x",
>> "truncate_seq" : 0,
>> "legacy_snaps" : [],
>> "expected_write_size" : 0,
>> "watchers" : {},
>> "flags" : [
>>"dirty",
>>"data_digest",
>>"omap_digest"
>> ],
>> "oid" : {
>>"pool" : 18,
>>"hash" : 1156456354,
>>"key" : "",
>>"oid" : ".dir.default.80018061.2",
>>    "namespace" : "",
>>"snapid" : -2,
>>"max" : 0
>> },
>> "truncate_size" : 0,
>> "version" : "120985'632942",
>> "expected_object_size" : 0,
>> "omap_digest" : "0x",
>> "lost" : 0,
>> "manifest" : {
>>"redirect_target" : {
>>   "namespace" : "",
>>   "snapid" : 0,
>>   "max" : 0,
>>   "pool" : -9223372036854775808,
>>   "hash" : 0,
>>   "oid" : "",
>>

Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-27 Thread Brad Hubbard
Try the following. You can do this with all osds up and running.

# rados -p [name_of_pool_18] setomapval .dir.default.80018061.2
temporary-key anything
# ceph pg deep-scrub 18.2

Once you are sure the scrub has completed and the pg is no longer
inconsistent you can remove the temporary key.

# rados -p [name_of_pool_18] rmomapkey .dir.default.80018061.2 temporary-key


On Wed, Jun 27, 2018 at 9:42 PM, Andrei Mikhailovsky  wrote:
> Here is one more thing:
>
> rados list-inconsistent-obj 18.2
> {
>"inconsistents" : [
>   {
>  "object" : {
> "locator" : "",
> "version" : 632942,
> "nspace" : "",
> "name" : ".dir.default.80018061.2",
> "snap" : "head"
>  },
>  "union_shard_errors" : [
> "omap_digest_mismatch_info"
>  ],
>  "shards" : [
> {
>"osd" : 21,
>"primary" : true,
>"data_digest" : "0x",
>"omap_digest" : "0x25e8a1da",
>"errors" : [
>   "omap_digest_mismatch_info"
>],
>"size" : 0
> },
> {
>"data_digest" : "0x",
>"primary" : false,
>"osd" : 28,
>"errors" : [
>   "omap_digest_mismatch_info"
>],
>"omap_digest" : "0x25e8a1da",
>"size" : 0
> }
>  ],
>  "errors" : [],
>  "selected_object_info" : {
> "mtime" : "2018-06-19 16:31:44.759717",
> "alloc_hint_flags" : 0,
> "size" : 0,
> "last_reqid" : "client.410876514.0:1",
> "local_mtime" : "2018-06-19 16:31:44.760139",
> "data_digest" : "0x",
> "truncate_seq" : 0,
> "legacy_snaps" : [],
> "expected_write_size" : 0,
> "watchers" : {},
> "flags" : [
>"dirty",
>"data_digest",
>"omap_digest"
> ],
> "oid" : {
>"pool" : 18,
>"hash" : 1156456354,
>"key" : "",
>"oid" : ".dir.default.80018061.2",
>"namespace" : "",
>"snapid" : -2,
>"max" : 0
> },
> "truncate_size" : 0,
> "version" : "120985'632942",
> "expected_object_size" : 0,
> "omap_digest" : "0x",
> "lost" : 0,
> "manifest" : {
>"redirect_target" : {
>   "namespace" : "",
>   "snapid" : 0,
>   "max" : 0,
>   "pool" : -9223372036854775808,
>   "hash" : 0,
>   "oid" : "",
>   "key" : ""
>},
>"type" : 0
> },
> "prior_version" : "0'0",
> "user_version" : 632942
>  }
>   }
>],
>"epoch" : 121151
> }
>
> Cheers
>
> - Original Message -
>> From: "Andrei Mikhailovsky" 
>> To: "Brad Hubbard" 
>> Cc: "ceph-users" 
>> Sent: Wednesday, 27 June, 2018 09:10:07
>> Subject: Re: [ceph-users] fixing unrepairable inconsistent PG
>
>> Hi Brad,
>>
>> Thanks, that helped to get the query info on the inconsistent PG 18.2:
>>
>> {
>>"state": "active+clean+inconsistent",
>>"snap_trimq": "[]",
>>"snap_trimq_len": 0,
>>"epoch": 121293,
>>"up": [
>>21,
>>28
>>],
>>"a

Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-27 Thread Andrei Mikhailovsky
Here is one more thing:

rados list-inconsistent-obj 18.2
{
   "inconsistents" : [
  {
 "object" : {
"locator" : "",
"version" : 632942,
"nspace" : "",
"name" : ".dir.default.80018061.2",
"snap" : "head"
 },
 "union_shard_errors" : [
"omap_digest_mismatch_info"
 ],
 "shards" : [
{
   "osd" : 21,
   "primary" : true,
   "data_digest" : "0x",
   "omap_digest" : "0x25e8a1da",
   "errors" : [
  "omap_digest_mismatch_info"
   ],
   "size" : 0
},
{
   "data_digest" : "0x",
   "primary" : false,
   "osd" : 28,
   "errors" : [
  "omap_digest_mismatch_info"
   ],
   "omap_digest" : "0x25e8a1da",
   "size" : 0
}
 ],
 "errors" : [],
 "selected_object_info" : {
"mtime" : "2018-06-19 16:31:44.759717",
"alloc_hint_flags" : 0,
"size" : 0,
"last_reqid" : "client.410876514.0:1",
"local_mtime" : "2018-06-19 16:31:44.760139",
"data_digest" : "0x",
"truncate_seq" : 0,
"legacy_snaps" : [],
"expected_write_size" : 0,
"watchers" : {},
"flags" : [
   "dirty",
   "data_digest",
   "omap_digest"
],
"oid" : {
   "pool" : 18,
   "hash" : 1156456354,
   "key" : "",
   "oid" : ".dir.default.80018061.2",
   "namespace" : "",
   "snapid" : -2,
   "max" : 0
},
"truncate_size" : 0,
"version" : "120985'632942",
"expected_object_size" : 0,
"omap_digest" : "0xffff",
"lost" : 0,
"manifest" : {
   "redirect_target" : {
  "namespace" : "",
  "snapid" : 0,
  "max" : 0,
  "pool" : -9223372036854775808,
  "hash" : 0,
  "oid" : "",
  "key" : ""
   },
   "type" : 0
},
"prior_version" : "0'0",
"user_version" : 632942
 }
  }
   ],
   "epoch" : 121151
}

Cheers

- Original Message -
> From: "Andrei Mikhailovsky" 
> To: "Brad Hubbard" 
> Cc: "ceph-users" 
> Sent: Wednesday, 27 June, 2018 09:10:07
> Subject: Re: [ceph-users] fixing unrepairable inconsistent PG

> Hi Brad,
> 
> Thanks, that helped to get the query info on the inconsistent PG 18.2:
> 
> {
>"state": "active+clean+inconsistent",
>"snap_trimq": "[]",
>"snap_trimq_len": 0,
>"epoch": 121293,
>"up": [
>21,
>28
>],
>"acting": [
>21,
>28
>],
>"actingbackfill": [
>"21",
>"28"
>],
>"info": {
>"pgid": "18.2",
>"last_update": "121290'698339",
>"last_complete": "121290'698339",
>"log_tail": "121272'696825",
>"last_user_version": 698319,
>"last_backfill": "MAX",
>"last_backfill_bitwise": 0,
>"purged_snaps": [],
>"history": {
>"epoch_created": 24431,
>"epoch_pool_created": 24431,
>"last_epoch_started": 121152,
>"last_interval_started": 121151,
>"last_epoch_clean": 121152,
>"last_interval_clean&q

Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-27 Thread Andrei Mikhailovsky
15:28:20.335739",
"log_size": 1526,
"ondisk_log_size": 1526,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": true,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 0,
"num_objects": 69,
"num_object_clones": 0,
"num_object_copies": 138,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 1,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 64,
"num_whiteouts": 0,
"num_read": 14057,
"num_read_kb": 454200,
"num_write": 797911,
"num_write_kb": 0,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 207,
"num_bytes_recovered": 0,
"num_keys_recovered": 9482826,
"num_objects_omap": 60,
"num_objects_hit_set_archive": 0,
        "num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0
},
"up": [
21,
28
],
"acting": [
21,
28
],
"blocked_by": [],
"up_primary": 21,
"acting_primary": 21
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 121152,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
}
],
"recovery_state": [
{
"name": "Started/Primary/Active",
"enter_time": "2018-06-21 16:35:46.478007",
"might_have_unfound": [],
"recovery_progress": {
"backfill_targets": [],
"waiting_on_backfill": [],
"last_backfill_started": "MIN",
"backfill_info": {
"begin": "MIN",
"end": "MIN",
"objects": []
},
"peer_backfill_info": [],
"backfills_in_flight": [],
"recovering": [],
"pg_backend": {
"pull_from_peer": [],
"pushing": []
}
},
"scrub": {
"scrubber.epoch_start": "121151",
"scrubber.active": false,
"scrubber.state": "INACTIVE",
"scrubber.start": "MIN",
"scrubber.end": "MIN",
"scrubber.subset_last_update": "0'0",
"scrubber.deep": false,
"scrubber.seed": 0,
"scrubber.waiting_on": 0,
"scrubber.waiting_on_whom": []
}
},
{
"name": "Started",
"enter_time": "2018-06-21 16:35:45.052939"
}
],
"agent_state": {}
}




Thanks for trying to 

Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-26 Thread Brad Hubbard
; caps: [osd] allow rwx
> client.bootstrap-mds
> caps: [mgr] allow r
> caps: [mon] allow profile bootstrap-mds
> client.bootstrap-mgr
> caps: [mon] allow profile bootstrap-mgr
> client.bootstrap-osd
> caps: [mgr] allow r
> caps: [mon] allow profile bootstrap-osd
> client.bootstrap-rgw
> caps: [mgr] allow r
> caps: [mon] allow profile bootstrap-rgw
> client.ceph-monitors
> caps: [mgr] allow r
> caps: [mon] allow r
> client.libvirt
> caps: [mgr] allow r
> caps: [mon] allow r
> caps: [osd] allow class-read object_prefix rbd_children, allow rwx 
> pool=libvirt-pool
> client.primary-ubuntu-1
> caps: [mgr] allow r
> caps: [mon] allow r
> caps: [osd] allow rwx pool=Primary-ubuntu-1
> client.radosgw1.gateway
> caps: [mgr] allow r
> caps: [mon] allow rwx
> caps: [osd] allow rwx
> client.radosgw2.gateway
> caps: [mgr] allow r
> caps: [mon] allow rw
> caps: [osd] allow rwx
> client.ssdcs
> caps: [mgr] allow r
> caps: [mon] allow r
> caps: [osd] allow class-read object_prefix rbd_children, allow rwx 
> pool=ssdcs
>
> mgr.arh-ibstorage1-ib
> caps: [mds] allow *
> caps: [mon] allow profile mgr
>     caps: [osd] allow *
> mgr.arh-ibstorage2-ib
> caps: [mds] allow *
> caps: [mon] allow profile mgr
> caps: [osd] allow *
>
>
>
>
>
>
>
> I have ran this command on all pgs in the cluster and it shows the same error 
> message for all of them. For example:
>
> Error EPERM: problem getting command descriptions from pg.5.1c9
>
> Andrei
>
>
> - Original Message -
>> From: "Brad Hubbard" 
>> To: "Andrei Mikhailovsky" 
>> Cc: "ceph-users" 
>> Sent: Tuesday, 26 June, 2018 01:10:34
>> Subject: Re: [ceph-users] fixing unrepairable inconsistent PG
>
>> Interesing...
>>
>> Can I see the output of "ceph auth list" and can you test whether you
>> can query any other pg that has osd.21 as its primary?
>>
>> On Mon, Jun 25, 2018 at 8:04 PM, Andrei Mikhailovsky  
>> wrote:
>>> Hi Brad,
>>>
>>> here is the output:
>>>
>>> --
>>>
>>> root@arh-ibstorage1-ib:/home/andrei# ceph --debug_ms 5 --debug_auth 20 pg 
>>> 18.2
>>> query
>>> 2018-06-25 10:59:12.100302 7fe23eaa1700  2 Event(0x7fe2400e0140 nevent=5000
>>> time_id=1).set_owner idx=0 owner=140609690670848
>>> 2018-06-25 10:59:12.100398 7fe23e2a0700  2 Event(0x7fe24010d030 nevent=5000
>>> time_id=1).set_owner idx=1 owner=140609682278144
>>> 2018-06-25 10:59:12.100445 7fe23da9f700  2 Event(0x7fe240139ec0 nevent=5000
>>> time_id=1).set_owner idx=2 owner=140609673885440
>>> 2018-06-25 10:59:12.100793 7fe244b28700  1  Processor -- start
>>> 2018-06-25 10:59:12.100869 7fe244b28700  1 -- - start start
>>> 2018-06-25 10:59:12.100882 7fe244b28700  5 adding auth protocol: cephx
>>> 2018-06-25 10:59:12.101046 7fe244b28700  2 auth: KeyRing::load: loaded key 
>>> file
>>> /etc/ceph/ceph.client.admin.keyring
>>> 2018-06-25 10:59:12.101244 7fe244b28700  1 -- - --> 192.168.168.201:6789/0 
>>> --
>>> auth(proto 0 30 bytes epoch 0) v1 -- 0x7fe240174b80 con 0
>>> 2018-06-25 10:59:12.101264 7fe244b28700  1 -- - --> 192.168.168.202:6789/0 
>>> --
>>> auth(proto 0 30 bytes epoch 0) v1 -- 0x7fe240175010 con 0
>>> 2018-06-25 10:59:12.101690 7fe23e2a0700  1 -- 192.168.168.201:0/3046734987
>>> learned_addr learned my addr 192.168.168.201:0/3046734987
>>> 2018-06-25 10:59:12.101890 7fe23e2a0700  2 -- 192.168.168.201:0/3046734987 
>>> >>
>>> 192.168.168.202:6789/0 conn(0x7fe240176dc0 :-1 
>>> s=STATE_CONNECTING_WAIT_ACK_SEQ
>>> pgs=0 cs=0 l=1)._process_connection got newly_acked_seq 0 vs out_seq 0
>>> 2018-06-25 10:59:12.102030 7fe23da9f700  2 -- 192.168.168.201:0/3046734987 
>>> >>
>>> 192.168.168.201:6789/0 conn(0x7fe24017a420 :-1 
>>> s=STATE_CONNECTING_WAIT_ACK_SEQ
>>> pgs=0 cs=0 l=1)._process_connection got newly_acked_seq 0 vs out_seq 0
>>> 2018-06-25 10:59:12.102450 7fe23e2a0700  5 -- 192.168.168.201:0/3046734987 
>>> >>
>>> 192.168.168.202:6789/0 conn(0x7fe240176dc0 :-1
>>> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=472363 cs=1 l=1). rx mon.1
>>> seq 1 0x7fe234002670 mon_map magic: 0 v1
>>> 2018-06-25 10:59:12.102494 7fe23e2a07

Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-26 Thread Andrei Mikhailovsky
  caps: [mon] allow rw
caps: [osd] allow rwx
client.ssdcs
caps: [mgr] allow r
caps: [mon] allow r
caps: [osd] allow class-read object_prefix rbd_children, allow rwx 
pool=ssdcs

mgr.arh-ibstorage1-ib
caps: [mds] allow *
caps: [mon] allow profile mgr
caps: [osd] allow *
mgr.arh-ibstorage2-ib
caps: [mds] allow *
caps: [mon] allow profile mgr
caps: [osd] allow *







I have ran this command on all pgs in the cluster and it shows the same error 
message for all of them. For example:

Error EPERM: problem getting command descriptions from pg.5.1c9

Andrei


- Original Message -
> From: "Brad Hubbard" 
> To: "Andrei Mikhailovsky" 
> Cc: "ceph-users" 
> Sent: Tuesday, 26 June, 2018 01:10:34
> Subject: Re: [ceph-users] fixing unrepairable inconsistent PG

> Interesing...
> 
> Can I see the output of "ceph auth list" and can you test whether you
> can query any other pg that has osd.21 as its primary?
> 
> On Mon, Jun 25, 2018 at 8:04 PM, Andrei Mikhailovsky  
> wrote:
>> Hi Brad,
>>
>> here is the output:
>>
>> --
>>
>> root@arh-ibstorage1-ib:/home/andrei# ceph --debug_ms 5 --debug_auth 20 pg 
>> 18.2
>> query
>> 2018-06-25 10:59:12.100302 7fe23eaa1700  2 Event(0x7fe2400e0140 nevent=5000
>> time_id=1).set_owner idx=0 owner=140609690670848
>> 2018-06-25 10:59:12.100398 7fe23e2a0700  2 Event(0x7fe24010d030 nevent=5000
>> time_id=1).set_owner idx=1 owner=140609682278144
>> 2018-06-25 10:59:12.100445 7fe23da9f700  2 Event(0x7fe240139ec0 nevent=5000
>> time_id=1).set_owner idx=2 owner=140609673885440
>> 2018-06-25 10:59:12.100793 7fe244b28700  1  Processor -- start
>> 2018-06-25 10:59:12.100869 7fe244b28700  1 -- - start start
>> 2018-06-25 10:59:12.100882 7fe244b28700  5 adding auth protocol: cephx
>> 2018-06-25 10:59:12.101046 7fe244b28700  2 auth: KeyRing::load: loaded key 
>> file
>> /etc/ceph/ceph.client.admin.keyring
>> 2018-06-25 10:59:12.101244 7fe244b28700  1 -- - --> 192.168.168.201:6789/0 --
>> auth(proto 0 30 bytes epoch 0) v1 -- 0x7fe240174b80 con 0
>> 2018-06-25 10:59:12.101264 7fe244b28700  1 -- - --> 192.168.168.202:6789/0 --
>> auth(proto 0 30 bytes epoch 0) v1 -- 0x7fe240175010 con 0
>> 2018-06-25 10:59:12.101690 7fe23e2a0700  1 -- 192.168.168.201:0/3046734987
>> learned_addr learned my addr 192.168.168.201:0/3046734987
>> 2018-06-25 10:59:12.101890 7fe23e2a0700  2 -- 192.168.168.201:0/3046734987 >>
>> 192.168.168.202:6789/0 conn(0x7fe240176dc0 :-1 
>> s=STATE_CONNECTING_WAIT_ACK_SEQ
>> pgs=0 cs=0 l=1)._process_connection got newly_acked_seq 0 vs out_seq 0
>> 2018-06-25 10:59:12.102030 7fe23da9f700  2 -- 192.168.168.201:0/3046734987 >>
>> 192.168.168.201:6789/0 conn(0x7fe24017a420 :-1 
>> s=STATE_CONNECTING_WAIT_ACK_SEQ
>> pgs=0 cs=0 l=1)._process_connection got newly_acked_seq 0 vs out_seq 0
>> 2018-06-25 10:59:12.102450 7fe23e2a0700  5 -- 192.168.168.201:0/3046734987 >>
>> 192.168.168.202:6789/0 conn(0x7fe240176dc0 :-1
>> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=472363 cs=1 l=1). rx mon.1
>> seq 1 0x7fe234002670 mon_map magic: 0 v1
>> 2018-06-25 10:59:12.102494 7fe23e2a0700  5 -- 192.168.168.201:0/3046734987 >>
>> 192.168.168.202:6789/0 conn(0x7fe240176dc0 :-1
>> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=472363 cs=1 l=1). rx mon.1
>> seq 2 0x7fe234002b70 auth_reply(proto 2 0 (0) Success) v1
>> 2018-06-25 10:59:12.102542 7fe23ca9d700  1 -- 192.168.168.201:0/3046734987 
>> <==
>> mon.1 192.168.168.202:6789/0 1  mon_map magic: 0 v1  505+0+0
>> (2386987630 0 0) 0x7fe234002670 con 0x7fe240176dc0
>> 2018-06-25 10:59:12.102629 7fe23ca9d700  1 -- 192.168.168.201:0/3046734987 
>> <==
>> mon.1 192.168.168.202:6789/0 2  auth_reply(proto 2 0 (0) Success) v1 
>> 33+0+0 (1469975654 0 0) 0x7fe234002b70 con 0x7fe240176dc0
>> 2018-06-25 10:59:12.102655 7fe23ca9d700 10 cephx: set_have_need_key no 
>> handler
>> for service mon
>> 2018-06-25 10:59:12.102657 7fe23ca9d700 10 cephx: set_have_need_key no 
>> handler
>> for service osd
>> 2018-06-25 10:59:12.102658 7fe23ca9d700 10 cephx: set_have_need_key no 
>> handler
>> for service mgr
>> 2018-06-25 10:59:12.102661 7fe23ca9d700 10 cephx: set_have_need_key no 
>> handler
>> for service auth
>> 2018-06-25 10:59:12.102662 7fe23ca9d700 10 cephx: validate_tickets want 53 
>> have
>> 0 need 53
>> 2018-06-25 10:59:12.102666 7fe23ca9d700 10 cephx client: handle_response ret 
>> = 0
>> 2018-06-25 10:59:12.102671 7fe23ca9d700 1

Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-25 Thread Andrei Mikhailovsky
46734987 
shutdown_connections mark down 192.168.168.201:6789/0 0x7fe24017a420
2018-06-25 10:59:12.112543 7fe244b28700  5 -- 192.168.168.201:0/3046734987 
shutdown_connections mark down 192.168.168.203:6828/43673 0x7fe240180f20
2018-06-25 10:59:12.112549 7fe244b28700  5 -- 192.168.168.201:0/3046734987 
shutdown_connections mark down 192.168.168.202:6789/0 0x7fe240176dc0
2018-06-25 10:59:12.112554 7fe244b28700  5 -- 192.168.168.201:0/3046734987 
shutdown_connections delete 0x7fe2240127b0
2018-06-25 10:59:12.112570 7fe244b28700  5 -- 192.168.168.201:0/3046734987 
shutdown_connections delete 0x7fe240176dc0
2018-06-25 10:59:12.112577 7fe244b28700  5 -- 192.168.168.201:0/3046734987 
shutdown_connections delete 0x7fe24017a420
2018-06-25 10:59:12.112582 7fe244b28700  5 -- 192.168.168.201:0/3046734987 
shutdown_connections delete 0x7fe240180f20
2018-06-25 10:59:12.112701 7fe244b28700  1 -- 192.168.168.201:0/3046734987 
shutdown_connections
2018-06-25 10:59:12.112752 7fe244b28700  1 -- 192.168.168.201:0/3046734987 wait 
complete.
2018-06-25 10:59:12.112764 7fe244b28700  1 -- 192.168.168.201:0/3046734987 >> 
192.168.168.201:0/3046734987 conn(0x7fe240167220 :-1 s=STATE_NONE pgs=0 cs=0 
l=0).mark_down
2018-06-25 10:59:12.112770 7fe244b28700  2 -- 192.168.168.201:0/3046734987 >> 
192.168.168.201:0/3046734987 conn(0x7fe240167220 :-1 s=STATE_NONE pgs=0 cs=0 
l=0)._stop


--


Thanks

- Original Message -
> From: "Brad Hubbard" 
> To: "Andrei Mikhailovsky" 
> Cc: "ceph-users" 
> Sent: Monday, 25 June, 2018 02:28:55
> Subject: Re: [ceph-users] fixing unrepairable inconsistent PG

> Can you try the following?
> 
> $ ceph --debug_ms 5 --debug_auth 20 pg 18.2 query
> 
> On Fri, Jun 22, 2018 at 7:54 PM, Andrei Mikhailovsky  
> wrote:
>> Hi Brad,
>>
>> here is the output of the command (replaced the real auth key with [KEY]):
>>
>>
>> 
>>
>> 2018-06-22 10:47:27.659895 7f70ef9e6700 10 monclient: build_initial_monmap
>> 2018-06-22 10:47:27.661995 7f70ef9e6700 10 monclient: init
>> 2018-06-22 10:47:27.662002 7f70ef9e6700  5 adding auth protocol: cephx
>> 2018-06-22 10:47:27.662004 7f70ef9e6700 10 monclient: auth_supported 2 method
>> cephx
>> 2018-06-22 10:47:27.662221 7f70ef9e6700  2 auth: KeyRing::load: loaded key 
>> file
>> /etc/ceph/ceph.client.admin.keyring
>> 2018-06-22 10:47:27.662338 7f70ef9e6700 10 monclient: _reopen_session rank -1
>> 2018-06-22 10:47:27.662425 7f70ef9e6700 10 monclient(hunting): picked
>> mon.noname-b con 0x7f70e8176c80 addr 192.168.168.202:6789/0
>> 2018-06-22 10:47:27.662484 7f70ef9e6700 10 monclient(hunting): picked
>> mon.noname-a con 0x7f70e817a2e0 addr 192.168.168.201:6789/0
>> 2018-06-22 10:47:27.662534 7f70ef9e6700 10 monclient(hunting): _renew_subs
>> 2018-06-22 10:47:27.662544 7f70ef9e6700 10 monclient(hunting): authenticate 
>> will
>> time out at 2018-06-22 10:52:27.662543
>> 2018-06-22 10:47:27.663831 7f70d77fe700 10 monclient(hunting): handle_monmap
>> mon_map magic: 0 v1
>> 2018-06-22 10:47:27.663885 7f70d77fe700 10 monclient(hunting):  got monmap 
>> 20,
>> mon.noname-b is now rank -1
>> 2018-06-22 10:47:27.663889 7f70d77fe700 10 monclient(hunting): dump:
>> epoch 20
>> fsid 51e9f641-372e-44ec-92a4-b9fe55cbf9fe
>> last_changed 2018-06-16 23:14:48.936175
>> created 0.00
>> 0: 192.168.168.201:6789/0 mon.arh-ibstorage1-ib
>> 1: 192.168.168.202:6789/0 mon.arh-ibstorage2-ib
>> 2: 192.168.168.203:6789/0 mon.arh-ibstorage3-ib
>>
>> 2018-06-22 10:47:27.664005 7f70d77fe700 10 cephx: set_have_need_key no 
>> handler
>> for service mon
>> 2018-06-22 10:47:27.664020 7f70d77fe700 10 cephx: set_have_need_key no 
>> handler
>> for service osd
>> 2018-06-22 10:47:27.664021 7f70d77fe700 10 cephx: set_have_need_key no 
>> handler
>> for service mgr
>> 2018-06-22 10:47:27.664025 7f70d77fe700 10 cephx: set_have_need_key no 
>> handler
>> for service auth
>> 2018-06-22 10:47:27.664026 7f70d77fe700 10 cephx: validate_tickets want 53 
>> have
>> 0 need 53
>> 2018-06-22 10:47:27.664032 7f70d77fe700 10 monclient(hunting): my global_id 
>> is
>> 411322261
>> 2018-06-22 10:47:27.664035 7f70d77fe700 10 cephx client: handle_response ret 
>> = 0
>> 2018-06-22 10:47:27.664046 7f70d77fe700 10 cephx client:  got initial server
>> challenge d66f2dffc2113d43
>> 2018-06-22 10:47:27.664049 7f70d77fe700 10 cephx client: validate_tickets:
>> want=53 need=53 have=0
>>
>> 2018-06-22 10:47:27.664052 7f70d77fe700 10 cephx: set_have_need_key no 
>> handler
>> for service mon
>> 2018-06-22 10:47:27.664053 7f70d77fe700 10

Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-24 Thread Brad Hubbard
: build_request
> 2018-06-22 10:47:27.665039 7f70d77fe700 10 cephx client: get service keys: 
> want=53 need=21 have=32
> 2018-06-22 10:47:27.665354 7f70d77fe700 10 cephx client: handle_response ret 
> = 0
> 2018-06-22 10:47:27.665365 7f70d77fe700 10 cephx client:  
> get_principal_session_key session_key [KEY]
> 2018-06-22 10:47:27.665377 7f70d77fe700 10 cephx: verify_service_ticket_reply 
> got 3 keys
> 2018-06-22 10:47:27.665379 7f70d77fe700 10 cephx: got key for service_id mon
> 2018-06-22 10:47:27.665419 7f70d77fe700 10 cephx:  ticket.secret_id=44133
> 2018-06-22 10:47:27.665425 7f70d77fe700 10 cephx: verify_service_ticket_reply 
> service mon secret_id 44133 session_key [KEY] validity=3600.00
> 2018-06-22 10:47:27.665437 7f70d77fe700 10 cephx: ticket expires=2018-06-22 
> 11:47:27.665436 renew_after=2018-06-22 11:32:27.665436
> 2018-06-22 10:47:27.665443 7f70d77fe700 10 cephx: got key for service_id osd
> 2018-06-22 10:47:27.665476 7f70d77fe700 10 cephx:  ticket.secret_id=44133
> 2018-06-22 10:47:27.665478 7f70d77fe700 10 cephx: verify_service_ticket_reply 
> service osd secret_id 44133 session_key [KEY] validity=3600.00
> 2018-06-22 10:47:27.665497 7f70d77fe700 10 cephx: ticket expires=2018-06-22 
> 11:47:27.665496 renew_after=2018-06-22 11:32:27.665496
> 2018-06-22 10:47:27.665506 7f70d77fe700 10 cephx: got key for service_id mgr
> 2018-06-22 10:47:27.665539 7f70d77fe700 10 cephx:  ticket.secret_id=132
> 2018-06-22 10:47:27.665546 7f70d77fe700 10 cephx: verify_service_ticket_reply 
> service mgr secret_id 132 session_key [KEY] validity=3600.00
> 2018-06-22 10:47:27.665564 7f70d77fe700 10 cephx: ticket expires=2018-06-22 
> 11:47:27.665564 renew_after=2018-06-22 11:32:27.665564
> 2018-06-22 10:47:27.665573 7f70d77fe700 10 cephx: validate_tickets want 53 
> have 53 need 0
> 2018-06-22 10:47:27.665602 7f70d77fe700  1 monclient: found 
> mon.arh-ibstorage2-ib
> 2018-06-22 10:47:27.665617 7f70d77fe700 20 monclient: _un_backoff 
> reopen_interval_multipler now 1
> 2018-06-22 10:47:27.665636 7f70d77fe700 10 monclient: _send_mon_message to 
> mon.arh-ibstorage2-ib at 192.168.168.202:6789/0
> 2018-06-22 10:47:27.665656 7f70d77fe700 10 cephx: validate_tickets want 53 
> have 53 need 0
> 2018-06-22 10:47:27.665658 7f70d77fe700 20 cephx client: need_tickets: 
> want=53 have=53 need=0
> 2018-06-22 10:47:27.665661 7f70d77fe700 20 monclient: _check_auth_rotating 
> not needed by client.admin
> 2018-06-22 10:47:27.665678 7f70ef9e6700  5 monclient: authenticate success, 
> global_id 411322261
> 2018-06-22 10:47:27.665694 7f70ef9e6700 10 monclient: _renew_subs
> 2018-06-22 10:47:27.665698 7f70ef9e6700 10 monclient: _send_mon_message to 
> mon.arh-ibstorage2-ib at 192.168.168.202:6789/0
> 2018-06-22 10:47:27.665817 7f70ef9e6700 10 monclient: _renew_subs
> 2018-06-22 10:47:27.665828 7f70ef9e6700 10 monclient: _send_mon_message to 
> mon.arh-ibstorage2-ib at 192.168.168.202:6789/0
> 2018-06-22 10:47:27.666069 7f70d77fe700 10 monclient: handle_monmap mon_map 
> magic: 0 v1
> 2018-06-22 10:47:27.666102 7f70d77fe700 10 monclient:  got monmap 20, 
> mon.arh-ibstorage2-ib is now rank 1
> 2018-06-22 10:47:27.666110 7f70d77fe700 10 monclient: dump:
>
> epoch 20
> fsid 51e9f641-372e-44ec-92a4-b9fe55cbf9fe
> last_changed 2018-06-16 23:14:48.936175
> created 0.00
> 0: 192.168.168.201:6789/0 mon.arh-ibstorage1-ib
> 1: 192.168.168.202:6789/0 mon.arh-ibstorage2-ib
> 2: 192.168.168.203:6789/0 mon.arh-ibstorage3-ib
>
> 2018-06-22 10:47:27.17 7f70eca43700 10 cephx client: build_authorizer for 
> service mgr
> 2018-06-22 10:47:27.667043 7f70eca43700 10 In get_auth_session_handler for 
> protocol 2
> 2018-06-22 10:47:27.678417 7f70eda45700 10 cephx client: build_authorizer for 
> service osd
> 2018-06-22 10:47:27.678914 7f70eda45700 10 In get_auth_session_handler for 
> protocol 2
> 2018-06-22 10:47:27.679003 7f70eda45700 10 _calc_signature seq 1 front_crc_ = 
> 2696387361 middle_crc = 0 data_crc = 0 sig = 929021353460216573
> 2018-06-22 10:47:27.679026 7f70eda45700 20 Putting signature in client 
> message(seq # 1): sig = 929021353460216573
> 2018-06-22 10:47:27.679520 7f70eda45700 10 _calc_signature seq 1 front_crc_ = 
> 1943489909 middle_crc = 0 data_crc = 0 sig = 10026640535487722288
> Error EPERM: problem getting command descriptions from pg.18.2
> 2018-06-22 10:47:27.681798 7f70ef9e6700 10 monclient: shutdown
>
>
> -
>
>
> From what I can see the auth works:
>
> 2018-06-22 10:47:27.665678 7f70ef9e6700  5 monclient: authenticate success, 
> global_id 411322261
>
>
>
>
> - Original Message -
>> From: "Brad Hubbard" 
>> To: "Andrei" 
>> Cc: "ceph-users" 
>> Sent: Frida

Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-21 Thread Brad Hubbard
That seems like an authentication issue?

Try running it like so...

$ ceph --debug_monc 20 --debug_auth 20 pg 18.2 query

On Thu, Jun 21, 2018 at 12:18 AM, Andrei Mikhailovsky  wrote:
> Hi Brad,
>
> Yes, but it doesn't show much:
>
> ceph pg 18.2 query
> Error EPERM: problem getting command descriptions from pg.18.2
>
> Cheers
>
>
>
> - Original Message -
>> From: "Brad Hubbard" 
>> To: "andrei" 
>> Cc: "ceph-users" 
>> Sent: Wednesday, 20 June, 2018 00:02:07
>> Subject: Re: [ceph-users] fixing unrepairable inconsistent PG
>
>> Can you post the output of a pg query?
>>
>> On Tue, Jun 19, 2018 at 11:44 PM, Andrei Mikhailovsky  
>> wrote:
>>> A quick update on my issue. I have noticed that while I was trying to move
>>> the problem object on osds, the file attributes got lost on one of the osds,
>>> which is I guess why the error messages showed the no attribute bit.
>>>
>>> I then copied the attributes metadata to the problematic object and
>>> restarted the osds in question. Following a pg repair I got a different
>>> error:
>>>
>>> 2018-06-19 13:51:05.846033 osd.21 osd.21 192.168.168.203:6828/24339 2 :
>>> cluster [ERR] 18.2 shard 21: soid 18:45f87722:::.dir.default.80018061.2:head
>>> omap_digest 0x25e8a1da != omap_digest 0x21c7f871 from auth oi
>>> 18:45f87722:::.dir.default.80018061.2:head(106137'603495 osd.21.0:41403910
>>> dirty|omap|data_digest|omap_digest s 0 uv 603494 dd  od 21c7f871
>>> alloc_hint [0 0 0])
>>> 2018-06-19 13:51:05.846042 osd.21 osd.21 192.168.168.203:6828/24339 3 :
>>> cluster [ERR] 18.2 shard 28: soid 18:45f87722:::.dir.default.80018061.2:head
>>> omap_digest 0x25e8a1da != omap_digest 0x21c7f871 from auth oi
>>> 18:45f87722:::.dir.default.80018061.2:head(106137'603495 osd.21.0:41403910
>>> dirty|omap|data_digest|omap_digest s 0 uv 603494 dd  od 21c7f871
>>> alloc_hint [0 0 0])
>>> 2018-06-19 13:51:05.846046 osd.21 osd.21 192.168.168.203:6828/24339 4 :
>>> cluster [ERR] 18.2 soid 18:45f87722:::.dir.default.80018061.2:head: failed
>>> to pick suitable auth object
>>> 2018-06-19 13:51:05.846118 osd.21 osd.21 192.168.168.203:6828/24339 5 :
>>> cluster [ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no '_'
>>> attr
>>> 2018-06-19 13:51:05.846129 osd.21 osd.21 192.168.168.203:6828/24339 6 :
>>> cluster [ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no
>>> 'snapset' attr
>>> 2018-06-19 13:51:09.810878 osd.21 osd.21 192.168.168.203:6828/24339 7 :
>>> cluster [ERR] 18.2 repair 4 errors, 0 fixed
>>>
>>> It mentions that there is an incorrect omap_digest . How do I go about
>>> fixing this?
>>>
>>> Cheers
>>>
>>> 
>>>
>>> From: "andrei" 
>>> To: "ceph-users" 
>>> Sent: Tuesday, 19 June, 2018 11:16:22
>>> Subject: [ceph-users] fixing unrepairable inconsistent PG
>>>
>>> Hello everyone
>>>
>>> I am having trouble repairing one inconsistent and stubborn PG. I get the
>>> following error in ceph.log:
>>>
>>>
>>>
>>> 2018-06-19 11:00:00.000225 mon.arh-ibstorage1-ib mon.0
>>> 192.168.168.201:6789/0 675 : cluster [ERR] overall HEALTH_ERR noout flag(s)
>>> set; 4 scrub errors; Possible data damage: 1 pg inconsistent; application
>>> not enabled on 4 pool(s)
>>> 2018-06-19 11:09:24.586392 mon.arh-ibstorage1-ib mon.0
>>> 192.168.168.201:6789/0 841 : cluster [ERR] Health check update: Possible
>>> data damage: 1 pg inconsistent, 1 pg repair (PG_DAMAGED)
>>> 2018-06-19 11:09:27.139504 osd.21 osd.21 192.168.168.203:6828/4003 2 :
>>> cluster [ERR] 18.2 soid 18:45f87722:::.dir.default.80018061.2:head: failed
>>> to pick suitable object info
>>> 2018-06-19 11:09:27.139545 osd.21 osd.21 192.168.168.203:6828/4003 3 :
>>> cluster [ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no '_'
>>> attr
>>> 2018-06-19 11:09:27.139550 osd.21 osd.21 192.168.168.203:6828/4003 4 :
>>> cluster [ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no
>>> 'snapset' attr
>>>
>>> 2018-06-19 11:09:35.484402 osd.21 osd.21 192.168.168.203:6828/4003 5 :
>>> cluster [ERR] 18.2 repair 4 errors, 0 fixed
>>> 2018-06-19 11:09:40.601657 mon.arh-ibstorage1-ib mon.0
>>> 192.168.168.201:6789/0 844 : cluster [ERR] Health check update: Possible
&

Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-20 Thread Andrei Mikhailovsky
Hi Brad,

Yes, but it doesn't show much:

ceph pg 18.2 query
Error EPERM: problem getting command descriptions from pg.18.2

Cheers



- Original Message -
> From: "Brad Hubbard" 
> To: "andrei" 
> Cc: "ceph-users" 
> Sent: Wednesday, 20 June, 2018 00:02:07
> Subject: Re: [ceph-users] fixing unrepairable inconsistent PG

> Can you post the output of a pg query?
> 
> On Tue, Jun 19, 2018 at 11:44 PM, Andrei Mikhailovsky  
> wrote:
>> A quick update on my issue. I have noticed that while I was trying to move
>> the problem object on osds, the file attributes got lost on one of the osds,
>> which is I guess why the error messages showed the no attribute bit.
>>
>> I then copied the attributes metadata to the problematic object and
>> restarted the osds in question. Following a pg repair I got a different
>> error:
>>
>> 2018-06-19 13:51:05.846033 osd.21 osd.21 192.168.168.203:6828/24339 2 :
>> cluster [ERR] 18.2 shard 21: soid 18:45f87722:::.dir.default.80018061.2:head
>> omap_digest 0x25e8a1da != omap_digest 0x21c7f871 from auth oi
>> 18:45f87722:::.dir.default.80018061.2:head(106137'603495 osd.21.0:41403910
>> dirty|omap|data_digest|omap_digest s 0 uv 603494 dd  od 21c7f871
>> alloc_hint [0 0 0])
>> 2018-06-19 13:51:05.846042 osd.21 osd.21 192.168.168.203:6828/24339 3 :
>> cluster [ERR] 18.2 shard 28: soid 18:45f87722:::.dir.default.80018061.2:head
>> omap_digest 0x25e8a1da != omap_digest 0x21c7f871 from auth oi
>> 18:45f87722:::.dir.default.80018061.2:head(106137'603495 osd.21.0:41403910
>> dirty|omap|data_digest|omap_digest s 0 uv 603494 dd  od 21c7f871
>> alloc_hint [0 0 0])
>> 2018-06-19 13:51:05.846046 osd.21 osd.21 192.168.168.203:6828/24339 4 :
>> cluster [ERR] 18.2 soid 18:45f87722:::.dir.default.80018061.2:head: failed
>> to pick suitable auth object
>> 2018-06-19 13:51:05.846118 osd.21 osd.21 192.168.168.203:6828/24339 5 :
>> cluster [ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no '_'
>> attr
>> 2018-06-19 13:51:05.846129 osd.21 osd.21 192.168.168.203:6828/24339 6 :
>> cluster [ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no
>> 'snapset' attr
>> 2018-06-19 13:51:09.810878 osd.21 osd.21 192.168.168.203:6828/24339 7 :
>> cluster [ERR] 18.2 repair 4 errors, 0 fixed
>>
>> It mentions that there is an incorrect omap_digest . How do I go about
>> fixing this?
>>
>> Cheers
>>
>> 
>>
>> From: "andrei" 
>> To: "ceph-users" 
>> Sent: Tuesday, 19 June, 2018 11:16:22
>> Subject: [ceph-users] fixing unrepairable inconsistent PG
>>
>> Hello everyone
>>
>> I am having trouble repairing one inconsistent and stubborn PG. I get the
>> following error in ceph.log:
>>
>>
>>
>> 2018-06-19 11:00:00.000225 mon.arh-ibstorage1-ib mon.0
>> 192.168.168.201:6789/0 675 : cluster [ERR] overall HEALTH_ERR noout flag(s)
>> set; 4 scrub errors; Possible data damage: 1 pg inconsistent; application
>> not enabled on 4 pool(s)
>> 2018-06-19 11:09:24.586392 mon.arh-ibstorage1-ib mon.0
>> 192.168.168.201:6789/0 841 : cluster [ERR] Health check update: Possible
>> data damage: 1 pg inconsistent, 1 pg repair (PG_DAMAGED)
>> 2018-06-19 11:09:27.139504 osd.21 osd.21 192.168.168.203:6828/4003 2 :
>> cluster [ERR] 18.2 soid 18:45f87722:::.dir.default.80018061.2:head: failed
>> to pick suitable object info
>> 2018-06-19 11:09:27.139545 osd.21 osd.21 192.168.168.203:6828/4003 3 :
>> cluster [ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no '_'
>> attr
>> 2018-06-19 11:09:27.139550 osd.21 osd.21 192.168.168.203:6828/4003 4 :
>> cluster [ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no
>> 'snapset' attr
>>
>> 2018-06-19 11:09:35.484402 osd.21 osd.21 192.168.168.203:6828/4003 5 :
>> cluster [ERR] 18.2 repair 4 errors, 0 fixed
>> 2018-06-19 11:09:40.601657 mon.arh-ibstorage1-ib mon.0
>> 192.168.168.201:6789/0 844 : cluster [ERR] Health check update: Possible
>> data damage: 1 pg inconsistent (PG_DAMAGED)
>>
>>
>> I have tried to follow a few instructions on the PG repair, including
>> removal of the 'broken' object .dir.default.80018061.2
>>  from primary osd following by the pg repair. After that didn't work, I've
>> done the same for the secondary osd. Still the same issue.
>>
>> Looking at the actual object on the file system, the file size is 0 for both
>> primary and secondary objects. The md5sum is the same too. The broken PG
>> belongs to the rado

Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-19 Thread Brad Hubbard
Can you post the output of a pg query?

On Tue, Jun 19, 2018 at 11:44 PM, Andrei Mikhailovsky  wrote:
> A quick update on my issue. I have noticed that while I was trying to move
> the problem object on osds, the file attributes got lost on one of the osds,
> which is I guess why the error messages showed the no attribute bit.
>
> I then copied the attributes metadata to the problematic object and
> restarted the osds in question. Following a pg repair I got a different
> error:
>
> 2018-06-19 13:51:05.846033 osd.21 osd.21 192.168.168.203:6828/24339 2 :
> cluster [ERR] 18.2 shard 21: soid 18:45f87722:::.dir.default.80018061.2:head
> omap_digest 0x25e8a1da != omap_digest 0x21c7f871 from auth oi
> 18:45f87722:::.dir.default.80018061.2:head(106137'603495 osd.21.0:41403910
> dirty|omap|data_digest|omap_digest s 0 uv 603494 dd  od 21c7f871
> alloc_hint [0 0 0])
> 2018-06-19 13:51:05.846042 osd.21 osd.21 192.168.168.203:6828/24339 3 :
> cluster [ERR] 18.2 shard 28: soid 18:45f87722:::.dir.default.80018061.2:head
> omap_digest 0x25e8a1da != omap_digest 0x21c7f871 from auth oi
> 18:45f87722:::.dir.default.80018061.2:head(106137'603495 osd.21.0:41403910
> dirty|omap|data_digest|omap_digest s 0 uv 603494 dd  od 21c7f871
> alloc_hint [0 0 0])
> 2018-06-19 13:51:05.846046 osd.21 osd.21 192.168.168.203:6828/24339 4 :
> cluster [ERR] 18.2 soid 18:45f87722:::.dir.default.80018061.2:head: failed
> to pick suitable auth object
> 2018-06-19 13:51:05.846118 osd.21 osd.21 192.168.168.203:6828/24339 5 :
> cluster [ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no '_'
> attr
> 2018-06-19 13:51:05.846129 osd.21 osd.21 192.168.168.203:6828/24339 6 :
> cluster [ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no
> 'snapset' attr
> 2018-06-19 13:51:09.810878 osd.21 osd.21 192.168.168.203:6828/24339 7 :
> cluster [ERR] 18.2 repair 4 errors, 0 fixed
>
> It mentions that there is an incorrect omap_digest . How do I go about
> fixing this?
>
> Cheers
>
> ____________
>
> From: "andrei" 
> To: "ceph-users" 
> Sent: Tuesday, 19 June, 2018 11:16:22
> Subject: [ceph-users] fixing unrepairable inconsistent PG
>
> Hello everyone
>
> I am having trouble repairing one inconsistent and stubborn PG. I get the
> following error in ceph.log:
>
>
>
> 2018-06-19 11:00:00.000225 mon.arh-ibstorage1-ib mon.0
> 192.168.168.201:6789/0 675 : cluster [ERR] overall HEALTH_ERR noout flag(s)
> set; 4 scrub errors; Possible data damage: 1 pg inconsistent; application
> not enabled on 4 pool(s)
> 2018-06-19 11:09:24.586392 mon.arh-ibstorage1-ib mon.0
> 192.168.168.201:6789/0 841 : cluster [ERR] Health check update: Possible
> data damage: 1 pg inconsistent, 1 pg repair (PG_DAMAGED)
> 2018-06-19 11:09:27.139504 osd.21 osd.21 192.168.168.203:6828/4003 2 :
> cluster [ERR] 18.2 soid 18:45f87722:::.dir.default.80018061.2:head: failed
> to pick suitable object info
> 2018-06-19 11:09:27.139545 osd.21 osd.21 192.168.168.203:6828/4003 3 :
> cluster [ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no '_'
> attr
> 2018-06-19 11:09:27.139550 osd.21 osd.21 192.168.168.203:6828/4003 4 :
> cluster [ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no
> 'snapset' attr
>
> 2018-06-19 11:09:35.484402 osd.21 osd.21 192.168.168.203:6828/4003 5 :
> cluster [ERR] 18.2 repair 4 errors, 0 fixed
> 2018-06-19 11:09:40.601657 mon.arh-ibstorage1-ib mon.0
> 192.168.168.201:6789/0 844 : cluster [ERR] Health check update: Possible
> data damage: 1 pg inconsistent (PG_DAMAGED)
>
>
> I have tried to follow a few instructions on the PG repair, including
> removal of the 'broken' object .dir.default.80018061.2
>  from primary osd following by the pg repair. After that didn't work, I've
> done the same for the secondary osd. Still the same issue.
>
> Looking at the actual object on the file system, the file size is 0 for both
> primary and secondary objects. The md5sum is the same too. The broken PG
> belongs to the radosgw bucket called .rgw.buckets.index
>
> What else can I try to get the thing fixed?
>
> Cheers
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-19 Thread Andrei Mikhailovsky
A quick update on my issue. I have noticed that while I was trying to move the 
problem object on osds, the file attributes got lost on one of the osds, which 
is I guess why the error messages showed the no attribute bit. 

I then copied the attributes metadata to the problematic object and restarted 
the osds in question. Following a pg repair I got a different error: 

2018-06-19 13:51:05.846033 osd.21 osd.21 192.168.168.203:6828/24339 2 : cluster 
[ERR] 18.2 shard 21: soid 18:45f87722:::.dir.default.80018061.2:head 
omap_digest 0x25e8a1da != omap_digest 0x21c7f871 from auth oi 
18:45f87722:::.dir.default.80018061.2:head(106137'603495 osd.21.0:41403910 
dirty|omap|data_digest|omap_digest s 0 uv 603494 dd  od 21c7f871 
alloc_hint [0 0 0]) 
2018-06-19 13:51:05.846042 osd.21 osd.21 192.168.168.203:6828/24339 3 : cluster 
[ERR] 18.2 shard 28: soid 18:45f87722:::.dir.default.80018061.2:head 
omap_digest 0x25e8a1da != omap_digest 0x21c7f871 from auth oi 
18:45f87722:::.dir.default.80018061.2:head(106137'603495 osd.21.0:41403910 
dirty|omap|data_digest|omap_digest s 0 uv 603494 dd  od 21c7f871 
alloc_hint [0 0 0]) 
2018-06-19 13:51:05.846046 osd.21 osd.21 192.168.168.203:6828/24339 4 : cluster 
[ERR] 18.2 soid 18:45f87722:::.dir.default.80018061.2:head: failed to pick 
suitable auth object 
2018-06-19 13:51:05.846118 osd.21 osd.21 192.168.168.203:6828/24339 5 : cluster 
[ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no '_' attr 
2018-06-19 13:51:05.846129 osd.21 osd.21 192.168.168.203:6828/24339 6 : cluster 
[ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no 'snapset' attr 
2018-06-19 13:51:09.810878 osd.21 osd.21 192.168.168.203:6828/24339 7 : cluster 
[ERR] 18.2 repair 4 errors, 0 fixed 

It mentions that there is an incorrect omap_digest . How do I go about fixing 
this? 

Cheers 

> From: "andrei" 
> To: "ceph-users" 
> Sent: Tuesday, 19 June, 2018 11:16:22
> Subject: [ceph-users] fixing unrepairable inconsistent PG

> Hello everyone

> I am having trouble repairing one inconsistent and stubborn PG. I get the
> following error in ceph.log:

> 2018-06-19 11:00:00.000225 mon.arh-ibstorage1-ib mon.0 192.168.168.201:6789/0
> 675 : cluster [ERR] overall HEALTH_ERR noout flag(s) set; 4 scrub errors;
> Possible data damage: 1 pg inconsistent; application not enabled on 4 pool(s)
> 2018-06-19 11:09:24.586392 mon.arh-ibstorage1-ib mon.0 192.168.168.201:6789/0
> 841 : cluster [ERR] Health check update: Possible data damage: 1 pg
> inconsistent, 1 pg repair (PG_DAMAGED)
> 2018-06-19 11:09:27.139504 osd.21 osd.21 192.168.168.203:6828/4003 2 : cluster
> [ERR] 18.2 soid 18:45f87722:::.dir.default.80018061.2:head: failed to pick
> suitable object info
> 2018-06-19 11:09:27.139545 osd.21 osd.21 192.168.168.203:6828/4003 3 : cluster
> [ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no '_' attr
> 2018-06-19 11:09:27.139550 osd.21 osd.21 192.168.168.203:6828/4003 4 : cluster
> [ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no 'snapset' attr

> 2018-06-19 11:09:35.484402 osd.21 osd.21 192.168.168.203:6828/4003 5 : cluster
> [ERR] 18.2 repair 4 errors, 0 fixed
> 2018-06-19 11:09:40.601657 mon.arh-ibstorage1-ib mon.0 192.168.168.201:6789/0
> 844 : cluster [ERR] Health check update: Possible data damage: 1 pg
> inconsistent (PG_DAMAGED)

> I have tried to follow a few instructions on the PG repair, including removal 
> of
> the 'broken' object .dir.default.80018061.2
> from primary osd following by the pg repair. After that didn't work, I've done
> the same for the secondary osd. Still the same issue.

> Looking at the actual object on the file system, the file size is 0 for both
> primary and secondary objects. The md5sum is the same too. The broken PG
> belongs to the radosgw bucket called .rgw.buckets.index

> What else can I try to get the thing fixed?

> Cheers

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] fixing unrepairable inconsistent PG

2018-06-19 Thread Andrei Mikhailovsky
Hello everyone 

I am having trouble repairing one inconsistent and stubborn PG. I get the 
following error in ceph.log: 



2018-06-19 11:00:00.000225 mon.arh-ibstorage1-ib mon.0 192.168.168.201:6789/0 
675 : cluster [ERR] overall HEALTH_ERR noout flag(s) set; 4 scrub errors; 
Possible data damage: 1 pg inconsistent; application not enabled on 4 pool(s) 
2018-06-19 11:09:24.586392 mon.arh-ibstorage1-ib mon.0 192.168.168.201:6789/0 
841 : cluster [ERR] Health check update: Possible data damage: 1 pg 
inconsistent, 1 pg repair (PG_DAMAGED) 
2018-06-19 11:09:27.139504 osd.21 osd.21 192.168.168.203:6828/4003 2 : cluster 
[ERR] 18.2 soid 18:45f87722:::.dir.default.80018061.2:head: failed to pick 
suitable object info 
2018-06-19 11:09:27.139545 osd.21 osd.21 192.168.168.203:6828/4003 3 : cluster 
[ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no '_' attr 
2018-06-19 11:09:27.139550 osd.21 osd.21 192.168.168.203:6828/4003 4 : cluster 
[ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no 'snapset' attr 

2018-06-19 11:09:35.484402 osd.21 osd.21 192.168.168.203:6828/4003 5 : cluster 
[ERR] 18.2 repair 4 errors, 0 fixed 
2018-06-19 11:09:40.601657 mon.arh-ibstorage1-ib mon.0 192.168.168.201:6789/0 
844 : cluster [ERR] Health check update: Possible data damage: 1 pg 
inconsistent (PG_DAMAGED) 


I have tried to follow a few instructions on the PG repair, including removal 
of the 'broken' object .dir.default.80018061.2 
from primary osd following by the pg repair. After that didn't work, I've done 
the same for the secondary osd. Still the same issue. 

Looking at the actual object on the file system, the file size is 0 for both 
primary and secondary objects. The md5sum is the same too. The broken PG 
belongs to the radosgw bucket called .rgw.buckets.index 

What else can I try to get the thing fixed? 

Cheers 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com