Re: [ceph-users] ceph cluster inconsistency keyvaluestore

2014-09-01 Thread Haomai Wang
Hmm, could you please list your instructions including cluster existing
time and all relevant ops? I want to reproduce it.


On Mon, Sep 1, 2014 at 4:45 PM, Kenneth Waegeman 
wrote:

> Hi,
>
> I reinstalled the cluster with 0.84, and tried again running rados bench
> on a EC coded pool on keyvaluestore.
> Nothing crashed this time, but when I check the status:
>
>  health HEALTH_ERR 128 pgs inconsistent; 128 scrub errors; too few pgs
> per osd (15 < min 20)
>  monmap e1: 3 mons at {ceph001=10.141.8.180:6789/0,
> ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch
> 8, quorum 0,1,2 ceph001,ceph002,ceph003
>  osdmap e174: 78 osds: 78 up, 78 in
>   pgmap v147680: 1216 pgs, 3 pools, 14758 GB data, 3690 kobjects
> 1753 GB used, 129 TB / 131 TB avail
> 1088 active+clean
>  128 active+clean+inconsistent
>
> the 128 inconsistent pgs are ALL the pgs of the EC KV store ( the others
> are on Filestore)
>
> The only thing I can see in the logs is that after the rados tests, it
> start scrubbing, and for each KV pg I get something like this:
>
> 2014-08-31 11:14:09.050747 osd.11 10.141.8.180:6833/61098 4 : [ERR] 2.3s0
> scrub stat mismatch, got 28164/29291 objects, 0/0 clones, 28164/29291
> dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts,
> 118128377856/122855358464 bytes.
>
> What could here be the problem?
> Thanks again!!
>
> Kenneth
>
>
> - Message from Haomai Wang  -
>Date: Tue, 26 Aug 2014 17:11:43 +0800
>From: Haomai Wang 
> Subject: Re: [ceph-users] ceph cluster inconsistency?
>  To: Kenneth Waegeman 
>  Cc: ceph-users@lists.ceph.com
>
>
>  Hmm, it looks like you hit this bug(http://tracker.ceph.com/issues/9223).
>>
>> Sorry for the late message, I forget that this fix is merged into 0.84.
>>
>> Thanks for your patient :-)
>>
>> On Tue, Aug 26, 2014 at 4:39 PM, Kenneth Waegeman
>>  wrote:
>>
>>>
>>> Hi,
>>>
>>> In the meantime I already tried with upgrading the cluster to 0.84, to
>>> see
>>> if that made a difference, and it seems it does.
>>> I can't reproduce the crashing osds by doing a 'rados -p ecdata ls'
>>> anymore.
>>>
>>> But now the cluster detect it is inconsistent:
>>>
>>>   cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
>>>health HEALTH_ERR 40 pgs inconsistent; 40 scrub errors; too few
>>> pgs
>>> per osd (4 < min 20); mon.ceph002 low disk space
>>>monmap e3: 3 mons at
>>> {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,
>>> ceph003=10.141.8.182:6789/0},
>>> election epoch 30, quorum 0,1,2 ceph001,ceph002,ceph003
>>>mdsmap e78951: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3
>>> up:standby
>>>osdmap e145384: 78 osds: 78 up, 78 in
>>> pgmap v247095: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects
>>>   1502 GB used, 129 TB / 131 TB avail
>>>279 active+clean
>>> 40 active+clean+inconsistent
>>>  1 active+clean+scrubbing+deep
>>>
>>>
>>> I tried to do ceph pg repair for all the inconsistent pgs:
>>>
>>>   cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
>>>health HEALTH_ERR 40 pgs inconsistent; 1 pgs repair; 40 scrub
>>> errors;
>>> too few pgs per osd (4 < min 20); mon.ceph002 low disk space
>>>monmap e3: 3 mons at
>>> {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,
>>> ceph003=10.141.8.182:6789/0},
>>> election epoch 30, quorum 0,1,2 ceph001,ceph002,ceph003
>>>mdsmap e79486: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3
>>> up:standby
>>>osdmap e146452: 78 osds: 78 up, 78 in
>>> pgmap v248520: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects
>>>   1503 GB used, 129 TB / 131 TB avail
>>>279 active+clean
>>> 39 active+clean+inconsistent
>>>  1 active+clean+scrubbing+deep
>>>  1 active+clean+scrubbing+deep+inconsistent+repair
>>>
>>> I let it recovering through the night, but this morning the mons were all
>>> gone, nothing to see in the log files.. The osds were all still up!
>>>
>>> cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
>>>  health HEALTH_ERR 36 pgs inconsistent; 1 pgs repair; 36 scrub
>>> errors;
>>> too few pgs per osd (4 < min 20)
>>>  monmap e7: 3 mons at
>>> {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,
>>> ceph003=10.141.8.182:6789/0},
>>> election epoch 44, quorum 0,1,2 ceph001,ceph002,ceph003
>>>  mdsmap e109481: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3
>>> up:standby
>>>  osdmap e203410: 78 osds: 78 up, 78 in
>>>   pgmap v331747: 320 pgs, 4 pools, 15251 GB data, 3812 kobjects
>>> 1547 GB used, 129 TB / 131 TB avail
>>>1 active+clean+scrubbing+deep+inconsistent+repair
>>>  284 active+clean
>>>   35 active+clean+inconsistent
>>>
>>> I restarted the monitors now, I will let you know when I see something
>>> more..
>>>
>>

Re: [ceph-users] ceph cluster inconsistency keyvaluestore

2014-09-01 Thread Kenneth Waegeman

Hi,


The cluster got installed with quattor, which uses ceph-deploy for  
installation of daemons, writes the config file and installs the  
crushmap.
I have 3 hosts, each 12 disks, having a large KV partition (3.6T) for  
the ECdata pool and a small cache partition (50G) for the cache


I manually did this:

ceph osd pool create cache 1024 1024
ceph osd pool set cache size 2
ceph osd pool set cache min_size 1
ceph osd erasure-code-profile set profile11 k=8 m=3 ruleset-failure-domain=osd
ceph osd pool create ecdata 128 128 erasure profile11
ceph osd tier add ecdata cache
ceph osd tier cache-mode cache writeback
ceph osd tier set-overlay ecdata cache
ceph osd pool set cache hit_set_type bloom
ceph osd pool set cache hit_set_count 1
ceph osd pool set cache hit_set_period 3600
ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))

(But the previous time I had the problem already without the cache part)



Cluster live since 2014-08-29 15:34:16

Config file on host ceph001:

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.143.8.0/24
filestore_xattr_use_omap = 1
fsid = 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
mon_cluster_log_to_syslog = 1
mon_host = ceph001.cubone.os, ceph002.cubone.os, ceph003.cubone.os
mon_initial_members = ceph001, ceph002, ceph003
osd_crush_update_on_start = 0
osd_journal_size = 10240
osd_pool_default_min_size = 2
osd_pool_default_pg_num = 512
osd_pool_default_pgp_num = 512
osd_pool_default_size = 3
public_network = 10.141.8.0/24

[osd.11]
osd_objectstore = keyvaluestore-dev

[osd.13]
osd_objectstore = keyvaluestore-dev

[osd.15]
osd_objectstore = keyvaluestore-dev

[osd.17]
osd_objectstore = keyvaluestore-dev

[osd.19]
osd_objectstore = keyvaluestore-dev

[osd.21]
osd_objectstore = keyvaluestore-dev

[osd.23]
osd_objectstore = keyvaluestore-dev

[osd.25]
osd_objectstore = keyvaluestore-dev

[osd.3]
osd_objectstore = keyvaluestore-dev

[osd.5]
osd_objectstore = keyvaluestore-dev

[osd.7]
osd_objectstore = keyvaluestore-dev

[osd.9]
osd_objectstore = keyvaluestore-dev


OSDs:
# idweight  type name   up/down reweight
-12 140.6   root default-cache
-9  46.87   host ceph001-cache
2   3.906   osd.2   up  1
4   3.906   osd.4   up  1
6   3.906   osd.6   up  1
8   3.906   osd.8   up  1
10  3.906   osd.10  up  1
12  3.906   osd.12  up  1
14  3.906   osd.14  up  1
16  3.906   osd.16  up  1
18  3.906   osd.18  up  1
20  3.906   osd.20  up  1
22  3.906   osd.22  up  1
24  3.906   osd.24  up  1
-10 46.87   host ceph002-cache
28  3.906   osd.28  up  1
30  3.906   osd.30  up  1
32  3.906   osd.32  up  1
34  3.906   osd.34  up  1
36  3.906   osd.36  up  1
38  3.906   osd.38  up  1
40  3.906   osd.40  up  1
42  3.906   osd.42  up  1
44  3.906   osd.44  up  1
46  3.906   osd.46  up  1
48  3.906   osd.48  up  1
50  3.906   osd.50  up  1
-11 46.87   host ceph003-cache
54  3.906   osd.54  up  1
56  3.906   osd.56  up  1
58  3.906   osd.58  up  1
60  3.906   osd.60  up  1
62  3.906   osd.62  up  1
64  3.906   osd.64  up  1
66  3.906   osd.66  up  1
68  3.906   osd.68  up  1
70  3.906   osd.70  up  1
72  3.906   osd.72  up  1
74  3.906   osd.74  up  1
76  3.906   osd.76  up  1
-8  140.6   root default-ec
-5  46.87   host ceph001-ec
3   3.906   osd.3   up  1
5   3.906   osd.5   up  1
7   3.906   osd.7   up  1
9   3.906   osd.9   up  1
11  3.906   osd.11  up  1
13  3.906   osd.13  up  1
15  3.906   osd.15  up  1
17  3.906   osd.17  up  1
19  3.906   osd.19  up  1
21  3.906   osd.21  up  1
23  3.906   osd.23  up  1
25  3.906   osd.25  up  1
-6  46.87   host ceph002-ec
29  3.906   osd.29  up  1
31  3.906   osd.31  up  1
33  3.906   osd.33  up  1
35  3.906 

Re: [ceph-users] ceph cluster inconsistency keyvaluestore

2014-09-03 Thread Kenneth Waegeman
I also can reproduce it on a new slightly different set up (also EC on  
KV and Cache) by running ceph pg scrub on a KV pg: this pg will then  
get the 'inconsistent' status




- Message from Kenneth Waegeman  -
   Date: Mon, 01 Sep 2014 16:28:31 +0200
   From: Kenneth Waegeman 
Subject: Re: ceph cluster inconsistency keyvaluestore
 To: Haomai Wang 
 Cc: ceph-users@lists.ceph.com



Hi,


The cluster got installed with quattor, which uses ceph-deploy for  
installation of daemons, writes the config file and installs the  
crushmap.
I have 3 hosts, each 12 disks, having a large KV partition (3.6T)  
for the ECdata pool and a small cache partition (50G) for the cache


I manually did this:

ceph osd pool create cache 1024 1024
ceph osd pool set cache size 2
ceph osd pool set cache min_size 1
ceph osd erasure-code-profile set profile11 k=8 m=3  
ruleset-failure-domain=osd

ceph osd pool create ecdata 128 128 erasure profile11
ceph osd tier add ecdata cache
ceph osd tier cache-mode cache writeback
ceph osd tier set-overlay ecdata cache
ceph osd pool set cache hit_set_type bloom
ceph osd pool set cache hit_set_count 1
ceph osd pool set cache hit_set_period 3600
ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))

(But the previous time I had the problem already without the cache part)



Cluster live since 2014-08-29 15:34:16

Config file on host ceph001:

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.143.8.0/24
filestore_xattr_use_omap = 1
fsid = 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
mon_cluster_log_to_syslog = 1
mon_host = ceph001.cubone.os, ceph002.cubone.os, ceph003.cubone.os
mon_initial_members = ceph001, ceph002, ceph003
osd_crush_update_on_start = 0
osd_journal_size = 10240
osd_pool_default_min_size = 2
osd_pool_default_pg_num = 512
osd_pool_default_pgp_num = 512
osd_pool_default_size = 3
public_network = 10.141.8.0/24

[osd.11]
osd_objectstore = keyvaluestore-dev

[osd.13]
osd_objectstore = keyvaluestore-dev

[osd.15]
osd_objectstore = keyvaluestore-dev

[osd.17]
osd_objectstore = keyvaluestore-dev

[osd.19]
osd_objectstore = keyvaluestore-dev

[osd.21]
osd_objectstore = keyvaluestore-dev

[osd.23]
osd_objectstore = keyvaluestore-dev

[osd.25]
osd_objectstore = keyvaluestore-dev

[osd.3]
osd_objectstore = keyvaluestore-dev

[osd.5]
osd_objectstore = keyvaluestore-dev

[osd.7]
osd_objectstore = keyvaluestore-dev

[osd.9]
osd_objectstore = keyvaluestore-dev


OSDs:
# idweight  type name   up/down reweight
-12 140.6   root default-cache
-9  46.87   host ceph001-cache
2   3.906   osd.2   up  1
4   3.906   osd.4   up  1
6   3.906   osd.6   up  1
8   3.906   osd.8   up  1
10  3.906   osd.10  up  1
12  3.906   osd.12  up  1
14  3.906   osd.14  up  1
16  3.906   osd.16  up  1
18  3.906   osd.18  up  1
20  3.906   osd.20  up  1
22  3.906   osd.22  up  1
24  3.906   osd.24  up  1
-10 46.87   host ceph002-cache
28  3.906   osd.28  up  1
30  3.906   osd.30  up  1
32  3.906   osd.32  up  1
34  3.906   osd.34  up  1
36  3.906   osd.36  up  1
38  3.906   osd.38  up  1
40  3.906   osd.40  up  1
42  3.906   osd.42  up  1
44  3.906   osd.44  up  1
46  3.906   osd.46  up  1
48  3.906   osd.48  up  1
50  3.906   osd.50  up  1
-11 46.87   host ceph003-cache
54  3.906   osd.54  up  1
56  3.906   osd.56  up  1
58  3.906   osd.58  up  1
60  3.906   osd.60  up  1
62  3.906   osd.62  up  1
64  3.906   osd.64  up  1
66  3.906   osd.66  up  1
68  3.906   osd.68  up  1
70  3.906   osd.70  up  1
72  3.906   osd.72  up  1
74  3.906   osd.74  up  1
76  3.906   osd.76  up  1
-8  140.6   root default-ec
-5  46.87   host ceph001-ec
3   3.906   osd.3   up  1
5   3.906   osd.5   up  1
7   3.906   osd.7   up  1
9   3.906   osd.9   up  1
11  3.906   osd.11  up  1
13  3.906   osd.13  up  1
15  3.906   osd.15  up  1
17  3.906   osd.17  up  1

Re: [ceph-users] ceph cluster inconsistency keyvaluestore

2014-09-06 Thread Haomai Wang
Sorry for the late message, I'm back from a short vacation. I would
like to try it this weekends. Thanks for your patient :-)

On Wed, Sep 3, 2014 at 9:16 PM, Kenneth Waegeman
 wrote:
> I also can reproduce it on a new slightly different set up (also EC on KV
> and Cache) by running ceph pg scrub on a KV pg: this pg will then get the
> 'inconsistent' status
>
>
>
> - Message from Kenneth Waegeman  -
>Date: Mon, 01 Sep 2014 16:28:31 +0200
>From: Kenneth Waegeman 
> Subject: Re: ceph cluster inconsistency keyvaluestore
>  To: Haomai Wang 
>  Cc: ceph-users@lists.ceph.com
>
>
>
>> Hi,
>>
>>
>> The cluster got installed with quattor, which uses ceph-deploy for
>> installation of daemons, writes the config file and installs the crushmap.
>> I have 3 hosts, each 12 disks, having a large KV partition (3.6T) for the
>> ECdata pool and a small cache partition (50G) for the cache
>>
>> I manually did this:
>>
>> ceph osd pool create cache 1024 1024
>> ceph osd pool set cache size 2
>> ceph osd pool set cache min_size 1
>> ceph osd erasure-code-profile set profile11 k=8 m=3
>> ruleset-failure-domain=osd
>> ceph osd pool create ecdata 128 128 erasure profile11
>> ceph osd tier add ecdata cache
>> ceph osd tier cache-mode cache writeback
>> ceph osd tier set-overlay ecdata cache
>> ceph osd pool set cache hit_set_type bloom
>> ceph osd pool set cache hit_set_count 1
>> ceph osd pool set cache hit_set_period 3600
>> ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))
>>
>> (But the previous time I had the problem already without the cache part)
>>
>>
>>
>> Cluster live since 2014-08-29 15:34:16
>>
>> Config file on host ceph001:
>>
>> [global]
>> auth_client_required = cephx
>> auth_cluster_required = cephx
>> auth_service_required = cephx
>> cluster_network = 10.143.8.0/24
>> filestore_xattr_use_omap = 1
>> fsid = 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
>> mon_cluster_log_to_syslog = 1
>> mon_host = ceph001.cubone.os, ceph002.cubone.os, ceph003.cubone.os
>> mon_initial_members = ceph001, ceph002, ceph003
>> osd_crush_update_on_start = 0
>> osd_journal_size = 10240
>> osd_pool_default_min_size = 2
>> osd_pool_default_pg_num = 512
>> osd_pool_default_pgp_num = 512
>> osd_pool_default_size = 3
>> public_network = 10.141.8.0/24
>>
>> [osd.11]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.13]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.15]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.17]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.19]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.21]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.23]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.25]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.3]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.5]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.7]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.9]
>> osd_objectstore = keyvaluestore-dev
>>
>>
>> OSDs:
>> # idweight  type name   up/down reweight
>> -12 140.6   root default-cache
>> -9  46.87   host ceph001-cache
>> 2   3.906   osd.2   up  1
>> 4   3.906   osd.4   up  1
>> 6   3.906   osd.6   up  1
>> 8   3.906   osd.8   up  1
>> 10  3.906   osd.10  up  1
>> 12  3.906   osd.12  up  1
>> 14  3.906   osd.14  up  1
>> 16  3.906   osd.16  up  1
>> 18  3.906   osd.18  up  1
>> 20  3.906   osd.20  up  1
>> 22  3.906   osd.22  up  1
>> 24  3.906   osd.24  up  1
>> -10 46.87   host ceph002-cache
>> 28  3.906   osd.28  up  1
>> 30  3.906   osd.30  up  1
>> 32  3.906   osd.32  up  1
>> 34  3.906   osd.34  up  1
>> 36  3.906   osd.36  up  1
>> 38  3.906   osd.38  up  1
>> 40  3.906   osd.40  up  1
>> 42  3.906   osd.42  up  1
>> 44  3.906   osd.44  up  1
>> 46  3.906   osd.46  up  1
>> 48  3.906   osd.48  up  1
>> 50  3.906   osd.50  up  1
>> -11 46.87   host ceph003-cache
>> 54  3.906   osd.54  up  1
>> 56  3.906   osd.56  up  1
>> 58  3.906   osd.58  up  1
>> 60  3.906   osd.60  up  1
>> 62  3.906   osd.62  up  1
>> 64  3.906   osd.64  up  1
>> 66  3.906   osd.66  up  1
>> 68  3.906   osd.68  up  1
>> 70  3.906   osd.70  up  1
>> 72  3.906   osd.72  up  1
>> 74  3.906 

Re: [ceph-users] ceph cluster inconsistency keyvaluestore

2014-09-07 Thread Haomai Wang
I have found the root cause. It's a bug.

When chunky scrub happen, it will iterate the who pg's objects and
each iterator only a few objects will be scan.

osd/PG.cc:3758
ret = get_pgbackend()-> objects_list_partial(
  start,
  cct->_conf->osd_scrub_chunk_min,
  cct->_conf->osd_scrub_chunk_max,
  0,
  &objects,
  &candidate_end);

candidate_end is the end of object set and it's used to indicate the
next scrub process's start position. But it will be truncated:

osd/PG.cc:3777
while (!boundary_found && objects.size() > 1) {
  hobject_t end = objects.back().get_boundary();
  objects.pop_back();

  if (objects.back().get_filestore_key() !=
end.get_filestore_key()) {
candidate_end = end;
boundary_found = true;
  }
}
end which only contain "hash" field as hobject_t will be assign to
candidate_end.  So the next scrub process a hobject_t only contains
"hash" field will be passed in to get_pgbackend()->
objects_list_partial.

It will cause incorrect results for KeyValueStore backend. Because it
will use strict key ordering for "collection_list_paritial" method. A
hobject_t only contains "hash" field will be:

1%e79s0_head!972F1B5D!!none!!!!0!0

and the actual object is
1%e79s0_head!972F1B5D!!1!!!object-name!head

In other word, a object only contain "hash" field can't used by to
search a absolute object has the same "hash" field.

@sage The simply way is modify obj->key function which will change
storage format. Because it's a experiment backend I would like to
provide with a external format change program help users do it. Is it
OK?


On Wed, Sep 3, 2014 at 9:16 PM, Kenneth Waegeman
 wrote:
> I also can reproduce it on a new slightly different set up (also EC on KV
> and Cache) by running ceph pg scrub on a KV pg: this pg will then get the
> 'inconsistent' status
>
>
>
> - Message from Kenneth Waegeman  -
>Date: Mon, 01 Sep 2014 16:28:31 +0200
>From: Kenneth Waegeman 
> Subject: Re: ceph cluster inconsistency keyvaluestore
>  To: Haomai Wang 
>  Cc: ceph-users@lists.ceph.com
>
>
>
>> Hi,
>>
>>
>> The cluster got installed with quattor, which uses ceph-deploy for
>> installation of daemons, writes the config file and installs the crushmap.
>> I have 3 hosts, each 12 disks, having a large KV partition (3.6T) for the
>> ECdata pool and a small cache partition (50G) for the cache
>>
>> I manually did this:
>>
>> ceph osd pool create cache 1024 1024
>> ceph osd pool set cache size 2
>> ceph osd pool set cache min_size 1
>> ceph osd erasure-code-profile set profile11 k=8 m=3
>> ruleset-failure-domain=osd
>> ceph osd pool create ecdata 128 128 erasure profile11
>> ceph osd tier add ecdata cache
>> ceph osd tier cache-mode cache writeback
>> ceph osd tier set-overlay ecdata cache
>> ceph osd pool set cache hit_set_type bloom
>> ceph osd pool set cache hit_set_count 1
>> ceph osd pool set cache hit_set_period 3600
>> ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))
>>
>> (But the previous time I had the problem already without the cache part)
>>
>>
>>
>> Cluster live since 2014-08-29 15:34:16
>>
>> Config file on host ceph001:
>>
>> [global]
>> auth_client_required = cephx
>> auth_cluster_required = cephx
>> auth_service_required = cephx
>> cluster_network = 10.143.8.0/24
>> filestore_xattr_use_omap = 1
>> fsid = 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
>> mon_cluster_log_to_syslog = 1
>> mon_host = ceph001.cubone.os, ceph002.cubone.os, ceph003.cubone.os
>> mon_initial_members = ceph001, ceph002, ceph003
>> osd_crush_update_on_start = 0
>> osd_journal_size = 10240
>> osd_pool_default_min_size = 2
>> osd_pool_default_pg_num = 512
>> osd_pool_default_pgp_num = 512
>> osd_pool_default_size = 3
>> public_network = 10.141.8.0/24
>>
>> [osd.11]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.13]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.15]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.17]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.19]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.21]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.23]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.25]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.3]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.5]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.7]
>> osd_objectstore = keyvaluestore-dev
>>
>> [osd.9]
>> osd_objectstore = keyvaluestore-dev
>>
>>
>> OSDs:
>> # idweight  type name   up/down reweight
>> -12 140.6   root default-cache
>> -9  46.87   host ceph001-cache
>> 2   3.906   osd.2   up  1
>> 4   3.906   osd.4   up  1
>> 6   3.906   osd.6   up  1
>> 8   3.906   osd.8   up  1
>> 10  3.906   osd.10  up  1
>> 12  3.906   osd.12  up  1
>> 14

Re: [ceph-users] ceph cluster inconsistency keyvaluestore

2014-09-08 Thread Kenneth Waegeman


Thank you very much !

Is this problem then related to the weird sizes I see:
  pgmap v55220: 1216 pgs, 3 pools, 3406 GB data, 852 kobjects
418 GB used, 88130 GB / 88549 GB avail

a calculation with df shows indeed that there is about 400GB used on  
disks, but the tests I ran should indeed have generated 3,5 TB, as  
also seen in rados df:


pool name   category KB  objects   clones   
   degraded  unfound   rdrd KB   wr
 wr KB
cache   -   59150443154660  
   0   0  1388365   5686734850  3665984
4709621763
ecdata  - 3512807425   8576200  
   0   0  1109938312332288   857621
3512807426


I thought it was related to the inconsistency?
Or can this be a sparse objects thing? (But I don't seem to found  
anything in the docs about that)


Thanks again!

Kenneth



- Message from Haomai Wang  -
   Date: Sun, 7 Sep 2014 20:34:39 +0800
   From: Haomai Wang 
Subject: Re: ceph cluster inconsistency keyvaluestore
 To: Kenneth Waegeman 
 Cc: ceph-users@lists.ceph.com



I have found the root cause. It's a bug.

When chunky scrub happen, it will iterate the who pg's objects and
each iterator only a few objects will be scan.

osd/PG.cc:3758
ret = get_pgbackend()-> objects_list_partial(
  start,
  cct->_conf->osd_scrub_chunk_min,
  cct->_conf->osd_scrub_chunk_max,
  0,
  &objects,
  &candidate_end);

candidate_end is the end of object set and it's used to indicate the
next scrub process's start position. But it will be truncated:

osd/PG.cc:3777
while (!boundary_found && objects.size() > 1) {
  hobject_t end = objects.back().get_boundary();
  objects.pop_back();

  if (objects.back().get_filestore_key() !=
end.get_filestore_key()) {
candidate_end = end;
boundary_found = true;
  }
}
end which only contain "hash" field as hobject_t will be assign to
candidate_end.  So the next scrub process a hobject_t only contains
"hash" field will be passed in to get_pgbackend()->
objects_list_partial.

It will cause incorrect results for KeyValueStore backend. Because it
will use strict key ordering for "collection_list_paritial" method. A
hobject_t only contains "hash" field will be:

1%e79s0_head!972F1B5D!!none!!!!0!0

and the actual object is
1%e79s0_head!972F1B5D!!1!!!object-name!head

In other word, a object only contain "hash" field can't used by to
search a absolute object has the same "hash" field.

@sage The simply way is modify obj->key function which will change
storage format. Because it's a experiment backend I would like to
provide with a external format change program help users do it. Is it
OK?


On Wed, Sep 3, 2014 at 9:16 PM, Kenneth Waegeman
 wrote:

I also can reproduce it on a new slightly different set up (also EC on KV
and Cache) by running ceph pg scrub on a KV pg: this pg will then get the
'inconsistent' status



- Message from Kenneth Waegeman  -
   Date: Mon, 01 Sep 2014 16:28:31 +0200
   From: Kenneth Waegeman 
Subject: Re: ceph cluster inconsistency keyvaluestore
 To: Haomai Wang 
 Cc: ceph-users@lists.ceph.com




Hi,


The cluster got installed with quattor, which uses ceph-deploy for
installation of daemons, writes the config file and installs the crushmap.
I have 3 hosts, each 12 disks, having a large KV partition (3.6T) for the
ECdata pool and a small cache partition (50G) for the cache

I manually did this:

ceph osd pool create cache 1024 1024
ceph osd pool set cache size 2
ceph osd pool set cache min_size 1
ceph osd erasure-code-profile set profile11 k=8 m=3
ruleset-failure-domain=osd
ceph osd pool create ecdata 128 128 erasure profile11
ceph osd tier add ecdata cache
ceph osd tier cache-mode cache writeback
ceph osd tier set-overlay ecdata cache
ceph osd pool set cache hit_set_type bloom
ceph osd pool set cache hit_set_count 1
ceph osd pool set cache hit_set_period 3600
ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))

(But the previous time I had the problem already without the cache part)



Cluster live since 2014-08-29 15:34:16

Config file on host ceph001:

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.143.8.0/24
filestore_xattr_use_omap = 1
fsid = 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
mon_cluster_log_to_syslog = 1
mon_host = ceph001.cubone.os, ceph002.cubone.os, ceph003.cubone.os
mon_initial_members = ceph001, ceph002, ceph003
osd_crush_update_on_start = 0
osd_journal_size = 10240
osd_pool_default_min_size = 2
osd_pool_default_pg_num = 512
osd_pool_default_pgp_num = 512
osd_pool_default_size = 3
public_network = 10.141.8.0/24

[osd.11]
osd_objectstore = keyvaluestore-dev

[osd.13]
osd_o

Re: [ceph-users] ceph cluster inconsistency keyvaluestore

2014-09-08 Thread Haomai Wang
I'm not very sure, it's possible that keyvaluestore will use spare
write which make big difference with ceph space statistic

On Mon, Sep 8, 2014 at 6:35 PM, Kenneth Waegeman
 wrote:
>
> Thank you very much !
>
> Is this problem then related to the weird sizes I see:
>   pgmap v55220: 1216 pgs, 3 pools, 3406 GB data, 852 kobjects
> 418 GB used, 88130 GB / 88549 GB avail
>
> a calculation with df shows indeed that there is about 400GB used on disks,
> but the tests I ran should indeed have generated 3,5 TB, as also seen in
> rados df:
>
> pool name   category KB  objects   clones
> degraded  unfound   rdrd KB   wrwr KB
> cache   -   59150443154660
> 0   0  1388365   5686734850  3665984   4709621763
> ecdata  - 3512807425   8576200
> 0   0  1109938312332288   857621   3512807426
>
> I thought it was related to the inconsistency?
> Or can this be a sparse objects thing? (But I don't seem to found anything
> in the docs about that)
>
> Thanks again!
>
> Kenneth
>
>
>
> - Message from Haomai Wang  -
>Date: Sun, 7 Sep 2014 20:34:39 +0800
>
>From: Haomai Wang 
> Subject: Re: ceph cluster inconsistency keyvaluestore
>  To: Kenneth Waegeman 
>  Cc: ceph-users@lists.ceph.com
>
>
>> I have found the root cause. It's a bug.
>>
>> When chunky scrub happen, it will iterate the who pg's objects and
>> each iterator only a few objects will be scan.
>>
>> osd/PG.cc:3758
>> ret = get_pgbackend()-> objects_list_partial(
>>   start,
>>   cct->_conf->osd_scrub_chunk_min,
>>   cct->_conf->osd_scrub_chunk_max,
>>   0,
>>   &objects,
>>   &candidate_end);
>>
>> candidate_end is the end of object set and it's used to indicate the
>> next scrub process's start position. But it will be truncated:
>>
>> osd/PG.cc:3777
>> while (!boundary_found && objects.size() > 1) {
>>   hobject_t end = objects.back().get_boundary();
>>   objects.pop_back();
>>
>>   if (objects.back().get_filestore_key() !=
>> end.get_filestore_key()) {
>> candidate_end = end;
>> boundary_found = true;
>>   }
>> }
>> end which only contain "hash" field as hobject_t will be assign to
>> candidate_end.  So the next scrub process a hobject_t only contains
>> "hash" field will be passed in to get_pgbackend()->
>> objects_list_partial.
>>
>> It will cause incorrect results for KeyValueStore backend. Because it
>> will use strict key ordering for "collection_list_paritial" method. A
>> hobject_t only contains "hash" field will be:
>>
>> 1%e79s0_head!972F1B5D!!none!!!!0!0
>>
>> and the actual object is
>> 1%e79s0_head!972F1B5D!!1!!!object-name!head
>>
>> In other word, a object only contain "hash" field can't used by to
>> search a absolute object has the same "hash" field.
>>
>> @sage The simply way is modify obj->key function which will change
>> storage format. Because it's a experiment backend I would like to
>> provide with a external format change program help users do it. Is it
>> OK?
>>
>>
>> On Wed, Sep 3, 2014 at 9:16 PM, Kenneth Waegeman
>>  wrote:
>>>
>>> I also can reproduce it on a new slightly different set up (also EC on KV
>>> and Cache) by running ceph pg scrub on a KV pg: this pg will then get the
>>> 'inconsistent' status
>>>
>>>
>>>
>>> - Message from Kenneth Waegeman  -
>>>Date: Mon, 01 Sep 2014 16:28:31 +0200
>>>From: Kenneth Waegeman 
>>> Subject: Re: ceph cluster inconsistency keyvaluestore
>>>  To: Haomai Wang 
>>>  Cc: ceph-users@lists.ceph.com
>>>
>>>
>>>
 Hi,


 The cluster got installed with quattor, which uses ceph-deploy for
 installation of daemons, writes the config file and installs the
 crushmap.
 I have 3 hosts, each 12 disks, having a large KV partition (3.6T) for
 the
 ECdata pool and a small cache partition (50G) for the cache

 I manually did this:

 ceph osd pool create cache 1024 1024
 ceph osd pool set cache size 2
 ceph osd pool set cache min_size 1
 ceph osd erasure-code-profile set profile11 k=8 m=3
 ruleset-failure-domain=osd
 ceph osd pool create ecdata 128 128 erasure profile11
 ceph osd tier add ecdata cache
 ceph osd tier cache-mode cache writeback
 ceph osd tier set-overlay ecdata cache
 ceph osd pool set cache hit_set_type bloom
 ceph osd pool set cache hit_set_count 1
 ceph osd pool set cache hit_set_period 3600
 ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))

 (But the previous time I had the problem already without the cache part)



 Cluster live since 2014-08-29 15:34:16

 Config file on host ceph001:

 [global]
 auth_client_required = cephx
 

Re: [ceph-users] ceph cluster inconsistency keyvaluestore

2014-09-08 Thread Sage Weil
On Sun, 7 Sep 2014, Haomai Wang wrote:
> I have found the root cause. It's a bug.
> 
> When chunky scrub happen, it will iterate the who pg's objects and
> each iterator only a few objects will be scan.
> 
> osd/PG.cc:3758
> ret = get_pgbackend()-> objects_list_partial(
>   start,
>   cct->_conf->osd_scrub_chunk_min,
>   cct->_conf->osd_scrub_chunk_max,
>   0,
>   &objects,
>   &candidate_end);
> 
> candidate_end is the end of object set and it's used to indicate the
> next scrub process's start position. But it will be truncated:
> 
> osd/PG.cc:3777
> while (!boundary_found && objects.size() > 1) {
>   hobject_t end = objects.back().get_boundary();
>   objects.pop_back();
> 
>   if (objects.back().get_filestore_key() !=
> end.get_filestore_key()) {
> candidate_end = end;
> boundary_found = true;
>   }
> }
> end which only contain "hash" field as hobject_t will be assign to
> candidate_end.  So the next scrub process a hobject_t only contains
> "hash" field will be passed in to get_pgbackend()->
> objects_list_partial.
> 
> It will cause incorrect results for KeyValueStore backend. Because it
> will use strict key ordering for "collection_list_paritial" method. A
> hobject_t only contains "hash" field will be:
> 
> 1%e79s0_head!972F1B5D!!none!!!!0!0
> 
> and the actual object is
> 1%e79s0_head!972F1B5D!!1!!!object-name!head
> 
> In other word, a object only contain "hash" field can't used by to
> search a absolute object has the same "hash" field.

You mean the problem is that the sort order is wrong and the hash-only 
hobject_t key doesn't sort before the other objects, right?

> @sage The simply way is modify obj->key function which will change
> storage format. Because it's a experiment backend I would like to
> provide with a external format change program help users do it. Is it
> OK?

Yeah, I think it's okay to just go ahead and make an incompatible change.

If it is easy to do an upgrade converter, it might be worthwhile, but this 
is an experimental backend so you are certainly not required to.  :)

sage



> 
> 
> On Wed, Sep 3, 2014 at 9:16 PM, Kenneth Waegeman
>  wrote:
> > I also can reproduce it on a new slightly different set up (also EC on KV
> > and Cache) by running ceph pg scrub on a KV pg: this pg will then get the
> > 'inconsistent' status
> >
> >
> >
> > - Message from Kenneth Waegeman  -
> >Date: Mon, 01 Sep 2014 16:28:31 +0200
> >From: Kenneth Waegeman 
> > Subject: Re: ceph cluster inconsistency keyvaluestore
> >  To: Haomai Wang 
> >  Cc: ceph-users@lists.ceph.com
> >
> >
> >
> >> Hi,
> >>
> >>
> >> The cluster got installed with quattor, which uses ceph-deploy for
> >> installation of daemons, writes the config file and installs the crushmap.
> >> I have 3 hosts, each 12 disks, having a large KV partition (3.6T) for the
> >> ECdata pool and a small cache partition (50G) for the cache
> >>
> >> I manually did this:
> >>
> >> ceph osd pool create cache 1024 1024
> >> ceph osd pool set cache size 2
> >> ceph osd pool set cache min_size 1
> >> ceph osd erasure-code-profile set profile11 k=8 m=3
> >> ruleset-failure-domain=osd
> >> ceph osd pool create ecdata 128 128 erasure profile11
> >> ceph osd tier add ecdata cache
> >> ceph osd tier cache-mode cache writeback
> >> ceph osd tier set-overlay ecdata cache
> >> ceph osd pool set cache hit_set_type bloom
> >> ceph osd pool set cache hit_set_count 1
> >> ceph osd pool set cache hit_set_period 3600
> >> ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))
> >>
> >> (But the previous time I had the problem already without the cache part)
> >>
> >>
> >>
> >> Cluster live since 2014-08-29 15:34:16
> >>
> >> Config file on host ceph001:
> >>
> >> [global]
> >> auth_client_required = cephx
> >> auth_cluster_required = cephx
> >> auth_service_required = cephx
> >> cluster_network = 10.143.8.0/24
> >> filestore_xattr_use_omap = 1
> >> fsid = 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
> >> mon_cluster_log_to_syslog = 1
> >> mon_host = ceph001.cubone.os, ceph002.cubone.os, ceph003.cubone.os
> >> mon_initial_members = ceph001, ceph002, ceph003
> >> osd_crush_update_on_start = 0
> >> osd_journal_size = 10240
> >> osd_pool_default_min_size = 2
> >> osd_pool_default_pg_num = 512
> >> osd_pool_default_pgp_num = 512
> >> osd_pool_default_size = 3
> >> public_network = 10.141.8.0/24
> >>
> >> [osd.11]
> >> osd_objectstore = keyvaluestore-dev
> >>
> >> [osd.13]
> >> osd_objectstore = keyvaluestore-dev
> >>
> >> [osd.15]
> >> osd_objectstore = keyvaluestore-dev
> >>
> >> [osd.17]
> >> osd_objectstore = keyvaluestore-dev
> >>
> >> [osd.19]
> >> osd_objectstore = keyvaluestore-dev
> >>
> >> [osd.21]
> >> osd_objectstore = keyvaluestore-dev
> >>
> >> [osd.23]
> >> osd_objectstore = keyvaluestore-dev
> >>
> >> [osd.25]
> >> osd_objectstore = keyvaluestore-dev
> >>
> >