[ceph-users] ERROR: flush_read_list(): d->client_c->handle_data() returned -5

2016-11-23 Thread Riederer, Michael
Hello,

we have 4 ceph radosgws behind a haproxy running ceph version 10.2.3

ceph.conf:
[global]
osd_pool_default_pgp_num = 4096
auth_service_required = cephx
mon_initial_members = 
ceph-203-1-public,ceph-203-2-public,ceph-203-3-public,ceph-203-4-public,ceph-203-7-public
fsid = 69876022-f6fb-4eef-af47-5527cfa1e33a
cluster_network = 10.65.204.0/24
auth_supported = cephx
auth_cluster_required = cephx
mon_host = 
10.65.203.17:6789,10.65.203.18:6789,10.65.203.19:6789,10.65.203.95:6789,10.65.203.98:6789
auth_client_required = cephx
public_network = 10.65.203.0/24

[client.radosgw.ceph-203-rgw-1]
host = ceph-203-rgw-1
keyring = /etc/ceph/ceph.client.radosgw.ceph-203-rgw-1.keyring
rgw_frontends = civetweb port=80
rgw dns name = ceph-203-rgw-1.mm.br.de
rgw print continue = False

Can someone help to find the ERROR:

2016-11-23 13:53:29.801313 7ff1487f0700  1 == starting new request 
req=0x7ff1487ea710 =
2016-11-23 13:53:29.802850 7ff1487f0700  1 == req done req=0x7ff1487ea710 
op status=0 http_status=200 ==
2016-11-23 13:53:29.802901 7ff1487f0700  1 civetweb: 0x7ff244002e90: 
10.65.163.49 - - [23/Nov/2016:13:53:29 +0100] "HEAD 
/mir-Live/3c14cda9-1f5c-49df-92d2-d8cc5ca03472_P.mp4 HTTP/1.1" 200 0 - 
aws-sdk-java/1.11.14 Linux/2.6.32-573.18.1.el6.x86_64 
OpenJDK_64-Bit_Server_VM/25.71-b15/1.8.0_71
2016-11-23 13:53:30.367801 7ff139fd3700  1 == starting new request 
req=0x7ff139fcd710 =
2016-11-23 13:53:30.382091 7ff139fd3700  0 ERROR: flush_read_list(): 
d->client_c->handle_data() returned -5
2016-11-23 13:53:30.382328 7ff139fd3700  0 WARNING: set_req_state_err err_no=5 
resorting to 500
2016-11-23 13:53:30.382389 7ff139fd3700  0 ERROR: s->cio->send_content_length() 
returned err=-5
2016-11-23 13:53:30.382394 7ff139fd3700  0 ERROR: s->cio->print() returned 
err=-5
2016-11-23 13:53:30.382396 7ff139fd3700  0 ERROR: STREAM_IO(s)->print() 
returned err=-5
2016-11-23 13:53:30.382414 7ff139fd3700  0 ERROR: 
STREAM_IO(s)->complete_header() returned err=-5
2016-11-23 13:53:30.382459 7ff139fd3700  1 == req done req=0x7ff139fcd710 
op status=-5 http_status=500 ==
2016-11-23 13:53:30.382541 7ff139fd3700  1 civetweb: 0x7ff2040008e0: 
10.65.163.49 - - [23/Nov/2016:13:53:30 +0100] "GET 
/mir-Live/3c14cda9-1f5c-49df-92d2-d8cc5ca03472_2.mp4 HTTP/1.1" 500 0 - 
aws-sdk-java/1.11.14 Linux/2.6.32-573.18.1.el6.x86_64 
OpenJDK_64-Bit_Server_VM/25.71-b15/1.8.0_71

Regards
Michael
--
Bayerischer Rundfunk; Rundfunkplatz 1; 80335 München
Telefon: +49 89 590001; E-Mail: i...@br.de; Website: http://www.BR.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean

2014-08-22 Thread Riederer, Michael
Hi Craig,

many thanks for your help. I decided to reinstall ceph.

Regards,
Mike

Von: Craig Lewis [cle...@centraldesktop.com]
Gesendet: Dienstag, 19. August 2014 22:24
An: Riederer, Michael
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 
pgs stuck unclean


On Tue, Aug 19, 2014 at 1:22 AM, Riederer, Michael 
mailto:michael.riede...@br.de>> wrote:


root@ceph-admin-storage:~# ceph pg force_create_pg 2.587
pg 2.587 now creating, ok
root@ceph-admin-storage:~# ceph pg 2.587 query
...
  "probing_osds": [
"5",
"8",
"10",
"13",
"20",
"35",
"46",
"56"],
...

All mentioned osds "probing_osds" are up and in, but the cluster can not create 
the pg. Not even scrub, deep-scrub or repair it.


My experience is that as long as you have down_osds_we_would_probe in the pg 
query, ceph pg force_create_pg won't do anything. ceph osd lost didn't help. 
The PGs would go into the creating state, then revert to incomplete. The only 
way I was able to get them to stay in the creating state was to re-create all 
of the OSD IDs listed in down_osds_we_would_probe.

Even then, it wasn't deterministic. I issued the ceph pg force_create_pg, and 
it actually took effect sometime in the middle of the night, after an unrelated 
OSD went down and up.

It was a very frustrating experience.


Just to be sure, that I did it the right way:
# stop ceph-osd id=x
# ceph osd out x
# ceph osd crush remove osd.x
# ceph auth del osd.x
# ceph osd rm x



My procedure was the same as yours, with the addition of a ceph osd lost x 
before ceph osd rm.
--
Bayerischer Rundfunk; Rundfunkplatz 1; 80335 München
Telefon: +49 89 590001; E-Mail: i...@br.de; Website: http://www.BR.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean

2014-08-19 Thread Riederer, Michael
Hi Craig,

I forgot to send the output of "ceph osd tree":
root@ceph-admin-storage:~# ceph osd tree
# idweighttype nameup/downreweight
-188.24root default
-844.12room room0
-215.92host ceph-1-storage
41.82osd.4up1
91.36osd.9up1
63.64osd.6up1
51.82osd.5up1
73.64osd.7up1
83.64osd.8up1
-317.28host ceph-2-storage
143.64osd.14up1
181.36osd.18up1
191.36osd.19up1
153.64osd.15up1
13.64osd.1up1
123.64osd.12up1
-410.92host ceph-5-storage
431.82osd.43up1
441.82osd.44up1
451.82osd.45up1
461.82osd.46up1
471.82osd.47up1
481.82osd.48up1
-944.12room room1
-515.92host ceph-3-storage
241.82osd.24up1
251.82osd.25up1
291.36osd.29up1
103.64osd.10up1
133.64osd.13up1
203.64osd.20up1
-617.28host ceph-4-storage
343.64osd.34up1
381.36osd.38up1
391.36osd.39up1
163.64osd.16up1
353.64osd.35up1
173.64osd.17up1
-710.92host ceph-6-storage
531.82osd.53up1
541.82osd.54up1
551.82osd.55up1
561.82osd.56up1
571.82osd.57up1
581.82osd.58up1

OSD 8, 13 and 20 are up and in.

root@ceph-admin-storage:~# ceph pg force_create_pg 2.587
pg 2.587 now creating, ok
root@ceph-admin-storage:~# ceph pg 2.587 query
...
  "probing_osds": [
"5",
"8",
"10",
"13",
"20",
"35",
"46",
"56"],
...

All mentioned osds "probing_osds" are up and in, but the cluster can not create 
the pg. Not even scrub, deep-scrub or repair it.

Sorry, I did not say that I do not believe that the slow osds are the problem. 
They may have contributed to the cluster is no longer healthy, because the slow 
osds have gone in and out at high system load.

Just to be sure, that I did it the right way:
# stop ceph-osd id=x
# ceph osd out x
# ceph osd crush remove osd.x
# ceph auth del osd.x
# ceph osd rm x

I'm pretty sure that the cluster is stable now. Stability problems should not 
be the cause, that I can no longer bring the cluster back to a healthy state.

The new output of ceph pg x query is available at: 
http://server.riederer.org/ceph-user/

Regards,
Mike

________
Von: Craig Lewis [cle...@centraldesktop.com]
Gesendet: Montag, 18. August 2014 19:22
An: Riederer, Michael
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 
pgs stuck unclean

I take it that OSD 8, 13, and 20 are some of the stopped OSDs.

I wasn't able to get ceph to execute ceph pg force_create until the OSDs in 
[recovery_state][probing_osds] from ceph pg query were online.  I ended up 
reformatting most of them, and re-adding them to the cluster.

What's wrong with those OSDs?  How slow are they?  If the problem is just that 
they're really slow, try starting them up, and manually marking them UP and 
OUT.  That way Ceph will read from them, but not write to them.  If they won't 
stay up, I'd replace them, and get the replacements back in the cluster.  I'd 
leave the replacements UP and OUT.  You can rebalance later, after the cluster 
is healthy again.



I've never seen the replay state, I'm not sure what to do with that.



On Mon, Aug 18, 2014 at 5:05 AM, Riederer, Michael 
mailto:michael.riede...@br.de>> wrote:
What has changed in the cluster compared to my first mail, the cluster was in a 
position to repair one pg, but now has a different pg in status 
"active+clean+replay"

root@ceph-admin-storage:~# ceph pg dump | grep "^2.92"
dumped all in format plain
2.920000000active+clean2014-08-18 
10:37:20.9628580'036830:577[8,13]8[8,13]80'0
2014-08-18 10:37:20.96272813503'13904192014-08-14 10:37:12.49

Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean

2014-08-18 Thread Riederer, Michael
What has changed in the cluster compared to my first mail, the cluster was in a 
position to repair one pg, but now has a different pg in status 
"active+clean+replay"

root@ceph-admin-storage:~# ceph pg dump | grep "^2.92"
dumped all in format plain
2.920000000active+clean2014-08-18 
10:37:20.9628580'036830:577[8,13]8[8,13]80'0
2014-08-18 10:37:20.96272813503'13904192014-08-14 10:37:12.497492
root@ceph-admin-storage:~# ceph pg dump | grep replay
dumped all in format plain
0.49a0000000active+clean+replay2014-08-18 
13:09:15.3172210'036830:1704[12,10]12[12,10]120'0   
 2014-08-18 13:09:15.3171310'02014-08-18 13:09:15.317131

Mike


Von: ceph-users [ceph-users-boun...@lists.ceph.com]" im Auftrag von "Riederer, 
Michael [michael.riede...@br.de]
Gesendet: Montag, 18. August 2014 13:40
An: Craig Lewis
Cc: ceph-users@lists.ceph.com; Karan Singh
Betreff: Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 
pgs stuck unclean

Hi Craig,

I brought the cluster in a stable condition. All slow osds are no longer in the 
cluster. All remaining 36 osds are more than 100 MB / sec writeable (dd 
if=/dev/zero of=testfile-2.txt bs=1024 count=4096000). No ceph client is 
connected to the cluster. The ceph nodes are in idle. Now sees the state as 
follows:

root@ceph-admin-storage:~# ceph -s
cluster 6b481875-8be5-4508-b075-e1f660fd7b33
 health HEALTH_WARN 3 pgs down; 3 pgs incomplete; 3 pgs stuck inactive; 3 
pgs stuck unclean
 monmap e2: 3 mons at 
{ceph-1-storage=10.65.150.101:6789/0,ceph-2-storage=10.65.150.102:6789/0,ceph-3-storage=10.65.150.103:6789/0},
 election epoch 5018, quorum 0,1,2 ceph-1-storage,ceph-2-storage,ceph-3-storage
 osdmap e36830: 36 osds: 36 up, 36 in
  pgmap v10907190: 6144 pgs, 3 pools, 10997 GB data, 2760 kobjects
22051 GB used, 68206 GB / 90258 GB avail
6140 active+clean
   3 down+incomplete
   1 active+clean+replay

root@ceph-admin-storage:~# ceph health detail
HEALTH_WARN 3 pgs down; 3 pgs incomplete; 3 pgs stuck inactive; 3 pgs stuck 
unclean
pg 2.c1 is stuck inactive since forever, current state down+incomplete, last 
acting [13,8]
pg 2.e3 is stuck inactive since forever, current state down+incomplete, last 
acting [20,8]
pg 2.587 is stuck inactive since forever, current state down+incomplete, last 
acting [13,8]
pg 2.c1 is stuck unclean since forever, current state down+incomplete, last 
acting [13,8]
pg 2.e3 is stuck unclean since forever, current state down+incomplete, last 
acting [20,8]
pg 2.587 is stuck unclean since forever, current state down+incomplete, last 
acting [13,8]
pg 2.587 is down+incomplete, acting [13,8]
pg 2.e3 is down+incomplete, acting [20,8]
pg 2.c1 is down+incomplete, acting [13,8]

I have tried the following:

root@ceph-admin-storage:~# ceph pg scrub 2.587
instructing pg 2.587 on osd.13 to scrub
root@ceph-admin-storage:~# ceph pg scrub 2.e3
^[[Ainstructing pg 2.e3 on osd.20 to scrub
root@ceph-admin-storage:~# ceph pg scrub 2.c1
instructing pg 2.c1 on osd.13 to scrub

root@ceph-admin-storage:~# ceph pg deep-scrub 2.587
instructing pg 2.587 on osd.13 to deep-scrub
root@ceph-admin-storage:~# ceph pg deep-scrub 2.e3
instructing pg 2.e3 on osd.20 to deep-scrub
root@ceph-admin-storage:~# ceph pg deep-scrub 2.c1
instructing pg 2.c1 on osd.13 to deep-scrub

root@ceph-admin-storage:~# ceph pg repair 2.587
instructing pg 2.587 on osd.13 to repair
root@ceph-admin-storage:~# ceph pg repair 2.e3
instructing pg 2.e3 on osd.20 to repair
root@ceph-admin-storage:~# ceph pg repair 2.c1
instructing pg 2.c1 on osd.13 to repair

In the monitor logfiles (ceph-mon.ceph-1/2/3-storage.log) I see the pg scrub, 
pg deep-scrub and pg repair commands, but I do not see anything in ceph.log and 
nothing in the ceph-osd.13/20/8.log.
(2014-08-18 13:24:49.337954 7f24ac111700  0 mon.ceph-1-storage@0(leader) e2 
handle_command mon_command({"prefix": "pg repair", "pgid": "2.587"} v 0) v1)

Is it possible to repair the ceph-cluster?

root@ceph-admin-storage:~# ceph pg force_create_pg 2.587
pg 2.587 now creating, ok

But nothing happens, the pg will not created.

root@ceph-admin-storage:~# ceph -s
cluster 6b481875-8be5-4508-b075-e1f660fd7b33
 health HEALTH_WARN 2 pgs down; 2 pgs incomplete; 3 pgs stuck inactive; 3 
pgs stuck unclean
 monmap e2: 3 mons at 
{ceph-1-storage=10.65.150.101:6789/0,ceph-2-storage=10.65.150.102:6789/0,ceph-3-storage=10.65.150.103:6789/0},
 election epoch 5018, quorum 0,1,2 ceph-1-storage,ceph-2-storage,ceph-3-storage
 osdmap e36830: 36 osds: 36 up, 36 in
  pgmap v10907191: 6144 pgs, 3 pools, 10997 GB data, 2760 kobjects
22051 GB used, 68206 GB / 90258 GB avail

Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean

2014-08-18 Thread Riederer, Michael
Hi Craig,

I brought the cluster in a stable condition. All slow osds are no longer in the 
cluster. All remaining 36 osds are more than 100 MB / sec writeable (dd 
if=/dev/zero of=testfile-2.txt bs=1024 count=4096000). No ceph client is 
connected to the cluster. The ceph nodes are in idle. Now sees the state as 
follows:

root@ceph-admin-storage:~# ceph -s
cluster 6b481875-8be5-4508-b075-e1f660fd7b33
 health HEALTH_WARN 3 pgs down; 3 pgs incomplete; 3 pgs stuck inactive; 3 
pgs stuck unclean
 monmap e2: 3 mons at 
{ceph-1-storage=10.65.150.101:6789/0,ceph-2-storage=10.65.150.102:6789/0,ceph-3-storage=10.65.150.103:6789/0},
 election epoch 5018, quorum 0,1,2 ceph-1-storage,ceph-2-storage,ceph-3-storage
 osdmap e36830: 36 osds: 36 up, 36 in
  pgmap v10907190: 6144 pgs, 3 pools, 10997 GB data, 2760 kobjects
22051 GB used, 68206 GB / 90258 GB avail
6140 active+clean
   3 down+incomplete
   1 active+clean+replay

root@ceph-admin-storage:~# ceph health detail
HEALTH_WARN 3 pgs down; 3 pgs incomplete; 3 pgs stuck inactive; 3 pgs stuck 
unclean
pg 2.c1 is stuck inactive since forever, current state down+incomplete, last 
acting [13,8]
pg 2.e3 is stuck inactive since forever, current state down+incomplete, last 
acting [20,8]
pg 2.587 is stuck inactive since forever, current state down+incomplete, last 
acting [13,8]
pg 2.c1 is stuck unclean since forever, current state down+incomplete, last 
acting [13,8]
pg 2.e3 is stuck unclean since forever, current state down+incomplete, last 
acting [20,8]
pg 2.587 is stuck unclean since forever, current state down+incomplete, last 
acting [13,8]
pg 2.587 is down+incomplete, acting [13,8]
pg 2.e3 is down+incomplete, acting [20,8]
pg 2.c1 is down+incomplete, acting [13,8]

I have tried the following:

root@ceph-admin-storage:~# ceph pg scrub 2.587
instructing pg 2.587 on osd.13 to scrub
root@ceph-admin-storage:~# ceph pg scrub 2.e3
^[[Ainstructing pg 2.e3 on osd.20 to scrub
root@ceph-admin-storage:~# ceph pg scrub 2.c1
instructing pg 2.c1 on osd.13 to scrub

root@ceph-admin-storage:~# ceph pg deep-scrub 2.587
instructing pg 2.587 on osd.13 to deep-scrub
root@ceph-admin-storage:~# ceph pg deep-scrub 2.e3
instructing pg 2.e3 on osd.20 to deep-scrub
root@ceph-admin-storage:~# ceph pg deep-scrub 2.c1
instructing pg 2.c1 on osd.13 to deep-scrub

root@ceph-admin-storage:~# ceph pg repair 2.587
instructing pg 2.587 on osd.13 to repair
root@ceph-admin-storage:~# ceph pg repair 2.e3
instructing pg 2.e3 on osd.20 to repair
root@ceph-admin-storage:~# ceph pg repair 2.c1
instructing pg 2.c1 on osd.13 to repair

In the monitor logfiles (ceph-mon.ceph-1/2/3-storage.log) I see the pg scrub, 
pg deep-scrub and pg repair commands, but I do not see anything in ceph.log and 
nothing in the ceph-osd.13/20/8.log.
(2014-08-18 13:24:49.337954 7f24ac111700  0 mon.ceph-1-storage@0(leader) e2 
handle_command mon_command({"prefix": "pg repair", "pgid": "2.587"} v 0) v1)

Is it possible to repair the ceph-cluster?

root@ceph-admin-storage:~# ceph pg force_create_pg 2.587
pg 2.587 now creating, ok

But nothing happens, the pg will not created.

root@ceph-admin-storage:~# ceph -s
cluster 6b481875-8be5-4508-b075-e1f660fd7b33
 health HEALTH_WARN 2 pgs down; 2 pgs incomplete; 3 pgs stuck inactive; 3 
pgs stuck unclean
 monmap e2: 3 mons at 
{ceph-1-storage=10.65.150.101:6789/0,ceph-2-storage=10.65.150.102:6789/0,ceph-3-storage=10.65.150.103:6789/0},
 election epoch 5018, quorum 0,1,2 ceph-1-storage,ceph-2-storage,ceph-3-storage
 osdmap e36830: 36 osds: 36 up, 36 in
  pgmap v10907191: 6144 pgs, 3 pools, 10997 GB data, 2760 kobjects
22051 GB used, 68206 GB / 90258 GB avail
   1 creating
6140 active+clean
   2 down+incomplete
   1 active+clean+replay
root@ceph-admin-storage:~# ceph health detail
HEALTH_WARN 2 pgs down; 2 pgs incomplete; 3 pgs stuck inactive; 3 pgs stuck 
unclean
pg 2.c1 is stuck inactive since forever, current state down+incomplete, last 
acting [13,8]
pg 2.e3 is stuck inactive since forever, current state down+incomplete, last 
acting [20,8]
pg 2.587 is stuck inactive since forever, current state creating, last acting []
pg 2.c1 is stuck unclean since forever, current state down+incomplete, last 
acting [13,8]
pg 2.e3 is stuck unclean since forever, current state down+incomplete, last 
acting [20,8]
pg 2.587 is stuck unclean since forever, current state creating, last acting []
pg 2.e3 is down+incomplete, acting [20,8]
pg 2.c1 is down+incomplete, acting [13,8]

What can I do to get rid of the "incomplete" or "creating" pg?

Regards,
Mike



Von: Craig Lewis [cle...@centraldesktop.com]
Gesendet: Donnerstag, 14. August 2014 19:56
An: Riederer, Michael
Cc: Karan Singh; ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] HEAL

Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean

2014-08-14 Thread Riederer, Michael
Hi Craig,

Yes we have stability problems. The cluster is definitely not suitable for a 
production environment. I will not describe the details here. I want to get to 
know ceph and this is possible with the Test-cluster. Some osds are very slow, 
less than 15 MB / sec writable. Also increases the load on the ceph nodes to 
over 30 when a osd is removed and a reorganistation of the data is necessary. 
If the load is very high (over 30) I have seen exactly what you describe. osds 
go down and out and come back up and in.

OK. I'll try the slow osd to remove and then to scrub, deep-scrub the pgs.

Many thanks for your help.

Regards,
Mike


Von: Craig Lewis [cle...@centraldesktop.com]
Gesendet: Mittwoch, 13. August 2014 19:48
An: Riederer, Michael
Cc: Karan Singh; ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 
pgs stuck unclean

Yes, ceph pg  query, not dump.  Sorry about that.

Are you having problems with OSD stability?  There's a lot of history in the 
[recovery_state][past_intervals]. That's normal when OSDs go down, and out, and 
come back up and in. You have a lot of history there. You might even be getting 
into the point that you have so much failover history, the OSDs can't process 
it all before they hit the suicide timeout.

[recovery_state][probing_osds] lists a lot of OSDs that have recently owned 
these PGs. If the OSDs are crashing frequently, you need to get that under 
control before proceeding.

Once the OSDs are stable, I think Ceph just needs to scrub and deep-scrub those 
PGs.


Until Ceph clears out the [recovery_state][probing_osds] section in the pg 
query, it's not going to do anything.  ceph osd lost hears you, but doesn't 
trust you.  Ceph won't do anything until it's actually checked those OSDs 
itself.  Scrubbing and Deep scrubbing should convince it.

Once that [recovery_state][probing_osds] section is gone, you should see the 
[recovery_state][past_intervals] section shrink or disappear. I don't have 
either section in my pg query. Once that happens, your ceph pg repair or ceph 
pg force_create_pg should finally have some effect.  You may or may not need to 
re-issue those commands.




On Tue, Aug 12, 2014 at 9:32 PM, Riederer, Michael 
mailto:michael.riede...@br.de>> wrote:
Hi Craig,

# ceph pg 2.587 query
# ceph pg 2.c1 query
# ceph pg 2.92 query
# ceph pg 2.e3 query

Please download the output form here:
http://server.riederer.org/ceph-user/

#


It is not possible to map a rbd:

# rbd map testshareone --pool rbd --name client.admin
rbd: add failed: (5) Input/output error

I found that: http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/11405
# ceph osd getcrushmap -o crushmap.bin
got crush map from osdmap epoch 3741
# crushtool -i crushmap.bin --set-chooseleaf_vary_r 0 -o crushmap-new.bin
# ceph osd setcrushmap -i crushmap-new.bin
set crush map

The Cluster had to do some. Now it looks a bit different.

It is still not possible to map a rbd.

root@ceph-admin-storage:~# ceph -s
cluster 6b481875-8be5-4508-b075-e1f660fd7b33
 health HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck 
unclean
 monmap e2: 3 mons at 
{ceph-1-storage=10.65.150.101:6789/0,ceph-2-storage=10.65.150.102:6789/0,ceph-3-storage=10.65.150.103:6789/0<http://10.65.150.101:6789/0,ceph-2-storage=10.65.150.102:6789/0,ceph-3-storage=10.65.150.103:6789/0>},
 election epoch 5010, quorum 0,1,2 ceph-1-storage,ceph-2-storage,ceph-3-storage
 osdmap e34206: 55 osds: 55 up, 55 in
  pgmap v10838368: 6144 pgs, 3 pools, 11002 GB data, 2762 kobjects
22078 GB used, 79932 GB / 102010 GB avail
6140 active+clean
   4 incomplete

root@ceph-admin-storage:~# ceph health detail
HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean
pg 2.92 is stuck inactive since forever, current state incomplete, last acting 
[8,13]
pg 2.c1 is stuck inactive since forever, current state incomplete, last acting 
[13,8]
pg 2.e3 is stuck inactive since forever, current state incomplete, last acting 
[20,8]
pg 2.587 is stuck inactive since forever, current state incomplete, last acting 
[13,8]

pg 2.92 is stuck unclean since forever, current state incomplete, last acting 
[8,13]
pg 2.c1 is stuck unclean since forever, current state incomplete, last acting 
[13,8]
pg 2.e3 is stuck unclean since forever, current state incomplete, last acting 
[20,8]
pg 2.587 is stuck unclean since forever, current state incomplete, last acting 
[13,8]
pg 2.587 is incomplete, acting [13,8]
pg 2.e3 is incomplete, acting [20,8]
pg 2.c1 is incomplete, acting [13,8]

pg 2.92 is incomplete, acting [8,13]

###

After updating to firefly, I did the following:

# ceph health detail
HEALTH_WARN crush map has legacy tunables

Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean

2014-08-12 Thread Riederer, Michael
Hi Craig,

# ceph pg 2.587 query
# ceph pg 2.c1 query
# ceph pg 2.92 query
# ceph pg 2.e3 query

Please download the output form here:
http://server.riederer.org/ceph-user/

#

It is not possible to map a rbd:

# rbd map testshareone --pool rbd --name client.admin
rbd: add failed: (5) Input/output error

I found that: http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/11405
# ceph osd getcrushmap -o crushmap.bin
got crush map from osdmap epoch 3741
# crushtool -i crushmap.bin --set-chooseleaf_vary_r 0 -o crushmap-new.bin
# ceph osd setcrushmap -i crushmap-new.bin
set crush map

The Cluster had to do some. Now it looks a bit different.

It is still not possible to map a rbd.

root@ceph-admin-storage:~# ceph -s
cluster 6b481875-8be5-4508-b075-e1f660fd7b33
 health HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck 
unclean
 monmap e2: 3 mons at 
{ceph-1-storage=10.65.150.101:6789/0,ceph-2-storage=10.65.150.102:6789/0,ceph-3-storage=10.65.150.103:6789/0},
 election epoch 5010, quorum 0,1,2 ceph-1-storage,ceph-2-storage,ceph-3-storage
 osdmap e34206: 55 osds: 55 up, 55 in
  pgmap v10838368: 6144 pgs, 3 pools, 11002 GB data, 2762 kobjects
22078 GB used, 79932 GB / 102010 GB avail
6140 active+clean
   4 incomplete
root@ceph-admin-storage:~# ceph health detail
HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean
pg 2.92 is stuck inactive since forever, current state incomplete, last acting 
[8,13]
pg 2.c1 is stuck inactive since forever, current state incomplete, last acting 
[13,8]
pg 2.e3 is stuck inactive since forever, current state incomplete, last acting 
[20,8]
pg 2.587 is stuck inactive since forever, current state incomplete, last acting 
[13,8]
pg 2.92 is stuck unclean since forever, current state incomplete, last acting 
[8,13]
pg 2.c1 is stuck unclean since forever, current state incomplete, last acting 
[13,8]
pg 2.e3 is stuck unclean since forever, current state incomplete, last acting 
[20,8]
pg 2.587 is stuck unclean since forever, current state incomplete, last acting 
[13,8]
pg 2.587 is incomplete, acting [13,8]
pg 2.e3 is incomplete, acting [20,8]
pg 2.c1 is incomplete, acting [13,8]
pg 2.92 is incomplete, acting [8,13]

###

After updating to firefly, I did the following:

# ceph health detail
HEALTH_WARN crush map has legacy tunables crush map has legacy tunables; see 
http://ceph.com/docs/master/rados/operations/crush-map/#tunables

# ceph osd crush tunables optimal
adjusted tunables profile to optimal

Mike

Von: Craig Lewis [cle...@centraldesktop.com]
Gesendet: Dienstag, 12. August 2014 20:02
An: Riederer, Michael
Cc: Karan Singh; ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 
pgs stuck unclean

For the incomplete PGs, can you give me the output of
ceph pg  dump

I'm interested in the recovery_state key of that JSON data.



On Tue, Aug 12, 2014 at 5:29 AM, Riederer, Michael 
mailto:michael.riede...@br.de>> wrote:
Sorry, but I think that does not help me. I forgot to mention something about 
the operating system:

root@ceph-1-storage:~# dpkg -l | grep libleveldb1
ii  libleveldb1   1.12.0-1precise.ceph  fast 
key-value storage library
root@ceph-1-storage:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 12.04.5 LTS
Release:12.04
Codename:   precise
root@ceph-1-storage:~# uname -a
Linux ceph-1-storage 3.5.0-52-generic #79~precise1-Ubuntu SMP Fri Jul 4 
21:03:49 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

libleveldb1 is greater than the mentioned version 1.9.0-1 ~ bpo70 + 1.

All ceph nodes are IBM x3650 with Intel Xeon CPUs 2.00 GHz and 8 GB RAM, ok all 
very old, about eight years,
but are still running.

Mike




Von: Karan Singh [karan.si...@csc.fi<mailto:karan.si...@csc.fi>]
Gesendet: Dienstag, 12. August 2014 13:00

An: Riederer, Michael
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Betreff: Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 
pgs stuck unclean

I am not sure if this helps , but have a look  
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg10078.html

- Karan -

On 12 Aug 2014, at 12:04, Riederer, Michael 
mailto:michael.riede...@br.de>> wrote:

Hi Karan,

root@ceph-admin-storage:~/ceph-cluster/crush-map-4-ceph-user-list# ceph osd 
getcrushmap -o crushmap.bin
got crush map from osdmap epoch 30748
root@ceph-admin-storage:~/ceph-cluster/crush-map-4-ceph-user-list# crushtool -d 
crushmap.bin -o crushmap.txt
root@ceph-admin-storage:~/ceph-cluster/crush-map-4-ceph-user-list# cat 
crushmap.txt
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable 

Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean

2014-08-12 Thread Riederer, Michael
Sorry, but I think that does not help me. I forgot to mention something about 
the operating system:

root@ceph-1-storage:~# dpkg -l | grep libleveldb1
ii  libleveldb1   1.12.0-1precise.ceph  fast 
key-value storage library
root@ceph-1-storage:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 12.04.5 LTS
Release:12.04
Codename:   precise
root@ceph-1-storage:~# uname -a
Linux ceph-1-storage 3.5.0-52-generic #79~precise1-Ubuntu SMP Fri Jul 4 
21:03:49 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

libleveldb1 is greater than the mentioned version 1.9.0-1 ~ bpo70 + 1.

All ceph nodes are IBM x3650 with Intel Xeon CPUs 2.00 GHz and 8 GB RAM, ok all 
very old, about eight years,
but are still running.

Mike




Von: Karan Singh [karan.si...@csc.fi]
Gesendet: Dienstag, 12. August 2014 13:00
An: Riederer, Michael
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 
pgs stuck unclean

I am not sure if this helps , but have a look  
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg10078.html

- Karan -

On 12 Aug 2014, at 12:04, Riederer, Michael 
mailto:michael.riede...@br.de>> wrote:

Hi Karan,

root@ceph-admin-storage:~/ceph-cluster/crush-map-4-ceph-user-list# ceph osd 
getcrushmap -o crushmap.bin
got crush map from osdmap epoch 30748
root@ceph-admin-storage:~/ceph-cluster/crush-map-4-ceph-user-list# crushtool -d 
crushmap.bin -o crushmap.txt
root@ceph-admin-storage:~/ceph-cluster/crush-map-4-ceph-user-list# cat 
crushmap.txt
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 device21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 device27
device 28 osd.28
device 29 osd.29
device 30 osd.30
device 31 osd.31
device 32 osd.32
device 33 osd.33
device 34 osd.34
device 35 osd.35
device 36 osd.36
device 37 osd.37
device 38 osd.38
device 39 osd.39
device 40 device40
device 41 device41
device 42 osd.42
device 43 osd.43
device 44 osd.44
device 45 osd.45
device 46 osd.46
device 47 osd.47
device 48 osd.48
device 49 osd.49
device 50 osd.50
device 51 osd.51
device 52 osd.52
device 53 osd.53
device 54 osd.54
device 55 osd.55
device 56 osd.56
device 57 osd.57
device 58 osd.58

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root

# buckets
host ceph-1-storage {
id -2# do not change unnecessarily
# weight 19.330
alg straw
hash 0# rjenkins1
item osd.0 weight 0.910
item osd.2 weight 0.910
item osd.3 weight 0.910
item osd.4 weight 1.820
item osd.9 weight 1.360
item osd.11 weight 0.680
item osd.6 weight 3.640
item osd.5 weight 1.820
item osd.7 weight 3.640
item osd.8 weight 3.640
}
host ceph-2-storage {
id -3# do not change unnecessarily
# weight 20.000
alg straw
hash 0# rjenkins1
item osd.14 weight 3.640
item osd.18 weight 1.360
item osd.19 weight 1.360
item osd.15 weight 3.640
item osd.1 weight 3.640
item osd.12 weight 3.640
item osd.22 weight 0.680
item osd.23 weight 0.680
item osd.26 weight 0.680
item osd.36 weight 0.680
}
host ceph-5-storage {
id -4# do not change unnecessarily
# weight 11.730
alg straw
hash 0# rjenkins1
item osd.32 weight 0.270
item osd.37 weight 0.270
item osd.42 weight 0.270
item osd.43 weight 1.820
item osd.44 weight 1.820
item osd.45 weight 1.820
item osd.46 weight 1.820
item osd.47 weight 1.820
item osd.48 weight 1.820
}
room room0 {
id -8# do not change unnecessarily
# weight 51.060
alg straw
hash 0# rjenkins1
item ceph-1-storage weight 19.330
item ceph-2-storage weight 20.000
item ceph-5-storage weight 11.730
}
host ceph-3-storage {
id -5# do not change unnecessarily
# weight 15.920
alg straw
hash 0# rjenkins1
item osd.24 weight 1.820
item osd.25 weight 1.820
item osd.29 weight 1.360
item osd.10 weight 3.640
item osd.13 weight 3.640
item osd.20 weight 3.640
}
host ceph-4-storage {
id -6# do not change unnecessarily
# weight 20.000
alg straw
hash 0# rjenkins1
item osd.34 weight 3.640
item osd.38 weight 1.360
item osd.39 weight 1.360
item osd.16 weight 3.640
item osd.30 weight 0.680
item osd.35 weight 3.640
item 

Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean

2014-08-12 Thread Riederer, Michael
 {
ruleset 2
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

root@ceph-admin-storage:~# ceph osd dump | grep -i pool
pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins 
pg_num 2048 pgp_num 2048 last_change 4623 crash_replay_interval 45 stripe_width 0
pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash 
rjenkins pg_num 2048 pgp_num 2048 last_change 4627 stripe_width 0
pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash rjenkins 
pg_num 2048 pgp_num 2048 last_change 4632 stripe_width 0


Mike

Von: Karan Singh [karan.si...@csc.fi]
Gesendet: Dienstag, 12. August 2014 10:35
An: Riederer, Michael
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 
pgs stuck unclean

Can you provide your cluster’s ceph osd dump | grep -i pooland crush map 
output.


- Karan -

On 12 Aug 2014, at 10:40, Riederer, Michael 
mailto:michael.riede...@br.de>> wrote:

Hi all,

How do I get my Ceph Cluster back to a healthy state?

root@ceph-admin-storage:~# ceph -v
ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
root@ceph-admin-storage:~# ceph -s
cluster 6b481875-8be5-4508-b075-e1f660fd7b33
 health HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck 
unclean
 monmap e2: 3 mons at 
{ceph-1-storage=10.65.150.101:6789/0,ceph-2-storage=10.65.150.102:6789/0,ceph-3-storage=10.65.150.103:6789/0},
 election epoch 5010, quorum 0,1,2 ceph-1-storage,ceph-2-storage,ceph-3-storage
 osdmap e30748: 55 osds: 55 up, 55 in
  pgmap v10800465: 6144 pgs, 3 pools, 11002 GB data, 2762 kobjects
22077 GB used, 79933 GB / 102010 GB avail
6138 active+clean
   4 incomplete
   2 active+clean+replay
root@ceph-admin-storage:~# ceph health detail
HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean
pg 2.92 is stuck inactive since forever, current state incomplete, last acting 
[8,13]
pg 2.c1 is stuck inactive since forever, current state incomplete, last acting 
[13,7]
pg 2.e3 is stuck inactive since forever, current state incomplete, last acting 
[20,7]
pg 2.587 is stuck inactive since forever, current state incomplete, last acting 
[13,5]
pg 2.92 is stuck unclean since forever, current state incomplete, last acting 
[8,13]
pg 2.c1 is stuck unclean since forever, current state incomplete, last acting 
[13,7]
pg 2.e3 is stuck unclean since forever, current state incomplete, last acting 
[20,7]
pg 2.587 is stuck unclean since forever, current state incomplete, last acting 
[13,5]
pg 2.587 is incomplete, acting [13,5]
pg 2.e3 is incomplete, acting [20,7]
pg 2.c1 is incomplete, acting [13,7]
pg 2.92 is incomplete, acting [8,13]
root@ceph-admin-storage:~# ceph pg dump_stuck inactive
ok
pg_statobjectsmipdegrunfbyteslogdisklogstate
state_stampvreportedupup_primaryactingacting_primary
last_scrubscrub_stamplast_deep_scrubdeep_scrub_stamp
2.920000000incomplete2014-08-08 
12:39:20.2045920'030748:7729[8,13]8[8,13]8
13503'13904192014-06-26 01:57:48.72762513503'13904192014-06-22 
01:57:30.114186
2.c10000000incomplete2014-08-08 
12:39:18.8465420'030748:7117[13,7]13[13,7]13
13503'16870172014-06-26 20:52:51.24986413503'16870172014-06-22 
14:24:22.633554
2.e30000000incomplete2014-08-08 
12:39:29.3115520'030748:8027[20,7]20[20,7]20
13503'13987272014-06-26 07:03:25.89925413503'13987272014-06-21 
07:02:31.393053
2.5870000000incomplete2014-08-08 
12:39:19.7157240'030748:7060[13,5]13[13,5]13
13646'15429342014-06-26 07:48:42.08993513646'15429342014-06-22 
07:46:20.363695
root@ceph-admin-storage:~# ceph osd tree
# idweighttype nameup/downreweight
-199.7root default
-851.06room room0
-219.33host ceph-1-storage
00.91osd.0up1
20.91osd.2up1
30.91osd.3up1
41.82osd.4up1
91.36osd.9up1
110.68osd.11up1
63.64osd.6up1
51.82osd.5up1
73.64osd.7up1
83.64osd.8up1
-320host ceph-2-storage
143.64osd.14up1
181.36osd.18up1
191.36osd.19up1

[ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean

2014-08-12 Thread Riederer, Michael
Hi all,

How do I get my Ceph Cluster back to a healthy state?

root@ceph-admin-storage:~# ceph -v
ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
root@ceph-admin-storage:~# ceph -s
cluster 6b481875-8be5-4508-b075-e1f660fd7b33
 health HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck 
unclean
 monmap e2: 3 mons at 
{ceph-1-storage=10.65.150.101:6789/0,ceph-2-storage=10.65.150.102:6789/0,ceph-3-storage=10.65.150.103:6789/0},
 election epoch 5010, quorum 0,1,2 ceph-1-storage,ceph-2-storage,ceph-3-storage
 osdmap e30748: 55 osds: 55 up, 55 in
  pgmap v10800465: 6144 pgs, 3 pools, 11002 GB data, 2762 kobjects
22077 GB used, 79933 GB / 102010 GB avail
6138 active+clean
   4 incomplete
   2 active+clean+replay
root@ceph-admin-storage:~# ceph health detail
HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean
pg 2.92 is stuck inactive since forever, current state incomplete, last acting 
[8,13]
pg 2.c1 is stuck inactive since forever, current state incomplete, last acting 
[13,7]
pg 2.e3 is stuck inactive since forever, current state incomplete, last acting 
[20,7]
pg 2.587 is stuck inactive since forever, current state incomplete, last acting 
[13,5]
pg 2.92 is stuck unclean since forever, current state incomplete, last acting 
[8,13]
pg 2.c1 is stuck unclean since forever, current state incomplete, last acting 
[13,7]
pg 2.e3 is stuck unclean since forever, current state incomplete, last acting 
[20,7]
pg 2.587 is stuck unclean since forever, current state incomplete, last acting 
[13,5]
pg 2.587 is incomplete, acting [13,5]
pg 2.e3 is incomplete, acting [20,7]
pg 2.c1 is incomplete, acting [13,7]
pg 2.92 is incomplete, acting [8,13]
root@ceph-admin-storage:~# ceph pg dump_stuck inactive
ok
pg_statobjectsmipdegrunfbyteslogdisklogstate
state_stampvreportedupup_primaryactingacting_primary
last_scrubscrub_stamplast_deep_scrubdeep_scrub_stamp
2.920000000incomplete2014-08-08 
12:39:20.2045920'030748:7729[8,13]8[8,13]8
13503'13904192014-06-26 01:57:48.72762513503'13904192014-06-22 
01:57:30.114186
2.c10000000incomplete2014-08-08 
12:39:18.8465420'030748:7117[13,7]13[13,7]13
13503'16870172014-06-26 20:52:51.24986413503'16870172014-06-22 
14:24:22.633554
2.e30000000incomplete2014-08-08 
12:39:29.3115520'030748:8027[20,7]20[20,7]20
13503'13987272014-06-26 07:03:25.89925413503'13987272014-06-21 
07:02:31.393053
2.5870000000incomplete2014-08-08 
12:39:19.7157240'030748:7060[13,5]13[13,5]13
13646'15429342014-06-26 07:48:42.08993513646'15429342014-06-22 
07:46:20.363695
root@ceph-admin-storage:~# ceph osd tree
# idweighttype nameup/downreweight
-199.7root default
-851.06room room0
-219.33host ceph-1-storage
00.91osd.0up1
20.91osd.2up1
30.91osd.3up1
41.82osd.4up1
91.36osd.9up1
110.68osd.11up1
63.64osd.6up1
51.82osd.5up1
73.64osd.7up1
83.64osd.8up1
-320host ceph-2-storage
143.64osd.14up1
181.36osd.18up1
191.36osd.19up1
153.64osd.15up1
13.64osd.1up1
123.64osd.12up1
220.68osd.22up1
230.68osd.23up1
260.68osd.26up1
360.68osd.36up1
-411.73host ceph-5-storage
320.27osd.32up1
370.27osd.37up1
420.27osd.42up1
431.82osd.43up1
441.82osd.44up1
451.82osd.45up1
461.82osd.46up1
471.82osd.47up1
481.82osd.48up1
-948.64room room1
-515.92host ceph-3-storage
241.82osd.24up1
251.82osd.25up1
291.36osd.29up1
103.64osd.10up1
133.64osd.13up1
203.64osd.20up1
-620host ceph-4-storage
343.64osd.34up1
381.36osd.38up1