My lowest level (other than OSD) is 'disktype' (based on the crushmaps at http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/ ) since I have SSDs and HDDs on the same host.
I just made that change (deleted the pool, deleted the profile, deleted the crush ruleset), then re-created using ruleset-failure-domain=disktype. Very similar results. health HEALTH_WARN 3 pgs degraded; 3 pgs stuck unclean; 3 pgs undersized 'ceph pg dump stuck' looks very similar to the last one I posted. On Wed, Mar 4, 2015 at 2:48 PM, Don Doerner <don.doer...@quantum.com> wrote: > Hmmm, I just struggled through this myself. How many racks do you have? > If not more than 8, you might want to make your failure domain smaller? I.e., > maybe host? That, at least, would allow you to debug the situation… > > > > -don- > > > > *From:* Kyle Hutson [mailto:kylehut...@ksu.edu] > *Sent:* 04 March, 2015 12:43 > *To:* Don Doerner > *Cc:* Ceph Users > > *Subject:* Re: [ceph-users] New EC pool undersized > > > > It wouldn't let me simply change the pg_num, giving > > Error EEXIST: specified pg_num 2048 <= current 8192 > > > > But that's not a big deal, I just deleted the pool and recreated with > 'ceph osd pool create ec44pool 2048 2048 erasure ec44profile' > > ...and the result is quite similar: 'ceph status' is now > > ceph status > > cluster 196e5eb8-d6a7-4435-907e-ea028e946923 > > health HEALTH_WARN 4 pgs degraded; 4 pgs stuck unclean; 4 pgs > undersized > > monmap e1: 4 mons at {hobbit01= > 10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0 > <https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0&k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0A&m=fHQcjtxx3uADdikQAQAh65Z0s%2FzNFIj544bRY5zThgI%3D%0A&s=01b7463be37041310163f5d75abc634fab3280633eaef2158ed6609c6f3978d8>}, > election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14 > > osdmap e412: 144 osds: 144 up, 144 in > > pgmap v6798: 6144 pgs, 2 pools, 0 bytes data, 0 objects > > 90590 MB used, 640 TB / 640 TB avail > > 4 active+undersized+degraded > > 6140 active+clean > > > > 'ceph pg dump_stuck results' in > > ok > > pg_stat objects mip degr misp unf bytes log disklog state > state_stamp v reported up up_primary acting acting_primary > last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp > > 2.296 0 0 0 0 0 0 0 0 > active+undersized+degraded 2015-03-04 14:33:26.672224 0'0 412:9 > [5,55,91,2147483647,83,135,53,26] 5 [5,55,91,2147483647,83,135,53,26] > 5 0'0 2015-03-04 14:33:15.649911 0'0 2015-03-04 14:33:15.649911 > > 2.69c 0 0 0 0 0 0 0 0 > active+undersized+degraded 2015-03-04 14:33:24.984802 0'0 412:9 > [93,134,1,74,112,28,2147483647,60] 93 [93,134,1,74,112,28,2147483647 > ,60] 93 0'0 2015-03-04 14:33:15.695747 0'0 2015-03-04 > 14:33:15.695747 > > 2.36d 0 0 0 0 0 0 0 0 > active+undersized+degraded 2015-03-04 14:33:21.937620 0'0 412:9 > [12,108,136,104,52,18,63,2147483647] 12 [12,108,136,104,52,18,63, > 2147483647] 12 0'0 2015-03-04 14:33:15.652480 0'0 2015-03-04 > 14:33:15.652480 > > 2.5f7 0 0 0 0 0 0 0 0 > active+undersized+degraded 2015-03-04 14:33:26.169242 0'0 412:9 > [94,128,73,22,4,60,2147483647,113] 94 [94,128,73,22,4,60,2147483647 > ,113] 94 0'0 2015-03-04 14:33:15.687695 0'0 2015-03-04 > 14:33:15.687695 > > > > I do have questions for you, even at this point, though. > > 1) Where did you find the formula (14400/(k+m))? > > 2) I was really trying to size this for when it goes to production, at > which point it may have as many as 384 OSDs. Doesn't that imply I should > have even more pgs? > > > > On Wed, Mar 4, 2015 at 2:15 PM, Don Doerner <don.doer...@quantum.com> > wrote: > > Oh duh… OK, then given a 4+4 erasure coding scheme, 14400/8 is 1800, so > try 2048. > > > > -don- > > > > *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf > Of *Don Doerner > *Sent:* 04 March, 2015 12:14 > *To:* Kyle Hutson; Ceph Users > *Subject:* Re: [ceph-users] New EC pool undersized > > > > In this case, that number means that there is not an OSD that can be > assigned. What’s your k, m from you erasure coded pool? You’ll need > approximately (14400/(k+m)) PGs, rounded up to the next power of 2… > > > > -don- > > > > *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com > <ceph-users-boun...@lists.ceph.com>] *On Behalf Of *Kyle Hutson > *Sent:* 04 March, 2015 12:06 > *To:* Ceph Users > *Subject:* [ceph-users] New EC pool undersized > > > > Last night I blew away my previous ceph configuration (this environment is > pre-production) and have 0.87.1 installed. I've manually edited the > crushmap so it down looks like https://dpaste.de/OLEa > <https://urldefense.proofpoint.com/v1/url?u=https://dpaste.de/OLEa&k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0A&m=JSfAuDHRgKln0yM%2FQGMT3hZb3rVLUpdn2wGdV3C0Rbk%3D%0A&s=c1bd46dcd96e656554817882d7f6581903b1e3c6a50313f4bf7494acfd12b442> > > > > I currently have 144 OSDs on 8 nodes. > > > > After increasing pg_num and pgp_num to a more suitable 1024 (due to the > high number of OSDs), everything looked happy. > > So, now I'm trying to play with an erasure-coded pool. > > I did: > > ceph osd erasure-code-profile set ec44profile k=4 m=4 > ruleset-failure-domain=rack > > ceph osd pool create ec44pool 8192 8192 erasure ec44profile > > > > After settling for a bit 'ceph status' gives > > cluster 196e5eb8-d6a7-4435-907e-ea028e946923 > > health HEALTH_WARN 7 pgs degraded; 7 pgs stuck degraded; 7 pgs stuck > unclean; 7 pgs stuck undersized; 7 pgs undersized > > monmap e1: 4 mons at {hobbit01= > 10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0 > <https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0&k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0A&m=JSfAuDHRgKln0yM%2FQGMT3hZb3rVLUpdn2wGdV3C0Rbk%3D%0A&s=6fe07b47a00235857630057e09cfb702dcddcea1d3f98d81a574020ee95dee44>}, > election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14 > > osdmap e409: 144 osds: 144 up, 144 in > > pgmap v6763: 12288 pgs, 2 pools, 0 bytes data, 0 objects > > 90598 MB used, 640 TB / 640 TB avail > > 7 active+undersized+degraded > > 12281 active+clean > > > > So to troubleshoot the undersized pgs, I issued 'ceph pg dump_stuck' > > ok > > pg_stat objects mip degr misp unf bytes log disklog > state state_stamp v reported up up_primary acting > acting_primary last_scrub scrub_stamp last_deep_scrub > deep_scrub_stamp > > 1.d77 0 0 0 0 0 0 0 0 > active+undersized+degraded 2015-03-04 11:33:57.502849 0'0 408:12 > [15,95,58,73,52,31,116,2147483647] 15 [15,95,58,73,52,31,116, > 2147483647] 15 0'0 2015-03-04 11:33:42.100752 0'0 2015-03-04 > 11:33:42.100752 > > 1.10fa 0 0 0 0 0 0 0 0 > active+undersized+degraded 2015-03-04 11:34:29.362554 0'0 408:12 > [23,12,99,114,132,53,56,2147483647] 23 [23,12,99,114,132,53,56, > 2147483647] 23 0'0 2015-03-04 11:33:42.168571 0'0 2015-03-04 > 11:33:42.168571 > > 1.1271 0 0 0 0 0 0 0 0 > active+undersized+degraded 2015-03-04 11:33:48.795742 0'0 408:12 > [135,112,69,4,22,95,2147483647,83] 135 [135,112,69,4,22,95,2147483647,83] > 135 0'0 2015-03-04 11:33:42.139555 0'0 2015-03-04 11:33:42.139555 > > 1.2b5 0 0 0 0 0 0 0 0 > active+undersized+degraded 2015-03-04 11:34:32.189738 0'0 408:12 > [11,115,139,19,76,52,94,2147483647] 11 [11,115,139,19,76,52,94, > 2147483647] 11 0'0 2015-03-04 11:33:42.079673 0'0 2015-03-04 > 11:33:42.079673 > > 1.7ae 0 0 0 0 0 0 0 0 > active+undersized+degraded 2015-03-04 11:34:26.848344 0'0 408:12 > [27,5,132,119,94,56,52,2147483647] 27 [27,5,132,119,94,56,52, > 2147483647] 27 0'0 2015-03-04 11:33:42.109832 0'0 2015-03-04 > 11:33:42.109832 > > 1.1a97 0 0 0 0 0 0 0 0 > active+undersized+degraded 2015-03-04 11:34:25.457454 0'0 408:12 > [20,53,14,54,102,118,2147483647,72] 20 [20,53,14,54,102,118, > 2147483647,72] 20 0'0 2015-03-04 11:33:42.833850 0'0 > 2015-03-04 11:33:42.833850 > > 1.10a6 0 0 0 0 0 0 0 0 > active+undersized+degraded 2015-03-04 11:34:30.059936 0'0 408:12 > [136,22,4,2147483647,72,52,101,55] 136 [136,22,4,2147483647,72,52,101,55] > 136 0'0 2015-03-04 11:33:42.125871 0'0 2015-03-04 11:33:42.125871 > > > > This appears to have a number on all these (2147483647) that is way out > of line from what I would expect. > > > > Thoughts? > > > ------------------------------ > > The information contained in this transmission may be confidential. Any > disclosure, copying, or further distribution of confidential information is > not permitted unless such privilege is explicitly granted in writing by > Quantum. Quantum reserves the right to have electronic communications, > including email and attachments, sent across its networks filtered through > anti virus and spam software programs and retain such messages in order to > comply with applicable data security and retention requirements. Quantum is > not responsible for the proper and complete transmission of the substance > of this communication or for any delay in its receipt. > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com