Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

Troy Ablan Sun, 18 Aug 2019 19:00:47 -0700


On 8/18/19 6:43 PM, Brad Hubbard wrote:

That's this code.

3114   switch (alg) {
3115   case CRUSH_BUCKET_UNIFORM:
3116     size = sizeof(crush_bucket_uniform);
3117     break;
3118   case CRUSH_BUCKET_LIST:
3119     size = sizeof(crush_bucket_list);
3120     break;
3121   case CRUSH_BUCKET_TREE:
3122     size = sizeof(crush_bucket_tree);
3123     break;
3124   case CRUSH_BUCKET_STRAW:
3125     size = sizeof(crush_bucket_straw);
3126     break;
3127   case CRUSH_BUCKET_STRAW2:
3128     size = sizeof(crush_bucket_straw2);
3129     break;
3130   default:
3131     {
3132       char str[128];
3133       snprintf(str, sizeof(str), "unsupported bucket algorithm:
%d", alg);
3134       throw buffer::malformed_input(str);
3135     }
3136   }

CRUSH_BUCKET_UNIFORM = 1
CRUSH_BUCKET_LIST = 2
CRUSH_BUCKET_TREE = 3
CRUSH_BUCKET_STRAW = 4
CRUSH_BUCKET_STRAW2 = 5

So valid values for bucket algorithms are 1 through 5 but, for
whatever reason, at least one of yours is being interpreted as "-1"

this doesn't seem like something that would just happen spontaneously
with no changes to the cluster.

What recent changes have you made to the osdmap? What recent changes
have you made to the crushmap? Have you recently upgraded?


Brad,

There were no recent changes to the cluster/osd config to my knowledge.The only person who would make any such changes should have been me. Afew weeks ago, we added 90 new HDD OSDs all at once and the cluster wasstill backfilling onto those, but none of the pools on the now-affectedOSDs were involved in that.

It seems that all of the SSDs are likely to be in this same state, but Ihaven't checked every single one.

I sent a complete image of one of the 1TB OSDs (compressed to about41GB) via ceph-post-file. I put it the id in the tracker issue I openedfor this, https://tracker.ceph.com/issues/41240

I don't know if you or any other devs could use that for furtherinsight, but I'm hopeful.


Thanks,

-Troy
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

Reply via email to