Smalyshev added a comment.
> I don't think it makes sense to perpetuate a vertical scaling model With regard to disk space, I don't think we have a choice at least for the next FY. Even if we find alternative backend that can be a drop-in replacement for Blazegraph (or somehow miraculously discover an easy way to shard the data without killing performance) - and my current estimate for this optimistic scenario is "very low probability" - implementing that is likely going to take time. And in all that time we'd have to continue service queries on existing platform. Which means each node has to store the whole set of the query data. We can discuss future sharding solutions, and it's totally fine, but realistically I personally do not see any scenario where we have any such solution implemented and ready to the point we can 100% migrate our Wikidata query load to it in a year. Right now we are no further in this road than "we should think about trying some options", and we aren't even clear on what these options would be. It's a long road from here to a working sharding solution, and I don't see a way to make through it in a shorter time. If somebody sees a way please tell me. Which for me means that at least for the next FY, we need to plan for the current model - which implies hosting all the data on each server - to be what we have. Right now our data size is about 840G, and it keeps growing. I know Wikidata has growth projections so I'll add them here later but you can look on the DB size graph <https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=now-90d&to=now&panelId=7&fullscreen> and make your own conclusions. For me, this says we'll start to run out of disk space before the end of this calendar year. So we need to find solution for that - be it new disks, RAID0 or whatever magic there is. So when we discuss the budget for the next FY, I think we should base it on current model and growth projections we have now being the facts on the ground. If we find better model (and we do allocate time and resources to look for it), great, but we should not be basing our planning on miracles happening. > Taking machines offline and rebuilding them from scratch just because a disk failed or because we need more storage is really something that we need to avoid So far it was my experience this happened every time we did capacity upgrade. Has this changed (regardless of RAID0 move, I'm talking about current situation) or we still need to take the host offline when we add disk capacity now? If it hasn't changed, is there a model (not involving changing WDQS platform software, etc.) that can allow us to upgrade capacity without reimaging? TASK DETAIL https://phabricator.wikimedia.org/T221632 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Gehel, Smalyshev Cc: faidon, Smalyshev, Aklapper, Gehel, alaa_wmde, Nandana, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs