Smalyshev added a comment.

  > I don't think it makes sense to perpetuate a vertical scaling model
  
  With regard to disk space, I don't think we have a choice at least for the 
next FY. Even if we find alternative backend that can be a drop-in replacement 
for Blazegraph (or somehow miraculously discover an easy way to shard the data 
without killing performance) - and my current estimate for this optimistic 
scenario is "very low probability" - implementing that is likely going to take 
time. And in all that time we'd have to continue service queries on existing 
platform. Which means each node has to store the whole set of the query data.
  
  We can discuss future sharding solutions, and it's totally fine, but 
realistically I personally do not see any scenario where we have any such 
solution implemented and ready to the point we can 100% migrate our Wikidata 
query load to it in a year. Right now we are no further in this road than "we 
should think about trying some options", and we aren't even clear on what these 
options would be. It's a long road from here to a working sharding solution, 
and I don't see a way to make through it in a shorter time. If somebody sees a 
way please tell me. Which for me means that at least for the next FY, we need 
to plan for the current model - which implies hosting all the data on each 
server - to be what we have.
  
  Right now our data size is about 840G, and it keeps growing. I know Wikidata 
has growth projections so I'll add them here later but you can look on the DB 
size graph 
<https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=now-90d&to=now&panelId=7&fullscreen>
 and make your own conclusions. For me, this says we'll start to run out of 
disk space before the end of this calendar year. So we need to find solution 
for that - be it new disks, RAID0 or whatever magic there is.
  
  So when we discuss the budget for the next FY, I think we should base it on 
current model and growth projections we have now being the facts on the ground. 
If we find better model (and we do allocate time and resources to look for it), 
great, but we should not be basing our planning on miracles happening.
  
  > Taking machines offline and rebuilding them from scratch just because a 
disk failed or because we need more storage is really something that we need to 
avoid
  
  So far it was my experience this happened every time we did capacity upgrade. 
Has this changed (regardless of RAID0 move, I'm talking about current 
situation) or we still need to take the host offline when we add disk capacity 
now? If it hasn't changed, is there a model (not involving changing WDQS 
platform software, etc.) that can allow us to upgrade capacity without 
reimaging?

TASK DETAIL
  https://phabricator.wikimedia.org/T221632

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel, Smalyshev
Cc: faidon, Smalyshev, Aklapper, Gehel, alaa_wmde, Nandana, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to