Smalyshev added a comment.

  We've been around this topic a number of times, so I'll write a summary where 
we're at so far. I'm sorry it's going to be long, there's a bunch of issues at 
play. Also, if after reading this you think it's utter nonsense and I'm missing 
an obvious solution to this please feel free to explain it.
  
  Why we're using non-caching URLs?
  
  Because we want to have the latest data for the item. The item can be edited 
many times in short bursts (bots are especially known for this, but people do 
it too all the time). This is peculiar to Wikidata - Wikipedia articles are 
usually edited in big chunks, but on Wikidata each property/value change is a 
separate edit usually, which means there can be dozens of edits in a relatively 
short period.
  
  If we use static URL, and we get change for Q1234, we'd get the data for one 
of the edits stuck in the cache as "data for Q1234", and we have no way of 
getting most recent data until the cache expires. This is bad (later on that).
  
  If we use URL keyed by revision number, that means if we have 20 edits in a 
row, we'd have to download the RDF data for the page 20 times instead of just 
downloading it once (or maybe twice). This is somewhat mitigated by batch 
aggregation we do, but our batches are not that big, so if there is a big edit 
burst, this completely kills performance (and edit busts are exactly the place 
where we need every last bit of performance).
  
  > can't updaters just be ok with up to 1H of stale data and not cache bust at 
all?
  
  So, we can use one of two ways in general:
  A. Use revision-based URLs (described above) - for these we can cache them 
forever, since they don't change
  B. Use general Qid-based URL without revision marker. Long cache for this 
would be very bad, for the following reasons:
  
  1. People expect to see the data they edit on Wikidata. If somebody edits a 
value and would have to wait for an hour for it to show up on WDQS, people 
would be quite upset. We can have somewhat stale data even now, but hour-long 
delay is rare. And when it happens, people do complain.
  
  2. Updater is event-driven, so if it gets update for Q1234 revision X, it 
should be able to load data for Q1234 at least as old as revision X. If it 
loads, due to cache, any older data, this data is stuck in the database 
forever, unless there's a new update - since nothing will cause it to re-check 
Q1234 again.
  
  3. Data in Wikidata is highly interconnected. Unlike Wikipedia articles, 
which link to each other but largely are consumed independently, most Wikidata 
queries involve multiple items that interlink to each other. Caching means that 
each of this item will be seen by WDQS as being in a state it was at some 
random moment at the past hour (note that it can also be these moments will be 
different for different servers due to cache expirations that can happen in 
between server updates) - with these moments being different for different 
items. That means you basically can't do any query that involves any data 
edited in the past hour reliably, as the result for any of these can be 
completely nonsensical - some items would be seconds-fresh and some items they 
refer to may be hour-old, producing completely incoherent results. And since 
it's not easy to see from a query which of the results may be freshly edited, 
this would reduce reliability of service data a lot. It may be fine on a 
relatively static database, but Wikidata is not one.
  
  I am not sure we can get around this even if we delay updates - even if we 
process only hour old updates (and give up completely on freshness we have now) 
we can't know where the hourly caching window for each item started - that 
would depend on when the edits happened. One item may be hour old and another 
two hours old. Stale data would be bad enough, randomly inconsistently stale 
data would be a disaster.
  
  So I consider static URL with long caching a complete non-starter unless 
somebody explains to me how to get around the problems described above.
  
  The only feasible way I can see is to pre-process update stream to aggregate 
multiple edits to the same item over a long period of time and then do 
revision-based loads. Revision-based caching is safe with regard to 
consistency, and aggregation would mostly solve the performance issue. However, 
this means introducing an artificial delay into the process (otherwise the 
aggregation is useless) - which should be long enough to capture any edit burst 
on a typical item. And, of course, we'd need development effort to actually 
implement the aggregator service in a way that can serve all WDQS scenarios. 
We've talked about it bit but we don't currently have a work plan for this yet.

TASK DETAIL
  https://phabricator.wikimedia.org/T217897

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, Nandana, 
thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, 
Zppix, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to