Smalyshev added a comment.
We've been around this topic a number of times, so I'll write a summary where we're at so far. I'm sorry it's going to be long, there's a bunch of issues at play. Also, if after reading this you think it's utter nonsense and I'm missing an obvious solution to this please feel free to explain it. Why we're using non-caching URLs? Because we want to have the latest data for the item. The item can be edited many times in short bursts (bots are especially known for this, but people do it too all the time). This is peculiar to Wikidata - Wikipedia articles are usually edited in big chunks, but on Wikidata each property/value change is a separate edit usually, which means there can be dozens of edits in a relatively short period. If we use static URL, and we get change for Q1234, we'd get the data for one of the edits stuck in the cache as "data for Q1234", and we have no way of getting most recent data until the cache expires. This is bad (later on that). If we use URL keyed by revision number, that means if we have 20 edits in a row, we'd have to download the RDF data for the page 20 times instead of just downloading it once (or maybe twice). This is somewhat mitigated by batch aggregation we do, but our batches are not that big, so if there is a big edit burst, this completely kills performance (and edit busts are exactly the place where we need every last bit of performance). > can't updaters just be ok with up to 1H of stale data and not cache bust at all? So, we can use one of two ways in general: A. Use revision-based URLs (described above) - for these we can cache them forever, since they don't change B. Use general Qid-based URL without revision marker. Long cache for this would be very bad, for the following reasons: 1. People expect to see the data they edit on Wikidata. If somebody edits a value and would have to wait for an hour for it to show up on WDQS, people would be quite upset. We can have somewhat stale data even now, but hour-long delay is rare. And when it happens, people do complain. 2. Updater is event-driven, so if it gets update for Q1234 revision X, it should be able to load data for Q1234 at least as old as revision X. If it loads, due to cache, any older data, this data is stuck in the database forever, unless there's a new update - since nothing will cause it to re-check Q1234 again. 3. Data in Wikidata is highly interconnected. Unlike Wikipedia articles, which link to each other but largely are consumed independently, most Wikidata queries involve multiple items that interlink to each other. Caching means that each of this item will be seen by WDQS as being in a state it was at some random moment at the past hour (note that it can also be these moments will be different for different servers due to cache expirations that can happen in between server updates) - with these moments being different for different items. That means you basically can't do any query that involves any data edited in the past hour reliably, as the result for any of these can be completely nonsensical - some items would be seconds-fresh and some items they refer to may be hour-old, producing completely incoherent results. And since it's not easy to see from a query which of the results may be freshly edited, this would reduce reliability of service data a lot. It may be fine on a relatively static database, but Wikidata is not one. I am not sure we can get around this even if we delay updates - even if we process only hour old updates (and give up completely on freshness we have now) we can't know where the hourly caching window for each item started - that would depend on when the edits happened. One item may be hour old and another two hours old. Stale data would be bad enough, randomly inconsistently stale data would be a disaster. So I consider static URL with long caching a complete non-starter unless somebody explains to me how to get around the problems described above. The only feasible way I can see is to pre-process update stream to aggregate multiple edits to the same item over a long period of time and then do revision-based loads. Revision-based caching is safe with regard to consistency, and aggregation would mostly solve the performance issue. However, this means introducing an artificial delay into the process (otherwise the aggregation is useless) - which should be long enough to capture any edit burst on a typical item. And, of course, we'd need development effort to actually implement the aggregator service in a way that can serve all WDQS scenarios. We've talked about it bit but we don't currently have a work plan for this yet. TASK DETAIL https://phabricator.wikimedia.org/T217897 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev Cc: Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, Nandana, thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, Zppix, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs