[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-04-30 Thread gerritbot
gerritbot added a comment.

  Change 504990 **merged** by Gehel:
  [operations/puppet@production] Enable revision fetches in production



To: Smalyshev, gerritbot
Cc: Lucas_Werkmeister_WMDE, Addshore, Smalyshev, BBlack, Aklapper, Gehel, 
alaa_wmde, Legado_Shulgin, Nandana, thifranc, AndyTan, Davinaclare77, Qtn1293, 
Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, 
LawExplorer, Zppix, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-05-13 Thread Addshore
Addshore added a comment.

  So, I did some really crappy analysis of the hit rate in varnish before and 
after this change, looking at the 5th of april and the 5th of may (one before 
and one after as far as I can tell).
  | SUMMARY| April 5th | May 5th |
  | hit-front  | 1132  | 2986499   |
  | hit-local  | 3689  | 6068535   |
  | hit-remote | 1 | 593755  |
  | int-local  | 6 | 0   |
  | int-remote | 2 | 0   |
  | miss   | 14126950 | 1053539   |
  | pass   | 4360  | 8624|
  | TOTAL  | 14136140 | 10710952   |
  These include requests from IPs that start with 10.*.
  **For the 2 days looked at the hit rate for varnish has gone from 0.03% up to 
90%, woo!**
  For reference I got the raw data with:
 month, cache_status, COUNT(*) AS requests, ip
FROM wmf.webrequest
WHERE uri_host = 'www.wikidata.org'
AND year = 2019
AND ( month = 04 OR month = 05 )
AND day = 05
AND user_agent = 'Wikidata Query Service Updater'
AND uri_path RLIKE '^/wiki/Special:EntityData'
GROUP BY month, cache_status, ip
ORDER BY month, cache_status, requests DESC



To: Smalyshev, Addshore
Cc: Lucas_Werkmeister_WMDE, Addshore, Smalyshev, BBlack, Aklapper, Gehel, 
darthmon_wmde, alaa_wmde, Legado_Shulgin, Nandana, thifranc, AndyTan, 
Davinaclare77, Qtn1293, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, 
QZanden, EBjune, merbst, LawExplorer, Zppix, _jensen, rosalieper, Jonas, 
Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-06-25 Thread Addshore
Addshore added a comment.

  Will this change also get rolled out to 3rd parties using the updater? / Is 
it in a certain release?



To: Smalyshev, Addshore
Cc: Lucas_Werkmeister_WMDE, Addshore, Smalyshev, BBlack, Aklapper, Gehel, 
darthmon_wmde, Legado_Shulgin, Nandana, thifranc, AndyTan, Davinaclare77, 
Qtn1293, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, 
EBjune, merbst, LawExplorer, Zppix, _jensen, rosalieper, Jonas, Xmlizer, 
Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, 
faidon, Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-06-25 Thread Smalyshev
Smalyshev added a comment.

  No release yet, but if you check out Updater or WDQS build, you get the same 



To: Smalyshev
Cc: Lucas_Werkmeister_WMDE, Addshore, Smalyshev, BBlack, Aklapper, Gehel, 
darthmon_wmde, Legado_Shulgin, Nandana, thifranc, AndyTan, Davinaclare77, 
Qtn1293, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, 
EBjune, merbst, LawExplorer, Zppix, _jensen, rosalieper, Jonas, Xmlizer, 
Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, 
faidon, Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-06-30 Thread Addshore
Addshore added a comment.

  I guess this will eventually be in wdqs 0.3.3 ?



To: Smalyshev, Addshore
Cc: Lucas_Werkmeister_WMDE, Addshore, Smalyshev, BBlack, Aklapper, Gehel, 
darthmon_wmde, Legado_Shulgin, Nandana, thifranc, AndyTan, Davinaclare77, 
Qtn1293, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, 
EBjune, merbst, LawExplorer, Zppix, _jensen, rosalieper, Jonas, Xmlizer, 
Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, 
faidon, Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-06-30 Thread Smalyshev
Smalyshev added a comment.

  Eventually, yes.



To: Smalyshev
Cc: Lucas_Werkmeister_WMDE, Addshore, Smalyshev, BBlack, Aklapper, Gehel, 
darthmon_wmde, Legado_Shulgin, Nandana, thifranc, AndyTan, Davinaclare77, 
Qtn1293, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, 
EBjune, merbst, LawExplorer, Zppix, _jensen, rosalieper, Jonas, Xmlizer, 
Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, 
faidon, Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-08 Thread BBlack
BBlack added a comment.

  Looking at an internal version of the flavor=dump outputs of an entity, 
related observations:
  Test request from the inside: `curl -v 
--resolve www.wikidata.org:443:`
  - There is LM data, for this QID it currently says:  `last-modified: Fri, 08 
Mar 2019 06:24:59 GMT`
  - This could be used with standard HTTP conditional requests for 
`If-Modified-Since`.  This would still cause a ping through to the applayer, 
but would not transfer the body if no change.
  - Or alternatively, use the same data that's informing the LM/IMS conditional 
stuff to set metadata in the dump output as well, so that your queries can use 
this as a datestamp that's shared among more clients (this is basically the 
`use event date` idea from the summary), so that it doesn't even need an LM/IMS 
roundtrip and can be a true cache hit.
  - The CC header is: `cache-control: public, s-maxage=3600, max-age=3600`
  - 1H seems short in general.  We prefer 1d+ for the actual CC times 
advertised by major cacheable production endpoints so that everything doesn't 
go stale too quickly during minor maintenance work on a cache or a site.  Is 
there a reason (often it's set low because other issues around purging and this 
kind of update traffic not being well-engineered yet?).
  - However, assuming the 1H is staying for now, can't updaters just be ok with 
up to 1H of stale data and not cache bust at all?  There's no such thing as 
async+realtime; there's always a staleness, it's just a question of how much is 
tolerable for the use-case.



To: BBlack
Cc: BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, Nandana, thifranc, 
AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, 
Zppix, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, Smalyshev, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-08 Thread Smalyshev
Smalyshev added a comment.

  We've been around this topic a number of times, so I'll write a summary where 
we're at so far. I'm sorry it's going to be long, there's a bunch of issues at 
play. Also, if after reading this you think it's utter nonsense and I'm missing 
an obvious solution to this please feel free to explain it.
  Why we're using non-caching URLs?
  Because we want to have the latest data for the item. The item can be edited 
many times in short bursts (bots are especially known for this, but people do 
it too all the time). This is peculiar to Wikidata - Wikipedia articles are 
usually edited in big chunks, but on Wikidata each property/value change is a 
separate edit usually, which means there can be dozens of edits in a relatively 
short period.
  If we use static URL, and we get change for Q1234, we'd get the data for one 
of the edits stuck in the cache as "data for Q1234", and we have no way of 
getting most recent data until the cache expires. This is bad (later on that).
  If we use URL keyed by revision number, that means if we have 20 edits in a 
row, we'd have to download the RDF data for the page 20 times instead of just 
downloading it once (or maybe twice). This is somewhat mitigated by batch 
aggregation we do, but our batches are not that big, so if there is a big edit 
burst, this completely kills performance (and edit busts are exactly the place 
where we need every last bit of performance).
  > can't updaters just be ok with up to 1H of stale data and not cache bust at 
  So, we can use one of two ways in general:
  A. Use revision-based URLs (described above) - for these we can cache them 
forever, since they don't change
  B. Use general Qid-based URL without revision marker. Long cache for this 
would be very bad, for the following reasons:
  1. People expect to see the data they edit on Wikidata. If somebody edits a 
value and would have to wait for an hour for it to show up on WDQS, people 
would be quite upset. We can have somewhat stale data even now, but hour-long 
delay is rare. And when it happens, people do complain.
  2. Updater is event-driven, so if it gets update for Q1234 revision X, it 
should be able to load data for Q1234 at least as old as revision X. If it 
loads, due to cache, any older data, this data is stuck in the database 
forever, unless there's a new update - since nothing will cause it to re-check 
Q1234 again.
  3. Data in Wikidata is highly interconnected. Unlike Wikipedia articles, 
which link to each other but largely are consumed independently, most Wikidata 
queries involve multiple items that interlink to each other. Caching means that 
each of this item will be seen by WDQS as being in a state it was at some 
random moment at the past hour (note that it can also be these moments will be 
different for different servers due to cache expirations that can happen in 
between server updates) - with these moments being different for different 
items. That means you basically can't do any query that involves any data 
edited in the past hour reliably, as the result for any of these can be 
completely nonsensical - some items would be seconds-fresh and some items they 
refer to may be hour-old, producing completely incoherent results. And since 
it's not easy to see from a query which of the results may be freshly edited, 
this would reduce reliability of service data a lot. It may be fine on a 
relatively static database, but Wikidata is not one.
  I am not sure we can get around this even if we delay updates - even if we 
process only hour old updates (and give up completely on freshness we have now) 
we can't know where the hourly caching window for each item started - that 
would depend on when the edits happened. One item may be hour old and another 
two hours old. Stale data would be bad enough, randomly inconsistently stale 
data would be a disaster.
  So I consider static URL with long caching a complete non-starter unless 
somebody explains to me how to get around the problems described above.
  The only feasible way I can see is to pre-process update stream to aggregate 
multiple edits to the same item over a long period of time and then do 
revision-based loads. Revision-based caching is safe with regard to 
consistency, and aggregation would mostly solve the performance issue. However, 
this means introducing an artificial delay into the process (otherwise the 
aggregation is useless) - which should be long enough to capture any edit burst 
on a typical item. And, of course, we'd need development effort to actually 
implement the aggregator service in a way that can serve all WDQS scenarios. 
We've talked about it bit but we don't currently have a work plan for this yet.



To: Smalyshev
Cc: Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shu

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-08 Thread Smalyshev
Smalyshev added a comment.

  > don't do cache busting on events older than X
  This however gave me an idea. If we kept a map of all latest revision IDs for 
all items we've recently updated, we could eliminate a lot of stale updates - 
especially when we're catching up after the lag. The first mention of the item 
would fetch the latest rev, and then all the following events would basically 
be ignored.
  Right now we do something like that within the batch, and again match the 
revision IDs against the database after the fetches - but this way we can do it 
cross-batch and eliminate the unnecessary fetches. Basically that'd solve the 
problem of lots of fetches (while the cache is active) since each item will be 
fetched only once per backlog. I think with proper data structure (like 
SparseArray maybe?) we could keep a lot of history there relatively cheaply (we 
just need one 64-bit int per item). Also probably won't work for changes that 
lack revision ID - like deletes - but we could either ignore those (they are 
relatively rare) or also use timestamps (dangerous).



To: Smalyshev
Cc: Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, Nandana, 
thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, 
Zppix, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-12 Thread Gehel
Gehel added a comment.

  Given the discussion above, I'm not sure I understand why a 
`If-Modified-Since:` would not work. What am I missing?



To: Gehel
Cc: Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, Nandana, 
thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, 
Zppix, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-12 Thread Smalyshev
Smalyshev added a comment.

  >   I'm not sure I understand why a If-Modified-Since: would not work. 
  How would you see it working? In Varnish, it would be useless since Varnish 
has no way of knowing if Wikidata item changed since being cached. If we go to 
the backend, first we are already incurring all the costs of PHP setup, 
database connection and loading the data just to know whether the data has been 
edited or not. But there are more complications then that:
  1. If-Modified-Since requires a timestamp. Timestamps (in seconds) do not 
have enough granularity to track changes in Wikidata - there can be many edits 
within one second.
  2. The timestamp we have in the change even is not necessarily the timestamp 
on the database edit (we could probably ensure it's the same but due to the 
above it's useless anyway and we have to use revision IDs)
  3. In most cases, we can already know if we have certain revision without 
calling Wikidata - revisions are monotonic, which means if we have revision X 
in the database and change comes with revision Y with Yhttps://phabricator.wikimedia.org/T217897


To: Smalyshev
Cc: Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, Nandana, 
thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, 
Zppix, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-12 Thread Smalyshev
Smalyshev added a comment.

  > it at least make sense to raise the issue so that the WDQS use case is 
addressed if it can be addressed
  For that, we better define "WDQS case". The best I have right now is the 
event aggregation idea described above. In theory, it could also be combined 
with data loading (so that Wikidata data is loaded only once per stream) though 
I am not sure how well it would work given that data sizes for some items are 
pretty large and combining huge data arrays with small events may be 
counter-productive. Some intermediate storage may solve the issue.



To: Smalyshev
Cc: Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, Nandana, 
thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, 
Zppix, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-12 Thread BBlack
BBlack added a comment.

  I think it would be better, from my perspective, to really understand the 
use-cases better (which I don't).  Why do these remote clients need "realtime" 
(no staleness) fetches of Q items?  What I hear is it sounds like all clients 
expect everything to be perfectly synchronous, but I don't understand why they 
need to be perfectly synchronous.  In the case that lead to this ticket, it was 
a remote client at Orange issuing a very high rate of these uncacheable 
queries, which seems like a bulk data load/update process, not an "I just 
edited this thing and need to see my own edits reflected" sort of case.



To: BBlack
Cc: Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, Nandana, 
thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, 
Zppix, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-12 Thread Smalyshev
Smalyshev added a comment.

  > Why do these remote clients need "realtime" (no staleness) fetches of Q 
  Because that's what Query Service is - realtime (well, near-realtime, given 
update times) queryable representation of Wikidata content in RDF form.
  > What I hear is it sounds like all clients expect everything to be perfectly 
  Not sure what you mean by "synchronous" here, could you explain?
  >   In the case that lead to this ticket, it was a remote client at Orange 
issuing a very high rate of these uncacheable queries
  If you start with an old dump/starting point, yes, this is essentially a bulk 
data load. It is not meant for external clients, so there's no actual rate 
controls built in, except for the server itself rejecting the queries. Of 
course, rate limiting would mean the endpoint may take a very long time to 
catch up... 
  BTW in this situation I don't really see how caching would help any, as 
Updater is not supposed to ever ask for the same content (or, in a situation 
with large backlog, for the same item) twice. If we need more detailed look 
into it, maybe set up a meeting so we could do an interactive dive-in there?



To: Smalyshev
Cc: Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, Nandana, 
thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, 
Zppix, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-13 Thread Addshore
Addshore added a comment.

  >>   In the case that lead to this ticket, it was a remote client at Orange 
issuing a very high rate of these uncacheable queries
  It's not just Orange it would seem..
  I took a quick look at the webrequest data for the WDQS updater UA and there 
are other locations also running the updater probably to keep their own copy of 
the query service up to date.
  Looking at March 1st 2019 58% on the uncachable requests to 
Special:EntityData for wikidata came from internal wdqs systems, the other 42% 
came from what seems to be another 9 or so external copies of the wdqs that are 
kept up to date with the live data using the updater.
  On 1st march there were 21,461,124 cache misses for the WDQS updater UA.
  To put that in perspective the total number of cache misses for wikidata on 
that day was 23,857,783 (passes were 54 million)
  Comparing this with the total number of of edits on wikidata in that day ~1.1 
million, I see quite some uncached requests that could be cached.
  I understand that there are some issues with the wdqs updating for every 
single revision due to the performance of writing to SPARQL however.
  Taking a look at our internal hosts on the 1st march they seemed to make 
between 926,848 and 826,878 requests to Special:Entity data, so they avoided 
200k - 300k requests to get data and thus also probably sparql writes each.
  Thinking purely from a varnish hit rate perspective it would make sense to 
remove the "random cache busting" (i guess not actually random as it is ts 
based nocache=)  from the request and switch to asking for the 
specific revision id that is required.
  This would likely go from 21 million misses per day to just 1 million? (the 
initial 1 million requests to populate the cache for each revision being 
  After talking with Stas this apparently makes updating within the updater 
harder etc as it might result in more writes to sparql? (I'd let stas talk more 
on that topic).
  Adding nocache=X doesn't actually mean the request will not be cached, it is 
still cached, just unlikely to be called by other users (probably wasting quite 
some varnish space?)
  It looks like we probably get ~10k cache hits even with the cache buster from 
the WDQS UA, maybe if the servers happen to be requesting the same entities 
during the same second.
  If we don't want to explicitly ask for a revision from the page, can we not 
use the latest revision id we know exists for the entity that we have, or some 
hash of it? to make for a nicer cache buster that could actually be shared 
among updaters both internal and external? The updating pattern within the wdqs 
itself could stay the same then? I guess this depends on if the wdqs updater 
knows what the latest revid is for the entity it is getting updates for?
  Another thing to consider here is in theory even when using the cache buster 
method the data the wdqs updater currently gets when passing nocache=ts may not 
be up to date due to maxlag, not sure if that has been considered in the 
updater process at all?
  It's not often that the maxlag but in the last months it has occasionally 
gone up to 5s or 20s (not sure if the wdqs updater normally requests data for 
an entity that quickly after an edit has been made? but if it does it could be 
getting out of dat data even with the cache busting. But perhaps the 
Last-Modified header is checked in the wdqs updater? if not, maybe it should 
be? (grepping through the code I couldn't find it)
  On the Wikibase side of things, this is a relatively cheap request to make as 
the revision look up is done from the big shared cache of wikidata entity 
revisions and the flavour=dump so wikibase itself in most cases will not make 
any expensive sql queries etc (but anything mediawiki does on start up will 
still happen).



To: Addshore
Cc: Addshore, Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, 
Nandana, thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, 
merbst, LawExplorer, Zppix, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, 
jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, 
Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-13 Thread Addshore
Addshore added a comment.

  The only other thing I was going to add (forgot before i hit submit on the 
last post)
  Within the cluster varnish cached results for entities return much faster 
than the php returned results (of course)
  | entity  | varnish result | php result | page selection |
  | Q1.ttl?flavour=dump | ~0.06-0.07s| ~0.6-0.7s  | randomish  |
  | Q64.ttl?flavour=dump  | ~0.15-0.16s| ~2.3-2.5s  | randomish  |
  | Q100.ttl?flavour=dump  | ~0.13-0.14s| ~2s| randomish  |
  | Q55886027.ttl?flavour=dump  | ~0.14s | ~7-17s?| LongPages  |
  | Q2911127.ttl?flavour=dump  | ~~0.02s| 0.06s  | ShortPages |
  Data was gathered from a prod mw host with requests like the following
cat curl-format.txt
time_namelookup:  %{time_namelookup}\ntime_connect:  
%{time_connect}\n time_appconnect:  %{time_appconnect}\n
time_pretransfer:  %{time_pretransfer}\n   time_redirect:  
%{time_redirect}\n  time_starttransfer:  %{time_starttransfer}\n
 --\n  time_total:  %{time_total}\n

curl -w "@curl-format.txt" -o /dev/null -s 
  I guess the wdqs internal machines would have comparable response times?
  It's hard to really figure anything concrete out from this but the wdqs 
updater / updaters would potentially spend a lot less time waiting for 
responses (maybe they already do them async?) if they hit varnish more?
  Doing some terrible maths and looking at the smallest possible time saving 
for a short page, so 0.04s saved by hitting the cache and assuming 1 million 
edits in a day (based on the comment above, even though right now the wdqs 
updater does a small amount of batching so makes less requests) 100*0.04 = 
40,000s = =~11 hours per host?
  This doesn't really help if the slowest part of the process is actually 
writing the data to blazegraph, but 11 hours in a 24 hour period is still 
pretty significant. I hope the Java updater does some amount of async work 
(writing to blazegraph while getting the next data ready?)



To: Addshore
Cc: Addshore, Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, 
Nandana, thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, 
merbst, LawExplorer, Zppix, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, 
jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, 
Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-22 Thread Addshore
Addshore added a comment.

  In T217897#5026499 , 
@Smalyshev wrote:
  > > I guess the wdqs internal machines would have comparable response times?
  > You can see response times for RDF loading in the dashboard: 
  So, p95 is around 80ms, which lines up pretty well with the data in 
T217897#5020225  where a 
short page took 60ms.
  I guess most entities are toward the smaller end of the scale as the mean 
seems to be closer to ~55-60ms.
  Still, a cache hit for something that takes 80ms in php would likely only 
take ~25ms if hitting varnish.
  >> but 11 hours in a 24 hour period is still pretty significant
  > I'm not sure I understand how this figure was obtained but there's 
absolutely no way Updater spends half time in waiting for RDF loading. In 
reality, it spends most of its time in SPARQL Update.
  I'm not sure if the figure is totally accurate, it is based on multiple 
  Let me try and refine it slightly.
  Again this ignores batching, but the actual edit count on the day was ~1.1 
million, which resulted in between 926848 and 826878 requests to load entity 
data, depending on which wdqs host you look at, so 84-75% of edits end up 
triggering a entity data load with the current batching methods.
  But the fact stands that varnish will always respond faster than php, and 
looking at even the smallest entity, a varnish hit shaves around 50% off the 
request time.
  If we say a single wdqs host is making 800k requests to special entity data 
right now with an average load time of 55ms, thats 4400ms or 12 hours spent 
loading data
  If we pretend we are loading every single revision (so 1.1 million) and 
actually hit the varnish cache (well sometimes not if we are the first server 
to ask) then we have ((110/12*11) * 25ms ) + ( 110/12*1 * 55ms ) = 
3025 ms or 8.5 hours
  So probably a saving of closer to 4 hours each day per instance of loading 
time. But if the updater were to actually then write to blazegraph for each of 
the retrieved revisisions then of course that would be 300k more updates, but 
IMO the wdqs updater can still request revisions like this, and choose not to 
actually write every single revision.
  This is all generally meant to just highlight that hitting varnish is 
obviously going to be faster, even if the updater itself think that entity data 
retrieval is fast enough.
  >> writing to blazegraph while getting the next data ready?
  > That could be possible but doesn't happen now. May be a good idea to try. 
However, since SPARQL Update dominates the timings pretty heavily it's unlikely 
we'd save too much. And since we need to validate IDs against database (to 
ensure we don't already have the revision we're about to fetch) we can not 
fetch RDF before previous update has finished, thus reducing the 
parallelizeable part to essentially only Kafka data loading, which doesn't seem 
to be worth it.
  I'm still a bit confused about this logic inside the updater, especially with 
this id validation checking if we have the revision already etc?
  The fastest way for this to work in the distributed fashion that that it is 
currently laid out in is to just retrieve every revision of entity data, using 
a varnish cacheable query string, hold the latest revision of an entity in some 
internal queue in the updater for a few second while waiting for more updates, 
and then just commit that to blazegraph for storage after a few seconds.
  This means less reducing the php calls dramatically, increasing varnish hits, 
decreasing overall time spent waiting for special:entitydata responses, and 
still allowing for batching.
  A few other comment that we might want to think about.
  PHP is being hit very roughly with 12.5 million requests to turn some PHP 
object into RDF output for special entity data, we might want to just consider 
caching that in its own memcached key inside wikibase so we only have to do 
that conversion once per revision, reducing this logic from running 12.5 
million times to just around 1.1 million times.
  This hasn't been considered before because special entity data is cacheable, 
and these should all be varnish cache hits anyway, but if the updater behaviour 
does not change then maybe we should add this?
  Also regarding 3rd parties using the updater, perhaps the revid based 
approach needs to be developed anyway to reduce the load that is likely to 
continue to increase. These should be hitting the cache, but they should also 
nt be getting out of date data, revid is the solution to that.
  https://grafana.wikimedia.org/d/00188/wikidata-special-entitydata shows 
the issue pretty well with the number of requests for uncached ttl data to 
special entity d

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-25 Thread Smalyshev
Smalyshev added a comment.

  > I'm still a bit confused about this logic inside the updater, especially 
with this id validation checking if we have the revision already etc?
  Not sure what you mean "already". You can have revision ID in the change, and 
revision ID in Wikidata, but you still have to check against revision ID in 
Blazegraph, so that you do not replace newer data with older data.
  > hold the latest revision of an entity in some internal queue in the updater 
for a few second while waiting for more updates, and then just commit that to 
blazegraph for storage after a few seconds
  Not sure how holding it in the queue for a few seconds would help anything. 
You'd just time-shift the whole process several seconds to the past, but 
otherwise nothing would change. If you mean batching the updates, we already to 
that. But the batch for the updates covering several seconds would be huge 
(some bots do hundreds of updates per seconds) and putting them into SPARQL 
queries would make them very slow. If we split them, we slow the process down, 
and take the risk the whole update was useless since new data already arrived. 
I am not sure how waiting for a few seconds helps anything beyond what current 
process is already doing (and introducing additional complexity, as now we 
can't anymore assume we're working with latest data but always have to track 
which delayed update this data relates to). Maybe I misunderstand something in 
your proposal.
  > This means less reducing the php calls dramatically, increasing varnish 
  It may raise varnish hits (since everything would be varnish hit), but as for 
reducing PHP calls, I am not sure about that, because instead of fetching only 
newest edit, if the entry is edited 100 times, you now need to fetch 100 edits 
instead. That's 100x PHP calls.
  > PHP is being hit very roughly with 12.5 million requests to turn some PHP 
object into RDF output for special entity data, we might want to just consider 
caching that in its own memcached key inside wikibase so we only have to do 
that conversion once per revision
  May be worth considering, but we have tons of revisions, do we have enough 
memory for such cache? some entries are huge, and if one letter changes in 30M 
RDF, we'd be storing two 30M revisions differing in one byte. Of course, we 
could limit the size of the cacheable RDF - not sure how many resources are 



To: Smalyshev
Cc: Addshore, Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, 
Nandana, thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, 
merbst, LawExplorer, Zppix, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, 
jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, 
Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-26 Thread Smalyshev
Smalyshev added a comment.

  Looking at the distribution of Special:EntityData fetches, if we cache 
entities under 10K, we will capture about 90% of them. Most frequent sizes are 
1 to 4K. So caching probably worth trying.



To: Smalyshev
Cc: Addshore, Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, 
Nandana, thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, 
merbst, LawExplorer, Zppix, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, 
jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, 
Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-26 Thread gerritbot
gerritbot added a comment.

  Change 499363 had a related patch set uploaded (by Smalyshev; owner: 
  [mediawiki/extensions/Wikibase@master] Add caching of Special:EntityData 



To: gerritbot
Cc: Addshore, Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, 
Nandana, thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, 
merbst, LawExplorer, Zppix, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, 
jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, 
Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-27 Thread Addshore
Addshore added a comment.

  In T217897#5056213 , 
@Smalyshev wrote:
  > > I'm still a bit confused about this logic inside the updater, especially 
with this id validation checking if we have the revision already etc?
  > Not sure what you mean "already". You can have revision ID in the change, 
and revision ID in Wikidata, but you still have to check against revision ID in 
Blazegraph, so that you do not replace newer data with older data.
  I'm not quite sure how you would get to this situation in the first place 
  Is the stream of events suddenly going to start sending old events?
  Or is this mainly a situation after data has been bulk loaded?
  >> hold the latest revision of an entity in some internal queue in the 
updater for a few second while waiting for more updates, and then just commit 
that to blazegraph for storage after a few seconds
  > Not sure how holding it in the queue for a few seconds would help anything. 
You'd just time-shift the whole process several seconds to the past, but 
otherwise nothing would change. If you mean batching the updates, we already to 
that. But the batch for the updates covering several seconds would be huge 
(some bots do hundreds of updates per seconds) and putting them into SPARQL 
queries would make them very slow. If we split them, we slow the process down, 
and take the risk the whole update was useless since new data already arrived. 
I am not sure how waiting for a few seconds helps anything beyond what current 
process is already doing (and introducing additional complexity, as now we 
can't anymore assume we're working with latest data but always have to track 
which delayed update this data relates to). Maybe I misunderstand something in 
your proposal.
  Yes the waiting a few seconds would be for batching changes to the same 
entity. But this would be waiting on the stream of events for entity changes. 
wait until the entity has not been touched for 10 seconds (or something), then 
request the last revid that the updater received from special:entitydata using 
revid, and create the sparql and do the update.
  I'm thinking about batched updates per entity, not batched updates of all 
changes in a set period of time.
  Again, im mainly proposing this to try and get revid to be used, I still 
don't understand if above is essentially what the updater is doing, why revid 
can't be used, if I were going to write something to do updates to the query 
service from the ground up with no knowledge of what has already been attempted 
the above is what it would do.
  >> This means less reducing the php calls dramatically, increasing varnish 
  > It may raise varnish hits (since everything would be varnish hit), but as 
for reducing PHP calls, I am not sure about that, because instead of fetching 
only newest edit, if the entry is edited 100 times, you now need to fetch 100 
edits instead. That's 100x PHP calls.
  Well, the underlying PHP calls that would happen as a result of hitting 
varnish would decrease dramatically even if every single revisions was 
requested using revid for ttl format, due to the current distributed nature of 
the updater.
  If edits on wikidata were slower 1 edit on wikidata would result in 12 PHP 
runs using the cachbusting (so ignoring the batching)
  Hitting revid 1 edit would result in 1, maybe 2 PHP hits, depending on how 
fast varnish was the cache the result.
  Again ignoring the batching here as it definitely does not give us a 12x 
decrease in requests to php, that we would get with using a cachable url.
  This is briefly backed up by data in T217897#5048178 

  "but the actual edit count on the day was ~1.1 million, which resulted in 
between 926848 and 826878 requests to load entity data"
  That is per host.
  So 1.1 million edits, but around 10 million PHP code executions (at least) to 
update the query services, when in my eyes that should really be no more than 
the # of edits.
  >> PHP is being hit very roughly with 12.5 million requests to turn some PHP 
object into RDF output for special entity data, we might want to just consider 
caching that in its own memcached key inside wikibase so we only have to do 
that conversion once per revision
  > May be worth considering, but we have tons of revisions, do we have enough 
memory for such cache? some entries are huge, and if one letter changes in 30M 
RDF, we'd be storing two 30M revisions differing in one byte. Of course, we 
could limit the size of the cacheable RDF - not sure how many resources are 
  So the shared cache for entity revisions inside wikibase exists per entity, 
not per revision, but it is updated during save and can be assumed to be the 
latest revision.
  It is shared between wikidata.org, and all client sites and used for 
essentially all entity revision retrieval (we 

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-27 Thread Addshore
Addshore added a comment.

  In T217897#5060748 , 
@Smalyshev wrote:
  > Looking at the distribution of Special:EntityData fetches, if we cache 
entities under 10K, we will capture about 90% of them. Most frequent sizes are 
1 to 4K. So caching probably worth trying.
  I left some comments on the patch, but still think that the cache we are 
talking about there would be unnecessary if the wdqs just hit varnish.



To: Addshore
Cc: Addshore, Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, 
CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, 
Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, 
Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Zppix, 
Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-27 Thread Smalyshev
Smalyshev added a comment.

  > the cache we are talking about there would be unnecessary if the wdqs just 
hit varnish.
  It is problematic for WDQS to "just hit varnish", because varnish does not 
know if certain revision is the latest one available or not. Wikidata on the 
other hand does.



To: Smalyshev
Cc: Addshore, Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, 
CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, 
Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, 
Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Zppix, 
Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-28 Thread Addshore
Addshore added a comment.

  In T217897#5062728 , 
@Smalyshev wrote:
  > > the cache we are talking about there would be unnecessary if the wdqs 
just hit varnish.
  > It is problematic for WDQS to "just hit varnish", because varnish does not 
know if certain revision is the latest one available or not. Wikidata on the 
other hand does.
  Again my flipped way of looking at that is that WDQS does know what the 
latest version of the entity that it is trying to get updates for is, therefor, 
it should probably just ask for it.



To: Addshore
Cc: Addshore, Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, 
CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, 
Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, 
Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Zppix, 
Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-28 Thread Smalyshev
Smalyshev added a comment.

  > WDQS does know what the latest version of the entity that it is trying to 
get updates for is,
  But "last version that WDQS knows of" can be very different from "last 
version that Wikidata has". That's the whole issue.
  I had an idea recently though. Maybe we could work it in two modes - if the 
stream is lagged sufficiently, we use the "latest available" mode - to jump to 
the front, but if we're more or less current, the probability of our change 
being current is high, so we could use "by revision ID" mode. Need to look at 
edit timings to see if it's workable but may be splitting two cases - catching 
up from large lag and keeping current - would be more efficient and allows us 
to use cache for the most frequent case (which is "keeping current"). That 
would be easy to implement and test - just a couple of if's in proper places.



To: Smalyshev
Cc: Addshore, Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, 
CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, 
Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, 
Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Zppix, 
Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-28 Thread Smalyshev
Smalyshev added a comment.

  @Addshore btw do I understand right that constraints can not be fetched 
per-revision? In this case, do we still need cache-busting there? Or constrains 
manage their caches? I am not sure what to do here.



To: Smalyshev
Cc: Addshore, Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, 
CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, 
Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, 
Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Zppix, 
Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-28 Thread gerritbot
gerritbot added a comment.

  Change 499951 had a related patch set uploaded (by Smalyshev; owner: 
  [wikidata/query/rdf@master] Implement more cache-friendly Wikibase fetch 



To: gerritbot
Cc: Addshore, Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, 
CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, 
Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, 
Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Zppix, 
Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-29 Thread Addshore
Addshore added a comment.

  In T217897#5066900 , 
@Smalyshev wrote:
  > > WDQS does know what the latest version of the entity that it is trying to 
get updates for is,
  > But "last version that WDQS knows of" can be very different from "last 
version that Wikidata has". That's the whole issue.
  > I had an idea recently though. Maybe we could work it in two modes - if the 
stream is lagged sufficiently, we use the "latest available" mode - to jump to 
the front, but if we're more or less current, the probability of our change 
being current is high, so we could use "by revision ID" mode. Need to look at 
edit timings to see if it's workable but may be splitting two cases - catching 
up from large lag and keeping current - would be more efficient and allows us 
to use cache for the most frequent case (which is "keeping current"). That 
would be easy to implement and test - just a couple of if's in proper places.
  That sounds like a pretty good idea!
  How often do the updaters get lagged behind the stream?
  Another thing that we would also tweak would be the cache busting method. 
Right now a timestamp is used all the way down to a second.
  If the cache buster had slightly less granularity (such as only timestamps 
ending in even seconds, or 0 and 5) the probability of hits in the same few 
seconds between different updaters within the cluster would be greatly 
  But this would be a cherry on top, and if we already have the majority of 
requests using revid this probably isn't too bad.
  I guess if this would work on not again depends on what the internals of the 
updater do and if this means for sure things might be out of date in places or 
if it would be able to handle this.
  Another option that would be more involved would be have a single consumer of 
the stream do the hard work (generating SPARQL) and spit that back into another 
stream, so that wikibase is only hit once for each change rather than by each 
  In T217897#5068290 , 
@Smalyshev wrote:
  > @Addshore btw do I understand right that constraints can not be fetched 
per-revision? In this case, do we still need cache-busting there? Or constrains 
manage their caches? I am not sure what to do here.
  Constraints can not be fetched per revisions, you can only get the latest 
  The constraint check results a single revision can change, so there is little 
point in tieing them to a revid.
  When we get finished with the work in the area the results will be 
persistently stored, so retrieving them will be cheap, they will be calculated 
after each edit, persisted and then a stream be added to saying new constraint 
check data for X now exists.



To: Addshore
Cc: Addshore, Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, 
CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, 
Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, 
Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Zppix, 
Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-03-31 Thread gerritbot
gerritbot added a comment.

  Change 500359 had a related patch set uploaded (by Smalyshev; owner: 
  [operations/puppet@production] Enable using revision-fetch mechanism for test 
& internal clusters



To: gerritbot
Cc: Addshore, Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, 
CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, 
Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, 
Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Zppix, 
Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-04-02 Thread gerritbot
gerritbot added a comment.

  Change 499951 **merged** by jenkins-bot:
  [wikidata/query/rdf@master] Implement more cache-friendly Wikibase fetch 



To: gerritbot
Cc: Addshore, Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, 
CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, 
Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, 
Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Zppix, 
Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-04-03 Thread gerritbot
gerritbot added a comment.

  Change 500359 **merged** by Gehel:
  [operations/puppet@production] Enable using revision-fetch mechanism for test 
& internal clusters



To: gerritbot
Cc: Addshore, Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, 
CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, 
Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, 
Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Zppix, 
Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-04-03 Thread gerritbot
gerritbot added a comment.

  Change 501056 had a related patch set uploaded (by Gehel; owner: Smalyshev):
  [operations/puppet@production] wdqs: expose revision-fetch mechanism



To: gerritbot
Cc: Addshore, Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, 
CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, 
Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, 
Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Zppix, 
Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-04-03 Thread gerritbot
gerritbot added a comment.

  Change 501056 **merged** by Gehel:
  [operations/puppet@production] wdqs: expose revision-fetch mechanism



To: gerritbot
Cc: Addshore, Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, 
CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, 
Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, 
Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Zppix, 
Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-04-03 Thread Smalyshev
Smalyshev added a comment.

  I've also made a counter to check how many "forward skips" - i.e. loading 
revision further than we've asked in change - we get. The averages are between 
0.1 and 0.5, sometimes going to 1 - i.e. we're saving up to one item 
fetch/update per second, or since we're processing about 10 updates per second, 
it's from 1% to 10% speed improvement. 1% is low, but 10% is not, so we may not 
want to give up on skip-ahead just yet.



To: Smalyshev
Cc: Lucas_Werkmeister_WMDE, Addshore, Smalyshev, BBlack, Aklapper, Gehel, 
alaa_wmde, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, 
Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, 
Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, 
EBjune, merbst, LawExplorer, Lewizho99, Zppix, Maathavan, _jensen, rosalieper, 
Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-04-04 Thread gerritbot
gerritbot added a comment.

  Change 501450 had a related patch set uploaded (by Smalyshev; owner: 
  [wikidata/query/rdf@master] Work around status 400 on redirect revision fetch



To: gerritbot
Cc: Lucas_Werkmeister_WMDE, Addshore, Smalyshev, BBlack, Aklapper, Gehel, 
alaa_wmde, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, 
Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, 
Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, 
EBjune, merbst, LawExplorer, Lewizho99, Zppix, Maathavan, _jensen, rosalieper, 
Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-04-05 Thread gerritbot
gerritbot added a comment.

  Change 501450 **merged** by jenkins-bot:
  [wikidata/query/rdf@master] Work around status 400 on redirect revision fetch



To: gerritbot
Cc: Lucas_Werkmeister_WMDE, Addshore, Smalyshev, BBlack, Aklapper, Gehel, 
alaa_wmde, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, 
Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, 
Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, 
EBjune, merbst, LawExplorer, Lewizho99, Zppix, Maathavan, _jensen, rosalieper, 
Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-04-09 Thread gerritbot
gerritbot added a comment.

  Change 502655 had a related patch set uploaded (by Smalyshev; owner: 
  [mediawiki/extensions/Wikibase@master] Allow revision dump for redirects



To: gerritbot
Cc: Lucas_Werkmeister_WMDE, Addshore, Smalyshev, BBlack, Aklapper, Gehel, 
alaa_wmde, joker88john, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, 
thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, 
Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, 
Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, 
Hfbn0, Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, Lewizho99, 
Zppix, Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-04-10 Thread gerritbot
gerritbot added a comment.

  Change 502909 had a related patch set uploaded (by Smalyshev; owner: 
  [operations/puppet@production] Enable revisions support on internal clusters



To: gerritbot
Cc: Lucas_Werkmeister_WMDE, Addshore, Smalyshev, BBlack, Aklapper, Gehel, 
alaa_wmde, joker88john, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, 
thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, 
Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, 
Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, 
Hfbn0, Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, Lewizho99, 
Zppix, Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-04-11 Thread gerritbot
gerritbot added a comment.

  Change 502909 **merged** by Gehel:
  [operations/puppet@production] Enable revisions support on internal clusters



To: gerritbot
Cc: Lucas_Werkmeister_WMDE, Addshore, Smalyshev, BBlack, Aklapper, Gehel, 
alaa_wmde, joker88john, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, 
thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, 
Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, 
Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, 
Hfbn0, Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, Lewizho99, 
Zppix, Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-04-18 Thread gerritbot
gerritbot added a comment.

  Change 504990 had a related patch set uploaded (by Smalyshev; owner: 
  [operations/puppet@production] Enable revision fetches in production



To: gerritbot
Cc: Lucas_Werkmeister_WMDE, Addshore, Smalyshev, BBlack, Aklapper, Gehel, 
alaa_wmde, joker88john, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, 
thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, 
Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, 
Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, 
Hfbn0, Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, WSH1906, 
Lewizho99, Zppix, Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, 
jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, 
Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-04-19 Thread Smalyshev
Smalyshev added a comment.

  Results of caching can be seen here:
  Deploy date is Apr 11, fetch time drops from 135/195 ms (eqiad/codfw) to 
90/150 ms when requests are cached.



To: Smalyshev
Cc: Lucas_Werkmeister_WMDE, Addshore, Smalyshev, BBlack, Aklapper, Gehel, 
alaa_wmde, joker88john, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, 
thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, 
Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, 
Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, 
Hfbn0, Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, WSH1906, 
Lewizho99, Zppix, Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, 
jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, 
Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2019-04-23 Thread gerritbot
gerritbot added a comment.

  Change 502655 **merged** by jenkins-bot:
  [mediawiki/extensions/Wikibase@master] Allow revision dump for redirects



To: gerritbot
Cc: Lucas_Werkmeister_WMDE, Addshore, Smalyshev, BBlack, Aklapper, Gehel, 
alaa_wmde, joker88john, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, 
thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, 
Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, 
Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, 
Hfbn0, Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, WSH1906, 
Lewizho99, Zppix, Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, 
jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, 
Mbch331, Jay8g, fgiunchedi
Wikidata-bugs mailing list

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

2020-01-15 Thread gerritbot
gerritbot added a comment.

  Change 499363 abandoned by Addshore:
  Add caching of Special:EntityData results



To: Smalyshev, gerritbot
Cc: Lucas_Werkmeister_WMDE, Addshore, Smalyshev, BBlack, Aklapper, Gehel, 
darthmon_wmde, Legado_Shulgin, Nandana, Davinaclare77, Qtn1293, Techguru.pc, 
Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, 
LawExplorer, Zppix, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, Wong128hk, 
jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, 
Mbch331, Rxy, Jay8g, fgiunchedi
Wikidata-bugs mailing list