I agree, we should look at some actual traffic to see how many queries /could/ be cached in a 2/5/10/60 min window. Maybe remove the example queries from those numbers, to separate the "production" and testing usage. Also, look at query runtime; if only "cheap" queries would be cached, there is no point in caching.
If caching would lead to significant savings, option 2 sounds sensible. Some people will get upset if their results aren't up-to-the-second, and being able to shift the blame at "server defaults" would be convenient ;-) Option 3 sounds bad, because everyone and their cousin will just add an override to their tools, to prevent hours-old data to be served to the surprised users. WDQ has a ~10-15 min lag, that's about as much as people can stomach. Once you run a query, you know both the runtime and the result size. Maybe expensive queries with a huge result set could be cached longer by default, and cheap/small queries not at all? If you expect your recent Wikidata edit to change the results from 3 to 4, you should see that ASAP; if the change would be 50.000 to 50.001, it seems less critical somehow. On Tue, Feb 16, 2016 at 11:19 PM James Heald <j.he...@ucl.ac.uk> wrote: > I have to say that I am dubious. > > How often does *exactly* the same query get run within 2 minutes ? > > Does the same query ever get run ? > > The first thing to do, surely, is to create a hash for each query, (or > better, perhaps, something like a tinyurl so then the lookup is > reversible, record a timestamp for that hash each time the query is run, > and then see even over a period of a month how many (if any) queries are > being re-run, and if so how often. > > I can imagine it's possible that particular tracking queries might be > re-run (but probably (a) not every two minutes; and (b) not wanting the > same result as last time). > > Also perhaps queries with a published link might get re-run -- eg if > somebody posts the link for a query-generated graph on twitter that gets > a lot of re-tweets. (Or even just if Lydia posts it in the news of the > week). > > For queries like that, caching might well make sense (and save the > server a potential slashdotting). > > I'd guess there's probably only a very few queries like that though. > > Possibly it's only worth caching a set of results if the same query has > *already* been requested within the last n minutes ? > > -- James > > > On 16/02/2016 22:47, Stas Malyshev wrote: > > Hi! > > > > With Wikidata Query Service usage raising and more use cases being > > found, it is time to consider caching infrastructure for results, since > > queries are expensive. One of the questions I would like to solicit > > feedback on is the following: > > > > Should we have default SPARQL endpoint cached or uncached? If cached, > > which default cache duration would be good for most users? The cache, of > > course, applies to the results of the same (identical) query only. > > Please also note the following is not an implementation plan, but rather > > an opinion poll, whatever we end up deciding we will have an > > announcement with actual plan before we do it. > > > > Also, whichever default we choose, there should be a possibility to get > > both cached and uncached results. The question is when you access the > > endpoint with no options, which one would it be. So possible variants > are: > > > > 1. query.wikidata.org/sparql is uncached, to get cached result you use > > something like query.wikidata.org/sparql?cached=120 to get result no > > older than 120 seconds ago. > > PRO: least surprise for default users. > > CON: relies on goodwill of tool writers, if somebody doesn't know about > > cache option and uses the same query heavily, we would have to ask them > > to use the parameter. > > > > 2. query.wikidata.org/sparql is cached for short duration (e.g. 1 > > minute) by default, if you'd like fresh result, you do something like > > query.wikidata.org/sparql?cached=0. If you're fine with older result, > > you can use query.wikidata.org/sparql?cached=3600 and get cached result > > if it's still in cache but by default you never get result older than 1 > > minute. This of course assuming Varnish magic can do this, if not, the > > scheme has to be amended. > > PRO: performance improvement while keeping default results reasonably > fresh > > CON: it is not obvious that result is not the freshest data but can be > > stale, so if you update something in wikidata and query again within > > minute, you can be surprised > > > > 3. query.wikidata.org/sparql is cached for long duration (e.g. hours) by > > default, if you'd like fresher result you do something like > > query.wikidata.org/sparql?cache=120 to get result no older than 2 > > minutes, or cache=0 if you want uncached one. > > PRO: best performance improvement for most queries, works well with > > queries that display data that rarely changes, such as lists, etc. > > CON: for people not knowing about cache option, in may be rather > > confusing to not be able to get up-to-date results. > > > > So we'd like to hear - especially from current SPARQL endpoint users - > > what do you think about these and which would work for you? > > > > Also, for the users of the WDQS GUI - provided we have cached and > > uncached options, which one the GUI should return by default? Should it > > be always uncached? Performance there is not a major question - the > > traffic to the GUI is pretty low - but rather convenience. Of course, if > > you run cached query from GUI and the data in cache, you can get results > > much faster for some queries. OTOH, it may be important in many cases to > > be able to access actual content up-to-date, not the cached version. > > > > I also created a poll: https://phabricator.wikimedia.org/V8 > > so please feel free to vote for your favorite option. > > > > OK, this letter is long enough already so I'll stop here and wait to > > hear what everybody's thinking. > > > > Thanks in advance, > > > > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata >
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata