Re: [Wikidata] SPARQL endpoint caching

Magnus Manske Tue, 16 Feb 2016 15:55:50 -0800

I agree, we should look at some actual traffic to see how many queries
/could/ be cached in a 2/5/10/60 min window. Maybe remove the example
queries from those numbers, to separate the "production" and testing usage.
Also, look at query runtime; if only "cheap" queries would be cached, there
is no point in caching.


If caching would lead to significant savings, option 2 sounds sensible.
Some people will get upset if their results aren't up-to-the-second, and
being able to shift the blame at "server defaults" would be convenient ;-)

Option 3 sounds bad, because everyone and their cousin will just add an
override to their tools, to prevent hours-old data to be served to the
surprised users. WDQ has a ~10-15 min lag, that's about as much as people
can stomach.

Once you run a query, you know both the runtime and the result size. Maybe
expensive queries with a huge result set could be cached longer by default,
and cheap/small queries not at all? If you expect your recent Wikidata edit
to change the results from 3 to 4, you should see that ASAP; if the change
would be 50.000 to 50.001, it seems less critical somehow.

On Tue, Feb 16, 2016 at 11:19 PM James Heald <j.he...@ucl.ac.uk> wrote:

> I have to say that I am dubious.
>
> How often does *exactly* the same query get run within 2 minutes ?
>
> Does the same query ever get run ?
>
> The first thing to do, surely, is to create a hash for each query, (or
> better, perhaps, something like a tinyurl so then the lookup is
> reversible, record a timestamp for that hash each time the query is run,
> and then see even over a period of a month how many (if any) queries are
> being re-run, and if so how often.
>
> I can imagine it's possible that particular tracking queries might be
> re-run (but probably (a) not every two minutes; and (b) not wanting the
> same result as last time).
>
> Also perhaps queries with a published link might get re-run -- eg if
> somebody posts the link for a query-generated graph on twitter that gets
> a lot of re-tweets.  (Or even just if Lydia posts it in the news of the
> week).
>
> For queries like that, caching might well make sense (and save the
> server a potential slashdotting).
>
> I'd guess there's probably only a very few queries like that though.
>
> Possibly it's only worth caching a set of results if the same query has
> *already* been requested within the last n minutes ?
>
>    -- James
>
>
> On 16/02/2016 22:47, Stas Malyshev wrote:
> > Hi!
> >
> > With Wikidata Query Service usage raising and more use cases being
> > found, it is time to consider caching infrastructure for results, since
> > queries are expensive. One of the questions I would like to solicit
> > feedback on is the following:
> >
> > Should we have default SPARQL endpoint cached or uncached? If cached,
> > which default cache duration would be good for most users? The cache, of
> > course, applies to the results of the same (identical) query only.
> > Please also note the following is not an implementation plan, but rather
> > an opinion poll, whatever we end up deciding we will have an
> > announcement with actual plan before we do it.
> >
> > Also, whichever default we choose, there should be a possibility to get
> > both cached and uncached results. The question is when you access the
> > endpoint with no options, which one would it be. So possible variants
> are:
> >
> > 1. query.wikidata.org/sparql is uncached, to get cached result you use
> > something like query.wikidata.org/sparql?cached=120 to get result no
> > older than 120 seconds ago.
> > PRO: least surprise for default users.
> > CON: relies on goodwill of tool writers, if somebody doesn't know about
> > cache option and uses the same query heavily, we would have to ask them
> > to use the parameter.
> >
> > 2. query.wikidata.org/sparql is cached for short duration (e.g. 1
> > minute) by default, if you'd like fresh result, you do something like
> > query.wikidata.org/sparql?cached=0. If you're fine with older result,
> > you can use query.wikidata.org/sparql?cached=3600 and get cached result
> > if it's still in cache but by default you never get result older than 1
> > minute. This of course assuming Varnish magic can do this, if not, the
> > scheme has to be amended.
> > PRO: performance improvement while keeping default results reasonably
> fresh
> > CON: it is not obvious that result is not the freshest data but can be
> > stale, so if you update something in wikidata and query again within
> > minute, you can be surprised
> >
> > 3. query.wikidata.org/sparql is cached for long duration (e.g. hours) by
> > default, if you'd like fresher result you do something like
> > query.wikidata.org/sparql?cache=120 to get result no older than 2
> > minutes, or cache=0 if you want uncached one.
> > PRO: best performance improvement for most queries, works well with
> > queries that display data that rarely changes, such as lists, etc.
> > CON: for people not knowing about cache option, in may be rather
> > confusing to not be able to get up-to-date results.
> >
> > So we'd like to hear - especially from current SPARQL endpoint users -
> > what do you think about these and which would work for you?
> >
> > Also, for the users of the WDQS GUI - provided we have cached and
> > uncached options, which one the GUI should return by default? Should it
> > be always uncached? Performance there is not a major question - the
> > traffic to the GUI is pretty low - but rather convenience. Of course, if
> > you run cached query from GUI and the data in cache, you can get results
> > much faster for some queries. OTOH, it may be important in many cases to
> > be able to access actual content up-to-date, not the cached version.
> >
> > I also created a poll: https://phabricator.wikimedia.org/V8
> > so please feel free to vote for your favorite option.
> >
> > OK, this letter is long enough already so I'll stop here and wait to
> > hear what everybody's thinking.
> >
> > Thanks in advance,
> >
>
>
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL endpoint caching

Reply via email to