[Wikidata-bugs] [Maniphest] [Commented On] T126730: [RFC] Caching for results of wikidata Sparql queries

BBlack Wed, 17 Feb 2016 05:51:01 -0800

BBlack added a comment.

In https://phabricator.wikimedia.org/T126730#2034900, @Christopher wrote:


> I may be wrong, but the headers that are returned from a request to the nginx 
> server wdqs1002 say that varnish 1.1 is already being used there.


It's varnish 3.0.6 currently (4.x is coming down the road).

> And, for whatever reason,** it misses**, because repeating the same query 
> gives the same response time.

It misses because the response is sent with `Transfer-Encoding: chunked`.  If 
it were sent un-chunked with a Content-Length, the varnish would have a chance 
at caching it.  However, the next thing you'd run into is that the response 
doesn't contain any caching-relevant headers (e.g. `Expires`, `Cache-Control`, 
`Age`).  Lacking these, varnish would cache it with our configured default_ttl, 
which on the misc cluster where `query.wikidata.org` is currently hosted, is 
only 120 seconds.

> Even though Varnish cache **should work** to proxy nginx for optimizing 
> delivery of static query results, it lacks several important features of an 
> object broker.  Namely, client control of object expiration (TTL) and 
> retrieval of "named query results" from persistent storage.
> 
>   A WDQS service use case may in fact be to compare results from several days 
> ago with current results.   Thus, assuming the latest results state is what 
> the client wants my actually not be true.

I think all of this is doable.  Named query results is something we talked 
about in the previous discussion re `GET` length restrictions.  `POST`ing 
(and/or server-side configuring, either way!) a complex query and saving it as 
a named query through a separate query-setup interface, then executing the 
query for results with a `GET` on just the query name.

I don't think we really want client control of object expiration (at least, not 
"varnish cache object expiration"), but what we want is the ability to 
parameterize named queries based on time, right?  e.g. a named query that gives 
a time-series graph might have parameters for start time and duration.  You 
might initially post the complex SPARQL template and save it as `fooquery`, 
then later have a client get it as 
`/sparql?saved_query=fooquery&start=201601011234&duration=1w`.  Varnish would 
have the chance to cache those based on the query args as separate results, and 
you could limit the time resolution if you want to enhance cacheability.

If it's for inclusion from a page that wants to graph that data and always show 
a "current" graph rather than hardcoded start/duration (and I could see 
use-cases for both in articles), you could support a start time of `now` with 
an optional resolution specifier that defaults to 1 day, like `&start=now/1d`.  
The response to such a query would set cache-control headers that allow caching 
at varnish up to 24H (based on `now/1d` resolution), which means everyone 
executing that query gets new results about once a day and they all shared a 
single cached result per day.

The important thing here is there's no need for a client to have control over 
result object expiration if the query encodes everything that's relevant to 
expiration and the maximum cache lifetime is set small enough that other 
effects (e.g. data updates to existing historical data) are negligible in the 
big picture.

> Possibly, the optimal solution would use the varnish-api-engine 
> (http://info.varnish-software.com/blog/introducing-varnish-api-engine) in 
> conjunction with a WDQS REST API (provided with a modified RESTBase?).   Is 
> the varnish-api-engine being used anywhere in WMF?  Also, delegating query 
> requests to an API could allow POSTs.  Simply with Varnish cache, the POST 
> problem would remain unresolved.

We're not using the Varnish API Engine, and I don't see us pursuing that 
anytime soon.  Most of what it does can be done other ways, and more 
importantly it's commercial software.

There seems to be some confusion as to whether `POST` is or isn't still an 
issue here...

Also, a whole separate issue is that WDQS is currently mapped through our 
`cache_misc` cluster.  That cluster is for small lightweight miscellaneous 
infrastructure.  WDQS was probably always a poor match for that, but we put it 
there because at the time it was seen as being a lightweight / low-rate service 
that would mostly be used directly by humans to execute one-off complicated 
queries.  The plans in this ticket sound nothing like that, and `cache_misc` 
probably isn't an appropriate home for a complex query services that's going to 
backend serious query load from wikis and the rest of the world...


TASK DETAIL
  https://phabricator.wikimedia.org/T126730

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: BBlack
Cc: BBlack, GWicke, Bene, Ricordisamoa, daniel, Lydia_Pintscher, Smalyshev, 
Jonas, Christopher, Yurik, hoo, Aklapper, aude, debt, Gehel, Izno, Luke081515, 
jkroll, Wikidata-bugs, Jdouglas, Deskana, Manybubbles, Mbch331, Jay8g, Ltrlg



_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T126730: [RFC] Caching for results of wikidata Sparql queries

Reply via email to