Re: [Wikidata] SPARQL endpoint caching

Andra Waagmeester Wed, 17 Feb 2016 01:25:20 -0800

Basicly I have two use-cases of the SPARQL endpoint. 1. Concept finding for
bot activities, 2. example /tutorial/show-case queries.
Starting with the second, especially if it is on prototyping an (extensive)
caching time is totally acceptable to me and definitely worth it would it
improve the overall performance of the endpoint.


In bot activities, currently the bot freezes when the last update of the
WDQS exceeds 5 minutes. The main reason for using the WDQS in our bot
efforts is for concept resolution (i.e. does a concept and one or more of
its properties already exists on WD). The chances that a duplicate items
are created within 5 minutes  are slim and could if it happens are easily
fixed manually.

So if it would improve the performance or stability of the WDQS, I would
certainly vote for option 3.

Would it be possible to implement a select box in the GUI where users can
select the preferred caching time? Such a feature would show the existence
of different caching times to new users.

Just my 2cts

Andra

On Wed, Feb 17, 2016 at 9:54 AM, Katie Filbert <katie.filb...@wikimedia.de>
wrote:

> On Wed, Feb 17, 2016 at 9:39 AM, Markus Krötzsch <
> mar...@semantic-mediawiki.org> wrote:
>
>> On 17.02.2016 08:16, Stas Malyshev wrote:
>>
>>> Hi!
>>>
>>> (2) Shouldn't BlazeGraph do the caching (too)? It knows how much a query
>>>> costs to re-run and it could even know if a query is affected by a data
>>>>
>>>
>>> BlazeGraph does a lot of caching, but it's limited by the memory and it
>>> AFAIK does not do whole query caching (like mysql does, for example) -
>>> which means if you run two big queries one after another, the latter
>>> could remove from cache what the former put there. Its caching, AFAIK,
>>> is on much lower level. Which is helpful too since different queries
>>> share a lot of underlying data, but not exactly our case here.
>>>
>>> update (a cache might still be the same as a current result even after
>>>> many data changes). Having several caching layers is useful, but the
>>>> more elaborate (query-structure dependent) caching strategies should
>>>> maybe be left to the database.
>>>>
>>>
>>> I don't think Blazegraph does anything like resolving changes to see if
>>> query results changed, that sound like pretty hard thing to do in triple
>>> store. You can manually store specific query result AFAIK but that's
>>> just form of writing data as I understand and may not be very scalable.
>>>
>>
>> Yes, in general this would be extremely hard. There are some easy cases
>> one could catch, but it is not clear how effective this would be for our
>> load. I am just saying we should not try to build a query-aware caching
>> strategy that would better be done on a lower level.
>>
>>
>>> The points (3)-(5) are based on guessing. As Magnus said, some analysis
>>>> could help to confirm or refute this. On the other hand, caching should
>>>> not just focus on current usage patterns only, but consider a bit what
>>>> could happen in the future.
>>>>
>>>
>>> Well, again the problem is that one use case that I think absolutely
>>> needs caching - namely, exporting data to graphs, maps, etc. deployed on
>>> wiki pages - is also the one not implemented yet because we don't have
>>> cache (not only, but one of the things we need) so we've got chicken and
>>> egg problem here :) Of course, we can just choose something now based on
>>> educated guess and change it later if it works badly. That's probably
>>> what we'll do.
>>>
>>
>> Yes, it is hard to predict what load this will create. The caching levels
>> around Wikipedia prevent re-computation of the page on most page views, so
>> maybe there would not actually be very many repeated requests for the same
>> query coming from tOne option could be a dedicated caching layer just for
>> such wiki uses. On the one hand, the set of all embedded queries is known
>> upfront (so, in contrast to other uses, you already know which queries will
>> be asked). On the other hand, users may wish to do a forced refresh his
>> side. The main danger again seems to be bursts of activity (a page getting
>> a lot of edits in a short time, and each edit invalidates the ParserCache
>> and requires refetching query results). On the positive side, this specific
>> usage of WDQS can pass its own caching parameters (which we can control),
>> so if there is a caching layer in place, one could react to issues on short
>> notice by being more conservative there than for other queries.
>>
>> The interesting thing about the wiki-embedding usage is that it requires
>> quick propagation of changes. Scenario: a user visits a Wikipedia page with
>> a map created from a query; the user finds an outdated item on the map; she
>> goes to Wikidata to fix it, and refreshes (edits) the page to see the
>> change. Now if she is too quick, the change will not have made it into the
>> query result yet -- she could try in a minute or so. However, if we have a
>> long caching period, her first query will have populated the cache and
>> prevent the update from showing for the maximal amount of time (the whole
>> cache period). This seems like a case where long caching would be rather
>> bad for user experience.
>
>
> I think it would be nice if having a graph with query on a page does not
> too much adversely affect the time it takes to save a page. (e.g. if
> running the query takes 20 seconds..., and instead reuse cached query
> results)  And not have such usage kill / overwhelm the query service, is
> also important.
>
> If we incorporate entity usage or something like that, then maybe that
> could be used to handle cache invalidation in cases something used in a
> query changed.
>
> Cheers,
> Katie
>
>
>
>>
>>
>> Markus
>>
>>
>>
>> _______________________________________________
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
>
>
> --
> Katie Filbert
> Wikidata Developer
>
> Wikimedia Germany e.V. | Tempelhofer Ufer 23-24, 10963 Berlin
> Phone (030) 219 158 26-0
>
> http://wikimedia.de
>
> Wikimedia Germany - Society for the Promotion of free knowledge eV Entered
> in the register of Amtsgericht Berlin-Charlottenburg under the number 23
> 855 as recognized as charitable by the Inland Revenue for corporations I
> Berlin, tax number 27/681/51985.
>
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL endpoint caching

Reply via email to