[Wikidata-bugs] [Maniphest] [Commented On] T157811: Wikidata reference URIs have become too many to search with WDQS SPARQL

Jheald Tue, 14 Feb 2017 04:08:10 -0800

Jheald added a comment.

Hi Smalyshev, thanks for taking the time to get back to me.

The LDF suggestion is a good one -- I was thinking of looking in to it for investigating Commons sitelinks and P373s, both of which are now getting close to the borderline of what a query can cope with without timing out, with little time left for filtering. For these it might be quite useful to have all million downloaded locally. But it seems like overkill, if all one wants to do is a quick query to check on what is referencing a particular site, with maybe only a couple of hundred hits expected, or in the low thousands; but to be able to do quite general analyses on that (and share them with other people) -- eg what classes of items are showing references to this given website. I haven't used LDF, but it seems it would be turning what would be quite a simple query into quite a substantial scripting job. But it's a good suggestion.

Regarding "contains" I do take the point. Full text indexing is (I believe) technically possible using suffix trees, and I think Blazegraph even offer it as an option. As I understand it though, it could increase the storage requirement for text fields by a factor of anything up to 20. I don't know how much of the database is accounted for by text fields, but I could understand a reluctance not to go down that route.

On the other hand, I was a bit disappointed that the query wasn't helped by switching to "strstarts". As I understand it, Blazegraph does do basic indexing on all fields, which is essential for it to be able to rapidly find and retrieve a fully specified value. This is what makes it possible for a look-up for a particular url to be very very fast, eg this query

SELECT ?ref  WHERE {
  ?ref pr:P854 <http://artuk.org/discover/artists/velazquez-diego-15991660>
}

What I was hoping was that that indexing might survive (at least until any joins are done) when one identifies the values of pr:P 854 to a variable,

?ref pr:P854 ?url

If the indexing did survive, then (at least in principle) one might hope that it might then be accessible by the implementation of STRSTARTS, so that when presented with

STRSTARTS( ?url, 'http://artuk.org/discover/artists')

Blazegraph would be able to rapidly identify the first match from this index on ?url; the first non-match after those matches; and then rapidly extract everything in between for its solution set.

Of course with string matching on URLs, there is a complication that STRSTARTS only works on strings, so that one also has to do an str() cast, and therefore to make it work the line above thus has to read

STRSTARTS( str(?url),  'http://artuk.org/discover/artists')

But if the URLs were already indexed using eg a basically alphabetical B-tree, it might be possible to do the cast to string and still preserve an indexing, so it might still be possible to execute the line above with indexing, and for it therefore to still be very fast.

So that's what I had in mind, wondering whether it might be possible for a degree of indexing to somehow be preserved through the query execution, so that the STRSTARTS match might sometimes be very fast.

Of course you're right that for URLs, the domain part of the URL eg figaro.fr or artuk.org would almost always be part of what one would want to filter for, so if it could be possible to filter rapidly on that, then the remaining solution set would very likely be small enough to use conventional string operators. So this could be a great functionality to have, if such a thing could be implemented easily, perhaps as an extra triple on reference objects.

But I can't help thinking that this kind of STRSTARTS request must be something that comes up often enough, in enough different production contexts, that it might be worth Blazegraph's time to see whether some kind of indexing-based acceleration like the above thought-outline could sometimes be possible.

TASK DETAIL

https://phabricator.wikimedia.org/T157811

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Jheald
Cc: Smalyshev, Aklapper, Jheald, EBjune, merbst, Avner, debt, Gehel, D3r1ck01, Jonas, FloNight, Xmlizer, Izno, jkroll, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331

_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T157811: Wikidata reference URIs have become too many to search with WDQS SPARQL

Reply via email to