[Virtuoso-users] Best way to query a random sample?

Konrad Höffner Mon, 11 Apr 2011 08:52:26 +0000

Hello,

I would like to know the best method to get a random sample of alltriples for a subset of all the resources of a SPARQL endpoint, e.g.:

select ?s ?p ?o where {?s ?p ?o. {select ?s where {?s adbpedia-owl:Settlement} limit 5}}

The problem here is that the choice of resources returned by the subquery "select ?s where {?s a dbpedia-owl:Settlement} limit 5" depends onan arbitrary ordering and may thus return only Cities in the sameCountry for example (which is not impossible but improbable in a randomsample), making it unsuitable for some purposes, for instance learningtasks.

There is a similar question in the jena mailing listhttp://tech.groups.yahoo.com/group/jena-dev/message/36776 but thesolution proposed there is the creation of a custom filter function inARQ, which to my knowledge works client side and thus is infeasible onlarge knowledge bases.

Now the solution that we thought about was to do it in several steps,namely:


1. Counting the number of relevant resources

select count(?s) as ?count where {?s a dbpedia-owl:Settlement}

2. Generating a random subset S of the set {0,..,count-1}, with a sizeof n. This is easily doable in java, however the most efficientalgorithm depends on the relation of count and n because of several waysto prevent duplicates and is discussed in this thread (German language,however):

http://www.java-forum.org/mathematik/116289-random-sample.html

3 a. Querying the resources in a loop (Java pseudo code):

for(int s: S)

{
 String queryString = "select ?s ?p ?o where {?s ?p ?o. {select ?s where {?s a 
dbpedia-owl:Settlement} offset "+s+" limit 1}}";
 sample.addAll(query(queryString));
}

3 b. Because the execution of 3a is very slow for large enough n, apossible optimisation would be the merge of several such queries in a union:


select ?s ?p ?o where
{?s ?p ?o.
{select ?s where {?s a dbpedia-owl:Settlement} offset "+s1+" limit 1} UNION
{select ?s where {?s a dbpedia-owl:Settlement} offset "+s2+" limit 1} UNION
...
{select ?s where {?s a dbpedia-owl:Settlement} offset "+sn+" limit 1}

}

The problem with 3b is that a Virtuoso SPARQL query can only be acertain size, doing this with n = 100 on the DBpedia SPARQL endpointresulted in the error "414 Request-URI Too Large".

Now it would be possible to send the queries in batches of 50 and mergethem afterwards but this still seems to be an awkward and slow solutionto the problem. Also as far as I understand it, leaving out the "ORDERBY" - statement makes a different ordering for each sub query possiblewhich may result in some resources being duplicates. On the other hand,using "ORDER BY ?s" in every sub query should also heavily degrade theperformance.Thus I would like to know if there are better known solutions to getrandom samples from a (Virtuoso) SPARQL endpoint and if there areoptimized solutions if the randomness does not have to stand up tostrict standards.


Two ideas we had were:

- implementing a random ordering function, but on the server side (sothat it can be used by any SPARQL query on the endpoint)- shuffling the internal order of resources how they are returned whenusing no "ORDER BY" clause


Thanks,
Konrad

[Virtuoso-users] Best way to query a random sample?

Reply via email to