Hello,

I would like to know the best method to get a random sample of all triples for a subset of all the resources of a SPARQL endpoint, e.g.:

select ?s ?p ?o where {?s ?p ?o. {select ?s where {?s a dbpedia-owl:Settlement} limit 5}}

The problem here is that the choice of resources returned by the sub query "select ?s where {?s a dbpedia-owl:Settlement} limit 5" depends on an arbitrary ordering and may thus return only Cities in the same Country for example (which is not impossible but improbable in a random sample), making it unsuitable for some purposes, for instance learning tasks.

There is a similar question in the jena mailing list http://tech.groups.yahoo.com/group/jena-dev/message/36776 but the solution proposed there is the creation of a custom filter function in ARQ, which to my knowledge works client side and thus is infeasible on large knowledge bases.

Now the solution that we thought about was to do it in several steps, namely:

1. Counting the number of relevant resources

select count(?s) as ?count where {?s a dbpedia-owl:Settlement}


2. Generating a random subset S of the set {0,..,count-1}, with a size of n. This is easily doable in java, however the most efficient algorithm depends on the relation of count and n because of several ways to prevent duplicates and is discussed in this thread (German language, however):
http://www.java-forum.org/mathematik/116289-random-sample.html

3 a. Querying the resources in a loop (Java pseudo code):

for(int s: S)

{
 String queryString = "select ?s ?p ?o where {?s ?p ?o. {select ?s where {?s a 
dbpedia-owl:Settlement} offset "+s+" limit 1}}";
 sample.addAll(query(queryString));
}


3 b. Because the execution of 3a is very slow for large enough n, a possible optimisation would be the merge of several such queries in a union:

select ?s ?p ?o where
{?s ?p ?o.
{select ?s where {?s a dbpedia-owl:Settlement} offset "+s1+" limit 1} UNION
{select ?s where {?s a dbpedia-owl:Settlement} offset "+s2+" limit 1} UNION
...
{select ?s where {?s a dbpedia-owl:Settlement} offset "+sn+" limit 1}

}


The problem with 3b is that a Virtuoso SPARQL query can only be a certain size, doing this with n = 100 on the DBpedia SPARQL endpoint resulted in the error "414 Request-URI Too Large".

Now it would be possible to send the queries in batches of 50 and merge them afterwards but this still seems to be an awkward and slow solution to the problem. Also as far as I understand it, leaving out the "ORDER BY" - statement makes a different ordering for each sub query possible which may result in some resources being duplicates. On the other hand, using "ORDER BY ?s" in every sub query should also heavily degrade the performance. Thus I would like to know if there are better known solutions to get random samples from a (Virtuoso) SPARQL endpoint and if there are optimized solutions if the randomness does not have to stand up to strict standards.

Two ideas we had were:

- implementing a random ordering function, but on the server side (so that it can be used by any SPARQL query on the endpoint) - shuffling the internal order of resources how they are returned when using no "ORDER BY" clause

Thanks,
Konrad

Reply via email to