Hello,
I would like to know the best method to get a random sample of all
triples for a subset of all the resources of a SPARQL endpoint, e.g.:
select ?s ?p ?o where {?s ?p ?o. {select ?s where {?s a
dbpedia-owl:Settlement} limit 5}}
The problem here is that the choice of resources returned by the sub
query "select ?s where {?s a dbpedia-owl:Settlement} limit 5" depends on
an arbitrary ordering and may thus return only Cities in the same
Country for example (which is not impossible but improbable in a random
sample), making it unsuitable for some purposes, for instance learning
tasks.
There is a similar question in the jena mailing list
http://tech.groups.yahoo.com/group/jena-dev/message/36776 but the
solution proposed there is the creation of a custom filter function in
ARQ, which to my knowledge works client side and thus is infeasible on
large knowledge bases.
Now the solution that we thought about was to do it in several steps,
namely:
1. Counting the number of relevant resources
select count(?s) as ?count where {?s a dbpedia-owl:Settlement}
2. Generating a random subset S of the set {0,..,count-1}, with a size
of n. This is easily doable in java, however the most efficient
algorithm depends on the relation of count and n because of several ways
to prevent duplicates and is discussed in this thread (German language,
however):
http://www.java-forum.org/mathematik/116289-random-sample.html
3 a. Querying the resources in a loop (Java pseudo code):
for(int s: S)
{
String queryString = "select ?s ?p ?o where {?s ?p ?o. {select ?s where {?s a
dbpedia-owl:Settlement} offset "+s+" limit 1}}";
sample.addAll(query(queryString));
}
3 b. Because the execution of 3a is very slow for large enough n, a
possible optimisation would be the merge of several such queries in a union:
select ?s ?p ?o where
{?s ?p ?o.
{select ?s where {?s a dbpedia-owl:Settlement} offset "+s1+" limit 1} UNION
{select ?s where {?s a dbpedia-owl:Settlement} offset "+s2+" limit 1} UNION
...
{select ?s where {?s a dbpedia-owl:Settlement} offset "+sn+" limit 1}
}
The problem with 3b is that a Virtuoso SPARQL query can only be a
certain size, doing this with n = 100 on the DBpedia SPARQL endpoint
resulted in the error "414 Request-URI Too Large".
Now it would be possible to send the queries in batches of 50 and merge
them afterwards but this still seems to be an awkward and slow solution
to the problem. Also as far as I understand it, leaving out the "ORDER
BY" - statement makes a different ordering for each sub query possible
which may result in some resources being duplicates. On the other hand,
using "ORDER BY ?s" in every sub query should also heavily degrade the
performance.
Thus I would like to know if there are better known solutions to get
random samples from a (Virtuoso) SPARQL endpoint and if there are
optimized solutions if the randomness does not have to stand up to
strict standards.
Two ideas we had were:
- implementing a random ordering function, but on the server side (so
that it can be used by any SPARQL query on the endpoint)
- shuffling the internal order of resources how they are returned when
using no "ORDER BY" clause
Thanks,
Konrad