Hello Konrad,

The best, as usual, is decimation in its original style.

E.g.
select ?s ?p ?o from <some-graph> where { ?s ?p ?o .
  FILTER (1 > bif:rnd (10, ?s, ?p, ?o))
 }

By tweaking first argument of bif:rnd() and the left side of the
inequality you can tweak decimation ratio from 1/10 to the desired
value. What's important is to know that the SQL optimizer has a right to
execute bif:rnd (10) only once at the beginning of the query, so we had
to pass additional three arguments that can be known only when a table
row is fetched so bif:rnd (10, ?s, ?p, ?o) is calculated for every row
and thus any given row is either returned or ignored independently from
others.

However, bif:rnd (10, ?s, ?p, ?o) contains a subtle inefficiency. In RDF
store, graph nodes are stored as numeric IRI IDs and literal objects can
be stored in a separate table. The call of an SQL function needs
arguments of traditional SQL datatypes, so the query processor will
extract the text of IRI for each node and the full value for each
literal object. That's significant waste of time. The workaround is

sparql select ?s ?p ?o from <some-graph> where { ?s ?p ?o .
  FILTER (1 > <SHORT_OR_LONG::bif:rnd> (10, ?s, ?p, ?o))  }

that tells the SPARQL front-end to omit redundand conversions of values.


Best Regards,

Ivan Mikhailov
OpenLink Software
http://virtuoso.openlinksw.com


On Mon, 2011-04-11 at 10:52 +0200, Konrad Höffner wrote:
> Hello,
> 
> I would like to know the best method to get a random sample of all 
> triples for a subset of all the resources of a SPARQL endpoint, e.g.:
> 
> select ?s ?p ?o where {?s ?p ?o. {select ?s where {?s a 
> dbpedia-owl:Settlement} limit 5}}
> 
> The problem here is that the choice of resources returned by the sub 
> query "select ?s where {?s a dbpedia-owl:Settlement} limit 5" depends on 
> an arbitrary ordering and may thus return only Cities in the same 
> Country for example (which is not impossible but improbable in a random 
> sample), making it unsuitable for some purposes, for instance learning 
> tasks.
> 
> There is a similar question in the jena mailing list 
> http://tech.groups.yahoo.com/group/jena-dev/message/36776 but the 
> solution proposed there is the creation of a custom filter function in 
> ARQ, which to my knowledge works client side and thus is infeasible on 
> large knowledge bases.
> 
> Now the solution that we thought about was to do it in several steps, 
> namely:
> 
> 1. Counting the number of relevant resources
> 
> select count(?s) as ?count where {?s a dbpedia-owl:Settlement}
> 
> 
> 2. Generating a random subset S of the set {0,..,count-1}, with a size 
> of n. This is easily doable in java, however the most efficient 
> algorithm depends on the relation of count and n because of several ways 
> to prevent duplicates and is discussed in this thread (German language, 
> however):
> http://www.java-forum.org/mathematik/116289-random-sample.html
> 
> 3 a. Querying the resources in a loop (Java pseudo code):
> 
> for(int s: S)
> 
> {
>   String queryString = "select ?s ?p ?o where {?s ?p ?o. {select ?s where {?s 
> a dbpedia-owl:Settlement} offset "+s+" limit 1}}";
>   sample.addAll(query(queryString));
> }
> 
> 
> 3 b. Because the execution of 3a is very slow for large enough n, a 
> possible optimisation would be the merge of several such queries in a union:
> 
> select ?s ?p ?o where
> {?s ?p ?o.
> {select ?s where {?s a dbpedia-owl:Settlement} offset "+s1+" limit 1} UNION
> {select ?s where {?s a dbpedia-owl:Settlement} offset "+s2+" limit 1} UNION
> ...
> {select ?s where {?s a dbpedia-owl:Settlement} offset "+sn+" limit 1}
> 
> }
> 
> 
> The problem with 3b is that a Virtuoso SPARQL query can only be a 
> certain size, doing this with n = 100 on the DBpedia SPARQL endpoint 
> resulted in the error "414 Request-URI Too Large".
> 
> Now it would be possible to send the queries in batches of 50 and merge 
> them afterwards but this still seems to be an awkward and slow solution 
> to the problem. Also as far as I understand it, leaving out the "ORDER 
> BY" - statement makes a different ordering for each sub query possible 
> which may result in some resources being duplicates. On the other hand, 
> using "ORDER BY ?s" in every sub query should also heavily degrade the 
> performance.
> Thus I would like to know if there are better known solutions to get 
> random samples from a (Virtuoso) SPARQL endpoint and if there are 
> optimized solutions if the randomness does not have to stand up to 
> strict standards.
> 
> Two ideas we had were:
> 
> - implementing a random ordering function, but on the server side (so 
> that it can be used by any SPARQL query on the endpoint)
> - shuffling the internal order of resources how they are returned when 
> using no "ORDER BY" clause
> 
> Thanks,
> Konrad
> 
> ------------------------------------------------------------------------------
> Xperia(TM) PLAY
> It's a major breakthrough. An authentic gaming
> smartphone on the nation's most reliable network.
> And it wants your games.
> http://p.sf.net/sfu/verizon-sfdev
> _______________________________________________
> Virtuoso-users mailing list
> Virtuoso-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/virtuoso-users



Reply via email to