Hi,
we all have seen SPARQL queries like these, right?
1) SELECT COUNT(*) WHERE { ?s ?p ?o }
2) SELECT DISTINCT ?o WHERE { ?s a ?o }
3) SELECT DISTINCT ?p WHERE { ?s ?p ?o }
4) SELECT ?o COUNT(?o) WHERE { ?s a ?o } GROUP BY ?o
1) and 3) scan through all the SPO index. 2) and 4) pretty much the same,
assuming all your unique subjects have at least one rdf:type property.
There is little I can think of to make those queries run faster. These are very
simple analytics queries and the results could be easily computed and updated
as someone adds data to an RDF dataset.
This reminded me of this http://sqlstream.com/ (is there something similar for
SPARQL/ARQ)?
Those queries also reminded me what I do to roughly and quickly understand what
an RDF dataset is about when a new one is thrown at me.
When I do not know anything about an RDF dataset, the first thing I want to
know is how many classes and how many instances per class.
Query 1:
--------
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?type ( COUNT(?s) as ?c )
{
?s rdf:type ?type .
}
GROUP BY ?type
ORDER BY DESC (?c)
Then I want to know how many properties and statistics about their usage.
Query 2:
--------
SELECT ?property ( COUNT(?o) as ?c )
{
?s ?property ?o .
}
GROUP BY ?property
ORDER BY DESC (?c)
Then I want to see how properties are used for each type.
Query 3:
--------
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?type ?property ( COUNT(?o) as ?c )
{
?s rdf:type ?type .
?s ?property ?o .
}
GROUP BY ?type ?property
ORDER BY ?type DESC (?c) ?property
Sometimes I want to focus just on (or exclude) a specific namespace.
For example:
Query 4:
--------
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX afn: <http://jena.hpl.hp.com/ARQ/function#>
SELECT ?type ?property ( COUNT(?o) as ?c )
{
?s rdf:type ?type .
?s ?property ?o .
FILTER ( afn:namespace ( ?property ) = "http://xmlns.com/foaf/0.1/" )
}
GROUP BY ?type ?property
ORDER BY ?type DESC (?c) ?property
Once I have an idea on the terms in the vocabulary and how frequently they get
used, I want to understand how they get used and their "meaning?"/"role"?
To do this I need to look at the values/objects and the first think I want to
do is to focus on the most used properties and see the values of just that
property. How many distinct values are there?
Query 5:
--------
PREFIX bibo: <http://purl.org/ontology/bibo/>
SELECT ?o ( COUNT(?o) as ?c )
{
?s bibo:pmid ?o .
}
GROUP BY ?o
ORDER BY DESC(?o)
This way, I might find identifiers of things...
Query 6:
--------
PREFIX dc: <http://purl.org/dc/terms/>
SELECT ?o ( COUNT(?o) as ?c )
{
?s dc:date ?o .
}
GROUP BY ?o
ORDER BY DESC(?o)
Or how things are be distributed.
Then, I want to understand more about the structure/shape of the RDF entities
that I have in the dataset. So, I use DESCRIBE to look at examples of "things",
for example a list of 10 articles:
Query 7:
--------
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX bibo: <http://purl.org/ontology/bibo/>
DESCRIBE ?s
{
?s rdf:type bibo:Article
}
LIMIT 10
It is difficult to decide where to draw a line, if it is better to have
accurate real-time results or simply a cache layer and update results nightly.
In any case, I'd like to explore the idea of stream querying (not new, I know)
for very simple SPARQL queries a little bit further and see what it would take,
in practice, to implement something like
this.
Do you have interesting links|papers about stream querying systems?
Paolo