Hi,
we all have seen SPARQL queries like these, right?

 1) SELECT COUNT(*) WHERE { ?s ?p ?o }
 2) SELECT DISTINCT ?o WHERE { ?s a ?o }
 3) SELECT DISTINCT ?p WHERE { ?s ?p ?o }
 4) SELECT ?o COUNT(?o) WHERE { ?s a ?o } GROUP BY ?o

1) and 3) scan through all the SPO index. 2) and 4) pretty much the same, 
assuming all your unique subjects have at least one rdf:type property.
There is little I can think of to make those queries run faster. These are very 
simple analytics queries and the results could be easily computed and updated 
as someone adds data to an RDF dataset.
This reminded me of this http://sqlstream.com/ (is there something similar for 
SPARQL/ARQ)?
Those queries also reminded me what I do to roughly and quickly understand what 
an RDF dataset is about when a new one is thrown at me.

When I do not know anything about an RDF dataset, the first thing I want to 
know is how many classes and how many instances per class.

Query 1:
--------

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?type ( COUNT(?s) as ?c )
{
  ?s rdf:type ?type .
}
GROUP BY ?type
ORDER BY DESC (?c)


Then I want to know how many properties and statistics about their usage.

Query 2:
--------

SELECT ?property ( COUNT(?o) as ?c )
{
  ?s ?property ?o .
}
GROUP BY ?property
ORDER BY DESC (?c)


Then I want to see how properties are used for each type.

Query 3:
--------

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?type ?property ( COUNT(?o) as ?c )
{
  ?s rdf:type ?type .
  ?s ?property ?o .
}
GROUP BY ?type ?property
ORDER BY ?type DESC (?c) ?property


Sometimes I want to focus just on (or exclude) a specific namespace.

For example:

Query 4:
--------

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX afn: <http://jena.hpl.hp.com/ARQ/function#>
SELECT ?type ?property ( COUNT(?o) as ?c )
{
  ?s rdf:type ?type .
  ?s ?property ?o .
  FILTER ( afn:namespace ( ?property ) = "http://xmlns.com/foaf/0.1/"; )
}
GROUP BY ?type ?property
ORDER BY ?type DESC (?c) ?property


Once I have an idea on the terms in the vocabulary and how frequently they get 
used, I want to understand how they get used and their "meaning?"/"role"?
To do this I need to look at the values/objects and the first think I want to 
do is to focus on the most used properties and see the values of just that 
property. How many distinct values are there?

Query 5:
--------

PREFIX bibo: <http://purl.org/ontology/bibo/>
SELECT ?o ( COUNT(?o) as ?c )
{
  ?s bibo:pmid ?o .
}
GROUP BY ?o
ORDER BY DESC(?o)


This way, I might find identifiers of things...


Query 6:
--------

PREFIX dc: <http://purl.org/dc/terms/>
SELECT ?o ( COUNT(?o) as ?c )
{
  ?s dc:date ?o .
}
GROUP BY ?o
ORDER BY DESC(?o)


Or how things are be distributed.


Then, I want to understand more about the structure/shape of the RDF entities 
that I have in the dataset. So, I use DESCRIBE to look at examples of "things", 
for example a list of 10 articles:

Query 7:
--------

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX bibo: <http://purl.org/ontology/bibo/>
DESCRIBE ?s
{
  ?s rdf:type bibo:Article
}
LIMIT 10


It is difficult to decide where to draw a line, if it is better to have 
accurate real-time results or simply a cache layer and update results nightly.
In any case, I'd like to explore the idea of stream querying (not new, I know) 
for very simple SPARQL queries a little bit further and see what it would take, 
in practice, to implement something like
this.

Do you have interesting links|papers about stream querying systems?

Paolo

Reply via email to