On 27/09/11 21:14, Stephen Allen wrote:
All,
I am working on removing memory limits for CONSTRUCT queries. Since the
spec requires that the triples created by the graph template be combined by
a set union [1] (section 16.2 in the 1.1 WD), it means we need to remove
duplicates.
Sort of - something has to remove duplicates but if they are being
inserted into a store that does duplicate suppression then the CONSTRUCT
need not perfectly.
DISTINCT on the WHERE part would be effective in reducing the amount of
triples even is there is still the possibility of duplicates.
Other simple techniques like a sliding window of previously generated
triples might also be a lot easier than perfect duplicate removal.
As the triples do not need to be ordered (section 16.2.3), it
seems like DistinctDataBag<Triple> will fit the bill as a temporary storage
container/duplicate remover.
The current implementation of DistinctDataBag is implemented as an in-memory
HashMap that spills sorted data to disk. When the data is read back using a
merge-sort, all duplicates will be adjacent, and can thus be easily removed.
Anyway, to make a long story short, I need a Comparator<Triple> to pass to
the DistinctDataBag. Do we have such a thing? If not, what is the best way
to implement this? I have an attempt below, but I need some help on the
NodeComparator class.
Not directly but there is:
NodeUtils.compareRDFTerms
NodeValue.compareAlways
NodeValue.compare
I think you want NodeValue.compareAlways.
NodeValue.compare backs =, <, > is SPARQL.
This is value-based.
123 > 45
123 < 456
Not everything is comparable. ARQ sorting does impose total order on
bindings by using NodeValue.compare, then the SPARQL rules for unlike
objects, then doing a guaranteed-to-have-an-answer
NodeUtils.compareRDFTerms.
The end result is arbitrary - e.g. bNode labels are compared to order
bNodes.
So it's value-comparison, then SPARQL type ordering, then a total
ordering Nodes.
-Stephen
[1] It would be nice if it was a bag union and we had something like
CONSTRUCT DISTINCT for set. I'm guessing it is not possible to change the
spec for backwards compatibility reasons with SPARQL 1.0?
123 < 45
Correct. And also an RDF graph is a set of triples so it's not just SPARQL.
Other systems (Sesame, Redland) have or had duplicated triple results
arguing it was then the app to sort it out if it cared (a choice of
efficiency argument) but these have all gone now, I think. Users wanted
no duplicates.
Andy
==
public class TripleComparator implements Comparator<Triple>
{
private final NodeComparator nc = new NodeComparator();
public int compare(Triple o1, Triple o2)
{
int toReturn = nc.compare(o1.getSubject(), o2.getSubject());
if (toReturn == 0)
{
toReturn = nc.compare(o1.getPredicate(), o2.getPredicate());
if (toReturn == 0)
{
toReturn = nc.compare(o1.getObject(), o2.getObject());
}
}
return toReturn;
}
}
public class NodeComparator implements Comparator<Node>
{
public int compare(Node o1, Node o2)
{
// TODO: What should I be doing here?
return
o1.getIndexingValue().toString().compareTo(o2.getIndexingValue().toString())
;
}
}