Hi Paulo,

A probabilistic DISTINCT operator could be useful for some cases I can think
of.  But you'd probably want to make it an extension to SPARQL since a lot
(most?) of the time you need correct results.  Unfortunately I don't think
bloom filters would help even with REDUCED queries since we pretty much need
the opposite behavior, false negatives being OK but no false positives.

In terms of DISTINCT implementations, the two choices are a hashset or
sorting.  Both of those cases could be rewritten with spill-to-disk backing
stores if the set got too large.  In this respect it seems to be similar to
JENA-44 and JENA-45.  A hashset approach provides lower latency to the first
result, but the sorting approach could be faster in total execution time
(especially when you have to start spilling to disk).  In fact, a number of
RDBMSs offer two options for constructing query plans, one minimizes total
execution time and the other minimizes the time to retrieve the first
result.

I agree it would be cool if the query operators were aware of sorted inputs
so that DISTINCT could be implemented cheaply if there were an underlying
ORDER BY (or even if the triple/quad index provided the bindings in the
correct sort order).  In this same vein it also opens up possibilities for
cheap merge joins instead of nested loop joins.

-Stephen


-----Original Message-----
From: Paolo Castagna [mailto:[email protected]] 
Sent: Friday, August 05, 2011 11:59 AM
To: [email protected]
Subject: SPARQL queries with DISTINCT 

Hi,
in ARQ there is a QueryIterDistinct which physically implements the
OpDistinct (i.e. the logical operator form the SPARQL algebra package).

QueryIterDistinct [1] uses an HashSet<Binding> to keep the already seen
bindings and generate a sequence of unique bindings.

For 'large' result sets the HashSet can use a 'lot' of RAM.

Could we use a bloom filter [2] instead?

For each binding, QueryIterDistinct will ask the bloom filter: "have
you seen this binding"? If the answer is no, we can be absolutely sure
the binding is part of the solution. However, if the answer is yes
(i.e. "I've seen this binding already") we cannot be absolutely sure
and we will not have a way to check for sure. We would suppress that
binding, incurring the risk of not including it in the final solution.
(i.e. false positives are possible, although with a low probability).

Would the use of a bloom filter (large enough to make the probability
of false positives small enough) be acceptable in a QueryIterDistinct
implementation?

Do you have other ideas on how we could reduce the memory consumption
for DISTINCT iterators over large result sets?

Another (interesting) opportunity of optimization is when DISTINCT is
used with ORDER BY, but I've not looked into it yet.

Cheers,
Paolo

 [1]
https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/h
pl/jena/sparql/engine/iterator/QueryIterDistinct.java
 [2] http://en.wikipedia.org/wiki/Bloom_filter

Reply via email to