Hi, I want to be able to tell if SPARQL queries are interchangeable: their semantics is the same and they lead to the same results. I'd appreciate any tips on detecting interchangeable queries using Jena.
Here's what I tried so far. I start with computationally cheaper comparisons and continue with more expensive ones if the compared queries aren't detected as interchangeable. First, I compare the queries as strings to check easy verbatim matches. If the compared queries don't match, I parse them to instances of org.apache.jena.query.Query and compare them using org.apache.jena.sparql.core.QueryCompare. This detects interchangeable queries that have minor syntax differences, such as different character case of SPARQL clauses (e.g., "SELECT" vs. "select"). However, for example queries using the same IRI as absolute IRI vs. compact IRI are treated as different by QueryCompare. Therefore, If queries aren't detected as interchangeable in this step, I convert them to SPARQL algebra using the compile() method of org.apache.jena.sparql.algebra.Algebra and compare them as the resulting algebra objects. In this way queries with absolute/compact forms of the same IRI are treated as equal. However, there other interchangeable queries that produce unequal algebra. For example (the queries I mentioned in https://mail-archives.apache.org/mod_mbox/jena-users/201607.mbox/browser): # Query 1 PREFIX skos: <http://www.w3.org/2004/02/skos/core#> SELECT * WHERE { ?concept skos:broader [ skos:prefLabel ?broaderLabel ] . } # Query 2 PREFIX skos: <http://www.w3.org/2004/02/skos/core#> SELECT * WHERE { ?concept skos:broader/skos:prefLabel ?broaderLabel . } This is why I try another step to detect interchangeable queries, which is to perform algebra optimization. However, simply calling Algebra.optimize() on the example queries doesn't make their algebra equal. However, there's a work-around using a custom NodeIsomorphismMap (see https://mail-archives.apache.org/mod_mbox/jena-users/201607.mbox/browser) that compares the queries as equal. Nevertheless, even with these provisions, there are other kinds of interchangeable queries that are treated as distinct. For example: - Queries using blank nodes or unprojected variables - Queries with different order of UNION clauses - Queries expressing the same disjunction using UNION, VALUES, or property path with alternatives I suspect there is a way to make the algebra optimization more "aggressive", so that it produces equal algebra for the above kinds of interchangeable queries. I read Rob Vesse's excellent slides on query optimization in Jena ( http://events.linuxfoundation.org/sites/events/files/slides/SPARQL%20Optimisation%20101%20Tutorial.pdf) and it seems to me that much of what I need is already possible in Jena. I see there are many algebra transformers ( https://github.com/apache/jena/tree/master/jena-arq/src/main/java/org/apache/jena/sparql/algebra/optimize) that can be enabled in the org.apache.jena.sparql.util.Context passed to the Algebra.optimize() method. Would you recommend enabling some optimizations that are not enabled in the default Context (i.e. ARQ.getContext())? I also found some unreachable code in org.apache.jena.sparql.algebra.optimize.Optimize ( https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/algebra/optimize/Optimize.java#L136-L142). Was it left in the code for documentation? Overall, would you say that the approach to detecting interchangeable queries via algebra optimizations is a good one? Would you suggest a different approach? - Jindřich -- Jindřich Mynarz http://mynarz.net/#jindrich