Re: Re: Semantics of SERVICE w.r.t. slicing

Claus Stadler Sat, 04 Jun 2022 02:54:02 -0700

Hi Andy,

> Are you going to be making improvements to querytranformation/optimization as part of your work on the enhanced SERVICEhandling on the active PR?

To summarize the PR (https://github.com/apache/jena/issues/1314) forreaders here: Its about a (a) improving the extension system for customservice executors and

(b) creating a plugin that allows for bulk retrieval and caching withSERVICE.

Actually I am trying to avoid touching transformation/optimization, butas part of my work on SERVICE extensions I added a little

'correlate' option. Together with a 'self' flag for referring back tothe active dataset this allows for doing:



# For each department fetch 5 employees

SELECT * {

  ?d a . Department

SERVICE <correlate:self> { # self could also be a URI such asurn:x-arq:self


    SELECT ?e { ?d hasEmployee ?e } LIMIT 5

} }

Actually the variable ?d in the SERVICE clause has a different scope,but if 'correlate' is seen, my plugin just appliesRename.reverseVarRename on the OpService.

This could be restricted to only the variables that join with the inputbinding. This means the scope of (some of) the variables in the SERVICEclause is lost and a naive substitution with the input bindings becomespossible.


For example the following query


SELECT * {

  BIND(<urn:foo> AS ?s)

  SERVICE <correlate:> { # self is implied if no other URL is mentioned

SELECT ?x ?y { # Important not no project ?s otherwise VarFinderwill prevent the OpJoin->OpSequence optimization


      { BIND(?s AS ?x) } UNION { BIND(?s AS ?y) } }

} }


Yields:

-------------------------------------
| s         | x         | y         |
=====================================
| <urn:foo> | <urn:foo> |           |
| <urn:foo> |           | <urn:foo> |
-------------------------------------


For completeness, without correlate: one gets:

SELECT * {
  BIND(<urn:foo> AS ?s)
  { SELECT ?x ?y { { BIND(?s AS ?x) } UNION { BIND(?s AS ?y) } } }
}
---------------------
| s         | x | y |
=====================
| <urn:foo> |   |   |
| <urn:foo> |   |   |
--------------------

So far, it was possible to trick Jena into optimizing OpJoin intoOpSequence as long as there were no joining variables.

The need for the extra projection of ?x ?y (and not ?s) is not supernice but it used to be a good tradeoff for not having to touch optimizers


and having this feature escalate into the core of ARQ.

I guess with my recent (bug) report I shot myself somewhat in the footnow :D

Because I am not sure if its still possible to write a querysyntactically in a way such that OpJoin turns into OpSequence ifLIMIT/OFFSET appears in the service clause!

Consequently, its actually the optimizer that would have to be aware ofthe 'correlate' flag on service clauses and base its decision on it.

It just turns out that the SPARQL 1.1 service syntax is the easiest wayto have a syntax for it until hopefully sparql 1.2 standardizes it(corresponding issue: https://github.com/w3c/sparql-12/issues/100)

Andy recently also raised the option to extend the ARQ parser withcustom syntax |SERVICE <http://my.endpoint/sparql> ARGS "cache" { ... }:|


https://github.com/apache/jena/pull/1315#issuecomment-1146350174

Something along these lines would be very powerful when fleshed out, butfrom my side I think for this work its not necessary to add customsyntax (yet).

But of course the larger picture is how to e.g. extend service with e.g.http options and other custom options.

(I think there was some discussion on the sparql 1.2 issue tracker but Ican't find it right now).



Cheers,

Claus




On 03.06.22 22:41, Andy Seaborne wrote:

JENA-2332 and PR 1364.

    Andy

https://issues.apache.org/jira/browse/JENA-2332

https://github.com/apache/jena/pull/1364

On 03/06/2022 18:29, Andy Seaborne wrote:
Probably a bug then.
Are you going to be making improvements to querytranformation/optimization as part of your work on the enhancedSERVICE handling on the active PR?
     Andy

On 03/06/2022 10:39, Claus Stadler wrote:
Hi again,
I think the point was missed; what I was actually after is that inthe following query a "join" is optimized into a "sequence"
and I wonder whether this is the correct behavior if a LIMIT/OFFSETis present.
So running the following query with optimize enabled/disabled givesdifferent results:
SELECT * {
SERVICE <https://dbpedia.org/sparql> { SELECT * { ?s a<http://dbpedia.org/ontology/MusicalArtist> } LIMIT 5 } SERVICE <https://dbpedia.org/sparql> { SELECT * { ?s<http://www.w3.org/2000/01/rdf-schema#label> ?x } LIMIT 1 }
}


➜  bin ./arq --query service-query.rq

   (sequence !!!!!

     (service <https://dbpedia.org/sparql>
       (slice _ 5
(bgp (triple ?s<http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http://dbpedia.org/ontology/MusicalArtist>))))
     (service <https://dbpedia.org/sparql>
       (slice _ 1
(bgp (triple ?s<http://www.w3.org/2000/01/rdf-schema#label> ?x)))))
-------------------------------------------------------------------------------| s |x |===============================================================================| <http://dbpedia.org/resource/Aarti_Mukherjee> | "AartiMukherjee"@en || <http://dbpedia.org/resource/Abatte_Barihun> | "AbatteBarihun"@en || <http://dbpedia.org/resource/Abby_Abadi> | "AbbyAbadi"@en || <http://dbpedia.org/resource/Abd_al_Malik_(rapper)> | "Abd alMalik"@de || <http://dbpedia.org/resource/Abdul_Wahid_Khan> | "Abdul WahidKhan"@en |-------------------------------------------------------------------------------
./arq --explain --optimize=no --query service-query.rq
   (join !!!!!
     (service <https://dbpedia.org/sparql>
       (slice _ 5
(bgp (triple ?s<http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http://dbpedia.org/ontology/MusicalArtist>))))
     (service <https://dbpedia.org/sparql>
       (slice _ 1
(bgp (triple ?s<http://www.w3.org/2000/01/rdf-schema#label> ?x)))))
---------
| s | x |
=========
---------


Cheers,

Claus


On 03.06.22 10:22, Andy Seaborne wrote:
On 02/06/2022 21:19, Claus Stadler wrote:
Hi,
I noticed some interesting results when using SERVICE with a subquery with a slice (limit / offset).
Preliminary Remark:
Because SPARQL semantics is bottom up, a query such as thefollowing will not yield bindings for ?x:
SELECT * {
SERVICE <https://dbpedia.org/sparql> { SELECT * { ?s a<http://dbpedia.org/ontology/MusicalArtist> } LIMIT 5 }
   SERVICE <https://dbpedia.org/sparql> { BIND(?s AS ?x) }
}
The query plan for that is:

(join
  (service <https://dbpedia.org/sparql>
    (slice _ 5
(bgp (triple ?s<http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http://dbpedia.org/ontology/MusicalArtist>))))
  (service <https://dbpedia.org/sparql>
    (extend ((?x ?s))
      (table unit))))
which has not had any optimization applied. ARQ checks scopesbefore doing any transfomation.
Change BIND(?s AS ?x) to BIND(?s1 AS ?x)

and it will have (join) replaced by (sequence)

-----------------------------------------------------------
| s                                                   | x |
===========================================================
| <http://dbpedia.org/resource/Aarti_Mukherjee> |   |
| <http://dbpedia.org/resource/Abatte_Barihun> |   |
| <http://dbpedia.org/resource/Abby_Abadi> |   |
| <http://dbpedia.org/resource/Abd_al_Malik_(rapper)> |   |
| <http://dbpedia.org/resource/Abdul_Wahid_Khan> |   |
-----------------------------------------------------------
LIMIT 1 is a no-op - the second SERVICE always evals to one row ofno columns. Which makes the second SERVICE the join identity andthe result is the first SERVICE.
Column ?x is only in the display because it is in "SELECT *"
Query engines, such as Jena, attempt to optimize execution. Forinstance, in the following query,
instead of retrieving all labels, jena uses each binding for aMusical Artist to perform a lookup at the service.
The result is semantically equivalent to bottom up evaluation(without result set limits) - just much faster.
SELECT * {
SERVICE <https://dbpedia.org/sparql> { SELECT * { ?s a<http://dbpedia.org/ontology/MusicalArtist> } LIMIT 5 } SERVICE <https://dbpedia.org/sparql> { ?s<http://www.w3.org/2000/01/rdf-schema#label> ?x }
}


The main point:
However, the following query with ARQ interestingly yields onebinding for every musical artist - which contradicts the bottom-upparadigm:
SELECT * {
SERVICE <https://dbpedia.org/sparql> { SELECT * { ?s a<http://dbpedia.org/ontology/MusicalArtist> } LIMIT 5 } SERVICE <https://dbpedia.org/sparql> { SELECT * { ?s<http://www.w3.org/2000/01/rdf-schema#label> ?x } LIMIT 1 }
}


<http://dbpedia.org/resource/Aarti_Mukherjee> "Aarti Mukherjee"@en
<http://dbpedia.org/resource/Abatte_Barihun> "Abatte Barihun"@en
... 3 more results ...
With bottom-up semantics, the second service clause would onlyfetch a single binding so in the unlikely event that it happens tojoin with a musical artist I'd expect at most one binding
in the overall result set.

Now I wonder whether this is a bug or a feature.
I know that Jena's VarFinder is used to decide whether to performa bottom-up evaluation using OpJoin or a correlated join usingOpSequence which results in the different outcomes.
The SPARQL spec doesn't say much about the semantics of Service(https://www.w3.org/TR/sparql11-query/#sparqlAlgebraEval)
It isn't about the semantics of SERVICE.  Its the (join) local-side.
So I wonder which behavior is expected when using SERVICE withSLICE'd queries.
"SERVICE { pattern }" executes "SELECT * { pattern }" at the farend, LIMITS and all.
    Andy
Cheers,

Claus

--
Dipl. Inf. Claus Stadler
Institute of Applied Informatics (InfAI) / University of Leipzig
Workpage & WebID:http://aksw.org/ClausStadler

Re: Re: Semantics of SERVICE w.r.t. slicing

Reply via email to