On Mon, Jan 30, 2012 at 7:08 AM, Andy Seaborne <[email protected]> wrote: > On 30/01/12 12:52, William Waites wrote: >> >> Hello all, > > > Hi William, > > >> >> My collegue Paolo has been suggesting that I join this list for a >> while, and since I have a couple of questions stemming from my use of >> the SPARQL-FED stuff in ARQ, I thought that now might be a good time. >> >> What I'm doing is as follows. I have some information about >> airports. It's accurate and complete, but pretty skeletal. dbpedia on >> the other hand is less complete but richer in terms of text >> descriptions and additional information. There also happens to be a >> text field (ICAO code) that can be used to join the two. >> >> Though I know there are ways to do this more efficiently, I think a >> single CONSTRUCT query with some SERVICE blocks in the WHERE clause is >> a very clean way to do it, and will only become more efficient as the >> implementation gets better. >> >> So an abbreviated version of the query might be something like, >> >> CONSTRUCT { >> ?my_uri dct:description ?description >> } WHERE { >> ?my_uri transit:icaoCode ?icao. >> SERVICE<http://dbpedia.org/sparql> { >> ?dbp_uri dbpprop:icao ?icao; >> rdfs:comment ?description >> } >> } >> >> This mail is about two ways the implementation might get >> better. >> >> Firstly it is brittle. It expands into doing one remote query for each >> ?icao, which is what one would expect. If any sub-query fails due to >> transient network events or server flakiness (almost inevitable with >> more than a trivially small set of things to be queried) the whole >> query fails. I would rather like the process to continue, and perhaps >> log a warning. The web is unreliable and the semantic web contains a >> funny open-world assumption of incomplete results being acceptable, >> it's just the nature of the beast. Incomplete results are better than >> no results in this case, but that they are known to be possibly >> incomplete should be flagged in some way in case the user cares. > > > SERVICE SILENT may be what you are looking for. Strictly, this is continue > (with no results) if any part fails but in ARQ, in normal usage, it is > applied to each service request. > > See QueryIterService. > > >> >> Secondly, I understand from Paolo that the client in ARQ does not use >> persistent HTTP connections. For iterations like this, the HTTP >> set-up/tear-down is quite costly and it would be much better if >> persistent connections were supported here. Possibly even better >> (potentially the server could take advantage of this, executing >> queries in parallel for example) if the queries were pipelined to some >> extent. > > > The real problem is that the correct query to send to the far end is > > SELECT * { > > ?dbp_uri dbpprop:icao ?icao; > rdfs:comment ?description > } BINDINGS ... fro the first part ... > > then it is one request that still does not ask an ungrounded > > { > ?dbp_uri dbpprop:icao ?icao; > rdfs:comment ?description > } > > but DBpedia does not support all of SPARQL 1.1 and in particular it does not > support BINDINGS (yet?). > > The implementation of service requests is in > com.hp.hpl.jena.sparql.engine.http.HttpQuery. It might be better to use the > Apache HTTP client. Currently it use java.net. > > Patches welcome. > > >> doesn't cause the whole thing to fail and lose the work already done. >> "doesn't consume a lot of RAM" > > > ARQ streams the results out (unless you ask something that can't like > wanting the text output form - in which case send to a file as a streamable > format and read the file back in.) . > > CONSTRUCT isn't streamable - can you use a SELECT and generate the triples > for the CONSTRUCT as it streams? >
I'd been thinking that an additional API that did stream CONSTRUCT queries might be useful. It would have to return an Iterator<Triple> instead of a Model. This would work well for Fuseki, as it is only streaming RDF back to the client. Combined with org.openjena.atlas.data.DistinctDataNet it would have spill-to-disk capability as well. -Stephen
