On Mon, 30 Jan 2012 14:26:00 +0000, Paolo Castagna <[email protected]> said:
paolo> Welcome William.
Thank you.
paolo> When possible, I do this sort of things locally. I get a
paolo> copy of the data I need or small slices of it, I load
paolo> everything in TDB and run my SPARQL queries locally.
Right. However for my applications (!!!) I don't want to do this
because:
1. I cannot count on the remote data being available in bulk since
some publishers habitually only make SPARQL endpoints available
and not dumps.
2. I don't know beforehand which slices of the data I will need,
if I knew this I wouldn't need to run the query.
3. I cannot count on having my own temporary local store to put
intermediate results into.
paolo> Looking at HttpQuery.java [1] that seems to me to be the
paolo> case (and it is probably ok for the majority of use cases).
Perhaps. Although fixing this would improve performance by a
significant amount and should not break anything existing. And it
ought to be simple.
paolo> See also/related:
paolo> "This feature is a basic building block to allow remote
paolo> access in the middle of a query, not a general solution to
paolo> the issues in distributed query evaluation...
Yes, I realise this and have read that caveat. I understand quite well
about pattern selectivity and the like.
I am perfectly happy for the query to take a long time to run as a
batch job as long as it doesn't consume a lot of RAM and that a
recoverable failure (e.g. of the HTTP response code 5XX kind not 4XX)
doesn't cause the whole thing to fail and lose the work already done.
"doesn't consume a lot of RAM" probably means "write results to
persistant storage or a file descriptor incrementally". That would
make Jena/ARQ useable in my application.
Cheers,
-w
pgp35dDX49bN0.pgp
Description: PGP signature
