Hi Andy!
Thanks for your excellent answers.
Andy Seaborne kirjoitti 16.11.2017 klo 22:02:
But I'm worried about the state of the hdt-java project,
it is not being actively maintained and it's still based on Fuseki1.
You don't need to use their Fuseki integration.
I need a SPARQL endpoint...AFAICT the hdt-java Fuseki integration is the
only available way to set up a SPARQL endpoint on top of HDT files.
Well, there's the LDF stack, which can do SPARQL-over-LDF-over-HDT, but
it doesn't make sense to use that within a single machine, it would just
create huge overhead. Over the network LDF makes sense in some
scenarios, if you want to provide data that others can compute on
without causing huge load on the server.
Those are low figures for 40M. Lack of free RAM? (It's more acute with
TDB2 ATM as it does random I/O.) RDF syntax? A lot of long literals?
Today: TDB2:
INFO Finished: 50,005,630 bsbm-50m.nt.gz 738.81s (Avg: 67,684)
My laptop is hardly top of the line, it's a cheap model from 2011. I
gave the specs above. It has an SSD but it's limited by the SATA bus
speed to around 300MB/s in ideal conditions. I'm sure a more modern
machine can do much better, as your figures indicate. But I use this one
for comparison benchmarks, because it's easy to guarantee that there's
nothing else running on the system, unlike on a VM with shared resources
which is generally faster but less predictable.breaking the server.
The syntax is N-Triples. One thing I forgot to mention (I even forgot
about it myself) is that there is some duplication of triples within
that file, as it's created by concatenating several files. So it's more
like 50M triples of which 40M are distinct.
My impressions so far: It works, but it's slower than TDB and needs
more disk space. Compaction seems to work, but initially it will just
increase disk usage. The stale data has to be manually removed to
actually reclaim any space.
The user can archive it or delete it.
Yes, I understand. But in this case I want to have a SPARQL endpoint
that runs for months at a time, ideally with little supervision. I
*don't* want to be there deleting stale files all the time!
Please don't get me wrong, I'm not trying to downplay or criticize your
excellent work on TDB!. I'm just trying to figure out whether it would
already be suitable for the use case I have - a public SPARQL endpoint
with a nontrivial size dataset that updates regularly. I'm kicking the
tyres so to speak, looking for potential causes of concern and
suggestions for improvements. So far I've been very impressed with what
TDB2 can do, even though it's at an early stage of development.
Especially the way Fuseki with TDB2 can now handle very large
transactions is great, and the performance seems to be roughly on par
with TDB1, which is also a good sign.
1. Is is possible to trigger a TDB2 compaction from within Fuseki? I'd
prefer not taking the endpoint down for compaction.
Not currently, as I've said, there is no Fuseki change except integrate
the TDB2 jars.
Adding a template name to the HTTP API would be good but IMO it's a long
way off to provide UI access. TDB1 works for people.
OK.
2. Should the stale data be deleted after compaction, at least as an
option?
If you want to make a PR ...
Understood.
3. Should there be a JIRA issue about UI and API support for creating
TDB2 datasets?
Every JIRA is a request for someone to do work or an offer to contribute.
That's why I asked first! When I notice clear bugs I create JIRA issues.
Ditto when I have something to contribute. But delving deep into
Fuseki/TDB2 integration issues are a bit too far outside my comfort
zone, unfortunately.
-Osma
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suomi...@helsinki.fi
http://www.nationallibrary.fi