Hi Andy!

Thanks for your excellent answers.

Andy Seaborne kirjoitti 16.11.2017 klo 22:02:

But I'm worried about the state of the hdt-java project, it is not being actively maintained and it's still based on Fuseki1.

You don't need to use their Fuseki integration.

I need a SPARQL endpoint...AFAICT the hdt-java Fuseki integration is the only available way to set up a SPARQL endpoint on top of HDT files. Well, there's the LDF stack, which can do SPARQL-over-LDF-over-HDT, but it doesn't make sense to use that within a single machine, it would just create huge overhead. Over the network LDF makes sense in some scenarios, if you want to provide data that others can compute on without causing huge load on the server.

Those are low figures for 40M.  Lack of free RAM? (It's more acute with TDB2 ATM as it does random I/O.) RDF syntax? A lot of long literals?

Today: TDB2:

INFO  Finished: 50,005,630 bsbm-50m.nt.gz 738.81s (Avg: 67,684)

My laptop is hardly top of the line, it's a cheap model from 2011. I gave the specs above. It has an SSD but it's limited by the SATA bus speed to around 300MB/s in ideal conditions. I'm sure a more modern machine can do much better, as your figures indicate. But I use this one for comparison benchmarks, because it's easy to guarantee that there's nothing else running on the system, unlike on a VM with shared resources which is generally faster but less predictable.breaking the server.

The syntax is N-Triples. One thing I forgot to mention (I even forgot about it myself) is that there is some duplication of triples within that file, as it's created by concatenating several files. So it's more like 50M triples of which 40M are distinct.

My impressions so far: It works, but it's slower than TDB and needs more disk space. Compaction seems to work, but initially it will just increase disk usage. The stale data has to be manually removed to actually reclaim any space.

The user can archive it or delete it.

Yes, I understand. But in this case I want to have a SPARQL endpoint that runs for months at a time, ideally with little supervision. I *don't* want to be there deleting stale files all the time!

Please don't get me wrong, I'm not trying to downplay or criticize your excellent work on TDB!. I'm just trying to figure out whether it would already be suitable for the use case I have - a public SPARQL endpoint with a nontrivial size dataset that updates regularly. I'm kicking the tyres so to speak, looking for potential causes of concern and suggestions for improvements. So far I've been very impressed with what TDB2 can do, even though it's at an early stage of development. Especially the way Fuseki with TDB2 can now handle very large transactions is great, and the performance seems to be roughly on par with TDB1, which is also a good sign.

1. Is is possible to trigger a TDB2 compaction from within Fuseki? I'd prefer not taking the endpoint down for compaction.

Not currently, as I've said, there is no Fuseki change except integrate the TDB2 jars.

Adding a template name to the HTTP API would be good but IMO it's a long way off to provide UI access.  TDB1 works for people.

OK.

2. Should the stale data be deleted after compaction, at least as an option?

If you want to make a PR ...

Understood.

3. Should there be a JIRA issue about UI and API support for creating TDB2 datasets?

Every JIRA is a request for someone to do work or an offer to contribute.
That's why I asked first! When I notice clear bugs I create JIRA issues. Ditto when I have something to contribute. But delving deep into Fuseki/TDB2 integration issues are a bit too far outside my comfort zone, unfortunately.

-Osma

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suomi...@helsinki.fi
http://www.nationallibrary.fi

Reply via email to