Re: TDB2 testing Re: TDB2 merged

Osma Suominen Thu, 16 Nov 2017 12:29:14 -0800

Hi Andy!

Thanks for your excellent answers.


Andy Seaborne kirjoitti 16.11.2017 klo 22:02:

But I'm worried about the state of the hdt-java project,it is not being actively maintained and it's still based on Fuseki1.
You don't need to use their Fuseki integration.

I need a SPARQL endpoint...AFAICT the hdt-java Fuseki integration is theonly available way to set up a SPARQL endpoint on top of HDT files.Well, there's the LDF stack, which can do SPARQL-over-LDF-over-HDT, butit doesn't make sense to use that within a single machine, it would justcreate huge overhead. Over the network LDF makes sense in somescenarios, if you want to provide data that others can compute onwithout causing huge load on the server.

Those are low figures for 40M. Lack of free RAM? (It's more acute withTDB2 ATM as it does random I/O.) RDF syntax? A lot of long literals?
Today: TDB2:

INFO  Finished: 50,005,630 bsbm-50m.nt.gz 738.81s (Avg: 67,684)

My laptop is hardly top of the line, it's a cheap model from 2011. Igave the specs above. It has an SSD but it's limited by the SATA busspeed to around 300MB/s in ideal conditions. I'm sure a more modernmachine can do much better, as your figures indicate. But I use this onefor comparison benchmarks, because it's easy to guarantee that there'snothing else running on the system, unlike on a VM with shared resourceswhich is generally faster but less predictable.breaking the server.

The syntax is N-Triples. One thing I forgot to mention (I even forgotabout it myself) is that there is some duplication of triples withinthat file, as it's created by concatenating several files. So it's morelike 50M triples of which 40M are distinct.

My impressions so far: It works, but it's slower than TDB and needsmore disk space. Compaction seems to work, but initially it will justincrease disk usage. The stale data has to be manually removed toactually reclaim any space.
The user can archive it or delete it.

Yes, I understand. But in this case I want to have a SPARQL endpointthat runs for months at a time, ideally with little supervision. I*don't* want to be there deleting stale files all the time!

Please don't get me wrong, I'm not trying to downplay or criticize yourexcellent work on TDB!. I'm just trying to figure out whether it wouldalready be suitable for the use case I have - a public SPARQL endpointwith a nontrivial size dataset that updates regularly. I'm kicking thetyres so to speak, looking for potential causes of concern andsuggestions for improvements. So far I've been very impressed with whatTDB2 can do, even though it's at an early stage of development.Especially the way Fuseki with TDB2 can now handle very largetransactions is great, and the performance seems to be roughly on parwith TDB1, which is also a good sign.

1. Is is possible to trigger a TDB2 compaction from within Fuseki? I'dprefer not taking the endpoint down for compaction.
Not currently, as I've said, there is no Fuseki change except integratethe TDB2 jars.
Adding a template name to the HTTP API would be good but IMO it's a longway off to provide UI access. TDB1 works for people.

OK.

2. Should the stale data be deleted after compaction, at least as anoption?
If you want to make a PR ...


Understood.

3. Should there be a JIRA issue about UI and API support for creatingTDB2 datasets?
Every JIRA is a request for someone to do work or an offer to contribute.

That's why I asked first! When I notice clear bugs I create JIRA issues.Ditto when I have something to contribute. But delving deep intoFuseki/TDB2 integration issues are a bit too far outside my comfortzone, unfortunately.


-Osma

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suomi...@helsinki.fi
http://www.nationallibrary.fi

Re: TDB2 testing Re: TDB2 merged

Reply via email to