Re: TDB2 testing Re: TDB2 merged

Andy Seaborne Thu, 16 Nov 2017 12:03:38 -0800


On 27/10/17 11:44, Osma Suominen wrote:

Hi,
As I've promised earlier I took TDB2 for a little test drive, using the3.5.0rc1 builds.
I tested two scenarios: A server running Fuseki, and command line toolsoperating directly on a database directory.
1. Server running Fuseki
First the server (running as a VM). Currently I've been using Fusekiwith HDT support, from the hdt-java repository. I'm serving a dataset ofabout 39M triples, which occasionally changes (eventually this will beupdated once per month, or perhaps more frequently, even once per day).With HDT, I can simply rebuild the HDT file (less than 10 minutes) andthen restart Fuseki. Downtime for the endpoint is only a few seconds.But I'm worried about the state of the hdt-java project, it is not beingactively maintained and it's still based on Fuseki1.


You don't need to use their Fuseki integration.

So I switched (for now) to Fuseki2 with TDB2. It was rather smooththanks to the documentation that Andy provided. I usually create Fuseki2datasets via the API (using curl), but I noticed that, like the UI, theAPI only supports "mem" and "tdb". So I created a "tdb" dataset first,then edited the configuration file so it uses tdb2 instead.
Loading the data took about 17 minutes. I used wget for this, per Andy'sexample. This is a bit slower than regenerating the HDT, but acceptablesince I'm only doing it occasionally. I also tested executing querieswhile reloading the data. This seemed to work OK even though performanceobviously did suffer. But at least the endpoint remained up.
The TDB2 directory ended up at 4.6GB. In contrast, the HDT file + indexfor the same data is 560MB.
I reloaded the same data, and the TDB2 directory grew to 8.5GB, almosttwice its original size. I understand that the TDB2 needs to becompacted regularly, otherwise it will keep growing. I'm OK with thelarge disk space usage if it's constant, not growing over time like TDB1.
2. Command line tools
For this I used an older version of the same dataset with 30M triples,the same one I used for my HDT vs TDB comparison that I posted on theusers mailing list:http://mail-archives.apache.org/mod_mbox/jena-users/201704.mbox/%3C90c0130b-244d-f0a7-03d3-83b47564c990%40iki.fi%3E
This was on my i3-2330M laptop with 8GB RAM and SSD.


Thank you for the figures.

Loading the data using tdb2.tdbloader took about 18 minutes (about 28ktriples per second). The TDB2 directory is 3.7GB. In contrast, usingtdbloader2, loading took 11 minutes and the TDB directory was 2.7GB. SoTDB2 is slower to load and takes more disk space than TDB.

Those are low figures for 40M. Lack of free RAM? (It's more acute withTDB2 ATM as it does random I/O.) RDF syntax? A lot of long literals?


Today: TDB2:

INFO  Finished: 50,005,630 bsbm-50m.nt.gz 738.81s (Avg: 67,684)

I ran the same example query I used before on the TDB2. The first timewas slow (33 seconds), but subsequent queries took 16.1-18.0 seconds.
I also re-ran the same query on TDB using tdbquery on Jena 3.5.0rc1. Thequery took 13.7-14.0 seconds after the first run (24 seconds).
I also reloaded the same data to the TDB2 to see the effect. Reloadingtook 11 minutes and the database grew to 5.7GB. Then I compacted itusing tdb2.tdbcompact. Compacting took 18 minutes and the disk usagejust grew further, to 9.7GB. The database directory then contained bothData-0001 and Data-0002 directories. I removed Data-0001 and disk usagefell to 4.0GB. Not quite the same as the original 3.7GB, but close.
My impressions so far: It works, but it's slower than TDB and needs moredisk space. Compaction seems to work, but initially it will justincrease disk usage. The stale data has to be manually removed toactually reclaim any space.


The user can archive it or delete it.

I didn't test subsequent load/compactcycles, but I assume there may still be some disk space growth (e.g. dueto blank nodes, of which there are some in my dataset) even if the datais regularly compacted.
For me, not growing over time like TDB is really the crucial featurethat TDB2 seems to promise. Right now it's not clear whether it entirelyfulfills this promise, since compaction needs to be done manually anddoesn't actually reclaim disk space by itself.
Questions/suggestions:
1. Is is possible to trigger a TDB2 compaction from within Fuseki? I'dprefer not taking the endpoint down for compaction.

Not currently, as I've said, there is no Fuseki change except integratethe TDB2 jars.

Adding a template name to the HTTP API would be good but IMO it's a longway off to provide UI access. TDB1 works for people.

2. Should the stale data be deleted after compaction, at least as anoption?


If you want to make a PR ...

3. Should there be a JIRA issue about UI and API support for creatingTDB2 datasets?

>

Every JIRA is a request for someone to do work or an offer to contribute.

    Andy


-Osma

Re: TDB2 testing Re: TDB2 merged

Reply via email to