> Adding a template name to the HTTP API would be good but IMO it's a long way > off to provide UI access. TDB1 works for people.
This is true, but if we can give people an easy way to create TDB2 dbs and compare them apples-to-apples in their own systems, we will get more feedback more quickly. That having been said, I honestly do not know anything about how the Fuseki UI is coded. Is it done with a well-known template library? ajs6f > On Nov 16, 2017, at 3:02 PM, Andy Seaborne <[email protected]> wrote: > > > > On 27/10/17 11:44, Osma Suominen wrote: >> Hi, >> As I've promised earlier I took TDB2 for a little test drive, using the >> 3.5.0rc1 builds. >> I tested two scenarios: A server running Fuseki, and command line tools >> operating directly on a database directory. >> 1. Server running Fuseki >> First the server (running as a VM). Currently I've been using Fuseki with >> HDT support, from the hdt-java repository. I'm serving a dataset of about >> 39M triples, which occasionally changes (eventually this will be updated >> once per month, or perhaps more frequently, even once per day). With HDT, I >> can simply rebuild the HDT file (less than 10 minutes) and then restart >> Fuseki. Downtime for the endpoint is only a few seconds. But I'm worried >> about the state of the hdt-java project, it is not being actively maintained >> and it's still based on Fuseki1. > > You don't need to use their Fuseki integration. > >> So I switched (for now) to Fuseki2 with TDB2. It was rather smooth thanks to >> the documentation that Andy provided. I usually create Fuseki2 datasets via >> the API (using curl), but I noticed that, like the UI, the API only supports >> "mem" and "tdb". So I created a "tdb" dataset first, then edited the >> configuration file so it uses tdb2 instead. >> Loading the data took about 17 minutes. I used wget for this, per Andy's >> example. This is a bit slower than regenerating the HDT, but acceptable >> since I'm only doing it occasionally. I also tested executing queries while >> reloading the data. This seemed to work OK even though performance obviously >> did suffer. But at least the endpoint remained up. >> The TDB2 directory ended up at 4.6GB. In contrast, the HDT file + index for >> the same data is 560MB. >> I reloaded the same data, and the TDB2 directory grew to 8.5GB, almost twice >> its original size. I understand that the TDB2 needs to be compacted >> regularly, otherwise it will keep growing. I'm OK with the large disk space >> usage if it's constant, not growing over time like TDB1. >> 2. Command line tools >> For this I used an older version of the same dataset with 30M triples, the >> same one I used for my HDT vs TDB comparison that I posted on the users >> mailing list: >> http://mail-archives.apache.org/mod_mbox/jena-users/201704.mbox/%3C90c0130b-244d-f0a7-03d3-83b47564c990%40iki.fi%3E >> This was on my i3-2330M laptop with 8GB RAM and SSD. > > Thank you for the figures. > >> Loading the data using tdb2.tdbloader took about 18 minutes (about 28k >> triples per second). The TDB2 directory is 3.7GB. In contrast, using >> tdbloader2, loading took 11 minutes and the TDB directory was 2.7GB. So TDB2 >> is slower to load and takes more disk space than TDB. > > Those are low figures for 40M. Lack of free RAM? (It's more acute with TDB2 > ATM as it does random I/O.) RDF syntax? A lot of long literals? > > Today: TDB2: > > INFO Finished: 50,005,630 bsbm-50m.nt.gz 738.81s (Avg: 67,684) > > >> I ran the same example query I used before on the TDB2. The first time was >> slow (33 seconds), but subsequent queries took 16.1-18.0 seconds. >> I also re-ran the same query on TDB using tdbquery on Jena 3.5.0rc1. The >> query took 13.7-14.0 seconds after the first run (24 seconds). >> I also reloaded the same data to the TDB2 to see the effect. Reloading took >> 11 minutes and the database grew to 5.7GB. Then I compacted it using >> tdb2.tdbcompact. Compacting took 18 minutes and the disk usage just grew >> further, to 9.7GB. The database directory then contained both Data-0001 and >> Data-0002 directories. I removed Data-0001 and disk usage fell to 4.0GB. Not >> quite the same as the original 3.7GB, but close. >> My impressions so far: It works, but it's slower than TDB and needs more >> disk space. Compaction seems to work, but initially it will just increase >> disk usage. The stale data has to be manually removed to actually reclaim >> any space. > > The user can archive it or delete it. > >> I didn't test subsequent load/compact cycles, but I assume there may still >> be some disk space growth (e.g. due to blank nodes, of which there are some >> in my dataset) even if the data is regularly compacted. >> For me, not growing over time like TDB is really the crucial feature that >> TDB2 seems to promise. Right now it's not clear whether it entirely fulfills >> this promise, since compaction needs to be done manually and doesn't >> actually reclaim disk space by itself. >> Questions/suggestions: >> 1. Is is possible to trigger a TDB2 compaction from within Fuseki? I'd >> prefer not taking the endpoint down for compaction. > > Not currently, as I've said, there is no Fuseki change except integrate the > TDB2 jars. > > Adding a template name to the HTTP API would be good but IMO it's a long way > off to provide UI access. TDB1 works for people. > >> 2. Should the stale data be deleted after compaction, at least as an option? > > If you want to make a PR ... > >> 3. Should there be a JIRA issue about UI and API support for creating TDB2 >> datasets? > > > > Every JIRA is a request for someone to do work or an offer to contribute. > > Andy > >> -Osma
