Re: TDB2 testing Re: TDB2 merged

ajs6f Thu, 16 Nov 2017 12:21:53 -0800

> Adding a template name to the HTTP API would be good but IMO it's a long way 
> off to provide UI access.  TDB1 works for people.


This is true, but if we can give people an easy way to create TDB2 dbs and 
compare them apples-to-apples in their own systems, we will get more feedback 
more quickly.

That having been said, I honestly do not know anything about how the Fuseki UI 
is coded. Is it done with a well-known template library?

ajs6f

> On Nov 16, 2017, at 3:02 PM, Andy Seaborne <[email protected]> wrote:
> 
> 
> 
> On 27/10/17 11:44, Osma Suominen wrote:
>> Hi,
>> As I've promised earlier I took TDB2 for a little test drive, using the 
>> 3.5.0rc1 builds.
>> I tested two scenarios: A server running Fuseki, and command line tools 
>> operating directly on a database directory.
>> 1. Server running Fuseki
>> First the server (running as a VM). Currently I've been using Fuseki with 
>> HDT support, from the hdt-java repository. I'm serving a dataset of about 
>> 39M triples, which occasionally changes (eventually this will be updated 
>> once per month, or perhaps more frequently, even once per day). With HDT, I 
>> can simply rebuild the HDT file (less than 10 minutes) and then restart 
>> Fuseki. Downtime for the endpoint is only a few seconds. But I'm worried 
>> about the state of the hdt-java project, it is not being actively maintained 
>> and it's still based on Fuseki1.
> 
> You don't need to use their Fuseki integration.
> 
>> So I switched (for now) to Fuseki2 with TDB2. It was rather smooth thanks to 
>> the documentation that Andy provided. I usually create Fuseki2 datasets via 
>> the API (using curl), but I noticed that, like the UI, the API only supports 
>> "mem" and "tdb". So I created a "tdb" dataset first, then edited the 
>> configuration file so it uses tdb2 instead.
>> Loading the data took about 17 minutes. I used wget for this, per Andy's 
>> example. This is a bit slower than regenerating the HDT, but acceptable 
>> since I'm only doing it occasionally. I also tested executing queries while 
>> reloading the data. This seemed to work OK even though performance obviously 
>> did suffer. But at least the endpoint remained up.
>> The TDB2 directory ended up at 4.6GB. In contrast, the HDT file + index for 
>> the same data is 560MB.
>> I reloaded the same data, and the TDB2 directory grew to 8.5GB, almost twice 
>> its original size. I understand that the TDB2 needs to be compacted 
>> regularly, otherwise it will keep growing. I'm OK with the large disk space 
>> usage if it's constant, not growing over time like TDB1.
>> 2. Command line tools
>> For this I used an older version of the same dataset with 30M triples, the 
>> same one I used for my HDT vs TDB comparison that I posted on the users 
>> mailing list:
>> http://mail-archives.apache.org/mod_mbox/jena-users/201704.mbox/%3C90c0130b-244d-f0a7-03d3-83b47564c990%40iki.fi%3E
>>  This was on my i3-2330M laptop with 8GB RAM and SSD.
> 
> Thank you for the figures.
> 
>> Loading the data using tdb2.tdbloader took about 18 minutes (about 28k 
>> triples per second). The TDB2 directory is 3.7GB. In contrast, using 
>> tdbloader2, loading took 11 minutes and the TDB directory was 2.7GB. So TDB2 
>> is slower to load and takes more disk space than TDB.
> 
> Those are low figures for 40M.  Lack of free RAM? (It's more acute with TDB2 
> ATM as it does random I/O.) RDF syntax? A lot of long literals?
> 
> Today: TDB2:
> 
> INFO  Finished: 50,005,630 bsbm-50m.nt.gz 738.81s (Avg: 67,684)
> 
> 
>> I ran the same example query I used before on the TDB2. The first time was 
>> slow (33 seconds), but subsequent queries took 16.1-18.0 seconds.
>> I also re-ran the same query on TDB using tdbquery on Jena 3.5.0rc1. The 
>> query took 13.7-14.0 seconds after the first run (24 seconds).
>> I also reloaded the same data to the TDB2 to see the effect. Reloading took 
>> 11 minutes and the database grew to 5.7GB. Then I compacted it using 
>> tdb2.tdbcompact. Compacting took 18 minutes and the disk usage just grew 
>> further, to 9.7GB. The database directory then contained both Data-0001 and 
>> Data-0002 directories. I removed Data-0001 and disk usage fell to 4.0GB. Not 
>> quite the same as the original 3.7GB, but close.
>> My impressions so far: It works, but it's slower than TDB and needs more 
>> disk space. Compaction seems to work, but initially it will just increase 
>> disk usage. The stale data has to be manually removed to actually reclaim 
>> any space. 
> 
> The user can archive it or delete it.
> 
>> I didn't test subsequent load/compact cycles, but I assume there may still 
>> be some disk space growth (e.g. due to blank nodes, of which there are some 
>> in my dataset) even if the data is regularly compacted.
>> For me, not growing over time like TDB is really the crucial feature that 
>> TDB2 seems to promise. Right now it's not clear whether it entirely fulfills 
>> this promise, since compaction needs to be done manually and doesn't 
>> actually reclaim disk space by itself.
>> Questions/suggestions:
>> 1. Is is possible to trigger a TDB2 compaction from within Fuseki? I'd 
>> prefer not taking the endpoint down for compaction.
> 
> Not currently, as I've said, there is no Fuseki change except integrate the 
> TDB2 jars.
> 
> Adding a template name to the HTTP API would be good but IMO it's a long way 
> off to provide UI access.  TDB1 works for people.
> 
>> 2. Should the stale data be deleted after compaction, at least as an option?
> 
> If you want to make a PR ...
> 
>> 3. Should there be a JIRA issue about UI and API support for creating TDB2 
>> datasets?
> >
> 
> Every JIRA is a request for someone to do work or an offer to contribute.
> 
>    Andy
> 
>> -Osma

Re: TDB2 testing Re: TDB2 merged

Reply via email to