Re: Wikidata evolution

Marco Neumann Tue, 23 Nov 2021 03:19:49 -0800

Thank you for the calcification Rob. I presume this happens on an atomic
(triple) level for updates as well?



On Tue, Nov 23, 2021 at 9:42 AM Rob Vesse <[email protected]> wrote:

> Marco
>
> So there's a couple of things going on.
>
> Firstly the Node Table, the mapping of RDF Terms to the internal Node IDs
> used in the indexes can only ever grow.  TDB2 doesn't do reference counting
> so it doesn't ever remove entries from the table as it doesn't know when a
> Node ID is no longer needed.  Also for RDF Terms that aren't directly
> interned (e.g. some numerics, booleans, dates etc), so primarily URIs,
> Blank Nodes and larger/arbitrarily typed literals, the Node ID actually
> encodes the offset into the Node Table to make Node ID to RDF Term decoding
> fast so you can’t just arbitrarily rewrite the Node Table.  And even if
> rewriting the Node Table were supported it would require rewriting all the
> indexes since those use the Node IDs.
>
> TL;DR the Node Table only grows because the cost of compacting it
> outweighs the benefits.  This is also why you may have seen advice in the
> past that if your database has a lot of DELETE operations made against it
> then in periodically dumping all the data and reloading it into a new
> database is recommended since that generates a fresh Node Table with only
> the RDF Terms currently in use.
>
> Secondly the indexes are themselves versioned storage, so when you modify
> the database a new state is created (potentially pointing to some/all of
> the existing data) but the old data is still there as well.  This is done
> for two reasons:
>
> 1) It allows writes to overlap with ongoing reads to improve concurrency.
> Essentially each read/write transaction operates on a snapshot of the data,
> a write creates a new snapshot but an ongoing read can continue to read the
> old snapshot it was working against
> 2) It provides for strong fault tolerance since a crash/exit during a
> write doesn't affect old data
>
> Note that you can perform a compact operation on a TDB2 database which
> essentially discards all but the latest snapshot and should reclaim the
> index data that is no longer needed.  This is a blocking exclusive write
> operation so doesn't allow for concurrent reads as a normal write would.
>
> Cheers,
>
> Rob
>
> PS. I'm sure Andy will chime in if I've misrepresented/misstated anything
> above
>
> On 22/11/2021, 21:15, "Marco Neumann" <[email protected]> wrote:
>
>     Yes I just had a look at one of my own datasets with 180mt and a
> footprint
>     of 28G. The overhead is not too bad at 10-20%. vs raw nt files
>
>     I was surprised that the CLEAR ALL directive doesn't remove/release
> disk
>     memory. Does TDB2 require a commit to release disk space?
>
>     impressed to see that load times went up to 250k/s with 4.2. more than
>     twice the speed I have seen with 3.15. Not sure if this is OS (Ubuntu
>     20.04.3 LTS) related.
>
>     Maybe we should make a recommendation to the wikidata team to provide
> us
>     with a production environment type machine to run some load and query
> tests.
>
>
>
>
>
>
>     On Mon, Nov 22, 2021 at 8:43 PM Andy Seaborne <[email protected]> wrote:
>
>     >
>     >
>     > On 21/11/2021 21:03, Marco Neumann wrote:
>     > > What's the disk footprint these days for 1b on tdb2?
>     >
>     > Quite a lot. For 1B BSBM, ~125G (which is a bit heavy on significant
>     > sized literals - the node themselves are 50G). Obvious for current WD
>     > scale usage a sprinkling of compression would be good!
>     >
>     > One thing xloader gives us is that it makes it possible to load on a
>     > spinning disk. (it also has lower peak intermediate file space and
>     > faster because it does not fall into a slow loading mode for the node
>     > table that tdbloader2 did sometimes.)
>     >
>     >      Andy
>     >
>     > >
>     > > On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne <[email protected]>
> wrote:
>     > >
>     > >>
>     > >>
>     > >> On 20/11/2021 14:21, Andy Seaborne wrote:
>     > >>> Wikidata are looking for a replace for BlazeGraph
>     > >>>
>     > >>> About WDQS, current scale and current challenges
>     > >>>     https://youtu.be/wn2BrQomvFU?t=9148
>     > >>>
>     > >>> And in the process of appointing a graph consultant: (5 month
>     > contract):
>     > >>> https://boards.greenhouse.io/wikimedia/jobs/3546920
>     > >>>
>     > >>> and Apache Jena came up:
>     > >>> https://phabricator.wikimedia.org/T206560#7517212
>     > >>>
>     > >>> Realistically?
>     > >>>
>     > >>> Full wikidata is 16B triples. Very hard to load - xloader may
> help
>     > >>> though the goal for that was to make loading the truthy subset
> (5B)
>     > >>> easier. 5B -> 16B is not a trivial step.
>     > >>
>     > >> And it's growing at about 1B per quarter.
>     > >>
>     > >>
>     >
> https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy
>     > >>
>     > >>>
>     > >>> Even if wikidata loads, it would be impractically slow as TDB is
> today.
>     > >>> (yes, that's fixable; not practical in their timescales.)
>     > >>>
>     > >>> The current discussions feel more like they are looking for a
> "product"
>     > >>> - a triplestore that they are use - rather than a collaboration.
>     > >>>
>     > >>>       Andy
>     > >>
>     > >
>     > >
>     >
>
>
>     --
>
>
>     ---
>     Marco Neumann
>     KONA
>
>
>
>
>

-- 


---
Marco Neumann
KONA

Re: Wikidata evolution

Reply via email to