Re: Wikidata evolution

Andy Seaborne Sun, 06 Mar 2022 14:21:47 -0800



On 06/03/2022 09:08, LB wrote:

Hi Andy,
yes I also did with rewriting which indeed performed faster. Indeed theissue here was that TDB2 didn't use the stats.opt file because it was inthe wrong location. I'm still not convinced that it should be located in$TDB2_LOCATION/Data-XXX instead of $TDB2_LOCATION/ - especially when youwould do a compact or some other operation you will have to move thestats file to the newer data directory.


Compaction could copy it over.

Actually, compact could update it if it is generated by stats.

The reason it's in Data-XXXX is that it is related to the storage notthe switchable overlay. Not immovable but the reason it is where it is.History.

Moved the TDB2 instance now to the faster server Ryzen 5950X, 16C/32C,128GB RAM, 3.4GB NVMe RAID 1
I did few more queries which take a lot of time, either my machine istoo slow or it is just as it is:
The count query to compute the dataset size

SELECT (count(*) as ?cnt) {
   ?s ?p ?o
}

Runtime: 2489.115 seconds


That will use the SPO.* files.

Massively sensitive to warm up!

I am suspicious that OS caching could be better informed about accesspatterns - that would take some native code (a lot more practical inJava17).

Observations are rather hard, iotop showed a read speed of 150M/s - Idon't know how to interpret this, sounds rather slow for an NVMe SSD. Ialso don't yet understand which files are touched. If those are just theindex files, then I don't get why it takes so much time given that theindex files are rather small with ~1G
Some counting with a join and a filter:

SELECT (count(*) as ?cnt) WHERE {
   ?s wdt:P31 wd:Q5 ;
      rdfs:label ?l
   filter(lang(?l)='en')
}

Runtime: 4817.022 seconds
I compared those queries with (public) QLever triple store, the latterquery takes 2s - indeed as this is on their public server the comparisonis not fair, and maybe there init process does more caching in advance.
I'm also trying to set it up locally on the same server as the TDB2instance and will compare again - just learned that in future we shouldrent servers with way more disk space ... "lesson learned"


Great.

From the public server, QLever doesn't support much of SPARQL functions.

        Andy

On 05.03.22 11:57, Andy Seaborne wrote:
Two comments inline:

On 02/03/2022 15:41, LB wrote:
Hm,

coming back to this query

SELECT * WHERE {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 100

I calculated the triple pattern sizes:

p:P625: ~9M

ps:P625: ~9M

pq:P376: ~1K
Also try to rewrite with :P376 first.
SELECT * WHERE
  { _:x ps:P625 ?o; pq:P376 ?body. ?s p:P625 _:x . }
LIMIT 100


which is:


SELECT  *
WHERE
  { ?s    p:P625   _:b0 .
    _:b0  ps:P625  ?o ;
    _:b0  pq:P376  ?body
  }
LIMIT   100

==>


SELECT  *
WHERE
  { _:b0  pq:P376  ?body .
    _:b0  ps:P625  ?o ;
    ?s    p:P625   _:b0 .
  }
LIMIT   100

(
Even with computing TDB stats it doesn't seem to perform well (notsure if those steps have been taken into account, as usual I putstats.opt into TDB dir). Took 180s even after I did a full count ofall 18.8B triples in advance to warm cache.
Counting by itself only warm triple indexes, not the node table, norit's indexes.
COUNT(*) or COUNT(?x) does not need the details of the RDF termitself. Term results out of TDB are lazily computed and COUNT, bydesign, does not trigger pulling from the node table.
    Andy
I guess the files are rather larger
373G    OSP.dat
373G    POS.dat
373G    SPO.dat
186G    nodes-data.obj
85G    nodes.dat
1,3G    OSP.idn
1,3G    POS.idn
1,3G    SPO.idn
720M    nodes.idn
for computation it would touch which files first?
By the way, counting all 18.8B triples took ~6000s - HDD read speedwas ~70M/s and given that we have 1.4TB disk size ...
Long story short, with that slow HDD setup it takes ages or I'm doingsomething fundamentally wrong. Will copy over the TDB image toanother server with SSD to see how things will change,.
On 02.03.22 14:12, Andy Seaborne wrote:
> iotops showed ~400M/s while executing the last time. Does this
> performance drop really come from HDD vs SSD?

Yes - it could well do.

Try running the queries twice in the same server.

TDB does no pre-work whatsoever so file system caching is significant.

> Especially the last two
> queries just have different limits, so I assume the joins are justtoo
> heavy?

    Andy

On 02/03/2022 08:22, LB wrote:
Hi all,
just as a follow up I loaded Wikidata latest full into TDB2 viaxloader on a different less powerful server:
- 2x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (8 cores per cpu, 2threads per core, -> 16C/32T)
- 128GB RAM
- non SSD RAID
it took about 93h with --threads 28; again I lost the logs becausesomebody rebootet the server yesterday, will restart it soon tokeep logs on disk this time instead of terminal
Afterwards I started querying a bit via Fuseki, and surprisinglyfor a very common Wikidata query making use of qualifiers theperformance was rather low:
16:22:29 INFO Server :: Started 2022/03/01 16:22:29 CETon port 303116:24:54 INFO Fuseki :: [1] POSThttp://localhost:3031/ds/sparql16:24:54 INFO Fuseki :: [1] Query = PREFIX wdt:<http://www.wikidata.org/prop/direct/> SELECT * WHERE { ?swdt:P625 ?o } LIMIT 10
16:24:54 INFO  Fuseki          :: [1] 200 OK (313 ms)
16:25:57 INFO Fuseki :: [2] POSThttp://localhost:3031/ds/sparql16:25:57 INFO Fuseki :: [2] Query = PREFIX p:<http://www.wikidata.org/prop/> PREFIX ps:<http://www.wikidata.org/prop/statement/> PREFIXPREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT *WHERE { ?s p:P625 [ps:P625 ?o] } LIMIT 10
16:25:58 INFO  Fuseki          :: [2] 200 OK (430 ms)
16:26:51 INFO Fuseki :: [3] POSThttp://localhost:3031/ds/sparql16:26:51 INFO Fuseki :: [3] Query = PREFIX p:<http://www.wikidata.org/prop/> PREFIX ps:<http://www.wikidata.org/prop/statement/> PREFIXPREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT *WHERE { ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 10
16:27:10 INFO  Fuseki          :: [3] 200 OK (19.088 s)
16:27:21 INFO Fuseki :: [4] POSThttp://localhost:3031/ds/sparql16:27:21 INFO Fuseki :: [4] Query = PREFIX p:<http://www.wikidata.org/prop/> PREFIX ps:<http://www.wikidata.org/prop/statement/> PREFIXPREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT *WHERE { ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 100
16:40:34 INFO  Fuseki          :: [4] 200 OK (793.675 s)
iotops showed ~400M/s while executing the last time. Does thisperformance drop really come from HDD vs SSD? Especially the lasttwo queries just have different limits, so I assume the joins arejust too heavy?
On 23.11.21 13:10, Andy Seaborne wrote:
Try loading truthy:
https://dumps.wikimedia.org/wikidatawiki/entities/20211117/wikidata-20211117-truthy-BETA.nt.bz2
(it always has "BETA" in the name)

which the current latest:
https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2
    Andy

On 23/11/2021 11:12, Marco Neumann wrote:
that's on commodity hardware

http://www.lotico.com/index.php/JENA_Loader_Benchmarks
load times are just load times. Including indexing I'm down to137,217 t/s
sure with a billion triples I am down to 87kt/s

but still reasonable for most of my use cases.
On Tue, Nov 23, 2021 at 10:44 AM Andy Seaborne <a...@apache.org>wrote:
On 22/11/2021 21:14, Marco Neumann wrote:
Yes I just had a look at one of my own datasets with 180mt and a
footprint
of 28G. The overhead is not too bad at 10-20%. vs raw nt files
I was surprised that the CLEAR ALL directive doesn'tremove/release disk
memory. Does TDB2 require a commit to release disk space?
Any active read transactions can still see the old data. You can't
delete it for real.

Run compact.
impressed to see that load times went up to 250k/s
What was the hardware?
with 4.2. more than
twice the speed I have seen with 3.15. Not sure if this is OS(Ubuntu
20.04.3 LTS) related.
You won't get 250k at scale. Loading rate slows for algorithmicreasons
and system reasons.

Now try 500m!
Maybe we should make a recommendation to the wikidata team toprovide uswith a production environment type machine to run some load andquery
tests.
On Mon, Nov 22, 2021 at 8:43 PM Andy Seaborne <a...@apache.org>wrote:
On 21/11/2021 21:03, Marco Neumann wrote:
What's the disk footprint these days for 1b on tdb2?
Quite a lot. For 1B BSBM, ~125G (which is a bit heavy onsignificantsized literals - the node themselves are 50G). Obvious forcurrent WD
scale usage a sprinkling of compression would be good!
One thing xloader gives us is that it makes it possible toload on aspinning disk. (it also has lower peak intermediate file spaceandfaster because it does not fall into a slow loading mode forthe node
table that tdbloader2 did sometimes.)

       Andy
On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne<a...@apache.org> wrote:
On 20/11/2021 14:21, Andy Seaborne wrote:
Wikidata are looking for a replace for BlazeGraph

About WDQS, current scale and current challenges
      https://youtu.be/wn2BrQomvFU?t=9148

And in the process of appointing a graph consultant: (5 month
contract):
https://boards.greenhouse.io/wikimedia/jobs/3546920

and Apache Jena came up:
https://phabricator.wikimedia.org/T206560#7517212

Realistically?
Full wikidata is 16B triples. Very hard to load - xloadermay helpthough the goal for that was to make loading the truthysubset (5B)
easier. 5B -> 16B is not a trivial step.
And it's growing at about 1B per quarter.
https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy
Even if wikidata loads, it would be impractically slow asTDB is
today.
(yes, that's fixable; not practical in their timescales.)

The current discussions feel more like they are looking for a
"product"
- a triplestore that they are use - rather than acollaboration.
        Andy

Re: Wikidata evolution

Reply via email to