> iotops showed ~400M/s while executing the last time. Does this
> performance drop really come from HDD vs SSD?

Yes - it could well do.

Try running the queries twice in the same server.

TDB does no pre-work whatsoever so file system caching is significant.

> Especially the last two
> queries just have different limits, so I assume the joins are just too
> heavy?

    Andy

On 02/03/2022 08:22, LB wrote:
Hi all,

just as a follow up I loaded Wikidata latest full into TDB2 via xloader on a different less powerful server:

- 2x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (8 cores per cpu, 2 threads per core, -> 16C/32T)
- 128GB RAM
- non SSD RAID

it took about 93h with  --threads 28; again I lost the logs because somebody rebootet the server yesterday, will restart it soon to keep logs on disk this time instead of terminal

Afterwards I started querying a bit via Fuseki, and surprisingly for a very common Wikidata query making use of qualifiers the performance was rather low:

16:22:29 INFO  Server          :: Started 2022/03/01 16:22:29 CET on port 3031 16:24:54 INFO  Fuseki          :: [1] POST http://localhost:3031/ds/sparql 16:24:54 INFO  Fuseki          :: [1] Query = PREFIX wdt: <http://www.wikidata.org/prop/direct/> SELECT * WHERE { ?s wdt:P625 ?o } LIMIT 10
16:24:54 INFO  Fuseki          :: [1] 200 OK (313 ms)
16:25:57 INFO  Fuseki          :: [2] POST http://localhost:3031/ds/sparql 16:25:57 INFO  Fuseki          :: [2] Query = PREFIX p: <http://www.wikidata.org/prop/> PREFIX ps: <http://www.wikidata.org/prop/statement/>  PREFIX PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * WHERE {   ?s p:P625 [ps:P625 ?o] } LIMIT 10
16:25:58 INFO  Fuseki          :: [2] 200 OK (430 ms)
16:26:51 INFO  Fuseki          :: [3] POST http://localhost:3031/ds/sparql 16:26:51 INFO  Fuseki          :: [3] Query = PREFIX p: <http://www.wikidata.org/prop/> PREFIX ps: <http://www.wikidata.org/prop/statement/>  PREFIX PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * WHERE {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 10
16:27:10 INFO  Fuseki          :: [3] 200 OK (19.088 s)
16:27:21 INFO  Fuseki          :: [4] POST http://localhost:3031/ds/sparql 16:27:21 INFO  Fuseki          :: [4] Query = PREFIX p: <http://www.wikidata.org/prop/> PREFIX ps: <http://www.wikidata.org/prop/statement/>  PREFIX PREFIX pq: <http://www.wikidata.org/prop/qualifier/> SELECT * WHERE {   ?s p:P625 [ps:P625 ?o; pq:P376 ?body] } LIMIT 100
16:40:34 INFO  Fuseki          :: [4] 200 OK (793.675 s)

iotops showed ~400M/s while executing the last time. Does this performance drop really come from HDD vs SSD? Especially the last two queries just have different limits, so I assume the joins are just too heavy?


On 23.11.21 13:10, Andy Seaborne wrote:
Try loading truthy:


https://dumps.wikimedia.org/wikidatawiki/entities/20211117/wikidata-20211117-truthy-BETA.nt.bz2

(it always has "BETA" in the name)

which the current latest:

https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2

    Andy

On 23/11/2021 11:12, Marco Neumann wrote:
that's on commodity hardware

http://www.lotico.com/index.php/JENA_Loader_Benchmarks

load times are just load times. Including indexing I'm down to 137,217 t/s

sure with a billion triples I am down to 87kt/s

but still reasonable for most of my use cases.


On Tue, Nov 23, 2021 at 10:44 AM Andy Seaborne <a...@apache.org> wrote:



On 22/11/2021 21:14, Marco Neumann wrote:
Yes I just had a look at one of my own datasets with 180mt and a
footprint
of 28G. The overhead is not too bad at 10-20%. vs raw nt files

I was surprised that the CLEAR ALL directive doesn't remove/release disk
memory. Does TDB2 require a commit to release disk space?

Any active read transactions can still see the old data. You can't
delete it for real.

Run compact.

impressed to see that load times went up to 250k/s

What was the hardware?

with 4.2. more than
twice the speed I have seen with 3.15. Not sure if this is OS (Ubuntu
20.04.3 LTS) related.

You won't get 250k at scale. Loading rate slows for algorithmic reasons
and system reasons.

Now try 500m!

Maybe we should make a recommendation to the wikidata team to provide us
with a production environment type machine to run some load and query
tests.






On Mon, Nov 22, 2021 at 8:43 PM Andy Seaborne <a...@apache.org> wrote:



On 21/11/2021 21:03, Marco Neumann wrote:
What's the disk footprint these days for 1b on tdb2?

Quite a lot. For 1B BSBM, ~125G (which is a bit heavy on significant
sized literals - the node themselves are 50G). Obvious for current WD
scale usage a sprinkling of compression would be good!

One thing xloader gives us is that it makes it possible to load on a
spinning disk. (it also has lower peak intermediate file space and
faster because it does not fall into a slow loading mode for the node
table that tdbloader2 did sometimes.)

       Andy


On Sun, Nov 21, 2021 at 8:00 PM Andy Seaborne <a...@apache.org> wrote:



On 20/11/2021 14:21, Andy Seaborne wrote:
Wikidata are looking for a replace for BlazeGraph

About WDQS, current scale and current challenges
      https://youtu.be/wn2BrQomvFU?t=9148

And in the process of appointing a graph consultant: (5 month
contract):
https://boards.greenhouse.io/wikimedia/jobs/3546920

and Apache Jena came up:
https://phabricator.wikimedia.org/T206560#7517212

Realistically?

Full wikidata is 16B triples. Very hard to load - xloader may help
though the goal for that was to make loading the truthy subset (5B)
easier. 5B -> 16B is not a trivial step.

And it's growing at about 1B per quarter.



https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy


Even if wikidata loads, it would be impractically slow as TDB is
today.
(yes, that's fixable; not practical in their timescales.)

The current discussions feel more like they are looking for a
"product"
- a triplestore that they are use - rather than a collaboration.

        Andy









Reply via email to