Re: Resource requirements and configuration for loading a Wikidata dump

Andy Seaborne Wed, 10 Jun 2020 01:39:49 -0700



On 09/06/2020 12:18, Wolfgang Fahl wrote:

Marco

thank you for sharing your results. Could you please try to make the
sample size 10 and 100 times bigger for the discussion we currently have
at hand. Getting to a billion triples has not been a problem for the
WikiData import. From 1-10 billion triples it gets tougher and
for >10 billion triples there is no success story yet that I know of.

This brings us to the general question - what will we do in a few years
from now when we'd like to work with 100 billion triples or more and the
upcoming decades where we might see a rise in data size that stays
exponential?

At several levels, the world is going parallel, both one "machine" (acomputer is a distributed system) and datacenter wide.

Scale comes from multiple machines. There is still mileage in largersingle machine architectures and better software, but not long term.


At another level - why have all the data in the same place? Convenience.

Search engines are not a feature of WWW architecture. They are anemergent effect because it is convenient (simpler, easier) to have oneplace to find things - and that also makes it a winner-takes-all market.

Convenience has limits. Search engine style does not work for all tasks,e.g. search within the enterprise for example, or indeed for data. Andit has consequences in the clandestine data analysis and data abuse.


    Andy

Wolfgang


Am 09.06.20 um 12:17 schrieb Marco Neumann:

same here, I get the best performance on single iron with SSD and fast
DDRAM. The datacenters in the cloud tend to be very selective and you can
only get the fast dedicated hardware in a few locations in the cloud.

http://www.lotico.com/index.php/JENA_Loader_Benchmarks

In addition keep in mind these are not query benchmarks.

  Marco

On Tue, Jun 9, 2020 at 10:27 AM Andy Seaborne <a...@apache.org> wrote:

It maybe that SSD is the important factor.

1/ From a while ago, on truthy:


https://lists.apache.org/thread.html/70dde8e3d99ce3d69de613b5013c3f4c583d96161dec494ece49a412%40%3Cusers.jena.apache.org%3E

before tdb2.tdbloader was a thing.

2/ I did some (not open) testing on a mere 800M and tdb2.tdbloader with
a Dell XPS laptop (2015 model, 16G RAM, 1T M.2 SSD) and a big AWS server
(local NVMe, but virtualized, SSD).

The laptop was nearly as fast as a big AWS server.

My assumption was that as the database grew, RAM caching become less
significant and the speed of I/O was dominant.

FYI When "tdb2.tdbloader --loader=parallel" gets going it will saturate
the I/O.

----

I don't have access to hardware (or ad hoc AWS machines) at the moment
otherwise I'd give this a try.

Previously, downloading the data to AWS is much faster and much more
reliable than to my local setup. That said, I think dumps.wikimedia.org
does some rate limiting of downloads as well or my route to the site
ends up on a virtual T3 - I get the magic number of 5MBytes/s sustained
download speed a lot out of working hours.

      Andy

On 09/06/2020 08:04, Wolfgang Fahl wrote:

Hi Johannes,

thank you for bringing the issue to this mailinglist again.

At

https://stackoverflow.com/questions/61813248/jena-tdbloader-performance-and-limits

there is a question describing the issue and at

http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData#Test_with_Apache_Jena

a documentation of my own attempts. There has been some feedback by a
few people in the mean time but i have no report of a success yet. Also
the only hints to achieve better performance are currently related to
RAM and disk so using lots of RAM (up to 2 Terrrabyte) and SSDs (also
some 2 Terrabyte) was mentioned. I asked at my local IT center and the
machine with such RAM is around 30-60 thousand EUR and definitely out of
my budget. I might invest in a 200 EUR 2 Terrabyte SSD if i could be
sure that this would solve the problem. At this time i doubt it since
the software keeps crashing on me and there seem to be bugs in Operating
System, Java Virtual Machine and Jena itself that prevent the success as
well as the severe degradation in performance for multi-billion triple
imports that make it almost impossible to test given a estimated time of
finish of half a year on (old but sophisticated) hardware that i am
using daily.

Cheers
    Wolfgang

Am 08.06.20 um 17:54 schrieb Hoffart, Johannes:

Hi,

I want to load the full Wikidata dump, available at

https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 to
use in Jena.

I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G.

Initially, the progress (measured by dataset size) is quick. It slows down
very much after a couple of 100GB written, and finally, at around 500GB,
the progress is almost halted.

Did anyone ingest Wikidata into Jena before? What are the system

requirements? Is there a specific tdb2.tdbloader configuration that would
speed things up? For example building an index after data ingest?

Thanks
Johannes

Johannes Hoffart, Executive Director, Technology Division
Goldman Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 | D-60329

Frankfurt am Main

Email: johannes.hoff...@gs.com<mailto:johannes.hoff...@gs.com> | Tel:

+49 (0)69 7532 3558

Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen | Dr.

Matthias Bock

Vorsitzender des Aufsichtsrats: Dermot McDonogh
Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190


________________________________

Your Personal Data: We may collect and process information about you

that may be subject to data protection laws. For more information about how
we use and disclose your personal data, how we protect your information,
our legal basis to use your information, your rights and who you can
contact, please refer to: www.gs.com/privacy-notices<
http://www.gs.com/privacy-notices>

Re: Resource requirements and configuration for loading a Wikidata dump

Reply via email to