Re: Resource requirements and configuration for loading a Wikidata dump

Marco Neumann Wed, 10 Jun 2020 08:47:42 -0700

Exactly Andy, thank you for the additional context, and as a matter of fact
we already query / manipulate 150bn+ triples in a LOD cloud as distributed
sets every day.


But of course we frequently see practitioners in the community who look at
the Semantic Web and Jena specifically primarily as a database technology,
while not paying that much attention to the Web and RDF / SPARQL federation
aspects.

That said a lot of what we do here on the list with Jena is indeed geared
towards performance, optimization and features and hence I will continue to
collect sample data for the lotico benchmarks page. The dataset we have
used so far in the benchmarks process simply hits a sweet spot in terms of
hardware requirements and time it takes to run quick tests. And the tests
already gave me valuable hints at how to scale out clusters for other
non-public data sets. BTW if anyone has access to more powerful hardware
configurations I'd be more than happy to test larger datasets for
benchmarking purposes and would include the results in the page :-) . And
as mentioned by Martynas a page on the Jena project site might be a good
idea as well.

Wolfang, I hear you and I've added a dataset today with 1 billion triples
and will continue to try to add larger datasets over time.
http://www.lotico.com/index.php/JENA_Loader_Benchmarks

If you are only specifically interested in the wikidata dump loading
process for this thread there is some data available on the wikidata
mailing list as well (no data for Jena yet though). It took some users 10.2
days to load the full Wikidata RDF dump (wikidata-20190513-all-BETA.ttl,
379G) with Blazegraph 2.1.5. and apparently 43 hours with a dev version of
Virtuoso.
https://lists.wikimedia.org/pipermail/wikidata/2019-June/013201.html

Marco



On Wed, Jun 10, 2020 at 9:39 AM Andy Seaborne <a...@apache.org> wrote:

>
>
> On 09/06/2020 12:18, Wolfgang Fahl wrote:
> > Marco
> >
> > thank you for sharing your results. Could you please try to make the
> > sample size 10 and 100 times bigger for the discussion we currently have
> > at hand. Getting to a billion triples has not been a problem for the
> > WikiData import. From 1-10 billion triples it gets tougher and
> > for >10 billion triples there is no success story yet that I know of.
> >
> > This brings us to the general question - what will we do in a few years
> > from now when we'd like to work with 100 billion triples or more and the
> > upcoming decades where we might see a rise in data size that stays
> > exponential?
>
> At several levels, the world is going parallel, both one "machine" (a
> computer is a distributed system) and datacenter wide.
>
> Scale comes from multiple machines. There is still mileage in larger
> single machine architectures and better software, but not long term.
>
> At another level - why have all the data in the same place? Convenience.
>
> Search engines are not a feature of WWW architecture. They are an
> emergent effect because it is convenient (simpler, easier) to have one
> place to find things - and that also makes it a winner-takes-all market.
>
> Convenience has limits. Search engine style does not work for all tasks,
> e.g. search within the enterprise for example, or indeed for data. And
> it has consequences in the clandestine data analysis and data abuse.
>
>      Andy
>
> > Wolfgang
> >
> >
> > Am 09.06.20 um 12:17 schrieb Marco Neumann:
> >> same here, I get the best performance on single iron with SSD and fast
> >> DDRAM. The datacenters in the cloud tend to be very selective and you
> can
> >> only get the fast dedicated hardware in a few locations in the cloud.
> >>
> >> http://www.lotico.com/index.php/JENA_Loader_Benchmarks
> >>
> >> In addition keep in mind these are not query benchmarks.
> >>
> >>   Marco
> >>
> >> On Tue, Jun 9, 2020 at 10:27 AM Andy Seaborne <a...@apache.org> wrote:
> >>
> >>> It maybe that SSD is the important factor.
> >>>
> >>> 1/ From a while ago, on truthy:
> >>>
> >>>
> >>>
> https://lists.apache.org/thread.html/70dde8e3d99ce3d69de613b5013c3f4c583d96161dec494ece49a412%40%3Cusers.jena.apache.org%3E
> >>>
> >>> before tdb2.tdbloader was a thing.
> >>>
> >>> 2/ I did some (not open) testing on a mere 800M and tdb2.tdbloader with
> >>> a Dell XPS laptop (2015 model, 16G RAM, 1T M.2 SSD) and a big AWS
> server
> >>> (local NVMe, but virtualized, SSD).
> >>>
> >>> The laptop was nearly as fast as a big AWS server.
> >>>
> >>> My assumption was that as the database grew, RAM caching become less
> >>> significant and the speed of I/O was dominant.
> >>>
> >>> FYI When "tdb2.tdbloader --loader=parallel" gets going it will saturate
> >>> the I/O.
> >>>
> >>> ----
> >>>
> >>> I don't have access to hardware (or ad hoc AWS machines) at the moment
> >>> otherwise I'd give this a try.
> >>>
> >>> Previously, downloading the data to AWS is much faster and much more
> >>> reliable than to my local setup. That said, I think
> dumps.wikimedia.org
> >>> does some rate limiting of downloads as well or my route to the site
> >>> ends up on a virtual T3 - I get the magic number of 5MBytes/s sustained
> >>> download speed a lot out of working hours.
> >>>
> >>>       Andy
> >>>
> >>> On 09/06/2020 08:04, Wolfgang Fahl wrote:
> >>>> Hi Johannes,
> >>>>
> >>>> thank you for bringing the issue to this mailinglist again.
> >>>>
> >>>> At
> >>>>
> >>>
> https://stackoverflow.com/questions/61813248/jena-tdbloader-performance-and-limits
> >>>> there is a question describing the issue and at
> >>>>
> >>>
> http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData#Test_with_Apache_Jena
> >>>> a documentation of my own attempts. There has been some feedback by a
> >>>> few people in the mean time but i have no report of a success yet.
> Also
> >>>> the only hints to achieve better performance are currently related to
> >>>> RAM and disk so using lots of RAM (up to 2 Terrrabyte) and SSDs (also
> >>>> some 2 Terrabyte) was mentioned. I asked at my local IT center and the
> >>>> machine with such RAM is around 30-60 thousand EUR and definitely out
> of
> >>>> my budget. I might invest in a 200 EUR 2 Terrabyte SSD if i could be
> >>>> sure that this would solve the problem. At this time i doubt it since
> >>>> the software keeps crashing on me and there seem to be bugs in
> Operating
> >>>> System, Java Virtual Machine and Jena itself that prevent the success
> as
> >>>> well as the severe degradation in performance for multi-billion triple
> >>>> imports that make it almost impossible to test given a estimated time
> of
> >>>> finish of half a year on (old but sophisticated) hardware that i am
> >>>> using daily.
> >>>>
> >>>> Cheers
> >>>>     Wolfgang
> >>>>
> >>>> Am 08.06.20 um 17:54 schrieb Hoffart, Johannes:
> >>>>> Hi,
> >>>>>
> >>>>> I want to load the full Wikidata dump, available at
> >>> https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2
> to
> >>> use in Jena.
> >>>>> I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G.
> >>> Initially, the progress (measured by dataset size) is quick. It slows
> down
> >>> very much after a couple of 100GB written, and finally, at around
> 500GB,
> >>> the progress is almost halted.
> >>>>> Did anyone ingest Wikidata into Jena before? What are the system
> >>> requirements? Is there a specific tdb2.tdbloader configuration that
> would
> >>> speed things up? For example building an index after data ingest?
> >>>>> Thanks
> >>>>> Johannes
> >>>>>
> >>>>> Johannes Hoffart, Executive Director, Technology Division
> >>>>> Goldman Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 |
> D-60329
> >>> Frankfurt am Main
> >>>>> Email: johannes.hoff...@gs.com<mailto:johannes.hoff...@gs.com> |
> Tel:
> >>> +49 (0)69 7532 3558
> >>>>> Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen |
> Dr.
> >>> Matthias Bock
> >>>>> Vorsitzender des Aufsichtsrats: Dermot McDonogh
> >>>>> Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190
> >>>>>
> >>>>>
> >>>>> ________________________________
> >>>>>
> >>>>> Your Personal Data: We may collect and process information about you
> >>> that may be subject to data protection laws. For more information
> about how
> >>> we use and disclose your personal data, how we protect your
> information,
> >>> our legal basis to use your information, your rights and who you can
> >>> contact, please refer to: www.gs.com/privacy-notices<
> >>> http://www.gs.com/privacy-notices>
> >>
>


-- 


---
Marco Neumann
KONA

Re: Resource requirements and configuration for loading a Wikidata dump

Reply via email to