Re: Resource requirements and configuration for loading a Wikidata dump

2020-06-11 Thread Marco Neumann
Wolfgang, here is another link (I did not find in your link list yet) this time to setup wikidata with blazegraph in the Google Cloud (GCE) https://addshore.com/2019/10/your-own-wikidata-query-service-with-no-limits-part-1/ On Thu, Jun 11, 2020 at 7:14 AM Wolfgang Fahl wrote: > > Am 10.06.20

Re: Resource requirements and configuration for loading a Wikidata dump

2020-06-11 Thread Wolfgang Fahl
Am 10.06.20 um 17:46 schrieb Marco Neumann: > Wolfang, I hear you and I've added a dataset today with 1 billion triples > and will continue to try to add larger datasets over time. > http://www.lotico.com/index.php/JENA_Loader_Benchmarks > > If you are only specifically interested in the wikidata

Re: Resource requirements and configuration for loading a Wikidata dump

2020-06-10 Thread Marco Neumann
Exactly Andy, thank you for the additional context, and as a matter of fact we already query / manipulate 150bn+ triples in a LOD cloud as distributed sets every day. But of course we frequently see practitioners in the community who look at the Semantic Web and Jena specifically primarily as a

Re: Resource requirements and configuration for loading a Wikidata dump

2020-06-10 Thread Andy Seaborne
On 09/06/2020 12:18, Wolfgang Fahl wrote: Marco thank you for sharing your results. Could you please try to make the sample size 10 and 100 times bigger for the discussion we currently have at hand. Getting to a billion triples has not been a problem for the WikiData import. From 1-10

RE: Resource requirements and configuration for loading a Wikidata dump

2020-06-09 Thread Hoffart, Johannes
Hi Andy, Thanks for the helpful pointers by you and others. I will change the heap settings to see if this at least allows the process to finish. For reference, the machine has 128GB of main memory and a regular HDD attached. I also changed the logging settings to see the progress (would be

Re: Resource requirements and configuration for loading a Wikidata dump

2020-06-09 Thread Wolfgang Fahl
Marco thank you for sharing your results. Could you please try to make the sample size 10 and 100 times bigger for the discussion we currently have at hand. Getting to a billion triples has not been a problem for the WikiData import. From 1-10 billion triples it gets tougher and for >10 billion

Re: Resource requirements and configuration for loading a Wikidata dump

2020-06-09 Thread Marco Neumann
same here, I get the best performance on single iron with SSD and fast DDRAM. The datacenters in the cloud tend to be very selective and you can only get the fast dedicated hardware in a few locations in the cloud. http://www.lotico.com/index.php/JENA_Loader_Benchmarks In addition keep in mind

Re: Resource requirements and configuration for loading a Wikidata dump

2020-06-09 Thread Andy Seaborne
It maybe that SSD is the important factor. 1/ From a while ago, on truthy: https://lists.apache.org/thread.html/70dde8e3d99ce3d69de613b5013c3f4c583d96161dec494ece49a412%40%3Cusers.jena.apache.org%3E before tdb2.tdbloader was a thing. 2/ I did some (not open) testing on a mere 800M and

Re: Resource requirements and configuration for loading a Wikidata dump

2020-06-09 Thread Wolfgang Fahl
Hi Johannes, thank you for bringing the issue to this mailinglist again. At https://stackoverflow.com/questions/61813248/jena-tdbloader-performance-and-limits there is a question describing the issue and at http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData#Test_with_Apache_Jena a

Re: Resource requirements and configuration for loading a Wikidata dump

2020-06-08 Thread Martynas Jusevičius
Wouldn't it be a good idea to have a page in the Fuseki/TDB2 documentation with benchmark results and/or user-reported loading statistics, including hardware specs? It would also be useful to map such specs to the AWS instance types: https://aws.amazon.com/ec2/instance-types/ On Mon, Jun 8, 2020

Re: Resource requirements and configuration for loading a Wikidata dump

2020-06-08 Thread Andy Seaborne
Hi Johannes, On 08/06/2020 16:54, Hoffart, Johannes wrote: Hi, I want to load the full Wikidata dump, available at https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 to use in Jena. I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G. Initially, the

Re: Resource requirements and configuration for loading a Wikidata dump

2020-06-08 Thread Andy Seaborne
Hi Johannes, On 08/06/2020 16:54, Hoffart, Johannes wrote: Hi, I want to load the full Wikidata dump, available at https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 to use in Jena. I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G. Initially, the

Re: Resource requirements and configuration for loading a Wikidata dump

2020-06-08 Thread Ahmed El-Roby
Thanks Johannes for starting this thread. I am facing the exact same problem with tdb2. For any significantly large file for that matter, it takes forever to load. I hope this problem has a solution. Thank you. -Ahmed On Mon, Jun 8, 2020 at 11:55 AM Hoffart, Johannes wrote: > Hi, > > I want to

Resource requirements and configuration for loading a Wikidata dump

2020-06-08 Thread Hoffart, Johannes
Hi, I want to load the full Wikidata dump, available at https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 to use in Jena. I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G. Initially, the progress (measured by dataset size) is quick. It slows down very much