Ok here is another successful tdb2 load. this time with the full wikidata download (20211222_latest-all.nt.gz 172G ) file.
counting 16,733,395,878 triples and a total of "103h 45m 15s" for the entire load. I think with the right hardware this could easily be time compressed quite a bit. http://www.lotico.com/temp/LOG-45497 Best, Marco On Tue, Dec 21, 2021 at 8:59 AM Marco Neumann <[email protected]> wrote: > Thank you Andy. found it in revisions somewhere > > just finished another run with truthy > > http://lotico.com/temp/LOG-1214 > > will now increase RAM before running an additional load with increased > thread count. > > Marco > > On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne <[email protected]> wrote: > >> gists are git repos: so the file is there ... somewhere: >> >> >> https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log >> >> Andy >> >> On 19/12/2021 17:56, Marco Neumann wrote: >> > Thank you Lorenz, >> > unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in >> > github >> > >> > >> > On Sun, Dec 19, 2021 at 9:46 AM LB <[email protected] >> .invalid> >> > wrote: >> > >> >> I edited the Gist [1] and put the default stats there. Takes ~4min to >> >> compute the stats. >> >> >> >> Findings: >> >> >> >> - for Wikidata we have to extend those stats with the stats for wdt:P31 >> >> property as Wikidata does use this property as their own rdf:type >> >> relation. It is indeed trivial, just execute >> >> >> >> select ?c (count(*) as ?cnt) {?s >> >> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c >> >> >> >> and convert it into the stats rule language (SSE) and put those rules >> >> before the more generic rule >> >> >> >> |(<http://www.wikidata.org/prop/direct/P31> 98152611)| >> >> >> >> - I didn't want to touch the stats script itself, but we could for >> >> example also make this type relation generic and allow for other like >> >> wdt:P31 or skos:subject via a commandline option which provides any URI >> >> as the type relation with default being rdf:type - but that's for sure >> >> probably overkill >> >> >> >> - there is a bug in the stats script or file I guess, because of of >> some >> >> overflow? the count value is >> >> >> >> (count -1983667112)) >> >> >> >> which indicates this. I opened a ticket: >> >> https://issues.apache.org/jira/browse/JENA-2225 >> >> >> >> >> >> [1] >> >> >> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3 >> >> >> >> On 18.12.21 11:35, Marco Neumann wrote: >> >>> good morning Lorenz, >> >>> >> >>> Maybe time to get a few query bencharms tests? :) >> >>> >> >>> What does tdb2.tdbstats report? >> >>> >> >>> Marco >> >>> >> >>> >> >>> On Sat, Dec 18, 2021 at 8:09 AM LB <[email protected] >> .invalid> >> >>> wrote: >> >>> >> >>>> Good morning, >> >>>> >> >>>> loading of Wikidata truthy is done, this time I didn't forget to keep >> >>>> logs: >> >>>> >> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3 >> >>>> >> >>>> I'm a bit surprised that this time it was 8h faster than last time, >> 31h >> >>>> vs 39h. Not sure if a) there was something else on the server last >> time >> >>>> (at least I couldn't see any running tasks) or b) if this is a >> >>>> consequence of the more parallelized Unix sort now - I set it to >> >>>> --parallel=16 >> >>>> >> >>>> I mean, the piped input stream is single threaded I guess, but maybe >> the >> >>>> sort merge step can benefit from more threads? I guess I have to >> clean >> >>>> up everything and run it again with the original setup with 2 Unix >> sort >> >>>> threads ... >> >>>> >> >>>> >> >>>> On 16.12.21 14:48, Andy Seaborne wrote: >> >>>>> >> >>>>> On 16/12/2021 10:52, Andy Seaborne wrote: >> >>>>> ... >> >>>>> >> >>>>>> I am getting a slow down during data ingestion. However, your >> summary >> >>>>>> figures don't show that in the ingest phase. The whole logs may >> have >> >>>>>> the signal in it but less pronounced. >> >>>>>> >> >>>>>> My working assumption is now that it is random access to the node >> >>>>>> table. Your results point to it not being a CPU issue but that my >> >>>>>> setup is saturating the I/O path. While the portable has a NVMe >> SSD, >> >>>>>> it has probably not got the same I/O bandwidth as a server class >> >>>>>> machine. >> >>>>>> >> >>>>>> I'm not sure what to do about this other than run with a much >> bigger >> >>>>>> node table cache for the ingestion phase. Substituting some file >> >>>>>> mapper file area for bigger cache should be a win. While I hadn't >> >>>>>> noticed before, it is probably visible in logs of smaller loads on >> >>>>>> closer inspection. Experimenting on a small dataset is a lot >> easier. >> >>>>> I'm more sure of this - not yet definite. >> >>>>> >> >>>>> The nodeToNodeId cache is 200k -- this is on the load/update path. >> >>>>> Seems rather small for the task. >> >>>>> >> >>>>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by >> SPARQL >> >>>>> results. >> >>>>> >> >>>>> 2 pieces of data will help: >> >>>>> >> >>>>> Experimenting with very small cache settings. >> >>>>> >> >>>>> Letting my slow load keep going to see if there is the same >> >>>>> characteristics at the index stage. There shouldn't be if >> nodeToNodeId >> >>>>> is the cause; it's only an influence in the data ingestion step. >> >>>>> >> >>>>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel >> and >> >>>>> maybe loader=phased. It falls into the same situation although the >> >>>>> improvement there is going to be less marked. "Parallel" saturates >> the >> >>>>> I/O by other means as well. >> >>>>> >> >>>>> Andy >> >>> >> >> >> > >> > >> > > > -- > > > --- > Marco Neumann > KONA > > -- --- Marco Neumann KONA
