Ok here is another successful tdb2 load. this time with the full wikidata
download (20211222_latest-all.nt.gz 172G ) file.

counting 16,733,395,878 triples and a total of "103h 45m 15s" for the
entire load.

I think with the right hardware this could easily be time compressed quite
a bit.

http://www.lotico.com/temp/LOG-45497

Best,
Marco


On Tue, Dec 21, 2021 at 8:59 AM Marco Neumann <[email protected]>
wrote:

> Thank you Andy. found it in revisions somewhere
>
> just finished another run with truthy
>
> http://lotico.com/temp/LOG-1214
>
> will now increase RAM before running an additional load with increased
> thread count.
>
> Marco
>
> On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne <[email protected]> wrote:
>
>> gists are git repos: so the file is there ... somewhere:
>>
>>
>> https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log
>>
>>      Andy
>>
>> On 19/12/2021 17:56, Marco Neumann wrote:
>> > Thank you Lorenz,
>> > unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
>> > github
>> >
>> >
>> > On Sun, Dec 19, 2021 at 9:46 AM LB <[email protected]
>> .invalid>
>> > wrote:
>> >
>> >> I edited the Gist [1] and put the default stats there. Takes ~4min to
>> >> compute the stats.
>> >>
>> >> Findings:
>> >>
>> >> - for Wikidata we have to extend those stats with the stats for wdt:P31
>> >> property as Wikidata does use this property as their own rdf:type
>> >> relation. It is indeed trivial, just execute
>> >>
>> >> select ?c (count(*) as ?cnt) {?s
>> >> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c
>> >>
>> >> and convert it into the stats rule language (SSE) and put those rules
>> >> before the more generic rule
>> >>
>> >> |(<http://www.wikidata.org/prop/direct/P31> 98152611)|
>> >>
>> >> - I didn't want to touch the stats script itself, but we could for
>> >> example also make this type relation generic and allow for other like
>> >> wdt:P31 or skos:subject via a commandline option which provides any URI
>> >> as the type relation with default being rdf:type - but that's for sure
>> >> probably overkill
>> >>
>> >> - there is a bug in the stats script or file I guess, because of of
>> some
>> >> overflow? the count value is
>> >>
>> >> (count -1983667112))
>> >>
>> >> which indicates this.  I opened a ticket:
>> >> https://issues.apache.org/jira/browse/JENA-2225
>> >>
>> >>
>> >> [1]
>> >>
>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>> >>
>> >> On 18.12.21 11:35, Marco Neumann wrote:
>> >>> good morning Lorenz,
>> >>>
>> >>> Maybe time to get a few query bencharms tests? :)
>> >>>
>> >>> What does tdb2.tdbstats report?
>> >>>
>> >>> Marco
>> >>>
>> >>>
>> >>> On Sat, Dec 18, 2021 at 8:09 AM LB <[email protected]
>> .invalid>
>> >>> wrote:
>> >>>
>> >>>> Good morning,
>> >>>>
>> >>>> loading of Wikidata truthy is done, this time I didn't forget to keep
>> >>>> logs:
>> >>>>
>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>> >>>>
>> >>>> I'm a bit surprised that this time it was 8h faster than last time,
>> 31h
>> >>>> vs 39h. Not sure if a) there was something else on the server last
>> time
>> >>>> (at least I couldn't see any running tasks) or b) if this is a
>> >>>> consequence of the more parallelized Unix sort now - I set it to
>> >>>> --parallel=16
>> >>>>
>> >>>> I mean, the piped input stream is single threaded I guess, but maybe
>> the
>> >>>> sort merge step can benefit from more threads? I guess I have to
>> clean
>> >>>> up everything and run it again with the original setup with 2 Unix
>> sort
>> >>>> threads ...
>> >>>>
>> >>>>
>> >>>> On 16.12.21 14:48, Andy Seaborne wrote:
>> >>>>>
>> >>>>> On 16/12/2021 10:52, Andy Seaborne wrote:
>> >>>>> ...
>> >>>>>
>> >>>>>> I am getting a slow down during data ingestion. However, your
>> summary
>> >>>>>> figures don't show that in the ingest phase. The whole logs may
>> have
>> >>>>>> the signal in it but less pronounced.
>> >>>>>>
>> >>>>>> My working assumption is now that it is random access to the node
>> >>>>>> table. Your results point to it not being a CPU issue but that my
>> >>>>>> setup is saturating the I/O path. While the portable has a NVMe
>> SSD,
>> >>>>>> it has probably not got the same I/O bandwidth as a server class
>> >>>>>> machine.
>> >>>>>>
>> >>>>>> I'm not sure what to do about this other than run with a much
>> bigger
>> >>>>>> node table cache for the ingestion phase. Substituting some file
>> >>>>>> mapper file area for bigger cache should be a win. While I hadn't
>> >>>>>> noticed before, it is probably visible in logs of smaller loads on
>> >>>>>> closer inspection. Experimenting on a small dataset is a lot
>> easier.
>> >>>>> I'm more sure of this - not yet definite.
>> >>>>>
>> >>>>> The nodeToNodeId cache is 200k -- this is on the load/update path.
>> >>>>> Seems rather small for the task.
>> >>>>>
>> >>>>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by
>> SPARQL
>> >>>>> results.
>> >>>>>
>> >>>>> 2 pieces of data will help:
>> >>>>>
>> >>>>> Experimenting with very small cache settings.
>> >>>>>
>> >>>>> Letting my slow load keep going to see if there is the same
>> >>>>> characteristics at the index stage. There shouldn't be if
>> nodeToNodeId
>> >>>>> is the cause; it's only an influence in the data ingestion step.
>> >>>>>
>> >>>>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel
>> and
>> >>>>> maybe loader=phased. It falls into the same situation although the
>> >>>>> improvement there is going to be less marked. "Parallel" saturates
>> the
>> >>>>> I/O by other means as well.
>> >>>>>
>> >>>>>       Andy
>> >>>
>> >>
>> >
>> >
>>
>
>
> --
>
>
> ---
> Marco Neumann
> KONA
>
>

-- 


---
Marco Neumann
KONA

Reply via email to