Re: Testing tdb2.xloader

LB Sat, 18 Dec 2021 00:09:50 -0800

Good morning,

loading of Wikidata truthy is done, this time I didn't forget to keeplogs:https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3

I'm a bit surprised that this time it was 8h faster than last time, 31hvs 39h. Not sure if a) there was something else on the server last time(at least I couldn't see any running tasks) or b) if this is aconsequence of the more parallelized Unix sort now - I set it to--parallel=16

I mean, the piped input stream is single threaded I guess, but maybe thesort merge step can benefit from more threads? I guess I have to cleanup everything and run it again with the original setup with 2 Unix sortthreads ...



On 16.12.21 14:48, Andy Seaborne wrote:

On 16/12/2021 10:52, Andy Seaborne wrote:
...
I am getting a slow down during data ingestion. However, your summaryfigures don't show that in the ingest phase. The whole logs may havethe signal in it but less pronounced.
My working assumption is now that it is random access to the nodetable. Your results point to it not being a CPU issue but that mysetup is saturating the I/O path. While the portable has a NVMe SSD,it has probably not got the same I/O bandwidth as a server classmachine.
I'm not sure what to do about this other than run with a much biggernode table cache for the ingestion phase. Substituting some filemapper file area for bigger cache should be a win. While I hadn'tnoticed before, it is probably visible in logs of smaller loads oncloser inspection. Experimenting on a small dataset is a lot easier.
I'm more sure of this - not yet definite.
The nodeToNodeId cache is 200k -- this is on the load/update path.Seems rather small for the task.
The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQLresults.
2 pieces of data will help:

Experimenting with very small cache settings.
Letting my slow load keep going to see if there is the samecharacteristics at the index stage. There shouldn't be if nodeToNodeIdis the cause; it's only an influence in the data ingestion step.
Aside : Increasing nodeToNodeId could also help tdbloader=parallel andmaybe loader=phased. It falls into the same situation although theimprovement there is going to be less marked. "Parallel" saturates theI/O by other means as well.
    Andy

Re: Testing tdb2.xloader

Reply via email to