On 16/12/2021 10:52, Andy Seaborne wrote:
...
I am getting a slow down during data ingestion. However, your summary
figures don't show that in the ingest phase. The whole logs may have the
signal in it but less pronounced.
My working assumption is now that it is random access to the node table.
Your results point to it not being a CPU issue but that my setup is
saturating the I/O path. While the portable has a NVMe SSD, it has
probably not got the same I/O bandwidth as a server class machine.
I'm not sure what to do about this other than run with a much bigger
node table cache for the ingestion phase. Substituting some file mapper
file area for bigger cache should be a win. While I hadn't noticed
before, it is probably visible in logs of smaller loads on closer
inspection. Experimenting on a small dataset is a lot easier.
I'm more sure of this - not yet definite.
The nodeToNodeId cache is 200k -- this is on the load/update path. Seems
rather small for the task.
The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
results.
2 pieces of data will help:
Experimenting with very small cache settings.
Letting my slow load keep going to see if there is the same
characteristics at the index stage. There shouldn't be if nodeToNodeId
is the cause; it's only an influence in the data ingestion step.
Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
maybe loader=phased. It falls into the same situation although the
improvement there is going to be less marked. "Parallel" saturates the
I/O by other means as well.
Andy