Hi,
Andy Seaborne writes: > On 23/02/2021 17:55, Daniel Hernandez wrote: >> Hi, >> >>>> The disk where I was loading the data was a local rotating disk of >>>> 7200 rpm. The machine has also an SSD but is too small to do the >>>> experiment. >>> >>> tdbloader2 may be the right choice for that setup - it was written >>> with disks in mind. It uses Unix sort(1). What it needs is to tune the >>> parameters to the runs of "sort" >> Thanks, this information is very useful. >> >>> Wolfgang Fahl has loaded large (several billion triples) >>> >>> https://issues.apache.org/jira/browse/JENA-1909 >>> >>> and his notes are at: >>> >>> http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData >> I also have loaded Wikidata in a very small virtual machine with a >> single core, and a rotating non local disk. I remember it lasted more >> than a week. I do not saved the log, because the machine was running >> other jobs at the same time. Next time I loaded a big dataset I will >> share the machine specification and loading log. >> >>>> I wonder if it is better to load the data using a fast disk, a lot of >>>> RAM, or a lot of cores. >>> >>> A few years ago, I ran load tests of two machines, one 32G+SATA SSD, >>> one 16G+ 1TB M2 SSD. The 16G but faster SSD was quicker overall. >> That is interesting. I am considering to have a machine with an >> NVMe >> SSD disk for the next loading. >> >>> Database directories can be copied across machines after they have >>> been built. >> The tdbloader2 generates some files with the tmp extension. The file >> data-triples.tmp can be very big. The name suggest that it is a temporal >> file. Can I delete that file after the loading ends? > > Yes. > > The files are the triples ids from the parse/load nodes stage. > > Then comes the indexing which is multiple passes over the tmp files, > once per index to sort, using an external sort (in both sense! > external program and external to disk), then build the indexes in a > single pass per index. > > This is reusing the external sort capability of sorting data much > larger than RAM. sort(1) needs > > I found a previous load script (when wikidata was 2.2 B IIRC) > > Setting SORT_ARGS > > ------------------ > #!/bin/bash > > echo "== $(date)" > > export TOOL_DIR="$PWD" > export JENA_HOME="$HOME/jlib/apache-jena-3.5.0" > export JVM_ARGS="" > export GZIP="--fast" > #export SORT_ARGS="--parallel=2 --compress-program=/bin/gzip > --temporary-directory=$PWD/tmp --buffer-size=75%" > > export SORT_ARGS="--temporary-directory=$PWD/tmp" > > ## -k : keep work files. > > # Logger:org.apache.jena.riot > > PHASE="--phase index" > ARGS="--keep-work $PHASE --loc db2-all" > > tdbloader2 $ARGS "$@" > echo "== $(date)" > ------------------ > IIRC not all sort(1) had "--parallel" back then. > > > I also found a script replacement for the "sort" command in the scripts: > ------------------ > #!/bin/bash > # Special. > ## mysort $KEYS "$DATA" "$WORK" > > KEYS="$1" > DATA="$2" > WORK="$3" > > SORT_ARGS="--compress-program=/bin/gzip --temporary-directory=$PWD/tmp > --buffer-size=80%" > gzip -d < "$DATA.gz" | sort $SORT_ARGS -u $KEYS > "$WORK" > ------------------ > > HTH > Andy Thanks Andy, you help me a lot! Best, Daniel
