Hi,

Andy Seaborne writes:

> On 23/02/2021 17:55, Daniel Hernandez wrote:
>> Hi,
>>
>>>> The disk where I was loading the data was a local rotating disk of
>>>> 7200 rpm. The machine has also an SSD but is too small to do the
>>>> experiment.
>>>
>>> tdbloader2 may be the right choice for that setup - it was written
>>> with disks in mind. It uses Unix sort(1). What it needs is to tune the
>>> parameters to the runs of "sort"
>> Thanks, this information is very useful.
>>
>>> Wolfgang Fahl has loaded large (several billion triples)
>>>
>>> https://issues.apache.org/jira/browse/JENA-1909
>>>
>>> and his notes are at:
>>>
>>>   http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData
>> I also have loaded Wikidata in a very small virtual machine with a
>> single core, and a rotating non local disk. I remember it lasted more
>> than a week. I do not saved the log, because the machine was running
>> other jobs at the same time. Next time I loaded a big dataset I will
>> share the machine specification and loading log.
>>
>>>> I wonder if it is better to load the data using a fast disk, a lot of
>>>> RAM, or a lot of cores.
>>>
>>> A few years ago, I ran load tests of two machines, one 32G+SATA SSD,
>>> one 16G+ 1TB M2 SSD.  The 16G but faster SSD was quicker overall.
>> That is interesting.  I am considering to have a machine with an
>> NVMe
>> SSD disk for the next loading.
>>
>>> Database directories can be copied across machines after they have
>>> been built.
>> The tdbloader2 generates some files with the tmp extension. The file
>> data-triples.tmp can be very big. The name suggest that it is a temporal
>> file. Can I delete that file after the loading ends?
>
> Yes.
>
> The files are the triples ids from the parse/load nodes stage.
>
> Then comes the indexing which is multiple passes over the tmp files,
> once per index to sort, using an external sort (in both sense!
> external program and external to disk), then build the indexes in a
> single pass per index.
>
> This is reusing the external sort capability of sorting data much
> larger than RAM.  sort(1) needs
>
> I found a previous load script (when wikidata was 2.2 B IIRC)
>
> Setting SORT_ARGS
>
> ------------------
> #!/bin/bash
>
> echo "== $(date)"
>
> export TOOL_DIR="$PWD"
> export JENA_HOME="$HOME/jlib/apache-jena-3.5.0"
> export JVM_ARGS=""
> export GZIP="--fast"
> #export SORT_ARGS="--parallel=2 --compress-program=/bin/gzip
>  --temporary-directory=$PWD/tmp --buffer-size=75%"
>
> export SORT_ARGS="--temporary-directory=$PWD/tmp"
>
> ## -k : keep work files.
>
> # Logger:org.apache.jena.riot
>
> PHASE="--phase index"
> ARGS="--keep-work $PHASE --loc db2-all"
>
> tdbloader2 $ARGS "$@"
> echo "== $(date)"
> ------------------
> IIRC not all sort(1) had "--parallel" back then.
>
>
> I also found a script replacement for the "sort" command in the scripts:
> ------------------
> #!/bin/bash
> # Special.
> ## mysort $KEYS "$DATA" "$WORK"
>
> KEYS="$1"
> DATA="$2"
> WORK="$3"
>
> SORT_ARGS="--compress-program=/bin/gzip --temporary-directory=$PWD/tmp
> --buffer-size=80%"
> gzip -d < "$DATA.gz" | sort $SORT_ARGS -u $KEYS > "$WORK"
> ------------------
>
>       HTH
>       Andy

Thanks Andy, you help me a lot!

Best,
Daniel

Reply via email to