Øyvind, looks like the above was the wrong log from a prior sharding
experiment.

This is the correct log file for the truthy dataset.

http://www.lotico.com/temp/LOG-98085



On Sat, Dec 11, 2021 at 10:02 PM Marco Neumann <marco.neum...@gmail.com>
wrote:

> Thank you Øyvind for sharing, great to see more tests in the wild.
>
> I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
> dataset and quickly ran out of disk space. It finished the job but did not
> write any of the indexes to disk due to lack of space. no error messages.
>
> http://www.lotico.com/temp/LOG-95239
>
> I have now ordered a new 4TB SSD drive to rerun the test possibly with the
> full wikidata dataset,
>
> I personally had the best experience with dedicated hardware so far (can
> be in the data center), shared or dedicated virtual compute engines did not
> deliver as expected. And I have not seen great benefits from data center
> grade multicore cpus. But I think they will during runtime in multi user
> settings (eg fuseki).
>
> Best,
> Marco
>
> On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oyvin...@gmail.com> wrote:
>
>> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
>> truthy
>> dump downloaded 2021-12-09.
>>
>> The instance is a vm created on the Norwegian Research and Education
>> Cloud,
>> an openstack cloud provider.
>>
>> Instance type:
>> 32 GB memory
>> 4 CPU
>>
>> The storage used for dump + temp files  is mounted as a separate  900GB
>> volume and is mounted on /var/fuseki/databases
>> .The type of storage is described as
>> >  *mass-storage-default*: Storage backed by spinning hard drives,
>> available to everybody and is the default type.
>> with ext4 configured. At the moment I don't have access to the faster
>> volume type mass-storage-ssd. CPU and memory are not dedicated, and can be
>> overcommitted.
>>
>> OS for the instance is a clean Rocky Linux image, with no services except
>> jena/fuseki installed. The systemd service  set up for fuseki is stopped.
>> jena and fuseki version is 4.3.0.
>>
>> openjdk 11.0.13 2021-10-19 LTS
>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
>>
>> I'm running from a tmux session to avoid connectivity issues and to
>> capture
>> the output. I think the output is stored in memory and not on disk.
>> On First run I tried to have the tmpdir on the root partition, to separate
>> temp dir and data dir, but with only 19 GB free, the tmpdir soon was disk
>> full. For the second (current run) all directories are under
>> /var/fuseki/databases.
>>
>>  $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
>> --tmpdir
>> /var/fuseki/databases/tmp latest-truthy.nt.gz
>>
>> The import is so far at the "ingest data" stage where it has really slowed
>> down.
>>
>> Current output is:
>>
>> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
>> Avg: 7,593)
>>
>> See full log so far at
>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>>
>> Some notes:
>>
>> * There is a (time/info) lapse in the output log between the  end of
>> 'parse' and the start of 'index' for Terms.  It is unclear to me what is
>> happening in the 1h13 minutes between the lines.
>>
>> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds [2021/12/10
>> 22:33:46 CET]
>> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
>> 6,560,468,631 triples/quads 129,331 TPS
>> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755 /
>> Avg: 237,755)
>>
>> * The ingest data step really slows down on the "ingest data stage": At
>> the
>> current rate, if I calculated correctly, it looks like PKG.CmdxIngestData
>> has 10 days left before it finishes.
>>
>> * When I saw sort running in the background for the first parts of the
>> job,
>> I looked at the `sort` command. I noticed from some online sources that
>> setting the environment variable LC_ALL=C improves speed for `sort`. Could
>> this be set on the ProcessBuilder for the `sort` process? Could it
>> break/change something? I see the warning from the man page for `sort`.
>>
>>        *** WARNING *** The locale specified by the environment affects
>>        sort order.  Set LC_ALL=C to get the traditional sort order that
>>        uses native byte values.
>>
>> Links:
>> https://access.redhat.com/solutions/445233
>>
>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>>
>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>>
>> Best regards,
>> Øyvind
>>
>
>
> --
>
>
> ---
> Marco Neumann
> KONA
>
>

-- 


---
Marco Neumann
KONA

Reply via email to