Re: Testing tdb2.xloader

Marco Neumann Sun, 12 Dec 2021 05:46:08 -0800

Does 4.3.1 already contain the mitigation for the Log4j2 vulnerability?

On Sun, Dec 12, 2021 at 1:24 PM Marco Neumann <marco.neum...@gmail.com>
wrote:


> As Andy mentioned, I will give the 4.3.1 xloader a try with the new 4TB
> SSD drive and an old laptop.
>
> I also have a contact who has just set up a new datacenter in Ireland. I
> may be able to run a few tests on much bigger machines as well. Otherwise I
> am very happy with the iron in Finland.as long as they are dedicated
> machines.
>
> On Sun, Dec 12, 2021 at 12:44 PM Andy Seaborne <a...@apache.org> wrote:
>
>>
>>
>> On 11/12/2021 22:02, Marco Neumann wrote:
>> > Thank you Øyvind for sharing, great to see more tests in the wild.
>> >
>> > I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
>> > dataset and quickly ran out of disk space. It finished the job but did
>> not
>> > write any of the indexes to disk due to lack of space. no error
>> messages.
>>
>> The 4.3.1 xloader should hopefully address the space issue.
>>
>>      Andy
>>
>> >
>> > http://www.lotico.com/temp/LOG-95239
>> >
>> > I have now ordered a new 4TB SSD drive to rerun the test possibly with
>> the
>> > full wikidata dataset,
>> >
>> > I personally had the best experience with dedicated hardware so far
>> (can be
>> > in the data center), shared or dedicated virtual compute engines did not
>> > deliver as expected. And I have not seen great benefits from data center
>> > grade multicore cpus. But I think they will during runtime in multi user
>> > settings (eg fuseki).
>> >
>> > Best,
>> > Marco
>> >
>> > On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oyvin...@gmail.com>
>> wrote:
>> >
>> >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
>> truthy
>> >> dump downloaded 2021-12-09.
>> >>
>> >> The instance is a vm created on the Norwegian Research and Education
>> Cloud,
>> >> an openstack cloud provider.
>> >>
>> >> Instance type:
>> >> 32 GB memory
>> >> 4 CPU
>> >>
>> >> The storage used for dump + temp files  is mounted as a separate  900GB
>> >> volume and is mounted on /var/fuseki/databases
>> >> .The type of storage is described as
>> >>>   *mass-storage-default*: Storage backed by spinning hard drives,
>> >> available to everybody and is the default type.
>> >> with ext4 configured. At the moment I don't have access to the faster
>> >> volume type mass-storage-ssd. CPU and memory are not dedicated, and
>> can be
>> >> overcommitted.
>> >>
>> >> OS for the instance is a clean Rocky Linux image, with no services
>> except
>> >> jena/fuseki installed. The systemd service  set up for fuseki is
>> stopped.
>> >> jena and fuseki version is 4.3.0.
>> >>
>> >> openjdk 11.0.13 2021-10-19 LTS
>> >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>> >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode,
>> sharing)
>> >>
>> >> I'm running from a tmux session to avoid connectivity issues and to
>> capture
>> >> the output. I think the output is stored in memory and not on disk.
>> >> On First run I tried to have the tmpdir on the root partition, to
>> separate
>> >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
>> disk
>> >> full. For the second (current run) all directories are under
>> >> /var/fuseki/databases.
>> >>
>> >>   $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
>> --tmpdir
>> >> /var/fuseki/databases/tmp latest-truthy.nt.gz
>> >>
>> >> The import is so far at the "ingest data" stage where it has really
>> slowed
>> >> down.
>> >>
>> >> Current output is:
>> >>
>> >> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
>> >> Avg: 7,593)
>> >>
>> >> See full log so far at
>> >>
>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>> >>
>> >> Some notes:
>> >>
>> >> * There is a (time/info) lapse in the output log between the  end of
>> >> 'parse' and the start of 'index' for Terms.  It is unclear to me what
>> is
>> >> happening in the 1h13 minutes between the lines.
>> >>
>> >> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds
>> [2021/12/10
>> >> 22:33:46 CET]
>> >> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
>> >> 6,560,468,631 triples/quads 129,331 TPS
>> >> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755
>> /
>> >> Avg: 237,755)
>> >>
>> >> * The ingest data step really slows down on the "ingest data stage":
>> At the
>> >> current rate, if I calculated correctly, it looks like
>> PKG.CmdxIngestData
>> >> has 10 days left before it finishes.
>> >>
>> >> * When I saw sort running in the background for the first parts of the
>> job,
>> >> I looked at the `sort` command. I noticed from some online sources that
>> >> setting the environment variable LC_ALL=C improves speed for `sort`.
>> Could
>> >> this be set on the ProcessBuilder for the `sort` process? Could it
>> >> break/change something? I see the warning from the man page for `sort`.
>> >>
>> >>         *** WARNING *** The locale specified by the environment affects
>> >>         sort order.  Set LC_ALL=C to get the traditional sort order
>> that
>> >>         uses native byte values.
>> >>
>> >> Links:
>> >> https://access.redhat.com/solutions/445233
>> >>
>> >>
>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>> >>
>> >>
>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>> >>
>> >> Best regards,
>> >> Øyvind
>> >>
>> >
>> >
>>
>
>
> --
>
>
> ---
> Marco Neumann
> KONA
>
>

-- 


---
Marco Neumann
KONA

Re: Testing tdb2.xloader

Reply via email to