Re: Testing tdb2.xloader

Marco Neumann Thu, 16 Dec 2021 02:08:50 -0800

thank you Lorenz, I am running this test myself now again with a larger
disk. You may want to consider running a full load of wikidata as well. The
timing info and disk space you have should be sufficient.


Did we figure out a place to post the parser messages?

Marco


On Thu, Dec 16, 2021 at 10:01 AM LB <conpcompl...@googlemail.com.invalid>
wrote:

> Sure
>
> > wikidata-tdb/Data-0001:
> > total 524G
> > -rw-r--r-- 1   24 Dez 15 05:41 GOSP.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 GPOS.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 GPU.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GPU.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GPU.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 GSPO.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GSPO.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GSPO.idn
> > -rw-r--r-- 1    0 Dez 15 05:41 journal.jrnl
> > -rw-r--r-- 1   24 Dez 15 05:41 nodes.bpt
> > -rw-r--r-- 1  36G Dez 15 05:41 nodes.dat
> > -rw-r--r-- 1   16 Dez 15 05:41 nodes-data.bdf
> > -rw-r--r-- 1  44G Dez 15 05:41 nodes-data.obj
> > -rw-r--r-- 1 312M Dez 15 05:41 nodes.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 OSP.bpt
> > -rw-r--r-- 1 148G Dez 16 04:14 OSP.dat
> > -rw-r--r-- 1   24 Dez 15 05:41 OSPG.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 OSPG.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 OSPG.idn
> > -rw-r--r-- 1 528M Dez 16 04:14 OSP.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 POS.bpt
> > -rw-r--r-- 1 148G Dez 15 21:17 POS.dat
> > -rw-r--r-- 1   24 Dez 15 05:41 POSG.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 POSG.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 POSG.idn
> > -rw-r--r-- 1 528M Dez 15 21:17 POS.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 prefixes.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 prefixes.dat
> > -rw-r--r-- 1   16 Dez 15 05:41 prefixes-data.bdf
> > -rw-r--r-- 1    0 Dez 14 12:21 prefixes-data.obj
> > -rw-r--r-- 1 8,0M Dez 14 12:21 prefixes.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 SPO.bpt
> > -rw-r--r-- 1 148G Dez 15 11:25 SPO.dat
> > -rw-r--r-- 1   24 Dez 15 05:41 SPOG.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 SPOG.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 SPOG.idn
> > -rw-r--r-- 1 528M Dez 15 11:25 SPO.idn
> > -rw-r--r-- 1    8 Dez 15 21:17 tdb.lock
>
> On 16.12.21 10:27, Marco Neumann wrote:
> > Thank you Lorenz, can you please post a directory list for Data-0001 with
> > file sizes.
> >
> >
> > On Thu, Dec 16, 2021 at 8:49 AM LB <conpcompl...@googlemail.com.invalid>
> > wrote:
> >
> >> Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:
> >>
> >> Server:
> >>
> >> AMD Ryzen 9 5950X  (16C/32T)
> >> 128 GB DDR4 ECC RAM
> >> 2 x 3.84 TB NVMe SSD
> >>
> >>
> >> Environment:
> >>
> >> - Ubuntu 20.04.3 LTS
> >> - OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
> >> - Jena 4.3.1
> >>
> >>
> >> Command:
> >>
> >>> tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc
> >>> datasets/wikidata-tdb datasets/latest-truthy.nt.bz2
> >>
> >> Log summary:
> >>
> >>> 04:14:28 INFO  Load node table  = 36600 seconds
> >>> 04:14:28 INFO  Load ingest data = 25811 seconds
> >>> 04:14:28 INFO  Build index SPO  = 20688 seconds
> >>> 04:14:28 INFO  Build index POS  = 35466 seconds
> >>> 04:14:28 INFO  Build index OSP  = 25042 seconds
> >>> 04:14:28 INFO  Overall          143607 seconds
> >>> 04:14:28 INFO  Overall          39h 53m 27s
> >>> 04:14:28 INFO  Triples loaded   = 6.610.055.778
> >>> 04:14:28 INFO  Quads loaded     = 0
> >>> 04:14:28 INFO  Overall Rate     46.028 tuples per second
> >>
> >> Disk space usage according to
> >>
> >>> du -sh datasets/wikidata-tdb
> >>    is
> >>
> >>> 524G    datasets/wikidata-tdb
> >> During loading I could see ~90GB of RAM occupied (50% of total memory
> >> got to sort and it used 2 threads - is it intended to stick to 2 threads
> >> with --parallel 2?)
> >>
> >>
> >> Cheers,
> >> Lorenz
> >>
> >>
> >> On 12.12.21 13:07, Andy Seaborne wrote:
> >>> Hi, Øyvind,
> >>>
> >>> This is all very helpful feedback. Thank you.
> >>>
> >>> On 11/12/2021 21:45, Øyvind Gjesdal wrote:
> >>>> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
> >>>> truthy
> >>>> dump downloaded 2021-12-09.
> >>> This is the 4.3.0 xloader?
> >>>
> >>> There are improvements in 4.3.1. Since that release was going out,
> >>> including using less temporary space, the development version got
> >>> merged in. It has had some testing.
> >>>
> >>> It compresses the triples.tmp and intermediate sort files in the index
> >>> stage making the peak usage much smaller.
> >>>
> >>>> The instance is a vm created on the Norwegian Research and Education
> >>>> Cloud,
> >>>> an openstack cloud provider.
> >>>>
> >>>> Instance type:
> >>>> 32 GB memory
> >>>> 4 CPU
> >>> I using similar on a 7 year old desktop machine, SATA disk.
> >>>
> >>> I haven't got a machine I can dedicate to the multi-day load. I'll try
> >>> to find a way to at least push it though building the node table.
> >>>
> >>> Loading the first 1B of truthy:
> >>>
> >>> 1B triples , 40kTPS , 06h 54m 10s
> >>>
> >>> The database is 81G and building needs an addition 11.6G for workspace
> >>> for a total of 92G (+ the data file).
> >>>
> >>> While smaller, its seems bz2 files are much slower to decompress so
> >>> I've been using gz files.
> >>>
> >>> My current best guess for 6.4B truthy is
> >>>
> >>> Temp        96G
> >>> Database   540G
> >>> Data        48G
> >>> Total:     684G  -- peak disk needed
> >>>
> >>> based on scaling up 1B truthy. Personally, I would make sure there was
> >>> more space. Also - I don't know if the shape of the data is
> >>> sufficiently uniform to make scaling predictable.  The time doesn't
> >>> scale so simply.
> >>>
> >>> This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
> >>>
> >>> Compression reduces the size of triples.tmp -- the related sort
> >>> temporary files which add up to the same again -- 1/6 of the size.
> >>>
> >>>> The storage used for dump + temp files  is mounted as a separate
> 900GB
> >>>> volume and is mounted on /var/fuseki/databases
> >>>> .The type of storage is described as
> >>>>>    *mass-storage-default*: Storage backed by spinning hard drives,
> >>>> available to everybody and is the default type.
> >>>> with ext4 configured. At the moment I don't have access to the faster
> >>>> volume type mass-storage-ssd. CPU and memory are not dedicated, and
> >>>> can be
> >>>> overcommitted.
> >>> "overcommitted" may be a problem.
> >>>
> >>> While it's not "tdb2 loader parallel" it does use a continuous CPU in
> >>> several threads.
> >>>
> >>> For memory - "it's complicated".
> >>>
> >>> The java parts only need say 2G. The sort is set to "buffer 50%
> >>> --parallel=2" and the java pipes into sort, that's another thread. I
> >>> think the effective peak is 3 active threads and they'll all be at
> >>> 100% for some of the time.
> >>>
> >>> So it's going to need 50% of RAM + 2G for a java proces, +OS.
> >>>
> >>> It does not need space for memory mapped files (they aren't used at
> >>> all in the loading process and I/O is sequential.
> >>>
> >>> If that triggers over commitment swap out, the performance may go down
> >>> a lot.
> >>>
> >>> For disk - if that is physically remote, it should not a problem
> >>> (famous last words). I/O is sequential and in large continuous chunks
> >>> - typical for batch processing jobs.
> >>>
> >>>> OS for the instance is a clean Rocky Linux image, with no services
> >>>> except
> >>>> jena/fuseki installed. The systemd service
> >>>   set up for fuseki is stopped.
> >>>> jena and fuseki version is 4.3.0.
> >>>>
> >>>> openjdk 11.0.13 2021-10-19 LTS
> >>>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> >>>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode,
> sharing)
> >>> Just FYI: Java17 is a little faster. Some java improvements have
> >>> improved RDF parsing speed by up to 10%. in xloader that not
> >>> significant to the overall time.
> >>>
> >>>> I'm running from a tmux session to avoid connectivity issues and to
> >>>> capture
> >>>> the output.
> >>> I use
> >>>
> >>> tdb2.xloader .... |& tee LOG-FILE-NAME
> >>>
> >>> to capture the logs and see them. ">&" and "tail -f" would achieve
> >>> much the same effect
> >>>
> >>>> I think the output is stored in memory and not on disk.
> >>>> On First run I tried to have the tmpdir on the root partition, to
> >>>> separate
> >>>> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
> >>>> disk
> >>>> full. For the second (current run) all directories are under
> >>>> /var/fuseki/databases.
> >>> Yes - after making that mistake myself, the new version ignores system
> >>> TMPDIR.  Using --tmpdir is best but otherwise it defaults to the data
> >>> directory.
> >>>
> >>>>    $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
> >>>> --tmpdir
> >>>> /var/fuseki/databases/tmp latest-truthy.nt.gz
> >>>>
> >>>> The import is so far at the "ingest data" stage where it has really
> >>>> slowed
> >>>> down.
> >>> FYI: The first line of ingest is always very slow. It is not measuring
> >>> the start point correctly.
> >>>
> >>>> Current output is:
> >>>>
> >>>> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356
> /
> >>>> Avg: 7,593)
> >>>>
> >>>> See full log so far at
> >>>>
> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
> >>> The earlier first pass also slows down and that should be fairly
> >>> constant-ish speed step once everything settles down.
> >>>
> >>>> Some notes:
> >>>>
> >>>> * There is a (time/info) lapse in the output log between the end of
> >>>> 'parse' and the start of 'index' for Terms.  It is unclear to me what
> is
> >>>> happening in the 1h13 minutes between the lines.
> >>> There is "sort" going on. "top" should show it.
> >>>
> >>> For each index there is also a very long pause for exactly the same
> >>> reason.  It would be good to have some something go "tick" and log a
> >>> message occasionally.
> >>>
> >>>> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds
> >>>> [2021/12/10
> >>>> 22:33:46 CET]
> >>>> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
> >>>> 6,560,468,631 triples/quads 129,331 TPS
> >>>> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch:
> 237,755 /
> >>>> Avg: 237,755)
> >>>>
> >>>> * The ingest data step really slows down on the "ingest data stage":
> >>>> At the
> >>>> current rate, if I calculated correctly, it looks like
> >>>> PKG.CmdxIngestData
> >>>> has 10 days left before it finishes.
> >>> Ouch.
> >>>
> >>>> * When I saw sort running in the background for the first parts of
> >>>> the job,
> >>>> I looked at the `sort` command. I noticed from some online sources
> that
> >>>> setting the environment variable LC_ALL=C improves speed for `sort`.
> >>>> Could
> >>>> this be set on the ProcessBuilder for the `sort` process? Could it
> >>>> break/change something? I see the warning from the man page for
> `sort`.
> >>>>
> >>>>          *** WARNING *** The locale specified by the environment
> affects
> >>>>          sort order.  Set LC_ALL=C to get the traditional sort order
> that
> >>>>          uses native byte values.
> >>> It shouldn't matter but, yes, better to set it and export it in the
> >>> control script and propagate to forked processes.
> >>>
> >>> The sort is doing a binary sort except because it a text sort program,
> >>> the binary is turned into hex (!!). hex is in the ASCII subset and
> >>> shoule be locale safe.
> >>>
> >>> But better to set LC_ALL=C.
> >>>
> >>>      Andy
> >>>
> >>>
> >>>> Links:
> >>>> https://access.redhat.com/solutions/445233
> >>>>
> >>
> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
> >>>>
> >>
> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
> >>>>
> >>>> Best regards,
> >>>> Øyvind
> >>>>
> >
>


-- 


---
Marco Neumann
KONA

Re: Testing tdb2.xloader

Reply via email to