thank you Lorenz, I am running this test myself now again with a larger disk. You may want to consider running a full load of wikidata as well. The timing info and disk space you have should be sufficient.
Did we figure out a place to post the parser messages? Marco On Thu, Dec 16, 2021 at 10:01 AM LB <conpcompl...@googlemail.com.invalid> wrote: > Sure > > > wikidata-tdb/Data-0001: > > total 524G > > -rw-r--r-- 1 24 Dez 15 05:41 GOSP.bpt > > -rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.dat > > -rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.idn > > -rw-r--r-- 1 24 Dez 15 05:41 GPOS.bpt > > -rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.dat > > -rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.idn > > -rw-r--r-- 1 24 Dez 15 05:41 GPU.bpt > > -rw-r--r-- 1 8,0M Dez 14 12:21 GPU.dat > > -rw-r--r-- 1 8,0M Dez 14 12:21 GPU.idn > > -rw-r--r-- 1 24 Dez 15 05:41 GSPO.bpt > > -rw-r--r-- 1 8,0M Dez 14 12:21 GSPO.dat > > -rw-r--r-- 1 8,0M Dez 14 12:21 GSPO.idn > > -rw-r--r-- 1 0 Dez 15 05:41 journal.jrnl > > -rw-r--r-- 1 24 Dez 15 05:41 nodes.bpt > > -rw-r--r-- 1 36G Dez 15 05:41 nodes.dat > > -rw-r--r-- 1 16 Dez 15 05:41 nodes-data.bdf > > -rw-r--r-- 1 44G Dez 15 05:41 nodes-data.obj > > -rw-r--r-- 1 312M Dez 15 05:41 nodes.idn > > -rw-r--r-- 1 24 Dez 15 05:41 OSP.bpt > > -rw-r--r-- 1 148G Dez 16 04:14 OSP.dat > > -rw-r--r-- 1 24 Dez 15 05:41 OSPG.bpt > > -rw-r--r-- 1 8,0M Dez 14 12:21 OSPG.dat > > -rw-r--r-- 1 8,0M Dez 14 12:21 OSPG.idn > > -rw-r--r-- 1 528M Dez 16 04:14 OSP.idn > > -rw-r--r-- 1 24 Dez 15 05:41 POS.bpt > > -rw-r--r-- 1 148G Dez 15 21:17 POS.dat > > -rw-r--r-- 1 24 Dez 15 05:41 POSG.bpt > > -rw-r--r-- 1 8,0M Dez 14 12:21 POSG.dat > > -rw-r--r-- 1 8,0M Dez 14 12:21 POSG.idn > > -rw-r--r-- 1 528M Dez 15 21:17 POS.idn > > -rw-r--r-- 1 24 Dez 15 05:41 prefixes.bpt > > -rw-r--r-- 1 8,0M Dez 14 12:21 prefixes.dat > > -rw-r--r-- 1 16 Dez 15 05:41 prefixes-data.bdf > > -rw-r--r-- 1 0 Dez 14 12:21 prefixes-data.obj > > -rw-r--r-- 1 8,0M Dez 14 12:21 prefixes.idn > > -rw-r--r-- 1 24 Dez 15 05:41 SPO.bpt > > -rw-r--r-- 1 148G Dez 15 11:25 SPO.dat > > -rw-r--r-- 1 24 Dez 15 05:41 SPOG.bpt > > -rw-r--r-- 1 8,0M Dez 14 12:21 SPOG.dat > > -rw-r--r-- 1 8,0M Dez 14 12:21 SPOG.idn > > -rw-r--r-- 1 528M Dez 15 11:25 SPO.idn > > -rw-r--r-- 1 8 Dez 15 21:17 tdb.lock > > On 16.12.21 10:27, Marco Neumann wrote: > > Thank you Lorenz, can you please post a directory list for Data-0001 with > > file sizes. > > > > > > On Thu, Dec 16, 2021 at 8:49 AM LB <conpcompl...@googlemail.com.invalid> > > wrote: > > > >> Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed: > >> > >> Server: > >> > >> AMD Ryzen 9 5950X (16C/32T) > >> 128 GB DDR4 ECC RAM > >> 2 x 3.84 TB NVMe SSD > >> > >> > >> Environment: > >> > >> - Ubuntu 20.04.3 LTS > >> - OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04) > >> - Jena 4.3.1 > >> > >> > >> Command: > >> > >>> tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc > >>> datasets/wikidata-tdb datasets/latest-truthy.nt.bz2 > >> > >> Log summary: > >> > >>> 04:14:28 INFO Load node table = 36600 seconds > >>> 04:14:28 INFO Load ingest data = 25811 seconds > >>> 04:14:28 INFO Build index SPO = 20688 seconds > >>> 04:14:28 INFO Build index POS = 35466 seconds > >>> 04:14:28 INFO Build index OSP = 25042 seconds > >>> 04:14:28 INFO Overall 143607 seconds > >>> 04:14:28 INFO Overall 39h 53m 27s > >>> 04:14:28 INFO Triples loaded = 6.610.055.778 > >>> 04:14:28 INFO Quads loaded = 0 > >>> 04:14:28 INFO Overall Rate 46.028 tuples per second > >> > >> Disk space usage according to > >> > >>> du -sh datasets/wikidata-tdb > >> is > >> > >>> 524G datasets/wikidata-tdb > >> During loading I could see ~90GB of RAM occupied (50% of total memory > >> got to sort and it used 2 threads - is it intended to stick to 2 threads > >> with --parallel 2?) > >> > >> > >> Cheers, > >> Lorenz > >> > >> > >> On 12.12.21 13:07, Andy Seaborne wrote: > >>> Hi, Øyvind, > >>> > >>> This is all very helpful feedback. Thank you. > >>> > >>> On 11/12/2021 21:45, Øyvind Gjesdal wrote: > >>>> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata > >>>> truthy > >>>> dump downloaded 2021-12-09. > >>> This is the 4.3.0 xloader? > >>> > >>> There are improvements in 4.3.1. Since that release was going out, > >>> including using less temporary space, the development version got > >>> merged in. It has had some testing. > >>> > >>> It compresses the triples.tmp and intermediate sort files in the index > >>> stage making the peak usage much smaller. > >>> > >>>> The instance is a vm created on the Norwegian Research and Education > >>>> Cloud, > >>>> an openstack cloud provider. > >>>> > >>>> Instance type: > >>>> 32 GB memory > >>>> 4 CPU > >>> I using similar on a 7 year old desktop machine, SATA disk. > >>> > >>> I haven't got a machine I can dedicate to the multi-day load. I'll try > >>> to find a way to at least push it though building the node table. > >>> > >>> Loading the first 1B of truthy: > >>> > >>> 1B triples , 40kTPS , 06h 54m 10s > >>> > >>> The database is 81G and building needs an addition 11.6G for workspace > >>> for a total of 92G (+ the data file). > >>> > >>> While smaller, its seems bz2 files are much slower to decompress so > >>> I've been using gz files. > >>> > >>> My current best guess for 6.4B truthy is > >>> > >>> Temp 96G > >>> Database 540G > >>> Data 48G > >>> Total: 684G -- peak disk needed > >>> > >>> based on scaling up 1B truthy. Personally, I would make sure there was > >>> more space. Also - I don't know if the shape of the data is > >>> sufficiently uniform to make scaling predictable. The time doesn't > >>> scale so simply. > >>> > >>> This is the 4.3.1 version - the 4.3.0 uses a lot more disk space. > >>> > >>> Compression reduces the size of triples.tmp -- the related sort > >>> temporary files which add up to the same again -- 1/6 of the size. > >>> > >>>> The storage used for dump + temp files is mounted as a separate > 900GB > >>>> volume and is mounted on /var/fuseki/databases > >>>> .The type of storage is described as > >>>>> *mass-storage-default*: Storage backed by spinning hard drives, > >>>> available to everybody and is the default type. > >>>> with ext4 configured. At the moment I don't have access to the faster > >>>> volume type mass-storage-ssd. CPU and memory are not dedicated, and > >>>> can be > >>>> overcommitted. > >>> "overcommitted" may be a problem. > >>> > >>> While it's not "tdb2 loader parallel" it does use a continuous CPU in > >>> several threads. > >>> > >>> For memory - "it's complicated". > >>> > >>> The java parts only need say 2G. The sort is set to "buffer 50% > >>> --parallel=2" and the java pipes into sort, that's another thread. I > >>> think the effective peak is 3 active threads and they'll all be at > >>> 100% for some of the time. > >>> > >>> So it's going to need 50% of RAM + 2G for a java proces, +OS. > >>> > >>> It does not need space for memory mapped files (they aren't used at > >>> all in the loading process and I/O is sequential. > >>> > >>> If that triggers over commitment swap out, the performance may go down > >>> a lot. > >>> > >>> For disk - if that is physically remote, it should not a problem > >>> (famous last words). I/O is sequential and in large continuous chunks > >>> - typical for batch processing jobs. > >>> > >>>> OS for the instance is a clean Rocky Linux image, with no services > >>>> except > >>>> jena/fuseki installed. The systemd service > >>> set up for fuseki is stopped. > >>>> jena and fuseki version is 4.3.0. > >>>> > >>>> openjdk 11.0.13 2021-10-19 LTS > >>>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS) > >>>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, > sharing) > >>> Just FYI: Java17 is a little faster. Some java improvements have > >>> improved RDF parsing speed by up to 10%. in xloader that not > >>> significant to the overall time. > >>> > >>>> I'm running from a tmux session to avoid connectivity issues and to > >>>> capture > >>>> the output. > >>> I use > >>> > >>> tdb2.xloader .... |& tee LOG-FILE-NAME > >>> > >>> to capture the logs and see them. ">&" and "tail -f" would achieve > >>> much the same effect > >>> > >>>> I think the output is stored in memory and not on disk. > >>>> On First run I tried to have the tmpdir on the root partition, to > >>>> separate > >>>> temp dir and data dir, but with only 19 GB free, the tmpdir soon was > >>>> disk > >>>> full. For the second (current run) all directories are under > >>>> /var/fuseki/databases. > >>> Yes - after making that mistake myself, the new version ignores system > >>> TMPDIR. Using --tmpdir is best but otherwise it defaults to the data > >>> directory. > >>> > >>>> $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy > >>>> --tmpdir > >>>> /var/fuseki/databases/tmp latest-truthy.nt.gz > >>>> > >>>> The import is so far at the "ingest data" stage where it has really > >>>> slowed > >>>> down. > >>> FYI: The first line of ingest is always very slow. It is not measuring > >>> the start point correctly. > >>> > >>>> Current output is: > >>>> > >>>> 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch: 3,356 > / > >>>> Avg: 7,593) > >>>> > >>>> See full log so far at > >>>> > https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab > >>> The earlier first pass also slows down and that should be fairly > >>> constant-ish speed step once everything settles down. > >>> > >>>> Some notes: > >>>> > >>>> * There is a (time/info) lapse in the output log between the end of > >>>> 'parse' and the start of 'index' for Terms. It is unclear to me what > is > >>>> happening in the 1h13 minutes between the lines. > >>> There is "sort" going on. "top" should show it. > >>> > >>> For each index there is also a very long pause for exactly the same > >>> reason. It would be good to have some something go "tick" and log a > >>> message occasionally. > >>> > >>>> 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds > >>>> [2021/12/10 > >>>> 22:33:46 CET] > >>>> 22:33:52 INFO Terms :: == Parse: 50726.071 seconds : > >>>> 6,560,468,631 triples/quads 129,331 TPS > >>>> 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch: > 237,755 / > >>>> Avg: 237,755) > >>>> > >>>> * The ingest data step really slows down on the "ingest data stage": > >>>> At the > >>>> current rate, if I calculated correctly, it looks like > >>>> PKG.CmdxIngestData > >>>> has 10 days left before it finishes. > >>> Ouch. > >>> > >>>> * When I saw sort running in the background for the first parts of > >>>> the job, > >>>> I looked at the `sort` command. I noticed from some online sources > that > >>>> setting the environment variable LC_ALL=C improves speed for `sort`. > >>>> Could > >>>> this be set on the ProcessBuilder for the `sort` process? Could it > >>>> break/change something? I see the warning from the man page for > `sort`. > >>>> > >>>> *** WARNING *** The locale specified by the environment > affects > >>>> sort order. Set LC_ALL=C to get the traditional sort order > that > >>>> uses native byte values. > >>> It shouldn't matter but, yes, better to set it and export it in the > >>> control script and propagate to forked processes. > >>> > >>> The sort is doing a binary sort except because it a text sort program, > >>> the binary is turned into hex (!!). hex is in the ASCII subset and > >>> shoule be locale safe. > >>> > >>> But better to set LC_ALL=C. > >>> > >>> Andy > >>> > >>> > >>>> Links: > >>>> https://access.redhat.com/solutions/445233 > >>>> > >> > https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram > >>>> > >> > https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort > >>>> > >>>> Best regards, > >>>> Øyvind > >>>> > > > -- --- Marco Neumann KONA