Correction, using tdb2.tdbloader - -loader=parallel - - loc=../../db .. /.. /file.nt
On Thu, 28 Nov, 2019, 3:56 PM Amandeep Srivastava, < [email protected]> wrote: > Yes, I have the jena-log4j.properties file within the jena repo and the > TDB2.loader file under bin in same repo. > > For me, when I run tdb2.loader - - loc=.. /.. /db .. /../file.nt, I see no > logs. The process starts consuming cores and ram but there's nothing on the > console. When the loading is finished, cursor moves on to the next line. > > On Thu, 28 Nov, 2019, 3:48 PM Andy Seaborne, <[email protected]> wrote: > >> >> >> On 28/11/2019 05:44, Amandeep Srivastava wrote: >> > Thanks Andy, setting it that way worked. >> > >> > Also, can we turn on the verbose logging in TDB2.loader like we have in >> > tdbloader2? >> > >> > Basically, giving an output of how many triples it's loading and how >> much >> > time has elapsed so far. >> >> It does that by default for the data phase. The report step size is >> longer (500k) than TDB1 >> >> The index phase is more parallel and not all mods report progress. >> >> What are you seeing? >> (Do you have a log4j.properties in the current directory?) >> >> Andy >> >> >> >> >> tdb2.tdbloader --loader=parallel --loc DB2 ~/Datasets/BSBM/bsbm-5m.nt.gz >> >> INFO Loader = LoaderParallel >> INFO Start: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz >> INFO Add: 500,000 bsbm-5m.nt.gz (Batch: 124,875 / Avg: 124,875) >> INFO Add: 1,000,000 bsbm-5m.nt.gz (Batch: 171,174 / Avg: 144,404) >> INFO Add: 1,500,000 bsbm-5m.nt.gz (Batch: 190,403 / Avg: 157,051) >> INFO Add: 2,000,000 bsbm-5m.nt.gz (Batch: 200,883 / Avg: 166,112) >> INFO Add: 2,500,000 bsbm-5m.nt.gz (Batch: 206,953 / Avg: 172,938) >> INFO Add: 3,000,000 bsbm-5m.nt.gz (Batch: 205,170 / Avg: 177,588) >> INFO Add: 3,500,000 bsbm-5m.nt.gz (Batch: 198,255 / Avg: 180,272) >> INFO Add: 4,000,000 bsbm-5m.nt.gz (Batch: 147,449 / Avg: 175,392) >> INFO Add: 4,500,000 bsbm-5m.nt.gz (Batch: 159,642 / Avg: 173,490) >> INFO Add: 5,000,000 bsbm-5m.nt.gz (Batch: 166,777 / Avg: 172,795) >> INFO Elapsed: 28.94 seconds [2019/11/28 10:17:55 GMT] >> INFO Finished: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz: 5,000,599 tuples >> in 28.96s (Avg: 172,690) >> INFO Finish - index POS >> INFO Finish - index SPO >> INFO Finish - index OSP >> INFO Time = 39.180 seconds : Triples = 5,000,599 : Rate = 127,631 /s >> >> though the default may be faster on this small datset >> >> > >> > On Thu, 14 Nov, 2019, 2:20 PM Andy Seaborne, <[email protected]> wrote: >> > >> >> Firstly - just to be clear - tdb.tdbloader2 is (confusingly) for TDB1. >> >> Old name, before TDB2 came along so we're a bit stuck with it. >> >> >> >> tdbloader2 respects the $TMPDIR environment variable. >> >> >> >> Or set the SORT_ARGS environment variable with --temporary-directory= >> >> (or -T). See tdbloader2 --help >> >> >> >> Andy >> >> >> >> On 14/11/2019 02:54, Amandeep Srivastava wrote: >> >>> I was trying to test the performance of tdb.tdbloader2 by creating a >> TDB >> >>> database. The loader failed at sort SPO step. The failure seems to >> occur >> >>> because of insufficient storage in the /tmp folder. Can we point tdb >> to >> >> use >> >>> another folder as /tmp? >> >>> >> >>> Error log: >> >>> sort: write failed: /tmp/sortxRql3B: No space left on device >> >>> >> >>> On Wed, 13 Nov, 2019, 5:37 PM Amandeep Srivastava, < >> >>> [email protected]> wrote: >> >>> >> >>>> Thanks, Andy, for the detailed explanation :) >> >>>> >> >>>> On Wed, 13 Nov, 2019, 4:52 PM Andy Seaborne, <[email protected]> >> wrote: >> >>>> >> >>>>> >> >>>>> >> >>>>> On 12/11/2019 15:53, Amandeep Srivastava wrote: >> >>>>>> Thanks for the heads up, Dan. Will go and check the archives. >> >>>>>> >> >>>>>> I think I should get how to decide between tdb and TDB2 in the >> >> archives >> >>>>>> itself. >> >>>>> >> >>>>> For large bulk loaders, the TDB2 loader is faster, if you use >> >>>>> --loader-parallel (NB it can take over your machine's I/O!) >> >>>>> >> >>>>> See tdb2.tdbloader --help for names of plans that are built-in. >> >>>>> >> >>>>> The only way to know which is best is to try but >> >>>>> >> >>>>> >> >>>>> The order threading used is: >> >>>>> >> >>>>> sequential < light < phased < parallel >> >>>>> >> >>>>> (it does not always mean more threads is faster). >> >>>>> >> >>>>> sequential is roughly the same as the TDB1 bulk loader. >> >>>>> >> >>>>> parallel usualy wins as data gets larger (several 100m) if the >> machine >> >>>>> has the I/O to handle it. >> >>>>> >> >>>>> Andy >> >>>>> >> >>>>>> >> >>>>>> On Tue, 12 Nov, 2019, 8:59 PM Dan Pritts, <[email protected]> wrote: >> >>>>>> >> >>>>>>> Look through the list archives for posts from Andy describing the >> >>>>>>> differences between tdb1 and tdb2. they have different >> >> optimizations; I >> >>>>>>> don't recall the differences. >> >>>>>>> >> >>>>>>> thanks >> >>>>>>> danno >> >>>>>>> >> >>>>>>> Dan Pritts >> >>>>>>> ICPSR Computing and Network Services >> >>>>>>> >> >>>>>>> On 12 Nov 2019, at 7:29, Amandeep Srivastava wrote: >> >>>>>>> >> >>>>>>>> Hi, >> >>>>>>>> >> >>>>>>>> I'm trying to create a TDB database from Wikidata's official RDF >> >> dump >> >>>>>>>> to >> >>>>>>>> read the data using Fuseki service. I need to make a few queries >> for >> >>>>>>>> my >> >>>>>>>> personal project, running which the online service times out. >> >>>>>>>> >> >>>>>>>> I have a 12 core machine with 36 GB memory. >> >>>>>>>> >> >>>>>>>> Can you please advise on the best way for creating the database? >> >> Since >> >>>>>>>> the >> >>>>>>>> dump is huge, I cannot try all the approaches. Besides, I'm not >> sure >> >>>>>>>> if the >> >>>>>>>> tdbloader function works in a similar way on data of different >> >> sizes. >> >>>>>>>> >> >>>>>>>> Questions: >> >>>>>>>> >> >>>>>>>> 1. Which one would be better to use - tdb.tdbloader2 (TDB1) or >> >>>>>>>> tdb2.tdbloader (TDB2) for creating the database and why? Any >> >> specific >> >>>>>>>> configurations that I should be aware of? >> >>>>>>>> >> >>>>>>>> 2. I'm running a job currently using tdb.tdbloader2 but it is >> using >> >>>>>>>> just a >> >>>>>>>> single core. Also, it's loading speed is decreasing slowly. It >> >> started >> >>>>>>>> at >> >>>>>>>> an avg of 120k tuples and is currently at 80k tuples. Can you >> advise >> >>>>>>>> how >> >>>>>>>> can I utilize all the cores of my machine and maintain the >> loading >> >>>>>>>> speed at >> >>>>>>>> the same time? >> >>>>>>>> >> >>>>>>>> Regards, >> >>>>>>>> Aman >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > >> >
