Okay, I'll try that. About the incremental updates using TDB2.tdbloader, how do we do that? Is it possible through CLI?
Also, for inputs like Wikidata's dump, where we don't have an incremental update, do we need to find the delta ourselves or TDB2.tdbloader would overwrite the old data with new one? On Fri, 29 Nov, 2019, 2:01 AM Andy Seaborne, <[email protected]> wrote: > > > On 28/11/2019 16:29, Amandeep Srivastava wrote: > > For some reason, java can't read the set $LOGGING value. > > > > Replacing > LOGGING="${LOGGING:--Dlog4j.configuration=file:$JENA_HOME/jena-log4j.properties}" > > with > *LOGGING="-Djava.util.logging.config.file=$JENA_HOME/jena-log4j.properties" > > *in tdb2.tdbloader script worked for me. Now it shows me all the logs. > > Actually that works by not setting log4j and defaulting to the built-in > setting! Just not setting -Dlog4j.configuration should work. > > > One final thing that I wanted to ask is that database size created by > > tdb2.tdbloader is much larger than what was created by tdbloader2. From > the > > archives, I see that it is because of full B+ trees implementation. Is > > there any way I can reduce the index size of tdb2.tdbloader? > > Not really and the effect does not last if incremental updates happen to > the tdbloader2 database. > > tdb2.tdbloader will work on a existing database, unlike tdbloader2 > > It might be possible to write a compressor that rewrites indexes as part > of the compaction step but it's not there at the moment. > > Andy > > > Thanks in advance, you guys have been super helpful. Appreciate it. > > > > On Thu, Nov 28, 2019 at 6:10 PM Amandeep Srivastava < > > [email protected]> wrote: > > > >> Attaching cmd line details for reference. Also, after creating the > >> database, it isn't removing the tdb lock which hinders fuseki server > from > >> reading from the database > >> > >> aman@DESKTOP-ML2LO1I:~$ cd apache-jena-3.13.1/ > >> aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ ls > >> LICENSE NOTICE README bat bin *jena-log4j.properties* lib lib-src > >> src-examples test > >> aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ *./bin/tdb2.tdbloader > >> --loader=parallel --loc=../test ../bsbm-generated-dataset.nt* > >> > >> aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ ls ../test/ > >> > >> Data-0001 *tdb.lock * > >> aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ ls ../test/Data-0001/ > >> GOSP.bpt GPOS.idn GSPO.dat OSPG.bpt POS.idn SPO.dat journal.jrnl > >> nodes.idn prefixes.idn GOSP.dat GPU.bpt > GSPO.idn > >> OSPG.dat POSG.bpt SPO.idn nodes-data.bdf prefixes-data.bdf > >> *tdb.lock* GOSP.idn GPU.dat OSP.bpt OSPG.idn POSG.dat SPOG.bpt > >> nodes-data.obj prefixes-data.obj GPOS.bpt GPU.idn OSP.dat > POS.bpt > >> POSG.idn SPOG.dat nodes.bpt prefixes.bpt GPOS.dat GSPO.bpt > >> OSP.idn POS.dat SPO.bpt SPOG.idn nodes.dat prefixes.dat > >> > >> > >> Thanks, > >> Aman > >> > >> On Thu, Nov 28, 2019 at 3:59 PM Amandeep Srivastava < > >> [email protected]> wrote: > >> > >>> Correction, using > >>> > >>> tdb2.tdbloader - -loader=parallel - - loc=../../db .. /.. /file.nt > >>> > >>> On Thu, 28 Nov, 2019, 3:56 PM Amandeep Srivastava, < > >>> [email protected]> wrote: > >>> > >>>> Yes, I have the jena-log4j.properties file within the jena repo and > the > >>>> TDB2.loader file under bin in same repo. > >>>> > >>>> For me, when I run tdb2.loader - - loc=.. /.. /db .. /../file.nt, I > see > >>>> no logs. The process starts consuming cores and ram but there's > nothing on > >>>> the console. When the loading is finished, cursor moves on to the next > >>>> line. > >>>> > >>>> On Thu, 28 Nov, 2019, 3:48 PM Andy Seaborne, <[email protected]> wrote: > >>>> > >>>>> > >>>>> > >>>>> On 28/11/2019 05:44, Amandeep Srivastava wrote: > >>>>>> Thanks Andy, setting it that way worked. > >>>>>> > >>>>>> Also, can we turn on the verbose logging in TDB2.loader like we have > >>>>> in > >>>>>> tdbloader2? > >>>>>> > >>>>>> Basically, giving an output of how many triples it's loading and how > >>>>> much > >>>>>> time has elapsed so far. > >>>>> > >>>>> It does that by default for the data phase. The report step size is > >>>>> longer (500k) than TDB1 > >>>>> > >>>>> The index phase is more parallel and not all mods report progress. > >>>>> > >>>>> What are you seeing? > >>>>> (Do you have a log4j.properties in the current directory?) > >>>>> > >>>>> Andy > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> tdb2.tdbloader --loader=parallel --loc DB2 > ~/Datasets/BSBM/bsbm-5m.nt.gz > >>>>> > >>>>> INFO Loader = LoaderParallel > >>>>> INFO Start: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz > >>>>> INFO Add: 500,000 bsbm-5m.nt.gz (Batch: 124,875 / Avg: 124,875) > >>>>> INFO Add: 1,000,000 bsbm-5m.nt.gz (Batch: 171,174 / Avg: 144,404) > >>>>> INFO Add: 1,500,000 bsbm-5m.nt.gz (Batch: 190,403 / Avg: 157,051) > >>>>> INFO Add: 2,000,000 bsbm-5m.nt.gz (Batch: 200,883 / Avg: 166,112) > >>>>> INFO Add: 2,500,000 bsbm-5m.nt.gz (Batch: 206,953 / Avg: 172,938) > >>>>> INFO Add: 3,000,000 bsbm-5m.nt.gz (Batch: 205,170 / Avg: 177,588) > >>>>> INFO Add: 3,500,000 bsbm-5m.nt.gz (Batch: 198,255 / Avg: 180,272) > >>>>> INFO Add: 4,000,000 bsbm-5m.nt.gz (Batch: 147,449 / Avg: 175,392) > >>>>> INFO Add: 4,500,000 bsbm-5m.nt.gz (Batch: 159,642 / Avg: 173,490) > >>>>> INFO Add: 5,000,000 bsbm-5m.nt.gz (Batch: 166,777 / Avg: 172,795) > >>>>> INFO Elapsed: 28.94 seconds [2019/11/28 10:17:55 GMT] > >>>>> INFO Finished: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz: 5,000,599 > tuples > >>>>> in 28.96s (Avg: 172,690) > >>>>> INFO Finish - index POS > >>>>> INFO Finish - index SPO > >>>>> INFO Finish - index OSP > >>>>> INFO Time = 39.180 seconds : Triples = 5,000,599 : Rate = 127,631 /s > >>>>> > >>>>> though the default may be faster on this small datset > >>>>> > >>>>>> > >>>>>> On Thu, 14 Nov, 2019, 2:20 PM Andy Seaborne, <[email protected]> > wrote: > >>>>>> > >>>>>>> Firstly - just to be clear - tdb.tdbloader2 is (confusingly) for > >>>>> TDB1. > >>>>>>> Old name, before TDB2 came along so we're a bit stuck with it. > >>>>>>> > >>>>>>> tdbloader2 respects the $TMPDIR environment variable. > >>>>>>> > >>>>>>> Or set the SORT_ARGS environment variable with > --temporary-directory= > >>>>>>> (or -T). See tdbloader2 --help > >>>>>>> > >>>>>>> Andy > >>>>>>> > >>>>>>> On 14/11/2019 02:54, Amandeep Srivastava wrote: > >>>>>>>> I was trying to test the performance of tdb.tdbloader2 by creating > >>>>> a TDB > >>>>>>>> database. The loader failed at sort SPO step. The failure seems to > >>>>> occur > >>>>>>>> because of insufficient storage in the /tmp folder. Can we point > >>>>> tdb to > >>>>>>> use > >>>>>>>> another folder as /tmp? > >>>>>>>> > >>>>>>>> Error log: > >>>>>>>> sort: write failed: /tmp/sortxRql3B: No space left on device > >>>>>>>> > >>>>>>>> On Wed, 13 Nov, 2019, 5:37 PM Amandeep Srivastava, < > >>>>>>>> [email protected]> wrote: > >>>>>>>> > >>>>>>>>> Thanks, Andy, for the detailed explanation :) > >>>>>>>>> > >>>>>>>>> On Wed, 13 Nov, 2019, 4:52 PM Andy Seaborne, <[email protected]> > >>>>> wrote: > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On 12/11/2019 15:53, Amandeep Srivastava wrote: > >>>>>>>>>>> Thanks for the heads up, Dan. Will go and check the archives. > >>>>>>>>>>> > >>>>>>>>>>> I think I should get how to decide between tdb and TDB2 in the > >>>>>>> archives > >>>>>>>>>>> itself. > >>>>>>>>>> > >>>>>>>>>> For large bulk loaders, the TDB2 loader is faster, if you use > >>>>>>>>>> --loader-parallel (NB it can take over your machine's I/O!) > >>>>>>>>>> > >>>>>>>>>> See tdb2.tdbloader --help for names of plans that are built-in. > >>>>>>>>>> > >>>>>>>>>> The only way to know which is best is to try but > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> The order threading used is: > >>>>>>>>>> > >>>>>>>>>> sequential < light < phased < parallel > >>>>>>>>>> > >>>>>>>>>> (it does not always mean more threads is faster). > >>>>>>>>>> > >>>>>>>>>> sequential is roughly the same as the TDB1 bulk loader. > >>>>>>>>>> > >>>>>>>>>> parallel usualy wins as data gets larger (several 100m) if the > >>>>> machine > >>>>>>>>>> has the I/O to handle it. > >>>>>>>>>> > >>>>>>>>>> Andy > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Tue, 12 Nov, 2019, 8:59 PM Dan Pritts, <[email protected]> > >>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Look through the list archives for posts from Andy describing > >>>>> the > >>>>>>>>>>>> differences between tdb1 and tdb2. they have different > >>>>>>> optimizations; I > >>>>>>>>>>>> don't recall the differences. > >>>>>>>>>>>> > >>>>>>>>>>>> thanks > >>>>>>>>>>>> danno > >>>>>>>>>>>> > >>>>>>>>>>>> Dan Pritts > >>>>>>>>>>>> ICPSR Computing and Network Services > >>>>>>>>>>>> > >>>>>>>>>>>> On 12 Nov 2019, at 7:29, Amandeep Srivastava wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi, > >>>>>>>>>>>>> > >>>>>>>>>>>>> I'm trying to create a TDB database from Wikidata's official > >>>>> RDF > >>>>>>> dump > >>>>>>>>>>>>> to > >>>>>>>>>>>>> read the data using Fuseki service. I need to make a few > >>>>> queries for > >>>>>>>>>>>>> my > >>>>>>>>>>>>> personal project, running which the online service times out. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I have a 12 core machine with 36 GB memory. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Can you please advise on the best way for creating the > >>>>> database? > >>>>>>> Since > >>>>>>>>>>>>> the > >>>>>>>>>>>>> dump is huge, I cannot try all the approaches. Besides, I'm > >>>>> not sure > >>>>>>>>>>>>> if the > >>>>>>>>>>>>> tdbloader function works in a similar way on data of > different > >>>>>>> sizes. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Questions: > >>>>>>>>>>>>> > >>>>>>>>>>>>> 1. Which one would be better to use - tdb.tdbloader2 (TDB1) > or > >>>>>>>>>>>>> tdb2.tdbloader (TDB2) for creating the database and why? Any > >>>>>>> specific > >>>>>>>>>>>>> configurations that I should be aware of? > >>>>>>>>>>>>> > >>>>>>>>>>>>> 2. I'm running a job currently using tdb.tdbloader2 but it is > >>>>> using > >>>>>>>>>>>>> just a > >>>>>>>>>>>>> single core. Also, it's loading speed is decreasing slowly. > It > >>>>>>> started > >>>>>>>>>>>>> at > >>>>>>>>>>>>> an avg of 120k tuples and is currently at 80k tuples. Can you > >>>>> advise > >>>>>>>>>>>>> how > >>>>>>>>>>>>> can I utilize all the cores of my machine and maintain the > >>>>> loading > >>>>>>>>>>>>> speed at > >>>>>>>>>>>>> the same time? > >>>>>>>>>>>>> > >>>>>>>>>>>>> Regards, > >>>>>>>>>>>>> Aman > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >> > >> -- > >> Regards, > >> Amandeep Srivastava > >> Final Year Bachelor of Technology, > >> Computer Science and Engineering Department, > >> Indian Institute of Technology (ISM), Dhanbad. > >> > >> > > >
