Re: TDB optimization query

Amandeep Srivastava Thu, 28 Nov 2019 17:23:52 -0800

Okay, I'll try that.

About the incremental updates using TDB2.tdbloader, how do we do that? Is
it possible through CLI?


Also, for inputs like Wikidata's dump, where we don't have an incremental
update, do we need to find the delta ourselves or TDB2.tdbloader would
overwrite the old data with new one?

On Fri, 29 Nov, 2019, 2:01 AM Andy Seaborne, <[email protected]> wrote:

>
>
> On 28/11/2019 16:29, Amandeep Srivastava wrote:
> > For some reason, java can't read the set $LOGGING value.
> >
> > Replacing
> LOGGING="${LOGGING:--Dlog4j.configuration=file:$JENA_HOME/jena-log4j.properties}"
> > with
> *LOGGING="-Djava.util.logging.config.file=$JENA_HOME/jena-log4j.properties"
> > *in tdb2.tdbloader script worked for me. Now it shows me all the logs.
>
> Actually that works by not setting log4j and defaulting to the built-in
> setting! Just not setting -Dlog4j.configuration should work.
>
> > One final thing that I wanted to ask is that database size created by
> > tdb2.tdbloader is much larger than what was created by tdbloader2. From
> the
> > archives, I see that it is because of full B+ trees implementation. Is
> > there any way I can reduce the index size of tdb2.tdbloader?
>
> Not really and the effect does not last if incremental updates happen to
> the tdbloader2 database.
>
> tdb2.tdbloader will work on a existing database, unlike tdbloader2
>
> It might be possible to write a compressor that rewrites indexes as part
> of the compaction step but it's not there at the moment.
>
>      Andy
>
> > Thanks in advance, you guys have been super helpful. Appreciate it.
> >
> > On Thu, Nov 28, 2019 at 6:10 PM Amandeep Srivastava <
> > [email protected]> wrote:
> >
> >> Attaching cmd line details for reference. Also, after creating the
> >> database, it isn't removing the tdb lock which hinders fuseki server
> from
> >> reading from the database
> >>
> >> aman@DESKTOP-ML2LO1I:~$ cd apache-jena-3.13.1/
> >> aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ ls
> >> LICENSE  NOTICE  README  bat  bin  *jena-log4j.properties*  lib  lib-src
> >>   src-examples  test
> >> aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ *./bin/tdb2.tdbloader
> >> --loader=parallel --loc=../test ../bsbm-generated-dataset.nt*
> >>
> >> aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ ls ../test/
> >>
> >> Data-0001  *tdb.lock *
> >> aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ ls ../test/Data-0001/
> >> GOSP.bpt  GPOS.idn  GSPO.dat  OSPG.bpt  POS.idn   SPO.dat   journal.jrnl
> >>   nodes.idn          prefixes.idn             GOSP.dat  GPU.bpt
>  GSPO.idn
> >>   OSPG.dat  POSG.bpt  SPO.idn   nodes-data.bdf  prefixes-data.bdf
> >> *tdb.lock*  GOSP.idn  GPU.dat   OSP.bpt   OSPG.idn  POSG.dat  SPOG.bpt
> >>   nodes-data.obj  prefixes-data.obj  GPOS.bpt  GPU.idn   OSP.dat
>  POS.bpt
> >> POSG.idn  SPOG.dat  nodes.bpt       prefixes.bpt     GPOS.dat  GSPO.bpt
> >>   OSP.idn   POS.dat   SPO.bpt   SPOG.idn  nodes.dat  prefixes.dat
> >>
> >>
> >> Thanks,
> >> Aman
> >>
> >> On Thu, Nov 28, 2019 at 3:59 PM Amandeep Srivastava <
> >> [email protected]> wrote:
> >>
> >>> Correction, using
> >>>
> >>> tdb2.tdbloader - -loader=parallel - - loc=../../db .. /.. /file.nt
> >>>
> >>> On Thu, 28 Nov, 2019, 3:56 PM Amandeep Srivastava, <
> >>> [email protected]> wrote:
> >>>
> >>>> Yes, I have the jena-log4j.properties file within the jena repo and
> the
> >>>> TDB2.loader file under bin in same repo.
> >>>>
> >>>> For me, when I run tdb2.loader - - loc=.. /.. /db .. /../file.nt, I
> see
> >>>> no logs. The process starts consuming cores and ram but there's
> nothing on
> >>>> the console. When the loading is finished, cursor moves on to the next
> >>>> line.
> >>>>
> >>>> On Thu, 28 Nov, 2019, 3:48 PM Andy Seaborne, <[email protected]> wrote:
> >>>>
> >>>>>
> >>>>>
> >>>>> On 28/11/2019 05:44, Amandeep Srivastava wrote:
> >>>>>> Thanks Andy, setting it that way worked.
> >>>>>>
> >>>>>> Also, can we turn on the verbose logging in TDB2.loader like we have
> >>>>> in
> >>>>>> tdbloader2?
> >>>>>>
> >>>>>> Basically, giving an output of how many triples it's loading and how
> >>>>> much
> >>>>>> time has elapsed so far.
> >>>>>
> >>>>> It does that by default for the data phase. The report step size is
> >>>>> longer (500k) than TDB1
> >>>>>
> >>>>> The index phase is more parallel and not all mods report progress.
> >>>>>
> >>>>> What are you seeing?
> >>>>> (Do you have a log4j.properties in the current directory?)
> >>>>>
> >>>>>       Andy
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> tdb2.tdbloader --loader=parallel --loc DB2
> ~/Datasets/BSBM/bsbm-5m.nt.gz
> >>>>>
> >>>>> INFO  Loader = LoaderParallel
> >>>>> INFO  Start: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz
> >>>>> INFO  Add: 500,000 bsbm-5m.nt.gz (Batch: 124,875 / Avg: 124,875)
> >>>>> INFO  Add: 1,000,000 bsbm-5m.nt.gz (Batch: 171,174 / Avg: 144,404)
> >>>>> INFO  Add: 1,500,000 bsbm-5m.nt.gz (Batch: 190,403 / Avg: 157,051)
> >>>>> INFO  Add: 2,000,000 bsbm-5m.nt.gz (Batch: 200,883 / Avg: 166,112)
> >>>>> INFO  Add: 2,500,000 bsbm-5m.nt.gz (Batch: 206,953 / Avg: 172,938)
> >>>>> INFO  Add: 3,000,000 bsbm-5m.nt.gz (Batch: 205,170 / Avg: 177,588)
> >>>>> INFO  Add: 3,500,000 bsbm-5m.nt.gz (Batch: 198,255 / Avg: 180,272)
> >>>>> INFO  Add: 4,000,000 bsbm-5m.nt.gz (Batch: 147,449 / Avg: 175,392)
> >>>>> INFO  Add: 4,500,000 bsbm-5m.nt.gz (Batch: 159,642 / Avg: 173,490)
> >>>>> INFO  Add: 5,000,000 bsbm-5m.nt.gz (Batch: 166,777 / Avg: 172,795)
> >>>>> INFO    Elapsed: 28.94 seconds [2019/11/28 10:17:55 GMT]
> >>>>> INFO  Finished: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz: 5,000,599
> tuples
> >>>>> in 28.96s (Avg: 172,690)
> >>>>> INFO  Finish - index POS
> >>>>> INFO  Finish - index SPO
> >>>>> INFO  Finish - index OSP
> >>>>> INFO  Time = 39.180 seconds : Triples = 5,000,599 : Rate = 127,631 /s
> >>>>>
> >>>>> though the default may be faster on this small datset
> >>>>>
> >>>>>>
> >>>>>> On Thu, 14 Nov, 2019, 2:20 PM Andy Seaborne, <[email protected]>
> wrote:
> >>>>>>
> >>>>>>> Firstly - just to be clear - tdb.tdbloader2 is (confusingly) for
> >>>>> TDB1.
> >>>>>>> Old name, before TDB2 came along so we're a bit stuck with it.
> >>>>>>>
> >>>>>>> tdbloader2 respects the $TMPDIR environment variable.
> >>>>>>>
> >>>>>>> Or set the SORT_ARGS environment variable with
> --temporary-directory=
> >>>>>>> (or -T). See tdbloader2 --help
> >>>>>>>
> >>>>>>>        Andy
> >>>>>>>
> >>>>>>> On 14/11/2019 02:54, Amandeep Srivastava wrote:
> >>>>>>>> I was trying to test the performance of tdb.tdbloader2 by creating
> >>>>> a TDB
> >>>>>>>> database. The loader failed at sort SPO step. The failure seems to
> >>>>> occur
> >>>>>>>> because of insufficient storage in the /tmp folder. Can we point
> >>>>> tdb to
> >>>>>>> use
> >>>>>>>> another folder as /tmp?
> >>>>>>>>
> >>>>>>>> Error log:
> >>>>>>>> sort: write failed: /tmp/sortxRql3B: No space left on device
> >>>>>>>>
> >>>>>>>> On Wed, 13 Nov, 2019, 5:37 PM Amandeep Srivastava, <
> >>>>>>>> [email protected]> wrote:
> >>>>>>>>
> >>>>>>>>> Thanks, Andy, for the detailed explanation :)
> >>>>>>>>>
> >>>>>>>>> On Wed, 13 Nov, 2019, 4:52 PM Andy Seaborne, <[email protected]>
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 12/11/2019 15:53, Amandeep Srivastava wrote:
> >>>>>>>>>>> Thanks for the heads up, Dan. Will go and check the archives.
> >>>>>>>>>>>
> >>>>>>>>>>> I think I should get how to decide between tdb and TDB2 in the
> >>>>>>> archives
> >>>>>>>>>>> itself.
> >>>>>>>>>>
> >>>>>>>>>> For large bulk loaders, the TDB2 loader is faster, if you use
> >>>>>>>>>> --loader-parallel (NB it can take over your machine's I/O!)
> >>>>>>>>>>
> >>>>>>>>>> See tdb2.tdbloader --help for names of plans that are built-in.
> >>>>>>>>>>
> >>>>>>>>>> The only way to know which is best is to try but
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> The order threading used is:
> >>>>>>>>>>
> >>>>>>>>>> sequential < light < phased < parallel
> >>>>>>>>>>
> >>>>>>>>>> (it does not always mean more threads is faster).
> >>>>>>>>>>
> >>>>>>>>>> sequential is roughly the same as the TDB1 bulk loader.
> >>>>>>>>>>
> >>>>>>>>>> parallel usualy wins as data gets larger (several 100m) if the
> >>>>> machine
> >>>>>>>>>> has the I/O to handle it.
> >>>>>>>>>>
> >>>>>>>>>>         Andy
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, 12 Nov, 2019, 8:59 PM Dan Pritts, <[email protected]>
> >>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Look through the list archives for posts from Andy describing
> >>>>> the
> >>>>>>>>>>>> differences between tdb1 and tdb2. they have different
> >>>>>>> optimizations; I
> >>>>>>>>>>>> don't recall the differences.
> >>>>>>>>>>>>
> >>>>>>>>>>>> thanks
> >>>>>>>>>>>> danno
> >>>>>>>>>>>>
> >>>>>>>>>>>> Dan Pritts
> >>>>>>>>>>>> ICPSR Computing and Network Services
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 12 Nov 2019, at 7:29, Amandeep Srivastava wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm trying to create a TDB database from Wikidata's official
> >>>>> RDF
> >>>>>>> dump
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>> read the data using Fuseki service. I need to make a few
> >>>>> queries for
> >>>>>>>>>>>>> my
> >>>>>>>>>>>>> personal project, running which the online service times out.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I have a 12 core machine with 36 GB memory.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Can you please advise on the best way for creating the
> >>>>> database?
> >>>>>>> Since
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>> dump is huge, I cannot try all the approaches. Besides, I'm
> >>>>> not sure
> >>>>>>>>>>>>> if the
> >>>>>>>>>>>>> tdbloader function works in a similar way on data of
> different
> >>>>>>> sizes.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Questions:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1. Which one would be better to use - tdb.tdbloader2 (TDB1)
> or
> >>>>>>>>>>>>> tdb2.tdbloader (TDB2) for creating the database and why? Any
> >>>>>>> specific
> >>>>>>>>>>>>> configurations that I should be aware of?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2. I'm running a job currently using tdb.tdbloader2 but it is
> >>>>> using
> >>>>>>>>>>>>> just a
> >>>>>>>>>>>>> single core. Also, it's loading speed is decreasing slowly.
> It
> >>>>>>> started
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> an avg of 120k tuples and is currently at 80k tuples. Can you
> >>>>> advise
> >>>>>>>>>>>>> how
> >>>>>>>>>>>>> can I utilize all the cores of my machine and maintain the
> >>>>> loading
> >>>>>>>>>>>>> speed at
> >>>>>>>>>>>>> the same time?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Aman
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> >> --
> >> Regards,
> >> Amandeep Srivastava
> >> Final Year Bachelor of Technology,
> >> Computer Science and Engineering Department,
> >> Indian Institute of Technology (ISM), Dhanbad.
> >>
> >>
> >
>

Re: TDB optimization query

Reply via email to