Re: TDB optimization query

Amandeep Srivastava Thu, 28 Nov 2019 08:30:22 -0800

For some reason, java can't read the set $LOGGING value.

Replacing 
LOGGING="${LOGGING:--Dlog4j.configuration=file:$JENA_HOME/jena-log4j.properties}"
with  
*LOGGING="-Djava.util.logging.config.file=$JENA_HOME/jena-log4j.properties"
*in tdb2.tdbloader script worked for me. Now it shows me all the logs.


One final thing that I wanted to ask is that database size created by
tdb2.tdbloader is much larger than what was created by tdbloader2. From the
archives, I see that it is because of full B+ trees implementation. Is
there any way I can reduce the index size of tdb2.tdbloader?

Thanks in advance, you guys have been super helpful. Appreciate it.

On Thu, Nov 28, 2019 at 6:10 PM Amandeep Srivastava <
[email protected]> wrote:

> Attaching cmd line details for reference. Also, after creating the
> database, it isn't removing the tdb lock which hinders fuseki server from
> reading from the database
>
> aman@DESKTOP-ML2LO1I:~$ cd apache-jena-3.13.1/
> aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ ls
> LICENSE  NOTICE  README  bat  bin  *jena-log4j.properties*  lib  lib-src
>  src-examples  test
> aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ *./bin/tdb2.tdbloader
> --loader=parallel --loc=../test ../bsbm-generated-dataset.nt*
>
> aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ ls ../test/
>
> Data-0001  *tdb.lock *
> aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ ls ../test/Data-0001/
> GOSP.bpt  GPOS.idn  GSPO.dat  OSPG.bpt  POS.idn   SPO.dat   journal.jrnl
>  nodes.idn          prefixes.idn             GOSP.dat  GPU.bpt   GSPO.idn
>  OSPG.dat  POSG.bpt  SPO.idn   nodes-data.bdf  prefixes-data.bdf
> *tdb.lock*  GOSP.idn  GPU.dat   OSP.bpt   OSPG.idn  POSG.dat  SPOG.bpt
>  nodes-data.obj  prefixes-data.obj  GPOS.bpt  GPU.idn   OSP.dat   POS.bpt
> POSG.idn  SPOG.dat  nodes.bpt       prefixes.bpt     GPOS.dat  GSPO.bpt
>  OSP.idn   POS.dat   SPO.bpt   SPOG.idn  nodes.dat  prefixes.dat
>
>
> Thanks,
> Aman
>
> On Thu, Nov 28, 2019 at 3:59 PM Amandeep Srivastava <
> [email protected]> wrote:
>
>> Correction, using
>>
>> tdb2.tdbloader - -loader=parallel - - loc=../../db .. /.. /file.nt
>>
>> On Thu, 28 Nov, 2019, 3:56 PM Amandeep Srivastava, <
>> [email protected]> wrote:
>>
>>> Yes, I have the jena-log4j.properties file within the jena repo and the
>>> TDB2.loader file under bin in same repo.
>>>
>>> For me, when I run tdb2.loader - - loc=.. /.. /db .. /../file.nt, I see
>>> no logs. The process starts consuming cores and ram but there's nothing on
>>> the console. When the loading is finished, cursor moves on to the next
>>> line.
>>>
>>> On Thu, 28 Nov, 2019, 3:48 PM Andy Seaborne, <[email protected]> wrote:
>>>
>>>>
>>>>
>>>> On 28/11/2019 05:44, Amandeep Srivastava wrote:
>>>> > Thanks Andy, setting it that way worked.
>>>> >
>>>> > Also, can we turn on the verbose logging in TDB2.loader like we have
>>>> in
>>>> > tdbloader2?
>>>> >
>>>> > Basically, giving an output of how many triples it's loading and how
>>>> much
>>>> > time has elapsed so far.
>>>>
>>>> It does that by default for the data phase. The report step size is
>>>> longer (500k) than TDB1
>>>>
>>>> The index phase is more parallel and not all mods report progress.
>>>>
>>>> What are you seeing?
>>>> (Do you have a log4j.properties in the current directory?)
>>>>
>>>>      Andy
>>>>
>>>>
>>>>
>>>>
>>>> tdb2.tdbloader --loader=parallel --loc DB2 ~/Datasets/BSBM/bsbm-5m.nt.gz
>>>>
>>>> INFO  Loader = LoaderParallel
>>>> INFO  Start: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz
>>>> INFO  Add: 500,000 bsbm-5m.nt.gz (Batch: 124,875 / Avg: 124,875)
>>>> INFO  Add: 1,000,000 bsbm-5m.nt.gz (Batch: 171,174 / Avg: 144,404)
>>>> INFO  Add: 1,500,000 bsbm-5m.nt.gz (Batch: 190,403 / Avg: 157,051)
>>>> INFO  Add: 2,000,000 bsbm-5m.nt.gz (Batch: 200,883 / Avg: 166,112)
>>>> INFO  Add: 2,500,000 bsbm-5m.nt.gz (Batch: 206,953 / Avg: 172,938)
>>>> INFO  Add: 3,000,000 bsbm-5m.nt.gz (Batch: 205,170 / Avg: 177,588)
>>>> INFO  Add: 3,500,000 bsbm-5m.nt.gz (Batch: 198,255 / Avg: 180,272)
>>>> INFO  Add: 4,000,000 bsbm-5m.nt.gz (Batch: 147,449 / Avg: 175,392)
>>>> INFO  Add: 4,500,000 bsbm-5m.nt.gz (Batch: 159,642 / Avg: 173,490)
>>>> INFO  Add: 5,000,000 bsbm-5m.nt.gz (Batch: 166,777 / Avg: 172,795)
>>>> INFO    Elapsed: 28.94 seconds [2019/11/28 10:17:55 GMT]
>>>> INFO  Finished: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz: 5,000,599 tuples
>>>> in 28.96s (Avg: 172,690)
>>>> INFO  Finish - index POS
>>>> INFO  Finish - index SPO
>>>> INFO  Finish - index OSP
>>>> INFO  Time = 39.180 seconds : Triples = 5,000,599 : Rate = 127,631 /s
>>>>
>>>> though the default may be faster on this small datset
>>>>
>>>> >
>>>> > On Thu, 14 Nov, 2019, 2:20 PM Andy Seaborne, <[email protected]> wrote:
>>>> >
>>>> >> Firstly - just to be clear - tdb.tdbloader2 is (confusingly) for
>>>> TDB1.
>>>> >> Old name, before TDB2 came along so we're a bit stuck with it.
>>>> >>
>>>> >> tdbloader2 respects the $TMPDIR environment variable.
>>>> >>
>>>> >> Or set the SORT_ARGS environment variable with --temporary-directory=
>>>> >> (or -T). See tdbloader2 --help
>>>> >>
>>>> >>       Andy
>>>> >>
>>>> >> On 14/11/2019 02:54, Amandeep Srivastava wrote:
>>>> >>> I was trying to test the performance of tdb.tdbloader2 by creating
>>>> a TDB
>>>> >>> database. The loader failed at sort SPO step. The failure seems to
>>>> occur
>>>> >>> because of insufficient storage in the /tmp folder. Can we point
>>>> tdb to
>>>> >> use
>>>> >>> another folder as /tmp?
>>>> >>>
>>>> >>> Error log:
>>>> >>> sort: write failed: /tmp/sortxRql3B: No space left on device
>>>> >>>
>>>> >>> On Wed, 13 Nov, 2019, 5:37 PM Amandeep Srivastava, <
>>>> >>> [email protected]> wrote:
>>>> >>>
>>>> >>>> Thanks, Andy, for the detailed explanation :)
>>>> >>>>
>>>> >>>> On Wed, 13 Nov, 2019, 4:52 PM Andy Seaborne, <[email protected]>
>>>> wrote:
>>>> >>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On 12/11/2019 15:53, Amandeep Srivastava wrote:
>>>> >>>>>> Thanks for the heads up, Dan. Will go and check the archives.
>>>> >>>>>>
>>>> >>>>>> I think I should get how to decide between tdb and TDB2 in the
>>>> >> archives
>>>> >>>>>> itself.
>>>> >>>>>
>>>> >>>>> For large bulk loaders, the TDB2 loader is faster, if you use
>>>> >>>>> --loader-parallel (NB it can take over your machine's I/O!)
>>>> >>>>>
>>>> >>>>> See tdb2.tdbloader --help for names of plans that are built-in.
>>>> >>>>>
>>>> >>>>> The only way to know which is best is to try but
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> The order threading used is:
>>>> >>>>>
>>>> >>>>> sequential < light < phased < parallel
>>>> >>>>>
>>>> >>>>> (it does not always mean more threads is faster).
>>>> >>>>>
>>>> >>>>> sequential is roughly the same as the TDB1 bulk loader.
>>>> >>>>>
>>>> >>>>> parallel usualy wins as data gets larger (several 100m) if the
>>>> machine
>>>> >>>>> has the I/O to handle it.
>>>> >>>>>
>>>> >>>>>        Andy
>>>> >>>>>
>>>> >>>>>>
>>>> >>>>>> On Tue, 12 Nov, 2019, 8:59 PM Dan Pritts, <[email protected]>
>>>> wrote:
>>>> >>>>>>
>>>> >>>>>>> Look through the list archives for posts from Andy describing
>>>> the
>>>> >>>>>>> differences between tdb1 and tdb2. they have different
>>>> >> optimizations; I
>>>> >>>>>>> don't recall the differences.
>>>> >>>>>>>
>>>> >>>>>>> thanks
>>>> >>>>>>> danno
>>>> >>>>>>>
>>>> >>>>>>> Dan Pritts
>>>> >>>>>>> ICPSR Computing and Network Services
>>>> >>>>>>>
>>>> >>>>>>> On 12 Nov 2019, at 7:29, Amandeep Srivastava wrote:
>>>> >>>>>>>
>>>> >>>>>>>> Hi,
>>>> >>>>>>>>
>>>> >>>>>>>> I'm trying to create a TDB database from Wikidata's official
>>>> RDF
>>>> >> dump
>>>> >>>>>>>> to
>>>> >>>>>>>> read the data using Fuseki service. I need to make a few
>>>> queries for
>>>> >>>>>>>> my
>>>> >>>>>>>> personal project, running which the online service times out.
>>>> >>>>>>>>
>>>> >>>>>>>> I have a 12 core machine with 36 GB memory.
>>>> >>>>>>>>
>>>> >>>>>>>> Can you please advise on the best way for creating the
>>>> database?
>>>> >> Since
>>>> >>>>>>>> the
>>>> >>>>>>>> dump is huge, I cannot try all the approaches. Besides, I'm
>>>> not sure
>>>> >>>>>>>> if the
>>>> >>>>>>>> tdbloader function works in a similar way on data of different
>>>> >> sizes.
>>>> >>>>>>>>
>>>> >>>>>>>> Questions:
>>>> >>>>>>>>
>>>> >>>>>>>> 1. Which one would be better to use - tdb.tdbloader2 (TDB1) or
>>>> >>>>>>>> tdb2.tdbloader (TDB2) for creating the database and why? Any
>>>> >> specific
>>>> >>>>>>>> configurations that I should be aware of?
>>>> >>>>>>>>
>>>> >>>>>>>> 2. I'm running a job currently using tdb.tdbloader2 but it is
>>>> using
>>>> >>>>>>>> just a
>>>> >>>>>>>> single core. Also, it's loading speed is decreasing slowly. It
>>>> >> started
>>>> >>>>>>>> at
>>>> >>>>>>>> an avg of 120k tuples and is currently at 80k tuples. Can you
>>>> advise
>>>> >>>>>>>> how
>>>> >>>>>>>> can I utilize all the cores of my machine and maintain the
>>>> loading
>>>> >>>>>>>> speed at
>>>> >>>>>>>> the same time?
>>>> >>>>>>>>
>>>> >>>>>>>> Regards,
>>>> >>>>>>>> Aman
>>>> >>>>>>>
>>>> >>>>>>
>>>> >>>>>
>>>> >>>>
>>>> >>>
>>>> >>
>>>> >
>>>>
>>>
>
> --
> Regards,
> Amandeep Srivastava
> Final Year Bachelor of Technology,
> Computer Science and Engineering Department,
> Indian Institute of Technology (ISM), Dhanbad.
>
>

-- 
Regards,
Amandeep Srivastava
Final Year Bachelor of Technology,
Computer Science and Engineering Department,
Indian Institute of Technology (ISM), Dhanbad.

Re: TDB optimization query

Reply via email to