Re: TDB optimization query

Andy Seaborne Thu, 28 Nov 2019 12:31:58 -0800



On 28/11/2019 16:29, Amandeep Srivastava wrote:

For some reason, java can't read the set $LOGGING value.

Replacing 
LOGGING="${LOGGING:--Dlog4j.configuration=file:$JENA_HOME/jena-log4j.properties}"
with  
*LOGGING="-Djava.util.logging.config.file=$JENA_HOME/jena-log4j.properties"
*in tdb2.tdbloader script worked for me. Now it shows me all the logs.

Actually that works by not setting log4j and defaulting to the built-insetting! Just not setting -Dlog4j.configuration should work.

One final thing that I wanted to ask is that database size created by
tdb2.tdbloader is much larger than what was created by tdbloader2. From the
archives, I see that it is because of full B+ trees implementation. Is
there any way I can reduce the index size of tdb2.tdbloader?

Not really and the effect does not last if incremental updates happen tothe tdbloader2 database.


tdb2.tdbloader will work on a existing database, unlike tdbloader2

It might be possible to write a compressor that rewrites indexes as partof the compaction step but it's not there at the moment.


    Andy

Thanks in advance, you guys have been super helpful. Appreciate it.

On Thu, Nov 28, 2019 at 6:10 PM Amandeep Srivastava <
[email protected]> wrote:

Attaching cmd line details for reference. Also, after creating the
database, it isn't removing the tdb lock which hinders fuseki server from
reading from the database

aman@DESKTOP-ML2LO1I:~$ cd apache-jena-3.13.1/
aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ ls
LICENSE  NOTICE  README  bat  bin  *jena-log4j.properties*  lib  lib-src
  src-examples  test
aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ *./bin/tdb2.tdbloader
--loader=parallel --loc=../test ../bsbm-generated-dataset.nt*

aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ ls ../test/

Data-0001  *tdb.lock *
aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ ls ../test/Data-0001/
GOSP.bpt  GPOS.idn  GSPO.dat  OSPG.bpt  POS.idn   SPO.dat   journal.jrnl
  nodes.idn          prefixes.idn             GOSP.dat  GPU.bpt   GSPO.idn
  OSPG.dat  POSG.bpt  SPO.idn   nodes-data.bdf  prefixes-data.bdf
*tdb.lock*  GOSP.idn  GPU.dat   OSP.bpt   OSPG.idn  POSG.dat  SPOG.bpt
  nodes-data.obj  prefixes-data.obj  GPOS.bpt  GPU.idn   OSP.dat   POS.bpt
POSG.idn  SPOG.dat  nodes.bpt       prefixes.bpt     GPOS.dat  GSPO.bpt
  OSP.idn   POS.dat   SPO.bpt   SPOG.idn  nodes.dat  prefixes.dat


Thanks,
Aman

On Thu, Nov 28, 2019 at 3:59 PM Amandeep Srivastava <
[email protected]> wrote:

Correction, using

tdb2.tdbloader - -loader=parallel - - loc=../../db .. /.. /file.nt

On Thu, 28 Nov, 2019, 3:56 PM Amandeep Srivastava, <
[email protected]> wrote:

Yes, I have the jena-log4j.properties file within the jena repo and the
TDB2.loader file under bin in same repo.

For me, when I run tdb2.loader - - loc=.. /.. /db .. /../file.nt, I see
no logs. The process starts consuming cores and ram but there's nothing on
the console. When the loading is finished, cursor moves on to the next
line.

On Thu, 28 Nov, 2019, 3:48 PM Andy Seaborne, <[email protected]> wrote:



On 28/11/2019 05:44, Amandeep Srivastava wrote:

Thanks Andy, setting it that way worked.

Also, can we turn on the verbose logging in TDB2.loader like we have

in

tdbloader2?

Basically, giving an output of how many triples it's loading and how

much

time has elapsed so far.


It does that by default for the data phase. The report step size is
longer (500k) than TDB1

The index phase is more parallel and not all mods report progress.

What are you seeing?
(Do you have a log4j.properties in the current directory?)

      Andy




tdb2.tdbloader --loader=parallel --loc DB2 ~/Datasets/BSBM/bsbm-5m.nt.gz

INFO  Loader = LoaderParallel
INFO  Start: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz
INFO  Add: 500,000 bsbm-5m.nt.gz (Batch: 124,875 / Avg: 124,875)
INFO  Add: 1,000,000 bsbm-5m.nt.gz (Batch: 171,174 / Avg: 144,404)
INFO  Add: 1,500,000 bsbm-5m.nt.gz (Batch: 190,403 / Avg: 157,051)
INFO  Add: 2,000,000 bsbm-5m.nt.gz (Batch: 200,883 / Avg: 166,112)
INFO  Add: 2,500,000 bsbm-5m.nt.gz (Batch: 206,953 / Avg: 172,938)
INFO  Add: 3,000,000 bsbm-5m.nt.gz (Batch: 205,170 / Avg: 177,588)
INFO  Add: 3,500,000 bsbm-5m.nt.gz (Batch: 198,255 / Avg: 180,272)
INFO  Add: 4,000,000 bsbm-5m.nt.gz (Batch: 147,449 / Avg: 175,392)
INFO  Add: 4,500,000 bsbm-5m.nt.gz (Batch: 159,642 / Avg: 173,490)
INFO  Add: 5,000,000 bsbm-5m.nt.gz (Batch: 166,777 / Avg: 172,795)
INFO    Elapsed: 28.94 seconds [2019/11/28 10:17:55 GMT]
INFO  Finished: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz: 5,000,599 tuples
in 28.96s (Avg: 172,690)
INFO  Finish - index POS
INFO  Finish - index SPO
INFO  Finish - index OSP
INFO  Time = 39.180 seconds : Triples = 5,000,599 : Rate = 127,631 /s

though the default may be faster on this small datset


On Thu, 14 Nov, 2019, 2:20 PM Andy Seaborne, <[email protected]> wrote:

Firstly - just to be clear - tdb.tdbloader2 is (confusingly) for

TDB1.

Old name, before TDB2 came along so we're a bit stuck with it.

tdbloader2 respects the $TMPDIR environment variable.

Or set the SORT_ARGS environment variable with --temporary-directory=
(or -T). See tdbloader2 --help

       Andy

On 14/11/2019 02:54, Amandeep Srivastava wrote:

I was trying to test the performance of tdb.tdbloader2 by creating

a TDB

database. The loader failed at sort SPO step. The failure seems to

occur

because of insufficient storage in the /tmp folder. Can we point

tdb to

use

another folder as /tmp?

Error log:
sort: write failed: /tmp/sortxRql3B: No space left on device

On Wed, 13 Nov, 2019, 5:37 PM Amandeep Srivastava, <
[email protected]> wrote:

Thanks, Andy, for the detailed explanation :)

On Wed, 13 Nov, 2019, 4:52 PM Andy Seaborne, <[email protected]>

wrote:



On 12/11/2019 15:53, Amandeep Srivastava wrote:

Thanks for the heads up, Dan. Will go and check the archives.

I think I should get how to decide between tdb and TDB2 in the

archives

itself.


For large bulk loaders, the TDB2 loader is faster, if you use
--loader-parallel (NB it can take over your machine's I/O!)

See tdb2.tdbloader --help for names of plans that are built-in.

The only way to know which is best is to try but


The order threading used is:

sequential < light < phased < parallel

(it does not always mean more threads is faster).

sequential is roughly the same as the TDB1 bulk loader.

parallel usualy wins as data gets larger (several 100m) if the

machine

has the I/O to handle it.

        Andy


On Tue, 12 Nov, 2019, 8:59 PM Dan Pritts, <[email protected]>

wrote:

Look through the list archives for posts from Andy describing

the

differences between tdb1 and tdb2. they have different

optimizations; I

don't recall the differences.

thanks
danno

Dan Pritts
ICPSR Computing and Network Services

On 12 Nov 2019, at 7:29, Amandeep Srivastava wrote:

Hi,

I'm trying to create a TDB database from Wikidata's official

RDF

dump

to
read the data using Fuseki service. I need to make a few

queries for

my
personal project, running which the online service times out.

I have a 12 core machine with 36 GB memory.

Can you please advise on the best way for creating the

database?

Since

the
dump is huge, I cannot try all the approaches. Besides, I'm

not sure

if the
tdbloader function works in a similar way on data of different

sizes.


Questions:

1. Which one would be better to use - tdb.tdbloader2 (TDB1) or
tdb2.tdbloader (TDB2) for creating the database and why? Any

specific

configurations that I should be aware of?

2. I'm running a job currently using tdb.tdbloader2 but it is

using

just a
single core. Also, it's loading speed is decreasing slowly. It

started

at
an avg of 120k tuples and is currently at 80k tuples. Can you

advise

how
can I utilize all the cores of my machine and maintain the

loading

speed at
the same time?

Regards,
Aman


--
Regards,
Amandeep Srivastava
Final Year Bachelor of Technology,
Computer Science and Engineering Department,
Indian Institute of Technology (ISM), Dhanbad.

Re: TDB optimization query

Reply via email to