Re: Testing tdb2.xloader

2022-01-11 Thread Øyvind Gjesdal
I'm debugging now and think I've found the some possible culprits of the
slow data ingest stage on my setup.

In the ingest data stage, I see 100% disk use read, with only 2.5 MB/s,
and  the ps command also shows my processor spending time waiting for IO.
In the jena-base/src/main/java/org/apache/jena/atlas/io/IO.java class both
GZIPInputReader and GZIPOutputReader are called with one parameter, which
by default uses "512" as buffer. (
https://github.com/openjdk/jdk/blob/6e1da6440b898d8a10b3db56707a75f32a489149/src/java.base/share/classes/java/util/zip/GZIPInputStream.java#L90
)

I've built a snapshot now, with  "return new
GZIP[Output/Input]Stream(out,8*1024);" (which is the buffer value used in
other places for GZIPOutput/Input in jena), which is currently running, but
I'm thinking of maybe testing larger buffers, and see what my "lowest best
buffer" is. I'm  now waiting for the first steps to complete, to see the
effect of changing it.

What do I need to do to start from the exec_java $PKG.CmdxIngestData step
in tdb2.xloader?

Would
* stopping xloader before this step once
* taking a copy of $LOC
* then copying the copy to $LOC before each run (and commenting out
preceding steps in bin/tdb2.xloader)

be enough to have a working test setup to start at "data ingest", or do I
need to do something else?

Best regards,
Øyvind



tir. 28. des. 2021 kl. 14:21 skrev Andy Seaborne :

> Excellent news!
>
> Updated: https://www.w3.org/wiki/LargeTripleStores
>
>  Andy
>
> On 28/12/2021 10:11, Marco Neumann wrote:
> > Ok here is another successful tdb2 load. this time with the full wikidata
> > download (20211222_latest-all.nt.gz 172G ) file.
> >
> > counting 16,733,395,878 triples and a total of "103h 45m 15s" for the
> > entire load.
> >
> > I think with the right hardware this could easily be time compressed
> quite
> > a bit.
> >
> > http://www.lotico.com/temp/LOG-45497
> >
> > Best,
> > Marco
> >
> >
> > On Tue, Dec 21, 2021 at 8:59 AM Marco Neumann 
> > wrote:
> >
> >> Thank you Andy. found it in revisions somewhere
> >>
> >> just finished another run with truthy
> >>
> >> http://lotico.com/temp/LOG-1214
> >>
> >> will now increase RAM before running an additional load with increased
> >> thread count.
> >>
> >> Marco
> >>
> >> On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne  wrote:
> >>
> >>> gists are git repos: so the file is there ... somewhere:
> >>>
> >>>
> >>>
> https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log
> >>>
> >>>   Andy
> >>>
> >>> On 19/12/2021 17:56, Marco Neumann wrote:
>  Thank you Lorenz,
>  unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
>  github
> 
> 
>  On Sun, Dec 19, 2021 at 9:46 AM LB  >>> .invalid>
>  wrote:
> 
> > I edited the Gist [1] and put the default stats there. Takes ~4min to
> > compute the stats.
> >
> > Findings:
> >
> > - for Wikidata we have to extend those stats with the stats for
> wdt:P31
> > property as Wikidata does use this property as their own rdf:type
> > relation. It is indeed trivial, just execute
> >
> > select ?c (count(*) as ?cnt) {?s
> >  ?c} group by ?c
> >
> > and convert it into the stats rule language (SSE) and put those rules
> > before the more generic rule
> >
> > |( 98152611)|
> >
> > - I didn't want to touch the stats script itself, but we could for
> > example also make this type relation generic and allow for other like
> > wdt:P31 or skos:subject via a commandline option which provides any
> URI
> > as the type relation with default being rdf:type - but that's for
> sure
> > probably overkill
> >
> > - there is a bug in the stats script or file I guess, because of of
> >>> some
> > overflow? the count value is
> >
> > (count -1983667112))
> >
> > which indicates this.  I opened a ticket:
> > https://issues.apache.org/jira/browse/JENA-2225
> >
> >
> > [1]
> >
> >>>
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> >
> > On 18.12.21 11:35, Marco Neumann wrote:
> >> good morning Lorenz,
> >>
> >> Maybe time to get a few query bencharms tests? :)
> >>
> >> What does tdb2.tdbstats report?
> >>
> >> Marco
> >>
> >>
> >> On Sat, Dec 18, 2021 at 8:09 AM LB  >>> .invalid>
> >> wrote:
> >>
> >>> Good morning,
> >>>
> >>> loading of Wikidata truthy is done, this time I didn't forget to
> keep
> >>> logs:
> >>>
> >>>
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> >>>
> >>> I'm a bit surprised that this time it was 8h faster than last time,
> >>> 31h
> >>> vs 39h. Not sure if a) there was something else on the server last
> >>> 

Re: Testing tdb2.xloader

2021-12-28 Thread Andy Seaborne

Excellent news!

Updated: https://www.w3.org/wiki/LargeTripleStores

Andy

On 28/12/2021 10:11, Marco Neumann wrote:

Ok here is another successful tdb2 load. this time with the full wikidata
download (20211222_latest-all.nt.gz 172G ) file.

counting 16,733,395,878 triples and a total of "103h 45m 15s" for the
entire load.

I think with the right hardware this could easily be time compressed quite
a bit.

http://www.lotico.com/temp/LOG-45497

Best,
Marco


On Tue, Dec 21, 2021 at 8:59 AM Marco Neumann 
wrote:


Thank you Andy. found it in revisions somewhere

just finished another run with truthy

http://lotico.com/temp/LOG-1214

will now increase RAM before running an additional load with increased
thread count.

Marco

On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne  wrote:


gists are git repos: so the file is there ... somewhere:


https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log

  Andy

On 19/12/2021 17:56, Marco Neumann wrote:

Thank you Lorenz,
unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
github


On Sun, Dec 19, 2021 at 9:46 AM LB 
.invalid>

wrote:


I edited the Gist [1] and put the default stats there. Takes ~4min to
compute the stats.

Findings:

- for Wikidata we have to extend those stats with the stats for wdt:P31
property as Wikidata does use this property as their own rdf:type
relation. It is indeed trivial, just execute

select ?c (count(*) as ?cnt) {?s
 ?c} group by ?c

and convert it into the stats rule language (SSE) and put those rules
before the more generic rule

|( 98152611)|

- I didn't want to touch the stats script itself, but we could for
example also make this type relation generic and allow for other like
wdt:P31 or skos:subject via a commandline option which provides any URI
as the type relation with default being rdf:type - but that's for sure
probably overkill

- there is a bug in the stats script or file I guess, because of of

some

overflow? the count value is

(count -1983667112))

which indicates this.  I opened a ticket:
https://issues.apache.org/jira/browse/JENA-2225


[1]


https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3


On 18.12.21 11:35, Marco Neumann wrote:

good morning Lorenz,

Maybe time to get a few query bencharms tests? :)

What does tdb2.tdbstats report?

Marco


On Sat, Dec 18, 2021 at 8:09 AM LB 
.invalid>

wrote:


Good morning,

loading of Wikidata truthy is done, this time I didn't forget to keep
logs:


https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3


I'm a bit surprised that this time it was 8h faster than last time,

31h

vs 39h. Not sure if a) there was something else on the server last

time

(at least I couldn't see any running tasks) or b) if this is a
consequence of the more parallelized Unix sort now - I set it to
--parallel=16

I mean, the piped input stream is single threaded I guess, but maybe

the

sort merge step can benefit from more threads? I guess I have to

clean

up everything and run it again with the original setup with 2 Unix

sort

threads ...


On 16.12.21 14:48, Andy Seaborne wrote:


On 16/12/2021 10:52, Andy Seaborne wrote:
...


I am getting a slow down during data ingestion. However, your

summary

figures don't show that in the ingest phase. The whole logs may

have

the signal in it but less pronounced.

My working assumption is now that it is random access to the node
table. Your results point to it not being a CPU issue but that my
setup is saturating the I/O path. While the portable has a NVMe

SSD,

it has probably not got the same I/O bandwidth as a server class
machine.

I'm not sure what to do about this other than run with a much

bigger

node table cache for the ingestion phase. Substituting some file
mapper file area for bigger cache should be a win. While I hadn't
noticed before, it is probably visible in logs of smaller loads on
closer inspection. Experimenting on a small dataset is a lot

easier.

I'm more sure of this - not yet definite.

The nodeToNodeId cache is 200k -- this is on the load/update path.
Seems rather small for the task.

The nodeIdToNode cache is 1e6 -- this is the one that is hit by

SPARQL

results.

2 pieces of data will help:

Experimenting with very small cache settings.

Letting my slow load keep going to see if there is the same
characteristics at the index stage. There shouldn't be if

nodeToNodeId

is the cause; it's only an influence in the data ingestion step.

Aside : Increasing nodeToNodeId could also help tdbloader=parallel

and

maybe loader=phased. It falls into the same situation although the
improvement there is going to be less marked. "Parallel" saturates

the

I/O by other means as well.

   Andy












--


---
Marco Neumann
KONA






Re: Testing tdb2.xloader

2021-12-28 Thread Marco Neumann
Ok here is another successful tdb2 load. this time with the full wikidata
download (20211222_latest-all.nt.gz 172G ) file.

counting 16,733,395,878 triples and a total of "103h 45m 15s" for the
entire load.

I think with the right hardware this could easily be time compressed quite
a bit.

http://www.lotico.com/temp/LOG-45497

Best,
Marco


On Tue, Dec 21, 2021 at 8:59 AM Marco Neumann 
wrote:

> Thank you Andy. found it in revisions somewhere
>
> just finished another run with truthy
>
> http://lotico.com/temp/LOG-1214
>
> will now increase RAM before running an additional load with increased
> thread count.
>
> Marco
>
> On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne  wrote:
>
>> gists are git repos: so the file is there ... somewhere:
>>
>>
>> https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log
>>
>>  Andy
>>
>> On 19/12/2021 17:56, Marco Neumann wrote:
>> > Thank you Lorenz,
>> > unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
>> > github
>> >
>> >
>> > On Sun, Dec 19, 2021 at 9:46 AM LB > .invalid>
>> > wrote:
>> >
>> >> I edited the Gist [1] and put the default stats there. Takes ~4min to
>> >> compute the stats.
>> >>
>> >> Findings:
>> >>
>> >> - for Wikidata we have to extend those stats with the stats for wdt:P31
>> >> property as Wikidata does use this property as their own rdf:type
>> >> relation. It is indeed trivial, just execute
>> >>
>> >> select ?c (count(*) as ?cnt) {?s
>> >>  ?c} group by ?c
>> >>
>> >> and convert it into the stats rule language (SSE) and put those rules
>> >> before the more generic rule
>> >>
>> >> |( 98152611)|
>> >>
>> >> - I didn't want to touch the stats script itself, but we could for
>> >> example also make this type relation generic and allow for other like
>> >> wdt:P31 or skos:subject via a commandline option which provides any URI
>> >> as the type relation with default being rdf:type - but that's for sure
>> >> probably overkill
>> >>
>> >> - there is a bug in the stats script or file I guess, because of of
>> some
>> >> overflow? the count value is
>> >>
>> >> (count -1983667112))
>> >>
>> >> which indicates this.  I opened a ticket:
>> >> https://issues.apache.org/jira/browse/JENA-2225
>> >>
>> >>
>> >> [1]
>> >>
>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>> >>
>> >> On 18.12.21 11:35, Marco Neumann wrote:
>> >>> good morning Lorenz,
>> >>>
>> >>> Maybe time to get a few query bencharms tests? :)
>> >>>
>> >>> What does tdb2.tdbstats report?
>> >>>
>> >>> Marco
>> >>>
>> >>>
>> >>> On Sat, Dec 18, 2021 at 8:09 AM LB > .invalid>
>> >>> wrote:
>> >>>
>>  Good morning,
>> 
>>  loading of Wikidata truthy is done, this time I didn't forget to keep
>>  logs:
>> 
>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>> 
>>  I'm a bit surprised that this time it was 8h faster than last time,
>> 31h
>>  vs 39h. Not sure if a) there was something else on the server last
>> time
>>  (at least I couldn't see any running tasks) or b) if this is a
>>  consequence of the more parallelized Unix sort now - I set it to
>>  --parallel=16
>> 
>>  I mean, the piped input stream is single threaded I guess, but maybe
>> the
>>  sort merge step can benefit from more threads? I guess I have to
>> clean
>>  up everything and run it again with the original setup with 2 Unix
>> sort
>>  threads ...
>> 
>> 
>>  On 16.12.21 14:48, Andy Seaborne wrote:
>> >
>> > On 16/12/2021 10:52, Andy Seaborne wrote:
>> > ...
>> >
>> >> I am getting a slow down during data ingestion. However, your
>> summary
>> >> figures don't show that in the ingest phase. The whole logs may
>> have
>> >> the signal in it but less pronounced.
>> >>
>> >> My working assumption is now that it is random access to the node
>> >> table. Your results point to it not being a CPU issue but that my
>> >> setup is saturating the I/O path. While the portable has a NVMe
>> SSD,
>> >> it has probably not got the same I/O bandwidth as a server class
>> >> machine.
>> >>
>> >> I'm not sure what to do about this other than run with a much
>> bigger
>> >> node table cache for the ingestion phase. Substituting some file
>> >> mapper file area for bigger cache should be a win. While I hadn't
>> >> noticed before, it is probably visible in logs of smaller loads on
>> >> closer inspection. Experimenting on a small dataset is a lot
>> easier.
>> > I'm more sure of this - not yet definite.
>> >
>> > The nodeToNodeId cache is 200k -- this is on the load/update path.
>> > Seems rather small for the task.
>> >
>> > The nodeIdToNode cache is 1e6 -- this is the one that is hit by
>> SPARQL
>> > 

Re: Testing tdb2.xloader

2021-12-21 Thread Marco Neumann
Thank you Andy. found it in revisions somewhere

just finished another run with truthy

http://lotico.com/temp/LOG-1214

will now increase RAM before running an additional load with increased
thread count.

Marco

On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne  wrote:

> gists are git repos: so the file is there ... somewhere:
>
>
> https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log
>
>  Andy
>
> On 19/12/2021 17:56, Marco Neumann wrote:
> > Thank you Lorenz,
> > unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
> > github
> >
> >
> > On Sun, Dec 19, 2021 at 9:46 AM LB 
> > wrote:
> >
> >> I edited the Gist [1] and put the default stats there. Takes ~4min to
> >> compute the stats.
> >>
> >> Findings:
> >>
> >> - for Wikidata we have to extend those stats with the stats for wdt:P31
> >> property as Wikidata does use this property as their own rdf:type
> >> relation. It is indeed trivial, just execute
> >>
> >> select ?c (count(*) as ?cnt) {?s
> >>  ?c} group by ?c
> >>
> >> and convert it into the stats rule language (SSE) and put those rules
> >> before the more generic rule
> >>
> >> |( 98152611)|
> >>
> >> - I didn't want to touch the stats script itself, but we could for
> >> example also make this type relation generic and allow for other like
> >> wdt:P31 or skos:subject via a commandline option which provides any URI
> >> as the type relation with default being rdf:type - but that's for sure
> >> probably overkill
> >>
> >> - there is a bug in the stats script or file I guess, because of of some
> >> overflow? the count value is
> >>
> >> (count -1983667112))
> >>
> >> which indicates this.  I opened a ticket:
> >> https://issues.apache.org/jira/browse/JENA-2225
> >>
> >>
> >> [1]
> >> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> >>
> >> On 18.12.21 11:35, Marco Neumann wrote:
> >>> good morning Lorenz,
> >>>
> >>> Maybe time to get a few query bencharms tests? :)
> >>>
> >>> What does tdb2.tdbstats report?
> >>>
> >>> Marco
> >>>
> >>>
> >>> On Sat, Dec 18, 2021 at 8:09 AM LB  .invalid>
> >>> wrote:
> >>>
>  Good morning,
> 
>  loading of Wikidata truthy is done, this time I didn't forget to keep
>  logs:
> 
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> 
>  I'm a bit surprised that this time it was 8h faster than last time,
> 31h
>  vs 39h. Not sure if a) there was something else on the server last
> time
>  (at least I couldn't see any running tasks) or b) if this is a
>  consequence of the more parallelized Unix sort now - I set it to
>  --parallel=16
> 
>  I mean, the piped input stream is single threaded I guess, but maybe
> the
>  sort merge step can benefit from more threads? I guess I have to clean
>  up everything and run it again with the original setup with 2 Unix
> sort
>  threads ...
> 
> 
>  On 16.12.21 14:48, Andy Seaborne wrote:
> >
> > On 16/12/2021 10:52, Andy Seaborne wrote:
> > ...
> >
> >> I am getting a slow down during data ingestion. However, your
> summary
> >> figures don't show that in the ingest phase. The whole logs may have
> >> the signal in it but less pronounced.
> >>
> >> My working assumption is now that it is random access to the node
> >> table. Your results point to it not being a CPU issue but that my
> >> setup is saturating the I/O path. While the portable has a NVMe SSD,
> >> it has probably not got the same I/O bandwidth as a server class
> >> machine.
> >>
> >> I'm not sure what to do about this other than run with a much bigger
> >> node table cache for the ingestion phase. Substituting some file
> >> mapper file area for bigger cache should be a win. While I hadn't
> >> noticed before, it is probably visible in logs of smaller loads on
> >> closer inspection. Experimenting on a small dataset is a lot easier.
> > I'm more sure of this - not yet definite.
> >
> > The nodeToNodeId cache is 200k -- this is on the load/update path.
> > Seems rather small for the task.
> >
> > The nodeIdToNode cache is 1e6 -- this is the one that is hit by
> SPARQL
> > results.
> >
> > 2 pieces of data will help:
> >
> > Experimenting with very small cache settings.
> >
> > Letting my slow load keep going to see if there is the same
> > characteristics at the index stage. There shouldn't be if
> nodeToNodeId
> > is the cause; it's only an influence in the data ingestion step.
> >
> > Aside : Increasing nodeToNodeId could also help tdbloader=parallel
> and
> > maybe loader=phased. It falls into the same situation although the
> > improvement there is going to be less marked. 

Re: Testing tdb2.xloader

2021-12-21 Thread Andy Seaborne

gists are git repos: so the file is there ... somewhere:

https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log

Andy

On 19/12/2021 17:56, Marco Neumann wrote:

Thank you Lorenz,
unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
github


On Sun, Dec 19, 2021 at 9:46 AM LB 
wrote:


I edited the Gist [1] and put the default stats there. Takes ~4min to
compute the stats.

Findings:

- for Wikidata we have to extend those stats with the stats for wdt:P31
property as Wikidata does use this property as their own rdf:type
relation. It is indeed trivial, just execute

select ?c (count(*) as ?cnt) {?s
 ?c} group by ?c

and convert it into the stats rule language (SSE) and put those rules
before the more generic rule

|( 98152611)|

- I didn't want to touch the stats script itself, but we could for
example also make this type relation generic and allow for other like
wdt:P31 or skos:subject via a commandline option which provides any URI
as the type relation with default being rdf:type - but that's for sure
probably overkill

- there is a bug in the stats script or file I guess, because of of some
overflow? the count value is

(count -1983667112))

which indicates this.  I opened a ticket:
https://issues.apache.org/jira/browse/JENA-2225


[1]
https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3

On 18.12.21 11:35, Marco Neumann wrote:

good morning Lorenz,

Maybe time to get a few query bencharms tests? :)

What does tdb2.tdbstats report?

Marco


On Sat, Dec 18, 2021 at 8:09 AM LB 
wrote:


Good morning,

loading of Wikidata truthy is done, this time I didn't forget to keep
logs:
https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3

I'm a bit surprised that this time it was 8h faster than last time, 31h
vs 39h. Not sure if a) there was something else on the server last time
(at least I couldn't see any running tasks) or b) if this is a
consequence of the more parallelized Unix sort now - I set it to
--parallel=16

I mean, the piped input stream is single threaded I guess, but maybe the
sort merge step can benefit from more threads? I guess I have to clean
up everything and run it again with the original setup with 2 Unix sort
threads ...


On 16.12.21 14:48, Andy Seaborne wrote:


On 16/12/2021 10:52, Andy Seaborne wrote:
...


I am getting a slow down during data ingestion. However, your summary
figures don't show that in the ingest phase. The whole logs may have
the signal in it but less pronounced.

My working assumption is now that it is random access to the node
table. Your results point to it not being a CPU issue but that my
setup is saturating the I/O path. While the portable has a NVMe SSD,
it has probably not got the same I/O bandwidth as a server class
machine.

I'm not sure what to do about this other than run with a much bigger
node table cache for the ingestion phase. Substituting some file
mapper file area for bigger cache should be a win. While I hadn't
noticed before, it is probably visible in logs of smaller loads on
closer inspection. Experimenting on a small dataset is a lot easier.

I'm more sure of this - not yet definite.

The nodeToNodeId cache is 200k -- this is on the load/update path.
Seems rather small for the task.

The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
results.

2 pieces of data will help:

Experimenting with very small cache settings.

Letting my slow load keep going to see if there is the same
characteristics at the index stage. There shouldn't be if nodeToNodeId
is the cause; it's only an influence in the data ingestion step.

Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
maybe loader=phased. It falls into the same situation although the
improvement there is going to be less marked. "Parallel" saturates the
I/O by other means as well.

  Andy









Re: Testing tdb2.xloader

2021-12-19 Thread Marco Neumann
Thank you Lorenz,
unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
github


On Sun, Dec 19, 2021 at 9:46 AM LB 
wrote:

> I edited the Gist [1] and put the default stats there. Takes ~4min to
> compute the stats.
>
> Findings:
>
> - for Wikidata we have to extend those stats with the stats for wdt:P31
> property as Wikidata does use this property as their own rdf:type
> relation. It is indeed trivial, just execute
>
> select ?c (count(*) as ?cnt) {?s
>  ?c} group by ?c
>
> and convert it into the stats rule language (SSE) and put those rules
> before the more generic rule
>
> |( 98152611)|
>
> - I didn't want to touch the stats script itself, but we could for
> example also make this type relation generic and allow for other like
> wdt:P31 or skos:subject via a commandline option which provides any URI
> as the type relation with default being rdf:type - but that's for sure
> probably overkill
>
> - there is a bug in the stats script or file I guess, because of of some
> overflow? the count value is
>
> (count -1983667112))
>
> which indicates this.  I opened a ticket:
> https://issues.apache.org/jira/browse/JENA-2225
>
>
> [1]
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>
> On 18.12.21 11:35, Marco Neumann wrote:
> > good morning Lorenz,
> >
> > Maybe time to get a few query bencharms tests? :)
> >
> > What does tdb2.tdbstats report?
> >
> > Marco
> >
> >
> > On Sat, Dec 18, 2021 at 8:09 AM LB 
> > wrote:
> >
> >> Good morning,
> >>
> >> loading of Wikidata truthy is done, this time I didn't forget to keep
> >> logs:
> >> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> >>
> >> I'm a bit surprised that this time it was 8h faster than last time, 31h
> >> vs 39h. Not sure if a) there was something else on the server last time
> >> (at least I couldn't see any running tasks) or b) if this is a
> >> consequence of the more parallelized Unix sort now - I set it to
> >> --parallel=16
> >>
> >> I mean, the piped input stream is single threaded I guess, but maybe the
> >> sort merge step can benefit from more threads? I guess I have to clean
> >> up everything and run it again with the original setup with 2 Unix sort
> >> threads ...
> >>
> >>
> >> On 16.12.21 14:48, Andy Seaborne wrote:
> >>>
> >>> On 16/12/2021 10:52, Andy Seaborne wrote:
> >>> ...
> >>>
>  I am getting a slow down during data ingestion. However, your summary
>  figures don't show that in the ingest phase. The whole logs may have
>  the signal in it but less pronounced.
> 
>  My working assumption is now that it is random access to the node
>  table. Your results point to it not being a CPU issue but that my
>  setup is saturating the I/O path. While the portable has a NVMe SSD,
>  it has probably not got the same I/O bandwidth as a server class
>  machine.
> 
>  I'm not sure what to do about this other than run with a much bigger
>  node table cache for the ingestion phase. Substituting some file
>  mapper file area for bigger cache should be a win. While I hadn't
>  noticed before, it is probably visible in logs of smaller loads on
>  closer inspection. Experimenting on a small dataset is a lot easier.
> >>> I'm more sure of this - not yet definite.
> >>>
> >>> The nodeToNodeId cache is 200k -- this is on the load/update path.
> >>> Seems rather small for the task.
> >>>
> >>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
> >>> results.
> >>>
> >>> 2 pieces of data will help:
> >>>
> >>> Experimenting with very small cache settings.
> >>>
> >>> Letting my slow load keep going to see if there is the same
> >>> characteristics at the index stage. There shouldn't be if nodeToNodeId
> >>> is the cause; it's only an influence in the data ingestion step.
> >>>
> >>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
> >>> maybe loader=phased. It falls into the same situation although the
> >>> improvement there is going to be less marked. "Parallel" saturates the
> >>> I/O by other means as well.
> >>>
> >>>  Andy
> >
>


-- 


---
Marco Neumann
KONA


Re: Testing tdb2.xloader

2021-12-19 Thread Andy Seaborne



On 19/12/2021 09:46, LB wrote:
I edited the Gist [1] and put the default stats there. Takes ~4min to 
compute the stats.


Findings:

- for Wikidata we have to extend those stats with the stats for wdt:P31 
property as Wikidata does use this property as their own rdf:type 
relation. It is indeed trivial, just execute


select ?c (count(*) as ?cnt) {?s 
 ?c} group by ?c


Hmm, that's going to have an impact on the default fixed reorder 
algorithm which has "rdf:type" built-in.


Probably another thread here - there are places where the design of 
ARQ/TDB could be improved for multi-billion triple datasets.


Several high volume triple stores are read-only with 6 indexes for the 
graph. That give two things: no need to support transactions, and any 
join can be a merge join or parallel merge join.


Having cursors into the B+Trees (iterators that can jump over sections 
of the scan to support SIP (sideways information passing)) is another area.


There's some mileage yet in single machine stores, mostly because 
hardware keeps getting better, at a cost, and persistent RAM will change 
things ("in 5 yars time").


But native cluster engines already have some advantages.

and convert it into the stats rule language (SSE) and put those rules 
before the more generic rule


|( 98152611)|

- I didn't want to touch the stats script itself, but we could for 
example also make this type relation generic and allow for other like 
wdt:P31 or skos:subject via a commandline option which provides any URI 
as the type relation with default being rdf:type - but that's for sure 
probably overkill


- there is a bug in the stats script or file I guess, because of of some 
overflow? the count value is


(count -1983667112))

which indicates this.  I opened a ticket: 
https://issues.apache.org/jira/browse/JENA-2225


Thanks.




[1] https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3

On 18.12.21 11:35, Marco Neumann wrote:

good morning Lorenz,

Maybe time to get a few query bencharms tests? :)

What does tdb2.tdbstats report?

Marco


On Sat, Dec 18, 2021 at 8:09 AM LB 
wrote:


Good morning,

loading of Wikidata truthy is done, this time I didn't forget to keep
logs:
https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3

I'm a bit surprised that this time it was 8h faster than last time, 31h
vs 39h. Not sure if a) there was something else on the server last time
(at least I couldn't see any running tasks) or b) if this is a
consequence of the more parallelized Unix sort now - I set it to
--parallel=16

I mean, the piped input stream is single threaded I guess, but maybe the
sort merge step can benefit from more threads? I guess I have to clean
up everything and run it again with the original setup with 2 Unix sort
threads ...


On 16.12.21 14:48, Andy Seaborne wrote:


On 16/12/2021 10:52, Andy Seaborne wrote:
...


I am getting a slow down during data ingestion. However, your summary
figures don't show that in the ingest phase. The whole logs may have
the signal in it but less pronounced.

My working assumption is now that it is random access to the node
table. Your results point to it not being a CPU issue but that my
setup is saturating the I/O path. While the portable has a NVMe SSD,
it has probably not got the same I/O bandwidth as a server class
machine.

I'm not sure what to do about this other than run with a much bigger
node table cache for the ingestion phase. Substituting some file
mapper file area for bigger cache should be a win. While I hadn't
noticed before, it is probably visible in logs of smaller loads on
closer inspection. Experimenting on a small dataset is a lot easier.

I'm more sure of this - not yet definite.

The nodeToNodeId cache is 200k -- this is on the load/update path.
Seems rather small for the task.

The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
results.

2 pieces of data will help:

Experimenting with very small cache settings.

Letting my slow load keep going to see if there is the same
characteristics at the index stage. There shouldn't be if nodeToNodeId
is the cause; it's only an influence in the data ingestion step.

Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
maybe loader=phased. It falls into the same situation although the
improvement there is going to be less marked. "Parallel" saturates the
I/O by other means as well.

 Andy






Re: Testing tdb2.xloader

2021-12-19 Thread Andy Seaborne

I've updated:

https://www.w3.org/wiki/LargeTripleStores#Apache_Jena_TDB_.286.6B.29

for Lorenz's first run.

Andy

On 16/12/2021 08:49, LB wrote:

39h 53m 27s
04:14:28 INFO  Triples loaded   = 6.610.055.778
04:14:28 INFO  Quads loaded = 0
04:14:28 INFO  Overall Rate 46.028 tuples per second


Re: Testing tdb2.xloader

2021-12-19 Thread LB
I edited the Gist [1] and put the default stats there. Takes ~4min to 
compute the stats.


Findings:

- for Wikidata we have to extend those stats with the stats for wdt:P31 
property as Wikidata does use this property as their own rdf:type 
relation. It is indeed trivial, just execute


select ?c (count(*) as ?cnt) {?s 
 ?c} group by ?c


and convert it into the stats rule language (SSE) and put those rules 
before the more generic rule


|( 98152611)|

- I didn't want to touch the stats script itself, but we could for 
example also make this type relation generic and allow for other like 
wdt:P31 or skos:subject via a commandline option which provides any URI 
as the type relation with default being rdf:type - but that's for sure 
probably overkill


- there is a bug in the stats script or file I guess, because of of some 
overflow? the count value is


(count -1983667112))

which indicates this.  I opened a ticket: 
https://issues.apache.org/jira/browse/JENA-2225



[1] https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3

On 18.12.21 11:35, Marco Neumann wrote:

good morning Lorenz,

Maybe time to get a few query bencharms tests? :)

What does tdb2.tdbstats report?

Marco


On Sat, Dec 18, 2021 at 8:09 AM LB 
wrote:


Good morning,

loading of Wikidata truthy is done, this time I didn't forget to keep
logs:
https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3

I'm a bit surprised that this time it was 8h faster than last time, 31h
vs 39h. Not sure if a) there was something else on the server last time
(at least I couldn't see any running tasks) or b) if this is a
consequence of the more parallelized Unix sort now - I set it to
--parallel=16

I mean, the piped input stream is single threaded I guess, but maybe the
sort merge step can benefit from more threads? I guess I have to clean
up everything and run it again with the original setup with 2 Unix sort
threads ...


On 16.12.21 14:48, Andy Seaborne wrote:


On 16/12/2021 10:52, Andy Seaborne wrote:
...


I am getting a slow down during data ingestion. However, your summary
figures don't show that in the ingest phase. The whole logs may have
the signal in it but less pronounced.

My working assumption is now that it is random access to the node
table. Your results point to it not being a CPU issue but that my
setup is saturating the I/O path. While the portable has a NVMe SSD,
it has probably not got the same I/O bandwidth as a server class
machine.

I'm not sure what to do about this other than run with a much bigger
node table cache for the ingestion phase. Substituting some file
mapper file area for bigger cache should be a win. While I hadn't
noticed before, it is probably visible in logs of smaller loads on
closer inspection. Experimenting on a small dataset is a lot easier.

I'm more sure of this - not yet definite.

The nodeToNodeId cache is 200k -- this is on the load/update path.
Seems rather small for the task.

The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
results.

2 pieces of data will help:

Experimenting with very small cache settings.

Letting my slow load keep going to see if there is the same
characteristics at the index stage. There shouldn't be if nodeToNodeId
is the cause; it's only an influence in the data ingestion step.

Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
maybe loader=phased. It falls into the same situation although the
improvement there is going to be less marked. "Parallel" saturates the
I/O by other means as well.

 Andy




Re: Testing tdb2.xloader

2021-12-18 Thread Andy Seaborne

https://gist.github.com/afs/c97ebc7351478bce2989b79c9195ef11

Dell XPS13 (2021 edition)
32G RAM
4 core
1T SSD disk
Jena 4.3.1

Data:
wikidata-20211208-truthy-BETA.nt.gz

14:47:09 INFO  Load node table  = 39976 seconds
14:47:09 INFO  Load ingest data = 17 seconds
14:47:09 INFO  Build index SPO  = 32927 seconds
14:47:09 INFO  Build index POS  = 62628 seconds
14:47:09 INFO  Build index OSP  = 43457 seconds
14:47:09 INFO  Overall  278995 seconds
14:47:09 INFO  Overall  77h 29m 55s
14:47:09 INFO  Triples loaded   = 6,610,055,778
14:47:09 INFO  Quads loaded = 0
14:47:09 INFO  Overall Rate 23,692 tuples per second

Of note:

  Ingest time
  POS time

Haven't had the chance to look in detail yet.

Andy


Re: Testing tdb2.xloader

2021-12-18 Thread Andy Seaborne

Hi Lorenz,

On 18/12/2021 08:09, LB wrote:

Good morning,

loading of Wikidata truthy is done, this time I didn't forget to keep 
logs: 
https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3


I'm a bit surprised that this time it was 8h faster than last time, 31h 
vs 39h.


Great!

Not sure if a) there was something else on the server last time 
(at least I couldn't see any running tasks) or b) if this is a 
consequence of the more parallelized Unix sort now - I set it to 
--parallel=16


I mean, the piped input stream is single threaded I guess, but maybe the 
sort merge step can benefit from more threads?


yes - the sorting itself can be more parallel on a machine the size of 
you have.


Time to add a configuration file, rather than a slew of command line 
arguments. The file also then acts as a record of the setup.



I'm finding a new characteristic:

Loading on a smaller machine (32G RAM), I think the index sorting is 
recombining temp files. That results in more I/O and higher peak disk 
usage. While POS is always slower, it appears to be very much slower 
than SPO.


The internet has not been very clear on the effect of "batch size" but 
the GNU man page talks about "--batch-size=16".  I get more than 16 temp 
files - you probably don't at this scale.


--batch-size=128 seems better -- unlikely to be a problem with the 
number of file descriptors nowadays. 16 is probably just how it always was.


On my machine: per process:

ulimit -Sn is 1024 -- ulimit current setting
ulimit -Hn is 1048576  -- ulimit max without being root.

I'll investigate when the load finishes. I'm trying not to touch the 
machine to avoid breaking something. It is currently doing OSP.


I guess I have to clean 
up everything and run it again with the original setup with 2 Unix sort 
threads ...


Andy



Re: Testing tdb2.xloader

2021-12-18 Thread Marco Neumann
good morning Lorenz,

Maybe time to get a few query bencharms tests? :)

What does tdb2.tdbstats report?

Marco


On Sat, Dec 18, 2021 at 8:09 AM LB 
wrote:

> Good morning,
>
> loading of Wikidata truthy is done, this time I didn't forget to keep
> logs:
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>
> I'm a bit surprised that this time it was 8h faster than last time, 31h
> vs 39h. Not sure if a) there was something else on the server last time
> (at least I couldn't see any running tasks) or b) if this is a
> consequence of the more parallelized Unix sort now - I set it to
> --parallel=16
>
> I mean, the piped input stream is single threaded I guess, but maybe the
> sort merge step can benefit from more threads? I guess I have to clean
> up everything and run it again with the original setup with 2 Unix sort
> threads ...
>
>
> On 16.12.21 14:48, Andy Seaborne wrote:
> >
> >
> > On 16/12/2021 10:52, Andy Seaborne wrote:
> > ...
> >
> >> I am getting a slow down during data ingestion. However, your summary
> >> figures don't show that in the ingest phase. The whole logs may have
> >> the signal in it but less pronounced.
> >>
> >> My working assumption is now that it is random access to the node
> >> table. Your results point to it not being a CPU issue but that my
> >> setup is saturating the I/O path. While the portable has a NVMe SSD,
> >> it has probably not got the same I/O bandwidth as a server class
> >> machine.
> >>
> >> I'm not sure what to do about this other than run with a much bigger
> >> node table cache for the ingestion phase. Substituting some file
> >> mapper file area for bigger cache should be a win. While I hadn't
> >> noticed before, it is probably visible in logs of smaller loads on
> >> closer inspection. Experimenting on a small dataset is a lot easier.
> >
> > I'm more sure of this - not yet definite.
> >
> > The nodeToNodeId cache is 200k -- this is on the load/update path.
> > Seems rather small for the task.
> >
> > The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
> > results.
> >
> > 2 pieces of data will help:
> >
> > Experimenting with very small cache settings.
> >
> > Letting my slow load keep going to see if there is the same
> > characteristics at the index stage. There shouldn't be if nodeToNodeId
> > is the cause; it's only an influence in the data ingestion step.
> >
> > Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
> > maybe loader=phased. It falls into the same situation although the
> > improvement there is going to be less marked. "Parallel" saturates the
> > I/O by other means as well.
> >
> > Andy
>


-- 


---
Marco Neumann
KONA


Re: Testing tdb2.xloader

2021-12-18 Thread LB

Good morning,

loading of Wikidata truthy is done, this time I didn't forget to keep 
logs: 
https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3


I'm a bit surprised that this time it was 8h faster than last time, 31h 
vs 39h. Not sure if a) there was something else on the server last time 
(at least I couldn't see any running tasks) or b) if this is a 
consequence of the more parallelized Unix sort now - I set it to 
--parallel=16


I mean, the piped input stream is single threaded I guess, but maybe the 
sort merge step can benefit from more threads? I guess I have to clean 
up everything and run it again with the original setup with 2 Unix sort 
threads ...



On 16.12.21 14:48, Andy Seaborne wrote:



On 16/12/2021 10:52, Andy Seaborne wrote:
...

I am getting a slow down during data ingestion. However, your summary 
figures don't show that in the ingest phase. The whole logs may have 
the signal in it but less pronounced.


My working assumption is now that it is random access to the node 
table. Your results point to it not being a CPU issue but that my 
setup is saturating the I/O path. While the portable has a NVMe SSD, 
it has probably not got the same I/O bandwidth as a server class 
machine.


I'm not sure what to do about this other than run with a much bigger 
node table cache for the ingestion phase. Substituting some file 
mapper file area for bigger cache should be a win. While I hadn't 
noticed before, it is probably visible in logs of smaller loads on 
closer inspection. Experimenting on a small dataset is a lot easier.


I'm more sure of this - not yet definite.

The nodeToNodeId cache is 200k -- this is on the load/update path. 
Seems rather small for the task.


The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL 
results.


2 pieces of data will help:

Experimenting with very small cache settings.

Letting my slow load keep going to see if there is the same 
characteristics at the index stage. There shouldn't be if nodeToNodeId 
is the cause; it's only an influence in the data ingestion step.


Aside : Increasing nodeToNodeId could also help tdbloader=parallel and 
maybe loader=phased. It falls into the same situation although the 
improvement there is going to be less marked. "Parallel" saturates the 
I/O by other means as well.


    Andy


Re: Testing tdb2.xloader

2021-12-17 Thread Andy Seaborne




On 16/12/2021 10:08, Marco Neumann wrote:

thank you Lorenz, I am running this test myself now again with a larger
disk. You may want to consider running a full load of wikidata as well. The
timing info and disk space you have should be sufficient.

Full Wikidata (WD).

I've tried to gather a summary as of 2021-12.
The video [2] is most up-to-date at the time of writing.

WD about 16B currently.
It's growing at 1B/3 months [1]

The query service is 6 active machines (11 in total)
   58 queries per server per second average (no figures for peak)
   12 wikidata updates/330 triples per second  (no figures for peak)

It seems this mixed workload is causing the most pain.

Wikidata will want at least a 5 year strategy, and planning for 10 years 
isn't unreasonable. The service is 6 year old.


That's 36B and 56B as a baseline *minimum*.
That's without new streams of data.

They can't design for a minimum. "plan for success"
   --> towards 100B triples.

Andy

[1] 
https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy


[2] Scaling the Wikidata Query Service
https://www.youtube.com/watch?v=oV4qelj9fxM


Re: Testing tdb2.xloader

2021-12-16 Thread Andy Seaborne




On 16/12/2021 10:52, Andy Seaborne wrote:
...

I am getting a slow down during data ingestion. However, your summary 
figures don't show that in the ingest phase. The whole logs may have the 
signal in it but less pronounced.


My working assumption is now that it is random access to the node table. 
Your results point to it not being a CPU issue but that my setup is 
saturating the I/O path. While the portable has a NVMe SSD, it has 
probably not got the same I/O bandwidth as a server class machine.


I'm not sure what to do about this other than run with a much bigger 
node table cache for the ingestion phase. Substituting some file mapper 
file area for bigger cache should be a win. While I hadn't noticed 
before, it is probably visible in logs of smaller loads on closer 
inspection. Experimenting on a small dataset is a lot easier.


I'm more sure of this - not yet definite.

The nodeToNodeId cache is 200k -- this is on the load/update path. Seems 
rather small for the task.


The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL 
results.


2 pieces of data will help:

Experimenting with very small cache settings.

Letting my slow load keep going to see if there is the same 
characteristics at the index stage. There shouldn't be if nodeToNodeId 
is the cause; it's only an influence in the data ingestion step.


Aside : Increasing nodeToNodeId could also help tdbloader=parallel and 
maybe loader=phased. It falls into the same situation although the 
improvement there is going to be less marked. "Parallel" saturates the 
I/O by other means as well.


Andy


Re: Testing tdb2.xloader

2021-12-16 Thread Andy Seaborne

On 16/12/2021 12:32, LB wrote:
I couldn't get access to the full log as the output was too verbose for 
the screen and I forgot to pipe into a file ...


Yes - familiar ...

Maybe xloader should capture it's logging.



I can confirm the triples.tmp.gz size was something around 35-40G if I 
remember correctly.


It leaves the file behind.



I rerun the load now to a) keep logs and b) see if increasing the number 
of threads for parallel sort does change anything (though I don't think 
so).


a useful data point.



In ~40h I will provide an update and share the logs.


Wikidata full is 143GB BZip2 compressed, I'm wondering how large TDB 
would be on disk in that case. We have ~4 x the size of the truthy dump, 
both compressed, I'd guess we get something between 1.8T and 2.3T - 
clearly, the number of nodes shouldn't increase linearly as many 
statements are still about nodes from truthy dump. On the other hand, 
Wikidata has this statement about statement stuff, i.e. we have lots of 
statement identifiers.


Once I've time and enough resources on the server, I'll give it a try. 
Now I'm interested to see the final result


The indexes and the triples.tmp scale with number of triples.

I haven't parsed "all" to make sure it's parseable for a load. Given the 
WD data model, most of the dubious stuff is in truthy but there is 
always a chance of stray rubbish creeping in.





On 16.12.21 11:52, Andy Seaborne wrote:

Awesome!
I'm really pleased to hear the news.

That's better than I feared at this scale!

How big is triples.tmp.gz? 2* that size, and the database size is the 
peak storage space used. My estimate is about 40G making 604G overall.


I'd appreciate having the whole log file. Could you email it to me?

Currently, I'm trying the 2021-12-08 truthy (from gz, not bz2) on a 
modern portable with 4 cores and a single notional 1TB SSD. If the 
estimate is right, it will fit. More good news.



I am getting a slow down during data ingestion. However, your summary 
figures don't show that in the ingest phase. The whole logs may have 
the signal in it but less pronounced.


My working assumption is now that it is random access to the node 
table. Your results point to it not being a CPU issue but that my 
setup is saturating the I/O path. While the portable has a NVMe SSD, 
it has probably not got the same I/O bandwidth as a server class machine.


I'm not sure what to do about this other than run with a much bigger 
node table cache for the ingestion phase. Substituting some file 
mapper file area for bigger cache should be a win. While I hadn't 
noticed before, it is probably visible in logs of smaller loads on 
closer inspection. Experimenting on a small dataset is a lot easier.



I'm also watching the CPU temperature. When the graphics aren't 
active, the fans aren't even on. After a few minutes of active screen 
the fans spin up but the temperatures are still well within the 
limits. The machine is raised up by 1cm to give good airflow. And I 
keep the door shut to keep the cats away.


    Andy

Inline ...

On 16/12/2021 08:49, LB wrote:

Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:

Server:

AMD Ryzen 9 5950X  (16C/32T)
128 GB DDR4 ECC RAM
2 x 3.84 TB NVMe SSD


Nice.


Environment:

- Ubuntu 20.04.3 LTS
- OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
- Jena 4.3.1


Command:

tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb 
--loc datasets/wikidata-tdb datasets/latest-truthy.nt.bz2


I found .gz to be slightly faster than .bz2. This maybe because .gz is 
better supported by the Java runtime or just the fact bz2 is designed 
for best compression.





Log summary:


04:14:28 INFO  Load node table  = 36600 seconds
04:14:28 INFO  Load ingest data = 25811 seconds
04:14:28 INFO  Build index SPO  = 20688 seconds
04:14:28 INFO  Build index POS  = 35466 seconds
04:14:28 INFO  Build index OSP  = 25042 seconds
04:14:28 INFO  Overall  143607 seconds
04:14:28 INFO  Overall  39h 53m 27s


Less than 2 days :-)


04:14:28 INFO  Triples loaded   = 6.610.055.778
04:14:28 INFO  Quads loaded = 0
04:14:28 INFO  Overall Rate 46.028 tuples per second



Disk space usage according to


du -sh datasets/wikidata-tdb


  is


524G    datasets/wikidata-tdb





During loading I could see ~90GB of RAM occupied (50% of total memory 
got to sort and it used 2 threads - is it intended to stick to 2 
threads with --parallel 2?)


It is fixed at two for the sort currently.

There may be some benefit in making this configurable but previously 
I've found that more threads does not seem to yield much benefit 
though you have a lot more threads! Experiment required.





Cheers,
Lorenz


On 12.12.21 13:07, Andy Seaborne wrote:

Hi, Øyvind,

This is all very helpful feedback. Thank you.

On 11/12/2021 21:45, Øyvind Gjesdal wrote:
I'm trying out tdb2.xloader on an openstack vm, loading the 
wikidata truthy

dump downloaded 2021-12-09.


This is 

Re: Testing tdb2.xloader

2021-12-16 Thread LB
I couldn't get access to the full log as the output was too verbose for 
the screen and I forgot to pipe into a file ...


I can confirm the triples.tmp.gz size was something around 35-40G if I 
remember correctly.


I rerun the load now to a) keep logs and b) see if increasing the number 
of threads for parallel sort does change anything (though I don't think so).


In ~40h I will provide an update and share the logs.


Wikidata full is 143GB BZip2 compressed, I'm wondering how large TDB 
would be on disk in that case. We have ~4 x the size of the truthy dump, 
both compressed, I'd guess we get something between 1.8T and 2.3T - 
clearly, the number of nodes shouldn't increase linearly as many 
statements are still about nodes from truthy dump. On the other hand, 
Wikidata has this statement about statement stuff, i.e. we have lots of 
statement identifiers.


Once I've time and enough resources on the server, I'll give it a try. 
Now I'm interested to see the final result



On 16.12.21 11:52, Andy Seaborne wrote:

Awesome!
I'm really pleased to hear the news.

That's better than I feared at this scale!

How big is triples.tmp.gz? 2* that size, and the database size is the 
peak storage space used. My estimate is about 40G making 604G overall.


I'd appreciate having the whole log file. Could you email it to me?

Currently, I'm trying the 2021-12-08 truthy (from gz, not bz2) on a 
modern portable with 4 cores and a single notional 1TB SSD. If the 
estimate is right, it will fit. More good news.



I am getting a slow down during data ingestion. However, your summary 
figures don't show that in the ingest phase. The whole logs may have 
the signal in it but less pronounced.


My working assumption is now that it is random access to the node 
table. Your results point to it not being a CPU issue but that my 
setup is saturating the I/O path. While the portable has a NVMe SSD, 
it has probably not got the same I/O bandwidth as a server class machine.


I'm not sure what to do about this other than run with a much bigger 
node table cache for the ingestion phase. Substituting some file 
mapper file area for bigger cache should be a win. While I hadn't 
noticed before, it is probably visible in logs of smaller loads on 
closer inspection. Experimenting on a small dataset is a lot easier.



I'm also watching the CPU temperature. When the graphics aren't 
active, the fans aren't even on. After a few minutes of active screen 
the fans spin up but the temperatures are still well within the 
limits. The machine is raised up by 1cm to give good airflow. And I 
keep the door shut to keep the cats away.


    Andy

Inline ...

On 16/12/2021 08:49, LB wrote:

Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:

Server:

AMD Ryzen 9 5950X  (16C/32T)
128 GB DDR4 ECC RAM
2 x 3.84 TB NVMe SSD


Nice.


Environment:

- Ubuntu 20.04.3 LTS
- OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
- Jena 4.3.1


Command:

tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb 
--loc datasets/wikidata-tdb datasets/latest-truthy.nt.bz2


I found .gz to be slightly faster than .bz2. This maybe because .gz is 
better supported by the Java runtime or just the fact bz2 is designed 
for best compression.





Log summary:


04:14:28 INFO  Load node table  = 36600 seconds
04:14:28 INFO  Load ingest data = 25811 seconds
04:14:28 INFO  Build index SPO  = 20688 seconds
04:14:28 INFO  Build index POS  = 35466 seconds
04:14:28 INFO  Build index OSP  = 25042 seconds
04:14:28 INFO  Overall  143607 seconds
04:14:28 INFO  Overall  39h 53m 27s


Less than 2 days :-)


04:14:28 INFO  Triples loaded   = 6.610.055.778
04:14:28 INFO  Quads loaded = 0
04:14:28 INFO  Overall Rate 46.028 tuples per second



Disk space usage according to


du -sh datasets/wikidata-tdb


  is


524G    datasets/wikidata-tdb





During loading I could see ~90GB of RAM occupied (50% of total memory 
got to sort and it used 2 threads - is it intended to stick to 2 
threads with --parallel 2?)


It is fixed at two for the sort currently.

There may be some benefit in making this configurable but previously 
I've found that more threads does not seem to yield much benefit 
though you have a lot more threads! Experiment required.





Cheers,
Lorenz


On 12.12.21 13:07, Andy Seaborne wrote:

Hi, Øyvind,

This is all very helpful feedback. Thank you.

On 11/12/2021 21:45, Øyvind Gjesdal wrote:
I'm trying out tdb2.xloader on an openstack vm, loading the 
wikidata truthy

dump downloaded 2021-12-09.


This is the 4.3.0 xloader?

There are improvements in 4.3.1. Since that release was going out, 
including using less temporary space, the development version got 
merged in. It has had some testing.


It compresses the triples.tmp and intermediate sort files in the 
index stage making the peak usage much smaller.


The instance is a vm created on the Norwegian Research and 
Education Cloud,

an openstack cloud 

Re: Testing tdb2.xloader

2021-12-16 Thread Andy Seaborne

Awesome!
I'm really pleased to hear the news.

That's better than I feared at this scale!

How big is triples.tmp.gz? 2* that size, and the database size is the 
peak storage space used. My estimate is about 40G making 604G overall.


I'd appreciate having the whole log file. Could you email it to me?

Currently, I'm trying the 2021-12-08 truthy (from gz, not bz2) on a 
modern portable with 4 cores and a single notional 1TB SSD. If the 
estimate is right, it will fit. More good news.



I am getting a slow down during data ingestion. However, your summary 
figures don't show that in the ingest phase. The whole logs may have the 
signal in it but less pronounced.


My working assumption is now that it is random access to the node table. 
Your results point to it not being a CPU issue but that my setup is 
saturating the I/O path. While the portable has a NVMe SSD, it has 
probably not got the same I/O bandwidth as a server class machine.


I'm not sure what to do about this other than run with a much bigger 
node table cache for the ingestion phase. Substituting some file mapper 
file area for bigger cache should be a win. While I hadn't noticed 
before, it is probably visible in logs of smaller loads on closer 
inspection. Experimenting on a small dataset is a lot easier.



I'm also watching the CPU temperature. When the graphics aren't active, 
the fans aren't even on. After a few minutes of active screen the fans 
spin up but the temperatures are still well within the limits. The 
machine is raised up by 1cm to give good airflow. And I keep the door 
shut to keep the cats away.


Andy

Inline ...

On 16/12/2021 08:49, LB wrote:

Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:

Server:

AMD Ryzen 9 5950X  (16C/32T)
128 GB DDR4 ECC RAM
2 x 3.84 TB NVMe SSD


Nice.


Environment:

- Ubuntu 20.04.3 LTS
- OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
- Jena 4.3.1


Command:

tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc 
datasets/wikidata-tdb datasets/latest-truthy.nt.bz2


I found .gz to be slightly faster than .bz2. This maybe because .gz is 
better supported by the Java runtime or just the fact bz2 is designed 
for best compression.





Log summary:


04:14:28 INFO  Load node table  = 36600 seconds
04:14:28 INFO  Load ingest data = 25811 seconds
04:14:28 INFO  Build index SPO  = 20688 seconds
04:14:28 INFO  Build index POS  = 35466 seconds
04:14:28 INFO  Build index OSP  = 25042 seconds
04:14:28 INFO  Overall  143607 seconds
04:14:28 INFO  Overall  39h 53m 27s


Less than 2 days :-)


04:14:28 INFO  Triples loaded   = 6.610.055.778
04:14:28 INFO  Quads loaded = 0
04:14:28 INFO  Overall Rate 46.028 tuples per second



Disk space usage according to


du -sh datasets/wikidata-tdb


  is


524G    datasets/wikidata-tdb





During loading I could see ~90GB of RAM occupied (50% of total memory 
got to sort and it used 2 threads - is it intended to stick to 2 threads 
with --parallel 2?)


It is fixed at two for the sort currently.

There may be some benefit in making this configurable but previously 
I've found that more threads does not seem to yield much benefit though 
you have a lot more threads! Experiment required.





Cheers,
Lorenz


On 12.12.21 13:07, Andy Seaborne wrote:

Hi, Øyvind,

This is all very helpful feedback. Thank you.

On 11/12/2021 21:45, Øyvind Gjesdal wrote:
I'm trying out tdb2.xloader on an openstack vm, loading the wikidata 
truthy

dump downloaded 2021-12-09.


This is the 4.3.0 xloader?

There are improvements in 4.3.1. Since that release was going out, 
including using less temporary space, the development version got 
merged in. It has had some testing.


It compresses the triples.tmp and intermediate sort files in the index 
stage making the peak usage much smaller.


The instance is a vm created on the Norwegian Research and Education 
Cloud,

an openstack cloud provider.

Instance type:
32 GB memory
4 CPU


I using similar on a 7 year old desktop machine, SATA disk.

I haven't got a machine I can dedicate to the multi-day load. I'll try 
to find a way to at least push it though building the node table.


Loading the first 1B of truthy:

1B triples , 40kTPS , 06h 54m 10s

The database is 81G and building needs an addition 11.6G for workspace 
for a total of 92G (+ the data file).


While smaller, its seems bz2 files are much slower to decompress so 
I've been using gz files.


My current best guess for 6.4B truthy is

Temp    96G
Database   540G
Data    48G
Total: 684G  -- peak disk needed

based on scaling up 1B truthy. Personally, I would make sure there was 
more space. Also - I don't know if the shape of the data is 
sufficiently uniform to make scaling predictable.  The time doesn't 
scale so simply.


This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.

Compression reduces the size of triples.tmp -- the related sort 
temporary files which add 

Re: Testing tdb2.xloader

2021-12-16 Thread Marco Neumann
thank you Lorenz, I am running this test myself now again with a larger
disk. You may want to consider running a full load of wikidata as well. The
timing info and disk space you have should be sufficient.

Did we figure out a place to post the parser messages?

Marco


On Thu, Dec 16, 2021 at 10:01 AM LB 
wrote:

> Sure
>
> > wikidata-tdb/Data-0001:
> > total 524G
> > -rw-r--r-- 1   24 Dez 15 05:41 GOSP.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 GPOS.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 GPU.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GPU.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GPU.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 GSPO.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GSPO.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GSPO.idn
> > -rw-r--r-- 10 Dez 15 05:41 journal.jrnl
> > -rw-r--r-- 1   24 Dez 15 05:41 nodes.bpt
> > -rw-r--r-- 1  36G Dez 15 05:41 nodes.dat
> > -rw-r--r-- 1   16 Dez 15 05:41 nodes-data.bdf
> > -rw-r--r-- 1  44G Dez 15 05:41 nodes-data.obj
> > -rw-r--r-- 1 312M Dez 15 05:41 nodes.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 OSP.bpt
> > -rw-r--r-- 1 148G Dez 16 04:14 OSP.dat
> > -rw-r--r-- 1   24 Dez 15 05:41 OSPG.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 OSPG.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 OSPG.idn
> > -rw-r--r-- 1 528M Dez 16 04:14 OSP.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 POS.bpt
> > -rw-r--r-- 1 148G Dez 15 21:17 POS.dat
> > -rw-r--r-- 1   24 Dez 15 05:41 POSG.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 POSG.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 POSG.idn
> > -rw-r--r-- 1 528M Dez 15 21:17 POS.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 prefixes.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 prefixes.dat
> > -rw-r--r-- 1   16 Dez 15 05:41 prefixes-data.bdf
> > -rw-r--r-- 10 Dez 14 12:21 prefixes-data.obj
> > -rw-r--r-- 1 8,0M Dez 14 12:21 prefixes.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 SPO.bpt
> > -rw-r--r-- 1 148G Dez 15 11:25 SPO.dat
> > -rw-r--r-- 1   24 Dez 15 05:41 SPOG.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 SPOG.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 SPOG.idn
> > -rw-r--r-- 1 528M Dez 15 11:25 SPO.idn
> > -rw-r--r-- 18 Dez 15 21:17 tdb.lock
>
> On 16.12.21 10:27, Marco Neumann wrote:
> > Thank you Lorenz, can you please post a directory list for Data-0001 with
> > file sizes.
> >
> >
> > On Thu, Dec 16, 2021 at 8:49 AM LB 
> > wrote:
> >
> >> Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:
> >>
> >> Server:
> >>
> >> AMD Ryzen 9 5950X  (16C/32T)
> >> 128 GB DDR4 ECC RAM
> >> 2 x 3.84 TB NVMe SSD
> >>
> >>
> >> Environment:
> >>
> >> - Ubuntu 20.04.3 LTS
> >> - OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
> >> - Jena 4.3.1
> >>
> >>
> >> Command:
> >>
> >>> tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc
> >>> datasets/wikidata-tdb datasets/latest-truthy.nt.bz2
> >>
> >> Log summary:
> >>
> >>> 04:14:28 INFO  Load node table  = 36600 seconds
> >>> 04:14:28 INFO  Load ingest data = 25811 seconds
> >>> 04:14:28 INFO  Build index SPO  = 20688 seconds
> >>> 04:14:28 INFO  Build index POS  = 35466 seconds
> >>> 04:14:28 INFO  Build index OSP  = 25042 seconds
> >>> 04:14:28 INFO  Overall  143607 seconds
> >>> 04:14:28 INFO  Overall  39h 53m 27s
> >>> 04:14:28 INFO  Triples loaded   = 6.610.055.778
> >>> 04:14:28 INFO  Quads loaded = 0
> >>> 04:14:28 INFO  Overall Rate 46.028 tuples per second
> >>
> >> Disk space usage according to
> >>
> >>> du -sh datasets/wikidata-tdb
> >>is
> >>
> >>> 524Gdatasets/wikidata-tdb
> >> During loading I could see ~90GB of RAM occupied (50% of total memory
> >> got to sort and it used 2 threads - is it intended to stick to 2 threads
> >> with --parallel 2?)
> >>
> >>
> >> Cheers,
> >> Lorenz
> >>
> >>
> >> On 12.12.21 13:07, Andy Seaborne wrote:
> >>> Hi, Øyvind,
> >>>
> >>> This is all very helpful feedback. Thank you.
> >>>
> >>> On 11/12/2021 21:45, Øyvind Gjesdal wrote:
>  I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
>  truthy
>  dump downloaded 2021-12-09.
> >>> This is the 4.3.0 xloader?
> >>>
> >>> There are improvements in 4.3.1. Since that release was going out,
> >>> including using less temporary space, the development version got
> >>> merged in. It has had some testing.
> >>>
> >>> It compresses the triples.tmp and intermediate sort files in the index
> >>> stage making the peak usage much smaller.
> >>>
>  The instance is a vm created on the Norwegian Research and Education
>  Cloud,
>  an openstack cloud provider.
> 
>  Instance type:
>  32 GB memory
>  4 CPU
> >>> I using similar on a 7 year old desktop machine, SATA disk.
> >>>
> >>> I haven't got a machine I can dedicate to the multi-day load. I'll try
> >>> to find a way to at least push it though building the node table.
> >>>
> >>> Loading the first 1B of truthy:
> >>>
> >>> 

Re: Testing tdb2.xloader

2021-12-16 Thread LB

Sure


wikidata-tdb/Data-0001:
total 524G
-rw-r--r-- 1   24 Dez 15 05:41 GOSP.bpt
-rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.dat
-rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.idn
-rw-r--r-- 1   24 Dez 15 05:41 GPOS.bpt
-rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.dat
-rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.idn
-rw-r--r-- 1   24 Dez 15 05:41 GPU.bpt
-rw-r--r-- 1 8,0M Dez 14 12:21 GPU.dat
-rw-r--r-- 1 8,0M Dez 14 12:21 GPU.idn
-rw-r--r-- 1   24 Dez 15 05:41 GSPO.bpt
-rw-r--r-- 1 8,0M Dez 14 12:21 GSPO.dat
-rw-r--r-- 1 8,0M Dez 14 12:21 GSPO.idn
-rw-r--r-- 1    0 Dez 15 05:41 journal.jrnl
-rw-r--r-- 1   24 Dez 15 05:41 nodes.bpt
-rw-r--r-- 1  36G Dez 15 05:41 nodes.dat
-rw-r--r-- 1   16 Dez 15 05:41 nodes-data.bdf
-rw-r--r-- 1  44G Dez 15 05:41 nodes-data.obj
-rw-r--r-- 1 312M Dez 15 05:41 nodes.idn
-rw-r--r-- 1   24 Dez 15 05:41 OSP.bpt
-rw-r--r-- 1 148G Dez 16 04:14 OSP.dat
-rw-r--r-- 1   24 Dez 15 05:41 OSPG.bpt
-rw-r--r-- 1 8,0M Dez 14 12:21 OSPG.dat
-rw-r--r-- 1 8,0M Dez 14 12:21 OSPG.idn
-rw-r--r-- 1 528M Dez 16 04:14 OSP.idn
-rw-r--r-- 1   24 Dez 15 05:41 POS.bpt
-rw-r--r-- 1 148G Dez 15 21:17 POS.dat
-rw-r--r-- 1   24 Dez 15 05:41 POSG.bpt
-rw-r--r-- 1 8,0M Dez 14 12:21 POSG.dat
-rw-r--r-- 1 8,0M Dez 14 12:21 POSG.idn
-rw-r--r-- 1 528M Dez 15 21:17 POS.idn
-rw-r--r-- 1   24 Dez 15 05:41 prefixes.bpt
-rw-r--r-- 1 8,0M Dez 14 12:21 prefixes.dat
-rw-r--r-- 1   16 Dez 15 05:41 prefixes-data.bdf
-rw-r--r-- 1    0 Dez 14 12:21 prefixes-data.obj
-rw-r--r-- 1 8,0M Dez 14 12:21 prefixes.idn
-rw-r--r-- 1   24 Dez 15 05:41 SPO.bpt
-rw-r--r-- 1 148G Dez 15 11:25 SPO.dat
-rw-r--r-- 1   24 Dez 15 05:41 SPOG.bpt
-rw-r--r-- 1 8,0M Dez 14 12:21 SPOG.dat
-rw-r--r-- 1 8,0M Dez 14 12:21 SPOG.idn
-rw-r--r-- 1 528M Dez 15 11:25 SPO.idn
-rw-r--r-- 1    8 Dez 15 21:17 tdb.lock


On 16.12.21 10:27, Marco Neumann wrote:

Thank you Lorenz, can you please post a directory list for Data-0001 with
file sizes.


On Thu, Dec 16, 2021 at 8:49 AM LB 
wrote:


Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:

Server:

AMD Ryzen 9 5950X  (16C/32T)
128 GB DDR4 ECC RAM
2 x 3.84 TB NVMe SSD


Environment:

- Ubuntu 20.04.3 LTS
- OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
- Jena 4.3.1


Command:


tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc
datasets/wikidata-tdb datasets/latest-truthy.nt.bz2


Log summary:


04:14:28 INFO  Load node table  = 36600 seconds
04:14:28 INFO  Load ingest data = 25811 seconds
04:14:28 INFO  Build index SPO  = 20688 seconds
04:14:28 INFO  Build index POS  = 35466 seconds
04:14:28 INFO  Build index OSP  = 25042 seconds
04:14:28 INFO  Overall  143607 seconds
04:14:28 INFO  Overall  39h 53m 27s
04:14:28 INFO  Triples loaded   = 6.610.055.778
04:14:28 INFO  Quads loaded = 0
04:14:28 INFO  Overall Rate 46.028 tuples per second


Disk space usage according to


du -sh datasets/wikidata-tdb

   is


524Gdatasets/wikidata-tdb

During loading I could see ~90GB of RAM occupied (50% of total memory
got to sort and it used 2 threads - is it intended to stick to 2 threads
with --parallel 2?)


Cheers,
Lorenz


On 12.12.21 13:07, Andy Seaborne wrote:

Hi, Øyvind,

This is all very helpful feedback. Thank you.

On 11/12/2021 21:45, Øyvind Gjesdal wrote:

I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
truthy
dump downloaded 2021-12-09.

This is the 4.3.0 xloader?

There are improvements in 4.3.1. Since that release was going out,
including using less temporary space, the development version got
merged in. It has had some testing.

It compresses the triples.tmp and intermediate sort files in the index
stage making the peak usage much smaller.


The instance is a vm created on the Norwegian Research and Education
Cloud,
an openstack cloud provider.

Instance type:
32 GB memory
4 CPU

I using similar on a 7 year old desktop machine, SATA disk.

I haven't got a machine I can dedicate to the multi-day load. I'll try
to find a way to at least push it though building the node table.

Loading the first 1B of truthy:

1B triples , 40kTPS , 06h 54m 10s

The database is 81G and building needs an addition 11.6G for workspace
for a total of 92G (+ the data file).

While smaller, its seems bz2 files are much slower to decompress so
I've been using gz files.

My current best guess for 6.4B truthy is

Temp96G
Database   540G
Data48G
Total: 684G  -- peak disk needed

based on scaling up 1B truthy. Personally, I would make sure there was
more space. Also - I don't know if the shape of the data is
sufficiently uniform to make scaling predictable.  The time doesn't
scale so simply.

This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.

Compression reduces the size of triples.tmp -- the related sort
temporary files which add up to the same again -- 1/6 of the size.


The storage used for dump + temp files  is mounted as a separate  900GB
volume and is mounted on /var/fuseki/databases
.The type of storage is described as

   

Re: Testing tdb2.xloader

2021-12-16 Thread Marco Neumann
Thank you Lorenz, can you please post a directory list for Data-0001 with
file sizes.


On Thu, Dec 16, 2021 at 8:49 AM LB 
wrote:

> Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:
>
> Server:
>
> AMD Ryzen 9 5950X  (16C/32T)
> 128 GB DDR4 ECC RAM
> 2 x 3.84 TB NVMe SSD
>
>
> Environment:
>
> - Ubuntu 20.04.3 LTS
> - OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
> - Jena 4.3.1
>
>
> Command:
>
> > tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc
> > datasets/wikidata-tdb datasets/latest-truthy.nt.bz2
>
>
> Log summary:
>
> > 04:14:28 INFO  Load node table  = 36600 seconds
> > 04:14:28 INFO  Load ingest data = 25811 seconds
> > 04:14:28 INFO  Build index SPO  = 20688 seconds
> > 04:14:28 INFO  Build index POS  = 35466 seconds
> > 04:14:28 INFO  Build index OSP  = 25042 seconds
> > 04:14:28 INFO  Overall  143607 seconds
> > 04:14:28 INFO  Overall  39h 53m 27s
> > 04:14:28 INFO  Triples loaded   = 6.610.055.778
> > 04:14:28 INFO  Quads loaded = 0
> > 04:14:28 INFO  Overall Rate 46.028 tuples per second
>
>
> Disk space usage according to
>
> > du -sh datasets/wikidata-tdb
>
>   is
>
> > 524Gdatasets/wikidata-tdb
>
> During loading I could see ~90GB of RAM occupied (50% of total memory
> got to sort and it used 2 threads - is it intended to stick to 2 threads
> with --parallel 2?)
>
>
> Cheers,
> Lorenz
>
>
> On 12.12.21 13:07, Andy Seaborne wrote:
> > Hi, Øyvind,
> >
> > This is all very helpful feedback. Thank you.
> >
> > On 11/12/2021 21:45, Øyvind Gjesdal wrote:
> >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
> >> truthy
> >> dump downloaded 2021-12-09.
> >
> > This is the 4.3.0 xloader?
> >
> > There are improvements in 4.3.1. Since that release was going out,
> > including using less temporary space, the development version got
> > merged in. It has had some testing.
> >
> > It compresses the triples.tmp and intermediate sort files in the index
> > stage making the peak usage much smaller.
> >
> >> The instance is a vm created on the Norwegian Research and Education
> >> Cloud,
> >> an openstack cloud provider.
> >>
> >> Instance type:
> >> 32 GB memory
> >> 4 CPU
> >
> > I using similar on a 7 year old desktop machine, SATA disk.
> >
> > I haven't got a machine I can dedicate to the multi-day load. I'll try
> > to find a way to at least push it though building the node table.
> >
> > Loading the first 1B of truthy:
> >
> > 1B triples , 40kTPS , 06h 54m 10s
> >
> > The database is 81G and building needs an addition 11.6G for workspace
> > for a total of 92G (+ the data file).
> >
> > While smaller, its seems bz2 files are much slower to decompress so
> > I've been using gz files.
> >
> > My current best guess for 6.4B truthy is
> >
> > Temp96G
> > Database   540G
> > Data48G
> > Total: 684G  -- peak disk needed
> >
> > based on scaling up 1B truthy. Personally, I would make sure there was
> > more space. Also - I don't know if the shape of the data is
> > sufficiently uniform to make scaling predictable.  The time doesn't
> > scale so simply.
> >
> > This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
> >
> > Compression reduces the size of triples.tmp -- the related sort
> > temporary files which add up to the same again -- 1/6 of the size.
> >
> >> The storage used for dump + temp files  is mounted as a separate  900GB
> >> volume and is mounted on /var/fuseki/databases
> >> .The type of storage is described as
> >>>   *mass-storage-default*: Storage backed by spinning hard drives,
> >> available to everybody and is the default type.
> >> with ext4 configured. At the moment I don't have access to the faster
> >> volume type mass-storage-ssd. CPU and memory are not dedicated, and
> >> can be
> >> overcommitted.
> >
> > "overcommitted" may be a problem.
> >
> > While it's not "tdb2 loader parallel" it does use a continuous CPU in
> > several threads.
> >
> > For memory - "it's complicated".
> >
> > The java parts only need say 2G. The sort is set to "buffer 50%
> > --parallel=2" and the java pipes into sort, that's another thread. I
> > think the effective peak is 3 active threads and they'll all be at
> > 100% for some of the time.
> >
> > So it's going to need 50% of RAM + 2G for a java proces, +OS.
> >
> > It does not need space for memory mapped files (they aren't used at
> > all in the loading process and I/O is sequential.
> >
> > If that triggers over commitment swap out, the performance may go down
> > a lot.
> >
> > For disk - if that is physically remote, it should not a problem
> > (famous last words). I/O is sequential and in large continuous chunks
> > - typical for batch processing jobs.
> >
> >> OS for the instance is a clean Rocky Linux image, with no services
> >> except
> >> jena/fuseki installed. The systemd service
> >
> >  set up for fuseki is stopped.
> >> jena and fuseki version is 4.3.0.
> >>
> >> openjdk 11.0.13 2021-10-19 

Re: Testing tdb2.xloader

2021-12-16 Thread LB

Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:

Server:

AMD Ryzen 9 5950X  (16C/32T)
128 GB DDR4 ECC RAM
2 x 3.84 TB NVMe SSD


Environment:

- Ubuntu 20.04.3 LTS
- OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
- Jena 4.3.1


Command:

tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc 
datasets/wikidata-tdb datasets/latest-truthy.nt.bz2



Log summary:


04:14:28 INFO  Load node table  = 36600 seconds
04:14:28 INFO  Load ingest data = 25811 seconds
04:14:28 INFO  Build index SPO  = 20688 seconds
04:14:28 INFO  Build index POS  = 35466 seconds
04:14:28 INFO  Build index OSP  = 25042 seconds
04:14:28 INFO  Overall  143607 seconds
04:14:28 INFO  Overall  39h 53m 27s
04:14:28 INFO  Triples loaded   = 6.610.055.778
04:14:28 INFO  Quads loaded = 0
04:14:28 INFO  Overall Rate 46.028 tuples per second



Disk space usage according to


du -sh datasets/wikidata-tdb


 is


524G    datasets/wikidata-tdb


During loading I could see ~90GB of RAM occupied (50% of total memory 
got to sort and it used 2 threads - is it intended to stick to 2 threads 
with --parallel 2?)



Cheers,
Lorenz


On 12.12.21 13:07, Andy Seaborne wrote:

Hi, Øyvind,

This is all very helpful feedback. Thank you.

On 11/12/2021 21:45, Øyvind Gjesdal wrote:
I'm trying out tdb2.xloader on an openstack vm, loading the wikidata 
truthy

dump downloaded 2021-12-09.


This is the 4.3.0 xloader?

There are improvements in 4.3.1. Since that release was going out, 
including using less temporary space, the development version got 
merged in. It has had some testing.


It compresses the triples.tmp and intermediate sort files in the index 
stage making the peak usage much smaller.


The instance is a vm created on the Norwegian Research and Education 
Cloud,

an openstack cloud provider.

Instance type:
32 GB memory
4 CPU


I using similar on a 7 year old desktop machine, SATA disk.

I haven't got a machine I can dedicate to the multi-day load. I'll try 
to find a way to at least push it though building the node table.


Loading the first 1B of truthy:

1B triples , 40kTPS , 06h 54m 10s

The database is 81G and building needs an addition 11.6G for workspace 
for a total of 92G (+ the data file).


While smaller, its seems bz2 files are much slower to decompress so 
I've been using gz files.


My current best guess for 6.4B truthy is

Temp    96G
Database   540G
Data    48G
Total: 684G  -- peak disk needed

based on scaling up 1B truthy. Personally, I would make sure there was 
more space. Also - I don't know if the shape of the data is 
sufficiently uniform to make scaling predictable.  The time doesn't 
scale so simply.


This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.

Compression reduces the size of triples.tmp -- the related sort 
temporary files which add up to the same again -- 1/6 of the size.



The storage used for dump + temp files  is mounted as a separate  900GB
volume and is mounted on /var/fuseki/databases
.The type of storage is described as

  *mass-storage-default*: Storage backed by spinning hard drives,

available to everybody and is the default type.
with ext4 configured. At the moment I don't have access to the faster
volume type mass-storage-ssd. CPU and memory are not dedicated, and 
can be

overcommitted.


"overcommitted" may be a problem.

While it's not "tdb2 loader parallel" it does use a continuous CPU in 
several threads.


For memory - "it's complicated".

The java parts only need say 2G. The sort is set to "buffer 50% 
--parallel=2" and the java pipes into sort, that's another thread. I 
think the effective peak is 3 active threads and they'll all be at 
100% for some of the time.


So it's going to need 50% of RAM + 2G for a java proces, +OS.

It does not need space for memory mapped files (they aren't used at 
all in the loading process and I/O is sequential.


If that triggers over commitment swap out, the performance may go down 
a lot.


For disk - if that is physically remote, it should not a problem 
(famous last words). I/O is sequential and in large continuous chunks 
- typical for batch processing jobs.


OS for the instance is a clean Rocky Linux image, with no services 
except
jena/fuseki installed. The systemd service 


 set up for fuseki is stopped.

jena and fuseki version is 4.3.0.

openjdk 11.0.13 2021-10-19 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)


Just FYI: Java17 is a little faster. Some java improvements have 
improved RDF parsing speed by up to 10%. in xloader that not 
significant to the overall time.


I'm running from a tmux session to avoid connectivity issues and to 
capture
the output. 


I use

tdb2.xloader  |& tee LOG-FILE-NAME

to capture the logs and see them. ">&" and "tail -f" would achieve 
much the same effect



I think the output is stored in memory and not on disk.

Re: Testing tdb2.xloader

2021-12-14 Thread Marco Neumann
The more tests we have on different machines the better. :)

Personally I'd say if you have a choice go for a PCIe 4.0 NVMe SSDs and
stay away from SATA < III SSDs. Also for the tests SSD RAID isn't necessary.

These components have become extremely affordable in recent years and
really should be part of a fast pipeline imo in particular for
tdb2.tdbloader in parallel mode.

But as Andy has emphasized he has designed the tdb2.xloader process to
be spinning-disk friendly. so SSDs are not a prerequisite for xloader


On Tue, Dec 14, 2021 at 10:38 AM Øyvind Gjesdal  wrote:

> Hi Marco,
>
> Very useful to compare with your log on the different runs. Still working
> with configuration to see if I can get the ingest data stage to be usable
> for hdd. It looks like I get close to the performance of your run on the
> earlier stages, while ingest data is still very much too slow. Having to
> use SSD may be necessary, for a real world large import to complete?  I'lI
> request some ssd storage as well, and hope there's a quota for me :)
>
> Maybe I could also test different distros, to see if some of the default OS
> settings affect the import.
>
> Best regards,
> Øyvind
>
> søn. 12. des. 2021 kl. 10:21 skrev Marco Neumann  >:
>
> > Øyvind, looks like the above was the wrong log from a prior sharding
> > experiment.
> >
> > This is the correct log file for the truthy dataset.
> >
> > http://www.lotico.com/temp/LOG-98085
> >
> >
> >
> > On Sat, Dec 11, 2021 at 10:02 PM Marco Neumann 
> > wrote:
> >
> > > Thank you Øyvind for sharing, great to see more tests in the wild.
> > >
> > > I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
> > > dataset and quickly ran out of disk space. It finished the job but did
> > not
> > > write any of the indexes to disk due to lack of space. no error
> messages.
> > >
> > > http://www.lotico.com/temp/LOG-95239
> > >
> > > I have now ordered a new 4TB SSD drive to rerun the test possibly with
> > the
> > > full wikidata dataset,
> > >
> > > I personally had the best experience with dedicated hardware so far
> (can
> > > be in the data center), shared or dedicated virtual compute engines did
> > not
> > > deliver as expected. And I have not seen great benefits from data
> center
> > > grade multicore cpus. But I think they will during runtime in multi
> user
> > > settings (eg fuseki).
> > >
> > > Best,
> > > Marco
> > >
> > > On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal 
> > wrote:
> > >
> > >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
> > >> truthy
> > >> dump downloaded 2021-12-09.
> > >>
> > >> The instance is a vm created on the Norwegian Research and Education
> > >> Cloud,
> > >> an openstack cloud provider.
> > >>
> > >> Instance type:
> > >> 32 GB memory
> > >> 4 CPU
> > >>
> > >> The storage used for dump + temp files  is mounted as a separate
> 900GB
> > >> volume and is mounted on /var/fuseki/databases
> > >> .The type of storage is described as
> > >> >  *mass-storage-default*: Storage backed by spinning hard drives,
> > >> available to everybody and is the default type.
> > >> with ext4 configured. At the moment I don't have access to the faster
> > >> volume type mass-storage-ssd. CPU and memory are not dedicated, and
> can
> > be
> > >> overcommitted.
> > >>
> > >> OS for the instance is a clean Rocky Linux image, with no services
> > except
> > >> jena/fuseki installed. The systemd service  set up for fuseki is
> > stopped.
> > >> jena and fuseki version is 4.3.0.
> > >>
> > >> openjdk 11.0.13 2021-10-19 LTS
> > >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> > >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode,
> sharing)
> > >>
> > >> I'm running from a tmux session to avoid connectivity issues and to
> > >> capture
> > >> the output. I think the output is stored in memory and not on disk.
> > >> On First run I tried to have the tmpdir on the root partition, to
> > separate
> > >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
> > disk
> > >> full. For the second (current run) all directories are under
> > >> /var/fuseki/databases.
> > >>
> > >>  $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
> > >> --tmpdir
> > >> /var/fuseki/databases/tmp latest-truthy.nt.gz
> > >>
> > >> The import is so far at the "ingest data" stage where it has really
> > slowed
> > >> down.
> > >>
> > >> Current output is:
> > >>
> > >> 20:03:43 INFO  Data:: Add: 502,000,000 Data (Batch: 3,356
> /
> > >> Avg: 7,593)
> > >>
> > >> See full log so far at
> > >>
> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
> > >>
> > >> Some notes:
> > >>
> > >> * There is a (time/info) lapse in the output log between the  end of
> > >> 'parse' and the start of 'index' for Terms.  It is unclear to me what
> is
> > >> happening in the 1h13 minutes between the lines.
> > >>
> > >> 22:33:46 INFO  Terms   ::   Elapsed: 50,720.20 seconds
> > [2021/12/10
> > 

Re: Testing tdb2.xloader

2021-12-14 Thread Andy Seaborne




On 14/12/2021 10:38, Øyvind Gjesdal wrote:

Hi Marco,

Very useful to compare with your log on the different runs. Still working
with configuration to see if I can get the ingest data stage to be usable
for hdd. It looks like I get close to the performance of your run on the
earlier stages, while ingest data is still very much too slow. Having to
use SSD may be necessary, for a real world large import to complete?  I'lI
request some ssd storage as well, and hope there's a quota for me :)


The access patterns should (tm!) be spinning-disk friendly. There is no 
random IO updating B+trees directly.


All the B+Trees are written "bottom-up" by specially writing blocks of 
the right layout to disk, not via the B+Tree runtime code which would be 
top down via "add record" style access.


Andy


Maybe I could also test different distros, to see if some of the default OS
settings affect the import.

Best regards,
Øyvind

søn. 12. des. 2021 kl. 10:21 skrev Marco Neumann :


Øyvind, looks like the above was the wrong log from a prior sharding
experiment.

This is the correct log file for the truthy dataset.

http://www.lotico.com/temp/LOG-98085



On Sat, Dec 11, 2021 at 10:02 PM Marco Neumann 
wrote:


Thank you Øyvind for sharing, great to see more tests in the wild.

I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
dataset and quickly ran out of disk space. It finished the job but did

not

write any of the indexes to disk due to lack of space. no error messages.

http://www.lotico.com/temp/LOG-95239

I have now ordered a new 4TB SSD drive to rerun the test possibly with

the

full wikidata dataset,

I personally had the best experience with dedicated hardware so far (can
be in the data center), shared or dedicated virtual compute engines did

not

deliver as expected. And I have not seen great benefits from data center
grade multicore cpus. But I think they will during runtime in multi user
settings (eg fuseki).

Best,
Marco

On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal 

wrote:



I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
truthy
dump downloaded 2021-12-09.

The instance is a vm created on the Norwegian Research and Education
Cloud,
an openstack cloud provider.

Instance type:
32 GB memory
4 CPU

The storage used for dump + temp files  is mounted as a separate  900GB
volume and is mounted on /var/fuseki/databases
.The type of storage is described as

  *mass-storage-default*: Storage backed by spinning hard drives,

available to everybody and is the default type.
with ext4 configured. At the moment I don't have access to the faster
volume type mass-storage-ssd. CPU and memory are not dedicated, and can

be

overcommitted.

OS for the instance is a clean Rocky Linux image, with no services

except

jena/fuseki installed. The systemd service  set up for fuseki is

stopped.

jena and fuseki version is 4.3.0.

openjdk 11.0.13 2021-10-19 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)

I'm running from a tmux session to avoid connectivity issues and to
capture
the output. I think the output is stored in memory and not on disk.
On First run I tried to have the tmpdir on the root partition, to

separate

temp dir and data dir, but with only 19 GB free, the tmpdir soon was

disk

full. For the second (current run) all directories are under
/var/fuseki/databases.

  $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
--tmpdir
/var/fuseki/databases/tmp latest-truthy.nt.gz

The import is so far at the "ingest data" stage where it has really

slowed

down.

Current output is:

20:03:43 INFO  Data:: Add: 502,000,000 Data (Batch: 3,356 /
Avg: 7,593)

See full log so far at
https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab

Some notes:

* There is a (time/info) lapse in the output log between the  end of
'parse' and the start of 'index' for Terms.  It is unclear to me what is
happening in the 1h13 minutes between the lines.

22:33:46 INFO  Terms   ::   Elapsed: 50,720.20 seconds

[2021/12/10

22:33:46 CET]
22:33:52 INFO  Terms   :: == Parse: 50726.071 seconds :
6,560,468,631 triples/quads 129,331 TPS
23:46:13 INFO  Terms   :: Add: 1,000,000 Index (Batch: 237,755 /
Avg: 237,755)

* The ingest data step really slows down on the "ingest data stage": At
the
current rate, if I calculated correctly, it looks like

PKG.CmdxIngestData

has 10 days left before it finishes.

* When I saw sort running in the background for the first parts of the
job,
I looked at the `sort` command. I noticed from some online sources that
setting the environment variable LC_ALL=C improves speed for `sort`.

Could

this be set on the ProcessBuilder for the `sort` process? Could it
break/change something? I see the warning from the man page for `sort`.

*** WARNING *** The locale specified by the environment affects
sort order.  

Re: Testing tdb2.xloader

2021-12-14 Thread Øyvind Gjesdal
Hi Marco,

Very useful to compare with your log on the different runs. Still working
with configuration to see if I can get the ingest data stage to be usable
for hdd. It looks like I get close to the performance of your run on the
earlier stages, while ingest data is still very much too slow. Having to
use SSD may be necessary, for a real world large import to complete?  I'lI
request some ssd storage as well, and hope there's a quota for me :)

Maybe I could also test different distros, to see if some of the default OS
settings affect the import.

Best regards,
Øyvind

søn. 12. des. 2021 kl. 10:21 skrev Marco Neumann :

> Øyvind, looks like the above was the wrong log from a prior sharding
> experiment.
>
> This is the correct log file for the truthy dataset.
>
> http://www.lotico.com/temp/LOG-98085
>
>
>
> On Sat, Dec 11, 2021 at 10:02 PM Marco Neumann 
> wrote:
>
> > Thank you Øyvind for sharing, great to see more tests in the wild.
> >
> > I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
> > dataset and quickly ran out of disk space. It finished the job but did
> not
> > write any of the indexes to disk due to lack of space. no error messages.
> >
> > http://www.lotico.com/temp/LOG-95239
> >
> > I have now ordered a new 4TB SSD drive to rerun the test possibly with
> the
> > full wikidata dataset,
> >
> > I personally had the best experience with dedicated hardware so far (can
> > be in the data center), shared or dedicated virtual compute engines did
> not
> > deliver as expected. And I have not seen great benefits from data center
> > grade multicore cpus. But I think they will during runtime in multi user
> > settings (eg fuseki).
> >
> > Best,
> > Marco
> >
> > On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal 
> wrote:
> >
> >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
> >> truthy
> >> dump downloaded 2021-12-09.
> >>
> >> The instance is a vm created on the Norwegian Research and Education
> >> Cloud,
> >> an openstack cloud provider.
> >>
> >> Instance type:
> >> 32 GB memory
> >> 4 CPU
> >>
> >> The storage used for dump + temp files  is mounted as a separate  900GB
> >> volume and is mounted on /var/fuseki/databases
> >> .The type of storage is described as
> >> >  *mass-storage-default*: Storage backed by spinning hard drives,
> >> available to everybody and is the default type.
> >> with ext4 configured. At the moment I don't have access to the faster
> >> volume type mass-storage-ssd. CPU and memory are not dedicated, and can
> be
> >> overcommitted.
> >>
> >> OS for the instance is a clean Rocky Linux image, with no services
> except
> >> jena/fuseki installed. The systemd service  set up for fuseki is
> stopped.
> >> jena and fuseki version is 4.3.0.
> >>
> >> openjdk 11.0.13 2021-10-19 LTS
> >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
> >>
> >> I'm running from a tmux session to avoid connectivity issues and to
> >> capture
> >> the output. I think the output is stored in memory and not on disk.
> >> On First run I tried to have the tmpdir on the root partition, to
> separate
> >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
> disk
> >> full. For the second (current run) all directories are under
> >> /var/fuseki/databases.
> >>
> >>  $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
> >> --tmpdir
> >> /var/fuseki/databases/tmp latest-truthy.nt.gz
> >>
> >> The import is so far at the "ingest data" stage where it has really
> slowed
> >> down.
> >>
> >> Current output is:
> >>
> >> 20:03:43 INFO  Data:: Add: 502,000,000 Data (Batch: 3,356 /
> >> Avg: 7,593)
> >>
> >> See full log so far at
> >> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
> >>
> >> Some notes:
> >>
> >> * There is a (time/info) lapse in the output log between the  end of
> >> 'parse' and the start of 'index' for Terms.  It is unclear to me what is
> >> happening in the 1h13 minutes between the lines.
> >>
> >> 22:33:46 INFO  Terms   ::   Elapsed: 50,720.20 seconds
> [2021/12/10
> >> 22:33:46 CET]
> >> 22:33:52 INFO  Terms   :: == Parse: 50726.071 seconds :
> >> 6,560,468,631 triples/quads 129,331 TPS
> >> 23:46:13 INFO  Terms   :: Add: 1,000,000 Index (Batch: 237,755 /
> >> Avg: 237,755)
> >>
> >> * The ingest data step really slows down on the "ingest data stage": At
> >> the
> >> current rate, if I calculated correctly, it looks like
> PKG.CmdxIngestData
> >> has 10 days left before it finishes.
> >>
> >> * When I saw sort running in the background for the first parts of the
> >> job,
> >> I looked at the `sort` command. I noticed from some online sources that
> >> setting the environment variable LC_ALL=C improves speed for `sort`.
> Could
> >> this be set on the ProcessBuilder for the `sort` process? Could it
> >> break/change something? I see the warning from the man page for 

Re: Testing tdb2.xloader

2021-12-12 Thread Marco Neumann
Does 4.3.1 already contain the mitigation for the Log4j2 vulnerability?

On Sun, Dec 12, 2021 at 1:24 PM Marco Neumann 
wrote:

> As Andy mentioned, I will give the 4.3.1 xloader a try with the new 4TB
> SSD drive and an old laptop.
>
> I also have a contact who has just set up a new datacenter in Ireland. I
> may be able to run a few tests on much bigger machines as well. Otherwise I
> am very happy with the iron in Finland.as long as they are dedicated
> machines.
>
> On Sun, Dec 12, 2021 at 12:44 PM Andy Seaborne  wrote:
>
>>
>>
>> On 11/12/2021 22:02, Marco Neumann wrote:
>> > Thank you Øyvind for sharing, great to see more tests in the wild.
>> >
>> > I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
>> > dataset and quickly ran out of disk space. It finished the job but did
>> not
>> > write any of the indexes to disk due to lack of space. no error
>> messages.
>>
>> The 4.3.1 xloader should hopefully address the space issue.
>>
>>  Andy
>>
>> >
>> > http://www.lotico.com/temp/LOG-95239
>> >
>> > I have now ordered a new 4TB SSD drive to rerun the test possibly with
>> the
>> > full wikidata dataset,
>> >
>> > I personally had the best experience with dedicated hardware so far
>> (can be
>> > in the data center), shared or dedicated virtual compute engines did not
>> > deliver as expected. And I have not seen great benefits from data center
>> > grade multicore cpus. But I think they will during runtime in multi user
>> > settings (eg fuseki).
>> >
>> > Best,
>> > Marco
>> >
>> > On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal 
>> wrote:
>> >
>> >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
>> truthy
>> >> dump downloaded 2021-12-09.
>> >>
>> >> The instance is a vm created on the Norwegian Research and Education
>> Cloud,
>> >> an openstack cloud provider.
>> >>
>> >> Instance type:
>> >> 32 GB memory
>> >> 4 CPU
>> >>
>> >> The storage used for dump + temp files  is mounted as a separate  900GB
>> >> volume and is mounted on /var/fuseki/databases
>> >> .The type of storage is described as
>> >>>   *mass-storage-default*: Storage backed by spinning hard drives,
>> >> available to everybody and is the default type.
>> >> with ext4 configured. At the moment I don't have access to the faster
>> >> volume type mass-storage-ssd. CPU and memory are not dedicated, and
>> can be
>> >> overcommitted.
>> >>
>> >> OS for the instance is a clean Rocky Linux image, with no services
>> except
>> >> jena/fuseki installed. The systemd service  set up for fuseki is
>> stopped.
>> >> jena and fuseki version is 4.3.0.
>> >>
>> >> openjdk 11.0.13 2021-10-19 LTS
>> >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>> >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode,
>> sharing)
>> >>
>> >> I'm running from a tmux session to avoid connectivity issues and to
>> capture
>> >> the output. I think the output is stored in memory and not on disk.
>> >> On First run I tried to have the tmpdir on the root partition, to
>> separate
>> >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
>> disk
>> >> full. For the second (current run) all directories are under
>> >> /var/fuseki/databases.
>> >>
>> >>   $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
>> --tmpdir
>> >> /var/fuseki/databases/tmp latest-truthy.nt.gz
>> >>
>> >> The import is so far at the "ingest data" stage where it has really
>> slowed
>> >> down.
>> >>
>> >> Current output is:
>> >>
>> >> 20:03:43 INFO  Data:: Add: 502,000,000 Data (Batch: 3,356 /
>> >> Avg: 7,593)
>> >>
>> >> See full log so far at
>> >>
>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>> >>
>> >> Some notes:
>> >>
>> >> * There is a (time/info) lapse in the output log between the  end of
>> >> 'parse' and the start of 'index' for Terms.  It is unclear to me what
>> is
>> >> happening in the 1h13 minutes between the lines.
>> >>
>> >> 22:33:46 INFO  Terms   ::   Elapsed: 50,720.20 seconds
>> [2021/12/10
>> >> 22:33:46 CET]
>> >> 22:33:52 INFO  Terms   :: == Parse: 50726.071 seconds :
>> >> 6,560,468,631 triples/quads 129,331 TPS
>> >> 23:46:13 INFO  Terms   :: Add: 1,000,000 Index (Batch: 237,755
>> /
>> >> Avg: 237,755)
>> >>
>> >> * The ingest data step really slows down on the "ingest data stage":
>> At the
>> >> current rate, if I calculated correctly, it looks like
>> PKG.CmdxIngestData
>> >> has 10 days left before it finishes.
>> >>
>> >> * When I saw sort running in the background for the first parts of the
>> job,
>> >> I looked at the `sort` command. I noticed from some online sources that
>> >> setting the environment variable LC_ALL=C improves speed for `sort`.
>> Could
>> >> this be set on the ProcessBuilder for the `sort` process? Could it
>> >> break/change something? I see the warning from the man page for `sort`.
>> >>
>> >> *** WARNING *** The locale specified by the environment affects
>> >> 

Re: Testing tdb2.xloader

2021-12-12 Thread Marco Neumann
As Andy mentioned, I will give the 4.3.1 xloader a try with the new 4TB SSD
drive and an old laptop.

I also have a contact who has just set up a new datacenter in Ireland. I
may be able to run a few tests on much bigger machines as well. Otherwise I
am very happy with the iron in Finland.as long as they are dedicated
machines.

On Sun, Dec 12, 2021 at 12:44 PM Andy Seaborne  wrote:

>
>
> On 11/12/2021 22:02, Marco Neumann wrote:
> > Thank you Øyvind for sharing, great to see more tests in the wild.
> >
> > I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
> > dataset and quickly ran out of disk space. It finished the job but did
> not
> > write any of the indexes to disk due to lack of space. no error messages.
>
> The 4.3.1 xloader should hopefully address the space issue.
>
>  Andy
>
> >
> > http://www.lotico.com/temp/LOG-95239
> >
> > I have now ordered a new 4TB SSD drive to rerun the test possibly with
> the
> > full wikidata dataset,
> >
> > I personally had the best experience with dedicated hardware so far (can
> be
> > in the data center), shared or dedicated virtual compute engines did not
> > deliver as expected. And I have not seen great benefits from data center
> > grade multicore cpus. But I think they will during runtime in multi user
> > settings (eg fuseki).
> >
> > Best,
> > Marco
> >
> > On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal 
> wrote:
> >
> >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
> truthy
> >> dump downloaded 2021-12-09.
> >>
> >> The instance is a vm created on the Norwegian Research and Education
> Cloud,
> >> an openstack cloud provider.
> >>
> >> Instance type:
> >> 32 GB memory
> >> 4 CPU
> >>
> >> The storage used for dump + temp files  is mounted as a separate  900GB
> >> volume and is mounted on /var/fuseki/databases
> >> .The type of storage is described as
> >>>   *mass-storage-default*: Storage backed by spinning hard drives,
> >> available to everybody and is the default type.
> >> with ext4 configured. At the moment I don't have access to the faster
> >> volume type mass-storage-ssd. CPU and memory are not dedicated, and can
> be
> >> overcommitted.
> >>
> >> OS for the instance is a clean Rocky Linux image, with no services
> except
> >> jena/fuseki installed. The systemd service  set up for fuseki is
> stopped.
> >> jena and fuseki version is 4.3.0.
> >>
> >> openjdk 11.0.13 2021-10-19 LTS
> >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
> >>
> >> I'm running from a tmux session to avoid connectivity issues and to
> capture
> >> the output. I think the output is stored in memory and not on disk.
> >> On First run I tried to have the tmpdir on the root partition, to
> separate
> >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
> disk
> >> full. For the second (current run) all directories are under
> >> /var/fuseki/databases.
> >>
> >>   $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
> --tmpdir
> >> /var/fuseki/databases/tmp latest-truthy.nt.gz
> >>
> >> The import is so far at the "ingest data" stage where it has really
> slowed
> >> down.
> >>
> >> Current output is:
> >>
> >> 20:03:43 INFO  Data:: Add: 502,000,000 Data (Batch: 3,356 /
> >> Avg: 7,593)
> >>
> >> See full log so far at
> >> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
> >>
> >> Some notes:
> >>
> >> * There is a (time/info) lapse in the output log between the  end of
> >> 'parse' and the start of 'index' for Terms.  It is unclear to me what is
> >> happening in the 1h13 minutes between the lines.
> >>
> >> 22:33:46 INFO  Terms   ::   Elapsed: 50,720.20 seconds
> [2021/12/10
> >> 22:33:46 CET]
> >> 22:33:52 INFO  Terms   :: == Parse: 50726.071 seconds :
> >> 6,560,468,631 triples/quads 129,331 TPS
> >> 23:46:13 INFO  Terms   :: Add: 1,000,000 Index (Batch: 237,755 /
> >> Avg: 237,755)
> >>
> >> * The ingest data step really slows down on the "ingest data stage": At
> the
> >> current rate, if I calculated correctly, it looks like
> PKG.CmdxIngestData
> >> has 10 days left before it finishes.
> >>
> >> * When I saw sort running in the background for the first parts of the
> job,
> >> I looked at the `sort` command. I noticed from some online sources that
> >> setting the environment variable LC_ALL=C improves speed for `sort`.
> Could
> >> this be set on the ProcessBuilder for the `sort` process? Could it
> >> break/change something? I see the warning from the man page for `sort`.
> >>
> >> *** WARNING *** The locale specified by the environment affects
> >> sort order.  Set LC_ALL=C to get the traditional sort order that
> >> uses native byte values.
> >>
> >> Links:
> >> https://access.redhat.com/solutions/445233
> >>
> >>
> 

Re: Testing tdb2.xloader

2021-12-12 Thread Andy Seaborne




On 11/12/2021 22:02, Marco Neumann wrote:

Thank you Øyvind for sharing, great to see more tests in the wild.

I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
dataset and quickly ran out of disk space. It finished the job but did not
write any of the indexes to disk due to lack of space. no error messages.


The 4.3.1 xloader should hopefully address the space issue.

Andy



http://www.lotico.com/temp/LOG-95239

I have now ordered a new 4TB SSD drive to rerun the test possibly with the
full wikidata dataset,

I personally had the best experience with dedicated hardware so far (can be
in the data center), shared or dedicated virtual compute engines did not
deliver as expected. And I have not seen great benefits from data center
grade multicore cpus. But I think they will during runtime in multi user
settings (eg fuseki).

Best,
Marco

On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal  wrote:


I'm trying out tdb2.xloader on an openstack vm, loading the wikidata truthy
dump downloaded 2021-12-09.

The instance is a vm created on the Norwegian Research and Education Cloud,
an openstack cloud provider.

Instance type:
32 GB memory
4 CPU

The storage used for dump + temp files  is mounted as a separate  900GB
volume and is mounted on /var/fuseki/databases
.The type of storage is described as

  *mass-storage-default*: Storage backed by spinning hard drives,

available to everybody and is the default type.
with ext4 configured. At the moment I don't have access to the faster
volume type mass-storage-ssd. CPU and memory are not dedicated, and can be
overcommitted.

OS for the instance is a clean Rocky Linux image, with no services except
jena/fuseki installed. The systemd service  set up for fuseki is stopped.
jena and fuseki version is 4.3.0.

openjdk 11.0.13 2021-10-19 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)

I'm running from a tmux session to avoid connectivity issues and to capture
the output. I think the output is stored in memory and not on disk.
On First run I tried to have the tmpdir on the root partition, to separate
temp dir and data dir, but with only 19 GB free, the tmpdir soon was disk
full. For the second (current run) all directories are under
/var/fuseki/databases.

  $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy --tmpdir
/var/fuseki/databases/tmp latest-truthy.nt.gz

The import is so far at the "ingest data" stage where it has really slowed
down.

Current output is:

20:03:43 INFO  Data:: Add: 502,000,000 Data (Batch: 3,356 /
Avg: 7,593)

See full log so far at
https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab

Some notes:

* There is a (time/info) lapse in the output log between the  end of
'parse' and the start of 'index' for Terms.  It is unclear to me what is
happening in the 1h13 minutes between the lines.

22:33:46 INFO  Terms   ::   Elapsed: 50,720.20 seconds [2021/12/10
22:33:46 CET]
22:33:52 INFO  Terms   :: == Parse: 50726.071 seconds :
6,560,468,631 triples/quads 129,331 TPS
23:46:13 INFO  Terms   :: Add: 1,000,000 Index (Batch: 237,755 /
Avg: 237,755)

* The ingest data step really slows down on the "ingest data stage": At the
current rate, if I calculated correctly, it looks like PKG.CmdxIngestData
has 10 days left before it finishes.

* When I saw sort running in the background for the first parts of the job,
I looked at the `sort` command. I noticed from some online sources that
setting the environment variable LC_ALL=C improves speed for `sort`. Could
this be set on the ProcessBuilder for the `sort` process? Could it
break/change something? I see the warning from the man page for `sort`.

*** WARNING *** The locale specified by the environment affects
sort order.  Set LC_ALL=C to get the traditional sort order that
uses native byte values.

Links:
https://access.redhat.com/solutions/445233

https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram

https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort

Best regards,
Øyvind






Re: Testing tdb2.xloader

2021-12-12 Thread Andy Seaborne

Hi, Øyvind,

This is all very helpful feedback. Thank you.

On 11/12/2021 21:45, Øyvind Gjesdal wrote:

I'm trying out tdb2.xloader on an openstack vm, loading the wikidata truthy
dump downloaded 2021-12-09.


This is the 4.3.0 xloader?

There are improvements in 4.3.1. Since that release was going out, 
including using less temporary space, the development version got merged 
in. It has had some testing.


It compresses the triples.tmp and intermediate sort files in the index 
stage making the peak usage much smaller.



The instance is a vm created on the Norwegian Research and Education Cloud,
an openstack cloud provider.

Instance type:
32 GB memory
4 CPU


I using similar on a 7 year old desktop machine, SATA disk.

I haven't got a machine I can dedicate to the multi-day load. I'll try 
to find a way to at least push it though building the node table.


Loading the first 1B of truthy:

1B triples , 40kTPS , 06h 54m 10s

The database is 81G and building needs an addition 11.6G for workspace 
for a total of 92G (+ the data file).


While smaller, its seems bz2 files are much slower to decompress so I've 
been using gz files.


My current best guess for 6.4B truthy is

Temp96G
Database   540G
Data48G
Total: 684G  -- peak disk needed

based on scaling up 1B truthy. Personally, I would make sure there was 
more space. Also - I don't know if the shape of the data is sufficiently 
uniform to make scaling predictable.  The time doesn't scale so simply.


This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.

Compression reduces the size of triples.tmp -- the related sort 
temporary files which add up to the same again -- 1/6 of the size.



The storage used for dump + temp files  is mounted as a separate  900GB
volume and is mounted on /var/fuseki/databases
.The type of storage is described as

  *mass-storage-default*: Storage backed by spinning hard drives,

available to everybody and is the default type.
with ext4 configured. At the moment I don't have access to the faster
volume type mass-storage-ssd. CPU and memory are not dedicated, and can be
overcommitted.


"overcommitted" may be a problem.

While it's not "tdb2 loader parallel" it does use a continuous CPU in 
several threads.


For memory - "it's complicated".

The java parts only need say 2G. The sort is set to "buffer 50% 
--parallel=2" and the java pipes into sort, that's another thread. I 
think the effective peak is 3 active threads and they'll all be at 100% 
for some of the time.


So it's going to need 50% of RAM + 2G for a java proces, +OS.

It does not need space for memory mapped files (they aren't used at all 
in the loading process and I/O is sequential.


If that triggers over commitment swap out, the performance may go down a 
lot.


For disk - if that is physically remote, it should not a problem (famous 
last words). I/O is sequential and in large continuous chunks - typical 
for batch processing jobs.



OS for the instance is a clean Rocky Linux image, with no services except
jena/fuseki installed. The systemd service 


 set up for fuseki is stopped.

jena and fuseki version is 4.3.0.

openjdk 11.0.13 2021-10-19 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)


Just FYI: Java17 is a little faster. Some java improvements have 
improved RDF parsing speed by up to 10%. in xloader that not significant 
to the overall time.



I'm running from a tmux session to avoid connectivity issues and to capture
the output. 


I use

tdb2.xloader  |& tee LOG-FILE-NAME

to capture the logs and see them. ">&" and "tail -f" would achieve much 
the same effect



I think the output is stored in memory and not on disk.
On First run I tried to have the tmpdir on the root partition, to separate
temp dir and data dir, but with only 19 GB free, the tmpdir soon was disk
full. For the second (current run) all directories are under
/var/fuseki/databases.


Yes - after making that mistake myself, the new version ignores system 
TMPDIR.  Using --tmpdir is best but otherwise it defaults to the data 
directory.




  $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy --tmpdir
/var/fuseki/databases/tmp latest-truthy.nt.gz

The import is so far at the "ingest data" stage where it has really slowed
down.


FYI: The first line of ingest is always very slow. It is not measuring 
the start point correctly.




Current output is:

20:03:43 INFO  Data:: Add: 502,000,000 Data (Batch: 3,356 /
Avg: 7,593)

See full log so far at
https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab


The earlier first pass also slows down and that should be fairly 
constant-ish speed step once everything settles down.



Some notes:

* There is a (time/info) lapse in the output log between the  end of
'parse' and the start of 'index' for Terms.  It is unclear to me what is
happening in the 1h13 minutes between the lines.



Re: Testing tdb2.xloader

2021-12-12 Thread Marco Neumann
Øyvind, looks like the above was the wrong log from a prior sharding
experiment.

This is the correct log file for the truthy dataset.

http://www.lotico.com/temp/LOG-98085



On Sat, Dec 11, 2021 at 10:02 PM Marco Neumann 
wrote:

> Thank you Øyvind for sharing, great to see more tests in the wild.
>
> I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
> dataset and quickly ran out of disk space. It finished the job but did not
> write any of the indexes to disk due to lack of space. no error messages.
>
> http://www.lotico.com/temp/LOG-95239
>
> I have now ordered a new 4TB SSD drive to rerun the test possibly with the
> full wikidata dataset,
>
> I personally had the best experience with dedicated hardware so far (can
> be in the data center), shared or dedicated virtual compute engines did not
> deliver as expected. And I have not seen great benefits from data center
> grade multicore cpus. But I think they will during runtime in multi user
> settings (eg fuseki).
>
> Best,
> Marco
>
> On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal  wrote:
>
>> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
>> truthy
>> dump downloaded 2021-12-09.
>>
>> The instance is a vm created on the Norwegian Research and Education
>> Cloud,
>> an openstack cloud provider.
>>
>> Instance type:
>> 32 GB memory
>> 4 CPU
>>
>> The storage used for dump + temp files  is mounted as a separate  900GB
>> volume and is mounted on /var/fuseki/databases
>> .The type of storage is described as
>> >  *mass-storage-default*: Storage backed by spinning hard drives,
>> available to everybody and is the default type.
>> with ext4 configured. At the moment I don't have access to the faster
>> volume type mass-storage-ssd. CPU and memory are not dedicated, and can be
>> overcommitted.
>>
>> OS for the instance is a clean Rocky Linux image, with no services except
>> jena/fuseki installed. The systemd service  set up for fuseki is stopped.
>> jena and fuseki version is 4.3.0.
>>
>> openjdk 11.0.13 2021-10-19 LTS
>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
>>
>> I'm running from a tmux session to avoid connectivity issues and to
>> capture
>> the output. I think the output is stored in memory and not on disk.
>> On First run I tried to have the tmpdir on the root partition, to separate
>> temp dir and data dir, but with only 19 GB free, the tmpdir soon was disk
>> full. For the second (current run) all directories are under
>> /var/fuseki/databases.
>>
>>  $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
>> --tmpdir
>> /var/fuseki/databases/tmp latest-truthy.nt.gz
>>
>> The import is so far at the "ingest data" stage where it has really slowed
>> down.
>>
>> Current output is:
>>
>> 20:03:43 INFO  Data:: Add: 502,000,000 Data (Batch: 3,356 /
>> Avg: 7,593)
>>
>> See full log so far at
>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>>
>> Some notes:
>>
>> * There is a (time/info) lapse in the output log between the  end of
>> 'parse' and the start of 'index' for Terms.  It is unclear to me what is
>> happening in the 1h13 minutes between the lines.
>>
>> 22:33:46 INFO  Terms   ::   Elapsed: 50,720.20 seconds [2021/12/10
>> 22:33:46 CET]
>> 22:33:52 INFO  Terms   :: == Parse: 50726.071 seconds :
>> 6,560,468,631 triples/quads 129,331 TPS
>> 23:46:13 INFO  Terms   :: Add: 1,000,000 Index (Batch: 237,755 /
>> Avg: 237,755)
>>
>> * The ingest data step really slows down on the "ingest data stage": At
>> the
>> current rate, if I calculated correctly, it looks like PKG.CmdxIngestData
>> has 10 days left before it finishes.
>>
>> * When I saw sort running in the background for the first parts of the
>> job,
>> I looked at the `sort` command. I noticed from some online sources that
>> setting the environment variable LC_ALL=C improves speed for `sort`. Could
>> this be set on the ProcessBuilder for the `sort` process? Could it
>> break/change something? I see the warning from the man page for `sort`.
>>
>>*** WARNING *** The locale specified by the environment affects
>>sort order.  Set LC_ALL=C to get the traditional sort order that
>>uses native byte values.
>>
>> Links:
>> https://access.redhat.com/solutions/445233
>>
>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>>
>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>>
>> Best regards,
>> Øyvind
>>
>
>
> --
>
>
> ---
> Marco Neumann
> KONA
>
>

-- 


---
Marco Neumann
KONA


Re: Testing tdb2.xloader

2021-12-11 Thread Marco Neumann
Thank you Øyvind for sharing, great to see more tests in the wild.

I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
dataset and quickly ran out of disk space. It finished the job but did not
write any of the indexes to disk due to lack of space. no error messages.

http://www.lotico.com/temp/LOG-95239

I have now ordered a new 4TB SSD drive to rerun the test possibly with the
full wikidata dataset,

I personally had the best experience with dedicated hardware so far (can be
in the data center), shared or dedicated virtual compute engines did not
deliver as expected. And I have not seen great benefits from data center
grade multicore cpus. But I think they will during runtime in multi user
settings (eg fuseki).

Best,
Marco

On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal  wrote:

> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata truthy
> dump downloaded 2021-12-09.
>
> The instance is a vm created on the Norwegian Research and Education Cloud,
> an openstack cloud provider.
>
> Instance type:
> 32 GB memory
> 4 CPU
>
> The storage used for dump + temp files  is mounted as a separate  900GB
> volume and is mounted on /var/fuseki/databases
> .The type of storage is described as
> >  *mass-storage-default*: Storage backed by spinning hard drives,
> available to everybody and is the default type.
> with ext4 configured. At the moment I don't have access to the faster
> volume type mass-storage-ssd. CPU and memory are not dedicated, and can be
> overcommitted.
>
> OS for the instance is a clean Rocky Linux image, with no services except
> jena/fuseki installed. The systemd service  set up for fuseki is stopped.
> jena and fuseki version is 4.3.0.
>
> openjdk 11.0.13 2021-10-19 LTS
> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
>
> I'm running from a tmux session to avoid connectivity issues and to capture
> the output. I think the output is stored in memory and not on disk.
> On First run I tried to have the tmpdir on the root partition, to separate
> temp dir and data dir, but with only 19 GB free, the tmpdir soon was disk
> full. For the second (current run) all directories are under
> /var/fuseki/databases.
>
>  $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy --tmpdir
> /var/fuseki/databases/tmp latest-truthy.nt.gz
>
> The import is so far at the "ingest data" stage where it has really slowed
> down.
>
> Current output is:
>
> 20:03:43 INFO  Data:: Add: 502,000,000 Data (Batch: 3,356 /
> Avg: 7,593)
>
> See full log so far at
> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>
> Some notes:
>
> * There is a (time/info) lapse in the output log between the  end of
> 'parse' and the start of 'index' for Terms.  It is unclear to me what is
> happening in the 1h13 minutes between the lines.
>
> 22:33:46 INFO  Terms   ::   Elapsed: 50,720.20 seconds [2021/12/10
> 22:33:46 CET]
> 22:33:52 INFO  Terms   :: == Parse: 50726.071 seconds :
> 6,560,468,631 triples/quads 129,331 TPS
> 23:46:13 INFO  Terms   :: Add: 1,000,000 Index (Batch: 237,755 /
> Avg: 237,755)
>
> * The ingest data step really slows down on the "ingest data stage": At the
> current rate, if I calculated correctly, it looks like PKG.CmdxIngestData
> has 10 days left before it finishes.
>
> * When I saw sort running in the background for the first parts of the job,
> I looked at the `sort` command. I noticed from some online sources that
> setting the environment variable LC_ALL=C improves speed for `sort`. Could
> this be set on the ProcessBuilder for the `sort` process? Could it
> break/change something? I see the warning from the man page for `sort`.
>
>*** WARNING *** The locale specified by the environment affects
>sort order.  Set LC_ALL=C to get the traditional sort order that
>uses native byte values.
>
> Links:
> https://access.redhat.com/solutions/445233
>
> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>
> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>
> Best regards,
> Øyvind
>


-- 


---
Marco Neumann
KONA