Re: Bulk load of triples on DB

Sergio Fernández Fri, 24 May 2013 04:36:29 -0700

Hi,

IMHO just switch to Jena SDB is not a right idea, but porting some ideasto KiWi triple store would be nice.

In parallel, when we switched to a pure Sesame-backend, the idea for themid-long term was to be able to run Marmotta on top of other triplestores. I particularly would like to be able to use Jena TDB. But thereare still many part of Marmotta that should be refactored for allowing such.


Cheers,


On 23/05/13 15:03, Sebastian Schaffert wrote:

Hi Raffaele,

the idea was anyways to allow different backends besides KiWi, because each
has its advantages and disadvantages (KiWi's advantages are the versioning
and the reasoner). The issue is documented under

https://issues.apache.org/jira/browse/MARMOTTA-85

and the individual backends have subsequent numbers. See e.g.

https://issues.apache.org/jira/browse/MARMOTTA-89

for the SDB backend implementation.

Changing backends is currently not possible, but it is foreseen in the
architecture and it would take me about one day of work to change the
platform in a way that other backends can be used. The main change will be
in the SesameServiceImpl which sets up the underlying triple store. The
initialisation method for this service stacks together different sails
depending on the configuration and is already very modular. The only thing
that is currently hardcoded there is the initialisation of a new KiWiStore,
but in principle it could be any Sesame Sail.

But there are some consequences and dependencies, e.g. the
marmotta-versioning and marmotta-reasoner modules cannot be used if the
backend is not KiWi, and I need to find a clean way to model these
dependencies (Maven is unfortunately probably not enough, because several
backends could be on the classpath and only one backend selected - on the
other hand we could simply create different backend configurations in Maven
that only include one backend to be used - we will see).

If you want to try with SDB and TDB, the first step would be to implement a
clean wrapper that allows accessing Jena through the Sesame SAIL API. Peter
Ansell has already worked on such adapters:

https://github.com/ansell/JenaSesame

Maybe this would be a good starting point. I will in parallel try to work
on modularizing the backends. Not sure when I will be able to finish this,
because other things are currently on my priority list...

Greetings,

Sebastian


2013/5/23 Raffaele Palmieri <[email protected]>

Hi Sebastian, below are some considerations that induce me to think that
Jena SDB(or TDB) could be a better solution, but I understand that's a big
impact on codebase, and so I would go cautious.

On 23 May 2013 12:20, Sebastian Schaffert <[email protected]

wrote:

Hi Raffaele,


2013/5/22 Raffaele Palmieri <[email protected]>

On 22 May 2013 15:04, Andy Seaborne <[email protected]> wrote:

What is the current loading rate?


Tried a test with a graph of 661 nodes and 957 triples: it took about

sec. So, looking the triples the medium rate is 18.8 ms per triple;

tested

on Tomcat with maximum size of 1.5 Gb.

This is a bit too small for a real test, because you will have a high
influence of side effects (like cache initialisation). I have done some
performance comparisons with importing about 10% of GeoNames (about 15
million triples, 1.5 million resources). The test uses a specialised
parallel importer that was configured to run 8 importing threads in
parallel. Here are some figures on different hardware:
*- VMWare, 4CPU, 6GB RAM, HDD: 4:20h (avg per 100 resources: 10-13

seconds,

8 in parallel). In case of VMWare, the CPU is waiting most of the time

for

I/O, so apparently the harddisk is slow. Could also be related to an

older

Linux kernel or the host the instance is running on (might not have 4
physical CPUs assigned to the instance).*
*- QEmu**

    -

    4CPU (8GHz), 6GB RAM, SSD: 2:10h (avg per 100 resources: 4-5 seconds,

    in parallel). The change to SSD does not deliver the expected
performance
    gain, the limit is mostly the CPU power (load always between 90-100%).
    However, the variance in the average time for 100 is less, so the
results
    are more stable over time.

*
  - *Workstation, 8CPU, 24GB RAM, SSD: 0:40 (avg per 100 resources: 1-2
seconds, 8 in parallel). Running on physical hardware obviously shows the
highest performance. All 8 CPUs between 85-95% load.*

In this setup, my observation was that about 80% of the CPU time is
actually spent in Postgres, and most of the database time in SELECTs (not
INSERTs) because of checking if a node or triple already exists. So the
highest performance gain will be achieved by reducing the load on the
database. There is already a quite good caching system in place (using
EHCache), unfortunately the caching cannot solve the issue of checking

for

non-existance (a cache can only help when checking for existance). This

is

why especially the initial import is comparably slow.

   You are right that my test of about 1000 triples was limitative,
specially with lower resources than yours; but with the same graph and the
same resources Jena SDB offers better performance, however I agree with you
that we will need more benchmarks.
In favor of Jena, both SDB and TDB have already a command line tool to
access directly to the database.

Conceptually, when inserting a triple, the workflow currently looks as
follows:

1. for each node of subject, predicate, object, context:
1.1. check for existence of node
1.1.a node exists in cache, return its database ID
1.1.b node does not exist in cache, look in the database if it exists

there

(SELECT) and return its ID, or null
1.2. if the database ID is null:
1.2.1 query the sequence (H2, PostgreSQL: SELECT nextval(...)) or the
sequence simulation table (MySQL: SELECT) to get the next database ID and
assign it to the node
1.2.2 store the node in the database (INSERT) and add it to the cache
2. check for existance of triple:
2.a triple exists in cache, return its database ID
2.b triple does not exist in cache, look in the database if it exists

there

(SELECT) and return its ID, or null
3. if the triple ID is null:
3.1 query the sequence or the sequence simulation table (MySQL) to get

the

next database ID for triples and assign it to the triple
3.2 store the triple in the database (INSERT) and add it to the cache

So, in the worst case (i.e. all nodes are new and the triple is new, so
nothing can be answered from the cache) you will have:
- 4 INSERT commands (three nodes, 1 triple), these are comparably cheap
- 4 SELECT commands for existence checking (three nodes, 1 triple), these
are comparably expensive
- 4 SELECT from sequence commands in case of PostgeSQL or H2, very cheap

or

4 SELECT from table commands in case of MySQL, comparably cheap (but not

as

good as a real sequence)
what is even worse is that the INSERT and SELECT commands will be
interwoven, i.e. there will be alternating SELECTs and INSERTS, which
databases do not really like.

This workflow is for duplication check,  from documentation I see that Jena
SDB Loader already makes duplication suppression.

To optimize the performance, the best options are therefore:
- avoiding alternating SELECTS and INSERTS as much as possible (e.g. at
least for each triple batch the node insertions)
- avoiding the comparably expensive existence checks (e.g. other way of
caching/looking up that supports checking for non-existance)

If bulk import then is still slow, it might make sense looking into the
database specific bulk loading commands you suggested.

If I find some time, I might be able to look into the first optimization
(i.e. avoiding too many alternate SELECT and INSERT commands). Maybe a
certain improvement can already be achieved by optimizing this per

triple.


If you want to try out more sophisticated improvements or completely
alternate ways of bulk loading, I would be very happy to see it. Just

make

sure the database schema and integrity constraints are kept as they are

and

the rest will work. The main constraint is that nodes are unique (i.e.

each

URI or Literal has exactly one database row) and non-deleted triples are
unique (i.e. each non-deleted triple has exactly one database ID).


The Jena SDB bulk loader may have some ideas you can draw on.  It

bulk

loads a chunk (typically 10K) of triples at a time uses DB temporary

tables

as working storage.  The database layout is a triples+nodes database
layout.  SDB manipulates those tables in the DB to find new nodes to

add

to

the node table and new triples to add to the triples table as single

SQL

operations.  The original designer may be around on [email protected]

This design looks interesting and it seems to be a similar approach to

my

idea, it could be investigated. In the case, can we think about use of

Jena

SDB in Marmotta?

This could be implemented by wrapping Jena SDB (or also TDB) in a Sesame
Sail, and actually there is already an issue for this in Jira. However,
when doing this you will loose support for the Marmotta/KiWi Reasoner and
Versioning.



That would be a good thing that avoids doing too much refactoring. For KiWi
Reasoner and Versioning a move on to the Jena RDF API could be needed.

My suggestion would instead be to look how Jena SDB is
implementing the bulk import and try a similar solution.



We could implement the same approach from scratch(with queue, chunk and
threads), combining with the use of JDBC batch processing, we will obtain a
better result, but hasn't it more sense try to use directly already
implemented solution?

But if we start
with the optimizations I have already suggested, there might be a huge

gain

already. It just has not been in our focus right now, because the

scenarios

we are working on do not require bulk-loading huge amounts of data. Data
consistency and parallel access was more important to us. But it would

be a

nice feature to be able to run a local copy of GeoNames or DBPedia using
Marmotta ;-)



   Yes, it would be nice :)

Greetings,

Sebastian


Greetings
Raffaele.


--
Sergio Fernández

Re: Bulk load of triples on DB

Reply via email to