Hi Raffaele,

2013/5/22 Raffaele Palmieri <[email protected]>

> On 22 May 2013 15:04, Andy Seaborne <[email protected]> wrote:
>
> > What is the current loading rate?
> >
>
> Tried a test with a graph of 661 nodes and 957 triples: it took about 18
> sec. So, looking the triples the medium rate is 18.8 ms per triple; tested
> on Tomcat with maximum size of 1.5 Gb.
>
>
This is a bit too small for a real test, because you will have a high
influence of side effects (like cache initialisation). I have done some
performance comparisons with importing about 10% of GeoNames (about 15
million triples, 1.5 million resources). The test uses a specialised
parallel importer that was configured to run 8 importing threads in
parallel. Here are some figures on different hardware:
*- VMWare, 4CPU, 6GB RAM, HDD: 4:20h (avg per 100 resources: 10-13 seconds,
8 in parallel). In case of VMWare, the CPU is waiting most of the time for
I/O, so apparently the harddisk is slow. Could also be related to an older
Linux kernel or the host the instance is running on (might not have 4
physical CPUs assigned to the instance).*
*- QEmu**

   -

   4CPU (8GHz), 6GB RAM, SSD: 2:10h (avg per 100 resources: 4-5 seconds, 8
   in parallel). The change to SSD does not deliver the expected performance
   gain, the limit is mostly the CPU power (load always between 90-100%).
   However, the variance in the average time for 100 is less, so the results
   are more stable over time.

*
 - *Workstation, 8CPU, 24GB RAM, SSD: 0:40 (avg per 100 resources: 1-2
seconds, 8 in parallel). Running on physical hardware obviously shows the
highest performance. All 8 CPUs between 85-95% load.*

In this setup, my observation was that about 80% of the CPU time is
actually spent in Postgres, and most of the database time in SELECTs (not
INSERTs) because of checking if a node or triple already exists. So the
highest performance gain will be achieved by reducing the load on the
database. There is already a quite good caching system in place (using
EHCache), unfortunately the caching cannot solve the issue of checking for
non-existance (a cache can only help when checking for existance). This is
why especially the initial import is comparably slow.

Conceptually, when inserting a triple, the workflow currently looks as
follows:

1. for each node of subject, predicate, object, context:
1.1. check for existence of node
1.1.a node exists in cache, return its database ID
1.1.b node does not exist in cache, look in the database if it exists there
(SELECT) and return its ID, or null
1.2. if the database ID is null:
1.2.1 query the sequence (H2, PostgreSQL: SELECT nextval(...)) or the
sequence simulation table (MySQL: SELECT) to get the next database ID and
assign it to the node
1.2.2 store the node in the database (INSERT) and add it to the cache
2. check for existance of triple:
2.a triple exists in cache, return its database ID
2.b triple does not exist in cache, look in the database if it exists there
(SELECT) and return its ID, or null
3. if the triple ID is null:
3.1 query the sequence or the sequence simulation table (MySQL) to get the
next database ID for triples and assign it to the triple
3.2 store the triple in the database (INSERT) and add it to the cache

So, in the worst case (i.e. all nodes are new and the triple is new, so
nothing can be answered from the cache) you will have:
- 4 INSERT commands (three nodes, 1 triple), these are comparably cheap
- 4 SELECT commands for existence checking (three nodes, 1 triple), these
are comparably expensive
- 4 SELECT from sequence commands in case of PostgeSQL or H2, very cheap or
4 SELECT from table commands in case of MySQL, comparably cheap (but not as
good as a real sequence)
what is even worse is that the INSERT and SELECT commands will be
interwoven, i.e. there will be alternating SELECTs and INSERTS, which
databases do not really like.

To optimize the performance, the best options are therefore:
- avoiding alternating SELECTS and INSERTS as much as possible (e.g. at
least for each triple batch the node insertions)
- avoiding the comparably expensive existence checks (e.g. other way of
caching/looking up that supports checking for non-existance)

If bulk import then is still slow, it might make sense looking into the
database specific bulk loading commands you suggested.

If I find some time, I might be able to look into the first optimization
(i.e. avoiding too many alternate SELECT and INSERT commands). Maybe a
certain improvement can already be achieved by optimizing this per triple.

If you want to try out more sophisticated improvements or completely
alternate ways of bulk loading, I would be very happy to see it. Just make
sure the database schema and integrity constraints are kept as they are and
the rest will work. The main constraint is that nodes are unique (i.e. each
URI or Literal has exactly one database row) and non-deleted triples are
unique (i.e. each non-deleted triple has exactly one database ID).




> >
> > The Jena SDB bulk loader may have some ideas you can draw on.  It bulk
> > loads a chunk (typically 10K) of triples at a time uses DB temporary
> tables
> > as working storage.  The database layout is a triples+nodes database
> > layout.  SDB manipulates those tables in the DB to find new nodes to add
> to
> > the node table and new triples to add to the triples table as single SQL
> > operations.  The original designer may be around on [email protected]
> >
> >
> This design looks interesting and it seems to be a similar approach to my
> idea, it could be investigated. In the case, can we think about use of Jena
> SDB in Marmotta?
>
>
This could be implemented by wrapping Jena SDB (or also TDB) in a Sesame
Sail, and actually there is already an issue for this in Jira. However,
when doing this you will loose support for the Marmotta/KiWi Reasoner and
Versioning. My suggestion would instead be to look how Jena SDB is
implementing the bulk import and try a similar solution. But if we start
with the optimizations I have already suggested, there might be a huge gain
already. It just has not been in our focus right now, because the scenarios
we are working on do not require bulk-loading huge amounts of data. Data
consistency and parallel access was more important to us. But it would be a
nice feature to be able to run a local copy of GeoNames or DBPedia using
Marmotta ;-)


Greetings,

Sebastian

Reply via email to