Hi Sebastian, below are some considerations that induce me to think that
Jena SDB(or TDB) could be a better solution, but I understand that's a big
impact on codebase, and so I would go cautious.
On 23 May 2013 12:20, Sebastian Schaffert <[email protected]
wrote:
Hi Raffaele,
2013/5/22 Raffaele Palmieri <[email protected]>
On 22 May 2013 15:04, Andy Seaborne <[email protected]> wrote:
What is the current loading rate?
Tried a test with a graph of 661 nodes and 957 triples: it took about
18
sec. So, looking the triples the medium rate is 18.8 ms per triple;
tested
on Tomcat with maximum size of 1.5 Gb.
This is a bit too small for a real test, because you will have a high
influence of side effects (like cache initialisation). I have done some
performance comparisons with importing about 10% of GeoNames (about 15
million triples, 1.5 million resources). The test uses a specialised
parallel importer that was configured to run 8 importing threads in
parallel. Here are some figures on different hardware:
*- VMWare, 4CPU, 6GB RAM, HDD: 4:20h (avg per 100 resources: 10-13
seconds,
8 in parallel). In case of VMWare, the CPU is waiting most of the time
for
I/O, so apparently the harddisk is slow. Could also be related to an
older
Linux kernel or the host the instance is running on (might not have 4
physical CPUs assigned to the instance).*
*- QEmu**
-
4CPU (8GHz), 6GB RAM, SSD: 2:10h (avg per 100 resources: 4-5 seconds,
8
in parallel). The change to SSD does not deliver the expected
performance
gain, the limit is mostly the CPU power (load always between 90-100%).
However, the variance in the average time for 100 is less, so the
results
are more stable over time.
*
- *Workstation, 8CPU, 24GB RAM, SSD: 0:40 (avg per 100 resources: 1-2
seconds, 8 in parallel). Running on physical hardware obviously shows the
highest performance. All 8 CPUs between 85-95% load.*
In this setup, my observation was that about 80% of the CPU time is
actually spent in Postgres, and most of the database time in SELECTs (not
INSERTs) because of checking if a node or triple already exists. So the
highest performance gain will be achieved by reducing the load on the
database. There is already a quite good caching system in place (using
EHCache), unfortunately the caching cannot solve the issue of checking
for
non-existance (a cache can only help when checking for existance). This
is
why especially the initial import is comparably slow.
You are right that my test of about 1000 triples was limitative,
specially with lower resources than yours; but with the same graph and the
same resources Jena SDB offers better performance, however I agree with you
that we will need more benchmarks.
In favor of Jena, both SDB and TDB have already a command line tool to
access directly to the database.
Conceptually, when inserting a triple, the workflow currently looks as
follows:
1. for each node of subject, predicate, object, context:
1.1. check for existence of node
1.1.a node exists in cache, return its database ID
1.1.b node does not exist in cache, look in the database if it exists
there
(SELECT) and return its ID, or null
1.2. if the database ID is null:
1.2.1 query the sequence (H2, PostgreSQL: SELECT nextval(...)) or the
sequence simulation table (MySQL: SELECT) to get the next database ID and
assign it to the node
1.2.2 store the node in the database (INSERT) and add it to the cache
2. check for existance of triple:
2.a triple exists in cache, return its database ID
2.b triple does not exist in cache, look in the database if it exists
there
(SELECT) and return its ID, or null
3. if the triple ID is null:
3.1 query the sequence or the sequence simulation table (MySQL) to get
the
next database ID for triples and assign it to the triple
3.2 store the triple in the database (INSERT) and add it to the cache
So, in the worst case (i.e. all nodes are new and the triple is new, so
nothing can be answered from the cache) you will have:
- 4 INSERT commands (three nodes, 1 triple), these are comparably cheap
- 4 SELECT commands for existence checking (three nodes, 1 triple), these
are comparably expensive
- 4 SELECT from sequence commands in case of PostgeSQL or H2, very cheap
or
4 SELECT from table commands in case of MySQL, comparably cheap (but not
as
good as a real sequence)
what is even worse is that the INSERT and SELECT commands will be
interwoven, i.e. there will be alternating SELECTs and INSERTS, which
databases do not really like.
This workflow is for duplication check, from documentation I see that Jena
SDB Loader already makes duplication suppression.
To optimize the performance, the best options are therefore:
- avoiding alternating SELECTS and INSERTS as much as possible (e.g. at
least for each triple batch the node insertions)
- avoiding the comparably expensive existence checks (e.g. other way of
caching/looking up that supports checking for non-existance)
If bulk import then is still slow, it might make sense looking into the
database specific bulk loading commands you suggested.
If I find some time, I might be able to look into the first optimization
(i.e. avoiding too many alternate SELECT and INSERT commands). Maybe a
certain improvement can already be achieved by optimizing this per
triple.
If you want to try out more sophisticated improvements or completely
alternate ways of bulk loading, I would be very happy to see it. Just
make
sure the database schema and integrity constraints are kept as they are
and
the rest will work. The main constraint is that nodes are unique (i.e.
each
URI or Literal has exactly one database row) and non-deleted triples are
unique (i.e. each non-deleted triple has exactly one database ID).
The Jena SDB bulk loader may have some ideas you can draw on. It
bulk
loads a chunk (typically 10K) of triples at a time uses DB temporary
tables
as working storage. The database layout is a triples+nodes database
layout. SDB manipulates those tables in the DB to find new nodes to
add
to
the node table and new triples to add to the triples table as single
SQL
operations. The original designer may be around on [email protected]
This design looks interesting and it seems to be a similar approach to
my
idea, it could be investigated. In the case, can we think about use of
Jena
SDB in Marmotta?
This could be implemented by wrapping Jena SDB (or also TDB) in a Sesame
Sail, and actually there is already an issue for this in Jira. However,
when doing this you will loose support for the Marmotta/KiWi Reasoner and
Versioning.
That would be a good thing that avoids doing too much refactoring. For KiWi
Reasoner and Versioning a move on to the Jena RDF API could be needed.
My suggestion would instead be to look how Jena SDB is
implementing the bulk import and try a similar solution.
We could implement the same approach from scratch(with queue, chunk and
threads), combining with the use of JDBC batch processing, we will obtain a
better result, but hasn't it more sense try to use directly already
implemented solution?
But if we start
with the optimizations I have already suggested, there might be a huge
gain
already. It just has not been in our focus right now, because the
scenarios
we are working on do not require bulk-loading huge amounts of data. Data
consistency and parallel access was more important to us. But it would
be a
nice feature to be able to run a local copy of GeoNames or DBPedia using
Marmotta ;-)
Yes, it would be nice :)
Greetings,
Sebastian
Greetings
Raffaele.