Re: Bulk load of triples on DB

Raffaele Palmieri Thu, 23 May 2013 05:50:15 -0700

Hi Sebastian, below are some considerations that induce me to think that
Jena SDB(or TDB) could be a better solution, but I understand that's a big
impact on codebase, and so I would go cautious.


On 23 May 2013 12:20, Sebastian Schaffert <[email protected]>wrote:

> Hi Raffaele,
>
>
> 2013/5/22 Raffaele Palmieri <[email protected]>
>
> > On 22 May 2013 15:04, Andy Seaborne <[email protected]> wrote:
> >
> > > What is the current loading rate?
> > >
> >
> > Tried a test with a graph of 661 nodes and 957 triples: it took about 18
> > sec. So, looking the triples the medium rate is 18.8 ms per triple;
> tested
> > on Tomcat with maximum size of 1.5 Gb.
> >
> >
> This is a bit too small for a real test, because you will have a high
> influence of side effects (like cache initialisation). I have done some
> performance comparisons with importing about 10% of GeoNames (about 15
> million triples, 1.5 million resources). The test uses a specialised
> parallel importer that was configured to run 8 importing threads in
> parallel. Here are some figures on different hardware:
> *- VMWare, 4CPU, 6GB RAM, HDD: 4:20h (avg per 100 resources: 10-13 seconds,
> 8 in parallel). In case of VMWare, the CPU is waiting most of the time for
> I/O, so apparently the harddisk is slow. Could also be related to an older
> Linux kernel or the host the instance is running on (might not have 4
> physical CPUs assigned to the instance).*
> *- QEmu**
>

>    -
>
>    4CPU (8GHz), 6GB RAM, SSD: 2:10h (avg per 100 resources: 4-5 seconds, 8
>    in parallel). The change to SSD does not deliver the expected
> performance
>    gain, the limit is mostly the CPU power (load always between 90-100%).
>    However, the variance in the average time for 100 is less, so the
> results
>    are more stable over time.
>
> *
>  - *Workstation, 8CPU, 24GB RAM, SSD: 0:40 (avg per 100 resources: 1-2
> seconds, 8 in parallel). Running on physical hardware obviously shows the
> highest performance. All 8 CPUs between 85-95% load.*
>
> In this setup, my observation was that about 80% of the CPU time is
> actually spent in Postgres, and most of the database time in SELECTs (not
> INSERTs) because of checking if a node or triple already exists. So the
> highest performance gain will be achieved by reducing the load on the
> database. There is already a quite good caching system in place (using
> EHCache), unfortunately the caching cannot solve the issue of checking for
> non-existance (a cache can only help when checking for existance). This is
> why especially the initial import is comparably slow.
>
>
  You are right that my test of about 1000 triples was limitative,
specially with lower resources than yours; but with the same graph and the
same resources Jena SDB offers better performance, however I agree with you
that we will need more benchmarks.
In favor of Jena, both SDB and TDB have already a command line tool to
access directly to the database.


> Conceptually, when inserting a triple, the workflow currently looks as
> follows:
>
> 1. for each node of subject, predicate, object, context:
> 1.1. check for existence of node
> 1.1.a node exists in cache, return its database ID
> 1.1.b node does not exist in cache, look in the database if it exists there
> (SELECT) and return its ID, or null
> 1.2. if the database ID is null:
> 1.2.1 query the sequence (H2, PostgreSQL: SELECT nextval(...)) or the
> sequence simulation table (MySQL: SELECT) to get the next database ID and
> assign it to the node
> 1.2.2 store the node in the database (INSERT) and add it to the cache
> 2. check for existance of triple:
> 2.a triple exists in cache, return its database ID
> 2.b triple does not exist in cache, look in the database if it exists there
> (SELECT) and return its ID, or null
> 3. if the triple ID is null:
> 3.1 query the sequence or the sequence simulation table (MySQL) to get the
> next database ID for triples and assign it to the triple
> 3.2 store the triple in the database (INSERT) and add it to the cache
>
> So, in the worst case (i.e. all nodes are new and the triple is new, so
> nothing can be answered from the cache) you will have:
> - 4 INSERT commands (three nodes, 1 triple), these are comparably cheap
> - 4 SELECT commands for existence checking (three nodes, 1 triple), these
> are comparably expensive
> - 4 SELECT from sequence commands in case of PostgeSQL or H2, very cheap or
> 4 SELECT from table commands in case of MySQL, comparably cheap (but not as
> good as a real sequence)
> what is even worse is that the INSERT and SELECT commands will be
> interwoven, i.e. there will be alternating SELECTs and INSERTS, which
> databases do not really like.
>
>
This workflow is for duplication check,  from documentation I see that Jena
SDB Loader already makes duplication suppression.


> To optimize the performance, the best options are therefore:
> - avoiding alternating SELECTS and INSERTS as much as possible (e.g. at
> least for each triple batch the node insertions)
> - avoiding the comparably expensive existence checks (e.g. other way of
> caching/looking up that supports checking for non-existance)
>
> If bulk import then is still slow, it might make sense looking into the
> database specific bulk loading commands you suggested.
>
> If I find some time, I might be able to look into the first optimization
> (i.e. avoiding too many alternate SELECT and INSERT commands). Maybe a
> certain improvement can already be achieved by optimizing this per triple.
>
> If you want to try out more sophisticated improvements or completely
> alternate ways of bulk loading, I would be very happy to see it. Just make
> sure the database schema and integrity constraints are kept as they are and
> the rest will work. The main constraint is that nodes are unique (i.e. each
> URI or Literal has exactly one database row) and non-deleted triples are
> unique (i.e. each non-deleted triple has exactly one database ID).
>
>
> > >
> > > The Jena SDB bulk loader may have some ideas you can draw on.  It bulk
> > > loads a chunk (typically 10K) of triples at a time uses DB temporary
> > tables
> > > as working storage.  The database layout is a triples+nodes database
> > > layout.  SDB manipulates those tables in the DB to find new nodes to
> add
> > to
> > > the node table and new triples to add to the triples table as single
> SQL
> > > operations.  The original designer may be around on [email protected]
> > >
> > >
> > This design looks interesting and it seems to be a similar approach to my
> > idea, it could be investigated. In the case, can we think about use of
> Jena
> > SDB in Marmotta?
> >
> >
> This could be implemented by wrapping Jena SDB (or also TDB) in a Sesame
> Sail, and actually there is already an issue for this in Jira. However,
> when doing this you will loose support for the Marmotta/KiWi Reasoner and
> Versioning.


That would be a good thing that avoids doing too much refactoring. For KiWi
Reasoner and Versioning a move on to the Jena RDF API could be needed.


> My suggestion would instead be to look how Jena SDB is
> implementing the bulk import and try a similar solution.


We could implement the same approach from scratch(with queue, chunk and
threads), combining with the use of JDBC batch processing, we will obtain a
better result, but hasn't it more sense try to use directly already
implemented solution?


> But if we start
> with the optimizations I have already suggested, there might be a huge gain
> already. It just has not been in our focus right now, because the scenarios
> we are working on do not require bulk-loading huge amounts of data. Data
> consistency and parallel access was more important to us. But it would be a
> nice feature to be able to run a local copy of GeoNames or DBPedia using
> Marmotta ;-)


  Yes, it would be nice :)


>


> Greetings,
>
> Sebastian
>

Greetings
Raffaele.

Re: Bulk load of triples on DB

Reply via email to