Hi Sebastian, below are some considerations that induce me to think that Jena SDB(or TDB) could be a better solution, but I understand that's a big impact on codebase, and so I would go cautious.
On 23 May 2013 12:20, Sebastian Schaffert <[email protected]>wrote: > Hi Raffaele, > > > 2013/5/22 Raffaele Palmieri <[email protected]> > > > On 22 May 2013 15:04, Andy Seaborne <[email protected]> wrote: > > > > > What is the current loading rate? > > > > > > > Tried a test with a graph of 661 nodes and 957 triples: it took about 18 > > sec. So, looking the triples the medium rate is 18.8 ms per triple; > tested > > on Tomcat with maximum size of 1.5 Gb. > > > > > This is a bit too small for a real test, because you will have a high > influence of side effects (like cache initialisation). I have done some > performance comparisons with importing about 10% of GeoNames (about 15 > million triples, 1.5 million resources). The test uses a specialised > parallel importer that was configured to run 8 importing threads in > parallel. Here are some figures on different hardware: > *- VMWare, 4CPU, 6GB RAM, HDD: 4:20h (avg per 100 resources: 10-13 seconds, > 8 in parallel). In case of VMWare, the CPU is waiting most of the time for > I/O, so apparently the harddisk is slow. Could also be related to an older > Linux kernel or the host the instance is running on (might not have 4 > physical CPUs assigned to the instance).* > *- QEmu** > > - > > 4CPU (8GHz), 6GB RAM, SSD: 2:10h (avg per 100 resources: 4-5 seconds, 8 > in parallel). The change to SSD does not deliver the expected > performance > gain, the limit is mostly the CPU power (load always between 90-100%). > However, the variance in the average time for 100 is less, so the > results > are more stable over time. > > * > - *Workstation, 8CPU, 24GB RAM, SSD: 0:40 (avg per 100 resources: 1-2 > seconds, 8 in parallel). Running on physical hardware obviously shows the > highest performance. All 8 CPUs between 85-95% load.* > > In this setup, my observation was that about 80% of the CPU time is > actually spent in Postgres, and most of the database time in SELECTs (not > INSERTs) because of checking if a node or triple already exists. So the > highest performance gain will be achieved by reducing the load on the > database. There is already a quite good caching system in place (using > EHCache), unfortunately the caching cannot solve the issue of checking for > non-existance (a cache can only help when checking for existance). This is > why especially the initial import is comparably slow. > > You are right that my test of about 1000 triples was limitative, specially with lower resources than yours; but with the same graph and the same resources Jena SDB offers better performance, however I agree with you that we will need more benchmarks. In favor of Jena, both SDB and TDB have already a command line tool to access directly to the database. > Conceptually, when inserting a triple, the workflow currently looks as > follows: > > 1. for each node of subject, predicate, object, context: > 1.1. check for existence of node > 1.1.a node exists in cache, return its database ID > 1.1.b node does not exist in cache, look in the database if it exists there > (SELECT) and return its ID, or null > 1.2. if the database ID is null: > 1.2.1 query the sequence (H2, PostgreSQL: SELECT nextval(...)) or the > sequence simulation table (MySQL: SELECT) to get the next database ID and > assign it to the node > 1.2.2 store the node in the database (INSERT) and add it to the cache > 2. check for existance of triple: > 2.a triple exists in cache, return its database ID > 2.b triple does not exist in cache, look in the database if it exists there > (SELECT) and return its ID, or null > 3. if the triple ID is null: > 3.1 query the sequence or the sequence simulation table (MySQL) to get the > next database ID for triples and assign it to the triple > 3.2 store the triple in the database (INSERT) and add it to the cache > > So, in the worst case (i.e. all nodes are new and the triple is new, so > nothing can be answered from the cache) you will have: > - 4 INSERT commands (three nodes, 1 triple), these are comparably cheap > - 4 SELECT commands for existence checking (three nodes, 1 triple), these > are comparably expensive > - 4 SELECT from sequence commands in case of PostgeSQL or H2, very cheap or > 4 SELECT from table commands in case of MySQL, comparably cheap (but not as > good as a real sequence) > what is even worse is that the INSERT and SELECT commands will be > interwoven, i.e. there will be alternating SELECTs and INSERTS, which > databases do not really like. > > This workflow is for duplication check, from documentation I see that Jena SDB Loader already makes duplication suppression. > To optimize the performance, the best options are therefore: > - avoiding alternating SELECTS and INSERTS as much as possible (e.g. at > least for each triple batch the node insertions) > - avoiding the comparably expensive existence checks (e.g. other way of > caching/looking up that supports checking for non-existance) > > If bulk import then is still slow, it might make sense looking into the > database specific bulk loading commands you suggested. > > If I find some time, I might be able to look into the first optimization > (i.e. avoiding too many alternate SELECT and INSERT commands). Maybe a > certain improvement can already be achieved by optimizing this per triple. > > If you want to try out more sophisticated improvements or completely > alternate ways of bulk loading, I would be very happy to see it. Just make > sure the database schema and integrity constraints are kept as they are and > the rest will work. The main constraint is that nodes are unique (i.e. each > URI or Literal has exactly one database row) and non-deleted triples are > unique (i.e. each non-deleted triple has exactly one database ID). > > > > > > > > The Jena SDB bulk loader may have some ideas you can draw on. It bulk > > > loads a chunk (typically 10K) of triples at a time uses DB temporary > > tables > > > as working storage. The database layout is a triples+nodes database > > > layout. SDB manipulates those tables in the DB to find new nodes to > add > > to > > > the node table and new triples to add to the triples table as single > SQL > > > operations. The original designer may be around on [email protected] > > > > > > > > This design looks interesting and it seems to be a similar approach to my > > idea, it could be investigated. In the case, can we think about use of > Jena > > SDB in Marmotta? > > > > > This could be implemented by wrapping Jena SDB (or also TDB) in a Sesame > Sail, and actually there is already an issue for this in Jira. However, > when doing this you will loose support for the Marmotta/KiWi Reasoner and > Versioning. That would be a good thing that avoids doing too much refactoring. For KiWi Reasoner and Versioning a move on to the Jena RDF API could be needed. > My suggestion would instead be to look how Jena SDB is > implementing the bulk import and try a similar solution. We could implement the same approach from scratch(with queue, chunk and threads), combining with the use of JDBC batch processing, we will obtain a better result, but hasn't it more sense try to use directly already implemented solution? > But if we start > with the optimizations I have already suggested, there might be a huge gain > already. It just has not been in our focus right now, because the scenarios > we are working on do not require bulk-loading huge amounts of data. Data > consistency and parallel access was more important to us. But it would be a > nice feature to be able to run a local copy of GeoNames or DBPedia using > Marmotta ;-) Yes, it would be nice :) > > Greetings, > > Sebastian > Greetings Raffaele.
