On 22 May 2013 15:04, Andy Seaborne <[email protected]> wrote: > What is the current loading rate? >
Tried a test with a graph of 661 nodes and 957 triples: it took about 18 sec. So, looking the triples the medium rate is 18.8 ms per triple; tested on Tomcat with maximum size of 1.5 Gb. > > The Jena SDB bulk loader may have some ideas you can draw on. It bulk > loads a chunk (typically 10K) of triples at a time uses DB temporary tables > as working storage. The database layout is a triples+nodes database > layout. SDB manipulates those tables in the DB to find new nodes to add to > the node table and new triples to add to the triples table as single SQL > operations. The original designer may be around on [email protected] > > This design looks interesting and it seems to be a similar approach to my idea, it could be investigated. In the case, can we think about use of Jena SDB in Marmotta? > Andy > > Cheers, Raffaele. > > On 22/05/13 09:00, Sebastian Schaffert wrote: > >> Hi Raffaele, >> >> thanks for your ideas. I have been spending a lot of time thinking on how >> to improve the performance of bulk imports. There are currently several >> reasons why a bulk import is slow: >> 1) Marmotta uses (database) transactions to ensure a good behaviour and >> consistent data in highly parallel environments; transactions, however, >> introduce a big performance impact especially when they get long (because >> the database needs to keep a journal and merge it at the end) >> 2) Marmotta needs to check before creating a node or triple if this node >> or >> triple already exists, because you don't want to have duplicates >> 3) Marmotta needs to issue a single SQL command for every inserted triple >> (because of 2) >> >> 3) could be addressed as you say, but even the Java JDBC API offers "batch >> commands" that would improve performance, i.e. if you manage to run the >> same statement in a sequence many times, the performance will be greatly >> optimized. Unfortunately, I was not able to do this because I don't have a >> good solution for 2). 3) depends on 2) because for every inserted triple I >> need to check if the nodes already exist, so there will be select >> statements before the insert statements. >> >> 2) is a really tricky issue, because the check is needed to ensure data >> integrity. I have been thinking about different options here. Keep in mind >> that two tables are affected (triples and nodes) and both need to be >> handled in a different way: >> - if you know that the *triples* do not yet exist (e.g. empty database or >> the user assures that they do not exist) you can avoid the check for >> triple >> existance, but the node check is still needed because several triples >> might >> refer to the same node >> - if the dataset is reasonably small, you can implement the node check >> using an in-memory hashtable, which would be very fast; unfortunately you >> don't know this in advance, and once a node exists the Marmotta caching >> backends anyways takes care of it as long as Marmotta has memory, so the >> expensive part is checking for non-existance rather than for existance >> - you could also implement a persistent hash map (like MapDB) to keep >> track >> of the node ids, but I doubt it would give you much benefit over the >> database lookup once the dataset is bug >> Even if you implement this solution, you would need a two-pass import to >> achieve a bulk-load-behaviour in the database, because two tables are >> affected, i.e. in the first pass you would import only the nodes, and in >> the second pass only the triples. >> >> Another possibility is to relax the data integrity constraints a bit (e.g. >> allowing the same node to exist with different IDs), but I cannot foresee >> the consequences of such a choice - it is against the data model. >> >> >> 1) is easy to solve by putting Marmotta in some kind of "maintenance >> mode", >> i.e. when bulk importing there is an exclusive lock on the database for >> the >> import process. Another (similar) solution is to provide a separate >> command >> line tool for importing into a database while Marmotta is not running at >> all. >> >> >> The solution I was going to implement as a result of this thinking is as >> follows: >> - a separate command-line tool that accesses the database directly >> - when importing, all nodes and triples are first only created in-memory >> and stored in standard Java data structures (or in a temporary log on the >> file system) >> - when the import is finished, first all nodes are bulk-inserted and the >> Java objects get IDs >> - second, all triples are bulk-imported with the proper node ids >> >> >> If you want to try out different solutions, I'd be happy if this problem >> can be solved ;-) >> >> >> Greetings, >> >> Sebastian >> >> >> 2013/5/21 Raffaele Palmieri <[email protected]> >> >> Hi to all, >>> I would propose a little a change to architecture of Importer Service. >>> Actually for every triple there are single SQL commands invoked from >>> SailConnectionBase that persist triple informations on DB. That's >>> probably >>> one of major causes of delay of import operation. >>> I thought a way to optimize that operation, building for example a csv, >>> tsv, or *sv file that the major part of RDBMS are able to import in an >>> optimized way. >>> For example, MySQL has Load Data Infile command, Postgresql has Copy >>> command, H2 has Insert into ... Select from Csvread. >>> I am checking if this modification is feasible; it surely will need a >>> specialization of sql dialect depending on used RDBMS. >>> What do you think about? would it have too much impacts? >>> Regards, >>> Raffaele. >>> >>> >> >
