Re: Bulk load of triples on DB

Raffaele Palmieri Wed, 22 May 2013 06:54:38 -0700

On 22 May 2013 15:04, Andy Seaborne <[email protected]> wrote:

> What is the current loading rate?
>


Tried a test with a graph of 661 nodes and 957 triples: it took about 18
sec. So, looking the triples the medium rate is 18.8 ms per triple; tested
on Tomcat with maximum size of 1.5 Gb.


>
> The Jena SDB bulk loader may have some ideas you can draw on.  It bulk
> loads a chunk (typically 10K) of triples at a time uses DB temporary tables
> as working storage.  The database layout is a triples+nodes database
> layout.  SDB manipulates those tables in the DB to find new nodes to add to
> the node table and new triples to add to the triples table as single SQL
> operations.  The original designer may be around on [email protected]
>
>
This design looks interesting and it seems to be a similar approach to my
idea, it could be investigated. In the case, can we think about use of Jena
SDB in Marmotta?


>         Andy
>
>
Cheers,
Raffaele.


>
> On 22/05/13 09:00, Sebastian Schaffert wrote:
>
>> Hi Raffaele,
>>
>> thanks for your ideas. I have been spending a lot of time thinking on how
>> to improve the performance of bulk imports. There are currently several
>> reasons why a bulk import is slow:
>> 1) Marmotta uses (database) transactions to ensure a good behaviour and
>> consistent data in highly parallel environments; transactions, however,
>> introduce a big performance impact especially when they get long (because
>> the database needs to keep a journal and merge it at the end)
>> 2) Marmotta needs to check before creating a node or triple if this node
>> or
>> triple already exists, because you don't want to have duplicates
>> 3) Marmotta needs to issue a single SQL command for every inserted triple
>> (because of 2)
>>
>> 3) could be addressed as you say, but even the Java JDBC API offers "batch
>> commands" that would improve performance, i.e. if you manage to run the
>> same statement in a sequence many times, the performance will be greatly
>> optimized. Unfortunately, I was not able to do this because I don't have a
>> good solution for 2). 3) depends on 2) because for every inserted triple I
>> need to check if the nodes already exist, so there will be select
>> statements before the insert statements.
>>
>> 2) is a really tricky issue, because the check is needed to ensure data
>> integrity. I have been thinking about different options here. Keep in mind
>> that two tables are affected (triples and nodes) and both need to be
>> handled in a different way:
>> - if you know that the *triples* do not yet exist (e.g. empty database or
>> the user assures that they do not exist) you can avoid the check for
>> triple
>> existance, but the node check is still needed because several triples
>> might
>> refer to the same node
>> - if the dataset is reasonably small, you can implement the node check
>> using an in-memory hashtable, which would be very fast; unfortunately you
>> don't know this in advance, and once a node exists the Marmotta caching
>> backends anyways takes care of it as long as Marmotta has memory, so the
>> expensive part is checking for non-existance rather than for existance
>> - you could also implement a persistent hash map (like MapDB) to keep
>> track
>> of the node ids, but I doubt it would give you much benefit over the
>> database lookup once the dataset is bug
>> Even if you implement this solution, you would need a two-pass import to
>> achieve a bulk-load-behaviour in the database, because two tables are
>> affected, i.e. in the first pass you would import only the nodes, and in
>> the second pass only the triples.
>>
>> Another possibility is to relax the data integrity constraints a bit (e.g.
>> allowing the same node to exist with different IDs), but I cannot foresee
>> the consequences of such a choice - it is against the data model.
>>
>>
>> 1) is easy to solve by putting Marmotta in some kind of "maintenance
>> mode",
>> i.e. when bulk importing there is an exclusive lock on the database for
>> the
>> import process. Another (similar) solution is to provide a separate
>> command
>> line tool for importing into a database while Marmotta is not running at
>> all.
>>
>>
>> The solution I was going to implement as a result of this thinking is as
>> follows:
>> - a separate command-line tool that accesses the database directly
>> - when importing, all nodes and triples are first only created in-memory
>> and stored in standard Java data structures (or in a temporary log on the
>> file system)
>> - when the import is finished, first all nodes are bulk-inserted and the
>> Java objects get IDs
>> - second, all triples are bulk-imported with the proper node ids
>>
>>
>> If you want to try out different solutions, I'd be happy if this problem
>> can be solved ;-)
>>
>>
>> Greetings,
>>
>> Sebastian
>>
>>
>> 2013/5/21 Raffaele Palmieri <[email protected]>
>>
>>  Hi to all,
>>> I would propose a little a change to architecture of Importer Service.
>>> Actually for every triple there are single SQL commands invoked from
>>> SailConnectionBase that persist triple informations on DB. That's
>>> probably
>>> one of major causes of delay of import operation.
>>> I thought a way to optimize that operation, building for example a csv,
>>> tsv, or *sv file that the major part of RDBMS are able to import in an
>>> optimized way.
>>> For example, MySQL has Load Data Infile command, Postgresql has Copy
>>> command, H2 has Insert into ... Select from Csvread.
>>> I am checking if this modification is feasible; it surely will need a
>>> specialization of sql dialect depending on used RDBMS.
>>> What do you think about? would it have too much impacts?
>>> Regards,
>>> Raffaele.
>>>
>>>
>>
>

Re: Bulk load of triples on DB

Reply via email to