Re: TDBLoader2 Performance on Empty vs Existing Store (WAS: Import Messures)

Rob Vesse Fri, 22 Jun 2012 09:22:46 -0700

Off the top of my head I believe loading into an empty database is always
faster because of the way it generates the index files and node tables.
When loading to an existing dataset it tends to be slower because it has
to add to the existing files rather than generating them from scratch.


Andy/Paolo would be better placed to comment on this, I changed the
subject so they'll be more likely to notice this thread

Rob

Rob Vesse -- YarcData.com -- A Division of Cray Inc
Software Engineer, Bay Area
m: 925.960.3941  |  o: 925.264.4729 | @: rve...@yarcdata.com  |  Skype:
rvesse
6210 Stoneridge Mall Rd  |  Suite 120  | Pleasanton CA, 94588


On 6/22/12 8:24 AM, "Paul Gearon" <gea...@ieee.org> wrote:

>Without knowing much about TDB architecture I can still describe a
>couple of things.
>
>One of the most important aspects of speed of indexing and size of the
>resulting store is the shape of the data. Some data sets have many
>unique resources, meaning that there are lots of URIs and unique
>string literals. Other data sets can have many more triples, but each
>URI and string is re-used a lot. This is both faster to index and
>results in a much smaller index.
>
>Some indexes can also index strings for fast searching, which can have
>its own effects. I don't know if TDB does anything interesting there,
>but this is another area where shape can have an impact.
>
>Finally, the type of work done during indexing can lead to files being
>accessed with a totally different pattern, again depending on the
>shape of the data. This can mean that operations which are fast under
>some circumstances can slow right down in others (due to seeking,
>write contention, and other vagaries of the disk system). I'm not
>saying that this is what made loading so much slower in your second
>run while indexing stayed the same, but it's a common enough
>occurrence that I'm not shocked to see it.
>
>Also, did you ensure that you had nothing else going on during either
>load operation? It can be difficult to benchmark these things in
>modern operating systems, due to the number of simultaneous tasks
>which are necessarily running. My own desktop invariably starts
>backing up the hard drive whenever I try to time something.  :-)
>
>I look forward to hearing a response from the TDB developers with
>their opinions.
>
>Regards,
>Paul
>
>On Fri, Jun 22, 2012 at 9:05 AM, Stefan Scheffler
><sscheff...@avantgarde-labs.de> wrote:
>> Hello,
>> At the moment i am doing some performance checks on tdb. The first i
>>checked
>> was the import of the tdbloader2 and i got some weird results.
>> Maybe someone can help me out. Here are my testbase and the results.
>>
>> The first test was to store 12 GB of triples into an empty store (i
>>used the
>> german dbpedia).
>>
>> Load time: 16 minutes
>> average loading: ca 81.000 triple / second
>> index time: 40 minutes
>> store size: 9,3GB
>>
>>
>> The second test was to store the same data into an allready filled store
>> As i started the import i created a store with 348.398.593 Triples from
>>DNB
>> and HBZ (which are german libraries, store size: 33 GB).
>> Then i started to load the german dbpedia in.
>>
>> Load time: 3 hours and 4 minutes
>> average loading: ca 7200 / second
>> index time: 38 minutes
>> store size: 19 GB!!!!!
>>
>> Why does the loading time increases that immense? My expectation was,
>>that
>> the index time increases. But it does not. There where no other big
>> proccesses running nearby. And why does the store size shrink to 19GB?
>>I am
>> totally confused about that point.
>>
>> With friendly regards
>> Stefan
>>
>> --
>> Stefan Scheffler
>> Avantgarde Labs GbR
>> Löbauer Straße 19, 01099 Dresden
>> Telefon: + 49 (0) 351 21590834
>> Email: sscheff...@avantgarde-labs.de
>>

Re: TDBLoader2 Performance on Empty vs Existing Store (WAS: Import Messures)

Reply via email to