Re: Very slow tdbloader2 insertion

2017-04-17 Thread Laura Morales
> tdbloader2 builds b+trees from bottom to top, given sorted input. As
> such blocks are streamed to disk which is disk-efficient.
> 
> It is a series of java programs scripted together by a shell script.
> 
> tdbloader is pure java. It builds the b+trees by inserting, which for
> some idndxes is not optimal because it causes random inserts leading to
> random I/O, which is bad for disk performance.
> 
> Andy


But why is tdbloader better for smaller datasets, whereas tdbloader2 is better 
for very large dataset ("100M+ triples")? Wouldn't the approach of tdbloader2 
be superior in all cases?


Re: tdbloader skip bad file

2017-04-17 Thread Laura Morales
> Check the data before loading.
> 
> This is generally good practice.
> 
> Call "riot --validate" before loading to check each file.


Let's say I've downloaded these RDF files [1]. Some of those files are broken. 
How can I check-and-load all those files with a bash script? Should I loop all 
files, call riot for each of them singularly, then parse the riot output for 
each file?

[1] 
https://svn.apache.org/repos/asf/comdev/projects.apache.org/data/projects.xml


Re: Delete/Insert single graph in dataset

2017-04-17 Thread Andy Seaborne



On 16/04/17 18:05, A. Soroka wrote:

To load, yes. Just use the --graph=IRI (Act on a named graph) switch.

You'll find all the useful switches by executing tdbloader --help.

Neither of the loaders will delete anything, ever. I believe that tdbquery can 
execute SPARQL Update, which you could use for the purpose. If your database is 
supporting a Fuseki instance, you can use the Graph Store protocol:


tdbupdate for SPARQL Update.



https://www.w3.org/TR/sparql11-http-rdf-update

and Fuseki includes convenient command-line scripts:

https://jena.apache.org/documentation/fuseki2/soh.html

in this case, s-delete.

---
A. Soroka
The University of Virginia Library


On Apr 16, 2017, at 12:54 PM, Laura Morales  wrote:

Can I use tdbloader to load or delete a single graph from a dataset? Or maybe 
some other command line tool?




Re: Very slow tdbloader2 insertion

2017-04-17 Thread Andy Seaborne
tdbloader2 builds b+trees from bottom to top, given sorted input.  As 
such blocks are streamed to disk which is disk-efficient.


It is a series of java programs scripted together by a shell script.

tdbloader is pure java.  It builds the b+trees by inserting, which for 
some idndxes is not optimal because it causes random inserts leading to 
random I/O, which is bad for disk performance.


Andy



On 15/04/17 22:13, A. Soroka wrote:

To start with, tdbloader2 uses the assumption that the tuples are sorted 
(actually, it sorts them, then uses that assumption) as described in this old 
blog post of Andy's:

https://seaborne.blogspot.com/2010/12/repacking-btrees.html

That's one reason that you only want to use tbdloader2 to start from scratch. 
Andy, of course, can say more.

---
A. Soroka
The University of Virginia Library


On Apr 15, 2017, at 2:58 PM, Laura Morales  wrote:


Use tdbloader for 10M quads.


I wonder how is tdbloader technically different from tdbloader2. What makes 
tdbloader more suited for small/medium datasets and tdbloader2 more suited for 
very large datasets? Do they implement different insertion algorithms?




Re: tdbloader skip bad file

2017-04-17 Thread Andy Seaborne



On 17/04/17 19:46, Laura Morales wrote:

I'm trying to tdbload several .rdf files like this

$ tdbloader --quiet --graph=... --loc=...   
 ...

problem is, if one file raises an exception (eg. bad IRI), the whole bunch is 
dropped, and no triples are loaded from any file. I've tried calling tdbloader 
for each file, but it seems significantly slower.


Yes.

If the database is empty, tdbloader can use its optimizer loading ; 
otherwise it has to add the data with special care as to index creation 
which is much less optimal.



Is there some command line argument that I can use to tell tdbloader to skip 
bad .rdf files, but keep loading the good ones?


Check the data before loading.

This is generally good practice.

Call "riot --validate" before loading to check each file.

Andy


tdbloader skip bad file

2017-04-17 Thread Laura Morales
I'm trying to tdbload several .rdf files like this

$ tdbloader --quiet --graph=... --loc=...   
 ...

problem is, if one file raises an exception (eg. bad IRI), the whole bunch is 
dropped, and no triples are loaded from any file. I've tried calling tdbloader 
for each file, but it seems significantly slower.
Is there some command line argument that I can use to tell tdbloader to skip 
bad .rdf files, but keep loading the good ones?