Re: Very slow tdbloader2 insertion
> tdbloader2 builds b+trees from bottom to top, given sorted input. As > such blocks are streamed to disk which is disk-efficient. > > It is a series of java programs scripted together by a shell script. > > tdbloader is pure java. It builds the b+trees by inserting, which for > some idndxes is not optimal because it causes random inserts leading to > random I/O, which is bad for disk performance. > > Andy But why is tdbloader better for smaller datasets, whereas tdbloader2 is better for very large dataset ("100M+ triples")? Wouldn't the approach of tdbloader2 be superior in all cases?
Re: tdbloader skip bad file
> Check the data before loading. > > This is generally good practice. > > Call "riot --validate" before loading to check each file. Let's say I've downloaded these RDF files [1]. Some of those files are broken. How can I check-and-load all those files with a bash script? Should I loop all files, call riot for each of them singularly, then parse the riot output for each file? [1] https://svn.apache.org/repos/asf/comdev/projects.apache.org/data/projects.xml
Re: Delete/Insert single graph in dataset
On 16/04/17 18:05, A. Soroka wrote: To load, yes. Just use the --graph=IRI (Act on a named graph) switch. You'll find all the useful switches by executing tdbloader --help. Neither of the loaders will delete anything, ever. I believe that tdbquery can execute SPARQL Update, which you could use for the purpose. If your database is supporting a Fuseki instance, you can use the Graph Store protocol: tdbupdate for SPARQL Update. https://www.w3.org/TR/sparql11-http-rdf-update and Fuseki includes convenient command-line scripts: https://jena.apache.org/documentation/fuseki2/soh.html in this case, s-delete. --- A. Soroka The University of Virginia Library On Apr 16, 2017, at 12:54 PM, Laura Morales wrote: Can I use tdbloader to load or delete a single graph from a dataset? Or maybe some other command line tool?
Re: Very slow tdbloader2 insertion
tdbloader2 builds b+trees from bottom to top, given sorted input. As such blocks are streamed to disk which is disk-efficient. It is a series of java programs scripted together by a shell script. tdbloader is pure java. It builds the b+trees by inserting, which for some idndxes is not optimal because it causes random inserts leading to random I/O, which is bad for disk performance. Andy On 15/04/17 22:13, A. Soroka wrote: To start with, tdbloader2 uses the assumption that the tuples are sorted (actually, it sorts them, then uses that assumption) as described in this old blog post of Andy's: https://seaborne.blogspot.com/2010/12/repacking-btrees.html That's one reason that you only want to use tbdloader2 to start from scratch. Andy, of course, can say more. --- A. Soroka The University of Virginia Library On Apr 15, 2017, at 2:58 PM, Laura Morales wrote: Use tdbloader for 10M quads. I wonder how is tdbloader technically different from tdbloader2. What makes tdbloader more suited for small/medium datasets and tdbloader2 more suited for very large datasets? Do they implement different insertion algorithms?
Re: tdbloader skip bad file
On 17/04/17 19:46, Laura Morales wrote: I'm trying to tdbload several .rdf files like this $ tdbloader --quiet --graph=... --loc=... ... problem is, if one file raises an exception (eg. bad IRI), the whole bunch is dropped, and no triples are loaded from any file. I've tried calling tdbloader for each file, but it seems significantly slower. Yes. If the database is empty, tdbloader can use its optimizer loading ; otherwise it has to add the data with special care as to index creation which is much less optimal. Is there some command line argument that I can use to tell tdbloader to skip bad .rdf files, but keep loading the good ones? Check the data before loading. This is generally good practice. Call "riot --validate" before loading to check each file. Andy
tdbloader skip bad file
I'm trying to tdbload several .rdf files like this $ tdbloader --quiet --graph=... --loc=... ... problem is, if one file raises an exception (eg. bad IRI), the whole bunch is dropped, and no triples are loaded from any file. I've tried calling tdbloader for each file, but it seems significantly slower. Is there some command line argument that I can use to tell tdbloader to skip bad .rdf files, but keep loading the good ones?