Re: tdbloader skip bad file
One of the several advantages of N-Triples (and this is not an accident) is how easy it is to use standard Posix tools with it, e.g. cut, sed, grep, etc. --- A. Soroka The University of Virginia Library > On Apr 18, 2017, at 11:46 AM, Laura Morales wrote: > >> In the meantime, you can use something like sed for this, something like: >> sed -e "s|\(.*\)|\1 |" > > ah, right! This is a good suggestion. This seems to work: sed "s/\(.*\) > \.$/\1 ./" (all triples have a period at the end). > I think I'll use this until RIOT has a --graph option that would be much more > easy to work with :)
Re: tdbloader skip bad file
> In the meantime, you can use something like sed for this, something like: sed > -e "s|\(.*\)|\1 |" ah, right! This is a good suggestion. This seems to work: sed "s/\(.*\) \.$/\1 ./" (all triples have a period at the end). I think I'll use this until RIOT has a --graph option that would be much more easy to work with :)
Re: tdbloader skip bad file
In the meantime, you can use something like sed for this, something like: sed -e "s|\(.*\)|\1 |" --- A. Soroka The University of Virginia Library > On Apr 18, 2017, at 10:28 AM, Laura Morales wrote: > >> Convert to something cheaper (preferably stream-able, like N-triples, as >> Andy says) as early as possible. > > It would be very handy if riot had an "--graph=..." option as well, such that > I could immediately output all XML files into n-quads with a graph label (and > `cat` all of them into a single .nq file).
Re: tdbloader skip bad file
You can file a ticket for that functionality at the Jena JIRA instance: https://issues.apache.org/jira/browse/JENA --- A. Soroka The University of Virginia Library > On Apr 18, 2017, at 10:28 AM, Laura Morales wrote: > >> Convert to something cheaper (preferably stream-able, like N-triples, as >> Andy says) as early as possible. > > It would be very handy if riot had an "--graph=..." option as well, such that > I could immediately output all XML files into n-quads with a graph label (and > `cat` all of them into a single .nq file).
Re: tdbloader skip bad file
> Convert to something cheaper (preferably stream-able, like N-triples, as Andy > says) as early as possible. It would be very handy if riot had an "--graph=..." option as well, such that I could immediately output all XML files into n-quads with a graph label (and `cat` all of them into a single .nq file).
Re: tdbloader skip bad file
If you don't have a specific reason to use RDF/XML inside your workflow, you almost certainly shouldn't. It's one of the most expensive RDF serializations to process. Convert to something cheaper (preferably stream-able, like N-triples, as Andy says) as early as possible. As for the costs of validation, depending on your operating resources, it might be worthwhile to use something like GNU parallel or xargs -P to run several riot invocations together. That will only be true if the startup time for riot is very small compared to the time it takes to run over a given file, which will depend on the size of your files. In this case it seems unlikely to help much, but it may be useful at a different time. You can only load one file at a time into TDB with tdbloader, because only one process at a time can act against a given TDB database. --- A. Soroka The University of Virginia Library > On Apr 18, 2017, at 5:38 AM, Andy Seaborne wrote: > > > > On 18/04/17 10:19, Laura Morales wrote: >>> riot sets the Unix return code to 0 on success and 1 on failure in the >> usual Unix fashion. >>> >>> So build up a list of valid files by looping on the input files then >> load all the valid ones in one go with tdbloader. >> >> Thank you. >> Unfortunately however, running "riot --validate" on each file doesn't seem >> much faster than running tdbloader on each single file. Processing all files >> seem to take approximately the same time. >> > > running tdbloader with bad data can corrupt the database. > > It's a bulk loader - not a fix-up-the data tool. > > If they take about the same time, then the parse costs dominate - which is > possible with RDF/XML on small data files. > > If performance matters, parse/validate and output N-triples, then load the > N-triples. > >Andy
Re: tdbloader skip bad file
On 18/04/17 10:19, Laura Morales wrote: riot sets the Unix return code to 0 on success and 1 on failure in the usual Unix fashion. So build up a list of valid files by looping on the input files then load all the valid ones in one go with tdbloader. Thank you. Unfortunately however, running "riot --validate" on each file doesn't seem much faster than running tdbloader on each single file. Processing all files seem to take approximately the same time. running tdbloader with bad data can corrupt the database. It's a bulk loader - not a fix-up-the data tool. If they take about the same time, then the parse costs dominate - which is possible with RDF/XML on small data files. If performance matters, parse/validate and output N-triples, then load the N-triples. Andy
Re: tdbloader skip bad file
> riot sets the Unix return code to 0 on success and 1 on failure in the usual Unix fashion. > > So build up a list of valid files by looping on the input files then load all the valid ones in one go with tdbloader. Thank you. Unfortunately however, running "riot --validate" on each file doesn't seem much faster than running tdbloader on each single file. Processing all files seem to take approximately the same time.
Re: tdbloader skip bad file
On 17/04/17 22:56, Laura Morales wrote: Check the data before loading. This is generally good practice. Call "riot --validate" before loading to check each file. Let's say I've downloaded these RDF files [1]. Some of those files are broken. How can I check-and-load all those files with a bash script? Should I loop all files, call riot for each of them singularly, then parse the riot output for each file? [1] https://svn.apache.org/repos/asf/comdev/projects.apache.org/data/projects.xml riot sets the Unix return code to 0 on success and 1 on failure in the usual Unix fashion. So build up a list of valid files by looping on the input files then load all the valid ones in one go with tdbloader. The broken ones need fixing to be loadable. Andy
Re: tdbloader skip bad file
> Check the data before loading. > > This is generally good practice. > > Call "riot --validate" before loading to check each file. Let's say I've downloaded these RDF files [1]. Some of those files are broken. How can I check-and-load all those files with a bash script? Should I loop all files, call riot for each of them singularly, then parse the riot output for each file? [1] https://svn.apache.org/repos/asf/comdev/projects.apache.org/data/projects.xml
Re: tdbloader skip bad file
On 17/04/17 19:46, Laura Morales wrote: I'm trying to tdbload several .rdf files like this $ tdbloader --quiet --graph=... --loc=... ... problem is, if one file raises an exception (eg. bad IRI), the whole bunch is dropped, and no triples are loaded from any file. I've tried calling tdbloader for each file, but it seems significantly slower. Yes. If the database is empty, tdbloader can use its optimizer loading ; otherwise it has to add the data with special care as to index creation which is much less optimal. Is there some command line argument that I can use to tell tdbloader to skip bad .rdf files, but keep loading the good ones? Check the data before loading. This is generally good practice. Call "riot --validate" before loading to check each file. Andy
tdbloader skip bad file
I'm trying to tdbload several .rdf files like this $ tdbloader --quiet --graph=... --loc=... ... problem is, if one file raises an exception (eg. bad IRI), the whole bunch is dropped, and no triples are loaded from any file. I've tried calling tdbloader for each file, but it seems significantly slower. Is there some command line argument that I can use to tell tdbloader to skip bad .rdf files, but keep loading the good ones?