Re: tdbloader skip bad file

2017-04-18 Thread A. Soroka
One of the several advantages of N-Triples (and this is not an accident) is how 
easy it is to use standard Posix tools with it, e.g. cut, sed, grep, etc.

---
A. Soroka
The University of Virginia Library

> On Apr 18, 2017, at 11:46 AM, Laura Morales  wrote:
> 
>> In the meantime, you can use something like sed for this, something like: 
>> sed -e "s|\(.*\)|\1 |"
> 
> ah, right! This is a good suggestion. This seems to work: sed "s/\(.*\) 
> \.$/\1  ./"  (all triples have a period at the end).
> I think I'll use this until RIOT has a --graph option that would be much more 
> easy to work with :)



Re: tdbloader skip bad file

2017-04-18 Thread Laura Morales
> In the meantime, you can use something like sed for this, something like: sed 
> -e "s|\(.*\)|\1 |"

ah, right! This is a good suggestion. This seems to work: sed "s/\(.*\) \.$/\1 
 ./"  (all triples have a period at the end).
I think I'll use this until RIOT has a --graph option that would be much more 
easy to work with :)


Re: tdbloader skip bad file

2017-04-18 Thread A. Soroka
In the meantime, you can use something like sed for this, something like: sed  
-e "s|\(.*\)|\1 |"

---
A. Soroka
The University of Virginia Library

> On Apr 18, 2017, at 10:28 AM, Laura Morales  wrote:
> 
>> Convert to something cheaper (preferably stream-able, like N-triples, as 
>> Andy says) as early as possible.
> 
> It would be very handy if riot had an "--graph=..." option as well, such that 
> I could immediately output all XML files into n-quads with a graph label (and 
> `cat` all of them into a single .nq file).



Re: tdbloader skip bad file

2017-04-18 Thread A. Soroka
You can file a ticket for that functionality at the Jena JIRA instance:

https://issues.apache.org/jira/browse/JENA

---
A. Soroka
The University of Virginia Library

> On Apr 18, 2017, at 10:28 AM, Laura Morales  wrote:
> 
>> Convert to something cheaper (preferably stream-able, like N-triples, as 
>> Andy says) as early as possible.
> 
> It would be very handy if riot had an "--graph=..." option as well, such that 
> I could immediately output all XML files into n-quads with a graph label (and 
> `cat` all of them into a single .nq file).



Re: tdbloader skip bad file

2017-04-18 Thread Laura Morales
> Convert to something cheaper (preferably stream-able, like N-triples, as Andy 
> says) as early as possible.

It would be very handy if riot had an "--graph=..." option as well, such that I 
could immediately output all XML files into n-quads with a graph label (and 
`cat` all of them into a single .nq file).


Re: tdbloader skip bad file

2017-04-18 Thread A. Soroka
If you don't have a specific reason to use RDF/XML inside your workflow, you 
almost certainly shouldn't. It's one of the most expensive RDF serializations 
to process. Convert to something cheaper (preferably stream-able, like 
N-triples, as Andy says) as early as possible.

As for the costs of validation, depending on your operating resources, it might 
be worthwhile to use something like GNU parallel or xargs -P to run several 
riot invocations together. That will only be true if the startup time for riot 
is very small compared to the time it takes to run over a given file, which 
will depend on the size of your files. In this case it seems unlikely to help 
much, but it may be useful at a different time. You can only load one file at a 
time into TDB with tdbloader, because only one process at a time can act 
against a given TDB database.


---
A. Soroka
The University of Virginia Library

> On Apr 18, 2017, at 5:38 AM, Andy Seaborne  wrote:
> 
> 
> 
> On 18/04/17 10:19, Laura Morales wrote:
>>> riot sets the Unix return code to 0 on success and 1 on failure in the
>> usual Unix fashion.
>>> 
>>> So build up a list of valid files by looping on the input files then
>> load all the valid ones in one go with tdbloader.
>> 
>> Thank you.
>> Unfortunately however, running "riot --validate" on each file doesn't seem 
>> much faster than running tdbloader on each single file. Processing all files 
>> seem to take approximately the same time.
>> 
> 
> running tdbloader with bad data can corrupt the database.
> 
> It's a bulk loader - not a fix-up-the data tool.
> 
> If they take about the same time, then the parse costs dominate - which is 
> possible with RDF/XML on small data files.
> 
> If performance matters, parse/validate and output N-triples, then load the 
> N-triples.
> 
>Andy



Re: tdbloader skip bad file

2017-04-18 Thread Andy Seaborne



On 18/04/17 10:19, Laura Morales wrote:

riot sets the Unix return code to 0 on success and 1 on failure in the

usual Unix fashion.


So build up a list of valid files by looping on the input files then

load all the valid ones in one go with tdbloader.

Thank you.
Unfortunately however, running "riot --validate" on each file doesn't seem much 
faster than running tdbloader on each single file. Processing all files seem to take 
approximately the same time.



running tdbloader with bad data can corrupt the database.

It's a bulk loader - not a fix-up-the data tool.

If they take about the same time, then the parse costs dominate - which 
is possible with RDF/XML on small data files.


If performance matters, parse/validate and output N-triples, then load 
the N-triples.


Andy


Re: tdbloader skip bad file

2017-04-18 Thread Laura Morales
> riot sets the Unix return code to 0 on success and 1 on failure in the
usual Unix fashion.
> 
> So build up a list of valid files by looping on the input files then
load all the valid ones in one go with tdbloader.

Thank you.
Unfortunately however, running "riot --validate" on each file doesn't seem much 
faster than running tdbloader on each single file. Processing all files seem to 
take approximately the same time.


Re: tdbloader skip bad file

2017-04-18 Thread Andy Seaborne




On 17/04/17 22:56, Laura Morales wrote:

Check the data before loading.

This is generally good practice.

Call "riot --validate" before loading to check each file.



Let's say I've downloaded these RDF files [1]. Some of those files are broken. 
How can I check-and-load all those files with a bash script? Should I loop all 
files, call riot for each of them singularly, then parse the riot output for 
each file?

[1] 
https://svn.apache.org/repos/asf/comdev/projects.apache.org/data/projects.xml


riot sets the Unix return code to 0 on success and 1 on failure in the 
usual Unix fashion.


So build up a list of valid files by looping on the input files then 
load all the valid ones in one go with tdbloader.


The broken ones need fixing to be loadable.

Andy


Re: tdbloader skip bad file

2017-04-17 Thread Andy Seaborne



On 17/04/17 19:46, Laura Morales wrote:

I'm trying to tdbload several .rdf files like this

$ tdbloader --quiet --graph=... --loc=...   
 ...

problem is, if one file raises an exception (eg. bad IRI), the whole bunch is 
dropped, and no triples are loaded from any file. I've tried calling tdbloader 
for each file, but it seems significantly slower.


Yes.

If the database is empty, tdbloader can use its optimizer loading ; 
otherwise it has to add the data with special care as to index creation 
which is much less optimal.



Is there some command line argument that I can use to tell tdbloader to skip 
bad .rdf files, but keep loading the good ones?


Check the data before loading.

This is generally good practice.

Call "riot --validate" before loading to check each file.

Andy


tdbloader skip bad file

2017-04-17 Thread Laura Morales
I'm trying to tdbload several .rdf files like this

$ tdbloader --quiet --graph=... --loc=...   
 ...

problem is, if one file raises an exception (eg. bad IRI), the whole bunch is 
dropped, and no triples are loaded from any file. I've tried calling tdbloader 
for each file, but it seems significantly slower.
Is there some command line argument that I can use to tell tdbloader to skip 
bad .rdf files, but keep loading the good ones?