Hi Patrick,

thanks for your answers, helped me a lot.

On Saturday 07 August 2010, Patrick van Kleef wrote:
> > 1. Is there a possibility to for example get the line number where
> > the error occurred?
>
> The loader process creates a table called load_list which records
> which datasets where not loaded completely and for what reason.
> 
> Please try:
> 
>       select * from load_list where where ll_error is not null;
> 
> This should give you good indication of what is going on, including
> the line number where the failure happened.

I tried this before, but it only gives me the following as error message for 
both files:
23000 SR133: Can not set NULL to not nullable column 'DB.DBA.RDF_QUAD.O'

There are no line numbers :(

> > 2. Can I somehow tell virtuoso not to quit TTLP on such lines, but
> > to either
> > ignore or truncate them?
> 
> There are some flags to the TTLP code that would probably skip this
> errror, but in certain cases that could insert partial data into the
> database, which is much harder to clean up, so we have not made it
> default.

Actually the loader script calls the TTLP function with 255 as flags, so 
basically every flag, which tolerates more than the standard is activated 
already.

To fix my problem I for now did the following preprocessing of the DBpedia 
3.5.1 dumps, which strips out all lines with URLs longer than 1024 chars:

for i in external_links_en.nt.gz page_links_en.nt.gz ; do
  echo -n "cleaning $i..."
  zcat $i | grep -v -E '^<.+> <.+> <.{1025,}> .$' | gzip > 
${i%.nt.gz}_cleaned.nt.gz &&
  mv ${i%.nt.gz}_cleaned.nt.gz $i
  echo "done."
done

Cheers,
Jörn

Reply via email to