Hi Patrick, thanks for your answers, helped me a lot.
On Saturday 07 August 2010, Patrick van Kleef wrote: > > 1. Is there a possibility to for example get the line number where > > the error occurred? > > The loader process creates a table called load_list which records > which datasets where not loaded completely and for what reason. > > Please try: > > select * from load_list where where ll_error is not null; > > This should give you good indication of what is going on, including > the line number where the failure happened. I tried this before, but it only gives me the following as error message for both files: 23000 SR133: Can not set NULL to not nullable column 'DB.DBA.RDF_QUAD.O' There are no line numbers :( > > 2. Can I somehow tell virtuoso not to quit TTLP on such lines, but > > to either > > ignore or truncate them? > > There are some flags to the TTLP code that would probably skip this > errror, but in certain cases that could insert partial data into the > database, which is much harder to clean up, so we have not made it > default. Actually the loader script calls the TTLP function with 255 as flags, so basically every flag, which tolerates more than the standard is activated already. To fix my problem I for now did the following preprocessing of the DBpedia 3.5.1 dumps, which strips out all lines with URLs longer than 1024 chars: for i in external_links_en.nt.gz page_links_en.nt.gz ; do echo -n "cleaning $i..." zcat $i | grep -v -E '^<.+> <.+> <.{1025,}> .$' | gzip > ${i%.nt.gz}_cleaned.nt.gz && mv ${i%.nt.gz}_cleaned.nt.gz $i echo "done." done Cheers, Jörn