Re: [Virtuoso-users] Problems loading the DBpedia dump (very long URIs)

2010-08-08 Thread Jörn Hees
Hi Patrick,

thanks for your answers, helped me a lot.

On Saturday 07 August 2010, Patrick van Kleef wrote:
  1. Is there a possibility to for example get the line number where
  the error occurred?

 The loader process creates a table called load_list which records
 which datasets where not loaded completely and for what reason.
 
 Please try:
 
   select * from load_list where where ll_error is not null;
 
 This should give you good indication of what is going on, including
 the line number where the failure happened.

I tried this before, but it only gives me the following as error message for 
both files:
23000 SR133: Can not set NULL to not nullable column 'DB.DBA.RDF_QUAD.O'

There are no line numbers :(

  2. Can I somehow tell virtuoso not to quit TTLP on such lines, but
  to either
  ignore or truncate them?
 
 There are some flags to the TTLP code that would probably skip this
 errror, but in certain cases that could insert partial data into the
 database, which is much harder to clean up, so we have not made it
 default.

Actually the loader script calls the TTLP function with 255 as flags, so 
basically every flag, which tolerates more than the standard is activated 
already.

To fix my problem I for now did the following preprocessing of the DBpedia 
3.5.1 dumps, which strips out all lines with URLs longer than 1024 chars:

for i in external_links_en.nt.gz page_links_en.nt.gz ; do
  echo -n cleaning $i...
  zcat $i | grep -v -E '^.+ .+ .{1025,} .$' | gzip  
${i%.nt.gz}_cleaned.nt.gz 
  mv ${i%.nt.gz}_cleaned.nt.gz $i
  echo done.
done

Cheers,
Jörn



[Virtuoso-users] Problems loading the DBpedia dump (very long URIs)

2010-08-07 Thread Jörn Hees
Hi all,

after having a lot trouble with our group internal DBpedia mirror, I decided 
to do a fresh install.

I downloaded all the files (we only need a subset of the dbpedia: {en,de}) and 
followed this guide yesterday:
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderExampleDbpedia

It uses the 
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderScript 
to load the dumps (btw: the script could greatly benefit from a few comments).

Now today I found these lines in the logs:


23:33:20 PL LOG: Loader started

Sat Aug 07 2010
00:06:49 PL LOG:  File 
/usr/local/data/dbpedia/3.5.1/en/external_links_en.nt.gz error 23000 SR133: 
Can not set NULL to not nullable column 'DB.DBA.RDF_QUAD.O'
00:32:56 Checkpoint started
00:34:39 Checkpoint finished, log reused
02:05:30 PL LOG:  File /usr/local/data/dbpedia/3.5.1/en/page_links_en.nt.gz 
error 23000 SR133: Can not set NULL to not nullable column 'DB.DBA.RDF_QUAD.O'
02:47:44 PL LOG: No more files to load. Loader has finished,


So I went to investigate where this comes from and it seems that inside the 
ld_file procedure of the VirtBulkRDFLoaderScript the error is caught. It uses 
the TTLP procedure to load the data.

1. Is there a possibility to for example get the line number where the error 
occurred? I did a few checks on the external_links_en.nt.gz file with zcat and 
grep and I think that a very long URL is the problem (see appendix).

2. Can I somehow tell virtuoso not to quit TTLP on such lines, but to either 
ignore or truncate them?

3. How much data was not inserted? The error handler calls a rollback work, 
but it seems that in the rdf_loader_run procedure a commit work is done only 
after every 100 files loaded, which would mean that all which was inserted 
before is lost? At the same time log_enable(2, 1) is set, which means 
autocommit for every row, no log, so why is there a commit at all?

4. How do I continue? Can I simply restart with just these two files after 
fixing?

Cheers,
Jörn




Appendix:
=== did a few checks: ==
SPARQL select count(*) where { ?s http://dbpedia.org/property/reference ?o . 
};
results in:
348955

At the same time the external_links_en.nt.gz file has:
5081932 lines (zcat external_links_en.nt.gz | wc -l)

The corresponding lines in the file look like this
(zcat external_links_en.nt.gz | cat -n | head -n $((348955+1)) | tail -n 2 ):
348955  http://dbpedia.org/resource/Fourteen_Points 
http://dbpedia.org/property/reference 
http://wwi.lib.byu.edu/index.php/President_Wilson's_Fourteen_Points .
348956  http://dbpedia.org/resource/Fourteen_Points 
http://dbpedia.org/property/reference 
http://www.mtholyoke.edu/acad/intrel/doc31.htm .

Notice the 's in line 348955, but actually as we got 348955 triples, the 
problem should've occured in the line after that one, but in line 348956 I see 
nothing wrong.

Also notice that
sparql select * where { http://dbpedia.org/resource/Fourteen_Points 
http://dbpedia.org/property/reference ?o .};  

   
results in:
http://www.loc.gov/exhibits/treasures/trm053.html   


  
http://wwi.lib.byu.edu/index.php/Main_Page  


  
http://wwi.lib.byu.edu/index.php/President_Wilson's_Fourteen_Points
http://www.mtholyoke.edu/acad/intrel/doc31.htm
http://www.ourdocuments.gov/doc.php?flash=truedoc=62
http://web.jjay.cuny.edu/jobrien/reference/ob34.html

So line 348956 is imported.

Using nested intervals i found this:
356036  http://dbpedia.org/resource/Antisocial_personality_disorder 
http://dbpedia.org/property/reference 
http://www.faculty.missouristate.edu/M/MichaelCarlie/what_I_learned_about/GANGS/WHYFORM/pathological.htm
 
.
356037  http://dbpedia.org/resource/Hugo_Simberg 
http://dbpedia.org/property/reference 

Re: [Virtuoso-users] Problems loading the DBpedia dump (very long URIs)

2010-08-07 Thread Patrick van Kleef

Hi Jörn,

after having a lot trouble with our group internal DBpedia mirror, I  
decided

to do a fresh install.

I downloaded all the files (we only need a subset of the dbpedia:  
{en,de}) and

followed this guide yesterday:
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderExampleDbpedia

It uses the
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderScript
to load the dumps (btw: the script could greatly benefit from a few  
comments).


Now today I found these lines in the logs:


23:33:20 PL LOG: Loader started

   Sat Aug 07 2010
00:06:49 PL LOG:  File
/usr/local/data/dbpedia/3.5.1/en/external_links_en.nt.gz error 23000  
SR133:

Can not set NULL to not nullable column 'DB.DBA.RDF_QUAD.O'
00:32:56 Checkpoint started
00:34:39 Checkpoint finished, log reused
02:05:30 PL LOG:  File /usr/local/data/dbpedia/3.5.1/en/ 
page_links_en.nt.gz
error 23000 SR133: Can not set NULL to not nullable column  
'DB.DBA.RDF_QUAD.O'

02:47:44 PL LOG: No more files to load. Loader has finished,


So I went to investigate where this comes from and it seems that  
inside the
ld_file procedure of the VirtBulkRDFLoaderScript the error is  
caught. It uses

the TTLP procedure to load the data.

1. Is there a possibility to for example get the line number where  
the error
occurred? I did a few checks on the external_links_en.nt.gz file  
with zcat and

grep and I think that a very long URL is the problem (see appendix).



The loader process creates a table called load_list which records  
which datasets where not loaded completely and for what reason.


Please try:

select * from load_list where where ll_error is not null;

This should give you good indication of what is going on, including  
the line number where the failure happened.



2. Can I somehow tell virtuoso not to quit TTLP on such lines, but  
to either

ignore or truncate them?


There are some flags to the TTLP code that would probably skip this  
errror, but in certain cases that could insert partial data into the  
database, which is much harder to clean up, so we have not made it  
default.



3. How much data was not inserted? The error handler calls a  
rollback work,
but it seems that in the rdf_loader_run procedure a commit work is  
done only
after every 100 files loaded, which would mean that all which was  
inserted

before is lost? At the same time log_enable(2, 1) is set, which means
autocommit for every row, no log, so why is there a commit at all?



The commit is there because you can overrule the log_enable when  
calling the rdf_loader_run call. Since you did not, it was on  
autocommit so all the triples loaded before the bad triple have been  
committed.



4. How do I continue? Can I simply restart with just these two files  
after

fixing?



When you have found the offending line, you can dump all the remaining  
triples of this file to a new file and just load that partial result.  
Since all records before this point in the file are already in, that  
should complete without this one bad record.




Appendix:
=== did a few checks: ==
SPARQL select count(*) where { ?s http://dbpedia.org/property/reference 
 ?o .

};
results in:
348955

At the same time the external_links_en.nt.gz file has:
5081932 lines (zcat external_links_en.nt.gz | wc -l)

The corresponding lines in the file look like this
(zcat external_links_en.nt.gz | cat -n | head -n $((348955+1)) |  
tail -n 2 ):

348955  http://dbpedia.org/resource/Fourteen_Points
http://dbpedia.org/property/reference
http://wwi.lib.byu.edu/index.php/ 
President_Wilson's_Fourteen_Points .

348956  http://dbpedia.org/resource/Fourteen_Points
http://dbpedia.org/property/reference
http://www.mtholyoke.edu/acad/intrel/doc31.htm .

Notice the 's in line 348955, but actually as we got 348955 triples,  
the
problem should've occured in the line after that one, but in line  
348956 I see

nothing wrong.

Also notice that
sparql select * where { http://dbpedia.org/resource/Fourteen_Points
http://dbpedia.org/property/reference ?o .};
results in:
http://www.loc.gov/exhibits/treasures/trm053.html
http://wwi.lib.byu.edu/index.php/Main_Page
http://wwi.lib.byu.edu/index.php/President_Wilson's_Fourteen_Points
http://www.mtholyoke.edu/acad/intrel/doc31.htm
http://www.ourdocuments.gov/doc.php?flash=truedoc=62
http://web.jjay.cuny.edu/jobrien/reference/ob34.html

So line 348956 is imported.

Using nested intervals i found this:
356036  http://dbpedia.org/resource/Antisocial_personality_disorder
http://dbpedia.org/property/reference
http://www.faculty.missouristate.edu/M/MichaelCarlie/what_I_learned_about/GANGS/WHYFORM/pathological.htm 


.
356037  http://dbpedia.org/resource/Hugo_Simberg
http://dbpedia.org/property/reference