[Virtuoso-users] Problems loading the DBpedia dump (very long URIs)
Hi all, after having a lot trouble with our group internal DBpedia mirror, I decided to do a fresh install. I downloaded all the files (we only need a subset of the dbpedia: {en,de}) and followed this guide yesterday: http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderExampleDbpedia It uses the http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderScript to load the dumps (btw: the script could greatly benefit from a few comments). Now today I found these lines in the logs: 23:33:20 PL LOG: Loader started Sat Aug 07 2010 00:06:49 PL LOG: File /usr/local/data/dbpedia/3.5.1/en/external_links_en.nt.gz error 23000 SR133: Can not set NULL to not nullable column 'DB.DBA.RDF_QUAD.O' 00:32:56 Checkpoint started 00:34:39 Checkpoint finished, log reused 02:05:30 PL LOG: File /usr/local/data/dbpedia/3.5.1/en/page_links_en.nt.gz error 23000 SR133: Can not set NULL to not nullable column 'DB.DBA.RDF_QUAD.O' 02:47:44 PL LOG: No more files to load. Loader has finished, So I went to investigate where this comes from and it seems that inside the ld_file procedure of the VirtBulkRDFLoaderScript the error is caught. It uses the TTLP procedure to load the data. 1. Is there a possibility to for example get the line number where the error occurred? I did a few checks on the external_links_en.nt.gz file with zcat and grep and I think that a very long URL is the problem (see appendix). 2. Can I somehow tell virtuoso not to quit TTLP on such lines, but to either ignore or truncate them? 3. How much data was not inserted? The error handler calls a rollback work, but it seems that in the rdf_loader_run procedure a commit work is done only after every 100 files loaded, which would mean that all which was inserted before is lost? At the same time log_enable(2, 1) is set, which means autocommit for every row, no log, so why is there a commit at all? 4. How do I continue? Can I simply restart with just these two files after fixing? Cheers, Jörn Appendix: === did a few checks: == SPARQL select count(*) where { ?s http://dbpedia.org/property/reference ?o . }; results in: 348955 At the same time the external_links_en.nt.gz file has: 5081932 lines (zcat external_links_en.nt.gz | wc -l) The corresponding lines in the file look like this (zcat external_links_en.nt.gz | cat -n | head -n $((348955+1)) | tail -n 2 ): 348955 http://dbpedia.org/resource/Fourteen_Points http://dbpedia.org/property/reference http://wwi.lib.byu.edu/index.php/President_Wilson's_Fourteen_Points . 348956 http://dbpedia.org/resource/Fourteen_Points http://dbpedia.org/property/reference http://www.mtholyoke.edu/acad/intrel/doc31.htm . Notice the 's in line 348955, but actually as we got 348955 triples, the problem should've occured in the line after that one, but in line 348956 I see nothing wrong. Also notice that sparql select * where { http://dbpedia.org/resource/Fourteen_Points http://dbpedia.org/property/reference ?o .}; results in: http://www.loc.gov/exhibits/treasures/trm053.html http://wwi.lib.byu.edu/index.php/Main_Page http://wwi.lib.byu.edu/index.php/President_Wilson's_Fourteen_Points http://www.mtholyoke.edu/acad/intrel/doc31.htm http://www.ourdocuments.gov/doc.php?flash=truedoc=62 http://web.jjay.cuny.edu/jobrien/reference/ob34.html So line 348956 is imported. Using nested intervals i found this: 356036 http://dbpedia.org/resource/Antisocial_personality_disorder http://dbpedia.org/property/reference http://www.faculty.missouristate.edu/M/MichaelCarlie/what_I_learned_about/GANGS/WHYFORM/pathological.htm . 356037 http://dbpedia.org/resource/Hugo_Simberg http://dbpedia.org/property/reference
[Virtuoso-users] php hosting with virtuoso on Leopard
I'm am not having success in getting php hosting up on Mac OS 10.5. Following the instructions here http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSBldPHP , I am getting a link error. I can't figure out how to get a dylib out of the php build process. configure:23970: /usr/bin/gcc-4.0 -o conftest -O2 -arch ppc -I/opt/ local/include -I/Users/virgin ia/php53/include/php/main -I/Users/virginia/php53/include/php/Zend -I/ Users/virginia/php53/inclu de/php/TSRM -I/Users/virginia/php53/include/php -L/opt/local/lib -arch ppc conftest.c -L/Users/ virginia/php53/lib -lphp5 5 ld: in /Users/virginia/php53/lib/libphp5.so, can't link with bundle (MH_BUNDLE) only dylibs (MH_ DYLIB) collect2: ld returned 1 exit status any ideas on the proper build flags and where to add them to the build process (acinclude.m4 maybe?)? thanks David Coate
Re: [Virtuoso-users] Problems loading the DBpedia dump (very long URIs)
Hi Jörn, after having a lot trouble with our group internal DBpedia mirror, I decided to do a fresh install. I downloaded all the files (we only need a subset of the dbpedia: {en,de}) and followed this guide yesterday: http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderExampleDbpedia It uses the http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderScript to load the dumps (btw: the script could greatly benefit from a few comments). Now today I found these lines in the logs: 23:33:20 PL LOG: Loader started Sat Aug 07 2010 00:06:49 PL LOG: File /usr/local/data/dbpedia/3.5.1/en/external_links_en.nt.gz error 23000 SR133: Can not set NULL to not nullable column 'DB.DBA.RDF_QUAD.O' 00:32:56 Checkpoint started 00:34:39 Checkpoint finished, log reused 02:05:30 PL LOG: File /usr/local/data/dbpedia/3.5.1/en/ page_links_en.nt.gz error 23000 SR133: Can not set NULL to not nullable column 'DB.DBA.RDF_QUAD.O' 02:47:44 PL LOG: No more files to load. Loader has finished, So I went to investigate where this comes from and it seems that inside the ld_file procedure of the VirtBulkRDFLoaderScript the error is caught. It uses the TTLP procedure to load the data. 1. Is there a possibility to for example get the line number where the error occurred? I did a few checks on the external_links_en.nt.gz file with zcat and grep and I think that a very long URL is the problem (see appendix). The loader process creates a table called load_list which records which datasets where not loaded completely and for what reason. Please try: select * from load_list where where ll_error is not null; This should give you good indication of what is going on, including the line number where the failure happened. 2. Can I somehow tell virtuoso not to quit TTLP on such lines, but to either ignore or truncate them? There are some flags to the TTLP code that would probably skip this errror, but in certain cases that could insert partial data into the database, which is much harder to clean up, so we have not made it default. 3. How much data was not inserted? The error handler calls a rollback work, but it seems that in the rdf_loader_run procedure a commit work is done only after every 100 files loaded, which would mean that all which was inserted before is lost? At the same time log_enable(2, 1) is set, which means autocommit for every row, no log, so why is there a commit at all? The commit is there because you can overrule the log_enable when calling the rdf_loader_run call. Since you did not, it was on autocommit so all the triples loaded before the bad triple have been committed. 4. How do I continue? Can I simply restart with just these two files after fixing? When you have found the offending line, you can dump all the remaining triples of this file to a new file and just load that partial result. Since all records before this point in the file are already in, that should complete without this one bad record. Appendix: === did a few checks: == SPARQL select count(*) where { ?s http://dbpedia.org/property/reference ?o . }; results in: 348955 At the same time the external_links_en.nt.gz file has: 5081932 lines (zcat external_links_en.nt.gz | wc -l) The corresponding lines in the file look like this (zcat external_links_en.nt.gz | cat -n | head -n $((348955+1)) | tail -n 2 ): 348955 http://dbpedia.org/resource/Fourteen_Points http://dbpedia.org/property/reference http://wwi.lib.byu.edu/index.php/ President_Wilson's_Fourteen_Points . 348956 http://dbpedia.org/resource/Fourteen_Points http://dbpedia.org/property/reference http://www.mtholyoke.edu/acad/intrel/doc31.htm . Notice the 's in line 348955, but actually as we got 348955 triples, the problem should've occured in the line after that one, but in line 348956 I see nothing wrong. Also notice that sparql select * where { http://dbpedia.org/resource/Fourteen_Points http://dbpedia.org/property/reference ?o .}; results in: http://www.loc.gov/exhibits/treasures/trm053.html http://wwi.lib.byu.edu/index.php/Main_Page http://wwi.lib.byu.edu/index.php/President_Wilson's_Fourteen_Points http://www.mtholyoke.edu/acad/intrel/doc31.htm http://www.ourdocuments.gov/doc.php?flash=truedoc=62 http://web.jjay.cuny.edu/jobrien/reference/ob34.html So line 348956 is imported. Using nested intervals i found this: 356036 http://dbpedia.org/resource/Antisocial_personality_disorder http://dbpedia.org/property/reference http://www.faculty.missouristate.edu/M/MichaelCarlie/what_I_learned_about/GANGS/WHYFORM/pathological.htm . 356037 http://dbpedia.org/resource/Hugo_Simberg http://dbpedia.org/property/reference