Hello,
I would like to report back on my loading of dbpedia 3.2 into Open-Source Virtuoso 5.0.9. The good news is that I was successful and have a local DBPedia to play with now. Thanks to everyone for their input and suggestions on configuration parameters! Marv ---------------- Running Ubuntu 8.1 (intrepid) Kernel 2.6.27-7 8GB DDR2 RAM AMD Athlon 2.5ghz Dual core It took around 22 hours to import the core (21 files) and make a .db database file out of them. The imported resulted in one dbpedia.db file that is about 20-something GB in size. It typically takes a little an hour to start that database (load the .db file in memory) and start the virtuoso process. As a reference: Time to load infobox_en.nt = 52 minutes Some of the parameters in my dbpedia.ini MaxCheckpointRemap = 1000000 MaxMemPoolSize = 0 StopCompilerWhenXOverRunTime = 1 DefaultIsolation = 2 NumberOfBuffers = 550000 MaxDirtyBuffers = 320000 Files that had errors --------------- Three files did not load because of malformed URIs (about 500 of them across the three files, 400-something lines were in the externallinks file). I tried to reload these files with the ttlp_mt bit mask that ignores errors but it did not work. I deleted the corresponding triples and reloaded. Bascially you lose those triples. Someone needs to fix these in the DBPedia files. The three files with errors are: 1> homepage_en.nt 2> externallinks_en.nt 3> infobox-mappingbased-loose.nt The URI's either had spaces, backslashes or even Korean characters (in one case) in them. These files need cleaning up. Some questions --------------------------------------- * Why does short-abstracts take 4 hours to load though it is 982MB whereas long-abstracts took 2 hours to load though its size is 1.7 gigs?! The only difference is that short was loaded a few files after long... does performance change as the database file (the one i am creating, dbpedia.db) grows larger? * What is the best way to check for and delete duplicate triples in the database? * Related to this last question, it seems the online dbpedia at dbpedia.org/sparql gateway does not return duplicates over the webpage interface. However it does return duplicates for the SAME query when submitted through Jena. To duplicate this paste the following query in the webpage: select ?s where { ?s <http://dbpedia.org/property/influenced> <http://dbpedia.org/resource/Chris_Rock> } This will return the following results in my web browser: http://dbpedia.org/resource/Bill_Cosby http://dbpedia.org/resource/Dick_Gregory http://dbpedia.org/resource/Eddie_Murphy http://dbpedia.org/resource/Flip_Wilson http://dbpedia.org/resource/George_Carlin http://dbpedia.org/resource/Mort_Sahl http://dbpedia.org/resource/Redd_Foxx http://dbpedia.org/resource/Richard_Pryor http://dbpedia.org/resource/Rodney_Dangerfield http://dbpedia.org/resource/Sam_Kinison http://dbpedia.org/resource/Steve_Martin no duplicates, Now run the *same* query through a Jena program In my java source here is how I am connecting to what I assume is the SAME gateway! QueryExecution qexec = QueryExecutionFactory.sparqlService("http://DBpedia.org/sparql", q); and here is what i get (again this is the exact same query): ---------------------------------------------------- | s | ==================================================== | <http://dbpedia.org/resource/Bill_Cosby> | | <http://dbpedia.org/resource/Dick_Gregory> | | <http://dbpedia.org/resource/Eddie_Murphy> | | <http://dbpedia.org/resource/Flip_Wilson> | | <http://dbpedia.org/resource/George_Carlin> | | <http://dbpedia.org/resource/Mort_Sahl> | | <http://dbpedia.org/resource/Redd_Foxx> | | <http://dbpedia.org/resource/Richard_Pryor> | | <http://dbpedia.org/resource/Rodney_Dangerfield> | | <http://dbpedia.org/resource/Sam_Kinison> | | <http://dbpedia.org/resource/Steve_Martin> | | <http://dbpedia.org/resource/Bill_Cosby> | | <http://dbpedia.org/resource/Bill_Cosby> | | <http://dbpedia.org/resource/Dick_Gregory> | | <http://dbpedia.org/resource/Eddie_Murphy> | | <http://dbpedia.org/resource/Flip_Wilson> | | <http://dbpedia.org/resource/George_Carlin> | | <http://dbpedia.org/resource/Mort_Sahl> | | <http://dbpedia.org/resource/Redd_Foxx> | | <http://dbpedia.org/resource/Richard_Pryor> | | <http://dbpedia.org/resource/Rodney_Dangerfield> | | <http://dbpedia.org/resource/Sam_Kinison> | | <http://dbpedia.org/resource/Steve_Martin> | | <http://dbpedia.org/resource/Eddie_Murphy> | ---------------------------------------------------- Duplicates! Can someone please explain this? As a side, when I run this from isql on my newly locally installed dbpedia I get no duplicates (I havent tried Jena with my local). <eom>