2008/11/23 Kingsley Idehen <kide...@openlinksw.com> > Marvin Lugair wrote: > > Hello, > > > > I would like to report back on my loading of dbpedia 3.2 into Open-Source > Virtuoso 5.0.9. > > The good news is that I was successful and have a local DBPedia to play > with now. Thanks to everyone for their input and suggestions on > configuration parameters! > > > > Marv > > > > ---------------- > > > > Running Ubuntu 8.1 (intrepid) > > Kernel 2.6.27-7 > > 8GB DDR2 RAM > > AMD Athlon 2.5ghz Dual core > > > > It took around 22 hours to import the core (21 files) and make a .db > database file out of them. The imported resulted in one dbpedia.db file that > is about 20-something GB in size. > > It typically takes a little an hour to start that database (load the .db > file in memory) and start the virtuoso process. > > As a reference: > > Time to load infobox_en.nt = 52 minutes > > > > > > Some of the parameters in my dbpedia.ini > > > > MaxCheckpointRemap = 1000000 > > MaxMemPoolSize = 0 > > StopCompilerWhenXOverRunTime = 1 > > DefaultIsolation = 2 > > NumberOfBuffers = 550000 > > MaxDirtyBuffers = 320000 > > > > > > Files that had errors > > --------------- > > Three files did not load because of malformed URIs (about 500 of them > across the three files, 400-something lines were in the externallinks file). > I tried to reload these files with the ttlp_mt bit mask that ignores errors > but it did not work. > > I deleted the corresponding triples and reloaded. Bascially you lose > those triples. Someone needs to fix these in the DBPedia files. > > > > > > The three files with errors are: > > 1> homepage_en.nt > > 2> externallinks_en.nt > > 3> infobox-mappingbased-loose.nt > > The URI's either had spaces, backslashes or even Korean characters (in > one case) in them. These files need cleaning up. > > > > > > > > > > Some questions > > --------------------------------------- > > * Why does short-abstracts take 4 hours to load though it is 982MB > > whereas long-abstracts took 2 hours to load though its size is 1.7 gigs?! > > The only difference is that short was loaded a few files after long... > does performance change as the database file (the one i am creating, > dbpedia.db) grows larger? > > > > * What is the best way to check for and delete duplicate triples in the > database? > > > > * Related to this last question, it seems the online dbpedia at > dbpedia.org/sparql gateway does not return duplicates over the webpage > interface. However it does return duplicates for the SAME query when > submitted through Jena. To duplicate this paste the following query in the > webpage: > > > > select ?s > > where { > > ?s > > <http://dbpedia.org/property/influenced> > > <http://dbpedia.org/resource/Chris_Rock> > > } > > > > This will return the following results in my web browser: > > http://dbpedia.org/resource/Bill_Cosby > > http://dbpedia.org/resource/Dick_Gregory > > http://dbpedia.org/resource/Eddie_Murphy > > http://dbpedia.org/resource/Flip_Wilson > > http://dbpedia.org/resource/George_Carlin > > http://dbpedia.org/resource/Mort_Sahl > > http://dbpedia.org/resource/Redd_Foxx > > http://dbpedia.org/resource/Richard_Pryor > > http://dbpedia.org/resource/Rodney_Dangerfield > > http://dbpedia.org/resource/Sam_Kinison > > http://dbpedia.org/resource/Steve_Martin > > > > > > no duplicates, > > Now run the *same* query through a Jena program > > In my java source here is how I am connecting to what I assume is the > SAME gateway! > > QueryExecution qexec = QueryExecutionFactory.sparqlService(" > http://DBpedia.org/sparql", q); > > > > and here is what i get (again this is the exact same query): > > > > ---------------------------------------------------- > > | s | > > ==================================================== > > | <http://dbpedia.org/resource/Bill_Cosby> | > > | <http://dbpedia.org/resource/Dick_Gregory> | > > | <http://dbpedia.org/resource/Eddie_Murphy> | > > | <http://dbpedia.org/resource/Flip_Wilson> | > > | <http://dbpedia.org/resource/George_Carlin> | > > | <http://dbpedia.org/resource/Mort_Sahl> | > > | <http://dbpedia.org/resource/Redd_Foxx> | > > | <http://dbpedia.org/resource/Richard_Pryor> | > > | <http://dbpedia.org/resource/Rodney_Dangerfield> | > > | <http://dbpedia.org/resource/Sam_Kinison> | > > | <http://dbpedia.org/resource/Steve_Martin> | > > | <http://dbpedia.org/resource/Bill_Cosby> | > > | <http://dbpedia.org/resource/Bill_Cosby> | > > | <http://dbpedia.org/resource/Dick_Gregory> | > > | <http://dbpedia.org/resource/Eddie_Murphy> | > > | <http://dbpedia.org/resource/Flip_Wilson> | > > | <http://dbpedia.org/resource/George_Carlin> | > > | <http://dbpedia.org/resource/Mort_Sahl> | > > | <http://dbpedia.org/resource/Redd_Foxx> | > > | <http://dbpedia.org/resource/Richard_Pryor> | > > | <http://dbpedia.org/resource/Rodney_Dangerfield> | > > | <http://dbpedia.org/resource/Sam_Kinison> | > > | <http://dbpedia.org/resource/Steve_Martin> | > > | <http://dbpedia.org/resource/Eddie_Murphy> | > > ---------------------------------------------------- > > > > Duplicates! > > Can someone please explain this? > > > > As a side, when I run this from isql on my newly locally installed > dbpedia I get no duplicates (I havent tried Jena with my local). > > > > > > <eom> > > > > > > > > > > > > ------------------------------------------------------------------------- > > This SF.Net email is sponsored by the Moblin Your Move Developer's > challenge > > Build the coolest Linux based applications with Moblin SDK & win great > prizes > > Grand prize is a trip for two to an Open Source event anywhere in the > world > > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > > _______________________________________________ > > Dbpedia-discussion mailing list > > dbpedia-discuss...@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion > > > > > Marvin, > > You will see why when you run: > > select * > where {graph ?g { > ?s > <http://dbpedia.org/property/influenced> > <http://dbpedia.org/resource/Chris_Rock> > }} > > > As you can see their are two graphs: > 1. http://dbpedia.org > 2. http://dbpedia.org/resource/<entity> (this one results from cache > activity associated with client interactions with Virtuoso) > > Solutions: > -- Being specific about source Graph by specifying Graph IRI > select ?s > where {graph <http://dbpedia.org> { > ?s > <http://dbpedia.org/property/influenced> > <http://dbpedia.org/resource/Chris_Rock> > }} > > OR > > select ?s > from <http://dbpedia.org> > where { > ?s > <http://dbpedia.org/property/influenced> > <http://dbpedia.org/resource/Chris_Rock> > } > > -- Using DISTINCT > > select distinct ?s > where { > ?s > <http://dbpedia.org/property/influenced> > <http://dbpedia.org/resource/Chris_Rock> > } >
What is the instruction to give with Jena/Other clients etc. to make it behave in the same way as the HTTP SPARQL page interface and not resolve triples from the cache graphs. Cheers, Peter