2008/11/23 Kingsley Idehen <kide...@openlinksw.com>

> Marvin Lugair wrote:
> > Hello,
> >
> > I would like to report back on my loading of dbpedia 3.2 into Open-Source
> Virtuoso 5.0.9.
> > The good news is that I was successful and have a local DBPedia to play
> with now. Thanks to everyone for their input and suggestions on
> configuration parameters!
> >
> > Marv
> >
> > ----------------
> >
> > Running Ubuntu 8.1 (intrepid)
> > Kernel 2.6.27-7
> > 8GB DDR2 RAM
> > AMD Athlon 2.5ghz Dual core
> >
> > It took around 22 hours to import the core (21 files) and make a .db
> database file out of them. The imported resulted in one dbpedia.db file that
> is about 20-something GB in size.
> > It typically takes a little an hour to start that database (load the .db
> file in memory) and start the virtuoso process.
> > As a reference:
> > Time to load infobox_en.nt = 52 minutes
> >
> >
> > Some of the parameters in my dbpedia.ini
> >
> > MaxCheckpointRemap              = 1000000
> > MaxMemPoolSize                  = 0
> > StopCompilerWhenXOverRunTime    = 1
> > DefaultIsolation                = 2
> > NumberOfBuffers                 = 550000
> > MaxDirtyBuffers                 = 320000
> >
> >
> > Files that had errors
> > ---------------
> > Three files did not load because of malformed URIs (about 500 of them
> across the three files, 400-something lines were in the externallinks file).
> I tried to reload these files with the ttlp_mt bit mask that ignores errors
> but it did not work.
> > I deleted the corresponding triples and reloaded. Bascially you lose
> those triples. Someone needs to fix these in the DBPedia files.
> >
> >
> > The three files with errors are:
> >  1> homepage_en.nt
> >  2> externallinks_en.nt
> >  3> infobox-mappingbased-loose.nt
> > The URI's either had spaces, backslashes or even Korean characters (in
> one case) in them. These files need cleaning up.
> >
> >
> >
> >
> > Some questions
> > ---------------------------------------
> > * Why does short-abstracts take 4 hours to load though it is 982MB
> > whereas long-abstracts took 2 hours to load though its size is 1.7 gigs?!
> > The only difference is that short was loaded a few files after long...
> does performance change as the database file (the one i am creating,
> dbpedia.db) grows larger?
> >
> > * What is the best way to check for and delete duplicate triples in the
> database?
> >
> > * Related to this last question, it seems the online dbpedia at
> dbpedia.org/sparql gateway does not return duplicates over the webpage
> interface. However it does return duplicates for the SAME query when
> submitted through Jena. To duplicate this paste the following query in the
> webpage:
> >
> > select ?s
> > where {
> > ?s
> >  <http://dbpedia.org/property/influenced>
> > <http://dbpedia.org/resource/Chris_Rock>
> > }
> >
> > This will return the following results in my web browser:
> > http://dbpedia.org/resource/Bill_Cosby
> > http://dbpedia.org/resource/Dick_Gregory
> > http://dbpedia.org/resource/Eddie_Murphy
> > http://dbpedia.org/resource/Flip_Wilson
> > http://dbpedia.org/resource/George_Carlin
> > http://dbpedia.org/resource/Mort_Sahl
> > http://dbpedia.org/resource/Redd_Foxx
> > http://dbpedia.org/resource/Richard_Pryor
> > http://dbpedia.org/resource/Rodney_Dangerfield
> > http://dbpedia.org/resource/Sam_Kinison
> > http://dbpedia.org/resource/Steve_Martin
> >
> >
> > no duplicates,
> > Now run the *same* query through a Jena program
> > In my java source here is how I am connecting to what I assume is the
> SAME gateway!
> >  QueryExecution qexec = QueryExecutionFactory.sparqlService("
> http://DBpedia.org/sparql";, q);
> >
> > and here is what i get (again this is the exact same query):
> >
> > ----------------------------------------------------
> > | s                                                |
> > ====================================================
> > | <http://dbpedia.org/resource/Bill_Cosby>         |
> > | <http://dbpedia.org/resource/Dick_Gregory>       |
> > | <http://dbpedia.org/resource/Eddie_Murphy>       |
> > | <http://dbpedia.org/resource/Flip_Wilson>        |
> > | <http://dbpedia.org/resource/George_Carlin>      |
> > | <http://dbpedia.org/resource/Mort_Sahl>          |
> > | <http://dbpedia.org/resource/Redd_Foxx>          |
> > | <http://dbpedia.org/resource/Richard_Pryor>      |
> > | <http://dbpedia.org/resource/Rodney_Dangerfield> |
> > | <http://dbpedia.org/resource/Sam_Kinison>        |
> > | <http://dbpedia.org/resource/Steve_Martin>       |
> > | <http://dbpedia.org/resource/Bill_Cosby>         |
> > | <http://dbpedia.org/resource/Bill_Cosby>         |
> > | <http://dbpedia.org/resource/Dick_Gregory>       |
> > | <http://dbpedia.org/resource/Eddie_Murphy>       |
> > | <http://dbpedia.org/resource/Flip_Wilson>        |
> > | <http://dbpedia.org/resource/George_Carlin>      |
> > | <http://dbpedia.org/resource/Mort_Sahl>          |
> > | <http://dbpedia.org/resource/Redd_Foxx>          |
> > | <http://dbpedia.org/resource/Richard_Pryor>      |
> > | <http://dbpedia.org/resource/Rodney_Dangerfield> |
> > | <http://dbpedia.org/resource/Sam_Kinison>        |
> > | <http://dbpedia.org/resource/Steve_Martin>       |
> > | <http://dbpedia.org/resource/Eddie_Murphy>       |
> > ----------------------------------------------------
> >
> > Duplicates!
> > Can someone please explain this?
> >
> > As a side, when I run this from isql on my newly locally installed
> dbpedia I get no duplicates (I havent tried Jena with my local).
> >
> >
> > <eom>
> >
> >
> >
> >
> >
> > -------------------------------------------------------------------------
> > This SF.Net email is sponsored by the Moblin Your Move Developer's
> challenge
> > Build the coolest Linux based applications with Moblin SDK & win great
> prizes
> > Grand prize is a trip for two to an Open Source event anywhere in the
> world
> > http://moblin-contest.org/redirect.php?banner_id=100&url=/
> > _______________________________________________
> > Dbpedia-discussion mailing list
> > dbpedia-discuss...@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
> >
> >
> Marvin,
>
> You will see why when you run:
>
> select *
> where {graph ?g {
> ?s
>  <http://dbpedia.org/property/influenced>
> <http://dbpedia.org/resource/Chris_Rock>
> }}
>
>
> As you can see their are two graphs:
> 1. http://dbpedia.org
> 2. http://dbpedia.org/resource/<entity> (this one results from cache
> activity associated with client interactions with Virtuoso)
>
> Solutions:
> -- Being specific about source Graph by specifying Graph IRI
> select ?s
> where {graph <http://dbpedia.org> {
> ?s
>  <http://dbpedia.org/property/influenced>
> <http://dbpedia.org/resource/Chris_Rock>
> }}
>
> OR
>
> select ?s
> from <http://dbpedia.org>
> where {
> ?s
>  <http://dbpedia.org/property/influenced>
> <http://dbpedia.org/resource/Chris_Rock>
> }
>
> -- Using DISTINCT
>
> select distinct ?s
> where {
> ?s
>  <http://dbpedia.org/property/influenced>
> <http://dbpedia.org/resource/Chris_Rock>
> }
>

What is the instruction to give with Jena/Other clients etc. to make it
behave in the same way as the HTTP SPARQL page interface and not resolve
triples from the cache graphs.

Cheers,

Peter

Reply via email to