Peter Ansell wrote:
2008/11/23 Kingsley Idehen <kide...@openlinksw.com <mailto:kide...@openlinksw.com>>

    Marvin Lugair wrote:
    > Hello,
    >
    > I would like to report back on my loading of dbpedia 3.2 into
    Open-Source Virtuoso 5.0.9. <http://5.0.9.>
    > The good news is that I was successful and have a local DBPedia
    to play with now. Thanks to everyone for their input and
    suggestions on configuration parameters!
    >
    > Marv
    >
    > ----------------
    >
    > Running Ubuntu 8.1 (intrepid)
    > Kernel 2.6.27-7
    > 8GB DDR2 RAM
    > AMD Athlon 2.5ghz Dual core
    >
    > It took around 22 hours to import the core (21 files) and make a
    .db database file out of them. The imported resulted in one
    dbpedia.db file that is about 20-something GB in size.
    > It typically takes a little an hour to start that database (load
    the .db file in memory) and start the virtuoso process.
    > As a reference:
    > Time to load infobox_en.nt = 52 minutes
    >
    >
    > Some of the parameters in my dbpedia.ini
    >
    > MaxCheckpointRemap              = 1000000
    > MaxMemPoolSize                  = 0
    > StopCompilerWhenXOverRunTime    = 1
    > DefaultIsolation                = 2
    > NumberOfBuffers                 = 550000
    > MaxDirtyBuffers                 = 320000
    >
    >
    > Files that had errors
    > ---------------
    > Three files did not load because of malformed URIs (about 500 of
    them across the three files, 400-something lines were in the
    externallinks file). I tried to reload these files with the
    ttlp_mt bit mask that ignores errors but it did not work.
    > I deleted the corresponding triples and reloaded. Bascially you
    lose those triples. Someone needs to fix these in the DBPedia files.
    >
    >
    > The three files with errors are:
    >  1> homepage_en.nt
    >  2> externallinks_en.nt
    >  3> infobox-mappingbased-loose.nt
    > The URI's either had spaces, backslashes or even Korean
    characters (in one case) in them. These files need cleaning up.
    >
    >
    >
    >
    > Some questions
    > ---------------------------------------
    > * Why does short-abstracts take 4 hours to load though it is 982MB
    > whereas long-abstracts took 2 hours to load though its size is
    1.7 gigs?!
    > The only difference is that short was loaded a few files after
    long... does performance change as the database file (the one i am
    creating, dbpedia.db) grows larger?
    >
    > * What is the best way to check for and delete duplicate triples
    in the database?
    >
    > * Related to this last question, it seems the online dbpedia at
    dbpedia.org/sparql <http://dbpedia.org/sparql> gateway does not
    return duplicates over the webpage interface. However it does
    return duplicates for the SAME query when submitted through Jena.
    To duplicate this paste the following query in the webpage:
    >
    > select ?s
    > where {
    > ?s
    >  <http://dbpedia.org/property/influenced>
    > <http://dbpedia.org/resource/Chris_Rock>
    > }
    >
    > This will return the following results in my web browser:
    > http://dbpedia.org/resource/Bill_Cosby
    > http://dbpedia.org/resource/Dick_Gregory
    > http://dbpedia.org/resource/Eddie_Murphy
    > http://dbpedia.org/resource/Flip_Wilson
    > http://dbpedia.org/resource/George_Carlin
    > http://dbpedia.org/resource/Mort_Sahl
    > http://dbpedia.org/resource/Redd_Foxx
    > http://dbpedia.org/resource/Richard_Pryor
    > http://dbpedia.org/resource/Rodney_Dangerfield
    > http://dbpedia.org/resource/Sam_Kinison
    > http://dbpedia.org/resource/Steve_Martin
    >
    >
    > no duplicates,
    > Now run the *same* query through a Jena program
    > In my java source here is how I am connecting to what I assume
    is the SAME gateway!
    >  QueryExecution qexec =
    QueryExecutionFactory.sparqlService("http://DBpedia.org/sparql";, q);
    >
    > and here is what i get (again this is the exact same query):
    >
    > ----------------------------------------------------
    > | s                                                |
    > ====================================================
    > | <http://dbpedia.org/resource/Bill_Cosby>         |
    > | <http://dbpedia.org/resource/Dick_Gregory>       |
    > | <http://dbpedia.org/resource/Eddie_Murphy>       |
    > | <http://dbpedia.org/resource/Flip_Wilson>        |
    > | <http://dbpedia.org/resource/George_Carlin>      |
    > | <http://dbpedia.org/resource/Mort_Sahl>          |
    > | <http://dbpedia.org/resource/Redd_Foxx>          |
    > | <http://dbpedia.org/resource/Richard_Pryor>      |
    > | <http://dbpedia.org/resource/Rodney_Dangerfield> |
    > | <http://dbpedia.org/resource/Sam_Kinison>        |
    > | <http://dbpedia.org/resource/Steve_Martin>       |
    > | <http://dbpedia.org/resource/Bill_Cosby>         |
    > | <http://dbpedia.org/resource/Bill_Cosby>         |
    > | <http://dbpedia.org/resource/Dick_Gregory>       |
    > | <http://dbpedia.org/resource/Eddie_Murphy>       |
    > | <http://dbpedia.org/resource/Flip_Wilson>        |
    > | <http://dbpedia.org/resource/George_Carlin>      |
    > | <http://dbpedia.org/resource/Mort_Sahl>          |
    > | <http://dbpedia.org/resource/Redd_Foxx>          |
    > | <http://dbpedia.org/resource/Richard_Pryor>      |
    > | <http://dbpedia.org/resource/Rodney_Dangerfield> |
    > | <http://dbpedia.org/resource/Sam_Kinison>        |
    > | <http://dbpedia.org/resource/Steve_Martin>       |
    > | <http://dbpedia.org/resource/Eddie_Murphy>       |
    > ----------------------------------------------------
    >
    > Duplicates!
    > Can someone please explain this?
    >
    > As a side, when I run this from isql on my newly locally
    installed dbpedia I get no duplicates (I havent tried Jena with my
    local).
    >
    >
    > <eom>
    >
    >
    >
    >
    >
    >
    -------------------------------------------------------------------------
    > This SF.Net email is sponsored by the Moblin Your Move
    Developer's challenge
    > Build the coolest Linux based applications with Moblin SDK & win
    great prizes
    > Grand prize is a trip for two to an Open Source event anywhere
    in the world
    > http://moblin-contest.org/redirect.php?banner_id=100&url=/
    <http://moblin-contest.org/redirect.php?banner_id=100&url=/>
    > _______________________________________________
    > Dbpedia-discussion mailing list
    > dbpedia-discuss...@lists.sourceforge.net
    <mailto:dbpedia-discuss...@lists.sourceforge.net>
    > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
    >
    >
    Marvin,

    You will see why when you run:

    select *
    where {graph ?g {
    ?s
     <http://dbpedia.org/property/influenced>
    <http://dbpedia.org/resource/Chris_Rock>
    }}


    As you can see their are two graphs:
    1. http://dbpedia.org
    2. http://dbpedia.org/resource/<entity> (this one results from cache
    activity associated with client interactions with Virtuoso)

    Solutions:
    -- Being specific about source Graph by specifying Graph IRI
    select ?s
    where {graph <http://dbpedia.org> {
    ?s
     <http://dbpedia.org/property/influenced>
    <http://dbpedia.org/resource/Chris_Rock>
    }}

    OR

    select ?s
    from <http://dbpedia.org>
    where {
    ?s
     <http://dbpedia.org/property/influenced>
    <http://dbpedia.org/resource/Chris_Rock>
    }

    -- Using DISTINCT

    select distinct ?s
    where {
    ?s
     <http://dbpedia.org/property/influenced>
    <http://dbpedia.org/resource/Chris_Rock>
    }


What is the instruction to give with Jena/Other clients etc. to make it behave in the same way as the HTTP SPARQL page interface and not resolve triples from the cache graphs.

Cheers,

Peter

Peter,
Qualify the GRAPH IRI in your query pattern using the examples above (or use DISTINCT as per example above).


In relation to our Jena Provider also look at:
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtJenaSPARQLExample2


If this isn't clear, just send a Jena code excerpt.


--


Regards,

Kingsley Idehen       Weblog: http://www.openlinksw.com/blog/~kidehen
President & CEO OpenLink Software Web: http://www.openlinksw.com





Reply via email to