Peter Ansell wrote:
2008/11/23 Kingsley Idehen <kide...@openlinksw.com
<mailto:kide...@openlinksw.com>>
Marvin Lugair wrote:
> Hello,
>
> I would like to report back on my loading of dbpedia 3.2 into
Open-Source Virtuoso 5.0.9. <http://5.0.9.>
> The good news is that I was successful and have a local DBPedia
to play with now. Thanks to everyone for their input and
suggestions on configuration parameters!
>
> Marv
>
> ----------------
>
> Running Ubuntu 8.1 (intrepid)
> Kernel 2.6.27-7
> 8GB DDR2 RAM
> AMD Athlon 2.5ghz Dual core
>
> It took around 22 hours to import the core (21 files) and make a
.db database file out of them. The imported resulted in one
dbpedia.db file that is about 20-something GB in size.
> It typically takes a little an hour to start that database (load
the .db file in memory) and start the virtuoso process.
> As a reference:
> Time to load infobox_en.nt = 52 minutes
>
>
> Some of the parameters in my dbpedia.ini
>
> MaxCheckpointRemap = 1000000
> MaxMemPoolSize = 0
> StopCompilerWhenXOverRunTime = 1
> DefaultIsolation = 2
> NumberOfBuffers = 550000
> MaxDirtyBuffers = 320000
>
>
> Files that had errors
> ---------------
> Three files did not load because of malformed URIs (about 500 of
them across the three files, 400-something lines were in the
externallinks file). I tried to reload these files with the
ttlp_mt bit mask that ignores errors but it did not work.
> I deleted the corresponding triples and reloaded. Bascially you
lose those triples. Someone needs to fix these in the DBPedia files.
>
>
> The three files with errors are:
> 1> homepage_en.nt
> 2> externallinks_en.nt
> 3> infobox-mappingbased-loose.nt
> The URI's either had spaces, backslashes or even Korean
characters (in one case) in them. These files need cleaning up.
>
>
>
>
> Some questions
> ---------------------------------------
> * Why does short-abstracts take 4 hours to load though it is 982MB
> whereas long-abstracts took 2 hours to load though its size is
1.7 gigs?!
> The only difference is that short was loaded a few files after
long... does performance change as the database file (the one i am
creating, dbpedia.db) grows larger?
>
> * What is the best way to check for and delete duplicate triples
in the database?
>
> * Related to this last question, it seems the online dbpedia at
dbpedia.org/sparql <http://dbpedia.org/sparql> gateway does not
return duplicates over the webpage interface. However it does
return duplicates for the SAME query when submitted through Jena.
To duplicate this paste the following query in the webpage:
>
> select ?s
> where {
> ?s
> <http://dbpedia.org/property/influenced>
> <http://dbpedia.org/resource/Chris_Rock>
> }
>
> This will return the following results in my web browser:
> http://dbpedia.org/resource/Bill_Cosby
> http://dbpedia.org/resource/Dick_Gregory
> http://dbpedia.org/resource/Eddie_Murphy
> http://dbpedia.org/resource/Flip_Wilson
> http://dbpedia.org/resource/George_Carlin
> http://dbpedia.org/resource/Mort_Sahl
> http://dbpedia.org/resource/Redd_Foxx
> http://dbpedia.org/resource/Richard_Pryor
> http://dbpedia.org/resource/Rodney_Dangerfield
> http://dbpedia.org/resource/Sam_Kinison
> http://dbpedia.org/resource/Steve_Martin
>
>
> no duplicates,
> Now run the *same* query through a Jena program
> In my java source here is how I am connecting to what I assume
is the SAME gateway!
> QueryExecution qexec =
QueryExecutionFactory.sparqlService("http://DBpedia.org/sparql", q);
>
> and here is what i get (again this is the exact same query):
>
> ----------------------------------------------------
> | s |
> ====================================================
> | <http://dbpedia.org/resource/Bill_Cosby> |
> | <http://dbpedia.org/resource/Dick_Gregory> |
> | <http://dbpedia.org/resource/Eddie_Murphy> |
> | <http://dbpedia.org/resource/Flip_Wilson> |
> | <http://dbpedia.org/resource/George_Carlin> |
> | <http://dbpedia.org/resource/Mort_Sahl> |
> | <http://dbpedia.org/resource/Redd_Foxx> |
> | <http://dbpedia.org/resource/Richard_Pryor> |
> | <http://dbpedia.org/resource/Rodney_Dangerfield> |
> | <http://dbpedia.org/resource/Sam_Kinison> |
> | <http://dbpedia.org/resource/Steve_Martin> |
> | <http://dbpedia.org/resource/Bill_Cosby> |
> | <http://dbpedia.org/resource/Bill_Cosby> |
> | <http://dbpedia.org/resource/Dick_Gregory> |
> | <http://dbpedia.org/resource/Eddie_Murphy> |
> | <http://dbpedia.org/resource/Flip_Wilson> |
> | <http://dbpedia.org/resource/George_Carlin> |
> | <http://dbpedia.org/resource/Mort_Sahl> |
> | <http://dbpedia.org/resource/Redd_Foxx> |
> | <http://dbpedia.org/resource/Richard_Pryor> |
> | <http://dbpedia.org/resource/Rodney_Dangerfield> |
> | <http://dbpedia.org/resource/Sam_Kinison> |
> | <http://dbpedia.org/resource/Steve_Martin> |
> | <http://dbpedia.org/resource/Eddie_Murphy> |
> ----------------------------------------------------
>
> Duplicates!
> Can someone please explain this?
>
> As a side, when I run this from isql on my newly locally
installed dbpedia I get no duplicates (I havent tried Jena with my
local).
>
>
> <eom>
>
>
>
>
>
>
-------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move
Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win
great prizes
> Grand prize is a trip for two to an Open Source event anywhere
in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
<http://moblin-contest.org/redirect.php?banner_id=100&url=/>
> _______________________________________________
> Dbpedia-discussion mailing list
> dbpedia-discuss...@lists.sourceforge.net
<mailto:dbpedia-discuss...@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>
Marvin,
You will see why when you run:
select *
where {graph ?g {
?s
<http://dbpedia.org/property/influenced>
<http://dbpedia.org/resource/Chris_Rock>
}}
As you can see their are two graphs:
1. http://dbpedia.org
2. http://dbpedia.org/resource/<entity> (this one results from cache
activity associated with client interactions with Virtuoso)
Solutions:
-- Being specific about source Graph by specifying Graph IRI
select ?s
where {graph <http://dbpedia.org> {
?s
<http://dbpedia.org/property/influenced>
<http://dbpedia.org/resource/Chris_Rock>
}}
OR
select ?s
from <http://dbpedia.org>
where {
?s
<http://dbpedia.org/property/influenced>
<http://dbpedia.org/resource/Chris_Rock>
}
-- Using DISTINCT
select distinct ?s
where {
?s
<http://dbpedia.org/property/influenced>
<http://dbpedia.org/resource/Chris_Rock>
}
What is the instruction to give with Jena/Other clients etc. to make
it behave in the same way as the HTTP SPARQL page interface and not
resolve triples from the cache graphs.
Cheers,
Peter
Peter,
Qualify the GRAPH IRI in your query pattern using the examples above (or
use DISTINCT as per example above).
In relation to our Jena Provider also look at:
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtJenaSPARQLExample2
If this isn't clear, just send a Jena code excerpt.
--
Regards,
Kingsley Idehen Weblog: http://www.openlinksw.com/blog/~kidehen
President & CEO
OpenLink Software Web: http://www.openlinksw.com