Re: [Dbpedia-discussion] Strategies to download subsets of DBPedia

Saeedeh Shekarpour Tue, 23 Jul 2013 06:43:48 -0700

Dear all

We are pleased to announce our Slicing approach.
Since many of the LOD datasets are quite large and despite progress in RDF
data management their loading and querying within a triple store is
extremely time-consuming and resource-demanding. To overcome this
consumption obstacle, we propose a process inspired by the classical
Extract-Transform-Load (ETL) paradigm, RDF dataset slicing.



You can find further information here:
http://aksw.org/Projects/RDFSlice.html


During the following month, the source code will be publicly available.


On Thu, Jul 18, 2013 at 10:54 AM, Dan Gravell <d...@elstensoftware.com>wrote:

> Thanks Dimitris. I'm making fairly good progress right now by simply brute
> forcing a few scans over the .nt file and filtering out the lines of
> interest. On a consumer grade SSD this takes about 7 minutes per scan, and
> as this is a batch, non-interactive nor user facing job, this is
> acceptable. I hope to write up and maybe publish what I've done (my awk/sed
> skills fall short so I ended up scripting something in Scala).
>
> Dan
>
>
> On Thu, Jul 18, 2013 at 9:44 AM, Dimitris Kontokostas 
> <jimk...@gmail.com>wrote:
>
>> Hi Dan,
>>
>> On Tue, Jul 16, 2013 at 11:26 AM, Dan Gravell <d...@elstensoftware.com>wrote:
>>
>>> Thanks Paul. The end goal of this data is import into AWS SimpleDB and
>>> CloudSearch (for the strings), as a matter of fact.
>>>
>>> What I was doing though was having all of my data sources (also:
>>> Discogs, MusicBrainz) export to a common-ish JSON structure which then gets
>>> uploaded to the above services.
>>>
>>> I was keen on ways of just working on the dbpedia tuples from the
>>> download. I'm still looking at the feasibililty of this. One grep of the nt
>>> file on a consumer SSD gets through the file in just over two minutes,
>>> which bodes well. I will continue with this line of investigation.
>>>
>>> The other thing to investigate is writing custom formatters (I think
>>> they're called) for the extraction framework... not sure how 'pluggable'
>>> that is yet though.
>>>
>>
>> They 're pretty pluggable already. There are 2 extra formatters for
>> DBpedia live [1] but both are used manually in the code.
>> You can adapt the PolicyParser [2] class to enable them in the
>> configuration file for the dump extraction.
>>
>> Best,
>> Dimitris
>>
>> [1]
>> https://github.com/dbpedia/extraction-framework/tree/master/live/src/main/scala/org/dbpedia/extraction/destinations/formatters
>> [2]
>> https://github.com/dbpedia/extraction-framework/blob/master/dump/src/main/scala/org/dbpedia/extraction/dump/extract/PolicyParser.scala
>>
>>>
>>>
>>> On Mon, Jul 15, 2013 at 5:01 PM, Paul A. Houle <p...@ontology2.com>wrote:
>>>
>>>>   I can report my progress on this front.
>>>>
>>>> I’ve got a system in place that moves Freebase dumps,  recompresses
>>>> them and stores them in the AMZN cloud.  I can suck in DBpedia data the
>>>> same way.
>>>>
>>>> I’m hadoopifying my Infovore tools so I can do my preprocessing,
>>>> parallel super eyeball and be able to run basic reports.  The plan is to
>>>> keep most of the results in requester-pays S3 buckets,  which can be
>>>> accessed for free in the AWS,  particularly with Elastic MapReduce.
>>>>
>>>> The first release of the system will focus about rules that apply to
>>>> individual triples,  but it’s not a difficult extension of that to build
>>>> something that only copies records where the subjects are kings and
>>>> queens,  about sealing wax,  whatever.
>>>>
>>>> As a rough idea of costs and time involved,  it takes around two
>>>> hours,  $2 in transfer cost and about $1 in CPU to package the dump for
>>>> EMR.  It will take more EMR costs to clean the data up and probably
>>>> compress it to speed up your Q’s
>>>>
>>>> A somewhat tuned system could deliver you a custom subset of DBpedia in
>>>> an hour or two on a cluster that costs about as much to run as a minimum
>>>> wage employee.  You might then need to transfer the files out of AMZN but
>>>> TANSTAFFL.
>>>>
>>>>   *From:* Dan Gravell <d...@elstensoftware.com>
>>>> *Sent:* Monday, July 15, 2013 9:34 AM
>>>> *To:* dbpedia-discussion@lists.sourceforge.net
>>>> *Subject:* [Dbpedia-discussion] Strategies to download subsets of
>>>> DBPedia
>>>>
>>>>  What is the most efficient (CPU and network time) of extracting
>>>> subsets of DBPedia?
>>>>
>>>> I am only interested in <http://dbpedia.org/ontology/MusicalWork> and
>>>> the first level of relationships.
>>>>
>>>> First, I want to work on the data dumps provided either by DBPedia or
>>>> Wikipedia (via the extraction framework, maybe). I realise I could do what
>>>> I want via http://dbpedia.org/sparql but there are a number of
>>>> problems with this:
>>>>
>>>> - It adds load on dbpedia.org
>>>> - dbpedia.org often appears to have maintenance periods
>>>> - There are limits placed on the number of results from dbpedia.org
>>>>
>>>> However, the DBPedia dumps themselves have one big problem: they are so
>>>> massive it appears to take days to do anything with them. Loading them into
>>>> Apache Jena for instance takes ages. I also tried a little sed'ing and
>>>> awk'ing of the file but with little success.
>>>>
>>>> How is everyone else dealing with subsets of the data dumps? Is it
>>>> possible to configure the extraction framework to ignore input records, or
>>>> maybe output to something other than text n-tuples which would then be
>>>> easier to slice and dice (e.g. output to SQL, then perform a query?).
>>>>
>>>> Thanks,
>>>> Dan
>>>>
>>>> ------------------------------
>>>>
>>>> ------------------------------------------------------------------------------
>>>> See everything from the browser to the database with AppDynamics
>>>> Get end-to-end visibility with application monitoring from AppDynamics
>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>> Start your free trial of AppDynamics Pro today!
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>
>>>> ------------------------------
>>>> _______________________________________________
>>>> Dbpedia-discussion mailing list
>>>> Dbpedia-discussion@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> See everything from the browser to the database with AppDynamics
>>> Get end-to-end visibility with application monitoring from AppDynamics
>>> Isolate bottlenecks and diagnose root cause in seconds.
>>> Start your free trial of AppDynamics Pro today!
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Dbpedia-discussion mailing list
>>> Dbpedia-discussion@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>
>>>
>>
>>
>> --
>> Kontokostas Dimitris
>>
>
>
>
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Dbpedia-discussion mailing list
> Dbpedia-discussion@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>


-- 

Best Regards

--------------------------------------------------------

Saeedeh Shekarpour

Phd student

Department of Computer Science, University of Leipzig

Research Group: http://aksw.org



هر آنکسی که دراین حلقه نیست زنده به عشق
بر او نمرده به فتوی من نماز کنید

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Strategies to download subsets of DBPedia

Reply via email to