Re: [Dbpedia-gsoc] Wikimedia Commons extraction

Dimitris Kontokostas Thu, 06 Mar 2014 05:52:49 -0800

Hi Gaurav and welcome to the DBpedia community,

We also believe there is a lot of information in commons that is waiting to
be extracted :)


In order to improve your application you should first get familiar with the
DBpedia extraction framework. A good start is to read the latest DBpedia
article that describes the whole architecture in detail [1] and experiment
a bit with extraction options [2].

We already have a lot of extractors that can be re-used for category and
basic template extraction. Our focus here is
1) the mappings wiki, to get the information aligned with the DBpedia
ontology [3] and assign types to commons items
2) integrate the mapping statistics for commons [4] and
3) publish licence metadata for every commons item

<Just read Jimmy's answer>
I didn't notice annotations before and this looks very interesting too.
Jimmy is right, the wiktionary extraction code is definitely the only
option to get the annotations but I cannot estimate the effort to adapt the
wiktionary code for non-wiktionary projects. so let's add this option too
as an optional (4) that the student who will work on this idea will try to
experiment depending on the time left.

Best,
Dimitris



[1] http://wiki.dbpedia.org/Publications
[2]
https://github.com/dbpedia/extraction-framework/wiki/Extraction-Instructions
[3] http://mappings.dbpedia.org/index.php/Main_Page
[4] http://mappings.dbpedia.org/server/statistics/



On Thu, Mar 6, 2014 at 12:24 PM, Gaurav Vaidya <gau...@ggvaidya.com> wrote:

> Hi everybody!
>
> I’m really interested in the project to extract data from the Wikimedia
> Commons (http://wiki.dbpedia.org/gsoc2014/ideas#h359-6). If I understand
> this project correctly, the goal is to look at all the different ways in
> which metadata can be represented in the Commons — as categories, as
> templates, as links to articles in the Commons, on the language Wikipedias,
> and as external links to museum catalogues, Flickr and other sources of
> metadata — and then build a set of RDF structured data for each Commons
> file and category. This sounds like an excellent idea to me, and I’d like
> to help in any way I can!
>
> I’m a graduate student at the University of Colorado Boulder in the USA.
> My research focuses on the different ways in which species names and their
> associated species definitions can be represented digitally, and how often
> taxonomic changes cause names to end up with multiple species definitions.
> While this isn’t directly relevant to this project, it does mean I have a
> general interest in how information is organised and what information can
> be extracted from sources which weren’t designed to have any information
> extracted at all, such as scientific articles in taxonomic journals
> published decades ago. As part of my work in grad school, I write code in
> Perl and Python for the Map of Life project [1], and have previously
> written scientific software in Java [2]. I’ve never coded in Scala before,
> but I’ve just spent a couple of days trying to get the DBpedia extraction
> framework to try to extract enwiki templates from the Commons database
> dump, and so far the ideas in the source code seem to make sense to me.
> I’ve never used DBpedia for anything more sophisticated than looking for
> what information may be automatically extractable from a Wikipedia page.
>
> More relevantly, I’ve been a Wikipedian since 2002 [3]; although I don’t
> do much editing, I do organise local events and am very interested in how
> Wikipedia can work closely with Galleries, Libraries, Archives and Museums
> (GLAMs). I coauthored a paper in 2012 on a way for scientists to use
> Wikisource to crowdsource data extraction from transcribed field notebooks
> [5]: we created templates that our volunteers could use to “tag” the
> information we were interested in, which were dates, locations and species
> names, that I then extracted using a Perl script using the MediaWiki API
> [6]. I also spent the summer of 2012 working on a project with the
> Biodiversity Heritage Library (BHL) which would allow the BHL to upload
> page scans containing illustrations to the Commons with a half-filled
> template [7], and later redownload semi-structured data into their own
> catalogues about those scans once Wikipedians had improved them. Of course,
> the Wikimedia Commons extraction project would give them a whole lot more
> information much more easily than the idea I worked on!
>
> Please let me know if you have any questions for me! I tried to look for
> previous work on extracting data from the Commons on the developers’
> mailing lists, and was unable to find anything — if there’s a thread in
> there that I missed, please let me know! Otherwise, I’ll keep poking around
> with the Extraction Framework and start coming up with a project plan for
> tackling this goal that I think might be doable in three months.
>
> Thanks for proposing such a useful and interesting project!
>
> cheers,
> Gaurav
> http://www.ggvaidya.com/cv.html
>
> [1] Map of Life: http://www.mol.org/
> [2] TaxonDNA: http://taxondna.sourceforge.net/
> [3] My user page on Wikipedia: http://en.wikipedia.org/wiki/User:Gaurav
> [4] My https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3406479/
> [5] Thomer et al. 2012:
> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3406479/
> [6] Henderson source code: https://github.com/gaurav/henderson
> [7] BHL Art of Life template:
> https://commons.wikimedia.org/wiki/Template:Information_Art_of_Life
>
> ------------------------------------------------------------------------------
> Subversion Kills Productivity. Get off Subversion & Make the Move to
> Perforce.
> With Perforce, you get hassle-free workflows. Merge that actually works.
> Faster operations. Version large binaries.  Built-in WAN optimization and
> the
> freedom to use Git, Perforce or both. Make the move to Perforce.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
> _______________________________________________
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>



-- 
Kontokostas Dimitris

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] Wikimedia Commons extraction

Reply via email to