Re: [Dbpedia-gsoc] Wikimedia Commons extraction

Dimitris Kontokostas Mon, 10 Mar 2014 08:51:50 -0700

On Fri, Mar 7, 2014 at 12:01 PM, Gaurav Vaidya <gau...@ggvaidya.com> wrote:


> Hi Jimmy and Dimitris!
>
> Thanks for the quick replies!
>
> On 6 Mar, 2014, at 6:50 am, Dimitris Kontokostas <jimk...@gmail.com>
> wrote:
> > In order to improve your application you should first get familiar with
> the DBpedia extraction framework. A good start is to read the latest
> DBpedia article that describes the whole architecture in detail [1] and
> experiment a bit with extraction options [2].
> I’ll do that, thanks! I’ve gotten the enwiki extraction script running on
> the Commons wikipedia, but it’s running out of memory. I’ll try to run it
> on another computer and bump the Xmx up and see if that helps.
>
> > We already have a lot of extractors that can be re-used for category and
> basic template extraction. Our focus here is
> > 1) the mappings wiki, to get the information aligned with the DBpedia
> ontology [3] and assign types to commons items
> By types, do you mean the classes, such as:
>  - http://mappings.dbpedia.org/index.php/OntologyClass:Media for items
> with https://commons.wikimedia.org/wiki/Template:Information
>  - http://mappings.dbpedia.org/index.php/OntologyClass:Artwork for items
> with https://commons.wikimedia.org/wiki/Template:Artwork
> and so on?
>
> There are already lists of properties for these at:
>  - http://mappings.dbpedia.org/server/ontology/classes/Media
>  - http://mappings.dbpedia.org/server/ontology/classes/Artwork
>
>
There exists some information there but it is not completely aligned with
the  commons metadata.
One task will be to enrich the ontology with additional (sub)classes and
properties but this can be done at the time of mapping where we will see
what is missing.


> > 2) integrate the mapping statistics for commons [4] and
> > 3) publish licence metadata for every commons item
> That sounds very useful!
>
> > <Just read Jimmy's answer>
> > I didn't notice annotations before and this looks very interesting too.
> Yes, I completely forgot about those! They’re added in between
> {{ImageNote}} and {{ImageNoteEnd}} templates, so they should be
> extractable. If not, they show up as a pretty distinctive set of ‘div’
> classes and values in the HTML of the File: page.
>
> > Jimmy is right, the wiktionary extraction code is definitely the only
> option to get the annotations but I cannot estimate the effort to adapt the
> wiktionary code for non-wiktionary projects. so let's add this option too
> as an optional (4) that the student who will work on this idea will try to
> experiment depending on the time left.
> I’m not sure I understand why we’d need to Wiktionary extraction code
> instead of the main Extraction-Wrapper code: Wiktionary is divided between
> the different, separate language Wiktionaries, while the Commons is a
> single instance which keeps all its multilanguage stuff in templates. I’ve
> managed to get the main Extraction-Wrapper code running on the Commons dump
> by pretending that ‘commons’ is a different language of Wikipedia.
>
> The easiest way to implement the multilanguage templates might be a main
> class which identifies values inside other templates (say, the
> “description” parameter of
> http://commons.wikimedia.org/wiki/Template:Information, or the content
> between {{ImageNote}} and {{ImageNoteEnd}}), and then hands them on to
> another class which could identify {{en|}}, {{fr|}} and other templates
> inside the description and generate value/language pairs. The main class
> could then add the subjects and predicates to the values, creating separate
> Quads for each language. But I might be missing something important here!
>

The core DBpedia framework handles templates one at a time. DBpedia
Wiktionaly has a powerful configuration that can handle mutliple / nested
templates at once which is what we need here
see here for more details on DBpedia wiktionary
http://dbpedia.org/Wiktionary

Best,
Dimitris


>
> cheers,
> Gaurav




-- 
Kontokostas Dimitris

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech

_______________________________________________
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] Wikimedia Commons extraction

Reply via email to