Good idea! It probably wouldn't be hard to write a specific extractor
for this. Maybe just a few dozen lines.

Only problem is, we may soon have dozens or hundreds of such
specialized extractors. But we can deal with that. :-)

If you want to write that extractor, we would be happy to include it
in the extraction framework. Here are some instructions on how you can
send a pull request on GitHub:

https://github.com/dbpedia/extraction-framework/wiki/Contributing

To keep things manageable and since this extractor is only applicable
for the French Wikipedia edition, I would suggest you create a new
package org.dbpedia.extraction.mappings.fr in
extraction-framework/core/src/main/scala. Like many other extractors,
this one doesn't really belong in the 'core' module, but the
extraction framework is not yet very well modularized, so there's no
better place.

A minor addition: I guess we should change the syntax in the
extraction config files: currently, all extractor class names that *do
not contain a dot* are prefixed by "org.dbpedia.extraction.mappings.".
Example: "AbstractExtractor" becomes
"org.dbpedia.extraction.mappings.AbstractExtractor". If we change that
rule and prefix all extractor class names that *start with a dot* by
"org.dbpedia.extraction.mappings", then you could write
".fr.PopulationExtractor" in your extraction config file. With the
current rule, you would have to write the whole class name
"org.dbpedia.extraction.mappings.fr.PopulationExtractor". (Of course,
with the new rule, we would have to add a dot to all extractor class
names in all config files, but that's no big deal.)

Cheers,
JC

On 21 April 2013 22:35, Julien Plu <julien....@redaction-developpez.com> wrote:
> I thought to the same implementation than you Jona but a little bit
> different. Here my steps :
>
> 1) Parse the XML file and retrieve all the data about these templates. For
> example we see a tag "title" with this :
>
> Modèle:Données/Toulouse/évolution_population
>
> 2) Extract the last "an" and "pop" values
> 3) Put in a file the triples :
> <http://fr.dbpedia.org/resource/Toulouse>
> <http://fr.dbpedia.org/property/population> number pop^^xsd:integer .
> <http://fr.dbpedia.org/resource/Toulouse>
> <http://fr.dbpedia.org/property/AnneePopulation> year^^xsd:date .
>
> And so on, for all these templates. What do you think ?
>
> I know it's not really generic but it's a good beginning to think after to a
> generic solution.
>
> Best.
>
> Julien.
>
>
> 2013/4/21 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
>>
>> Good question. Short answer: No, DBpedia can't handle these templates,
>> and it's hard to change that.
>>
>> It would be nice to do it in a generic way: design a system that
>> allows users of the mappings wiki to add rules how such templates
>> should be handled in a certain lanuage. Write Scala code that executes
>> these rules and parses the template definitions (e.g.
>> Modèle:Données/Toulouse/évolution_population) to extract the data and
>> store it in memory or in an temporary file. Then during the main
>> extraction, when you find a template call like {{Dernière population
>> commune de France}}, get the data from storage and generate the
>> appropriate triples.
>>
>> A major effort. Related to
>> http://wiki.dbpedia.org/gsoc2013/ideas/CrowdsourceTestsAndRules , but
>> even bigger.
>>
>> Maybe it would be easier to extend DBpedia such that the framework can
>> "execute" template definitions.
>>
>> Maybe all that is a waste of time because the data will soon move to
>> Wikidata. We just don't know how soon: Three months? Three years?
>> Never?
>>
>> JC
>>
>> On 21 April 2013 22:04, Julien Plu <julien....@redaction-developpez.com>
>> wrote:
>> > Thanks Jona for these precisions :-)
>> >
>> > Another thing, I would like to know if the extraction framework can use
>> > the
>> > "data templates". I mean some properties values (in french wikipedia for
>> > french Settlement) are now replaced by templates, for example :
>> >
>> > population = {{Dernière population commune de France}} <!-- {{Last
>> > population french Settlement}} -->
>> >
>> > And this data is contained in this kind of pattern :
>> >
>> > http://fr.wikipedia.fr/wiki/Modèle:Données/Nom de
>> > l'article/évolution_population
>> >
>> > In english :
>> >
>> > Template:Data/article name/evolution_population
>> >
>> > By example :
>> >
>> > http://fr.wikipedia.org/wiki/Modèle:Données/Toulouse/évolution_population
>> >
>> > It's always the same address pattern. And these templates look like this
>> > :
>> >
>> > <includeonly>{{#switch: {{{1|}}}
>> > |an1=1793|pop1=52612
>> > |anX=year|popX=number
>> > |an=last_year|pop=last_known_number}}</includeonly>
>> >
>> > These templates are in the XML dump.
>> >
>> > So it has been added in the extraction framework ? if no, what files I
>> > have
>> > to modify for including these kind of exceptions ?
>> >
>> > Best.
>> >
>> > Julien.
>> >
>> >
>> > 2013/4/21 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
>> >>
>> >> On 21 April 2013 19:38, Julien Plu
>> >> <julien....@redaction-developpez.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > An idea of what I do wrongly? (see my previous mail below)
>> >> >
>> >> > Best.
>> >> >
>> >> > Julien.
>> >> >
>> >> > From: Julien Plu <julien....@redaction-developpez.com>
>> >> > Date: 2013/4/20
>> >> > Subject: Problem with extracted data
>> >> > To: "dbpedia-discussion@lists.sourceforge.net"
>> >> > <dbpedia-discussion@lists.sourceforge.net>
>> >> >
>> >> >
>> >> > Hi,
>> >> >
>> >> > After to have imported the extracted data into my virtuoso server I
>> >> > could
>> >> > see that I had some strange data. By example all my URI start with
>> >> > "http://dbpedia.org"; and not with "http://fr.dbpedia.org"; and I don't
>> >> > have
>> >> > the "prop-fr" properties too, whereas I put "fr" in all the
>> >> > extraction
>> >> > properties file.
>> >> >
>> >> > I could see too, if I compare the data from the http://fr.dbpedia.org
>> >> > and
>> >> > mine they are not the same. By example if you compare these two
>> >> > sparql
>> >> > results :
>> >> >
>> >> > mine :
>> >> >
>> >> >
>> >> > http://data.lirmm.fr:8890/sparql?default-graph-uri=&query=select+distinct+*+where+%7B%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FToulouse%3E+%3Fp+%3Fo%7D&should-sponge=&format=text%2Fhtml&timeout=0&debug=on
>> >> >
>> >> > fr.dbpedia.org :
>> >> >
>> >> >
>> >> > http://fr.dbpedia.org/sparql?default-graph-uri=&query=select+distinct+*+where+%7B%3Chttp%3A%2F%2Ffr.dbpedia.org%2Fresource%2FToulouse%3E+%3Fp+%3Fo%7D&format=text%2Fhtml&timeout=0&debug=on
>> >> >
>> >> > In mine, I don't have the "http://www.w3.org/2002/07/owl#sameAs"; or
>> >>
>> >> Do you mean the triples like http://www.w3.org/2002/07/owl#sameAs
>> >> http://de.dbpedia.org/resource/Toulouse ? To get them, you would have
>> >> to download Wikipedia dumps for several other languages, run
>> >> InterlangueLinkExtractor on them, and then run
>> >>
>> >>
>> >> https://github.com/dbpedia/extraction-framework/blob/master/scripts/src/main/scala/org/dbpedia/extraction/scripts/ProcessInterLanguageLinks.scala
>> >> on all the result files.
>> >>
>> >> Or you could use the links in
>> >>
>> >>
>> >> http://downloads.dbpedia.org/3.8/fr/interlanguage_links_same_as_chapters_fr.ttl.bz2
>> >> or a similar file.
>> >>
>> >> > "http://fr.dbpedia.org/property/population"; properties among many
>> >> > others.
>> >> >
>> >> > In attachment my extraction property file.
>> >> >
>> >> > What I did wrong ?
>> >> >
>> >> > Best.
>> >> >
>> >> > Julien.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > ------------------------------------------------------------------------------
>> >> > Precog is a next-generation analytics platform capable of advanced
>> >> > analytics on semi-structured data. The platform includes APIs for
>> >> > building
>> >> > apps and a phenomenal toolset for data science. Developers can use
>> >> > our toolset for easy data analysis & visualization. Get a free
>> >> > account!
>> >> > http://www2.precog.com/precogplatform/slashdotnewsletter
>> >> > _______________________________________________
>> >> > Dbpedia-discussion mailing list
>> >> > Dbpedia-discussion@lists.sourceforge.net
>> >> > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>> >> >
>> >
>> >
>
>

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to