On Sun, May 13, 2012 at 9:13 PM, Marco Amadori <marco.amad...@gmail.com> wrote:
> On Sunday 13 May 2012 21:04:05 Jona Christopher Sahnwaldt wrote:
>> On Thu, May 10, 2012 at 9:09 PM, Marco Amadori <marco.amad...@gmail.com>
> wrote:
>> > On Thursday 10 May 2012 21:06:30 Jona Christopher Sahnwaldt wrote:
>> >> I think what Marco meant was: the mapping says it's an object
>> >> property, so we should extract a URI, even if the property value is
>> >> just a string.
>> >
>> > Right.
>>
>> At first glance, that may look like a nice idea, but it would (very
>> likely) mean that DBpedia would extract many additional URIs that are
>> wrong and only a few additional URIs that are correct. Slightly better
>> recall, much worse precision. I should add that that's a (strong)
>> hunch based on my experience with DBpedia extractions and a few clicks
>> in Wikipedia. I don't have actual data to back this claim.
>
> In my requirements, this will happen if and only if the same happens in
> mediawiki code, or in other words the DBpedia heuristic is the same as
> Mediawiki's one.
>
> If the mediawiki template engine would produce a wikilink we should do it,
> otherwise we do not.

I agree. It's not really MediaWiki ot the template engine in general
though, but specific templates. See below.

>
>> >> In the case of the musician infoboxes on it wiki, that would work, but
>> >> in many other cases, it wouldn't. For example:
>> >> http://en.wikipedia.org/wiki/Glenn_Danzig contains "label = Plan 9,
>> >> Evilive". Just that string, no links. We could use a heuristic to
>> >> split the string into multiple links etc, but I don't think there's a
>> >> good, clean solution. With a naive approach we would extract
>> >> <http://en.dbpedia.org/resource/Plan_9,_Evilive>, which would be
>> >> wrong.
>> >
>> > We should use the same euristic used by the wikimedia template engine,
>> > that way it would match with the proper wikipedia page.
>>
>> The Wikimedia template engine in general does not use heuristics. The
>> specific template
>> http://it.wikipedia.org/wiki/Template:Artista_musicale also does not
>> use heuristics, but pretty simple rules: whatever value Wikipedia
>> users enter for one of the 'genere' properties is wrapped in '[[' and
>> ']]', and thus rendered as a link.
>
> There is no '[[', ']]' in the page, so who is adding them in 'genere'
> properties?

http://it.wikipedia.org/wiki/Template:Artista_musicale and/or
http://it.wikipedia.org/wiki/Template:Autocat_musica (and maybe
others, I didn't dig deep enough to be sure).

For example, this snippet from the source of Template:Artista_musicale:

{{Autocat musica|genere={{{genere|}}}|

passes the value of 'genere' on to Template:Autocat_musica, and
Template:Autocat_musica contains

[[{{{genere|}}}]]

which wraps the value of genere in '[[' and ']]'.

BUT if the template contains a property 'categorizza per genere = no',
then different code in Template:Artista_musicale applies:

{{#ifexist:{{{genere|}}}|[[{{{genere|}}}]]|{{{genere|}}}}}

handles the value of the 'genere' property. It first looks for a page
with that title. If auch a page exists, it renders a link. If such a
page does not exist, it renders just the text.

The treatment of the other 'genere2' etc properties is a bit different:

{{#if:{{{genere2|}}}|<br />[[{{{genere2}}}]]{{{nota genere2|}}}|}}

If there is a value for genere2, it is rendered as a line break and a
link, appending the text of an optional note. If there is no value for
genere2, nothing is rendered.


So, to accurately mimic the behavior of these templates the DBpedia
extraction would have to do this:


if ($value of property 'genere' is not empty) then
  if (categorizza per genere = no) then
    if (page with title $value exists) then
      extract $value as URI
    else
      extract $value as string // [1]
    end if
  else
    // normalize $value, e.g. 'acidjazz' to 'Acid jazz'
    extract normalized $value as URI
  end if
end if

if ($value of property 'genere2' is not empty) then
  if (categorizza per genere = no) then
    extract $value as URI
  else
    // normalize $value, e.g. 'acidjazz' to 'Acid jazz'
    extract normalized $value as URI
  end if
end if

...the other genere properties are like genere2...

[1] (or rather, don't extract anything, because a string is not a
valid value for an obect property)

There are probably other cases that I have missed...

Wikipedia templates are a mess... :-)

>
> If anyone does it, this is 'my' heuristic.
>
>> >> It seems that Italian Wikipedia templates often work like this, while
>> >> English templates rarely do. To make that behavior possible, the
>> >> Italian templates use multiple properties like genere, genere2,
>> >> genere3 etc, while the English templates use one property which the
>> >> editors can fill with links or strings as they like.
>> >
>> > This again means to me that we should trust mappings.
>>
>> But only if the mapping explicitly states that this property should
>> always be extracted as a URI. The editor of the mapping should check
>> the source code of the Wikipedia template.
>
> Ok.
>
>> If the template ALWAYS
>> renders a property value as a link, then we should ALWAYS extract URIs
>> for the property values. If not, then we should ONLY extract a URI for
>> the property value if we can find a matching link somewhere on the
>> page (that's the heuristic we use now).
>
> In less words, it isn't true that we should do the same as wikipedia does?

We SHOULD do the same as Wikipedia. Or rather, we should try to get
close to it with reasonable effort. Well, I shouldn't have written
always. More precisely, balancing effort and extraction precision:

If the template MOST OF THE TIME renders a property value as a link,
then we should ALWAYS extract URIs for the property values.

>
>> One more thing that may be relevant for this discussion:
>>
>> In the case of Template:Artista_musicale, things are even more
>> intricate. Template:Artista_musicale calls
>> http://it.wikipedia.org/wiki/Template:Autocat_musica for each 'genere'
>> property. Template:Autocat_musica contains a long list of musical
>> genres and slightly different names that may be used for them, for
>>
>> example:
>> |acidjazz
>> |acid-jazz
>> |acid jazz=[[Acid jazz]][[Categoria:{{{tipo}}} acid jazz]]
>>
>> This means that if a Wikipedia page contains "genere=acidjazz", the
>> rendered HTML will contain a link to "Acid jazz". DBpedia won't be so
>> smart. Even if we extend the framework with that special "always
>> extract this property as URI" flag, DBpedia would extract the URI
>> http://it.dbpedia.org/resource/Acidjazz for this property - which is
>> pretty useless, since there is no redirect from
>> http://it.wikipedia.org/wiki/Acidjazz to
>> http://it.wikipedia.org/wiki/Acid_jazz.
>
> This could be a nice wanted feature for DBpedia codebase.

Yes, it would be nice to have a feature that allows us to add such
mappings to the extraction configuration. But I think the specific
mappings should not be in the code, but on the mappings wiki. That of
course means that the mappings wiki needs a lot of new features,
probably new namespaces, etc.


>
>> But that's a minor problem. I still think that adding that flag would
>> be a good thing.
>
> If it is the simplest thing to do, go for it.

I'm afraid I won't even have the time to implement this anytime soon.
I'm already way behind schedule, and we really need to get DBpedia 3.8
out.

Cheers,
JC

>
> --
> ESC:wq

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to