No, I'm not insane.

It's not a bug, it's a feature. Let me explain.

The ontology property 'genre' is an object property, so its values
must be URIs and its parser is looking for links to other Wikipedia
pages.

In this case, the values are rendered as links in Wikipedia, but not
entered as links. There is

genere = Heavy Metal

but not

genere = [[Heavy Metal]]

Even though it doesn't find a link, the object parser could simply
generate a URI from the string "Heavy Metal", and in the case of this
template property, it would be correct, but in many other cases, it
would be wrong.

As a compromise, we implemented a heuristic: If anywhere on the page
there is a link whose label or target (I'm not sure about the details)
matches the string, we generate a URI. If the page does not contain
such a link, we generate nothing.

Example: http://it.wikipedia.org/wiki/Glenn_Danzig contains a link to
[[Punk rock]], but no links for the other three genres. That's why a
triple is generated for "genere3 = Punk rock" but not for the other
template properties.

Even worse - there is a link to [[Heavy metal]], but in the infobox
it's spelled "Heavy Metal" (with a capital M), so we don't find that
link. This behavior could be considered a bug. Wikipedia somehow fixes
the uppercase.

The page about Johnny Cash contains a link [[folk music|folk]], and
apparently that's why we extract
http://it.dbpedia.org/resource/Folk_music for "|genere2 = Folk".

As usual, when computers try to be smart, they tend to confuse us humans. :-)

Cheers,
JC

On Tue, May 8, 2012 at 6:53 PM, Jona Christopher Sahnwaldt
<j...@sahnwaldt.de> wrote:
> No, this is a different problem. The wikitext already contains four
> different properties:
>
> |genere = Heavy Metal
> |genere2 = Alternative Metal
> |genere3 = Punk rock
> |genere4 = Hardcore punk
>
> And http://mappings.dbpedia.org/index.php/Mapping_it:Artista_musicale
> contains mappings for all of them.
>
> The weird thing is that some of these numbered properties are
> extracted, other's aren't. Here's an example where four out of six are
> extracted:
>
> http://mappings.dbpedia.org/server/extraction/it/extract?title=Johnny_Cash
>
> What's even more weird is that the Wikipedia article contains "Folk",
> but our framework extracts "Folk music".
>
> Excuse me, I have an urgent appointment with my psychiatrist.
>
> :-)
>
>
> On Tue, May 8, 2012 at 6:15 PM, Dimitris Kontokostas <jimk...@gmail.com> 
> wrote:
>>
>> Hi,
>>
>> I am pasting an old developers-list thread between me and Max (I could not
>> find it on the archive search for a link)
>> I think it is more or less about the same bug. I don't remember if it was
>> fixed or not
>>
>> Cheers,
>> Dimitris
>>
>> -----------------------------------------------
>> Thanks for pointing to this thread. The case reported by Roberto was
>> actually a bug in NodeUtil.splitPropertyNode.
>> It always produced one too many elements while the last one did not
>> have any children.
>>
>> For the ObjectParser we still need to fix the regex though, because a
>> whitespace TextNode is still a node.
>>
>> Cheers,
>> Max
>>
>>
>> On Tue, Jul 5, 2011 at 02:32, Dimitris Kontokostas <jimk...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > it came to me an old thread from Roberto Mirizzi [1] that also involves
>> > whitespaces between links.
>> > Fixing the regex could apply to both, so I agree...
>> >
>> > Cheers
>> > Dimitris
>> >
>> > [1] http://sourceforge.net/mailarchive/message.php?msg_id=26916982
>> >
>> > On Mon, Jul 4, 2011 at 3:11 PM, Max Jakob <max.ja...@fu-berlin.de>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> On Wed, Apr 6, 2011 at 22:12, Dimitris Kontokostas <jimk...@gmail.com>
>> >> wrote:
>> >> > I am working on the FlagTemplateParser and I noticed something
>> >> > strange
>> >> > in
>> >> > the Mapping Extraction results.
>> >> > When there are multiple values seperated by the DataParser split
>> >> > regex
>> >> > ("""<br\s*\/?>|\n| and | or | in |/|;|,""")
>> >> > the values must not have a trailing or leading space, otherwise they
>> >> > are
>> >> > not
>> >> > parsed
>> >> >
>> >> > for example
>> >> > from the following string (from an Infobox Property)
>> >> > {{GRE}}<br>{{CYP}} <br> {{ALB}}<br >{{MKD}}<br
>> >> > />{{SRB}}<br>{{UKR}}<br>
>> >> > {{RUS}},{{TUR}},{{EGY}} , {{USA}}<br>{{CAN}}
>> >> >
>> >> > extracts values only for {{GRE}} {{MKD}} {{SRB}} {{UKR}} {{TUR}}
>> >> > {{CAN}}
>> >> > I haven't test it, but it could affect other parsers as well
>> >>
>> >> Line 36 in ObjectParser.scala is responsible, more specifically the if
>> >> clause:
>> >>
>> >> case templateNode : TemplateNode if(node.children.length == 1) =>
>> >>    resolveTemplate(templateNode) match ...
>> >>
>> >> The rational behind it is that templates should only be extracted if
>> >> there is no other data around it. Only then we can be certain that the
>> >> property links to the country information.
>> >> For counter-example, in the case of the flag template, there might be
>> >> a person name behind the flag icon. The property is then most probably
>> >> about the person and not about the person's nationality. I found this
>> >> on the page of the American Revolutionary War [1]:
>> >>
>> >> {{Infobox military conflict
>> >> ...
>> >> |commander1={{flagicon|United States|1777}} [[George Washington]]
>> >> ...
>> >> }}
>> >>
>> >> Clearly, the extraction of the following triple should be avoided:
>> >> res:American_Revolutionary_War ont:hasCommander res:United_States
>> >>
>> >> That is why the ObjectParser first checks if there are other nodes
>> >> under the property node. In the commander example, there is another
>> >> InternalLinkNode(George_
>> Washington) under the
>> >> PropertyNode(commander1). In your example, there are TextNodes with
>> >> "left-over whitespaces" after splitting.
>> >> So I think that maybe adjusting the Regex is actually the way to go.
>> >> What do you think?
>> >>
>> >> Sidenote: the triple for Greece is not extracted when running with
>> >> English because it does not use an ISO code [2] but one of the other
>> >> alias names for flag templates (IOC or FIFA) [3]. Maybe we need to
>> >> extend the map for English in FlagTemplateParserConfig...
>> >>
>> >> Cheers,
>> >> Max
>> >>
>> >> [1]
>> >>
>> >> http://en.wikipedia.org/w/index.php?title=American_Revolutionary_War&oldid=437678888
>> >> [2]
>> >>
>> >> http://download.oracle.com/javase/1.4.2/docs/api/java/util/Locale.html#getISO3Country%28%29
>> >> [3]
>> >>
>> >> http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Flag_Template#Alias_names
>> >
>> >
>> >
>> > --
>> > Kontokostas Dimitris
>> >
>>
>>
>> On Tue, May 8, 2012 at 6:56 PM, Jona Christopher Sahnwaldt
>> <j...@sahnwaldt.de> wrote:
>>>
>>> Hi Marco,
>>>
>>> yes, this is a bug. I don't know what's going on.
>>>
>>> http://en.wikipedia.org/wiki/Glenn_Danzig contains:
>>>
>>> genre = [[Heavy metal music|Heavy metal]], [[blues rock]], [[horror
>>> punk]], [[deathrock]], [[Classical music|classical]]
>>>
>>> All of them are extracted.
>>>
>>> http://it.wikipedia.org/wiki/Glenn_Danzig contains:
>>>
>>> |genere = Heavy Metal
>>> |genere2 = Alternative Metal
>>> |genere3 = Punk rock
>>> |genere4 = Hardcore punk
>>>
>>> But only Punk rock is extracted:
>>>
>>>
>>> http://mappings.dbpedia.org/server/extraction/it/extract?title=Glenn+Danzig
>>>
>>> > Same thing for 'dbprop-it:nome', which maps to 'foaf:name'.
>>> Works for me - the sample extraction page contains
>>>
>>> http://it.dbpedia.org/resource/Glenn_Danzig__lenn__1
>>> http://xmlns.com/foaf/0.1/name  Glenn
>>> http://it.dbpedia.org/resource/Glenn_Danzig__lenn__1
>>> http://xmlns.com/foaf/0.1/surname       Danzig
>>> http://it.dbpedia.org/resource/Glenn_Danzig__lenn__1
>>> http://xmlns.com/foaf/0.1/surname       all'anagrafe Glenn Allen Anzalone
>>>
>>> But that 'genere' thing is strange. Maybe template properties that end
>>> with numbers are not mapped correctly? I looked through the code but
>>> didn't find an obvious problem. We'll have to start a debugger, I
>>> guess.
>>>
>>> Cheers,
>>> JC
>>>
>>>
>>> On Mon, May 7, 2012 at 12:58 PM, Marco Fossati <hell.j....@gmail.com>
>>> wrote:
>>> > Hi Jona,
>>> >
>>> > We have just generated fresh dumps for the Italian DBpedia with the
>>> > latest extractors code version and found that some data is lost in the
>>> > mapping-based dataset.
>>> > If you have a look at this example [1], 'dbprop-it:genere' property has
>>> > 4 objects, while 'dbpedia-owl:genre' only has 1.
>>> > Same thing for 'dbprop-it:nome', which maps to 'foaf:name'.
>>> > I checked the same resource in the English version [2] (the property is
>>> > dbprop:genre) and the data is there.
>>> > Is it a mapping extractor bug?
>>> > Cheers,
>>> >
>>> > Marco
>>> >
>>> > [1] http://it.dbpedia.org/page/Glenn_Danzig
>>> > [2] http://dbpedia.org/page/Glenn_Danzig
>>> >
>>> >
>>> > ------------------------------------------------------------------------------
>>> > Live Security Virtual Conference
>>> > Exclusive live event will cover all the ways today's security and
>>> > threat landscape has changed and how IT managers can respond.
>>> > Discussions
>>> > will include endpoint security, mobile security and the latest in
>>> > malware
>>> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>> > _______________________________________________
>>> > Dbpedia-discussion mailing list
>>> > Dbpedia-discussion@lists.sourceforge.net
>>> > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Live Security Virtual Conference
>>> Exclusive live event will cover all the ways today's security and
>>> threat landscape has changed and how IT managers can respond. Discussions
>>> will include endpoint security, mobile security and the latest in malware
>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>> _______________________________________________
>>> Dbpedia-discussion mailing list
>>> Dbpedia-discussion@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>
>>
>>
>>
>> --
>> Kontokostas Dimitris

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to