I wouldn't claim that Freebase is bug-free, but that's a quite old and
simple algorithm, so unless they're triples from very early in it's life
(say, 2007), I'd guess that bad input data from Wikipedia is more likely
than a problem with the transformation.

It might help to give a little background on how Freebase deals with these
links.  The canonical link uses the article number (in the namespace
/wikipedia/en_id), but the alpha title (MQL key escaped) *and all
redirects* are also stored (namespace /wikipedia/en).  Additionally, the
same information has recently been added for number of the other language
wikipedias.

You can see them all here for the example that Andrea mentioned:

  https://www.freebase.com/m/09q3rp?keys

Outbound links from Freebase to Wikipedia are made using the article
number, so that's really the most important link.  The wisdom of including
redirects is debatable, I think.  Sometimes they're good alternate names,
but other times they represent misspellings, related concepts, etc.

If DBpedia has the Wikipedia article number, I'd suggest creating the links
based on those.  If not, I'd suggest using the redirect file to
canoncialize on a single "best" link.

Tom


On Mon, Mar 25, 2013 at 6:41 AM, Andrea Di Menna <ninn...@gmail.com> wrote:

> Hi all,
>
> it looks like there are actually some pages in Wikipedia which contain
> wrong data, which is where the pages originate from in Freebase, e.g.
>
> http://en.wikipedia.org/wiki/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila
>
> This page has been deleted on Jan 21, and this actually lead to the
> Freebase key
>
> Marl$00C3$00ADn$002C_$00C3$0081vila
>
> since UTF-8 0xC3 0x83 -> Unicode U+00C3 , etc..
>
> Cheers
> Andrea
>
>
> 2013/3/25 Andrea Di Menna <ninn...@gmail.com>
>
>> Hi,
>>
>> Maybe the only thing that can be done is to notify the freebase
>> discussion list about this problem.
>> Agree with Jona that the number of problematic references is not relevant.
>>
>> Cheers
>> Andrea
>>
>>
>> 2013/3/25 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
>>
>>>
>>> On Mar 25, 2013 3:32 AM, "Tom Morris" <tfmor...@gmail.com> wrote:
>>> >
>>> > Can someone point to the part of the discussion which talks about what
>>> the problem is?  This thread seems to start in mid-stream...
>>>
>>> That's right. Sorry. The start of the thread is in the middle of this
>>> page:
>>>
>>> https://github.com/dbpedia/extraction-framework/pull/25
>>>
>>> >
>>> > Freebase's MQL key encoding (
>>> http://wiki.freebase.com/wiki/MQL_key_escaping) is a completely private
>>> encoding which shouldn't have any effect on external
>>> URIs/IRIs/references/etc
>>>
>>> That's correct, and that's how the Scala script has always worked: it
>>> unescapes the MQL keys and uses the result to form DBpedia IRIs. The
>>> problems arise because some MQL keys contain invalid escapes (UTF-8 and
>>> Windows-1252 bytes instead of Unicode code points), and some others contain
>>> whitespace like U+2003 that is invalid even in IRIs.
>>>
>>> I would guess though that it's not a big problem because the affected
>>> keys are 1. not many, i.e. <1% and 2. not relevant anyway because they do
>>> not represent valid, current, non-redirect Wikipedia page titles. That's
>>> just a guess though, based on only a very cursory look at a few bad keys.
>>>
>>> I don't remember if these problems also came up when I ran the script on
>>> the old freebase dump format.
>>>
>>> JC
>>>
>>> >
>>> > On Sun, Mar 24, 2013 at 9:44 PM, Jona Christopher Sahnwaldt <
>>> j...@sahnwaldt.de> wrote:
>>> >>
>>> >> On 22 March 2013 23:21, Andrea Di Menna <ninn...@gmail.com> wrote:
>>> >> >
>>> >> > Hi Jona,
>>> >> >
>>> >> > thanks for merging the pull request!
>>> >> >
>>> >> > Anyway, couldn't we use percent encoding for Unicode code points
>>> which are
>>> >> > not allowed in N-Triples? (namely those outside the [#x20,#7E]
>>> range?
>>> >> > In this case we should get UTF-8 bytes and percent encode them.
>>> >> >
>>> >> > For example, as far as I can see
>>> >> >
>>> >> > Marl$00C3$00ADn$002C_$00C3$0081vila
>>> >> >
>>> >> > is
>>> >> >
>>> >> > <http://dbpedia.org/resource/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila>
>>> >> >
>>> >> > where \00C3 is 0xC3 0x83
>>> >> >          \00AD is 0xC2 0xAD
>>> >> >          \0081 is 0xC2 0x81
>>> >>
>>> >> Oh, by the way, it would be
>>> >> http://dbpedia.org/resource/Marl%C3%ADn,_%C3%81vila because that's
>>> the
>>> >> UTF-8-percent-encoding for Marlín,_Ávila.
>>> >>
>>> >> The weird thing is that these Wikipedia page titles in the Freebase
>>> >> contain UTF-8-encoded characters when they should contain no encoding
>>> >> at all, just plain Unicode code points. (Of course, the characters and
>>> >> codepoints are also dollar-escaped as usual for Freebase, but that's
>>> >> not a problem.)
>>> >>
>>> >>
>>> >> JC
>>> >>
>>> >> >
>>> >> > WDYT?
>>> >> >
>>> >> > Cheers
>>> >> > Andrea
>>> >> >
>>> >> > 2013/3/22 Christopher Sahnwaldt <notificati...@github.com>
>>> >> >>
>>> >> >> Ok, I got it. It has nothing to do with your platform. These are
>>> actually
>>> >> >> wrong URIs. There's not much we can do about it. I don't know
>>> where Freebase
>>> >> >> got them from, but I assume they may actually be wrong in
>>> Wikipedia.
>>> >> >>
>>> >> >> Examples:
>>> >> >>
>>> >> >> Marl$00C3$00ADn$002C_$00C3$0081vila
>>> >> >> AD 2C and C3 81 are UTF-8 encodings, but Freebase says [1] that the
>>> >> >> numbers should be plain Unicode code points, not UTF-8 bytes. 81
>>> is an
>>> >> >> invalid code point, so we generate an invalid URI.
>>> >> >>
>>> >> >> Bene$009A_decrees
>>> >> >> 9A is the Windows-1252 encoding for š, but 9A invalid in Unicode.
>>> >> >>
>>> >> >> Switzerland$2003
>>> >> >> 2003, 2029 etc. are valid Unicode code points, but for whitespace
>>> >> >> characters that are invalid in URIs
>>> >> >>
>>> >> >> In a nutshell: all these characters are invalid in URIs, and it's
>>> not our
>>> >> >> fault. I'll pull your changes in a moment.
>>> >> >>
>>> >> >> [1] http://wiki.freebase.com/wiki/MQL_key_escaping
>>> >> >>
>>> >> >> —
>>> >> >> Reply to this email directly or view it on GitHub.
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> ------------------------------------------------------------------------------
>>> >> Everyone hates slow websites. So do we.
>>> >> Make your web apps faster with AppDynamics
>>> >> Download AppDynamics Lite for free today:
>>> >> http://p.sf.net/sfu/appdyn_d2d_mar
>>> >> _______________________________________________
>>> >> Dbpedia-discussion mailing list
>>> >> Dbpedia-discussion@lists.sourceforge.net
>>> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>> >
>>> >
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Everyone hates slow websites. So do we.
>>> Make your web apps faster with AppDynamics
>>> Download AppDynamics Lite for free today:
>>> http://p.sf.net/sfu/appdyn_d2d_mar
>>> _______________________________________________
>>> Dbpedia-discussion mailing list
>>> Dbpedia-discussion@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>
>>>
>>
>
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to