Another approach might be to use the recently introduced Topic Equivalent
Webpage property:

ns:m.09q3rp     ns:common.topic.topic_equivalent_webpage        <
http://pt.wikipedia.org/wiki/Marlín>.
ns:m.09q3rp     ns:common.topic.topic_equivalent_webpage        <
http://es.wikipedia.org/wiki/Marlín_(Ávila)>.
ns:m.09q3rp     ns:common.topic.topic_equivalent_webpage        <
http://en.wikipedia.org/wiki/Marlín>.
ns:m.09q3rp     ns:common.topic.topic_equivalent_webpage        <
http://it.wikipedia.org/wiki/Marlín>.

It appears to be a single canonical alpha link for each language Wikipedia
with the MQL escaping undone and the redirects resolved.

Tom

On Mon, Mar 25, 2013 at 9:18 AM, Tom Morris <tfmor...@gmail.com> wrote:

> I wouldn't claim that Freebase is bug-free, but that's a quite old and
> simple algorithm, so unless they're triples from very early in it's life
> (say, 2007), I'd guess that bad input data from Wikipedia is more likely
> than a problem with the transformation.
>
> It might help to give a little background on how Freebase deals with these
> links.  The canonical link uses the article number (in the namespace
> /wikipedia/en_id), but the alpha title (MQL key escaped) *and all
> redirects* are also stored (namespace /wikipedia/en).  Additionally, the
> same information has recently been added for number of the other language
> wikipedias.
>
> You can see them all here for the example that Andrea mentioned:
>
>   https://www.freebase.com/m/09q3rp?keys
>
> Outbound links from Freebase to Wikipedia are made using the article
> number, so that's really the most important link.  The wisdom of including
> redirects is debatable, I think.  Sometimes they're good alternate names,
> but other times they represent misspellings, related concepts, etc.
>
> If DBpedia has the Wikipedia article number, I'd suggest creating the
> links based on those.  If not, I'd suggest using the redirect file to
> canoncialize on a single "best" link.
>
> Tom
>
>
>
> On Mon, Mar 25, 2013 at 6:41 AM, Andrea Di Menna <ninn...@gmail.com>wrote:
>
>> Hi all,
>>
>> it looks like there are actually some pages in Wikipedia which contain
>> wrong data, which is where the pages originate from in Freebase, e.g.
>>
>> http://en.wikipedia.org/wiki/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila
>>
>> This page has been deleted on Jan 21, and this actually lead to the
>> Freebase key
>>
>> Marl$00C3$00ADn$002C_$00C3$0081vila
>>
>> since UTF-8 0xC3 0x83 -> Unicode U+00C3 , etc..
>>
>> Cheers
>> Andrea
>>
>>
>> 2013/3/25 Andrea Di Menna <ninn...@gmail.com>
>>
>>> Hi,
>>>
>>> Maybe the only thing that can be done is to notify the freebase
>>> discussion list about this problem.
>>> Agree with Jona that the number of problematic references is not
>>> relevant.
>>>
>>> Cheers
>>> Andrea
>>>
>>>
>>> 2013/3/25 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
>>>
>>>>
>>>> On Mar 25, 2013 3:32 AM, "Tom Morris" <tfmor...@gmail.com> wrote:
>>>> >
>>>> > Can someone point to the part of the discussion which talks about
>>>> what the problem is?  This thread seems to start in mid-stream...
>>>>
>>>> That's right. Sorry. The start of the thread is in the middle of this
>>>> page:
>>>>
>>>> https://github.com/dbpedia/extraction-framework/pull/25
>>>>
>>>> >
>>>> > Freebase's MQL key encoding (
>>>> http://wiki.freebase.com/wiki/MQL_key_escaping) is a completely
>>>> private encoding which shouldn't have any effect on external
>>>> URIs/IRIs/references/etc
>>>>
>>>> That's correct, and that's how the Scala script has always worked: it
>>>> unescapes the MQL keys and uses the result to form DBpedia IRIs. The
>>>> problems arise because some MQL keys contain invalid escapes (UTF-8 and
>>>> Windows-1252 bytes instead of Unicode code points), and some others contain
>>>> whitespace like U+2003 that is invalid even in IRIs.
>>>>
>>>> I would guess though that it's not a big problem because the affected
>>>> keys are 1. not many, i.e. <1% and 2. not relevant anyway because they do
>>>> not represent valid, current, non-redirect Wikipedia page titles. That's
>>>> just a guess though, based on only a very cursory look at a few bad keys.
>>>>
>>>> I don't remember if these problems also came up when I ran the script
>>>> on the old freebase dump format.
>>>>
>>>> JC
>>>>
>>>> >
>>>> > On Sun, Mar 24, 2013 at 9:44 PM, Jona Christopher Sahnwaldt <
>>>> j...@sahnwaldt.de> wrote:
>>>> >>
>>>> >> On 22 March 2013 23:21, Andrea Di Menna <ninn...@gmail.com> wrote:
>>>> >> >
>>>> >> > Hi Jona,
>>>> >> >
>>>> >> > thanks for merging the pull request!
>>>> >> >
>>>> >> > Anyway, couldn't we use percent encoding for Unicode code points
>>>> which are
>>>> >> > not allowed in N-Triples? (namely those outside the [#x20,#7E]
>>>> range?
>>>> >> > In this case we should get UTF-8 bytes and percent encode them.
>>>> >> >
>>>> >> > For example, as far as I can see
>>>> >> >
>>>> >> > Marl$00C3$00ADn$002C_$00C3$0081vila
>>>> >> >
>>>> >> > is
>>>> >> >
>>>> >> > <http://dbpedia.org/resource/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila>
>>>> >> >
>>>> >> > where \00C3 is 0xC3 0x83
>>>> >> >          \00AD is 0xC2 0xAD
>>>> >> >          \0081 is 0xC2 0x81
>>>> >>
>>>> >> Oh, by the way, it would be
>>>> >> http://dbpedia.org/resource/Marl%C3%ADn,_%C3%81vila because that's
>>>> the
>>>> >> UTF-8-percent-encoding for Marlín,_Ávila.
>>>> >>
>>>> >> The weird thing is that these Wikipedia page titles in the Freebase
>>>> >> contain UTF-8-encoded characters when they should contain no encoding
>>>> >> at all, just plain Unicode code points. (Of course, the characters
>>>> and
>>>> >> codepoints are also dollar-escaped as usual for Freebase, but that's
>>>> >> not a problem.)
>>>> >>
>>>> >>
>>>> >> JC
>>>> >>
>>>> >> >
>>>> >> > WDYT?
>>>> >> >
>>>> >> > Cheers
>>>> >> > Andrea
>>>> >> >
>>>> >> > 2013/3/22 Christopher Sahnwaldt <notificati...@github.com>
>>>> >> >>
>>>> >> >> Ok, I got it. It has nothing to do with your platform. These are
>>>> actually
>>>> >> >> wrong URIs. There's not much we can do about it. I don't know
>>>> where Freebase
>>>> >> >> got them from, but I assume they may actually be wrong in
>>>> Wikipedia.
>>>> >> >>
>>>> >> >> Examples:
>>>> >> >>
>>>> >> >> Marl$00C3$00ADn$002C_$00C3$0081vila
>>>> >> >> AD 2C and C3 81 are UTF-8 encodings, but Freebase says [1] that
>>>> the
>>>> >> >> numbers should be plain Unicode code points, not UTF-8 bytes. 81
>>>> is an
>>>> >> >> invalid code point, so we generate an invalid URI.
>>>> >> >>
>>>> >> >> Bene$009A_decrees
>>>> >> >> 9A is the Windows-1252 encoding for š, but 9A invalid in Unicode.
>>>> >> >>
>>>> >> >> Switzerland$2003
>>>> >> >> 2003, 2029 etc. are valid Unicode code points, but for whitespace
>>>> >> >> characters that are invalid in URIs
>>>> >> >>
>>>> >> >> In a nutshell: all these characters are invalid in URIs, and it's
>>>> not our
>>>> >> >> fault. I'll pull your changes in a moment.
>>>> >> >>
>>>> >> >> [1] http://wiki.freebase.com/wiki/MQL_key_escaping
>>>> >> >>
>>>> >> >> —
>>>> >> >> Reply to this email directly or view it on GitHub.
>>>> >> >
>>>> >> >
>>>> >>
>>>> >>
>>>> ------------------------------------------------------------------------------
>>>> >> Everyone hates slow websites. So do we.
>>>> >> Make your web apps faster with AppDynamics
>>>> >> Download AppDynamics Lite for free today:
>>>> >> http://p.sf.net/sfu/appdyn_d2d_mar
>>>> >> _______________________________________________
>>>> >> Dbpedia-discussion mailing list
>>>> >> Dbpedia-discussion@lists.sourceforge.net
>>>> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>> >
>>>> >
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Everyone hates slow websites. So do we.
>>>> Make your web apps faster with AppDynamics
>>>> Download AppDynamics Lite for free today:
>>>> http://p.sf.net/sfu/appdyn_d2d_mar
>>>> _______________________________________________
>>>> Dbpedia-discussion mailing list
>>>> Dbpedia-discussion@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>>
>>>>
>>>
>>
>
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to