On 25 March 2013 15:00, Tom Morris <tfmor...@gmail.com> wrote:
> Another approach might be to use the recently introduced Topic Equivalent
> Webpage property:
>
> ns:m.09q3rp     ns:common.topic.topic_equivalent_webpage
> <http://pt.wikipedia.org/wiki/Marlín>.
> ns:m.09q3rp     ns:common.topic.topic_equivalent_webpage
> <http://es.wikipedia.org/wiki/Marlín_(Ávila)>.
> ns:m.09q3rp     ns:common.topic.topic_equivalent_webpage
> <http://en.wikipedia.org/wiki/Marlín>.
> ns:m.09q3rp     ns:common.topic.topic_equivalent_webpage
> <http://it.wikipedia.org/wiki/Marlín>.
>
> It appears to be a single canonical alpha link for each language Wikipedia
> with the MQL escaping undone and the redirects resolved.

Sounds good! I think we would only have to change a few lines in our
script to use these instead.

>
> Tom
>
> On Mon, Mar 25, 2013 at 9:18 AM, Tom Morris <tfmor...@gmail.com> wrote:
>>
>> I wouldn't claim that Freebase is bug-free, but that's a quite old and
>> simple algorithm, so unless they're triples from very early in it's life
>> (say, 2007), I'd guess that bad input data from Wikipedia is more likely
>> than a problem with the transformation.
>>
>> It might help to give a little background on how Freebase deals with these
>> links.  The canonical link uses the article number (in the namespace
>> /wikipedia/en_id), but the alpha title (MQL key escaped) *and all redirects*
>> are also stored (namespace /wikipedia/en).  Additionally, the same
>> information has recently been added for number of the other language
>> wikipedias.
>>
>> You can see them all here for the example that Andrea mentioned:
>>
>>   https://www.freebase.com/m/09q3rp?keys
>>
>> Outbound links from Freebase to Wikipedia are made using the article
>> number, so that's really the most important link.  The wisdom of including
>> redirects is debatable, I think.  Sometimes they're good alternate names,
>> but other times they represent misspellings, related concepts, etc.
>>
>> If DBpedia has the Wikipedia article number, I'd suggest creating the
>> links based on those.  If not, I'd suggest using the redirect file to
>> canoncialize on a single "best" link.
>>
>> Tom
>>
>>
>>
>> On Mon, Mar 25, 2013 at 6:41 AM, Andrea Di Menna <ninn...@gmail.com>
>> wrote:
>>>
>>> Hi all,
>>>
>>> it looks like there are actually some pages in Wikipedia which contain
>>> wrong data, which is where the pages originate from in Freebase, e.g.
>>>
>>> http://en.wikipedia.org/wiki/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila
>>>
>>> This page has been deleted on Jan 21, and this actually lead to the
>>> Freebase key
>>>
>>> Marl$00C3$00ADn$002C_$00C3$0081vila
>>>
>>> since UTF-8 0xC3 0x83 -> Unicode U+00C3 , etc..
>>>
>>> Cheers
>>> Andrea
>>>
>>>
>>> 2013/3/25 Andrea Di Menna <ninn...@gmail.com>
>>>>
>>>> Hi,
>>>>
>>>> Maybe the only thing that can be done is to notify the freebase
>>>> discussion list about this problem.
>>>> Agree with Jona that the number of problematic references is not
>>>> relevant.
>>>>
>>>> Cheers
>>>> Andrea
>>>>
>>>>
>>>> 2013/3/25 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
>>>>>
>>>>>
>>>>> On Mar 25, 2013 3:32 AM, "Tom Morris" <tfmor...@gmail.com> wrote:
>>>>> >
>>>>> > Can someone point to the part of the discussion which talks about
>>>>> > what the problem is?  This thread seems to start in mid-stream...
>>>>>
>>>>> That's right. Sorry. The start of the thread is in the middle of this
>>>>> page:
>>>>>
>>>>> https://github.com/dbpedia/extraction-framework/pull/25
>>>>>
>>>>> >
>>>>> > Freebase's MQL key encoding
>>>>> > (http://wiki.freebase.com/wiki/MQL_key_escaping) is a completely private
>>>>> > encoding which shouldn't have any effect on external
>>>>> > URIs/IRIs/references/etc
>>>>>
>>>>> That's correct, and that's how the Scala script has always worked: it
>>>>> unescapes the MQL keys and uses the result to form DBpedia IRIs. The
>>>>> problems arise because some MQL keys contain invalid escapes (UTF-8 and
>>>>> Windows-1252 bytes instead of Unicode code points), and some others 
>>>>> contain
>>>>> whitespace like U+2003 that is invalid even in IRIs.
>>>>>
>>>>> I would guess though that it's not a big problem because the affected
>>>>> keys are 1. not many, i.e. <1% and 2. not relevant anyway because they do
>>>>> not represent valid, current, non-redirect Wikipedia page titles. That's
>>>>> just a guess though, based on only a very cursory look at a few bad keys.
>>>>>
>>>>> I don't remember if these problems also came up when I ran the script
>>>>> on the old freebase dump format.
>>>>>
>>>>> JC
>>>>>
>>>>> >
>>>>> > On Sun, Mar 24, 2013 at 9:44 PM, Jona Christopher Sahnwaldt
>>>>> > <j...@sahnwaldt.de> wrote:
>>>>> >>
>>>>> >> On 22 March 2013 23:21, Andrea Di Menna <ninn...@gmail.com> wrote:
>>>>> >> >
>>>>> >> > Hi Jona,
>>>>> >> >
>>>>> >> > thanks for merging the pull request!
>>>>> >> >
>>>>> >> > Anyway, couldn't we use percent encoding for Unicode code points
>>>>> >> > which are
>>>>> >> > not allowed in N-Triples? (namely those outside the [#x20,#7E]
>>>>> >> > range?
>>>>> >> > In this case we should get UTF-8 bytes and percent encode them.
>>>>> >> >
>>>>> >> > For example, as far as I can see
>>>>> >> >
>>>>> >> > Marl$00C3$00ADn$002C_$00C3$0081vila
>>>>> >> >
>>>>> >> > is
>>>>> >> >
>>>>> >> > <http://dbpedia.org/resource/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila>
>>>>> >> >
>>>>> >> > where \00C3 is 0xC3 0x83
>>>>> >> >          \00AD is 0xC2 0xAD
>>>>> >> >          \0081 is 0xC2 0x81
>>>>> >>
>>>>> >> Oh, by the way, it would be
>>>>> >> http://dbpedia.org/resource/Marl%C3%ADn,_%C3%81vila because that's
>>>>> >> the
>>>>> >> UTF-8-percent-encoding for Marlín,_Ávila.
>>>>> >>
>>>>> >> The weird thing is that these Wikipedia page titles in the Freebase
>>>>> >> contain UTF-8-encoded characters when they should contain no
>>>>> >> encoding
>>>>> >> at all, just plain Unicode code points. (Of course, the characters
>>>>> >> and
>>>>> >> codepoints are also dollar-escaped as usual for Freebase, but that's
>>>>> >> not a problem.)
>>>>> >>
>>>>> >>
>>>>> >> JC
>>>>> >>
>>>>> >> >
>>>>> >> > WDYT?
>>>>> >> >
>>>>> >> > Cheers
>>>>> >> > Andrea
>>>>> >> >
>>>>> >> > 2013/3/22 Christopher Sahnwaldt <notificati...@github.com>
>>>>> >> >>
>>>>> >> >> Ok, I got it. It has nothing to do with your platform. These are
>>>>> >> >> actually
>>>>> >> >> wrong URIs. There's not much we can do about it. I don't know
>>>>> >> >> where Freebase
>>>>> >> >> got them from, but I assume they may actually be wrong in
>>>>> >> >> Wikipedia.
>>>>> >> >>
>>>>> >> >> Examples:
>>>>> >> >>
>>>>> >> >> Marl$00C3$00ADn$002C_$00C3$0081vila
>>>>> >> >> AD 2C and C3 81 are UTF-8 encodings, but Freebase says [1] that
>>>>> >> >> the
>>>>> >> >> numbers should be plain Unicode code points, not UTF-8 bytes. 81
>>>>> >> >> is an
>>>>> >> >> invalid code point, so we generate an invalid URI.
>>>>> >> >>
>>>>> >> >> Bene$009A_decrees
>>>>> >> >> 9A is the Windows-1252 encoding for š, but 9A invalid in Unicode.
>>>>> >> >>
>>>>> >> >> Switzerland$2003
>>>>> >> >> 2003, 2029 etc. are valid Unicode code points, but for whitespace
>>>>> >> >> characters that are invalid in URIs
>>>>> >> >>
>>>>> >> >> In a nutshell: all these characters are invalid in URIs, and it's
>>>>> >> >> not our
>>>>> >> >> fault. I'll pull your changes in a moment.
>>>>> >> >>
>>>>> >> >> [1] http://wiki.freebase.com/wiki/MQL_key_escaping
>>>>> >> >>
>>>>> >> >> —
>>>>> >> >> Reply to this email directly or view it on GitHub.
>>>>> >> >
>>>>> >> >
>>>>> >>
>>>>> >>
>>>>> >> ------------------------------------------------------------------------------
>>>>> >> Everyone hates slow websites. So do we.
>>>>> >> Make your web apps faster with AppDynamics
>>>>> >> Download AppDynamics Lite for free today:
>>>>> >> http://p.sf.net/sfu/appdyn_d2d_mar
>>>>> >> _______________________________________________
>>>>> >> Dbpedia-discussion mailing list
>>>>> >> Dbpedia-discussion@lists.sourceforge.net
>>>>> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Everyone hates slow websites. So do we.
>>>>> Make your web apps faster with AppDynamics
>>>>> Download AppDynamics Lite for free today:
>>>>> http://p.sf.net/sfu/appdyn_d2d_mar
>>>>> _______________________________________________
>>>>> Dbpedia-discussion mailing list
>>>>> Dbpedia-discussion@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>>>
>>>>
>>>
>>
>

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to