Hi all,

it looks like there are actually some pages in Wikipedia which contain
wrong data, which is where the pages originate from in Freebase, e.g.

http://en.wikipedia.org/wiki/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila

This page has been deleted on Jan 21, and this actually lead to the
Freebase key

Marl$00C3$00ADn$002C_$00C3$0081vila

since UTF-8 0xC3 0x83 -> Unicode U+00C3 , etc..

Cheers
Andrea

2013/3/25 Andrea Di Menna <ninn...@gmail.com>

> Hi,
>
> Maybe the only thing that can be done is to notify the freebase discussion
> list about this problem.
> Agree with Jona that the number of problematic references is not relevant.
>
> Cheers
> Andrea
>
>
> 2013/3/25 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
>
>>
>> On Mar 25, 2013 3:32 AM, "Tom Morris" <tfmor...@gmail.com> wrote:
>> >
>> > Can someone point to the part of the discussion which talks about what
>> the problem is?  This thread seems to start in mid-stream...
>>
>> That's right. Sorry. The start of the thread is in the middle of this
>> page:
>>
>> https://github.com/dbpedia/extraction-framework/pull/25
>>
>> >
>> > Freebase's MQL key encoding (
>> http://wiki.freebase.com/wiki/MQL_key_escaping) is a completely private
>> encoding which shouldn't have any effect on external
>> URIs/IRIs/references/etc
>>
>> That's correct, and that's how the Scala script has always worked: it
>> unescapes the MQL keys and uses the result to form DBpedia IRIs. The
>> problems arise because some MQL keys contain invalid escapes (UTF-8 and
>> Windows-1252 bytes instead of Unicode code points), and some others contain
>> whitespace like U+2003 that is invalid even in IRIs.
>>
>> I would guess though that it's not a big problem because the affected
>> keys are 1. not many, i.e. <1% and 2. not relevant anyway because they do
>> not represent valid, current, non-redirect Wikipedia page titles. That's
>> just a guess though, based on only a very cursory look at a few bad keys.
>>
>> I don't remember if these problems also came up when I ran the script on
>> the old freebase dump format.
>>
>> JC
>>
>> >
>> > On Sun, Mar 24, 2013 at 9:44 PM, Jona Christopher Sahnwaldt <
>> j...@sahnwaldt.de> wrote:
>> >>
>> >> On 22 March 2013 23:21, Andrea Di Menna <ninn...@gmail.com> wrote:
>> >> >
>> >> > Hi Jona,
>> >> >
>> >> > thanks for merging the pull request!
>> >> >
>> >> > Anyway, couldn't we use percent encoding for Unicode code points
>> which are
>> >> > not allowed in N-Triples? (namely those outside the [#x20,#7E] range?
>> >> > In this case we should get UTF-8 bytes and percent encode them.
>> >> >
>> >> > For example, as far as I can see
>> >> >
>> >> > Marl$00C3$00ADn$002C_$00C3$0081vila
>> >> >
>> >> > is
>> >> >
>> >> > <http://dbpedia.org/resource/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila>
>> >> >
>> >> > where \00C3 is 0xC3 0x83
>> >> >          \00AD is 0xC2 0xAD
>> >> >          \0081 is 0xC2 0x81
>> >>
>> >> Oh, by the way, it would be
>> >> http://dbpedia.org/resource/Marl%C3%ADn,_%C3%81vila because that's the
>> >> UTF-8-percent-encoding for Marlín,_Ávila.
>> >>
>> >> The weird thing is that these Wikipedia page titles in the Freebase
>> >> contain UTF-8-encoded characters when they should contain no encoding
>> >> at all, just plain Unicode code points. (Of course, the characters and
>> >> codepoints are also dollar-escaped as usual for Freebase, but that's
>> >> not a problem.)
>> >>
>> >>
>> >> JC
>> >>
>> >> >
>> >> > WDYT?
>> >> >
>> >> > Cheers
>> >> > Andrea
>> >> >
>> >> > 2013/3/22 Christopher Sahnwaldt <notificati...@github.com>
>> >> >>
>> >> >> Ok, I got it. It has nothing to do with your platform. These are
>> actually
>> >> >> wrong URIs. There's not much we can do about it. I don't know where
>> Freebase
>> >> >> got them from, but I assume they may actually be wrong in Wikipedia.
>> >> >>
>> >> >> Examples:
>> >> >>
>> >> >> Marl$00C3$00ADn$002C_$00C3$0081vila
>> >> >> AD 2C and C3 81 are UTF-8 encodings, but Freebase says [1] that the
>> >> >> numbers should be plain Unicode code points, not UTF-8 bytes. 81 is
>> an
>> >> >> invalid code point, so we generate an invalid URI.
>> >> >>
>> >> >> Bene$009A_decrees
>> >> >> 9A is the Windows-1252 encoding for š, but 9A invalid in Unicode.
>> >> >>
>> >> >> Switzerland$2003
>> >> >> 2003, 2029 etc. are valid Unicode code points, but for whitespace
>> >> >> characters that are invalid in URIs
>> >> >>
>> >> >> In a nutshell: all these characters are invalid in URIs, and it's
>> not our
>> >> >> fault. I'll pull your changes in a moment.
>> >> >>
>> >> >> [1] http://wiki.freebase.com/wiki/MQL_key_escaping
>> >> >>
>> >> >> —
>> >> >> Reply to this email directly or view it on GitHub.
>> >> >
>> >> >
>> >>
>> >>
>> ------------------------------------------------------------------------------
>> >> Everyone hates slow websites. So do we.
>> >> Make your web apps faster with AppDynamics
>> >> Download AppDynamics Lite for free today:
>> >> http://p.sf.net/sfu/appdyn_d2d_mar
>> >> _______________________________________________
>> >> Dbpedia-discussion mailing list
>> >> Dbpedia-discussion@lists.sourceforge.net
>> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>> >
>> >
>>
>>
>> ------------------------------------------------------------------------------
>> Everyone hates slow websites. So do we.
>> Make your web apps faster with AppDynamics
>> Download AppDynamics Lite for free today:
>> http://p.sf.net/sfu/appdyn_d2d_mar
>> _______________________________________________
>> Dbpedia-discussion mailing list
>> Dbpedia-discussion@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>
>>
>
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to