On Mar 25, 2013 3:32 AM, "Tom Morris" <tfmor...@gmail.com> wrote:
>
> Can someone point to the part of the discussion which talks about what
the problem is? This thread seems to start in mid-stream...
That's right. Sorry. The start of the thread is in the middle of this page:
https://github.com/dbpedia/extraction-framework/pull/25
>
> Freebase's MQL key encoding (
http://wiki.freebase.com/wiki/MQL_key_escaping) is a completely private
encoding which shouldn't have any effect on external
URIs/IRIs/references/etc
That's correct, and that's how the Scala script has always worked: it
unescapes the MQL keys and uses the result to form DBpedia IRIs. The
problems arise because some MQL keys contain invalid escapes (UTF-8 and
Windows-1252 bytes instead of Unicode code points), and some others contain
whitespace like U+2003 that is invalid even in IRIs.
I would guess though that it's not a big problem because the affected keys
are 1. not many, i.e. <1% and 2. not relevant anyway because they do not
represent valid, current, non-redirect Wikipedia page titles. That's just a
guess though, based on only a very cursory look at a few bad keys.
I don't remember if these problems also came up when I ran the script on
the old freebase dump format.
JC
>
> On Sun, Mar 24, 2013 at 9:44 PM, Jona Christopher Sahnwaldt <
j...@sahnwaldt.de> wrote:
>>
>> On 22 March 2013 23:21, Andrea Di Menna <ninn...@gmail.com> wrote:
>> >
>> > Hi Jona,
>> >
>> > thanks for merging the pull request!
>> >
>> > Anyway, couldn't we use percent encoding for Unicode code points which
are
>> > not allowed in N-Triples? (namely those outside the [#x20,#7E] range?
>> > In this case we should get UTF-8 bytes and percent encode them.
>> >
>> > For example, as far as I can see
>> >
>> > Marl$00C3$00ADn$002C_$00C3$0081vila
>> >
>> > is
>> >
>> > <http://dbpedia.org/resource/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila>
>> >
>> > where \00C3 is 0xC3 0x83
>> > \00AD is 0xC2 0xAD
>> > \0081 is 0xC2 0x81
>>
>> Oh, by the way, it would be
>> http://dbpedia.org/resource/Marl%C3%ADn,_%C3%81vila because that's the
>> UTF-8-percent-encoding for Marlín,_Ávila.
>>
>> The weird thing is that these Wikipedia page titles in the Freebase
>> contain UTF-8-encoded characters when they should contain no encoding
>> at all, just plain Unicode code points. (Of course, the characters and
>> codepoints are also dollar-escaped as usual for Freebase, but that's
>> not a problem.)
>>
>>
>> JC
>>
>> >
>> > WDYT?
>> >
>> > Cheers
>> > Andrea
>> >
>> > 2013/3/22 Christopher Sahnwaldt <notificati...@github.com>
>> >>
>> >> Ok, I got it. It has nothing to do with your platform. These are
actually
>> >> wrong URIs. There's not much we can do about it. I don't know where
Freebase
>> >> got them from, but I assume they may actually be wrong in Wikipedia.
>> >>
>> >> Examples:
>> >>
>> >> Marl$00C3$00ADn$002C_$00C3$0081vila
>> >> AD 2C and C3 81 are UTF-8 encodings, but Freebase says [1] that the
>> >> numbers should be plain Unicode code points, not UTF-8 bytes. 81 is an
>> >> invalid code point, so we generate an invalid URI.
>> >>
>> >> Bene$009A_decrees
>> >> 9A is the Windows-1252 encoding for š, but 9A invalid in Unicode.
>> >>
>> >> Switzerland$2003
>> >> 2003, 2029 etc. are valid Unicode code points, but for whitespace
>> >> characters that are invalid in URIs
>> >>
>> >> In a nutshell: all these characters are invalid in URIs, and it's not
our
>> >> fault. I'll pull your changes in a moment.
>> >>
>> >> [1] http://wiki.freebase.com/wiki/MQL_key_escaping
>> >>
>> >> —
>> >> Reply to this email directly or view it on GitHub.
>> >
>> >
>>
>>
------------------------------------------------------------------------------
>> Everyone hates slow websites. So do we.
>> Make your web apps faster with AppDynamics
>> Download AppDynamics Lite for free today:
>> http://p.sf.net/sfu/appdyn_d2d_mar
>> _______________________________________________
>> Dbpedia-discussion mailing list
>> Dbpedia-discussion@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion