Hi Jona,

thanks for merging the pull request!

Anyway, couldn't we use percent encoding for Unicode code points which are
not allowed in N-Triples? (namely those outside the [#x20,#7E] range?
In this case we should get UTF-8 bytes and percent encode them.

For example, as far as I can see

Marl$00C3$00ADn$002C_$00C3$0081vila

is

<http://dbpedia.org/resource/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila>

where \00C3 is 0xC3 0x83
         \00AD is 0xC2 0xAD
         \0081 is 0xC2 0x81

WDYT?

Cheers
Andrea

2013/3/22 Christopher Sahnwaldt <notificati...@github.com>

> Ok, I got it. It has nothing to do with your platform. These are actually
> wrong URIs. There's not much we can do about it. I don't know where
> Freebase got them from, but I assume they may actually be wrong in
> Wikipedia.
>
> Examples:
>
> Marl$00C3$00ADn$002C_$00C3$0081vila
> AD 2C and C3 81 are UTF-8 encodings, but Freebase says [1] that the
> numbers should be plain Unicode code points, not UTF-8 bytes. 81 is an
> invalid code point, so we generate an invalid URI.
>
> Bene$009A_decrees
> 9A is the Windows-1252 encoding for š, but 9A invalid in Unicode.
>
> Switzerland$2003
> 2003, 2029 etc. are valid Unicode code points, but for whitespace
> characters that are invalid in URIs
>
> In a nutshell: all these characters are invalid in URIs, and it's not our
> fault. I'll pull your changes in a moment.
>
> [1] http://wiki.freebase.com/wiki/MQL_key_escaping
>
> —
> Reply to this email directly or view it on 
> GitHub<https://github.com/dbpedia/extraction-framework/pull/25#issuecomment-15319409>
> .
>
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to