> On 22 March 2013 23:21, Andrea Di Menna <ninn...@gmail.com> wrote:
>>
>> Hi Jona,
>>
>> thanks for merging the pull request!
>>
>> Anyway, couldn't we use percent encoding for Unicode code points which are
>> not allowed in N-Triples? (namely those outside the [#x20,#7E] range?
>> In this case we should get UTF-8 bytes and percent encode them.
>>
>> For example, as far as I can see
>>
>> Marl$00C3$00ADn$002C_$00C3$0081vila
>>
>> is
>>
>> <http://dbpedia.org/resource/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila>
>>
>> where \00C3 is 0xC3 0x83
>>          \00AD is 0xC2 0xAD
>>          \0081 is 0xC2 0x81
>>
>> WDYT?
>
> I prefer the "garbage in, garbage out" style. The freebase keys are

PS: or in this case, "garbage in, nothing out"

> broken. We could try to fix them, but we would have to use several
> different heuristics: with percent encoding, we could "fix" the keys
> that are UTF-8 encoded, but not the ones that are Windows-encoded. To
> fix the keys containing whitespace, we would first have to UTF-8 the
> Unicode code point, then percent encode the UTF-8... it's a mess. And
> anyway, we try to move towards IRIs, not URIs, and IRIs wouldn't
> contain percent-encodings for these characters.
>
> How many keys are affected anyway? I think we generate several million
> freebase links, so even if 100,000 freebase keys are broken, it's not
> a big problem.
>
> JC
>
>>
>> Cheers
>> Andrea
>>
>> 2013/3/22 Christopher Sahnwaldt <notificati...@github.com>
>>>
>>> Ok, I got it. It has nothing to do with your platform. These are actually
>>> wrong URIs. There's not much we can do about it. I don't know where Freebase
>>> got them from, but I assume they may actually be wrong in Wikipedia.
>>>
>>> Examples:
>>>
>>> Marl$00C3$00ADn$002C_$00C3$0081vila
>>> AD 2C and C3 81 are UTF-8 encodings, but Freebase says [1] that the
>>> numbers should be plain Unicode code points, not UTF-8 bytes. 81 is an
>>> invalid code point, so we generate an invalid URI.
>>>
>>> Bene$009A_decrees
>>> 9A is the Windows-1252 encoding for š, but 9A invalid in Unicode.
>>>
>>> Switzerland$2003
>>> 2003, 2029 etc. are valid Unicode code points, but for whitespace
>>> characters that are invalid in URIs
>>>
>>> In a nutshell: all these characters are invalid in URIs, and it's not our
>>> fault. I'll pull your changes in a moment.
>>>
>>> [1] http://wiki.freebase.com/wiki/MQL_key_escaping
>>>
>>> —
>>> Reply to this email directly or view it on GitHub.
>>
>>

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to