Hi,

Maybe the only thing that can be done is to notify the freebase discussion
list about this problem.
Agree with Jona that the number of problematic references is not relevant.

Cheers
Andrea

2013/3/25 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>

>
> On Mar 25, 2013 3:32 AM, "Tom Morris" <tfmor...@gmail.com> wrote:
> >
> > Can someone point to the part of the discussion which talks about what
> the problem is?  This thread seems to start in mid-stream...
>
> That's right. Sorry. The start of the thread is in the middle of this page:
>
> https://github.com/dbpedia/extraction-framework/pull/25
>
> >
> > Freebase's MQL key encoding (
> http://wiki.freebase.com/wiki/MQL_key_escaping) is a completely private
> encoding which shouldn't have any effect on external
> URIs/IRIs/references/etc
>
> That's correct, and that's how the Scala script has always worked: it
> unescapes the MQL keys and uses the result to form DBpedia IRIs. The
> problems arise because some MQL keys contain invalid escapes (UTF-8 and
> Windows-1252 bytes instead of Unicode code points), and some others contain
> whitespace like U+2003 that is invalid even in IRIs.
>
> I would guess though that it's not a big problem because the affected keys
> are 1. not many, i.e. <1% and 2. not relevant anyway because they do not
> represent valid, current, non-redirect Wikipedia page titles. That's just a
> guess though, based on only a very cursory look at a few bad keys.
>
> I don't remember if these problems also came up when I ran the script on
> the old freebase dump format.
>
> JC
>
> >
> > On Sun, Mar 24, 2013 at 9:44 PM, Jona Christopher Sahnwaldt <
> j...@sahnwaldt.de> wrote:
> >>
> >> On 22 March 2013 23:21, Andrea Di Menna <ninn...@gmail.com> wrote:
> >> >
> >> > Hi Jona,
> >> >
> >> > thanks for merging the pull request!
> >> >
> >> > Anyway, couldn't we use percent encoding for Unicode code points
> which are
> >> > not allowed in N-Triples? (namely those outside the [#x20,#7E] range?
> >> > In this case we should get UTF-8 bytes and percent encode them.
> >> >
> >> > For example, as far as I can see
> >> >
> >> > Marl$00C3$00ADn$002C_$00C3$0081vila
> >> >
> >> > is
> >> >
> >> > <http://dbpedia.org/resource/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila>
> >> >
> >> > where \00C3 is 0xC3 0x83
> >> >          \00AD is 0xC2 0xAD
> >> >          \0081 is 0xC2 0x81
> >>
> >> Oh, by the way, it would be
> >> http://dbpedia.org/resource/Marl%C3%ADn,_%C3%81vila because that's the
> >> UTF-8-percent-encoding for Marlín,_Ávila.
> >>
> >> The weird thing is that these Wikipedia page titles in the Freebase
> >> contain UTF-8-encoded characters when they should contain no encoding
> >> at all, just plain Unicode code points. (Of course, the characters and
> >> codepoints are also dollar-escaped as usual for Freebase, but that's
> >> not a problem.)
> >>
> >>
> >> JC
> >>
> >> >
> >> > WDYT?
> >> >
> >> > Cheers
> >> > Andrea
> >> >
> >> > 2013/3/22 Christopher Sahnwaldt <notificati...@github.com>
> >> >>
> >> >> Ok, I got it. It has nothing to do with your platform. These are
> actually
> >> >> wrong URIs. There's not much we can do about it. I don't know where
> Freebase
> >> >> got them from, but I assume they may actually be wrong in Wikipedia.
> >> >>
> >> >> Examples:
> >> >>
> >> >> Marl$00C3$00ADn$002C_$00C3$0081vila
> >> >> AD 2C and C3 81 are UTF-8 encodings, but Freebase says [1] that the
> >> >> numbers should be plain Unicode code points, not UTF-8 bytes. 81 is
> an
> >> >> invalid code point, so we generate an invalid URI.
> >> >>
> >> >> Bene$009A_decrees
> >> >> 9A is the Windows-1252 encoding for š, but 9A invalid in Unicode.
> >> >>
> >> >> Switzerland$2003
> >> >> 2003, 2029 etc. are valid Unicode code points, but for whitespace
> >> >> characters that are invalid in URIs
> >> >>
> >> >> In a nutshell: all these characters are invalid in URIs, and it's
> not our
> >> >> fault. I'll pull your changes in a moment.
> >> >>
> >> >> [1] http://wiki.freebase.com/wiki/MQL_key_escaping
> >> >>
> >> >> —
> >> >> Reply to this email directly or view it on GitHub.
> >> >
> >> >
> >>
> >>
> ------------------------------------------------------------------------------
> >> Everyone hates slow websites. So do we.
> >> Make your web apps faster with AppDynamics
> >> Download AppDynamics Lite for free today:
> >> http://p.sf.net/sfu/appdyn_d2d_mar
> >> _______________________________________________
> >> Dbpedia-discussion mailing list
> >> Dbpedia-discussion@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
> >
> >
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_mar
> _______________________________________________
> Dbpedia-discussion mailing list
> Dbpedia-discussion@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to