On 25 March 2013 15:00, Tom Morris <tfmor...@gmail.com> wrote: > Another approach might be to use the recently introduced Topic Equivalent > Webpage property: > > ns:m.09q3rp ns:common.topic.topic_equivalent_webpage > <http://pt.wikipedia.org/wiki/Marlín>. > ns:m.09q3rp ns:common.topic.topic_equivalent_webpage > <http://es.wikipedia.org/wiki/Marlín_(Ávila)>. > ns:m.09q3rp ns:common.topic.topic_equivalent_webpage > <http://en.wikipedia.org/wiki/Marlín>. > ns:m.09q3rp ns:common.topic.topic_equivalent_webpage > <http://it.wikipedia.org/wiki/Marlín>. > > It appears to be a single canonical alpha link for each language Wikipedia > with the MQL escaping undone and the redirects resolved.
Sounds good! I think we would only have to change a few lines in our script to use these instead. > > Tom > > On Mon, Mar 25, 2013 at 9:18 AM, Tom Morris <tfmor...@gmail.com> wrote: >> >> I wouldn't claim that Freebase is bug-free, but that's a quite old and >> simple algorithm, so unless they're triples from very early in it's life >> (say, 2007), I'd guess that bad input data from Wikipedia is more likely >> than a problem with the transformation. >> >> It might help to give a little background on how Freebase deals with these >> links. The canonical link uses the article number (in the namespace >> /wikipedia/en_id), but the alpha title (MQL key escaped) *and all redirects* >> are also stored (namespace /wikipedia/en). Additionally, the same >> information has recently been added for number of the other language >> wikipedias. >> >> You can see them all here for the example that Andrea mentioned: >> >> https://www.freebase.com/m/09q3rp?keys >> >> Outbound links from Freebase to Wikipedia are made using the article >> number, so that's really the most important link. The wisdom of including >> redirects is debatable, I think. Sometimes they're good alternate names, >> but other times they represent misspellings, related concepts, etc. >> >> If DBpedia has the Wikipedia article number, I'd suggest creating the >> links based on those. If not, I'd suggest using the redirect file to >> canoncialize on a single "best" link. >> >> Tom >> >> >> >> On Mon, Mar 25, 2013 at 6:41 AM, Andrea Di Menna <ninn...@gmail.com> >> wrote: >>> >>> Hi all, >>> >>> it looks like there are actually some pages in Wikipedia which contain >>> wrong data, which is where the pages originate from in Freebase, e.g. >>> >>> http://en.wikipedia.org/wiki/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila >>> >>> This page has been deleted on Jan 21, and this actually lead to the >>> Freebase key >>> >>> Marl$00C3$00ADn$002C_$00C3$0081vila >>> >>> since UTF-8 0xC3 0x83 -> Unicode U+00C3 , etc.. >>> >>> Cheers >>> Andrea >>> >>> >>> 2013/3/25 Andrea Di Menna <ninn...@gmail.com> >>>> >>>> Hi, >>>> >>>> Maybe the only thing that can be done is to notify the freebase >>>> discussion list about this problem. >>>> Agree with Jona that the number of problematic references is not >>>> relevant. >>>> >>>> Cheers >>>> Andrea >>>> >>>> >>>> 2013/3/25 Jona Christopher Sahnwaldt <j...@sahnwaldt.de> >>>>> >>>>> >>>>> On Mar 25, 2013 3:32 AM, "Tom Morris" <tfmor...@gmail.com> wrote: >>>>> > >>>>> > Can someone point to the part of the discussion which talks about >>>>> > what the problem is? This thread seems to start in mid-stream... >>>>> >>>>> That's right. Sorry. The start of the thread is in the middle of this >>>>> page: >>>>> >>>>> https://github.com/dbpedia/extraction-framework/pull/25 >>>>> >>>>> > >>>>> > Freebase's MQL key encoding >>>>> > (http://wiki.freebase.com/wiki/MQL_key_escaping) is a completely private >>>>> > encoding which shouldn't have any effect on external >>>>> > URIs/IRIs/references/etc >>>>> >>>>> That's correct, and that's how the Scala script has always worked: it >>>>> unescapes the MQL keys and uses the result to form DBpedia IRIs. The >>>>> problems arise because some MQL keys contain invalid escapes (UTF-8 and >>>>> Windows-1252 bytes instead of Unicode code points), and some others >>>>> contain >>>>> whitespace like U+2003 that is invalid even in IRIs. >>>>> >>>>> I would guess though that it's not a big problem because the affected >>>>> keys are 1. not many, i.e. <1% and 2. not relevant anyway because they do >>>>> not represent valid, current, non-redirect Wikipedia page titles. That's >>>>> just a guess though, based on only a very cursory look at a few bad keys. >>>>> >>>>> I don't remember if these problems also came up when I ran the script >>>>> on the old freebase dump format. >>>>> >>>>> JC >>>>> >>>>> > >>>>> > On Sun, Mar 24, 2013 at 9:44 PM, Jona Christopher Sahnwaldt >>>>> > <j...@sahnwaldt.de> wrote: >>>>> >> >>>>> >> On 22 March 2013 23:21, Andrea Di Menna <ninn...@gmail.com> wrote: >>>>> >> > >>>>> >> > Hi Jona, >>>>> >> > >>>>> >> > thanks for merging the pull request! >>>>> >> > >>>>> >> > Anyway, couldn't we use percent encoding for Unicode code points >>>>> >> > which are >>>>> >> > not allowed in N-Triples? (namely those outside the [#x20,#7E] >>>>> >> > range? >>>>> >> > In this case we should get UTF-8 bytes and percent encode them. >>>>> >> > >>>>> >> > For example, as far as I can see >>>>> >> > >>>>> >> > Marl$00C3$00ADn$002C_$00C3$0081vila >>>>> >> > >>>>> >> > is >>>>> >> > >>>>> >> > <http://dbpedia.org/resource/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila> >>>>> >> > >>>>> >> > where \00C3 is 0xC3 0x83 >>>>> >> > \00AD is 0xC2 0xAD >>>>> >> > \0081 is 0xC2 0x81 >>>>> >> >>>>> >> Oh, by the way, it would be >>>>> >> http://dbpedia.org/resource/Marl%C3%ADn,_%C3%81vila because that's >>>>> >> the >>>>> >> UTF-8-percent-encoding for Marlín,_Ávila. >>>>> >> >>>>> >> The weird thing is that these Wikipedia page titles in the Freebase >>>>> >> contain UTF-8-encoded characters when they should contain no >>>>> >> encoding >>>>> >> at all, just plain Unicode code points. (Of course, the characters >>>>> >> and >>>>> >> codepoints are also dollar-escaped as usual for Freebase, but that's >>>>> >> not a problem.) >>>>> >> >>>>> >> >>>>> >> JC >>>>> >> >>>>> >> > >>>>> >> > WDYT? >>>>> >> > >>>>> >> > Cheers >>>>> >> > Andrea >>>>> >> > >>>>> >> > 2013/3/22 Christopher Sahnwaldt <notificati...@github.com> >>>>> >> >> >>>>> >> >> Ok, I got it. It has nothing to do with your platform. These are >>>>> >> >> actually >>>>> >> >> wrong URIs. There's not much we can do about it. I don't know >>>>> >> >> where Freebase >>>>> >> >> got them from, but I assume they may actually be wrong in >>>>> >> >> Wikipedia. >>>>> >> >> >>>>> >> >> Examples: >>>>> >> >> >>>>> >> >> Marl$00C3$00ADn$002C_$00C3$0081vila >>>>> >> >> AD 2C and C3 81 are UTF-8 encodings, but Freebase says [1] that >>>>> >> >> the >>>>> >> >> numbers should be plain Unicode code points, not UTF-8 bytes. 81 >>>>> >> >> is an >>>>> >> >> invalid code point, so we generate an invalid URI. >>>>> >> >> >>>>> >> >> Bene$009A_decrees >>>>> >> >> 9A is the Windows-1252 encoding for š, but 9A invalid in Unicode. >>>>> >> >> >>>>> >> >> Switzerland$2003 >>>>> >> >> 2003, 2029 etc. are valid Unicode code points, but for whitespace >>>>> >> >> characters that are invalid in URIs >>>>> >> >> >>>>> >> >> In a nutshell: all these characters are invalid in URIs, and it's >>>>> >> >> not our >>>>> >> >> fault. I'll pull your changes in a moment. >>>>> >> >> >>>>> >> >> [1] http://wiki.freebase.com/wiki/MQL_key_escaping >>>>> >> >> >>>>> >> >> — >>>>> >> >> Reply to this email directly or view it on GitHub. >>>>> >> > >>>>> >> > >>>>> >> >>>>> >> >>>>> >> ------------------------------------------------------------------------------ >>>>> >> Everyone hates slow websites. So do we. >>>>> >> Make your web apps faster with AppDynamics >>>>> >> Download AppDynamics Lite for free today: >>>>> >> http://p.sf.net/sfu/appdyn_d2d_mar >>>>> >> _______________________________________________ >>>>> >> Dbpedia-discussion mailing list >>>>> >> Dbpedia-discussion@lists.sourceforge.net >>>>> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion >>>>> > >>>>> > >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> Everyone hates slow websites. So do we. >>>>> Make your web apps faster with AppDynamics >>>>> Download AppDynamics Lite for free today: >>>>> http://p.sf.net/sfu/appdyn_d2d_mar >>>>> _______________________________________________ >>>>> Dbpedia-discussion mailing list >>>>> Dbpedia-discussion@lists.sourceforge.net >>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion >>>>> >>>> >>> >> > ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion