On 25 March 2013 14:18, Tom Morris <tfmor...@gmail.com> wrote: > I wouldn't claim that Freebase is bug-free, but that's a quite old and > simple algorithm, so unless they're triples from very early in it's life > (say, 2007), I'd guess that bad input data from Wikipedia is more likely > than a problem with the transformation. > > It might help to give a little background on how Freebase deals with these > links. The canonical link uses the article number (in the namespace > /wikipedia/en_id), but the alpha title (MQL key escaped) *and all redirects* > are also stored (namespace /wikipedia/en). Additionally, the same > information has recently been added for number of the other language > wikipedias. > > You can see them all here for the example that Andrea mentioned: > > https://www.freebase.com/m/09q3rp?keys > > Outbound links from Freebase to Wikipedia are made using the article number, > so that's really the most important link. The wisdom of including redirects > is debatable, I think. Sometimes they're good alternate names, but other > times they represent misspellings, related concepts, etc. > > If DBpedia has the Wikipedia article number, I'd suggest creating the links > based on those. If not, I'd suggest using the redirect file to canoncialize > on a single "best" link.
Our script works like this: - load all wikipedia page titles in the main namespace into a set (i.e. no categories, templates, etc.) - subtract from that set all titles that some other DBpedia code (which probably has 98-99% precision) recognized as redirect or disambiguation pages - create links to Freebase only for titles that are in the result set, i.e. that are very likely content pages Before we can look up a Freebase title in the result set, we generate the equivalent DBpedia IRI for it. This transformation fails for the Freebase keys we're discussing here, but I surmise this only happens for titles that neither Freebase, DBpedia nor Wikipedia really use, so it's not a real problem. Cheers, JC > > Tom > > > > On Mon, Mar 25, 2013 at 6:41 AM, Andrea Di Menna <ninn...@gmail.com> wrote: >> >> Hi all, >> >> it looks like there are actually some pages in Wikipedia which contain >> wrong data, which is where the pages originate from in Freebase, e.g. >> >> http://en.wikipedia.org/wiki/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila >> >> This page has been deleted on Jan 21, and this actually lead to the >> Freebase key >> >> Marl$00C3$00ADn$002C_$00C3$0081vila >> >> since UTF-8 0xC3 0x83 -> Unicode U+00C3 , etc.. >> >> Cheers >> Andrea >> >> >> 2013/3/25 Andrea Di Menna <ninn...@gmail.com> >>> >>> Hi, >>> >>> Maybe the only thing that can be done is to notify the freebase >>> discussion list about this problem. >>> Agree with Jona that the number of problematic references is not >>> relevant. >>> >>> Cheers >>> Andrea >>> >>> >>> 2013/3/25 Jona Christopher Sahnwaldt <j...@sahnwaldt.de> >>>> >>>> >>>> On Mar 25, 2013 3:32 AM, "Tom Morris" <tfmor...@gmail.com> wrote: >>>> > >>>> > Can someone point to the part of the discussion which talks about what >>>> > the problem is? This thread seems to start in mid-stream... >>>> >>>> That's right. Sorry. The start of the thread is in the middle of this >>>> page: >>>> >>>> https://github.com/dbpedia/extraction-framework/pull/25 >>>> >>>> > >>>> > Freebase's MQL key encoding >>>> > (http://wiki.freebase.com/wiki/MQL_key_escaping) is a completely private >>>> > encoding which shouldn't have any effect on external >>>> > URIs/IRIs/references/etc >>>> >>>> That's correct, and that's how the Scala script has always worked: it >>>> unescapes the MQL keys and uses the result to form DBpedia IRIs. The >>>> problems arise because some MQL keys contain invalid escapes (UTF-8 and >>>> Windows-1252 bytes instead of Unicode code points), and some others contain >>>> whitespace like U+2003 that is invalid even in IRIs. >>>> >>>> I would guess though that it's not a big problem because the affected >>>> keys are 1. not many, i.e. <1% and 2. not relevant anyway because they do >>>> not represent valid, current, non-redirect Wikipedia page titles. That's >>>> just a guess though, based on only a very cursory look at a few bad keys. >>>> >>>> I don't remember if these problems also came up when I ran the script on >>>> the old freebase dump format. >>>> >>>> JC >>>> >>>> > >>>> > On Sun, Mar 24, 2013 at 9:44 PM, Jona Christopher Sahnwaldt >>>> > <j...@sahnwaldt.de> wrote: >>>> >> >>>> >> On 22 March 2013 23:21, Andrea Di Menna <ninn...@gmail.com> wrote: >>>> >> > >>>> >> > Hi Jona, >>>> >> > >>>> >> > thanks for merging the pull request! >>>> >> > >>>> >> > Anyway, couldn't we use percent encoding for Unicode code points >>>> >> > which are >>>> >> > not allowed in N-Triples? (namely those outside the [#x20,#7E] >>>> >> > range? >>>> >> > In this case we should get UTF-8 bytes and percent encode them. >>>> >> > >>>> >> > For example, as far as I can see >>>> >> > >>>> >> > Marl$00C3$00ADn$002C_$00C3$0081vila >>>> >> > >>>> >> > is >>>> >> > >>>> >> > <http://dbpedia.org/resource/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila> >>>> >> > >>>> >> > where \00C3 is 0xC3 0x83 >>>> >> > \00AD is 0xC2 0xAD >>>> >> > \0081 is 0xC2 0x81 >>>> >> >>>> >> Oh, by the way, it would be >>>> >> http://dbpedia.org/resource/Marl%C3%ADn,_%C3%81vila because that's >>>> >> the >>>> >> UTF-8-percent-encoding for Marlín,_Ávila. >>>> >> >>>> >> The weird thing is that these Wikipedia page titles in the Freebase >>>> >> contain UTF-8-encoded characters when they should contain no encoding >>>> >> at all, just plain Unicode code points. (Of course, the characters >>>> >> and >>>> >> codepoints are also dollar-escaped as usual for Freebase, but that's >>>> >> not a problem.) >>>> >> >>>> >> >>>> >> JC >>>> >> >>>> >> > >>>> >> > WDYT? >>>> >> > >>>> >> > Cheers >>>> >> > Andrea >>>> >> > >>>> >> > 2013/3/22 Christopher Sahnwaldt <notificati...@github.com> >>>> >> >> >>>> >> >> Ok, I got it. It has nothing to do with your platform. These are >>>> >> >> actually >>>> >> >> wrong URIs. There's not much we can do about it. I don't know >>>> >> >> where Freebase >>>> >> >> got them from, but I assume they may actually be wrong in >>>> >> >> Wikipedia. >>>> >> >> >>>> >> >> Examples: >>>> >> >> >>>> >> >> Marl$00C3$00ADn$002C_$00C3$0081vila >>>> >> >> AD 2C and C3 81 are UTF-8 encodings, but Freebase says [1] that >>>> >> >> the >>>> >> >> numbers should be plain Unicode code points, not UTF-8 bytes. 81 >>>> >> >> is an >>>> >> >> invalid code point, so we generate an invalid URI. >>>> >> >> >>>> >> >> Bene$009A_decrees >>>> >> >> 9A is the Windows-1252 encoding for š, but 9A invalid in Unicode. >>>> >> >> >>>> >> >> Switzerland$2003 >>>> >> >> 2003, 2029 etc. are valid Unicode code points, but for whitespace >>>> >> >> characters that are invalid in URIs >>>> >> >> >>>> >> >> In a nutshell: all these characters are invalid in URIs, and it's >>>> >> >> not our >>>> >> >> fault. I'll pull your changes in a moment. >>>> >> >> >>>> >> >> [1] http://wiki.freebase.com/wiki/MQL_key_escaping >>>> >> >> >>>> >> >> — >>>> >> >> Reply to this email directly or view it on GitHub. >>>> >> > >>>> >> > >>>> >> >>>> >> >>>> >> ------------------------------------------------------------------------------ >>>> >> Everyone hates slow websites. So do we. >>>> >> Make your web apps faster with AppDynamics >>>> >> Download AppDynamics Lite for free today: >>>> >> http://p.sf.net/sfu/appdyn_d2d_mar >>>> >> _______________________________________________ >>>> >> Dbpedia-discussion mailing list >>>> >> Dbpedia-discussion@lists.sourceforge.net >>>> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion >>>> > >>>> > >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Everyone hates slow websites. So do we. >>>> Make your web apps faster with AppDynamics >>>> Download AppDynamics Lite for free today: >>>> http://p.sf.net/sfu/appdyn_d2d_mar >>>> _______________________________________________ >>>> Dbpedia-discussion mailing list >>>> Dbpedia-discussion@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion >>>> >>> >> > ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion