On 25 March 2013 14:18, Tom Morris <tfmor...@gmail.com> wrote:
> I wouldn't claim that Freebase is bug-free, but that's a quite old and
> simple algorithm, so unless they're triples from very early in it's life
> (say, 2007), I'd guess that bad input data from Wikipedia is more likely
> than a problem with the transformation.
>
> It might help to give a little background on how Freebase deals with these
> links.  The canonical link uses the article number (in the namespace
> /wikipedia/en_id), but the alpha title (MQL key escaped) *and all redirects*
> are also stored (namespace /wikipedia/en).  Additionally, the same
> information has recently been added for number of the other language
> wikipedias.
>
> You can see them all here for the example that Andrea mentioned:
>
>   https://www.freebase.com/m/09q3rp?keys
>
> Outbound links from Freebase to Wikipedia are made using the article number,
> so that's really the most important link.  The wisdom of including redirects
> is debatable, I think.  Sometimes they're good alternate names, but other
> times they represent misspellings, related concepts, etc.
>
> If DBpedia has the Wikipedia article number, I'd suggest creating the links
> based on those.  If not, I'd suggest using the redirect file to canoncialize
> on a single "best" link.

Our script works like this:
- load all wikipedia page titles in the main namespace into a set
(i.e. no categories, templates, etc.)
- subtract from that set all titles that some other DBpedia code
(which probably has 98-99% precision) recognized as redirect or
disambiguation pages
- create links to Freebase only for titles that are in the result set,
i.e. that are very likely content pages

Before we can look up a Freebase title in the result set, we generate
the equivalent DBpedia IRI for it. This transformation fails for the
Freebase keys we're discussing here, but I surmise this only happens
for titles that neither Freebase, DBpedia nor Wikipedia really use, so
it's not a real problem.

Cheers,
JC


>
> Tom
>
>
>
> On Mon, Mar 25, 2013 at 6:41 AM, Andrea Di Menna <ninn...@gmail.com> wrote:
>>
>> Hi all,
>>
>> it looks like there are actually some pages in Wikipedia which contain
>> wrong data, which is where the pages originate from in Freebase, e.g.
>>
>> http://en.wikipedia.org/wiki/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila
>>
>> This page has been deleted on Jan 21, and this actually lead to the
>> Freebase key
>>
>> Marl$00C3$00ADn$002C_$00C3$0081vila
>>
>> since UTF-8 0xC3 0x83 -> Unicode U+00C3 , etc..
>>
>> Cheers
>> Andrea
>>
>>
>> 2013/3/25 Andrea Di Menna <ninn...@gmail.com>
>>>
>>> Hi,
>>>
>>> Maybe the only thing that can be done is to notify the freebase
>>> discussion list about this problem.
>>> Agree with Jona that the number of problematic references is not
>>> relevant.
>>>
>>> Cheers
>>> Andrea
>>>
>>>
>>> 2013/3/25 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
>>>>
>>>>
>>>> On Mar 25, 2013 3:32 AM, "Tom Morris" <tfmor...@gmail.com> wrote:
>>>> >
>>>> > Can someone point to the part of the discussion which talks about what
>>>> > the problem is?  This thread seems to start in mid-stream...
>>>>
>>>> That's right. Sorry. The start of the thread is in the middle of this
>>>> page:
>>>>
>>>> https://github.com/dbpedia/extraction-framework/pull/25
>>>>
>>>> >
>>>> > Freebase's MQL key encoding
>>>> > (http://wiki.freebase.com/wiki/MQL_key_escaping) is a completely private
>>>> > encoding which shouldn't have any effect on external
>>>> > URIs/IRIs/references/etc
>>>>
>>>> That's correct, and that's how the Scala script has always worked: it
>>>> unescapes the MQL keys and uses the result to form DBpedia IRIs. The
>>>> problems arise because some MQL keys contain invalid escapes (UTF-8 and
>>>> Windows-1252 bytes instead of Unicode code points), and some others contain
>>>> whitespace like U+2003 that is invalid even in IRIs.
>>>>
>>>> I would guess though that it's not a big problem because the affected
>>>> keys are 1. not many, i.e. <1% and 2. not relevant anyway because they do
>>>> not represent valid, current, non-redirect Wikipedia page titles. That's
>>>> just a guess though, based on only a very cursory look at a few bad keys.
>>>>
>>>> I don't remember if these problems also came up when I ran the script on
>>>> the old freebase dump format.
>>>>
>>>> JC
>>>>
>>>> >
>>>> > On Sun, Mar 24, 2013 at 9:44 PM, Jona Christopher Sahnwaldt
>>>> > <j...@sahnwaldt.de> wrote:
>>>> >>
>>>> >> On 22 March 2013 23:21, Andrea Di Menna <ninn...@gmail.com> wrote:
>>>> >> >
>>>> >> > Hi Jona,
>>>> >> >
>>>> >> > thanks for merging the pull request!
>>>> >> >
>>>> >> > Anyway, couldn't we use percent encoding for Unicode code points
>>>> >> > which are
>>>> >> > not allowed in N-Triples? (namely those outside the [#x20,#7E]
>>>> >> > range?
>>>> >> > In this case we should get UTF-8 bytes and percent encode them.
>>>> >> >
>>>> >> > For example, as far as I can see
>>>> >> >
>>>> >> > Marl$00C3$00ADn$002C_$00C3$0081vila
>>>> >> >
>>>> >> > is
>>>> >> >
>>>> >> > <http://dbpedia.org/resource/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila>
>>>> >> >
>>>> >> > where \00C3 is 0xC3 0x83
>>>> >> >          \00AD is 0xC2 0xAD
>>>> >> >          \0081 is 0xC2 0x81
>>>> >>
>>>> >> Oh, by the way, it would be
>>>> >> http://dbpedia.org/resource/Marl%C3%ADn,_%C3%81vila because that's
>>>> >> the
>>>> >> UTF-8-percent-encoding for Marlín,_Ávila.
>>>> >>
>>>> >> The weird thing is that these Wikipedia page titles in the Freebase
>>>> >> contain UTF-8-encoded characters when they should contain no encoding
>>>> >> at all, just plain Unicode code points. (Of course, the characters
>>>> >> and
>>>> >> codepoints are also dollar-escaped as usual for Freebase, but that's
>>>> >> not a problem.)
>>>> >>
>>>> >>
>>>> >> JC
>>>> >>
>>>> >> >
>>>> >> > WDYT?
>>>> >> >
>>>> >> > Cheers
>>>> >> > Andrea
>>>> >> >
>>>> >> > 2013/3/22 Christopher Sahnwaldt <notificati...@github.com>
>>>> >> >>
>>>> >> >> Ok, I got it. It has nothing to do with your platform. These are
>>>> >> >> actually
>>>> >> >> wrong URIs. There's not much we can do about it. I don't know
>>>> >> >> where Freebase
>>>> >> >> got them from, but I assume they may actually be wrong in
>>>> >> >> Wikipedia.
>>>> >> >>
>>>> >> >> Examples:
>>>> >> >>
>>>> >> >> Marl$00C3$00ADn$002C_$00C3$0081vila
>>>> >> >> AD 2C and C3 81 are UTF-8 encodings, but Freebase says [1] that
>>>> >> >> the
>>>> >> >> numbers should be plain Unicode code points, not UTF-8 bytes. 81
>>>> >> >> is an
>>>> >> >> invalid code point, so we generate an invalid URI.
>>>> >> >>
>>>> >> >> Bene$009A_decrees
>>>> >> >> 9A is the Windows-1252 encoding for š, but 9A invalid in Unicode.
>>>> >> >>
>>>> >> >> Switzerland$2003
>>>> >> >> 2003, 2029 etc. are valid Unicode code points, but for whitespace
>>>> >> >> characters that are invalid in URIs
>>>> >> >>
>>>> >> >> In a nutshell: all these characters are invalid in URIs, and it's
>>>> >> >> not our
>>>> >> >> fault. I'll pull your changes in a moment.
>>>> >> >>
>>>> >> >> [1] http://wiki.freebase.com/wiki/MQL_key_escaping
>>>> >> >>
>>>> >> >> —
>>>> >> >> Reply to this email directly or view it on GitHub.
>>>> >> >
>>>> >> >
>>>> >>
>>>> >>
>>>> >> ------------------------------------------------------------------------------
>>>> >> Everyone hates slow websites. So do we.
>>>> >> Make your web apps faster with AppDynamics
>>>> >> Download AppDynamics Lite for free today:
>>>> >> http://p.sf.net/sfu/appdyn_d2d_mar
>>>> >> _______________________________________________
>>>> >> Dbpedia-discussion mailing list
>>>> >> Dbpedia-discussion@lists.sourceforge.net
>>>> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>> >
>>>> >
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Everyone hates slow websites. So do we.
>>>> Make your web apps faster with AppDynamics
>>>> Download AppDynamics Lite for free today:
>>>> http://p.sf.net/sfu/appdyn_d2d_mar
>>>> _______________________________________________
>>>> Dbpedia-discussion mailing list
>>>> Dbpedia-discussion@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>>
>>>
>>
>

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to