Hi Andrea,

Wikipedia page ids (URL parameter curid) are more stable than page
titles, and according to Tom, Freebase uses them as the main links to
Wikipedia, but DBpedia still uses the current page title as the
canonical resource IRI, so the DBpedia-to-Freebase linkset has to use
the page title. I assume Freebase also uses the current page title in
triples like

ns:m.09q3rp     ns:common.topic.topic_equivalent_webpage
<http://en.wikipedia.org/wiki/Marlín>.

so I think we should simply use these lines. Of course, this will fail
for the few Wikipedia page titles that changed between the time
Freebase generates their links and DBpedia extracts its data, but
that's no big deal. We have bigger fish to fry. :-)

To make our Freebase script use the article id, you'd have to load
page_ids_en.nt.bz2 , build a map from ids to titles, look for ids in
the Freebase dumps, map them to titles... doable, but a lot of work...

Cheers,
JC


On 25 March 2013 15:42, Andrea Di Menna <ninn...@gmail.com> wrote:
> Sorry,
>
> wrong information.
> We should use Page Ids
> (http://downloads.dbpedia.org/3.8/en/page_ids_en.nt.bz2)
>
> I am going to try something.
>
>
> Cheers
> Andrea
>
> 2013/3/25 Andrea Di Menna <ninn...@gmail.com>
>>
>> Hi,
>>
>> we have article numeric ids in the quads file (as oldid parameter).
>> Jona, do you think this is worth giving a try?
>>
>> Regards
>> Andrea
>>
>>
>> 2013/3/25 Tom Morris <tfmor...@gmail.com>
>>>
>>> Another approach might be to use the recently introduced Topic Equivalent
>>> Webpage property:
>>>
>>> ns:m.09q3rp     ns:common.topic.topic_equivalent_webpage
>>> <http://pt.wikipedia.org/wiki/Marlín>.
>>> ns:m.09q3rp     ns:common.topic.topic_equivalent_webpage
>>> <http://es.wikipedia.org/wiki/Marlín_(Ávila)>.
>>> ns:m.09q3rp     ns:common.topic.topic_equivalent_webpage
>>> <http://en.wikipedia.org/wiki/Marlín>.
>>> ns:m.09q3rp     ns:common.topic.topic_equivalent_webpage
>>> <http://it.wikipedia.org/wiki/Marlín>.
>>>
>>> It appears to be a single canonical alpha link for each language
>>> Wikipedia with the MQL escaping undone and the redirects resolved.
>>>
>>> Tom
>>>
>>> On Mon, Mar 25, 2013 at 9:18 AM, Tom Morris <tfmor...@gmail.com> wrote:
>>>>
>>>> I wouldn't claim that Freebase is bug-free, but that's a quite old and
>>>> simple algorithm, so unless they're triples from very early in it's life
>>>> (say, 2007), I'd guess that bad input data from Wikipedia is more likely
>>>> than a problem with the transformation.
>>>>
>>>> It might help to give a little background on how Freebase deals with
>>>> these links.  The canonical link uses the article number (in the namespace
>>>> /wikipedia/en_id), but the alpha title (MQL key escaped) *and all 
>>>> redirects*
>>>> are also stored (namespace /wikipedia/en).  Additionally, the same
>>>> information has recently been added for number of the other language
>>>> wikipedias.
>>>>
>>>> You can see them all here for the example that Andrea mentioned:
>>>>
>>>>   https://www.freebase.com/m/09q3rp?keys
>>>>
>>>> Outbound links from Freebase to Wikipedia are made using the article
>>>> number, so that's really the most important link.  The wisdom of including
>>>> redirects is debatable, I think.  Sometimes they're good alternate names,
>>>> but other times they represent misspellings, related concepts, etc.
>>>>
>>>> If DBpedia has the Wikipedia article number, I'd suggest creating the
>>>> links based on those.  If not, I'd suggest using the redirect file to
>>>> canoncialize on a single "best" link.
>>>>
>>>> Tom
>>>>
>>>>
>>>>
>>>> On Mon, Mar 25, 2013 at 6:41 AM, Andrea Di Menna <ninn...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> it looks like there are actually some pages in Wikipedia which contain
>>>>> wrong data, which is where the pages originate from in Freebase, e.g.
>>>>>
>>>>> http://en.wikipedia.org/wiki/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila
>>>>>
>>>>> This page has been deleted on Jan 21, and this actually lead to the
>>>>> Freebase key
>>>>>
>>>>> Marl$00C3$00ADn$002C_$00C3$0081vila
>>>>>
>>>>> since UTF-8 0xC3 0x83 -> Unicode U+00C3 , etc..
>>>>>
>>>>> Cheers
>>>>> Andrea
>>>>>
>>>>>
>>>>> 2013/3/25 Andrea Di Menna <ninn...@gmail.com>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Maybe the only thing that can be done is to notify the freebase
>>>>>> discussion list about this problem.
>>>>>> Agree with Jona that the number of problematic references is not
>>>>>> relevant.
>>>>>>
>>>>>> Cheers
>>>>>> Andrea
>>>>>>
>>>>>>
>>>>>> 2013/3/25 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
>>>>>>>
>>>>>>>
>>>>>>> On Mar 25, 2013 3:32 AM, "Tom Morris" <tfmor...@gmail.com> wrote:
>>>>>>> >
>>>>>>> > Can someone point to the part of the discussion which talks about
>>>>>>> > what the problem is?  This thread seems to start in mid-stream...
>>>>>>>
>>>>>>> That's right. Sorry. The start of the thread is in the middle of this
>>>>>>> page:
>>>>>>>
>>>>>>> https://github.com/dbpedia/extraction-framework/pull/25
>>>>>>>
>>>>>>> >
>>>>>>> > Freebase's MQL key encoding
>>>>>>> > (http://wiki.freebase.com/wiki/MQL_key_escaping) is a completely 
>>>>>>> > private
>>>>>>> > encoding which shouldn't have any effect on external
>>>>>>> > URIs/IRIs/references/etc
>>>>>>>
>>>>>>> That's correct, and that's how the Scala script has always worked: it
>>>>>>> unescapes the MQL keys and uses the result to form DBpedia IRIs. The
>>>>>>> problems arise because some MQL keys contain invalid escapes (UTF-8 and
>>>>>>> Windows-1252 bytes instead of Unicode code points), and some others 
>>>>>>> contain
>>>>>>> whitespace like U+2003 that is invalid even in IRIs.
>>>>>>>
>>>>>>> I would guess though that it's not a big problem because the affected
>>>>>>> keys are 1. not many, i.e. <1% and 2. not relevant anyway because they 
>>>>>>> do
>>>>>>> not represent valid, current, non-redirect Wikipedia page titles. That's
>>>>>>> just a guess though, based on only a very cursory look at a few bad 
>>>>>>> keys.
>>>>>>>
>>>>>>> I don't remember if these problems also came up when I ran the script
>>>>>>> on the old freebase dump format.
>>>>>>>
>>>>>>> JC
>>>>>>>
>>>>>>> >
>>>>>>> > On Sun, Mar 24, 2013 at 9:44 PM, Jona Christopher Sahnwaldt
>>>>>>> > <j...@sahnwaldt.de> wrote:
>>>>>>> >>
>>>>>>> >> On 22 March 2013 23:21, Andrea Di Menna <ninn...@gmail.com> wrote:
>>>>>>> >> >
>>>>>>> >> > Hi Jona,
>>>>>>> >> >
>>>>>>> >> > thanks for merging the pull request!
>>>>>>> >> >
>>>>>>> >> > Anyway, couldn't we use percent encoding for Unicode code points
>>>>>>> >> > which are
>>>>>>> >> > not allowed in N-Triples? (namely those outside the [#x20,#7E]
>>>>>>> >> > range?
>>>>>>> >> > In this case we should get UTF-8 bytes and percent encode them.
>>>>>>> >> >
>>>>>>> >> > For example, as far as I can see
>>>>>>> >> >
>>>>>>> >> > Marl$00C3$00ADn$002C_$00C3$0081vila
>>>>>>> >> >
>>>>>>> >> > is
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> > <http://dbpedia.org/resource/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila>
>>>>>>> >> >
>>>>>>> >> > where \00C3 is 0xC3 0x83
>>>>>>> >> >          \00AD is 0xC2 0xAD
>>>>>>> >> >          \0081 is 0xC2 0x81
>>>>>>> >>
>>>>>>> >> Oh, by the way, it would be
>>>>>>> >> http://dbpedia.org/resource/Marl%C3%ADn,_%C3%81vila because that's
>>>>>>> >> the
>>>>>>> >> UTF-8-percent-encoding for Marlín,_Ávila.
>>>>>>> >>
>>>>>>> >> The weird thing is that these Wikipedia page titles in the
>>>>>>> >> Freebase
>>>>>>> >> contain UTF-8-encoded characters when they should contain no
>>>>>>> >> encoding
>>>>>>> >> at all, just plain Unicode code points. (Of course, the characters
>>>>>>> >> and
>>>>>>> >> codepoints are also dollar-escaped as usual for Freebase, but
>>>>>>> >> that's
>>>>>>> >> not a problem.)
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> JC
>>>>>>> >>
>>>>>>> >> >
>>>>>>> >> > WDYT?
>>>>>>> >> >
>>>>>>> >> > Cheers
>>>>>>> >> > Andrea
>>>>>>> >> >
>>>>>>> >> > 2013/3/22 Christopher Sahnwaldt <notificati...@github.com>
>>>>>>> >> >>
>>>>>>> >> >> Ok, I got it. It has nothing to do with your platform. These
>>>>>>> >> >> are actually
>>>>>>> >> >> wrong URIs. There's not much we can do about it. I don't know
>>>>>>> >> >> where Freebase
>>>>>>> >> >> got them from, but I assume they may actually be wrong in
>>>>>>> >> >> Wikipedia.
>>>>>>> >> >>
>>>>>>> >> >> Examples:
>>>>>>> >> >>
>>>>>>> >> >> Marl$00C3$00ADn$002C_$00C3$0081vila
>>>>>>> >> >> AD 2C and C3 81 are UTF-8 encodings, but Freebase says [1] that
>>>>>>> >> >> the
>>>>>>> >> >> numbers should be plain Unicode code points, not UTF-8 bytes.
>>>>>>> >> >> 81 is an
>>>>>>> >> >> invalid code point, so we generate an invalid URI.
>>>>>>> >> >>
>>>>>>> >> >> Bene$009A_decrees
>>>>>>> >> >> 9A is the Windows-1252 encoding for š, but 9A invalid in
>>>>>>> >> >> Unicode.
>>>>>>> >> >>
>>>>>>> >> >> Switzerland$2003
>>>>>>> >> >> 2003, 2029 etc. are valid Unicode code points, but for
>>>>>>> >> >> whitespace
>>>>>>> >> >> characters that are invalid in URIs
>>>>>>> >> >>
>>>>>>> >> >> In a nutshell: all these characters are invalid in URIs, and
>>>>>>> >> >> it's not our
>>>>>>> >> >> fault. I'll pull your changes in a moment.
>>>>>>> >> >>
>>>>>>> >> >> [1] http://wiki.freebase.com/wiki/MQL_key_escaping
>>>>>>> >> >>
>>>>>>> >> >> —
>>>>>>> >> >> Reply to this email directly or view it on GitHub.
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> ------------------------------------------------------------------------------
>>>>>>> >> Everyone hates slow websites. So do we.
>>>>>>> >> Make your web apps faster with AppDynamics
>>>>>>> >> Download AppDynamics Lite for free today:
>>>>>>> >> http://p.sf.net/sfu/appdyn_d2d_mar
>>>>>>> >> _______________________________________________
>>>>>>> >> Dbpedia-discussion mailing list
>>>>>>> >> Dbpedia-discussion@lists.sourceforge.net
>>>>>>> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Everyone hates slow websites. So do we.
>>>>>>> Make your web apps faster with AppDynamics
>>>>>>> Download AppDynamics Lite for free today:
>>>>>>> http://p.sf.net/sfu/appdyn_d2d_mar
>>>>>>> _______________________________________________
>>>>>>> Dbpedia-discussion mailing list
>>>>>>> Dbpedia-discussion@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to