Sorry,

wrong information.
We should use Page Ids (
http://downloads.dbpedia.org/3.8/en/page_ids_en.nt.bz2)

I am going to try something.

Cheers
Andrea

2013/3/25 Andrea Di Menna <ninn...@gmail.com>

> Hi,
>
> we have article numeric ids in the quads file (as oldid parameter).
> Jona, do you think this is worth giving a try?
>
> Regards
> Andrea
>
>
> 2013/3/25 Tom Morris <tfmor...@gmail.com>
>
>> Another approach might be to use the recently introduced Topic Equivalent
>> Webpage property:
>>
>> ns:m.09q3rp     ns:common.topic.topic_equivalent_webpage        <
>> http://pt.wikipedia.org/wiki/Marlín>.
>> ns:m.09q3rp     ns:common.topic.topic_equivalent_webpage        <
>> http://es.wikipedia.org/wiki/Marlín_(Ávila)>.
>> ns:m.09q3rp     ns:common.topic.topic_equivalent_webpage        <
>> http://en.wikipedia.org/wiki/Marlín>.
>> ns:m.09q3rp     ns:common.topic.topic_equivalent_webpage        <
>> http://it.wikipedia.org/wiki/Marlín>.
>>
>> It appears to be a single canonical alpha link for each language
>> Wikipedia with the MQL escaping undone and the redirects resolved.
>>
>> Tom
>>
>> On Mon, Mar 25, 2013 at 9:18 AM, Tom Morris <tfmor...@gmail.com> wrote:
>>
>>> I wouldn't claim that Freebase is bug-free, but that's a quite old and
>>> simple algorithm, so unless they're triples from very early in it's life
>>> (say, 2007), I'd guess that bad input data from Wikipedia is more likely
>>> than a problem with the transformation.
>>>
>>> It might help to give a little background on how Freebase deals with
>>> these links.  The canonical link uses the article number (in the namespace
>>> /wikipedia/en_id), but the alpha title (MQL key escaped) *and all
>>> redirects* are also stored (namespace /wikipedia/en).  Additionally, the
>>> same information has recently been added for number of the other language
>>> wikipedias.
>>>
>>> You can see them all here for the example that Andrea mentioned:
>>>
>>>   https://www.freebase.com/m/09q3rp?keys
>>>
>>> Outbound links from Freebase to Wikipedia are made using the article
>>> number, so that's really the most important link.  The wisdom of including
>>> redirects is debatable, I think.  Sometimes they're good alternate names,
>>> but other times they represent misspellings, related concepts, etc.
>>>
>>> If DBpedia has the Wikipedia article number, I'd suggest creating the
>>> links based on those.  If not, I'd suggest using the redirect file to
>>> canoncialize on a single "best" link.
>>>
>>> Tom
>>>
>>>
>>>
>>> On Mon, Mar 25, 2013 at 6:41 AM, Andrea Di Menna <ninn...@gmail.com>wrote:
>>>
>>>> Hi all,
>>>>
>>>> it looks like there are actually some pages in Wikipedia which contain
>>>> wrong data, which is where the pages originate from in Freebase, e.g.
>>>>
>>>> http://en.wikipedia.org/wiki/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila
>>>>
>>>> This page has been deleted on Jan 21, and this actually lead to the
>>>> Freebase key
>>>>
>>>> Marl$00C3$00ADn$002C_$00C3$0081vila
>>>>
>>>> since UTF-8 0xC3 0x83 -> Unicode U+00C3 , etc..
>>>>
>>>> Cheers
>>>> Andrea
>>>>
>>>>
>>>> 2013/3/25 Andrea Di Menna <ninn...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>> Maybe the only thing that can be done is to notify the freebase
>>>>> discussion list about this problem.
>>>>> Agree with Jona that the number of problematic references is not
>>>>> relevant.
>>>>>
>>>>> Cheers
>>>>> Andrea
>>>>>
>>>>>
>>>>> 2013/3/25 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
>>>>>
>>>>>>
>>>>>> On Mar 25, 2013 3:32 AM, "Tom Morris" <tfmor...@gmail.com> wrote:
>>>>>> >
>>>>>> > Can someone point to the part of the discussion which talks about
>>>>>> what the problem is?  This thread seems to start in mid-stream...
>>>>>>
>>>>>> That's right. Sorry. The start of the thread is in the middle of this
>>>>>> page:
>>>>>>
>>>>>> https://github.com/dbpedia/extraction-framework/pull/25
>>>>>>
>>>>>> >
>>>>>> > Freebase's MQL key encoding (
>>>>>> http://wiki.freebase.com/wiki/MQL_key_escaping) is a completely
>>>>>> private encoding which shouldn't have any effect on external
>>>>>> URIs/IRIs/references/etc
>>>>>>
>>>>>> That's correct, and that's how the Scala script has always worked: it
>>>>>> unescapes the MQL keys and uses the result to form DBpedia IRIs. The
>>>>>> problems arise because some MQL keys contain invalid escapes (UTF-8 and
>>>>>> Windows-1252 bytes instead of Unicode code points), and some others 
>>>>>> contain
>>>>>> whitespace like U+2003 that is invalid even in IRIs.
>>>>>>
>>>>>> I would guess though that it's not a big problem because the affected
>>>>>> keys are 1. not many, i.e. <1% and 2. not relevant anyway because they do
>>>>>> not represent valid, current, non-redirect Wikipedia page titles. That's
>>>>>> just a guess though, based on only a very cursory look at a few bad keys.
>>>>>>
>>>>>> I don't remember if these problems also came up when I ran the script
>>>>>> on the old freebase dump format.
>>>>>>
>>>>>> JC
>>>>>>
>>>>>> >
>>>>>> > On Sun, Mar 24, 2013 at 9:44 PM, Jona Christopher Sahnwaldt <
>>>>>> j...@sahnwaldt.de> wrote:
>>>>>> >>
>>>>>> >> On 22 March 2013 23:21, Andrea Di Menna <ninn...@gmail.com> wrote:
>>>>>> >> >
>>>>>> >> > Hi Jona,
>>>>>> >> >
>>>>>> >> > thanks for merging the pull request!
>>>>>> >> >
>>>>>> >> > Anyway, couldn't we use percent encoding for Unicode code points
>>>>>> which are
>>>>>> >> > not allowed in N-Triples? (namely those outside the [#x20,#7E]
>>>>>> range?
>>>>>> >> > In this case we should get UTF-8 bytes and percent encode them.
>>>>>> >> >
>>>>>> >> > For example, as far as I can see
>>>>>> >> >
>>>>>> >> > Marl$00C3$00ADn$002C_$00C3$0081vila
>>>>>> >> >
>>>>>> >> > is
>>>>>> >> >
>>>>>> >> > <http://dbpedia.org/resource/Marl%C3%83%C2%ADn,_%C3%83%C2%81vila
>>>>>> >
>>>>>> >> >
>>>>>> >> > where \00C3 is 0xC3 0x83
>>>>>> >> >          \00AD is 0xC2 0xAD
>>>>>> >> >          \0081 is 0xC2 0x81
>>>>>> >>
>>>>>> >> Oh, by the way, it would be
>>>>>> >> http://dbpedia.org/resource/Marl%C3%ADn,_%C3%81vila because
>>>>>> that's the
>>>>>> >> UTF-8-percent-encoding for Marlín,_Ávila.
>>>>>> >>
>>>>>> >> The weird thing is that these Wikipedia page titles in the Freebase
>>>>>> >> contain UTF-8-encoded characters when they should contain no
>>>>>> encoding
>>>>>> >> at all, just plain Unicode code points. (Of course, the characters
>>>>>> and
>>>>>> >> codepoints are also dollar-escaped as usual for Freebase, but
>>>>>> that's
>>>>>> >> not a problem.)
>>>>>> >>
>>>>>> >>
>>>>>> >> JC
>>>>>> >>
>>>>>> >> >
>>>>>> >> > WDYT?
>>>>>> >> >
>>>>>> >> > Cheers
>>>>>> >> > Andrea
>>>>>> >> >
>>>>>> >> > 2013/3/22 Christopher Sahnwaldt <notificati...@github.com>
>>>>>> >> >>
>>>>>> >> >> Ok, I got it. It has nothing to do with your platform. These
>>>>>> are actually
>>>>>> >> >> wrong URIs. There's not much we can do about it. I don't know
>>>>>> where Freebase
>>>>>> >> >> got them from, but I assume they may actually be wrong in
>>>>>> Wikipedia.
>>>>>> >> >>
>>>>>> >> >> Examples:
>>>>>> >> >>
>>>>>> >> >> Marl$00C3$00ADn$002C_$00C3$0081vila
>>>>>> >> >> AD 2C and C3 81 are UTF-8 encodings, but Freebase says [1] that
>>>>>> the
>>>>>> >> >> numbers should be plain Unicode code points, not UTF-8 bytes.
>>>>>> 81 is an
>>>>>> >> >> invalid code point, so we generate an invalid URI.
>>>>>> >> >>
>>>>>> >> >> Bene$009A_decrees
>>>>>> >> >> 9A is the Windows-1252 encoding for š, but 9A invalid in
>>>>>> Unicode.
>>>>>> >> >>
>>>>>> >> >> Switzerland$2003
>>>>>> >> >> 2003, 2029 etc. are valid Unicode code points, but for
>>>>>> whitespace
>>>>>> >> >> characters that are invalid in URIs
>>>>>> >> >>
>>>>>> >> >> In a nutshell: all these characters are invalid in URIs, and
>>>>>> it's not our
>>>>>> >> >> fault. I'll pull your changes in a moment.
>>>>>> >> >>
>>>>>> >> >> [1] http://wiki.freebase.com/wiki/MQL_key_escaping
>>>>>> >> >>
>>>>>> >> >> —
>>>>>> >> >> Reply to this email directly or view it on GitHub.
>>>>>> >> >
>>>>>> >> >
>>>>>> >>
>>>>>> >>
>>>>>> ------------------------------------------------------------------------------
>>>>>> >> Everyone hates slow websites. So do we.
>>>>>> >> Make your web apps faster with AppDynamics
>>>>>> >> Download AppDynamics Lite for free today:
>>>>>> >> http://p.sf.net/sfu/appdyn_d2d_mar
>>>>>> >> _______________________________________________
>>>>>> >> Dbpedia-discussion mailing list
>>>>>> >> Dbpedia-discussion@lists.sourceforge.net
>>>>>> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>>>> >
>>>>>> >
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Everyone hates slow websites. So do we.
>>>>>> Make your web apps faster with AppDynamics
>>>>>> Download AppDynamics Lite for free today:
>>>>>> http://p.sf.net/sfu/appdyn_d2d_mar
>>>>>> _______________________________________________
>>>>>> Dbpedia-discussion mailing list
>>>>>> Dbpedia-discussion@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to