Re: [Dbpedia-discussion] Pagelinks dataset

Paul Houle Wed, 04 Dec 2013 08:52:35 -0800

I think I could get this data out of some API,  but there are great
HTML 5 parsing libraries now,  so a link extractor from HTML can be
built as quickly than an API client.


There are two big advantages of looking at links in HTML:  (i) you can
use the same software to analyze multiple sites,  and (ii) the HTML
output is often the most tested output of a system.  This is
particularly a problem in the case of Wikipedia markup which has no
formal specification and for which the editors aren't concerned if the
markup is clean but they will fix problems if they cause the HTML to
look wrong.

Another advantage of HTML is that you can work from a static dump
file,  or run a web crawler against the real Wikipedia or against a
local copy of Wikipedia loaded from the database dump files.



On Tue, Dec 3, 2013 at 2:30 PM, Andrea Di Menna <ninn...@gmail.com> wrote:
> I guess Paul wanted to know which book is cited by one wikipedia page (e.g.
> page A cites book x).
> If I am not wrong by asking template transclusions you only get the first
> part of the triple (page A).
>
> Paul, your use case is interesting.
> At the moment we are not dealing with the {{cite}} template nor {{cite
> book}} etc.
> We are looking into extensions which could support similar use cases anyway.
>
> Also please note that at the moment the framework does not handle references
> either (i.e. what is inside <ref></ref>) when using the SimpleWikiParser [1]
> From a quick exploration I see this template is used mainly for references.
>
> What do you exactly mean when you talk about "Wikipedia HTML"? Do you refer
> to HTML dumps of the whole wikipedia?
>
> Cheers
> Andrea
>
> [1]
> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/simple/SimpleWikiParser.scala#L172
>
>
> 2013/12/3 Tom Morris <tfmor...@gmail.com>
>>
>> On Tue, Dec 3, 2013 at 1:44 PM, Paul Houle <ontolo...@gmail.com> wrote:
>>>
>>> Something I found out recently is that the page links don't capture
>>> links that are generated by macros,  in particular almost all of the
>>> links to pages like
>>>
>>> http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1
>>>
>>> don't show up because they are generated by the {cite} macro.  These
>>> can be easily extracted from the Wikipedia HTML of course,
>>
>>
>> That's good to know, but couldn't you get this directly from the Wikimedia
>> API without resorting to HTML parsing by asking for template calls to
>> http://en.wikipedia.org/wiki/Template:Cite ?
>>
>> Tom
>
>



-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254    paul.houle on Skype   ontol...@gmail.com

------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Pagelinks dataset

Reply via email to