Re: [Dbpedia-discussion] Pagelinks dataset

Andrea Di Menna Thu, 05 Dec 2013 01:25:06 -0800

2013/12/4 Paul Houle <ontolo...@gmail.com>

> I think I could get this data out of some API,  but there are great
> HTML 5 parsing libraries now,  so a link extractor from HTML can be
> built as quickly than an API client.
>
> There are two big advantages of looking at links in HTML:  (i) you can
> use the same software to analyze multiple sites,  and (ii) the HTML
> output is often the most tested output of a system.  This is
> particularly a problem in the case of Wikipedia markup which has no
> formal specification and for which the editors aren't concerned if the
> markup is clean but they will fix problems if they cause the HTML to
> look wrong.
>
> Another advantage of HTML is that you can work from a static dump
> file,



Where can you get such dump from?


> or run a web crawler against the real Wikipedia


Seems not practical


> or against a
> local copy of Wikipedia loaded from the database dump files.
>

Pretty slow, isn't it?

Cheers!
Andrea


>
>

> On Tue, Dec 3, 2013 at 2:30 PM, Andrea Di Menna <ninn...@gmail.com> wrote:
> > I guess Paul wanted to know which book is cited by one wikipedia page
> (e.g.
> > page A cites book x).
> > If I am not wrong by asking template transclusions you only get the first
> > part of the triple (page A).
> >
> > Paul, your use case is interesting.
> > At the moment we are not dealing with the {{cite}} template nor {{cite
> > book}} etc.
> > We are looking into extensions which could support similar use cases
> anyway.
> >
> > Also please note that at the moment the framework does not handle
> references
> > either (i.e. what is inside <ref></ref>) when using the SimpleWikiParser
> [1]
> > From a quick exploration I see this template is used mainly for
> references.
> >
> > What do you exactly mean when you talk about "Wikipedia HTML"? Do you
> refer
> > to HTML dumps of the whole wikipedia?
> >
> > Cheers
> > Andrea
> >
> > [1]
> >
> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/simple/SimpleWikiParser.scala#L172
> >
> >
> > 2013/12/3 Tom Morris <tfmor...@gmail.com>
> >>
> >> On Tue, Dec 3, 2013 at 1:44 PM, Paul Houle <ontolo...@gmail.com> wrote:
> >>>
> >>> Something I found out recently is that the page links don't capture
> >>> links that are generated by macros,  in particular almost all of the
> >>> links to pages like
> >>>
> >>> http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1
> >>>
> >>> don't show up because they are generated by the {cite} macro.  These
> >>> can be easily extracted from the Wikipedia HTML of course,
> >>
> >>
> >> That's good to know, but couldn't you get this directly from the
> Wikimedia
> >> API without resorting to HTML parsing by asking for template calls to
> >> http://en.wikipedia.org/wiki/Template:Cite ?
> >>
> >> Tom
> >
> >
>
>
>
> --
> Paul Houle
> Expert on Freebase, DBpedia, Hadoop and RDF
> (607) 539 6254    paul.houle on Skype   ontol...@gmail.com
>

------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Pagelinks dataset

Reply via email to