Re: [Dbpedia-discussion] Pagelinks dataset

Paul Houle Thu, 05 Dec 2013 10:08:20 -0800

The "DBpedia Way" of extracting the citations probably would be to
build something that treats the citations the way infoboxes are
treated.


It's one way of doing things,  and it has it's own integrity,  but
it's not the way I do things.  (DBpedia does it this way about as well
as it can be done,  why try to beat it?)

A few years back I wrote a very elaborate Wikipedia markup parser in
.NET,  it used a recursive descent parser and lots and lots of
heuristics to deal with special cases.  The purpose of it was to
accurately parse author and licensing metadata from Wikimedia Commons
when ingesting images into Ookaboo.  I had to do the special cases
that because Wikipedia markup doesn't have a formal spec.

I quickly ran into a diminishing returns situation where I had to work
harder and harder to improve recall and get deteriorating results.

I later wrote a very simple parser for Flickr which just parsed the
HTML and took advantage of the "cool URIs" published in Flickr.  Today
I think of it as pretending that the Linked Data revolution has
already arrived,  because really if you look at the link graph of
Flickr,  there is a subset of it which isn't very different from the
link graph of Ookaboo.

Anyway,  I needed to pull some stuff out of Wikimedia Commons and it
took me 20 minutes to modify the Flickr parser to work for Commons and
get at least 80% of the recall that the old parser got.

On Thu, Dec 5, 2013 at 10:29 AM, Andrea Di Menna <ninn...@gmail.com> wrote:
> @Paul,
>
> unfortunately HTML wikipedia dumps are not released anymore (they are old
> static dumps as you said).
> This is a problem for a project like DBpedia, as you can easily understand.
>
> Moreover, I did not mean that it is not possible to crawl Wikipedia
> instances or load dump into a private Mediawiki instance (the latter is what
> happens when abstracts are extracted), I am just saying that this is
> probably not practical for a project like DBpedia which extracts data from
> multiple wikipedias.
>
> Cheers
> Andrea
>
>
> 2013/12/5 Paul Houle <ontolo...@gmail.com>
>>
>> @Andrea,
>>
>>         there are old static dumps available,  but I can say that running
>> the web crawler is not at all difficult.  I got a list of topics by looking
>> at the ?s for DBpedia descriptions and then wrote a very simple
>> single-threaded crawler that took a few days to run on a micro instance in
>> AWS.
>>
>>        The main key to writing a successful web crawler is keeping it
>> simple.
>>
>> On Dec 5, 2013 4:23 AM, "Andrea Di Menna" <ninn...@gmail.com> wrote:
>> >
>> > 2013/12/4 Paul Houle <ontolo...@gmail.com>
>> >>
>> >> I think I could get this data out of some API,  but there are great
>> >> HTML 5 parsing libraries now,  so a link extractor from HTML can be
>> >> built as quickly than an API client.
>> >>
>> >> There are two big advantages of looking at links in HTML:  (i) you can
>> >> use the same software to analyze multiple sites,  and (ii) the HTML
>> >> output is often the most tested output of a system.  This is
>> >> particularly a problem in the case of Wikipedia markup which has no
>> >> formal specification and for which the editors aren't concerned if the
>> >> markup is clean but they will fix problems if they cause the HTML to
>> >> look wrong.
>> >>
>> >> Another advantage of HTML is that you can work from a static dump
>> >> file,
>> >
>> >
>> > Where can you get such dump from?
>> >
>> >>
>> >> or run a web crawler against the real Wikipedia
>> >
>> >
>> > Seems not practical
>> >
>> >>
>> >> or against a
>> >> local copy of Wikipedia loaded from the database dump files.
>> >
>> >
>> > Pretty slow, isn't it?
>> >
>> > Cheers!
>> > Andrea
>> >
>> >>
>> >>
>> >>
>> >>
>> >> On Tue, Dec 3, 2013 at 2:30 PM, Andrea Di Menna <ninn...@gmail.com>
>> >> wrote:
>> >> > I guess Paul wanted to know which book is cited by one wikipedia page
>> >> > (e.g.
>> >> > page A cites book x).
>> >> > If I am not wrong by asking template transclusions you only get the
>> >> > first
>> >> > part of the triple (page A).
>> >> >
>> >> > Paul, your use case is interesting.
>> >> > At the moment we are not dealing with the {{cite}} template nor
>> >> > {{cite
>> >> > book}} etc.
>> >> > We are looking into extensions which could support similar use cases
>> >> > anyway.
>> >> >
>> >> > Also please note that at the moment the framework does not handle
>> >> > references
>> >> > either (i.e. what is inside <ref></ref>) when using the
>> >> > SimpleWikiParser [1]
>> >> > From a quick exploration I see this template is used mainly for
>> >> > references.
>> >> >
>> >> > What do you exactly mean when you talk about "Wikipedia HTML"? Do you
>> >> > refer
>> >> > to HTML dumps of the whole wikipedia?
>> >> >
>> >> > Cheers
>> >> > Andrea
>> >> >
>> >> > [1]
>> >> >
>> >> > https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/simple/SimpleWikiParser.scala#L172
>> >> >
>> >> >
>> >> > 2013/12/3 Tom Morris <tfmor...@gmail.com>
>> >> >>
>> >> >> On Tue, Dec 3, 2013 at 1:44 PM, Paul Houle <ontolo...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> Something I found out recently is that the page links don't capture
>> >> >>> links that are generated by macros,  in particular almost all of
>> >> >>> the
>> >> >>> links to pages like
>> >> >>>
>> >> >>> http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1
>> >> >>>
>> >> >>> don't show up because they are generated by the {cite} macro.
>> >> >>> These
>> >> >>> can be easily extracted from the Wikipedia HTML of course,
>> >> >>
>> >> >>
>> >> >> That's good to know, but couldn't you get this directly from the
>> >> >> Wikimedia
>> >> >> API without resorting to HTML parsing by asking for template calls
>> >> >> to
>> >> >> http://en.wikipedia.org/wiki/Template:Cite ?
>> >> >>
>> >> >> Tom
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Paul Houle
>> >> Expert on Freebase, DBpedia, Hadoop and RDF
>> >> (607) 539 6254    paul.houle on Skype   ontol...@gmail.com
>> >
>> >
>
>



-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254    paul.houle on Skype   ontol...@gmail.com

------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Pagelinks dataset

Reply via email to