https://bugzilla.wikimedia.org/show_bug.cgi?id=62209

            Bug ID: 62209
           Summary: feature request: Text extraction from custom wiki
                    markup
           Product: MediaWiki extensions
           Version: unspecified
          Hardware: All
                OS: All
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: Unprioritized
         Component: TextExtracts
          Assignee: wikibugs-l@lists.wikimedia.org
          Reporter: jimk...@gmail.com
                CC: maxsem.w...@gmail.com
       Web browser: ---
   Mobile Platform: ---

Hi,

This is a very interesting project for DBpedia [1]. We already try to extract
abstracts from articles (e.g. [2]) but up to now we hack in the mw core to get
it [3].

Looking at the code I noticed that for getting section 0 you parse the whole
page. This is a very expensive operation for us. We usually get the wiki markup
part that we want to extract and use this api call to get it.

api.php?format=xml&action=parse&prop=text&title=[...]&text=[...]

Then our hacked mw engine returns as clean text. As you probably guess, title
is used to resolve self references {{PAGENAME}} and text the part of the page
markup we want to get text from.

So, to get to the point, is this feasible in your extension? With some guiding
from your side, we can also work on this.

[1] http://dbpedia.org
[2] http://dbpedia.org/page/Berlin
[3]
https://github.com/dbpedia/extraction-framework/wiki/Dbpedia-Abstract-Extraction-step-by-step-guide#wiki-prepare-mediawiki---configuration-and-settings

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to