https://bugzilla.wikimedia.org/show_bug.cgi?id=62209
Bug ID: 62209 Summary: feature request: Text extraction from custom wiki markup Product: MediaWiki extensions Version: unspecified Hardware: All OS: All Status: UNCONFIRMED Severity: enhancement Priority: Unprioritized Component: TextExtracts Assignee: wikibugs-l@lists.wikimedia.org Reporter: jimk...@gmail.com CC: maxsem.w...@gmail.com Web browser: --- Mobile Platform: --- Hi, This is a very interesting project for DBpedia [1]. We already try to extract abstracts from articles (e.g. [2]) but up to now we hack in the mw core to get it [3]. Looking at the code I noticed that for getting section 0 you parse the whole page. This is a very expensive operation for us. We usually get the wiki markup part that we want to extract and use this api call to get it. api.php?format=xml&action=parse&prop=text&title=[...]&text=[...] Then our hacked mw engine returns as clean text. As you probably guess, title is used to resolve self references {{PAGENAME}} and text the part of the page markup we want to get text from. So, to get to the point, is this feasible in your extension? With some guiding from your side, we can also work on this. [1] http://dbpedia.org [2] http://dbpedia.org/page/Berlin [3] https://github.com/dbpedia/extraction-framework/wiki/Dbpedia-Abstract-Extraction-step-by-step-guide#wiki-prepare-mediawiki---configuration-and-settings -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l