Re: [Wikitech-l] [mwdumper] new maintainer?
mwDumper is essential also for anyone wiling to replicate a wiki locally for any purpose. There are alternatives such as xml2SQL or importDump.php but mwDumper is the most efficient in terms of correctness and completeness or speed sometimes. bilal == Verily, with hardship comes ease. On Fri, Feb 12, 2010 at 8:46 AM, emman...@engelhart.org wrote: Le ven 12/02/10 14:24, Christensen, Courtney christens...@battelle.orga écrit: We use the DumpHTML extension ( http://www.mediawiki.org/wiki/Extension:DumpHTML) to make static copies of our wikis. It used to be a maintenance script. Maybe that would work for you? The DumpHTML extension is something else... this is tool a to get a static HTML version of Mediawiki articles. If you speak from http://static.wikipedia.org/... this is also an other topic because these pages are not our content, but only a not customizable view of our content (I can't do nothing with it). Our content is the wiki code and the files (images, etc.) ... and this is what seems not to be fully reusable currently. Emmanuel PS: DumpHTML seems also not to be maintened currently... have a look to the bug reports. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] importing enwiki into local database
I am still able to import the dumps using the old mwDumper (modified to fix the contributor) and xml2SQL works also and it is quiet fast. importDump.php continues after it breaks I think. bilal -- Verily, with hardship comes ease. On Thu, Feb 4, 2010 at 9:24 PM, Chad innocentkil...@gmail.com wrote: On Thu, Feb 4, 2010 at 9:12 PM, Eric Sun e...@cs.stanford.edu wrote: Hi, I saw this thread back in October where someone was having trouble importing the English Wikipedia XML dump: http://lists.wikimedia.org/pipermail/wikitech-l/2009-October/045594.html The thread back in October seemed to end without resolution, and the tools still seem to be broken, so has anyone found a solution in the meantime? I'm using mediawiki-1.15.1 and attempting to import enwiki-20100130-pages-articles.xml.bz2. None of these options seem to work: 1) importDump.php fails by spewing Warning: xml_parse(): Unable to call handler in_() in ./includes/Import.php on line 437 repeatedly 2) xml2sql (http://meta.wikimedia.org/wiki/Xml2sql): Fails with error: xml2sql: parsing aborted at line 33 pos 16. due to the new redirect tag introduced in the new dumps? 3) mwdumper (http://www.mediawiki.org/wiki/MWDumper): Current XML is schema v0.4, but the documentation says that it's for 0.3 4) mwimport (http://meta.wikimedia.org/wiki/Data_dumps/mwimport): Fails immediately: siteinfo: untested generator 'MediaWiki 1.16alpha-wmf', expect trouble ahead page: expected closing tag in line 35 Any tips? Thanks! Eric ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l Most of these errors are caused by the new(ish) redirect / tag within page elements. 0.4 is the correct version of the schema, but unfortunately the schema was updated and dumps were produced using them before the changes made it into a release. 1.15.1 cannot import pages with redirect /, we should probably backport that. That, and we should rewrite the importers to not barf terribly when they encounter an unknown element. -Chad ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] download wikipedia database dump
If you can download the whole file to your PC, then you can just import a portion of it and stop the import after some time. The mwDumper shows you the imported pages in an increment of 1000. If you do not have enough bandwidth to download the whole thing, you can use the Special:Export (http://en.wikipedia.org/wiki/Special:Export) feature on the English Wikipedia and then you select the pages you need to download. bilal -- Verily, with hardship comes ease. On Mon, Jan 11, 2010 at 11:10 AM, OrzzrO orzvs...@gmail.com wrote: Hi, I want to download the wikipedia database dump of English version. But the whole database dump is 10.1GB, which is too large for me. In fact, I only need a part of the database, and any part is ok for me. Can I download a small database, which is the subset of the whole database dump ? Thanks for your time and help! Best Wishes! ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
I think having access to them on Commons repository is much easier to handle. A subset should be good enough. Having 11 TB of images needs huge research capabilities in order to handle all of them and work with all of them. Maybe a special API or advanced API functions would allow people enough access and at the same time save the bandwidth and the hassle to handle this behemoth collection. bilal -- Verily, with hardship comes ease. On Fri, Jan 8, 2010 at 1:57 PM, Tomasz Finc tf...@wikimedia.org wrote: William Pietri wrote: On 01/07/2010 01:40 AM, Jamie Morken wrote: I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. [...] Is the bandwidth used really a big problem? Bandwidth is pretty cheap these days, and given Wikipedia's total draw, I suspect the occasional dump download isn't much of a problem. No, bandwidth is not really the problem here. I think the core issue is to have bulk access to images. There have been a number of these requests in the past and after talking back and forth, it has usually been the case that a smaller subset of the data works just as well. A good example of this was the Deutsche Fotokek archive made late last year. http://download.wikipedia.org/images/Deutsche_Fotothek.tar ( 11GB ) This provided an easily retrievable high quality subset of our image data which researchers could use. Now if we were to snapshot image data and store them for a particular project the amount of duplicate image data would become significant. That's because we re-use a ton of image data between projects and rightfully so. If instead we package all of commons into a tarball then we get roughly 6T's of image data which after numerous conversation has been a bit more then most people want to process. So what does everyone think of going down the collections route? If we provide enough different and up to date ones then we could easily give people a large but manageable amount of data to work with. If there is a page already for this then please feel free to point me to it otherwise I'll create one. --tomasz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
I have been using the dumps for few months and I think this kind of dumps is much better than a torrent. Yes bandwidth can be saved but I do not think the the cost of bandwidth is higher than the cost of maintaining the torrents. If people are not hosting the files so the value of torrents is limited. I think regular mirroring is much better but it all depends on the willingness of people to host the files. bilal -- Verily, with hardship comes ease. On Thu, Jan 7, 2010 at 11:30 AM, Platonides platoni...@gmail.com wrote: Jamie Morken wrote: Hi, I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. I think it is important that wikipedia can be downloaded for using it offline now and in the future for people. best regards, Jamie Morken Has been tried before (when they were smaller). How many people do you think will have the necessary space and be willing to download it? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] {{Encyclopédie recherche}}
Greetings, This template is not being parsed on my french local wiki. Any hints on that. I did several search on google but I could not find the problem. bilal -- Verily, with hardship comes ease. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Importing English Wikipeida XML Dumps into MediaWiki
I have used xml2sql, mwdumper, import.php and the python script to import The two fastest are xml2sql and the python script (xray). The best results is from importDump.php mwDumper is slow but it gives good results. I have not done any import with the new redirect tag. bilal On Fri, Oct 9, 2009 at 2:18 PM, O. O. olson...@yahoo.com wrote: Andrew Krizhanovsky wrote: Hi! I have got the same redirect problem while importing the dump of Russian Wiktionary. :( Best regards, Andrew Krizhanovsky. So Andrew, do you import using importDump.php, MWDumper or xml2sql? I am curious to know what others are using for their imports. (This is for my personal knowledge.) It seems that the “redirect /” tags are mostly blank while grepping through the English Wikipedia Dump. I hope someone can fix this soon. Thanks to you guys, O. O. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- Verily, with hardship comes ease. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Wikipedia Google Earth layer
I think Google applications use the data crawled from their own datbases and of course Google has almost all last updates of Wikipedia articles with all its information including the Geo addresses. bilal On Fri, Oct 2, 2009 at 1:59 PM, Tei oscar.vi...@gmail.com wrote: On Fri, Oct 2, 2009 at 6:15 PM, Roan Kattouw roan.katt...@gmail.com wrote: 2009/10/2 Tei oscar.vi...@gmail.com: On Fri, Oct 2, 2009 at 3:37 PM, Strainu strain...@gmail.com wrote: ... I'm not sure if Wikimedia has anything to do with it, but I think I have a better chance of getting an answer here than by asking Google (the company) directly. Google (the search engine) was not really helpful on the matter. you could always install Ethereal, and spy the trafic from your computer to the network. It probably include some HTTP servers, and GET / POST request you can read. The LiveHTTPHeaders extension for Firefox will also do this job for you, and is a bit easier to install and use. Nah really. Is google earth we are talking here. Since is a standalone app, It talk directly trough the network. -- -- ℱin del ℳensaje. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- Verily, with hardship comes ease. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Public repositories for research dumps
Hi Felipe,Thanks for the great effort. This will save us hours of downloading and importing older dumps. bilal On Tue, Jun 23, 2009 at 12:26 PM, Felipe Ortega glimmer_phoe...@yahoo.eswrote: Hello. Since just a few hours ago, a new public repository has been created to host WikiXRay database dumps, containing info extracted from public Wikipedia dbdumps. The image is hosted by RedIRIS (in short, the Spanish equivalent of Kennisnet in Netherlands). http://sunsite.rediris.es/mirror/WKP_research ftp://ftp.rediris.es/mirror/WKP_research These new dumps are aimed to save time and effort to other researchers, since they won't need to parse the complete XML dumps to extract all relevant activity metadata. We used mysqldump to create the dumps from our databases.. As of today, only some of the biggest Wikipedias are available. However, in the following days the full set of available languages will be ready for downloading. The files will be updated regularly. The procedure is as follows: 1. Find the research dump of your interest. Download and decompress it in your local system. 2. Create a local DB to import the information. 3. Load the dump file, using a MySQL user with insert privileges: $ mysql -u user -p passw myDB dumpfile.sql And you're done. Final warning. 3 fields in the revision table are not reliable yet: rev_num_inlinks rev_num_outlinks rev_num_trans All remaining fields/values are trustable (in particular rev_len, rev_num_words, and so forth). Regards, Felipe. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] We're not quite at Google's level
Sorry I missed the point in a previous post then. The wordings looked like using the downtime as a strategy. On Fri, May 15, 2009 at 12:02 PM, Thomas Dalton thomas.dal...@gmail.comwrote: 2009/5/15 The Cunctator cuncta...@gmail.com: No, and it's stupid. It's not like this is a covert discussion. On Fri, May 15, 2009 at 11:45 AM, Bilal Abdul Kader bila...@gmail.com wrote: Is it ethical? How is it unethical? We take advantage of downtime to explain to our readers that we rely on donations to keep the site running, there is nothing dishonest about that. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Downloadable client fonts to aid language script support?
On Tue, May 5, 2009 at 9:47 AM, Nikola Smolenski smole...@eunet.yu wrote: Brion Vibber wrote: It might be helpful for some language wikis to link in a free font this way, when standard fonts supporting their script are often unavailable. Right now on such sites there tends to be a little English link at the top such as 'font help' leading to a page like this telling you how to download and install a font: http://ta.wikipedia.org/wiki/Project:Font_help Even more helpful: MediaWiki could determine if a page uses a rare character upon save and link to appropriate fonts. This should be pushed to the client end I think because even the page uses a rare character, the decision should be for the browser to load the font and not for mediaWiki to push it. Some front end js can do the task well. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Extracting pages history error
Greetings, I am trying to replicate enwiki locally but I am always getting a CRC error extracting the page history file (enwiki-latest-pages-meta-history.xml.bz2http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history.xml.bz2). Anybody was able to do so? I am not sure if the error is at the source (when zipping it) or because of the download manager at my end. bilal ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Skin JS cleanup and jQuery
There is an issue with running a foreground JS thread that is super fast and might send a lot of request to the server. Heavy processing on the client side would alleviate the load from the server (if possible) but it might push another load on the server (in the presented example of sending emails to uses). I have worked on an AJAX application that sends email using a Javascript application and it turns out that the server was denying the JS requests because it went beyond the allowed limit of connections from a single host. A better approach might be to start the task at the client side and save it ina queue at the server side for another process (server side) to take care of it later on in FIFO mode. On Wed, Apr 22, 2009 at 12:18 PM, Brion Vibber br...@wikimedia.org wrote: Perhaps... but note that the i/o for XMLHTTPRequest is asynchronous to begin with -- it's really only if you're doing heavy client-side _processing_ that you're likely to benefit from a background worker thread. On 4/17/09 6:45 PM, Marco Schuster wrote: You mean...stuff like bots written in Javascript, using the XML API? I could imagine also sending mails via Special:Emailuser in the background to reach multiple recipients - that's a PITA if you want send mails to multiple users. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Skin JS cleanup and jQuery
This would be a great idea as the library is always updated and has a lot of features for the front end. On Wed, Apr 22, 2009 at 12:28 PM, Brian brian.min...@colorado.edu wrote: Many extensions are now using the Yahoo User Interface library. It would be nice if mediawiki included it by default. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dealing with Large Files when attempting a wikipedia database download.
I have downloaded the history dump file (~150 GB) using Firefox on XP and using wget on Ubuntu and it works fine. I have downloaded it using a download manager on Vista and it is fine also. A more probable reason is the file system limitations. bilal On Fri, Apr 10, 2009 at 3:49 PM, Finne Boonen hen...@gmail.com wrote: http://en.wikipedia.org/wiki/Wikipedia_database has some information on how to deal with the large files henna On Fri, Apr 10, 2009 at 21:43, Daniel Kinzler dan...@brightbyte.de wrote: David Gerard schrieb: 2009/4/10 Jameson Scanlon jameson.scan...@googlemail.com: Does anyone on the wikitech mailing list happen to know whether it would be possible for some of the larger wikipedia database downloads (which are, say, 16GB or so in size) to be split into parts so that they can be downloaded. For whatever reason, whenever I have attempted to download the ~14GB files (say, from http://static.wikipedia.org/downloads/2008-06/en/ ), I have found that only 2GB (presumably, the first 2GB) of what I have sought to download has actually been downloaded. Is there anyway around this? Could anyone possibly suggest what possible reasons there might be for this difficulty in downloading the material? Downloading to a filesystem that only does maximum 2GB files? Also, several http clients don't like files over 2GB - this is because the large number of bytes in the Length field causes an integer overflow (2GB is the 31 bit limit). wget likes to die with a segmentation fault on those. I found that curl works. But of course, the file system also has to support very large files, as Gerard said. Finally: yes, it would be nive to have such dumps available in pieces of perhaps 1GB in size. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
I have a decent server that is dedicated for a Wikipedia project that depends on the fresh dumps. Can this be used anyway to speed up the process of generating the dumps? bilal On Tue, Jan 27, 2009 at 2:24 PM, Christian Storm st...@iparadigms.comwrote: On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote: The current enwiki database dump ( http://download.wikimedia.org/enwiki/20081008/ ) has been crawling along since 10/15/2008. The current dump system is not sustainable on very large wikis and is being replaced. You'll hear about it when we have the new one in place. :) -- brion Following up on this thread: http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html Brion, Can you offer any general timeline estimates (weeks, months, 1/2 year)? Are there any alternatives to retrieving the article data beyond directly crawling the site? I know this is verboten but we are in dire need of retrieving this data and don't know of any alternatives. The current estimate of end of year is too long for us to wait. Unfortunately, wikipedia is a favored source for students to plagiarize from which makes out of date content a real issue. Is there any way to help this process along? We can donate disk drives, developer time, ...? There is another possibility that we could offer but I would need to talk with someone at the wikimedia foundation offline. Is there anyone I could contact? Thanks for any information and/or direction you can give. Christian ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l