Re: [Wikitech-l] Simple way to convert XML to HTML
On Sun, Jul 26, 2009 at 8:17 PM, Chengbin Zhengchengbinzh...@gmail.com wrote: Anyone know how long it takes to create a static HTML dump? A month? It would depend completely on your hardware. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Sun, Jul 26, 2009 at 8:51 PM, K. Peachey p858sn...@yahoo.com.au wrote: On Mon, Jul 27, 2009 at 10:17 AM, Chengbin Zhengchengbinzh...@gmail.com wrote: Anyone know how long it takes to create a static HTML dump? A month? ___ Wikitech-l mailing list As in locally on your own systems or for the WMF servers to create it? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l WMF servers. Sorry for not clarifying. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
* Tei oscar.vi...@gmail.com [Tue, 21 Jul 2009 19:42:45 +0200]: On Tue, Jul 21, 2009 at 7:17 PM, Chengbin Zhengchengbinzh...@gmail.com wrote: ... No, I know what parsing means. Even if it takes 2 days to parse them, wouldn't it be faster than to actually create a static HTML dump the traditional way? If it is not, then what is the difficulty of making static HTML dumps? It can't be bandwidth, storage, or speed. WikiMedia work with limited resources on manpower, hardware, etc..etc... Things are done. When? when theres available resources, humans and of the other types. Is not only you, there are lots of people that want to download the wikipedia (sometimes in a periodic fashion) There are a log somewhere with the daily work of some wikipedia admin. ( - : http://wikitech.wikimedia.org/view/Server_admin_log Some of these are even very fun, like in: 02:11 b: CPAN sux 01:47 d**: I FOUND HOW TO REVIVE APACHES ( names obscured to protect the inocents ). Speaking of compact off-line English Wikipedia I liked the TomeRaider version: http://en.wikipedia.org/wiki/TomeRaider I wish there were newer TR builds, because English Wikipedia grows really fast. Dmitriy ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Wed, Jul 22, 2009 at 8:15 AM, Dmitriy Sintsov ques...@rambler.ru wrote: * Tei oscar.vi...@gmail.com [Tue, 21 Jul 2009 19:42:45 +0200]: On Tue, Jul 21, 2009 at 7:17 PM, Chengbin Zhengchengbinzh...@gmail.com wrote: ... No, I know what parsing means. Even if it takes 2 days to parse them, wouldn't it be faster than to actually create a static HTML dump the traditional way? If it is not, then what is the difficulty of making static HTML dumps? It can't be bandwidth, storage, or speed. WikiMedia work with limited resources on manpower, hardware, etc..etc... Things are done. When? when theres available resources, humans and of the other types. Is not only you, there are lots of people that want to download the wikipedia (sometimes in a periodic fashion) There are a log somewhere with the daily work of some wikipedia admin. ( - : http://wikitech.wikimedia.org/view/Server_admin_log Some of these are even very fun, like in: 02:11 b: CPAN sux 01:47 d**: I FOUND HOW TO REVIVE APACHES ( names obscured to protect the inocents ). Speaking of compact off-line English Wikipedia I liked the TomeRaider version: http://en.wikipedia.org/wiki/TomeRaider I wish there were newer TR builds, because English Wikipedia grows really fast. Dmitriy ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l Yes, the TombRaider version is exactly the version I want for static HTML. Just curious, is pages-articles.xml.bz2http://download.wikimedia.org/enwiki/20090713/enwiki-20090713-pages-articles.xml.bz2 like a TombRaider version? If not, what's the difference? And another curiosity, at http://en.wikipedia.org/wiki/Wikipedia:TomeRaider_database, it says the English Wikipedia database is only 3.3GB. Did they use compression? That seems awfully small. Even if they did, that's an incredible compression ratio, similar to 7-zip, I don't know how you can do that on a eBook format. NTFS compression only brings size down 50%. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Wed, Jul 22, 2009 at 5:48 PM, Chengbin Zhengchengbinzh...@gmail.com wrote: ... Yes, the TombRaider version is exactly the version I want for static HTML. Just curious, is pages-articles.xml.bz2http://download.wikimedia.org/enwiki/20090713/enwiki-20090713-pages-articles.xml.bz2 like a TombRaider version? If not, what's the difference? And another curiosity, at http://en.wikipedia.org/wiki/Wikipedia:TomeRaider_database, it says the English Wikipedia database is only 3.3GB. Did they use compression? That seems awfully small. Even if they did, that's an incredible compression ratio, similar to 7-zip, I don't know how you can do that on a eBook format. NTFS compression only brings size down 50%. At a point, Brion compressed it to 242 MB. http://www.mail-archive.com/wikitech-l@lists.wikimedia.org/msg00358.html You may also read this: http://en.wikipedia.org/wiki/Solid_compression -- -- ℱin del ℳensaje. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Wed, Jul 22, 2009 at 2:37 PM, Tei oscar.vi...@gmail.com wrote: On Wed, Jul 22, 2009 at 5:48 PM, Chengbin Zhengchengbinzh...@gmail.com wrote: ... Yes, the TombRaider version is exactly the version I want for static HTML. Just curious, is pages-articles.xml.bz2 http://download.wikimedia.org/enwiki/20090713/enwiki-20090713-pages-articles.xml.bz2 like a TombRaider version? If not, what's the difference? And another curiosity, at http://en.wikipedia.org/wiki/Wikipedia:TomeRaider_database, it says the English Wikipedia database is only 3.3GB. Did they use compression? That seems awfully small. Even if they did, that's an incredible compression ratio, similar to 7-zip, I don't know how you can do that on a eBook format. NTFS compression only brings size down 50%. At a point, Brion compressed it to 242 MB. http://www.mail-archive.com/wikitech-l@lists.wikimedia.org/msg00358.html You may also read this: http://en.wikipedia.org/wiki/Solid_compression -- -- ℱin del ℳensaje. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l I have no doubt that you can compress it to 3.3GB. I'm just curious how that's possible for an eBook format. 3.3GB, does it include skin, proper format of Wikipedia, etc? I'm assuming that the pages-articles.xml.bz2 XML dump includes something else other than the raw articles? What else are in it? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Wed, Jul 22, 2009 at 6:37 PM, Teioscar.vi...@gmail.com wrote: At a point, Brion compressed it to 242 MB. http://www.mail-archive.com/wikitech-l@lists.wikimedia.org/msg00358.html It looks like it was Platonides, not Brion, and as far as I can tell, Gregory Maxwell said his compression procedure was broken (i.e., inadvertently lossy). On Wed, Jul 22, 2009 at 7:03 PM, Chengbin Zhengchengbinzh...@gmail.com wrote: I have no doubt that you can compress it to 3.3GB. I'm just curious how that's possible for an eBook format. You just use a very good compression algorithm. Why can't e-books use 7-Zip? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Wed, Jul 22, 2009 at 6:53 PM, Aryeh Gregor simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com wrote: On Wed, Jul 22, 2009 at 6:37 PM, Teioscar.vi...@gmail.com wrote: At a point, Brion compressed it to 242 MB. http://www.mail-archive.com/wikitech-l@lists.wikimedia.org/msg00358.html It looks like it was Platonides, not Brion, and as far as I can tell, Gregory Maxwell said his compression procedure was broken (i.e., inadvertently lossy). On Wed, Jul 22, 2009 at 7:03 PM, Chengbin Zhengchengbinzh...@gmail.com wrote: I have no doubt that you can compress it to 3.3GB. I'm just curious how that's possible for an eBook format. You just use a very good compression algorithm. Why can't e-books use 7-Zip? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l Because decompression would be so slow it would be unusable (correct me if I'm wrong). Even if it used an excellent compression algorithm, you can't use solid compression, otherwise decompression will be a major pain. My own testing show that solid compression is roughly 5 times more efficient in compressing Wikipedia than normal compression. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Tue, Jul 21, 2009 at 3:33 AM, Kwan Ting Chank...@ktchan.info wrote: I know you want to avoid using command line, but in this case it's really much simpler / only feasible choice to search the internet / ask around for the right commands and issue that on the command line. It's only going to be one line of typing once you've got it, and you can write it down on a piece of paper or something for future reference. It's not like you have to learn the ins and out of all the commands and its options and what not. (Of course, you would want to test it on a small sample to make sure the command is correct before you let it loose on the whole dump.) In my experience, what on Unix is done with generic built-in command-line utilities can often be done on Windows using special-purpose GUIs written by third parties (often non-gratis, or nagware/adware/etc.). It's obviously a vastly inferior system to those of us who are happy using command lines, or even who are accustomed to using open-source GUI software, but it can work. For instance, this program provides a function to Delete files that match custom file name patterns and filters, and has a free trial version: http://www.microsystools.com/products/multibatcher/ So it's not accurate to say it's only feasible to use a command line here. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Mon, Jul 20, 2009 at 11:33 PM, Kwan Ting Chan k...@ktchan.info wrote: Chengbin Zheng wrote: Thank you for dropping by and sharing this information with us Tomasz! It is good just knowing that it is in the queue. Have you considered making a version of static HTML Wikipedia where there are no user talk and discussion pages that eating up half the space (like the 5GB XML dump for English Wikipedia)? As in the previous E-Mail, it is impossible to delete millions of pages through Windows Vista's search function (I left it overnight, and it ended up eating 1.3GB of RAM and maxing out one of my cores. Even deleting a single file took minutes). The Windows (and others?) GUI wasn't really designed with what you are trying to do in mind in terms of the number of items. You are asking it to search for all the files that match your pattern, keep the millions (?) of results in memory, and then to show you a windows containing the millions of items and to let you do all the magic GUI operations (selecting / dragging ...) all the while keeping track of which you've selected / move about etc. I know you want to avoid using command line, but in this case it's really much simpler / only feasible choice to search the internet / ask around for the right commands and issue that on the command line. It's only going to be one line of typing once you've got it, and you can write it down on a piece of paper or something for future reference. It's not like you have to learn the ins and out of all the commands and its options and what not. (Of course, you would want to test it on a small sample to make sure the command is correct before you let it loose on the whole dump.) KTC -- Experience is a good school but the fees are high. - Heinrich Heine ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l Actually, I do have to learn everything. I know absolutely nothing about HTML and all the stuff (Maybe I will when I take the computer science course in grade 10). Think of it this way, you have a radioactive material decay problem, where you want to find out how much mass is left after 1000 years. Obviously there is no simple algebraic way of doing it. You must set up a differential equation and solve it. There is no way to do it if your math skills are only basic algebra. This is me, and I have to learn all of advanced algebra, functions, trigonmetry, calculus, and differential equation to do it. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
Actually, I do have to learn everything. I know absolutely nothing about HTML and all the stuff (Maybe I will when I take the computer science course in grade 10). Think of it this way, you have a radioactive material decay problem, where you want to find out how much mass is left after 1000 years. Obviously there is no simple algebraic way of doing it. You must set up a differential equation and solve it. There is no way to do it if your math skills are only basic algebra. This is me, and I have to learn all of advanced algebra, functions, trigonmetry, calculus, and differential equation to do it. If you were able to do x264 from the commandline, this will be a walk in the park. I've been using the commandline for years and I *much* prefer to use a GUI to do x264 transcoding. Using the html exporter from the commandline is fairly simple, and it is documented on the extension page: http://www.mediawiki.org/wiki/Extension:DumpHTML V/r, Ryan Lane ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Tue, Jul 21, 2009 at 9:37 AM, Lane, Ryan ryan.l...@ocean.navo.navy.milwrote: Actually, I do have to learn everything. I know absolutely nothing about HTML and all the stuff (Maybe I will when I take the computer science course in grade 10). Think of it this way, you have a radioactive material decay problem, where you want to find out how much mass is left after 1000 years. Obviously there is no simple algebraic way of doing it. You must set up a differential equation and solve it. There is no way to do it if your math skills are only basic algebra. This is me, and I have to learn all of advanced algebra, functions, trigonmetry, calculus, and differential equation to do it. If you were able to do x264 from the commandline, this will be a walk in the park. I've been using the commandline for years and I *much* prefer to use a GUI to do x264 transcoding. Using the html exporter from the commandline is fairly simple, and it is documented on the extension page: http://www.mediawiki.org/wiki/Extension:DumpHTML V/r, Ryan Lane ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l I have no idea on how to install MediaWiki. This is too difficult and troublesome. Considering how much pain it is to use x264 from command line, I probably don't want to try this. Truthfully there is not much to x264 in command line. But the programs I'm seeing here is, well, complicated, to say the least. I'm just gonna wait for Wikimedia to update the static HTML, or bother my computer science teacher, LOL. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Tue, Jul 21, 2009 at 11:22 AM, Chengbin Zhengchengbinzh...@gmail.com wrote: On a side note, if parsing the XML gets you the static HTML version of Wikipedia, why can't Wikimedia just parse it for us and save a lot of our time (parsing and learning), and use that as the static HTML dump version? I'd assume it was a performance issue to parse all the pages for all the dumps so often. It might have just used too much CPU to be worth it at the time. Parsing some individual pages can take 20 seconds or more, and there are millions of them (although most much faster to parse than that). I'm sure it could be reinstituted with some effort, though. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Tue, Jul 21, 2009 at 1:08 PM, Chengbin Zhengchengbinzh...@gmail.com wrote: Wouldn't parsing it be faster than actually creating that many HTMLs? Parsing it *is* creating the HTML files. That's what parsing means in MediaWiki, converting wikitext to HTML. It's kind of a misnomer, admittedly. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Tue, Jul 21, 2009 at 1:11 PM, Aryeh Gregor simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com wrote: On Tue, Jul 21, 2009 at 1:08 PM, Chengbin Zhengchengbinzh...@gmail.com wrote: Wouldn't parsing it be faster than actually creating that many HTMLs? Parsing it *is* creating the HTML files. That's what parsing means in MediaWiki, converting wikitext to HTML. It's kind of a misnomer, admittedly. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l No, I know what parsing means. Even if it takes 2 days to parse them, wouldn't it be faster than to actually create a static HTML dump the traditional way? If it is not, then what is the difficulty of making static HTML dumps? It can't be bandwidth, storage, or speed. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Tue, Jul 21, 2009 at 1:17 PM, Chengbin Zhengchengbinzh...@gmail.com wrote: No, I know what parsing means. Even if it takes 2 days to parse them, wouldn't it be faster than to actually create a static HTML dump the traditional way? I don't know. I can only speculate. Whatever it is, it will take some attention to set it up again, and Tomasz has said he'll do that, so that's about all there is to say. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Tue, Jul 21, 2009 at 7:17 PM, Chengbin Zhengchengbinzh...@gmail.com wrote: ... No, I know what parsing means. Even if it takes 2 days to parse them, wouldn't it be faster than to actually create a static HTML dump the traditional way? If it is not, then what is the difficulty of making static HTML dumps? It can't be bandwidth, storage, or speed. WikiMedia work with limited resources on manpower, hardware, etc..etc... Things are done. When? when theres available resources, humans and of the other types. Is not only you, there are lots of people that want to download the wikipedia (sometimes in a periodic fashion) There are a log somewhere with the daily work of some wikipedia admin. ( - : http://wikitech.wikimedia.org/view/Server_admin_log Some of these are even very fun, like in: 02:11 b: CPAN sux 01:47 d**: I FOUND HOW TO REVIVE APACHES ( names obscured to protect the inocents ). -- -- ℱin del ℳensaje. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
wouldn't it be faster than to actually create a static HTML dump the traditional way? The content is wiki-text. It has to be parsed to be turned into HTML. There isn't a more traditional way, because there is no other way. Wouldn't it be possible to dump the parser cache instead of dumping XML and reparsing? Al the parsing work is already done on the Wikimedia servers, why do it again on a slow desktop system? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Tue, Jul 21, 2009 at 1:42 PM, Teioscar.vi...@gmail.com wrote: On Tue, Jul 21, 2009 at 7:17 PM, Chengbin Zhengchengbinzh...@gmail.com wrote: ... No, I know what parsing means. Even if it takes 2 days to parse them, wouldn't it be faster than to actually create a static HTML dump the traditional way? If it is not, then what is the difficulty of making static HTML dumps? It can't be bandwidth, storage, or speed. WikiMedia work with limited resources on manpower, hardware, etc..etc... Things are done. When? when theres available resources, humans and of the other types. Is not only you, there are lots of people that want to download the wikipedia (sometimes in a periodic fashion) There are a log somewhere with the daily work of some wikipedia admin. ( - : http://wikitech.wikimedia.org/view/Server_admin_log Some of these are even very fun, like in: 02:11 b: CPAN sux 01:47 d**: I FOUND HOW TO REVIVE APACHES ( names obscured to protect the inocents ). -- -- ℱin del ℳensaje. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l Hehe, seeing as like there's only 10 different names on there, it's pretty easy to figure out who B and D are ;-) -Chad ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Tue, Jul 21, 2009 at 1:49 PM, Chad innocentkil...@gmail.com wrote: On Tue, Jul 21, 2009 at 1:42 PM, Teioscar.vi...@gmail.com wrote: On Tue, Jul 21, 2009 at 7:17 PM, Chengbin Zhengchengbinzh...@gmail.com wrote: ... No, I know what parsing means. Even if it takes 2 days to parse them, wouldn't it be faster than to actually create a static HTML dump the traditional way? If it is not, then what is the difficulty of making static HTML dumps? It can't be bandwidth, storage, or speed. WikiMedia work with limited resources on manpower, hardware, etc..etc... Things are done. When? when theres available resources, humans and of the other types. Is not only you, there are lots of people that want to download the wikipedia (sometimes in a periodic fashion) There are a log somewhere with the daily work of some wikipedia admin. ( - : http://wikitech.wikimedia.org/view/Server_admin_log Some of these are even very fun, like in: 02:11 b: CPAN sux 01:47 d**: I FOUND HOW TO REVIVE APACHES ( names obscured to protect the inocents ). -- -- ℱin del ℳensaje. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l Hehe, seeing as like there's only 10 different names on there, it's pretty easy to figure out who B and D are ;-) -Chad ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l I can't imagine the need of downloading Wikipedia often for personal use. The amount of work (or should I say pain) involved to get Wikipedia working, umm, I don't want to do that often. The only reason I'm doing it is I want a copy of Wikipedia on the go. Finding Wi-Fi hotspots is hard (especially in a subway, LOL). It can save me time, as I can do research anytime I want, anywhere I want, for example in the subway. I'm not downloading the current static HTML dump because 1: It is very outdated. 2: It contains a LOT of useless information, hogging up half the space. Space is a big priority, as the English Wikipedia is what, 300GB uncompressed including junk. The next Archos PMP releasing in September is said to have a 500GB hard drive, but I doubt it, even though I hope so, because I would need 500GB if I'm putting Wikipedia on it (my videos are taking 220ish GB already on my Archos 5). Seriously hoping the next Archos supports NTFS (compression feature, cuts size by about half). How hard is it to get Linux to support NTFS? Why would you download Wikipedia? Internet is so readily available, and the online version has images. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Tue, Jul 21, 2009 at 2:20 PM, Chengbin Zheng chengbinzh...@gmail.comwrote: On Tue, Jul 21, 2009 at 1:49 PM, Chad innocentkil...@gmail.com wrote: On Tue, Jul 21, 2009 at 1:42 PM, Teioscar.vi...@gmail.com wrote: On Tue, Jul 21, 2009 at 7:17 PM, Chengbin Zhengchengbinzh...@gmail.com wrote: ... No, I know what parsing means. Even if it takes 2 days to parse them, wouldn't it be faster than to actually create a static HTML dump the traditional way? If it is not, then what is the difficulty of making static HTML dumps? It can't be bandwidth, storage, or speed. WikiMedia work with limited resources on manpower, hardware, etc..etc... Things are done. When? when theres available resources, humans and of the other types. Is not only you, there are lots of people that want to download the wikipedia (sometimes in a periodic fashion) There are a log somewhere with the daily work of some wikipedia admin. ( - : http://wikitech.wikimedia.org/view/Server_admin_log Some of these are even very fun, like in: 02:11 b: CPAN sux 01:47 d**: I FOUND HOW TO REVIVE APACHES ( names obscured to protect the inocents ). -- -- ℱin del ℳensaje. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l Hehe, seeing as like there's only 10 different names on there, it's pretty easy to figure out who B and D are ;-) -Chad ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l I can't imagine the need of downloading Wikipedia often for personal use. The amount of work (or should I say pain) involved to get Wikipedia working, umm, I don't want to do that often. The only reason I'm doing it is I want a copy of Wikipedia on the go. Finding Wi-Fi hotspots is hard (especially in a subway, LOL). It can save me time, as I can do research anytime I want, anywhere I want, for example in the subway. I'm not downloading the current static HTML dump because 1: It is very outdated. 2: It contains a LOT of useless information, hogging up half the space. Space is a big priority, as the English Wikipedia is what, 300GB uncompressed including junk. The next Archos PMP releasing in September is said to have a 500GB hard drive, but I doubt it, even though I hope so, because I would need 500GB if I'm putting Wikipedia on it (my videos are taking 220ish GB already on my Archos 5). Seriously hoping the next Archos supports NTFS (compression feature, cuts size by about half). How hard is it to get Linux to support NTFS? Why would you download Wikipedia? Internet is so readily available, and the online version has images. I downloaded the static HTML dump for another language to do a MUCH MUCH smaller scale test to see if it actually works. It works brilliantly. Even the search function works!! I didn't expect that to work. How does the search function work? I thought it is like search in Windows, but since everything is on RAM, website searches are instantaneous. I'm running this on hard drive, and it is instantaneous as well. BTW, the pages-articles.xml.bz2 version of the XML dump, does it include links to images, even though images don't exist? I find those pages taking up a lot of space as well. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Tue, Jul 21, 2009 at 8:20 PM, Chengbin Zhengchengbinzh...@gmail.com wrote: .. Why would you download Wikipedia? Internet is so readily available, and the online version has images. It obviusly don't make much sense for final users. It has been discused before anyway.. http://www.mail-archive.com/search?q=wikitech-l+torrentl=wikitec...@lists.wikimedia.org You tipically download the whole wikipedia, because you know what are doing, and want to use in some project (maybe create a 700 MB cd-rom version? or doing datamining on the delicious corpus of data of the wikipedia..). I suggest you to run a few search on that interface, to get interesting messages. Try to search GB dump, dump torrent, and the likes. -- -- ℱin del ℳensaje. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Mon, Jul 20, 2009 at 10:00 PM, Chengbin Zhengchengbinzh...@gmail.com wrote: It seems that reply doesn't work. So I'll send a new message. Since the static HTML Wikipedia is not updating (please update), and XML updates like everyday, the logical choice is to go with XML. Is there any way to convert XML to HTML, like the static HTML version? Download MediaWiki, import the dump, and use your wiki to output a static HTML dump. That's the only way I know of (but I haven't ever looked into it). I don't have mad computer skills like most of you. I need a simple way (preferably a GUI) to convert XML to HTML. Unlikely to exist. Also, how does the converted XML look like compared to the real Wikipedia? I've use Bzreader to open it, and it looks TERRIBLE, without any skin or format organization. Please tell me the converted XML won't look like this, and looks like the Wikipedia website. The XML only contains the wikitext for the pages, it doesn't contain the skin or the rules to convert to HTML. You need to run it through MediaWiki to get the HTML. Some simpler third-party tools would be able to produce some approximation of the HTML, as well, but none reliably. If the static HTML Wikipedia does update at some time, what are your preferred method of deleting the user talk, discussion, etc pages? I tried using Vista's search function and delete all of them with the name user, etc. But Vista doesn't like deleting millions of files. Even deleting 1 file takes minutes (probably due to the sheer number of folders). Is there like a program that can delete more efficiently? Or a program that deletes while searching (like finds a page, delete it, move on to search for the next file). I don't know of efficient GUI deletion utilities on Windows, because I don't need them. Probably nor do most people on what is, after all, a development list and not a user list. (Why would developers be likely to know about GUI tools that are easy to use for non-developers? You'd want to ask people with your skill set, not people with mad computer skills.) On a Unix command line, something of this form will do what you want: find -iname 'User:*' -exec rm {} + ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
. . . I should mention, also, that I believe the one in charge of dumps is Tomasz Finc. You may want to ask him about whether there are plans to resume the static HTML dumps. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Mon, Jul 20, 2009 at 6:41 PM, Aryeh Gregor simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com wrote: . . . I should mention, also, that I believe the one in charge of dumps is Tomasz Finc. You may want to ask him about whether there are plans to resume the static HTML dumps. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l I tried through Wikipedia mail, and I can't reach him. How do you use mediawiki? There are no exe files. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Mon, Jul 20, 2009 at 8:52 PM, Aryeh Gregor simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com wrote: On Mon, Jul 20, 2009 at 11:08 PM, Chengbin Zhengchengbinzh...@gmail.com wrote: I tried through Wikipedia mail, and I can't reach him. How do you use mediawiki? There are no exe files. Based on your posts here, I suspect this will be a difficult process for you. Even if you had experience installing and administering web apps, I don't know how reliably the dumps can be imported by third parties these days. If you're talking about the English Wikipedia, it would probably take a lot of processing time (maybe days, on a typical desktop?) for the dump to actually import, even if it's only the latest version of each page. And even after that, I don't know how easy or reliable it is to export static HTML. You will definitely, at a minimum, have to use a command line, and probably will run into at least one difficulty that will require debugging. MediaWiki is not really designed to be installed and administered by users who are only comfortable with GUIs. You could probably install it without too much difficulty, but the documentation for importing the dumps and exporting the static HTML might not be too comprehensible. If you still want to proceed, this page has lengthy instructions on installation: http://www.mediawiki.org/wiki/Manual:Running_MediaWiki_on_Windows I haven't imported a dump anywhere in a long time, and I've never exported static HTML, so I can't really help you with those offhand. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l Thank you for your answer. Yes, I think it is probably a bad idea. Maybe when I take the computer science course this year I'll get a better understanding. But definitely, I don't like using command lines. Even in video encoding, which I master at, I prefer using GUI (well simply because it is FAR FAR more convenient). Even though I could use command line, it takes forever. It took me over a year to master x264 and avisynth. Don't want to do that again for this. I guess I can just hope that the static HTML dumps do update. Meanwhile I need to look for a way to efficiently delete millions of talk and discussion files. Or better, Wikimedia making a lite version like the dumps so I don't have to do it. I'm really tight on space, as I'm putting this on a portable media player (the next Archos PMP, as the Archos 5 I have only have 250GB) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
Chengbin Zheng wrote: On Mon, Jul 20, 2009 at 6:41 PM, Aryeh Gregor simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com wrote: . . . I should mention, also, that I believe the one in charge of dumps is Tomasz Finc. You may want to ask him about whether there are plans to resume the static HTML dumps. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l I tried through Wikipedia mail, and I can't reach him. Looks like either my mail client ate them or those mails never arrived. I've exchanged mails with Tim Starling(original author/maintainer) of static.wikipedia.org to gauge the level of support and work required to have these running again. It certainly seems doable but I'm not going to commit to having them in place until the full en history snapshot works. Thinking post Wikimania 2009 (end of August) here for specking the return of these to a more maintainable state. --tomasz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Simple way to convert XML to HTML
On Mon, Jul 20, 2009 at 10:21 PM, Tomasz Finc tf...@wikimedia.org wrote: Chengbin Zheng wrote: On Mon, Jul 20, 2009 at 6:41 PM, Aryeh Gregor simetrical+wikil...@gmail.com simetrical%2bwikil...@gmail.com simetrical%2bwikil...@gmail.com simetrical%252bwikil...@gmail.com wrote: . . . I should mention, also, that I believe the one in charge of dumps is Tomasz Finc. You may want to ask him about whether there are plans to resume the static HTML dumps. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l I tried through Wikipedia mail, and I can't reach him. Looks like either my mail client ate them or those mails never arrived. I've exchanged mails with Tim Starling(original author/maintainer) of static.wikipedia.org to gauge the level of support and work required to have these running again. It certainly seems doable but I'm not going to commit to having them in place until the full en history snapshot works. Thinking post Wikimania 2009 (end of August) here for specking the return of these to a more maintainable state. --tomasz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l Thank you for dropping by and sharing this information with us Tomasz! It is good just knowing that it is in the queue. Have you considered making a version of static HTML Wikipedia where there are no user talk and discussion pages that eating up half the space (like the 5GB XML dump for English Wikipedia)? As in the previous E-Mail, it is impossible to delete millions of pages through Windows Vista's search function (I left it overnight, and it ended up eating 1.3GB of RAM and maxing out one of my cores. Even deleting a single file took minutes). ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l