Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-26 Thread Aryeh Gregor
On Sun, Jul 26, 2009 at 8:17 PM, Chengbin Zhengchengbinzh...@gmail.com wrote:
 Anyone know how long it takes to create a static HTML dump? A month?

It would depend completely on your hardware.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-26 Thread Chengbin Zheng
On Sun, Jul 26, 2009 at 8:51 PM, K. Peachey p858sn...@yahoo.com.au wrote:

 On Mon, Jul 27, 2009 at 10:17 AM, Chengbin Zhengchengbinzh...@gmail.com
 wrote:
  Anyone know how long it takes to create a static HTML dump? A month?
  ___
  Wikitech-l mailing list
 As in locally on your own systems or for the WMF servers to create it?

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


WMF servers.

Sorry for not clarifying.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-22 Thread Dmitriy Sintsov
* Tei oscar.vi...@gmail.com [Tue, 21 Jul 2009 19:42:45 +0200]:
 On Tue, Jul 21, 2009 at 7:17 PM, Chengbin 
Zhengchengbinzh...@gmail.com
 wrote:
 ...
 
  No, I know what parsing means. Even if it takes 2 days to parse 
them,
  wouldn't it be faster than to actually create a static HTML dump the
  traditional way?
 
  If it is not, then what is the difficulty of making static HTML 
dumps?
 It
  can't be bandwidth, storage, or speed.
 

 WikiMedia work with limited resources on manpower, hardware, 
etc..etc...

 Things are done. When? when theres available resources, humans and of
 the other types.
 Is not only you, there are lots of people that want to download the
 wikipedia (sometimes in a periodic fashion)

 There are a log somewhere with the daily work of some wikipedia admin. 
(
 - :
 http://wikitech.wikimedia.org/view/Server_admin_log

 Some of these are even very fun, like in:
 02:11 b: CPAN sux
 01:47 d**: I FOUND HOW TO REVIVE APACHES
 ( names obscured to protect the inocents ).

Speaking of compact off-line English Wikipedia I liked the TomeRaider 
version:
http://en.wikipedia.org/wiki/TomeRaider
I wish there were newer TR builds, because English Wikipedia grows 
really fast.
Dmitriy

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-22 Thread Chengbin Zheng
On Wed, Jul 22, 2009 at 8:15 AM, Dmitriy Sintsov ques...@rambler.ru wrote:

 * Tei oscar.vi...@gmail.com [Tue, 21 Jul 2009 19:42:45 +0200]:
  On Tue, Jul 21, 2009 at 7:17 PM, Chengbin
 Zhengchengbinzh...@gmail.com
  wrote:
  ...
  
   No, I know what parsing means. Even if it takes 2 days to parse
 them,
   wouldn't it be faster than to actually create a static HTML dump the
   traditional way?
  
   If it is not, then what is the difficulty of making static HTML
 dumps?
  It
   can't be bandwidth, storage, or speed.
  
 
  WikiMedia work with limited resources on manpower, hardware,
 etc..etc...
 
  Things are done. When? when theres available resources, humans and of
  the other types.
  Is not only you, there are lots of people that want to download the
  wikipedia (sometimes in a periodic fashion)
 
  There are a log somewhere with the daily work of some wikipedia admin.
 (
  - :
  http://wikitech.wikimedia.org/view/Server_admin_log
 
  Some of these are even very fun, like in:
  02:11 b: CPAN sux
  01:47 d**: I FOUND HOW TO REVIVE APACHES
  ( names obscured to protect the inocents ).
 
 Speaking of compact off-line English Wikipedia I liked the TomeRaider
 version:
 http://en.wikipedia.org/wiki/TomeRaider
 I wish there were newer TR builds, because English Wikipedia grows
 really fast.
 Dmitriy

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Yes, the TombRaider version is exactly the version I want for static
HTML.

Just curious, is
pages-articles.xml.bz2http://download.wikimedia.org/enwiki/20090713/enwiki-20090713-pages-articles.xml.bz2
like
a TombRaider version? If not, what's the difference?

And another curiosity, at
http://en.wikipedia.org/wiki/Wikipedia:TomeRaider_database, it says the
English Wikipedia database is only 3.3GB. Did they use compression? That
seems awfully small. Even if they did, that's an incredible compression
ratio, similar to 7-zip, I don't know how you can do that on a eBook format.
NTFS compression only brings size down 50%.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-22 Thread Tei
On Wed, Jul 22, 2009 at 5:48 PM, Chengbin Zhengchengbinzh...@gmail.com wrote:
...

 Yes, the TombRaider version is exactly the version I want for static
 HTML.

 Just curious, is
 pages-articles.xml.bz2http://download.wikimedia.org/enwiki/20090713/enwiki-20090713-pages-articles.xml.bz2
 like
 a TombRaider version? If not, what's the difference?

 And another curiosity, at
 http://en.wikipedia.org/wiki/Wikipedia:TomeRaider_database, it says the
 English Wikipedia database is only 3.3GB. Did they use compression? That
 seems awfully small. Even if they did, that's an incredible compression
 ratio, similar to 7-zip, I don't know how you can do that on a eBook format.
 NTFS compression only brings size down 50%.

At a point, Brion compressed it to 242 MB.

http://www.mail-archive.com/wikitech-l@lists.wikimedia.org/msg00358.html

You may also read this:
 http://en.wikipedia.org/wiki/Solid_compression


-- 
--
ℱin del ℳensaje.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-22 Thread Chengbin Zheng
On Wed, Jul 22, 2009 at 2:37 PM, Tei oscar.vi...@gmail.com wrote:

 On Wed, Jul 22, 2009 at 5:48 PM, Chengbin Zhengchengbinzh...@gmail.com
 wrote:
 ...
 
  Yes, the TombRaider version is exactly the version I want for static
  HTML.
 
  Just curious, is
  pages-articles.xml.bz2
 http://download.wikimedia.org/enwiki/20090713/enwiki-20090713-pages-articles.xml.bz2
 
  like
  a TombRaider version? If not, what's the difference?
 
  And another curiosity, at
  http://en.wikipedia.org/wiki/Wikipedia:TomeRaider_database, it says the
  English Wikipedia database is only 3.3GB. Did they use compression? That
  seems awfully small. Even if they did, that's an incredible compression
  ratio, similar to 7-zip, I don't know how you can do that on a eBook
 format.
  NTFS compression only brings size down 50%.

 At a point, Brion compressed it to 242 MB.

 http://www.mail-archive.com/wikitech-l@lists.wikimedia.org/msg00358.html

 You may also read this:
  http://en.wikipedia.org/wiki/Solid_compression


 --
 --
 ℱin del ℳensaje.

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l



I have no doubt that you can compress it to 3.3GB. I'm just curious how
that's possible for an eBook format. 3.3GB, does it include skin, proper
format of Wikipedia, etc?

I'm assuming that the pages-articles.xml.bz2 XML dump includes something
else other than the raw articles? What else are in it?
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-22 Thread Aryeh Gregor
On Wed, Jul 22, 2009 at 6:37 PM, Teioscar.vi...@gmail.com wrote:
 At a point, Brion compressed it to 242 MB.

 http://www.mail-archive.com/wikitech-l@lists.wikimedia.org/msg00358.html

It looks like it was Platonides, not Brion, and as far as I can tell,
Gregory Maxwell said his compression procedure was broken (i.e.,
inadvertently lossy).

On Wed, Jul 22, 2009 at 7:03 PM, Chengbin Zhengchengbinzh...@gmail.com wrote:
 I have no doubt that you can compress it to 3.3GB. I'm just curious how
 that's possible for an eBook format.

You just use a very good compression algorithm.  Why can't e-books use 7-Zip?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-22 Thread Chengbin Zheng
On Wed, Jul 22, 2009 at 6:53 PM, Aryeh Gregor
simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com
 wrote:

 On Wed, Jul 22, 2009 at 6:37 PM, Teioscar.vi...@gmail.com wrote:
  At a point, Brion compressed it to 242 MB.
 
  http://www.mail-archive.com/wikitech-l@lists.wikimedia.org/msg00358.html

 It looks like it was Platonides, not Brion, and as far as I can tell,
 Gregory Maxwell said his compression procedure was broken (i.e.,
 inadvertently lossy).

 On Wed, Jul 22, 2009 at 7:03 PM, Chengbin Zhengchengbinzh...@gmail.com
 wrote:
  I have no doubt that you can compress it to 3.3GB. I'm just curious how
  that's possible for an eBook format.

 You just use a very good compression algorithm.  Why can't e-books use
 7-Zip?

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Because decompression would be so slow it would be unusable (correct me if
I'm wrong).

Even if it used an excellent compression algorithm, you can't use solid
compression, otherwise decompression will be a major pain. My own testing
show that solid compression is roughly 5 times more efficient in compressing
Wikipedia than normal compression.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-21 Thread Aryeh Gregor
On Tue, Jul 21, 2009 at 3:33 AM, Kwan Ting Chank...@ktchan.info wrote:
 I know you want to avoid using command line, but in this case it's really
 much simpler / only feasible choice to search the internet / ask around for
 the right commands and issue that on the command line. It's only going to be
 one line of typing once you've got it, and you can write it down on a piece
 of paper or something for future reference. It's not like you have to learn
 the ins and out of all the commands and its options and what not. (Of
 course, you would want to test it on a small sample to make sure the command
 is correct before you let it loose on the whole dump.)

In my experience, what on Unix is done with generic built-in
command-line utilities can often be done on Windows using
special-purpose GUIs written by third parties (often non-gratis, or
nagware/adware/etc.).  It's obviously a vastly inferior system to
those of us who are happy using command lines, or even who are
accustomed to using open-source GUI software, but it can work.  For
instance, this program provides a function to Delete files that match
custom file name patterns and filters, and has a free trial version:

http://www.microsystools.com/products/multibatcher/

So it's not accurate to say it's only feasible to use a command line here.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-21 Thread Chengbin Zheng
On Mon, Jul 20, 2009 at 11:33 PM, Kwan Ting Chan k...@ktchan.info wrote:

 Chengbin Zheng wrote:


 Thank you for dropping by and sharing this information with us Tomasz!

 It is good just knowing that it is in the queue. Have you considered
 making
 a version of static HTML Wikipedia where there are no user talk and
 discussion pages that eating up half the space (like the 5GB XML dump for
 English Wikipedia)? As in the previous E-Mail, it is impossible to delete
 millions of pages through Windows Vista's search function (I left it
 overnight, and it ended up eating 1.3GB of RAM and maxing out one of my
 cores. Even deleting a single file took minutes).


 The Windows (and others?) GUI wasn't really designed with what you are
 trying to do in mind in terms of the number of items. You are asking it to
 search for all the files that match your pattern, keep the millions (?) of
 results in memory, and then to show you a windows containing the millions of
 items and to let you do all the magic GUI operations (selecting / dragging
 ...) all the while keeping track of which you've selected / move about etc.

 I know you want to avoid using command line, but in this case it's really
 much simpler / only feasible choice to search the internet / ask around for
 the right commands and issue that on the command line. It's only going to be
 one line of typing once you've got it, and you can write it down on a piece
 of paper or something for future reference. It's not like you have to learn
 the ins and out of all the commands and its options and what not. (Of
 course, you would want to test it on a small sample to make sure the command
 is correct before you let it loose on the whole dump.)

 KTC

 --
 Experience is a good school but the fees are high.
- Heinrich Heine

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Actually, I do have to learn everything. I know absolutely nothing about
HTML and all the stuff (Maybe I will when I take the computer science course
in grade 10). Think of it this way, you have a radioactive material decay
problem, where you want to find out how much mass is left after 1000 years.
Obviously there is no simple algebraic way of doing it. You must set up a
differential equation and solve it. There is no way to do it if your math
skills are only basic algebra. This is me, and I have to learn all of
advanced algebra, functions, trigonmetry, calculus, and differential
equation to do it.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-21 Thread Lane, Ryan
 Actually, I do have to learn everything. I know absolutely 
 nothing about
 HTML and all the stuff (Maybe I will when I take the computer 
 science course
 in grade 10). Think of it this way, you have a radioactive 
 material decay
 problem, where you want to find out how much mass is left 
 after 1000 years.
 Obviously there is no simple algebraic way of doing it. You 
 must set up a
 differential equation and solve it. There is no way to do it 
 if your math
 skills are only basic algebra. This is me, and I have to learn all of
 advanced algebra, functions, trigonmetry, calculus, and differential
 equation to do it.

If you were able to do x264 from the commandline, this will be a walk in the
park. I've been using the commandline for years and I *much* prefer to use a
GUI to do x264 transcoding.

Using the html exporter from the commandline is fairly simple, and it is
documented on the extension page:

http://www.mediawiki.org/wiki/Extension:DumpHTML

V/r,

Ryan Lane
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-21 Thread Chengbin Zheng
On Tue, Jul 21, 2009 at 9:37 AM, Lane, Ryan
ryan.l...@ocean.navo.navy.milwrote:

  Actually, I do have to learn everything. I know absolutely
  nothing about
  HTML and all the stuff (Maybe I will when I take the computer
  science course
  in grade 10). Think of it this way, you have a radioactive
  material decay
  problem, where you want to find out how much mass is left
  after 1000 years.
  Obviously there is no simple algebraic way of doing it. You
  must set up a
  differential equation and solve it. There is no way to do it
  if your math
  skills are only basic algebra. This is me, and I have to learn all of
  advanced algebra, functions, trigonmetry, calculus, and differential
  equation to do it.

 If you were able to do x264 from the commandline, this will be a walk in
 the
 park. I've been using the commandline for years and I *much* prefer to use
 a
 GUI to do x264 transcoding.

 Using the html exporter from the commandline is fairly simple, and it is
 documented on the extension page:

 http://www.mediawiki.org/wiki/Extension:DumpHTML

 V/r,

 Ryan Lane

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


I have no idea on how to install MediaWiki. This is too difficult and
troublesome. Considering how much pain it is to use x264 from command line,
I probably don't want to try this. Truthfully there is not much to x264 in
command line. But the programs I'm seeing here is, well, complicated, to say
the least. I'm just gonna wait for Wikimedia to update the static HTML, or
bother my computer science teacher, LOL.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-21 Thread Aryeh Gregor
On Tue, Jul 21, 2009 at 11:22 AM, Chengbin Zhengchengbinzh...@gmail.com wrote:
 On a side note, if parsing the XML gets you the static HTML version of
 Wikipedia, why can't Wikimedia just parse it for us and save a lot of our
 time (parsing and learning), and use that as the static HTML dump version?

I'd assume it was a performance issue to parse all the pages for all
the dumps so often.  It might have just used too much CPU to be worth
it at the time.  Parsing some individual pages can take 20 seconds or
more, and there are millions of them (although most much faster to
parse than that).  I'm sure it could be reinstituted with some effort,
though.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-21 Thread Aryeh Gregor
On Tue, Jul 21, 2009 at 1:08 PM, Chengbin Zhengchengbinzh...@gmail.com wrote:
 Wouldn't parsing it be faster than actually creating that many HTMLs?

Parsing it *is* creating the HTML files.  That's what parsing means
in MediaWiki, converting wikitext to HTML.  It's kind of a misnomer,
admittedly.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-21 Thread Chengbin Zheng
On Tue, Jul 21, 2009 at 1:11 PM, Aryeh Gregor
simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com
 wrote:

 On Tue, Jul 21, 2009 at 1:08 PM, Chengbin Zhengchengbinzh...@gmail.com
 wrote:
  Wouldn't parsing it be faster than actually creating that many HTMLs?

 Parsing it *is* creating the HTML files.  That's what parsing means
 in MediaWiki, converting wikitext to HTML.  It's kind of a misnomer,
 admittedly.

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


No, I know what parsing means. Even if it takes 2 days to parse them,
wouldn't it be faster than to actually create a static HTML dump the
traditional way?

If it is not, then what is the difficulty of making static HTML dumps? It
can't be bandwidth, storage, or speed.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-21 Thread Aryeh Gregor
On Tue, Jul 21, 2009 at 1:17 PM, Chengbin Zhengchengbinzh...@gmail.com wrote:
 No, I know what parsing means. Even if it takes 2 days to parse them,
 wouldn't it be faster than to actually create a static HTML dump the
 traditional way?

I don't know.  I can only speculate.  Whatever it is, it will take
some attention to set it up again, and Tomasz has said he'll do that,
so that's about all there is to say.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-21 Thread Tei
On Tue, Jul 21, 2009 at 7:17 PM, Chengbin Zhengchengbinzh...@gmail.com wrote:
...

 No, I know what parsing means. Even if it takes 2 days to parse them,
 wouldn't it be faster than to actually create a static HTML dump the
 traditional way?

 If it is not, then what is the difficulty of making static HTML dumps? It
 can't be bandwidth, storage, or speed.


WikiMedia work with limited resources on manpower, hardware, etc..etc...

Things are done. When? when theres available resources, humans and of
the other types.
Is not only you, there are lots of people that want to download the
wikipedia (sometimes in a periodic fashion)

There are a log somewhere with the daily work of some wikipedia admin. ( - :
http://wikitech.wikimedia.org/view/Server_admin_log

Some of these are even very fun, like in:
02:11 b: CPAN sux
01:47 d**: I FOUND HOW TO REVIVE APACHES
( names obscured to protect the inocents ).

-- 
--
ℱin del ℳensaje.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-21 Thread Daniel Schwen
 wouldn't it be faster than to actually create a static HTML dump the
 traditional way?
 The content is wiki-text. It has to be parsed to be turned into HTML. There
 isn't a more traditional way, because there is no other way.

Wouldn't it be possible to dump the parser cache instead of dumping
XML and reparsing? Al the parsing work is already done on the
Wikimedia servers, why do it again on a slow desktop system?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-21 Thread Chad
On Tue, Jul 21, 2009 at 1:42 PM, Teioscar.vi...@gmail.com wrote:
 On Tue, Jul 21, 2009 at 7:17 PM, Chengbin Zhengchengbinzh...@gmail.com 
 wrote:
 ...

 No, I know what parsing means. Even if it takes 2 days to parse them,
 wouldn't it be faster than to actually create a static HTML dump the
 traditional way?

 If it is not, then what is the difficulty of making static HTML dumps? It
 can't be bandwidth, storage, or speed.


 WikiMedia work with limited resources on manpower, hardware, etc..etc...

 Things are done. When? when theres available resources, humans and of
 the other types.
 Is not only you, there are lots of people that want to download the
 wikipedia (sometimes in a periodic fashion)

 There are a log somewhere with the daily work of some wikipedia admin. ( - :
 http://wikitech.wikimedia.org/view/Server_admin_log

 Some of these are even very fun, like in:
 02:11 b: CPAN sux
 01:47 d**: I FOUND HOW TO REVIVE APACHES
 ( names obscured to protect the inocents ).

 --
 --
 ℱin del ℳensaje.

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Hehe, seeing as like there's only 10 different names on there, it's
pretty easy to figure out who B and D are ;-)

-Chad

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-21 Thread Chengbin Zheng
On Tue, Jul 21, 2009 at 1:49 PM, Chad innocentkil...@gmail.com wrote:

 On Tue, Jul 21, 2009 at 1:42 PM, Teioscar.vi...@gmail.com wrote:
  On Tue, Jul 21, 2009 at 7:17 PM, Chengbin Zhengchengbinzh...@gmail.com
 wrote:
  ...
 
  No, I know what parsing means. Even if it takes 2 days to parse them,
  wouldn't it be faster than to actually create a static HTML dump the
  traditional way?
 
  If it is not, then what is the difficulty of making static HTML dumps?
 It
  can't be bandwidth, storage, or speed.
 
 
  WikiMedia work with limited resources on manpower, hardware, etc..etc...
 
  Things are done. When? when theres available resources, humans and of
  the other types.
  Is not only you, there are lots of people that want to download the
  wikipedia (sometimes in a periodic fashion)
 
  There are a log somewhere with the daily work of some wikipedia admin. (
 - :
  http://wikitech.wikimedia.org/view/Server_admin_log
 
  Some of these are even very fun, like in:
  02:11 b: CPAN sux
  01:47 d**: I FOUND HOW TO REVIVE APACHES
  ( names obscured to protect the inocents ).
 
  --
  --
  ℱin del ℳensaje.
 
  ___
  Wikitech-l mailing list
  Wikitech-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 Hehe, seeing as like there's only 10 different names on there, it's
 pretty easy to figure out who B and D are ;-)

 -Chad

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


I can't imagine the need of downloading Wikipedia often for personal use.
The amount of work (or should I say pain) involved to get Wikipedia working,
umm, I don't want to do that often.

The only reason I'm doing it is I want a copy of Wikipedia on the go.
Finding Wi-Fi hotspots is hard (especially in a subway, LOL). It can save me
time, as I can do research anytime I want, anywhere I want, for example in
the subway. I'm not downloading the current static HTML dump because

1: It is very outdated.
2: It contains a LOT of useless information, hogging up half the space.
Space is a big priority, as the English Wikipedia is what, 300GB
uncompressed including junk. The next Archos PMP releasing in September is
said to have a 500GB hard drive, but I doubt it, even though I hope so,
because I would need 500GB if I'm putting Wikipedia on it (my videos are
taking 220ish GB already on my Archos 5). Seriously hoping the next Archos
supports NTFS (compression feature, cuts size by about half). How hard is it
to get Linux to support NTFS?

Why would you download Wikipedia? Internet is so readily available, and the
online version has images.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-21 Thread Chengbin Zheng
On Tue, Jul 21, 2009 at 2:20 PM, Chengbin Zheng chengbinzh...@gmail.comwrote:



 On Tue, Jul 21, 2009 at 1:49 PM, Chad innocentkil...@gmail.com wrote:

 On Tue, Jul 21, 2009 at 1:42 PM, Teioscar.vi...@gmail.com wrote:
  On Tue, Jul 21, 2009 at 7:17 PM, Chengbin Zhengchengbinzh...@gmail.com
 wrote:
  ...
 
  No, I know what parsing means. Even if it takes 2 days to parse them,
  wouldn't it be faster than to actually create a static HTML dump the
  traditional way?
 
  If it is not, then what is the difficulty of making static HTML dumps?
 It
  can't be bandwidth, storage, or speed.
 
 
  WikiMedia work with limited resources on manpower, hardware, etc..etc...
 
  Things are done. When? when theres available resources, humans and of
  the other types.
  Is not only you, there are lots of people that want to download the
  wikipedia (sometimes in a periodic fashion)
 
  There are a log somewhere with the daily work of some wikipedia admin. (
 - :
  http://wikitech.wikimedia.org/view/Server_admin_log
 
  Some of these are even very fun, like in:
  02:11 b: CPAN sux
  01:47 d**: I FOUND HOW TO REVIVE APACHES
  ( names obscured to protect the inocents ).
 
  --
  --
  ℱin del ℳensaje.
 
  ___
  Wikitech-l mailing list
  Wikitech-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 Hehe, seeing as like there's only 10 different names on there, it's
 pretty easy to figure out who B and D are ;-)

 -Chad

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


 I can't imagine the need of downloading Wikipedia often for personal use.
 The amount of work (or should I say pain) involved to get Wikipedia working,
 umm, I don't want to do that often.

 The only reason I'm doing it is I want a copy of Wikipedia on the go.
 Finding Wi-Fi hotspots is hard (especially in a subway, LOL). It can save me
 time, as I can do research anytime I want, anywhere I want, for example in
 the subway. I'm not downloading the current static HTML dump because

 1: It is very outdated.
 2: It contains a LOT of useless information, hogging up half the space.
 Space is a big priority, as the English Wikipedia is what, 300GB
 uncompressed including junk. The next Archos PMP releasing in September is
 said to have a 500GB hard drive, but I doubt it, even though I hope so,
 because I would need 500GB if I'm putting Wikipedia on it (my videos are
 taking 220ish GB already on my Archos 5). Seriously hoping the next Archos
 supports NTFS (compression feature, cuts size by about half). How hard is it
 to get Linux to support NTFS?

 Why would you download Wikipedia? Internet is so readily available, and the
 online version has images.



I downloaded the static HTML dump for another language to do a MUCH MUCH
smaller scale test to see if it actually works. It works brilliantly. Even
the search function works!! I didn't expect that to work. How does the
search function work? I thought it is like search in Windows, but since
everything is on RAM, website searches are instantaneous. I'm running this
on hard drive, and it is instantaneous as well.

BTW, the pages-articles.xml.bz2 version of the XML dump, does it include
links to images, even though images don't exist? I find those pages taking
up a lot of space as well.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-21 Thread Tei
On Tue, Jul 21, 2009 at 8:20 PM, Chengbin Zhengchengbinzh...@gmail.com wrote:
..
 Why would you download Wikipedia? Internet is so readily available, and the
 online version has images.

It obviusly don't make much sense for final users.

It has been discused before anyway..
http://www.mail-archive.com/search?q=wikitech-l+torrentl=wikitec...@lists.wikimedia.org

You tipically download the whole wikipedia, because you know what are
doing, and want to use in some project (maybe create a 700 MB cd-rom
version?  or doing datamining on the delicious corpus of data of the
wikipedia..).

I suggest you to run a few search on that interface, to get
interesting messages.  Try to search GB dump, dump torrent,  and
the likes.





-- 
--
ℱin del ℳensaje.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-20 Thread Aryeh Gregor
On Mon, Jul 20, 2009 at 10:00 PM, Chengbin Zhengchengbinzh...@gmail.com wrote:
 It seems that reply doesn't work. So I'll send a new message.
 Since the static HTML Wikipedia is not updating (please update), and XML
 updates like everyday, the logical choice is to go with XML. Is there any
 way to convert XML to HTML, like the static HTML version?

Download MediaWiki, import the dump, and use your wiki to output a
static HTML dump.  That's the only way I know of (but I haven't ever
looked into it).

 I don't have mad computer skills like most of you. I need a simple way
 (preferably a GUI) to convert XML to HTML.

Unlikely to exist.

 Also, how does the converted XML
 look like compared to the real Wikipedia? I've use Bzreader to open it, and
 it looks TERRIBLE, without any skin or format organization. Please tell me
 the converted XML won't look like this, and looks like the Wikipedia
 website.

The XML only contains the wikitext for the pages, it doesn't contain
the skin or the rules to convert to HTML.  You need to run it through
MediaWiki to get the HTML.  Some simpler third-party tools would be
able to produce some approximation of the HTML, as well, but none
reliably.

 If the static HTML Wikipedia does update at some time, what are your
 preferred method of deleting the user talk, discussion, etc pages? I tried
 using Vista's search function and delete all of them with the name user,
 etc. But Vista doesn't like deleting millions of files. Even deleting 1 file
 takes minutes (probably due to the sheer number of folders). Is there like a
 program that can delete more efficiently? Or a program that deletes while
 searching (like finds a page, delete it, move on to search for the next
 file).

I don't know of efficient GUI deletion utilities on Windows, because I
don't need them.  Probably nor do most people on what is, after all, a
development list and not a user list.  (Why would developers be likely
to know about GUI tools that are easy to use for non-developers?
You'd want to ask people with your skill set, not people with mad
computer skills.)  On a Unix command line, something of this form will
do what you want:

find -iname 'User:*' -exec rm {} +

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-20 Thread Aryeh Gregor
. . . I should mention, also, that I believe the one in charge of
dumps is Tomasz Finc.  You may want to ask him about whether there are
plans to resume the static HTML dumps.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-20 Thread Chengbin Zheng
On Mon, Jul 20, 2009 at 6:41 PM, Aryeh Gregor
simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com
 wrote:

 . . . I should mention, also, that I believe the one in charge of
 dumps is Tomasz Finc.  You may want to ask him about whether there are
 plans to resume the static HTML dumps.

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l



I tried through Wikipedia mail, and I can't reach him.

How do you use mediawiki? There are no exe files.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-20 Thread Chengbin Zheng
On Mon, Jul 20, 2009 at 8:52 PM, Aryeh Gregor
simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com
 wrote:

 On Mon, Jul 20, 2009 at 11:08 PM, Chengbin Zhengchengbinzh...@gmail.com
 wrote:
  I tried through Wikipedia mail, and I can't reach him.
 
  How do you use mediawiki? There are no exe files.

 Based on your posts here, I suspect this will be a difficult process
 for you.  Even if you had experience installing and administering web
 apps, I don't know how reliably the dumps can be imported by third
 parties these days.  If you're talking about the English Wikipedia, it
 would probably take a lot of processing time (maybe days, on a typical
 desktop?) for the dump to actually import, even if it's only the
 latest version of each page.  And even after that, I don't know how
 easy or reliable it is to export static HTML.

 You will definitely, at a minimum, have to use a command line, and
 probably will run into at least one difficulty that will require
 debugging.  MediaWiki is not really designed to be installed and
 administered by users who are only comfortable with GUIs.  You could
 probably install it without too much difficulty, but the documentation
 for importing the dumps and exporting the static HTML might not be too
 comprehensible.

 If you still want to proceed, this page has lengthy instructions on
 installation:

 http://www.mediawiki.org/wiki/Manual:Running_MediaWiki_on_Windows

 I haven't imported a dump anywhere in a long time, and I've never
 exported static HTML, so I can't really help you with those offhand.

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l



Thank you for your answer.

Yes, I think it is probably a bad idea. Maybe when I take the computer
science course this year I'll get a better understanding.

But definitely, I don't like using command lines. Even in video encoding,
which I master at, I prefer using GUI (well simply because it is FAR FAR
more convenient). Even though I could use command line, it takes forever. It
took me over a year to master x264 and avisynth. Don't want to do that again
for this.

I guess I can just hope that the static HTML dumps do update. Meanwhile I
need to look for a way to efficiently delete millions of talk and discussion
files. Or better, Wikimedia making a lite version like the dumps so I
don't have to do it. I'm really tight on space, as I'm putting this on a
portable media player (the next Archos PMP, as the Archos 5 I have only have
250GB)
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-20 Thread Tomasz Finc
Chengbin Zheng wrote:
 On Mon, Jul 20, 2009 at 6:41 PM, Aryeh Gregor
 simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com
 wrote:
 
 . . . I should mention, also, that I believe the one in charge of
 dumps is Tomasz Finc.  You may want to ask him about whether there are
 plans to resume the static HTML dumps.

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 
 
 I tried through Wikipedia mail, and I can't reach him.

Looks like either my mail client ate them or those mails never arrived.

I've exchanged mails with Tim Starling(original author/maintainer) of 
static.wikipedia.org to gauge the level of support and work required to 
have these running again. It certainly seems doable but I'm not going to 
commit to having them in place until the full en history snapshot works.
Thinking post Wikimania 2009 (end of August) here for specking the 
return of these to a more maintainable state.

--tomasz


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Simple way to convert XML to HTML

2009-07-20 Thread Chengbin Zheng
On Mon, Jul 20, 2009 at 10:21 PM, Tomasz Finc tf...@wikimedia.org wrote:

 Chengbin Zheng wrote:
  On Mon, Jul 20, 2009 at 6:41 PM, Aryeh Gregor
  simetrical+wikil...@gmail.com simetrical%2bwikil...@gmail.com
 simetrical%2bwikil...@gmail.com simetrical%252bwikil...@gmail.com
  wrote:
 
  . . . I should mention, also, that I believe the one in charge of
  dumps is Tomasz Finc.  You may want to ask him about whether there are
  plans to resume the static HTML dumps.
 
  ___
  Wikitech-l mailing list
  Wikitech-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 
 
 
  I tried through Wikipedia mail, and I can't reach him.

 Looks like either my mail client ate them or those mails never arrived.

 I've exchanged mails with Tim Starling(original author/maintainer) of
 static.wikipedia.org to gauge the level of support and work required to
 have these running again. It certainly seems doable but I'm not going to
 commit to having them in place until the full en history snapshot works.
 Thinking post Wikimania 2009 (end of August) here for specking the
 return of these to a more maintainable state.

 --tomasz


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l



Thank you for dropping by and sharing this information with us Tomasz!

It is good just knowing that it is in the queue. Have you considered making
a version of static HTML Wikipedia where there are no user talk and
discussion pages that eating up half the space (like the 5GB XML dump for
English Wikipedia)? As in the previous E-Mail, it is impossible to delete
millions of pages through Windows Vista's search function (I left it
overnight, and it ended up eating 1.3GB of RAM and maxing out one of my
cores. Even deleting a single file took minutes).
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l