Some time ago, i posted a message asking for volunteers to create a Wikipedia CD/DVD.
Since then, i have been working on this project and have done some advances, that will be published as soon they work as expected. These are the steps that i am following to process the XML databases: Wikipedia XML database are huge and we could get best results when XML databases are downloaded directly from: http://download.wikipedia.org/ The download direction for wikipedia English XML database from 2010 February 3 is: http://download.wikimedia.org/enwiki/20100130/enwiki-20100130-pages-articles.xml.bz2 download direction for wikipedia Spanish XML database from 2010 February 21 is: http://download.wikimedia.org/eswiki/20100221/eswiki-20100221-pages-meta-current.xml.bz2 After downloading the compressed xml database, you should put the database inside a folder (not in the disk root) and split the file in small bz2 files using bzip2recover. http://www.bzip.org/downloads.html http://www.bzip.org/1.0.5/bzip2recover-105-x86-win32.exe It is easier to deal with many compressed small files than using one humongous text file of more than 25 GB (english xml database) or 5.3 GB (spanish xml database). After using bzip2recover to split the English xml database, i get more than 28,000 small (~250 kb) bz2 files or 6800 small bz2 files for Spanish xml databases. Each one of these files have (more of less) a 1 MB segment of the database. Notice that i choose a different file for spanish xml database than english xml database. That is because Wikipedia have been unable to solve a problem with their backup of spanish xml database. https://bugzilla.wikimedia.org/show_bug.cgi?id=18694 Alejandro _______________________________________________ use-revolution mailing list [email protected] Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
