On Mon, 16 Nov 2009, Emmanuel Engelhart wrote:

> > These indexes http://ai.cs.utsa.edu/wikipedia0.7/ seem to have been built
> > using categories.
> 
> This dump is one I have build (maybe extract from the ZIM)... but a
> little bit modified. This a pretty interesting url, would be great to
> know how the dev. behind have done exactly... maybe you would be able to
> do the same.

This is his explanation :-


This collection of articles is called "beta2" because they were
extracted from wikipedia_en_wp1_0.7_30000+_05_2009_beta2.zim at
tmp.kiwix.org.  My distribution has different file names for all the
articles based on their titles and has a title search capability that
only depends on Javascript.  Below is some explanation of the steps I
performed, not necessarily done in this order.

1. I extracted the articles from the zim file by downloading,
compiling and running zimDump from openzim.org.  Compiling zimDump is
nontrivial because it involves downloading and compiling other
packages in versions that will work together.

2. I created three lists with Perl scripts and manual cleaning
afterwards.  

  A. The first list was a list of all articles: zim file name, UTF8
  title, and ASCII title.

  B. The second list was a list of all zim files (articles, images and
  other files, Javascript and CSS): zim file name and target file name
  in my distribution.

  C. The third list was a list for redirecting one zim file name to
  another.  The zim dump creates a lot of empty files in the A
  subdirectory (A contains all the articles).  It turns out that each
  of them needs to be redirected to another article.  The redirects
  can be determined by downloading and running the zimReader program
  for Linux, which can be found at openzim.org.

There appear to be a few duplicate articles (none were deleted), which
I list below (in ASCII) for anyone who is interested:

        Abu Rayhan Biruni
        'Alawi
        Battle of Mohacs
        Beer-Lambert law
        Charismatic movement
        Elian Gonzalez affair
        Ismail Enver
        Ismet Inonu
        Istiklal Marsi
        Izmir Province
        Wikipedia:0.7/0.7geo/Leopold
        Macapa or Macapai
        Maceia or Maceio
        Nicole Vaidisova
        PRIDE Fighting Championships
        War in Afghanistan (2001-present)

3. I used a Perl script to copy all the files from the zim dump to a
staging area, modifying the links along the way.  There are many, many
dead image links (26314 in my count); I changed those links to empty
strings.  There are also some dead article links, most of them
correspond to dead image links, but a few of them should have been
redirected; they got added to my third list above.  Here are all the
dead article links and any appropriate redirect for anyone who is
interested.

        A/5ISM  ignore
        A/35A   A/D6N
        A/53Z   A/CWO
        A/5J03  ignore
        A/5J55  ignore
        A/9XO   A/HQO
        A/APD   A/A35
        A/D07   A/9PW
        A/F5G   A/163K
        A/PRL   ignore
        A/TKV   A/S4X
        A/TR4   ignore
        A/VBB   ignore
        A/ZM2   ignore
        A/2QE6  ignore
        A/T3B   A/NQR
        A/11B0  ignore
        A/1NN2  ignore
        A/5ISU  A/4O
        A/5JAZ  ignore
        A/5IV3  ignore
        A/5IXU  ignore
        A/102Z  ignore
        A/1QTM  ignore
        A/5IOB  ignore
        A/5J51  ignore
        A/5IP6  ignore
        A/5JBO  ignore
        A/Y91   ignore

4. In addition to changing links, I made a few other changes.  Each
article now has a search box for title search.  I took some existing
GPLed Javascript (JSE search engine) and made extensive modifications
for this application.  It only searches the titles; there is no
keyword index, and there is no text search.  The motto of the code is
"Linear Search FTW".  It is surprisingly snappy, though in hindsight,
searching 30000 titles is not a lot for a computer to do.  The results
page is functional, but otherwise not too exciting.

I changed the titles of the index pages to something less geeky, e.g..
"Topical Index: Wikipedia" for the topic index page on Wikipedia.  I
also fixed a number of incorrect links to the topical index page to
alphabetical index page.

Enjoy,

Tom Bylander
[email protected]
written November 13, 2009
_______________________________________________
dev-l mailing list
[email protected]
https://intern.openzim.org/mailman/listinfo/dev-l

Reply via email to