Re: [Wikitech-l] GSoC Project

gnosygnu Fri, 26 Apr 2013 15:02:37 -0700

I think this is a well-thought out idea. I'm just going to add a few
comments on Method 1:


* Wikimedia provides page.sql.gz dumps (EX:
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.sql.gz)
This table does have page_id and page_touched (the latter seems to
correlate to your "last touched")
The file is hefty at 935 MB. (This is because it has other columns, like
page_title). However, I think with 11 million+ pages, you're not probably
going to do much better than 100 MB (using 28 characters per entry, like
"(1234567,'20130407202126')," and a 30% zip ratio)

* Synchronizing latest versions will still be time-consuming.
I'd guesstimate that there are something like 50k changed articles per
month. I'm basing this number on
http://stats.wikimedia.org/EN/TablesWikipediaEN.htm which lists 800 new
articles per day. I then threw in another 800 unique page edits per day and
multiplying by 30 to get to a ballpark 50k. This correlates to a monthly
churn of 1%-2% of the entire article namespace (4.1 million) which I think
is a conservative percentage.

So, assuming this number is somewhat accurate, 50,000 API calls would not
be trivial -- especially for a user with limited internet connectivity.
This is to say nothing of Wikimedia's servers which will need to handle 50k
calls per client at that time of month. In short, I think synchronizing
that many pages would best be served by its own dump.

Also, there may be some months where this percentage is much higher. For
example, when Wikipedia switched its links over to Wikidata, I assume that
at least 50% of the pages were touched. Granted, this is not a common
occurrence, but as more bot activity rises (Wikidata properties for
infoboxes?), then this will complicate the sync accordingly.

Hope this helps and good luck with your project.



On Fri, Apr 26, 2013 at 4:27 PM, Kiran Mathew Koshy <
kiranmathewko...@gmail.com> wrote:

> Hi guys,
>
> I have an own idea  for my GSoC project that I'd like to share with you.
> Its not a perfect one, so please forgive any mistakes.
>
> The project is related to the existing GSoC project "*Incremental Data
> dumps
> *" , but is in no way a replacement for it.
>
>
> *Offline Wikipedia*
>
> For a long time, a lot of offline solutions for Wikipedia have sprung up on
> the internet. All of these have been unofficial solutions, and  have
> limitations. A major problem is the* increasing size of  the data dumps*,
> and the problem of *updating the local content. *
>
> Consider the situation in a place where internet is costly/
> unavailable.(For the purpose of discussion, lets consider a school in a 3rd
> world country.) Internet speeds are extremely slow, and accessing Wikipedia
> directly from the web is out of the question.
> Such a school would greatly benefit from an instance of Wikipedia on  a
> local server. Now up to here, the school can use any of the freely
> available offline Wikipedia solutions to make a local instance. The problem
> arises when the database in the local instance becomes obsolete. The client
> is then required to download an entire new dump(approx. 10 GB in size) and
> load it into the database.
> Another problem that arises is that most 3rd part programs *do not allow
> network access*, and a new instance of the database is required(approx. 40
> GB) on each installation.For instance, in a school with around 50 desktops,
> each desktop would require a 40 GB  database. Plus, *updating* them becomes
> even more difficult.
>
> So here's my *idea*:
> Modify the existing MediaWiki software and to add a few PHP/Python scripts
> which will automatically update the database and will run in the
> background.(Details on how the update is done is described later).
> Initially, the MediaWiki(modified) will take an XML dump/ SQL dump (SQL
> dump preferred) as input and will create the local instance of Wikipedia.
> Later on, the updates will be added to the database automatically by the
> script.
>
> The installation process is extremely easy, it just requires a server
> package like XAMPP and the MediaWiki bundle.
>
>
> Process of updating:
>
> There will be two methods of updating the server. Both will be implemented
> into the MediaWiki bundle. Method 2 requires the functionality of
> incremental data dumps, so it can be completed only after the functionality
> is available. Perhaps I can collaborate with the student selected for
> incremental data dumps.
>
> Method 1: (online update) A list of all pages are made and published by
> Wikipedia. This can be in an XML format. The only information  in the XML
> file will be the page IDs and the last-touched date. This file will be
> downloaded by the MediaWiki bundle, and the page IDs will be compared with
> the pages of the existing local database.
>
> case 1: A new page ID in XML file: denotes a new page added.
> case 2: A page which is present in the local database is not among the page
> IDs- denotes a deleted page.
> case 3: A page in the local database has a different 'last touched'
>  compared to the one in the local database- denotes an edited page.
>
> In each case, the change is made in the local database and if the new page
> data is required, the data is obtained using MediaWiki API.
> These offline instances of Wikipedia will be only used in cases where the
> internet speeds are very low, so they *won't cause much load on the
> servers*
> .
>
> method 2: (offline update): (Requires the functionality of the existing
> project "Incremental data dumps"):
>    In this case, the incremental data dumps are downloaded by the
> user(admin) and fed to the MediaWiki installation the same way the original
> dump is fed(as a normal file), and the corresponding changes are made by
> the bundle. Since I'm not aware of the XML format used in incremental
> updates, I cannot describe it now.
>
> Advantages : An offline solution can be provided for regions where internet
> access is a scarce resource. this would greatly benefit developing nations
> , and would help in making the world's information more free and openly
> available to everyone.
>
> All comments are welcome !
>
> PS: about me: I'm a 2nd year undergraduate student in Indian Institute of
> Technology, Patna. I code for fun.
> Languages: C/C++,Python,PHP,etc.
> hobbies: CUDA programming, robotics, etc.
>
> --
> Kiran Mathew Koshy
> Electrical Engineering,
> IIT Patna,
> Patna
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] GSoC Project

Reply via email to