Re: [Wikitech-l] [mwdumper] new maintainer?

2010-02-12 Thread Bilal Abdul Kader
mwDumper is essential also for anyone wiling to replicate a wiki locally for
any purpose. There are alternatives such as xml2SQL or importDump.php but
mwDumper is the most efficient in terms of correctness and completeness or
speed sometimes.

bilal
==
Verily, with hardship comes ease.


On Fri, Feb 12, 2010 at 8:46 AM, emman...@engelhart.org wrote:

  Le ven 12/02/10 14:24, Christensen, Courtney christens...@battelle.orga 
 écrit:
  We use the DumpHTML extension (
 http://www.mediawiki.org/wiki/Extension:DumpHTML) to
  make static copies of our wikis.  It used to be a maintenance script.
  Maybe that would work for you?

 The DumpHTML extension is something else... this is tool a to get a static
 HTML version of Mediawiki articles.

 If you speak from http://static.wikipedia.org/... this is also an other
 topic because these pages are not our content, but only a not customizable
 view of our content (I can't do nothing with it).

 Our content is the wiki code and the files (images, etc.) ... and this is
 what seems not to be fully reusable currently.

 Emmanuel

 PS: DumpHTML seems also not to be maintened currently... have a look to the
 bug reports.


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] importing enwiki into local database

2010-02-04 Thread Bilal Abdul Kader
I am still able to import the dumps using the old mwDumper (modified to fix
the contributor) and xml2SQL works also and it is quiet fast. importDump.php
continues after it breaks I think.

bilal
--
Verily, with hardship comes ease.


On Thu, Feb 4, 2010 at 9:24 PM, Chad innocentkil...@gmail.com wrote:

 On Thu, Feb 4, 2010 at 9:12 PM, Eric Sun e...@cs.stanford.edu wrote:
  Hi,
 
  I saw this thread back in October where someone was having trouble
  importing the English Wikipedia XML dump:
  http://lists.wikimedia.org/pipermail/wikitech-l/2009-October/045594.html
  The thread back in October seemed to end without resolution, and the
  tools still seem to be broken, so has anyone found a solution in the
  meantime?
 
  I'm using mediawiki-1.15.1 and attempting to import
  enwiki-20100130-pages-articles.xml.bz2.
 
  None of these options seem to work:
  1) importDump.php
  fails by spewing Warning: xml_parse(): Unable to call handler in_()
  in ./includes/Import.php on line 437 repeatedly
 
  2) xml2sql (http://meta.wikimedia.org/wiki/Xml2sql):
  Fails with error:
  xml2sql: parsing aborted at line 33 pos 16.
  due to the new redirect tag introduced in the new dumps?
 
  3) mwdumper (http://www.mediawiki.org/wiki/MWDumper):
  Current XML is schema v0.4, but the documentation says that it's for 0.3
 
  4) mwimport (http://meta.wikimedia.org/wiki/Data_dumps/mwimport):
  Fails immediately:
  siteinfo: untested generator 'MediaWiki 1.16alpha-wmf', expect trouble
 ahead
  page: expected closing tag in line 35
 
  Any tips?
  Thanks!
  Eric
 
  ___
  Wikitech-l mailing list
  Wikitech-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 

 Most of these errors are caused by the new(ish) redirect / tag
 within page elements. 0.4 is the correct version of the schema,
 but unfortunately the schema was updated and dumps were
 produced using them before the changes made it into a release.

 1.15.1 cannot import pages with redirect /, we should probably
 backport that. That, and we should rewrite the importers to not barf
 terribly when they encounter an unknown element.

 -Chad

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] download wikipedia database dump

2010-01-11 Thread Bilal Abdul Kader
If you can download the whole file to your PC, then you can just
import a portion of it and stop the import after some time. The
mwDumper shows you the imported pages in an increment of 1000.

If you do not have enough bandwidth to download the whole thing, you
can use the Special:Export
(http://en.wikipedia.org/wiki/Special:Export) feature on the English
Wikipedia and then you select the pages you need to download.

bilal
--
Verily, with hardship comes ease.



On Mon, Jan 11, 2010 at 11:10 AM, OrzzrO orzvs...@gmail.com wrote:
 Hi,
     I want to download the wikipedia database dump of English version. But
 the whole database dump is 10.1GB, which is too large for me. In fact, I
 only need a part of the database, and any part is ok for me. Can I download
 a small database, which is the subset of the whole database dump ?

    Thanks for your time and help!


    Best  Wishes!

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Bilal Abdul Kader
I think having access to them on Commons repository is much easier to
handle. A subset should be good enough.

Having 11 TB of images needs huge research capabilities in order to handle
all of them and work with all of them.

Maybe a special API or advanced API functions would allow people enough
access and at the same time save the bandwidth and the hassle to handle this
behemoth collection.

bilal
--
Verily, with hardship comes ease.


On Fri, Jan 8, 2010 at 1:57 PM, Tomasz Finc tf...@wikimedia.org wrote:

 William Pietri wrote:
  On 01/07/2010 01:40 AM, Jamie Morken wrote:
  I have a
  suggestion for wikipedia!!  I think that the database dumps including
  the image files should be made available by a wikipedia bittorrent
  tracker so that people would be able to download the wikipedia backups
  including the images (which currently they can't do) and also so that
  wikipedia's bandwidth costs would be reduced. [...]
 
 
  Is the bandwidth used really a big problem? Bandwidth is pretty cheap
  these days, and given Wikipedia's total draw, I suspect the occasional
  dump download isn't much of a problem.

 No, bandwidth is not really the problem here. I think the core issue is
 to have bulk access to images.

 There have been a number of these requests in the past and after talking
  back and forth, it has usually been the case that a smaller subset of
 the data works just as well.

 A good example of this was the Deutsche Fotokek archive made late last
 year.

 http://download.wikipedia.org/images/Deutsche_Fotothek.tar ( 11GB )

 This provided an easily retrievable high quality subset of our image
 data which researchers could use.

 Now if we were to snapshot image data and store them for a particular
 project the amount of duplicate image data would become significant.
 That's because we re-use a ton of image data between projects and
 rightfully so.

 If instead we package all of commons into a tarball then we get roughly
 6T's of image data which after numerous conversation has been a bit more
 then most people want to process.

 So what does everyone think of going down the collections route?

 If we provide enough different and up to date ones then we could easily
 give people a large but manageable amount of data to work with.

 If there is a page already for this then please feel free to point me to
 it otherwise I'll create one.

 --tomasz


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-07 Thread Bilal Abdul Kader
I have been using the dumps for few months and I think this kind of dumps is
much better than a torrent. Yes bandwidth can be saved but I do not think
the the cost of bandwidth is higher than the cost of maintaining the
torrents.

If people are not hosting the files so the value of torrents is limited.

I think regular mirroring is much better but it all depends on the
willingness of people to host the files.

bilal
--
Verily, with hardship comes ease.


On Thu, Jan 7, 2010 at 11:30 AM, Platonides platoni...@gmail.com wrote:

 Jamie Morken wrote:
  Hi,
 
  I have a
  suggestion for wikipedia!!  I think that the database dumps including
  the image files should be made available by a wikipedia bittorrent
  tracker so that people would be able to download the wikipedia backups
  including the images (which currently they can't do) and also so that
  wikipedia's bandwidth costs would be reduced.  I think it is important
  that wikipedia can be downloaded for using it offline now and in the
  future for people.
 
  best regards,
  Jamie Morken

 Has been tried before (when they were smaller).
 How many people do you think will have the necessary space and be
 willing to download it?


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] {{Encyclopédie recherche}}

2009-11-17 Thread Bilal Abdul Kader
Greetings,
This template is not being parsed on my french local wiki. Any hints on
that. I did several search on google but I could not find the problem.

bilal
--
Verily, with hardship comes ease.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Importing English Wikipeida XML Dumps into MediaWiki

2009-10-09 Thread Bilal Abdul Kader
I have used xml2sql, mwdumper, import.php and the python script to import
The two fastest are xml2sql and the python script (xray). The best results
is from importDump.php
mwDumper is slow but it gives good results.

I have not done any import with the new redirect tag.

bilal


On Fri, Oct 9, 2009 at 2:18 PM, O. O. olson...@yahoo.com wrote:

 Andrew Krizhanovsky wrote:
  Hi!
 
  I have got the same redirect problem while importing the dump of
  Russian Wiktionary. :(
 
  Best regards,
  Andrew Krizhanovsky.
 

 So Andrew, do you import using importDump.php, MWDumper or xml2sql? I am
 curious to know what others are using  for their imports. (This is for
 my personal knowledge.)

 It seems that the “redirect /” tags are mostly blank while grepping
 through the English Wikipedia Dump. I hope someone can fix this soon.

 Thanks to you guys,
 O. O.


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 
Verily, with hardship comes ease.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Wikipedia Google Earth layer

2009-10-02 Thread Bilal Abdul Kader
I think Google applications use the data crawled from their own datbases and
of course Google has almost all last updates of Wikipedia articles with all
its information including the Geo addresses.
bilal


On Fri, Oct 2, 2009 at 1:59 PM, Tei oscar.vi...@gmail.com wrote:

 On Fri, Oct 2, 2009 at 6:15 PM, Roan Kattouw roan.katt...@gmail.com
 wrote:
  2009/10/2 Tei oscar.vi...@gmail.com:
  On Fri, Oct 2, 2009 at 3:37 PM, Strainu strain...@gmail.com wrote:
  ...
  I'm not sure if Wikimedia has anything to do with it, but I think I
  have a better chance of getting an answer here than by asking Google
  (the company) directly. Google (the search engine) was not really
  helpful on the matter.
 
  you could always install Ethereal, and spy the trafic from your
  computer to the network. It probably include some HTTP servers, and
  GET / POST request you can read.
 
  The LiveHTTPHeaders extension for Firefox will also do this job for
  you, and is a bit easier to install and use.
 

 Nah really. Is google earth we are talking here. Since is a standalone
 app, It talk directly trough the network.

 --
 --
 ℱin del ℳensaje.

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 
Verily, with hardship comes ease.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Public repositories for research dumps

2009-06-23 Thread Bilal Abdul Kader
Hi Felipe,Thanks for the great effort. This will save us hours of
downloading and importing older dumps.

bilal


On Tue, Jun 23, 2009 at 12:26 PM, Felipe Ortega glimmer_phoe...@yahoo.eswrote:


 Hello.

 Since just a few hours ago, a new public repository has been created to
 host WikiXRay database dumps, containing info extracted from public
 Wikipedia dbdumps. The image is hosted by RedIRIS (in short, the Spanish
 equivalent of Kennisnet in Netherlands).

 http://sunsite.rediris.es/mirror/WKP_research

 ftp://ftp.rediris.es/mirror/WKP_research

 These new dumps are aimed to save time and effort to other researchers,
 since they won't need to parse the complete XML dumps to extract all
 relevant activity metadata. We used mysqldump to create the dumps from our
 databases..

 As of today, only some of the biggest Wikipedias are available. However,
  in the following days the full set of available languages will be ready for
 downloading. The files will be updated regularly.

 The procedure is as follows:

 1. Find the research dump of your interest. Download and decompress it in
 your local system.

 2. Create a local DB to import the information.

 3. Load the dump file, using a MySQL user with insert privileges:

 $ mysql -u user -p passw myDB  dumpfile.sql

 And you're done.

 Final warning. 3 fields in the revision table are not reliable yet:

 rev_num_inlinks
 rev_num_outlinks
 rev_num_trans

 All remaining fields/values are trustable (in particular rev_len,
 rev_num_words, and so forth).

 Regards,

 Felipe.




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] We're not quite at Google's level

2009-05-15 Thread Bilal Abdul Kader
Sorry I missed the point in a previous post then. The wordings looked like
using the downtime as a strategy.



On Fri, May 15, 2009 at 12:02 PM, Thomas Dalton thomas.dal...@gmail.comwrote:

 2009/5/15 The Cunctator cuncta...@gmail.com:
  No, and it's stupid. It's not like this is a covert discussion.
 
  On Fri, May 15, 2009 at 11:45 AM, Bilal Abdul Kader bila...@gmail.com
 wrote:
 
  Is it ethical?

 How is it unethical? We take advantage of downtime to explain to our
 readers that we rely on donations to keep the site running, there is
 nothing dishonest about that.

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Downloadable client fonts to aid language script support?

2009-05-05 Thread Bilal Abdul Kader
On Tue, May 5, 2009 at 9:47 AM, Nikola Smolenski smole...@eunet.yu wrote:

 Brion Vibber wrote:
  It might be helpful for some language wikis to link in a free font this
  way, when standard fonts supporting their script are often unavailable.
  Right now on such sites there tends to be a little English link at the
  top such as 'font help' leading to a page like this telling you how to
  download and install a font:
  http://ta.wikipedia.org/wiki/Project:Font_help

 Even more helpful: MediaWiki could determine if a page uses a rare
 character upon save and link to appropriate fonts.

 This should be pushed to the client end I think because even the page uses
a rare character, the decision should be for the browser to load the font
and not for mediaWiki to push it. Some front end js can do the task well.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Extracting pages history error

2009-05-03 Thread Bilal Abdul Kader
Greetings,
I am trying to replicate enwiki locally but I am always getting a CRC error
extracting the page history file
(enwiki-latest-pages-meta-history.xml.bz2http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history.xml.bz2).
Anybody was able to do so?
I am not sure if the error is at the source (when zipping it) or because of
the download manager at my end.

bilal
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Skin JS cleanup and jQuery

2009-04-22 Thread Bilal Abdul Kader
There is an issue with running a foreground JS thread that is super fast and
might send a lot of request to the server. Heavy processing on the client
side would alleviate the load from the server (if possible) but it might
push another load on the server (in the presented example of sending emails
to uses).

I have worked on an AJAX application that sends email using a Javascript
application and it turns out that the server was denying the JS requests
because it went beyond the allowed limit of connections from a single host.

A better approach might be to start the task at the client side and save it
ina queue at the server side for another process (server side) to take care
of it later on in FIFO mode.

On Wed, Apr 22, 2009 at 12:18 PM, Brion Vibber br...@wikimedia.org wrote:


 Perhaps... but note that the i/o for XMLHTTPRequest is asynchronous to
 begin with -- it's really only if you're doing heavy client-side
 _processing_ that you're likely to benefit from a background worker thread.


On 4/17/09 6:45 PM, Marco Schuster wrote:
 You mean...stuff like bots written in Javascript, using the XML API?
 I could imagine also sending mails via Special:Emailuser in the background
 to reach multiple recipients - that's a PITA if you want send mails to
 multiple users.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Skin JS cleanup and jQuery

2009-04-22 Thread Bilal Abdul Kader
This would be a great idea as the library is always updated and has a lot of
features for the front end.

On Wed, Apr 22, 2009 at 12:28 PM, Brian brian.min...@colorado.edu wrote:

 Many extensions are now using the Yahoo User Interface library. It would be
 nice if mediawiki included it by default.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dealing with Large Files when attempting a wikipedia database download.

2009-04-10 Thread Bilal Abdul Kader
I have downloaded the history dump file (~150 GB) using Firefox on XP and
using wget on Ubuntu and it works fine. I have downloaded it using a
download manager on Vista and it is fine also.

A more probable reason is the file system limitations.

bilal


On Fri, Apr 10, 2009 at 3:49 PM, Finne Boonen hen...@gmail.com wrote:

 http://en.wikipedia.org/wiki/Wikipedia_database has some information
 on how to deal with the large files

 henna

 On Fri, Apr 10, 2009 at 21:43, Daniel Kinzler dan...@brightbyte.de
 wrote:
  David Gerard schrieb:
  2009/4/10 Jameson Scanlon jameson.scan...@googlemail.com:
 
  Does anyone on the wikitech mailing list happen to know whether it
  would be possible for some of the larger wikipedia database downloads
  (which are, say, 16GB or so in size) to be split into parts so that
  they can be downloaded.  For whatever reason, whenever I have
  attempted to download the ~14GB files (say, from
  http://static.wikipedia.org/downloads/2008-06/en/ ), I have found that
  only 2GB (presumably, the first 2GB) of what I have sought to download
  has actually been downloaded.  Is there anyway around this?  Could
  anyone possibly suggest what possible reasons there might be for this
  difficulty in downloading the material?
 
 
  Downloading to a filesystem that only does maximum 2GB files?
 
 
  Also, several http clients don't like files over 2GB - this is because
 the large
  number of bytes in the Length field causes an integer overflow (2GB is
 the 31
  bit limit). wget likes to die with a segmentation fault on those. I found
 that
  curl works.
 
  But of course, the file system also has to support very large files, as
 Gerard said.
 
  Finally: yes, it would be nive to have such dumps available in pieces of
 perhaps
  1GB in size.
 
  -- daniel
 
  ___
  Wikitech-l mailing list
  Wikitech-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Bilal Abdul Kader
I have a decent server that is dedicated for a Wikipedia project that
depends on the fresh dumps. Can this be used anyway to speed up the process
of generating the dumps?

bilal


On Tue, Jan 27, 2009 at 2:24 PM, Christian Storm st...@iparadigms.comwrote:

  On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote:
  The current enwiki database dump (
 http://download.wikimedia.org/enwiki/20081008/
  ) has been crawling along since 10/15/2008.
  The current dump system is not sustainable on very large wikis and
  is being replaced. You'll hear about it when we have the new one in
  place. :)
  -- brion

 Following up on this thread:
 http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html

 Brion,

 Can you offer any general timeline estimates (weeks, months, 1/2
 year)?  Are there any alternatives to retrieving the article data
 beyond directly crawling
 the site?  I know this is verboten but we are in dire need of
 retrieving this data and don't know of any alternatives.  The current
 estimate of end of year is
 too long for us to wait.  Unfortunately, wikipedia is a favored source
 for students to plagiarize from which makes out of date content a real
 issue.

 Is there any way to help this process along?  We can donate disk
 drives, developer time, ...?  There is another possibility
 that we could offer but I would need to talk with someone at the
 wikimedia foundation offline.  Is there anyone I could
 contact?

 Thanks for any information and/or direction you can give.

 Christian


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l