[Wikitech-l] MediaWiki Slow, what to look for?

2009-01-27 Thread Dawson
Hello, I have a couple of mediawiki installations on two different slices at
Slicehost, both of which run websites on the same slice with no speed
problems, however, the mediawiki themselves run like dogs!
http://wiki.medicalstudentblog.co.uk/ Any ideas what to look for or ways to
optimise them? I still can't get over they need a 100mb ini_set in settings
to just load due to the messages or something.

Thank you, Dawson
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] MediaWiki Slow, what to look for?

2009-01-27 Thread Aryeh Gregor
On Tue, Jan 27, 2009 at 5:31 AM, Dawson costel...@gmail.com wrote:
 Hello, I have a couple of mediawiki installations on two different slices at
 Slicehost, both of which run websites on the same slice with no speed
 problems, however, the mediawiki themselves run like dogs!
 http://wiki.medicalstudentblog.co.uk/ Any ideas what to look for or ways to
 optimise them? I still can't get over they need a 100mb ini_set in settings
 to just load due to the messages or something.

If you haven't already, you should set up an opcode cache like APC or
XCache, and a variable cache like APC or XCache (if using one
application server) or memcached (if using multiple application
servers).  Those are essential for decent performance.  If you want
really snappy views, at least for logged-out users, you should use
Squid too, although that's probably overkill for a small site.  It
also might be useful to install wikidiff2 and use that for diffs.

Of course, none of this works if you don't have root access.  (Well,
maybe you could get memcached working with only shell . . .)  In that
case, I'm not sure what advice to give.

MediaWiki is a big, slow package, though.  For large sites, it has
scalability features that are almost certainly unparalleled in any
other wiki software, but it's probably not optimized as much for quick
loading on small-scale, cheap hardware.  It's mainly meant for
Wikipedia.  If you want to try digging into what's taking so long, you
can try enabling profiling:

http://www.mediawiki.org/wiki/Profiling#Profiling

If you find something that helps a lot, it would be helpful to mention it.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] hosting wikipedia

2009-01-27 Thread Stephen Dunn
Hi Folks:

I am a newbie so I apologize if I am asking basic questions. How would I go 
about hosting wiktionary allowing search queries via the web using opensearch. 
I am having trouble fining info on how to set this up. Any assistance is 
greatly appreciated.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Thomas Dalton
2009/1/27 Stephen Dunn swd...@yahoo.com:
 Hi Folks:

 I am a newbie so I apologize if I am asking basic questions. How would I go 
 about hosting wiktionary allowing search queries via the web using 
 opensearch. I am having trouble fining info on how to set this up. Any 
 assistance is greatly appreciated.

Why do you want to host Wiktionary? It's already hosted at
en.wiktionary.org. And do you mean Wiktionary (as you said in the body
of your email) or Wikipedia (as you said in the subject line)? Or do
you actually mean your own wiki, unrelated to either of those?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Stephen Dunn
I am working on a project to host wiktionary on one web page and wikipedia on 
another. So both, sorry..



- Original Message 
From: Thomas Dalton thomas.dal...@gmail.com
To: Wikimedia developers wikitech-l@lists.wikimedia.org
Sent: Tuesday, January 27, 2009 12:43:49 PM
Subject: Re: [Wikitech-l] hosting wikipedia

2009/1/27 Stephen Dunn swd...@yahoo.com:
 Hi Folks:

 I am a newbie so I apologize if I am asking basic questions. How would I go 
 about hosting wiktionary allowing search queries via the web using 
 opensearch. I am having trouble fining info on how to set this up. Any 
 assistance is greatly appreciated.

Why do you want to host Wiktionary? It's already hosted at
en.wiktionary.org. And do you mean Wiktionary (as you said in the body
of your email) or Wikipedia (as you said in the subject line)? Or do
you actually mean your own wiki, unrelated to either of those?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] MediaWiki Slow, what to look for?

2009-01-27 Thread Jason Schulz
To use filecache, you need to set $wgShowIPinHeader = false;

Also, see 
http://www.mediawiki.org/wiki/User:Aaron_Schulz/How_to_make_MediaWiki_fast
-Aaron

--
From: Dawson costel...@gmail.com
Sent: Tuesday, January 27, 2009 9:52 AM
To: Wikimedia developers wikitech-l@lists.wikimedia.org
Subject: Re: [Wikitech-l] MediaWiki Slow, what to look for?

 Modified config file as follows:

 $wgUseDatabaseMessage = false;
 $wgUseFileCache = true;
 $wgMainCacheType = CACHE_ACCEL;

 I also installed xcache and eaccelerator. The improvement in speed is 
 huge.

 2009/1/27 Aryeh Gregor
 simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com


 On Tue, Jan 27, 2009 at 5:31 AM, Dawson costel...@gmail.com wrote:
  Hello, I have a couple of mediawiki installations on two different 
  slices
 at
  Slicehost, both of which run websites on the same slice with no speed
  problems, however, the mediawiki themselves run like dogs!
  http://wiki.medicalstudentblog.co.uk/ Any ideas what to look for or 
  ways
 to
  optimise them? I still can't get over they need a 100mb ini_set in
 settings
  to just load due to the messages or something.

 If you haven't already, you should set up an opcode cache like APC or
 XCache, and a variable cache like APC or XCache (if using one
 application server) or memcached (if using multiple application
 servers).  Those are essential for decent performance.  If you want
 really snappy views, at least for logged-out users, you should use
 Squid too, although that's probably overkill for a small site.  It
 also might be useful to install wikidiff2 and use that for diffs.

 Of course, none of this works if you don't have root access.  (Well,
 maybe you could get memcached working with only shell . . .)  In that
 case, I'm not sure what advice to give.

 MediaWiki is a big, slow package, though.  For large sites, it has
 scalability features that are almost certainly unparalleled in any
 other wiki software, but it's probably not optimized as much for quick
 loading on small-scale, cheap hardware.  It's mainly meant for
 Wikipedia.  If you want to try digging into what's taking so long, you
 can try enabling profiling:

 http://www.mediawiki.org/wiki/Profiling#Profiling

 If you find something that helps a lot, it would be helpful to mention 
 it.

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Stephen Dunn
yes, website. so a web page has a search box that passes the input to 
wiktionary and results are provided on a results page. an example may be 
reference..com



- Original Message 
From: Thomas Dalton thomas.dal...@gmail.com
To: Wikimedia developers wikitech-l@lists.wikimedia.org
Sent: Tuesday, January 27, 2009 12:50:18 PM
Subject: Re: [Wikitech-l] hosting wikipedia

2009/1/27 Stephen Dunn swd...@yahoo.com:
 I am working on a project to host wiktionary on one web page and wikipedia on 
 another. So both, sorry..

You mean web *site*, surely? They are both far too big to fit on a
single page. I think you need to work out precisely what it is you're
trying to do before we can help you.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Thomas Dalton
2009/1/27 Stephen Dunn swd...@yahoo.com:
 yes, website. so a web page has a search box that passes the input to 
 wiktionary and results are provided on a results page. an example may be 
 reference..com

How would this differ from the search box on en.wiktionary.org? What
are you actually trying to achieve?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Tei
maybe this is what this guy need:

form name=bodySearch id=bodySearch class=bodySearch
action=http://en.wiktionary.org/wiki/Special:Search;
input type=text name=search size=50 id=bodySearchInput
/input type=submit name=go value=Define /
/form

test:
http://zerror.com/unorganized/wika/test.htm

it don't seems wiktionary block external searchs now (via REFERRER),
but maybe may change the policy on the future/change the parameters
needed.

On Tue, Jan 27, 2009 at 7:18 PM, Stephen Dunn swd...@yahoo.com wrote:
 refer to reference. com website and do a search

 yes, website. so a web page has a search box that passes the input to 
 wiktionary and results are provided on a results page. an example may be 
 reference..com

 How would this differ from the search box on en.wiktionary.org? What
 are you actually trying to achieve?



-- 
--
ℱin del ℳensaje.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Jeff Ferland
I'll try to weigh in with a bit of useful information, but it probably  
won't help that much.

You'll need a quite impressive machine to host even just the current  
revisions of the wiki. Expect to expend 10s to even hundreds of  
gigabytes on the database alone for Wikipedia using only the current  
versions.

There instructions for how to load the data that can be found by  
googling wikipedia dump.

Several others have inquired for more information about your goal, and  
I'm going to echo that. The mechanics of hosting this kind of data  
(volume, really) are highly related to the associated task.

This data used for academic research would be handled differenty than  
for a live website, for example.

Nobody likes to be told they can't do something, or get a bunch of  
useless responses to a request for help. Very sincerely, though,  
unless you find enough information from the dump instruction pages to  
point you on the right direction and are able to ask more specific  
questions, you are in over your head. Your solution at that point  
would be to hire somebody.

Sent from my phone,
Jeff

On Jan 27, 2009, at 12:34 PM, Stephen Dunn swd...@yahoo.com wrote:

 Hi Folks:

 I am a newbie so I apologize if I am asking basic questions. How  
 would I go about hosting wiktionary allowing search queries via the  
 web using opensearch. I am having trouble fining info on how to set  
 this up. Any assistance is greatly appreciated.

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Christian Storm
 On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote:
 The current enwiki database dump 
 (http://download.wikimedia.org/enwiki/20081008/ 
 ) has been crawling along since 10/15/2008.
 The current dump system is not sustainable on very large wikis and  
 is being replaced. You'll hear about it when we have the new one in  
 place. :)
 -- brion

Following up on this thread:  
http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html

Brion,

Can you offer any general timeline estimates (weeks, months, 1/2  
year)?  Are there any alternatives to retrieving the article data  
beyond directly crawling
the site?  I know this is verboten but we are in dire need of  
retrieving this data and don't know of any alternatives.  The current  
estimate of end of year is
too long for us to wait.  Unfortunately, wikipedia is a favored source  
for students to plagiarize from which makes out of date content a real  
issue.

Is there any way to help this process along?  We can donate disk  
drives, developer time, ...?  There is another possibility
that we could offer but I would need to talk with someone at the  
wikimedia foundation offline.  Is there anyone I could
contact?

Thanks for any information and/or direction you can give.

Christian


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Steve Summit
Jeff Ferland wrote:
 You'll need a quite impressive machine to host even just the current  
 revisions of the wiki. Expect to expend 10s to even hundreds of  
 gigabytes on the database alone for Wikipedia using only the current  
 versions.

No, no, no.  You're looking at it all wrong.  That's the sucker's
way of doing it.

If you're smart, you put up a simple page with a text box labeled
Wikipedia search, and whenever someone types a query into
the box and submits it, you ship the query over to the Wikimedia
servers, and then slurp back the response, and display it back
to the original submitter.  That way only Wikimedia has to worry
about all those pesky gigabyte-level database hosting requirements,
while you get all the glory.

This appears to be what the questioner is asking about.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Bilal Abdul Kader
I have a decent server that is dedicated for a Wikipedia project that
depends on the fresh dumps. Can this be used anyway to speed up the process
of generating the dumps?

bilal


On Tue, Jan 27, 2009 at 2:24 PM, Christian Storm st...@iparadigms.comwrote:

  On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote:
  The current enwiki database dump (
 http://download.wikimedia.org/enwiki/20081008/
  ) has been crawling along since 10/15/2008.
  The current dump system is not sustainable on very large wikis and
  is being replaced. You'll hear about it when we have the new one in
  place. :)
  -- brion

 Following up on this thread:
 http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html

 Brion,

 Can you offer any general timeline estimates (weeks, months, 1/2
 year)?  Are there any alternatives to retrieving the article data
 beyond directly crawling
 the site?  I know this is verboten but we are in dire need of
 retrieving this data and don't know of any alternatives.  The current
 estimate of end of year is
 too long for us to wait.  Unfortunately, wikipedia is a favored source
 for students to plagiarize from which makes out of date content a real
 issue.

 Is there any way to help this process along?  We can donate disk
 drives, developer time, ...?  There is another possibility
 that we could offer but I would need to talk with someone at the
 wikimedia foundation offline.  Is there anyone I could
 contact?

 Thanks for any information and/or direction you can give.

 Christian


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Make upload headings changeable

2009-01-27 Thread Chad
On Mon, Jan 26, 2009 at 12:44 PM, Ilmari Karonen nos...@vyznev.net wrote:

 Chad wrote:
  I was going to provide a specific parameter for it. That entire key sucks
  though anyway, I should probably ditch the md5()'d URL in favor of using
  the actual name. Fwiw: I've got a patch working, but I'm not quite ready
  to commit it yet. While we're at it, are we sure we want to use $wgLang
 and
  not $wgContLang? Image description pages are content, not a part of
  the interface. That being said, I would think it would be best to fetch
 the
  information using the wiki's content language.

 Well, if you actually visit the description page on Commons, you'll see
 the templates in your interface language -- that's kind of the _point_
 of the autotranslated templates.

 Then again, Commons is kind of a special case, since, being a
 multilingual project, it doesn't _have_ a real content language; in a
 technical sense its content language is English, but that's only because
 MediaWiki requires one language to be specified as a content language
 even if the actual content is multilingual.  So I can see arguments
 either way.

 What language is the shareduploadwiki-desc message shown in, anyway?
 Seems to be $wgLang, which would seem to suggest that the actual
 description should be shown in the interface language too, for consistency.

 --
 Ilmari Karonen


Should be done with a wiki's content language as of  r46372.

-Chad
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Make upload headings changeable

2009-01-27 Thread Marcus Buck
Chad hett schreven:
 Should be done with a wiki's content language as of  r46372.

 -Chad
Thanks! That's already a big improvement, but why content language? As I 
pointed out in response to your question, it need's to be user language 
on Meta, Incubator, Wikispecies, Beta Wikiversity, old Wikisource, and 
all the multilingual wikis of third party users. It's not actually 
necessary on non-multilingual wikis, but it does no harm either. So why 
content language?
This could be solved with a setting in LocalSettings.php 
isMultilingual, but that's another affair and as long as that does not 
exist, we should use user language.

Marcus Buck

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Robert Rohde
The problem, as I understand it (and Brion may come by to correct me)
is essentially that the current dump process is designed in a way that
can't be sustained given the size of enwiki.  It really needs to be
re-engineered, which means that developer time is needed to create a
new approach to dumping.

The main target for improvement is almost certainly parallelizing the
process so that wouldn't be a single monolithic dump process, but
rather a lot of little processes working in parallel.  That would also
ensure that if a single process gets stuck and dies, the entire dump
doesn't need to start over.


By way of observation, the dewiki's full history dumps in 26 hours
with 96% prefetched (i.e. loaded from previous dumps).  That suggests
that even starting from scratch (prefetch = 0%) it should dump in ~25
days under the current process.  enwiki is perhaps 3-6 times larger
than dewiki depending on how you do the accounting, which implies
dumping the whole thing from scratch would take ~5 months if the
process scaled linearly.  Of course it doesn't scale linearly, and we
end up with a prediction for completion that is currently 10 months
away (which amounts to a 13 month total execution).  And of course, if
there is any serious error in the next ten months the entire process
could die with no result.


Whether we want to let the current process continue to try and finish
or not, I would seriously suggest someone look into redumping the rest
of the enwiki files (i.e. logs, current pages, etc.).  I am also among
the people that care about having reasonably fresh dumps and it really
is a problem that the other dumps (e.g. stubs-meta-history) are frozen
while we wait to see if the full history dump can run to completion.

-Robert Rohde


On Tue, Jan 27, 2009 at 11:24 AM, Christian Storm st...@iparadigms.com wrote:
 On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote:
 The current enwiki database dump 
 (http://download.wikimedia.org/enwiki/20081008/
 ) has been crawling along since 10/15/2008.
 The current dump system is not sustainable on very large wikis and
 is being replaced. You'll hear about it when we have the new one in
 place. :)
 -- brion

 Following up on this thread:  
 http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html

 Brion,

 Can you offer any general timeline estimates (weeks, months, 1/2
 year)?  Are there any alternatives to retrieving the article data
 beyond directly crawling
 the site?  I know this is verboten but we are in dire need of
 retrieving this data and don't know of any alternatives.  The current
 estimate of end of year is
 too long for us to wait.  Unfortunately, wikipedia is a favored source
 for students to plagiarize from which makes out of date content a real
 issue.

 Is there any way to help this process along?  We can donate disk
 drives, developer time, ...?  There is another possibility
 that we could offer but I would need to talk with someone at the
 wikimedia foundation offline.  Is there anyone I could
 contact?

 Thanks for any information and/or direction you can give.

 Christian


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Thomas Dalton
 Whether we want to let the current process continue to try and finish
 or not, I would seriously suggest someone look into redumping the rest
 of the enwiki files (i.e. logs, current pages, etc.).  I am also among
 the people that care about having reasonably fresh dumps and it really
 is a problem that the other dumps (e.g. stubs-meta-history) are frozen
 while we wait to see if the full history dump can run to completion.

Even if we do let it finish, I'm not sure a dump of what Wikipedia was
like 13 months ago is much use... The way I see it, what we need is to
get a really powerful server to do the dump just once at a reasonable
speed and then we'll have a previous dump to build on so future ones
would be more reasonable.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Brion Vibber
On 1/27/09 2:35 PM, Thomas Dalton wrote:
 The way I see it, what we need is to get a really powerful server

Nope, it's a software architecture issue. We'll restart it with the new 
arch when it's ready to go.

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Robert Rohde
On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber br...@wikimedia.org wrote:
 On 1/27/09 2:35 PM, Thomas Dalton wrote:
 The way I see it, what we need is to get a really powerful server

 Nope, it's a software architecture issue. We'll restart it with the new
 arch when it's ready to go.

I don't know what your timetable is, but what about doing something to
address the other aspects of the dump (logs, stubs, etc.) that are in
limbo while full history chugs along.  All the other enwiki files are
now 3 months old and that is already enough to inconvenience some
people.

The simplest solution is just to kill the current dump job if you have
faith that a new architecture can be put in place in less than a year.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Crawling deWP

2009-01-27 Thread Marco Schuster
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi all,

I want to crawl around 800.000 flagged revisions from the German
Wikipedia, in order to make a dump containing only flagged revisions.
For this, I obviously need to spider Wikipedia.
What are the limits (rate!) here, what UA should I use and what
caveats do I have to take care of?

Thanks,
Marco

PS: I already have a revisions list, created with the Toolserver. I
used the following query: select fp_stable,fp_page_id from
flaggedpages where fp_reviewed=1;. Is it correct this one gives me a
list of all articles with flagged revs, fp_stable being the revid of
the most current flagged rev for this article?
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MingW32)
Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2)

iD8DBQFJf5wcW6S2GapJUuQRAl8NAJ0Xs+ImyTqmoX2Vtj6k6PK9ntlS5wCeJjsl
M5kMETB3URYni5TilIOt8Fs=
=j7Og
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread George Herbert
On Tue, Jan 27, 2009 at 11:29 AM, Steve Summit s...@eskimo.com wrote:

 Jeff Ferland wrote:
  You'll need a quite impressive machine to host even just the current
  revisions of the wiki. Expect to expend 10s to even hundreds of
  gigabytes on the database alone for Wikipedia using only the current
  versions.

 No, no, no.  You're looking at it all wrong.  That's the sucker's
 way of doing it.

 If you're smart, you put up a simple page with a text box labeled
 Wikipedia search, and whenever someone types a query into
 the box and submits it, you ship the query over to the Wikimedia
 servers, and then slurp back the response, and display it back
 to the original submitter.  That way only Wikimedia has to worry
 about all those pesky gigabyte-level database hosting requirements,
 while you get all the glory.

 This appears to be what the questioner is asking about.


Let's AGF a bit...

Even if someone with a not particularly Wikipedia goal in life links to one
of our searches from their page, all the resultant search result links are
back into Wikipedia.

If people have a question about something, and want to look it up, does it
really matter if they go to Wikipedia's front page and click search versus
doing so in another context?

We're providing an information resource - other sites can and often do link
to our articles (quite appropriately).  Why not link to our search?

The search link should in fairness tell people what they're getting, sure,
but that's more of a website-to-end-user disclosure problem than a problem
for us.

Google switching to use our search would crush us, obviously.  As would
AOL.  But J. Random site?  Seems like an ok thing, to me.


-- 
-george william herbert
george.herb...@gmail.com
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-27 Thread Rolf Lampa
Marco Schuster skrev:
 I want to crawl around 800.000 flagged revisions from the German
 Wikipedia, in order to make a dump containing only flagged revisions.
[...]
 flaggedpages where fp_reviewed=1;. Is it correct this one gives me a
 list of all articles with flagged revs, 


Doesn't the xml dumps contain the flag for flagged revs?

// Rolf Lampa

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-27 Thread Daniel Kinzler
Rolf Lampa schrieb:
 Marco Schuster skrev:
 I want to crawl around 800.000 flagged revisions from the German
 Wikipedia, in order to make a dump containing only flagged revisions.
 [...]
 flaggedpages where fp_reviewed=1;. Is it correct this one gives me a
 list of all articles with flagged revs, 
 
 
 Doesn't the xml dumps contain the flag for flagged revs?
 
They don't. And that's very sad.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-27 Thread Platonides
Marco Schuster wrote:
 Hi all,
 
 I want to crawl around 800.000 flagged revisions from the German
 Wikipedia, in order to make a dump containing only flagged revisions.
 For this, I obviously need to spider Wikipedia.
 What are the limits (rate!) here, what UA should I use and what
 caveats do I have to take care of?
 
 Thanks,
 Marco
 
 PS: I already have a revisions list, created with the Toolserver. I
 used the following query: select fp_stable,fp_page_id from
 flaggedpages where fp_reviewed=1;. Is it correct this one gives me a
 list of all articles with flagged revs, fp_stable being the revid of
 the most current flagged rev for this article?

Fetch them from the toolserver (there's a tool by duesentrieb for that).
It will catch almost all of them from the toolserver cluster, and make a
request to wikipedia only if needed.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-27 Thread Marco Schuster
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Wed, Jan 28, 2009 at 12:49 AM, Rolf Lampa  wrote:
 Marco Schuster skrev:
 I want to crawl around 800.000 flagged revisions from the German
 Wikipedia, in order to make a dump containing only flagged revisions.
 [...]
 flaggedpages where fp_reviewed=1;. Is it correct this one gives me a
 list of all articles with flagged revs,


 Doesn't the xml dumps contain the flag for flagged revs?

The xml dumps are nothing for me, way too much overhead (especially,
they are old, and I want to use single files, it's easier to process
these than one hge xml file). And they don't contain flagged
revisions flags :(

Marco
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MingW32)
Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2)

iD8DBQFJf5/cW6S2GapJUuQRAj1KAJ9feF3ElQTQbuENa2xfDoXJE5pq5QCfYtRd
x8lfmVHMzmVOqtO39MCfieQ=
=8YJP
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Toolserver-l] Crawling deWP

2009-01-27 Thread Daniel Kinzler
Marco Schuster schrieb:
 Fetch them from the toolserver (there's a tool by duesentrieb for that).
 It will catch almost all of them from the toolserver cluster, and make a
 request to wikipedia only if needed.
 I highly doubt this is legal use for the toolserver, and I pretty
 much guess that 800k revisions to fetch would be a huge resource load.
 
 Thanks, Marco
 
 PS: CC-ing toolserver list.

It's a legal use, the only problem is that the tool i wrote for is is quite
slow. You shouldn't hit it at full speed. So it might actually be better to
query the main server cluster, they can distribute the load more nicely.

One day i'll rewrite WikiProxy and everything will be better :)

But by then, i do hope we have revision flags in the dumps. because that would
be The Right Thing to use.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread George Herbert
On Tue, Jan 27, 2009 at 3:54 PM, Aryeh Gregor
simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com
 wrote:

 Anyway, the reason live mirrors are prohibited is not for load
 reasons.  I believe it's because if a site does nothing but stick up
 some ads and add no value, Wikimedia is going to demand a cut of the
 profit for using its trademarks and so on.  Some sites pay Wikimedia
 for live mirroring.  So the others, in principle, get blocked.


Right, but a live mirror is a very different thing than a search box link.


-- 
-george william herbert
george.herb...@gmail.com
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Aryeh Gregor
On Tue, Jan 27, 2009 at 7:37 PM, George Herbert
george.herb...@gmail.com wrote:
 Right, but a live mirror is a very different thing than a search box link.

Well, as far as I can tell, we have no idea whether the original
poster meant either of those, or perhaps something else altogether.
Obviously nobody minds a search box link, that's just a *link*.  You
can't stop people from linking to you.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] MediaWiki Slow, what to look for?

2009-01-27 Thread Platonides
Dawson wrote:
 Modified config file as follows:
 
 $wgUseDatabaseMessage = false;
 $wgUseFileCache = true;
 $wgMainCacheType = CACHE_ACCEL;

This should be $wgMainCacheType = CACHE_ACCEL; (constant) not
$wgMainCacheType = CACHE_ACCEL; (string)


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] MediaWiki Slow, what to look for?

2009-01-27 Thread Jason Schulz
http://svn.wikimedia.org/viewvc/mediawiki/trunk/tools/jobs-loop/run-jobs.c?revision=22101view=markupsortby=date

As mentioned, it is just a sample script. For sites with just one 
master/slave cluster, any simple script that keeps looping to run 
maintenance/runJobs.php will do.

-Aaron

--
From: Marco Schuster ma...@harddisk.is-a-geek.org
Sent: Tuesday, January 27, 2009 6:56 PM
To: Wikimedia developers wikitech-l@lists.wikimedia.org
Subject: Re: [Wikitech-l] MediaWiki Slow, what to look for?

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On Tue, Jan 27, 2009 at 6:56 PM, Jason Schulz  wrote:
 Also, see
 http://www.mediawiki.org/wiki/User:Aaron_Schulz/How_to_make_MediaWiki_fast
 The shell script you mention in step 2 has some stuff in it that makes
 it unusable outside Wikimedia:
 1) lots of hard-coded paths
 2) what is /usr/local/bin/run-jobs?

 I'd put 0 0 * * * /usr/bin/php /var/www/wiki/maintenance/runJobs.php
 21  /var/log/runJobs.log as crontab entry in your guide, as it's a
 bit more compatible with non-wikimedia environments ;)

 Marco
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.7 (MingW32)
 Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2)

 iD8DBQFJf59oW6S2GapJUuQRAvYCAJ4vWBAHSTHlJljfnnUSF7IpZlechQCcCY5A
 Zb5SMJz146sM5HalNQuA/9k=
 =Ie27
 -END PGP SIGNATURE-

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Toolserver-l] Crawling deWP

2009-01-27 Thread Marco Schuster
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Wed, Jan 28, 2009 at 1:13 AM, Daniel Kinzler  wrote:
 Marco Schuster schrieb:
 Fetch them from the toolserver (there's a tool by duesentrieb for that).
 It will catch almost all of them from the toolserver cluster, and make a
 request to wikipedia only if needed.
 I highly doubt this is legal use for the toolserver, and I pretty
 much guess that 800k revisions to fetch would be a huge resource load.

 Thanks, Marco

 PS: CC-ing toolserver list.

 It's a legal use, the only problem is that the tool i wrote for is is quite
 slow. You shouldn't hit it at full speed. So it might actually be better to
 query the main server cluster, they can distribute the load more nicely.
What is the best speed, actually? 2 requests per second? Or can I go up to 4?

 One day i'll rewrite WikiProxy and everything will be better :)
:)

 But by then, i do hope we have revision flags in the dumps. because that would
 be The Right Thing to use.
Still, using the dumps would require me to get the full history dump
because I only want flagged revisions and not current revisions
without the flag.

Marco
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MingW32)
Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2)

iD8DBQFJgAIpW6S2GapJUuQRAuY/AJ47eppKPbBqjz0l4HllCPolMWz9KACfRurR
Lod/wkd4ZM0ee+cPTfaO7yg=
=zB26
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Tei
On Wed, Jan 28, 2009 at 1:41 AM, Aryeh Gregor
simetrical+wikil...@gmail.com wrote:
 On Tue, Jan 27, 2009 at 7:37 PM, George Herbert
 george.herb...@gmail.com wrote:
 Right, but a live mirror is a very different thing than a search box link.

 Well, as far as I can tell, we have no idea whether the original
 poster meant either of those, or perhaps something else altogether.
 Obviously nobody minds a search box link, that's just a *link*.  You
 can't stop people from linking to you.


This one code don't even need to use
http://en.wiktionary.org/wiki/Special:Search

input box id=word /
form id=form1
input type=button onclick=searchWiktionary() value=Define /
/form

script

function $(name){
  return document.getElementById(name);
}

function searchWiktionary(){
  var word = $(word).value;
  $(form1).setAttribute(action,http://en.wiktionary.org/wiki/+
escape(word) );
  $(form1).submit();
}
/script



-- 
--
ℱin del ℳensaje.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Tei
On Wed, Jan 28, 2009 at 8:28 AM, Tei oscar.vi...@gmail.com wrote:
 On Wed, Jan 28, 2009 at 1:41 AM, Aryeh Gregor
 simetrical+wikil...@gmail.com wrote:
 On Tue, Jan 27, 2009 at 7:37 PM, George Herbert
 george.herb...@gmail.com wrote:
 Right, but a live mirror is a very different thing than a search box link.

 Well, as far as I can tell, we have no idea whether the original
 poster meant either of those, or perhaps something else altogether.
 Obviously nobody minds a search box link, that's just a *link*.  You
 can't stop people from linking to you.


 This one code don't even need to use
 http://en.wiktionary.org/wiki/Special:Search

 input box id=word /
 form id=form1
 input type=button onclick=searchWiktionary() value=Define /
 /form

 script

 function $(name){
  return document.getElementById(name);
 }

 function searchWiktionary(){
  var word = $(word).value;
  $(form1).setAttribute(action,http://en.wiktionary.org/wiki/+
 escape(word) );
  $(form1).submit();
 }
 /script



postadata:
I know the OP talk about OpenSearch. This snip of code is something
different instead.

-- 
--
ℱin del ℳensaje.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l