Re: [Wikitech-l] Firesheep

2010-10-26 Thread MZMcBride
George Herbert wrote:
> The current WMF situation is becoming "quaint" - pros use
> secure.wikimedia.org, amateurs don't realize what they're exposing.
> By professional standards, we're not keeping up with professional
> industry expectations.  It's not nuclear bomb secrets (cough) or
> missile designs (cough) but our internal function (in terms of keeping
> more sensitive accounts private and not hacked) and our ability to
> reassure people that they're using a modern and reliable site are
> falling slowly.

I don't understand what you're saying here. Most Wikimedia content is
intended to be distributed openly and widely. Certainly serving every page
view over HTTPS makes no sense given the cost vs. benefit currently.

As Aryeh notes, even those who act in an editing role (rather than in simply
a reader role) don't generally have valuable accounts. The "pros" you're
talking about are free to use secure.wikimedia.org (which is already set up
and has been for quite some time). If there were a secure site alternative,
I think you'd have a point. As it stands, I don't see what's very quaint
about this situation.

It'd be great to one day have http://en.wikipedia.org be the same as
https://en.wikipedia.org with the only noticeable difference being the
little lock icon in your browser. But there are a finite amount of resources
and this really isn't and shouldn't be a high priority.

If the goal is to reassure people that they're using a "modern and reliable
site," there are lot of other features that could and should be implemented
first in my view, though the goal itself seems a bit dubious in any case.

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Firesheep

2010-10-26 Thread George Herbert
On Mon, Oct 25, 2010 at 11:59 PM, MZMcBride  wrote:
> George Herbert wrote:
>> The current WMF situation is becoming "quaint" - pros use
>> secure.wikimedia.org, amateurs don't realize what they're exposing.
>> By professional standards, we're not keeping up with professional
>> industry expectations.  It's not nuclear bomb secrets (cough) or
>> missile designs (cough) but our internal function (in terms of keeping
>> more sensitive accounts private and not hacked) and our ability to
>> reassure people that they're using a modern and reliable site are
>> falling slowly.
>
> I don't understand what you're saying here. Most Wikimedia content is
> intended to be distributed openly and widely. Certainly serving every page
> view over HTTPS makes no sense given the cost vs. benefit currently.
>
> As Aryeh notes, even those who act in an editing role (rather than in simply
> a reader role) don't generally have valuable accounts. The "pros" you're
> talking about are free to use secure.wikimedia.org (which is already set up
> and has been for quite some time). If there were a secure site alternative,
> I think you'd have a point. As it stands, I don't see what's very quaint
> about this situation.
>
> It'd be great to one day have http://en.wikipedia.org be the same as
> https://en.wikipedia.org with the only noticeable difference being the
> little lock icon in your browser. But there are a finite amount of resources
> and this really isn't and shouldn't be a high priority.
>
> If the goal is to reassure people that they're using a "modern and reliable
> site," there are lot of other features that could and should be implemented
> first in my view, though the goal itself seems a bit dubious in any case.
>
> MZMcBride

I have no objection to us serving http traffic, especially as default
to logged-out users.  There's security sensitivity, and then there's
paranoia.

But I would prefer to move towards a logged-in user by default goes to
secure connection model.  That would include making secure a
multi-system, fully redundantly supported part of the environment, or
alternately just making https work on all the front ends.

Any "login" should be protected.  The casual "eh" attitude here is
unprofessional, as it were.  The nature of the site means that this
isn't something I would rush a crash program and redirect major
resources to fix immediately, but it's not something to think of as
desirable and continue propogating for more years.


-- 
-george william herbert
george.herb...@gmail.com

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Firesheep

2010-10-26 Thread Nikola Smolenski
On 10/26/2010 08:59 AM, MZMcBride wrote:
> As Aryeh notes, even those who act in an editing role (rather than in simply
> a reader role) don't generally have valuable accounts. The "pros" you're
> talking about are free to use secure.wikimedia.org (which is already set up
> and has been for quite some time). If there were a secure site alternative,
> I think you'd have a point. As it stands, I don't see what's very quaint
> about this situation.

For a maximum security and minimal overhead, let the login always be 
over https. If a logged-in user is an admin or higher, use https for 
everything. Expand to all editors if easily possible.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Firesheep

2010-10-26 Thread John Vandenberg
On Tue, Oct 26, 2010 at 6:24 PM, George Herbert
 wrote:
>..
> But I would prefer to move towards a logged-in user by default goes to
> secure connection model.  That would include making secure a
> multi-system, fully redundantly supported part of the environment, or
> alternately just making https work on all the front ends.
>
> Any "login" should be protected.  The casual "eh" attitude here is
> unprofessional, as it were.  The nature of the site means that this
> isn't something I would rush a crash program and redirect major
> resources to fix immediately, but it's not something to think of as
> desirable and continue propogating for more years.

I agree.  Even if we still do drop users back to http after
authentication, and the cookies can be sniffed, that is preferable to
having authentication over http.

People often use the same password for many sites.

Their password may not have much value on WMF projects ('at worst they
access admin functions'), but it could be used to access their gmail
or similar.

--
John Vandenberg

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Firesheep

2010-10-26 Thread Daniel Kinzler
On 26.10.2010 09:36, Nikola Smolenski wrote:
> On 10/26/2010 08:59 AM, MZMcBride wrote:
>> As Aryeh notes, even those who act in an editing role (rather than in simply
>> a reader role) don't generally have valuable accounts. The "pros" you're
>> talking about are free to use secure.wikimedia.org (which is already set up
>> and has been for quite some time). If there were a secure site alternative,
>> I think you'd have a point. As it stands, I don't see what's very quaint
>> about this situation.
> 
> For a maximum security and minimal overhead, let the login always be 
> over https. If a logged-in user is an admin or higher, use https for 
> everything. Expand to all editors if easily possible.

This sounds like a sensible compromise. It protects the sensitive bits, and
doesn't cause massive load on https handling. I would very much like to see this
on the official roadmap.

By the way... where's the official road map?

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Firesheep

2010-10-26 Thread Conrad Irwin
There is no real massive load caused by https at runtime.  There is however
a significant chink of developer and sysadmin time needed to implement this
and make it work.

For now, at least, the only optimisations that should be considered are
those that make it easier all round.

Conrad

On 26 Oct 2010 08:44, "Daniel Kinzler"  wrote:

On 26.10.2010 09:36, Nikola Smolenski wrote:
> On 10/26/2010 08:59 AM, MZMcBride wrote:
>> As Aryeh ...
This sounds like a sensible compromise. It protects the sensitive bits, and
doesn't cause massive load on https handling. I would very much like to see
this
on the official roadmap.

By the way... where's the official road map?

-- daniel


___
Wikitech-l mailing list
wikitec...@lists.wikimedia
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] InlineEditor new version (previously Sentence-Level Editing)

2010-10-26 Thread Alex Brollo
2010/10/25 Jan Paul Posma 

> Hi all,
>
> As presented last Saturday at the Hack-A-Ton, I've committed a new version
> of the InlineEditor extension. [1] This is an implementation of the
> sentence-level editing demo posted a few months ago.
>
>
Very interesting! Obviously I'll not see your work till  it will be
implemented into  Wikipedia  and  all other  Wikimedia  Foundation projects.
Please consider  too  specific needs of sister projects, t.i. poem
extensionused by
wikisource and its ...  tags; I guess that any sister
project has something particular to be considered from the beginning of any
work about a new editor.

Alex
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Platonides
Robert Rohde wrote:
> Many of the things done for the statistical analysis of database dumps
> should be suitable for parallelization (e.g. break the dump into
> chunks, process the chunks in parallel and sum the results).  You
> could talk to Erik Zachte.  I don't know if his code has already been
> designed for parallel processing though.

I don't think it's a good candidate since you are presumably using
compressed files, and its decompression linearises it (and is most
likely the bottleneck, too).


> Another option might be to look at the methods for compressing old
> revisions (is [1] still current?).
> 
> I make heavy use of parallel processing in my professional work (not
> related to wikis), but I can't really think of any projects I have at
> hand that would be accessible and completable in a month.
> 
> -Robert Rohde
> 
> [1] http://www.mediawiki.org/wiki/Manual:CompressOld.php

It can be used, I am unsure if it is used by WMF.

Another thing that would be nice to have parallelised would be things
like parser tests. That would need adding cotasks to php or so. The most
similar extension I know is runkit which is the other way around:
several php scopes instead of several threads in one scope.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Jyothis Edathoot
Develop a new bot framework (may be interwiki processing to start with) for
high performance GPU cluster (nvidia or AMD) similar to what boinc based
projects does.  nvdia is more popular while AMD has more cores for the same
price

 :)


Regards,
Jyothis.

http://www.Jyothis.net

http://ml.wikipedia.org/wiki/User:Jyothis
http://meta.wikimedia.org/wiki/User:Jyothis
I am the first customer of http://www.netdotnet.com

woods are lovely dark and deep,
but i have promises to keep and
miles to go before i sleep and
lines to go before I press sleep

completion date = (start date + ((estimated effort x 3.1415926) / resources)
+ ((total coffee breaks x 0.25) / 24)) + Effort in meetings



On Sun, Oct 24, 2010 at 8:42 PM, Aryeh Gregor

> wrote:

> This term I'm taking a course in high-performance computing
> , and I have
> to pick a topic for a final project.  According to the assignment
> ,
> "The only real requirement is that it be something in parallel."  In
> the class, we covered
>
> * Microoptimization of single-threaded code (efficient use of CPU cache,
> etc.)
> * Multithreaded programming using OpenMP
> * GPU programming using OpenCL
>
> and will probably briefly cover distributed computing over multiple
> machines with MPI.  I will have access to a high-performance cluster
> at NYU, including lots of CPU nodes and some high-end GPUs.  Unlike
> most of the other people in the class, I don't have any interesting
> science projects I'm working on, so something useful to
> MediaWiki/Wikimedia/Wikipedia is my first thought.  If anyone has any
> suggestions, please share.  (If you have non-Wikimedia-related ones,
> I'd also be interested in hearing about them offlist.)  They shouldn't
> be too ambitious, since I have to finish them in about a month, while
> doing work for three other courses and a bunch of other stuff.
>
> My first thought was to write a GPU program to crack MediaWiki
> password hashes as quickly as possible, then use what we've studied in
> class about GPU architecture to design a hash function that would be
> as slow as possible to crack on a GPU relative to its PHP execution
> speed, as Tim suggested a while back.  However, maybe there's
> something more interesting I could do.
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Ariel T. Glenn
Στις 26-10-2010, ημέρα Τρι, και ώρα 16:25 +0200, ο/η Platonides έγραψε:
> Robert Rohde wrote:
> > Many of the things done for the statistical analysis of database dumps
> > should be suitable for parallelization (e.g. break the dump into
> > chunks, process the chunks in parallel and sum the results).  You
> > could talk to Erik Zachte.  I don't know if his code has already been
> > designed for parallel processing though.
> 
> I don't think it's a good candidate since you are presumably using
> compressed files, and its decompression linearises it (and is most
> likely the bottleneck, too).

If one were clever (and I have some code that would enable one to be
clever), one could seek to some point in the (bzip2-compressed) file and
uncompress from there before processing.  Running a bunch of jobs each
decompressing only their small piece then becomes feasible.  I don't
have code that does this for gz or 7z; afaik these do not do compression
in discrete blocks.

Ariel



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Firesheep

2010-10-26 Thread Aryeh Gregor
On Tue, Oct 26, 2010 at 2:23 AM, Ashar Voultoiz  wrote:
> HTTPS means full encryption, that is either :
>   - a ton of CPU cycles : those are wasted cycles for something else.
>   - SSL ASIC : costly, specially given our gets/ bandwidth levels

HTTPS uses very few CPU cycles by today's standards.  See here:

"""
In January this year (2010), Gmail switched to using HTTPS for
everything by default. Previously it had been introduced as an option,
but now all of our users use HTTPS to secure their email between their
browsers and Google, all the time. In order to do this we had to
deploy no additional machines and no special hardware. On our
production frontend machines, SSL/TLS accounts for less than 1% of the
CPU load, less than 10KB of memory per connection and less than 2% of
network overhead. Many people believe that SSL takes a lot of CPU time
and we hope the above numbers (public for the first time) will help to
dispel that.
"""
http://www.imperialviolet.org/2010/06/25/overclocking-ssl.html

On Tue, Oct 26, 2010 at 3:24 AM, George Herbert
 wrote:
> Any "login" should be protected.  The casual "eh" attitude here is
> unprofessional, as it were.  The nature of the site means that this
> isn't something I would rush a crash program and redirect major
> resources to fix immediately, but it's not something to think of as
> desirable and continue propogating for more years.

It's not desirable, but given limited resources, undesirable things
are inevitable.  I don't know what the sysadmins are spending their
time on, but presumably it's something that they feel takes precedence
over this.  (None has commented so far here . . .)

On Tue, Oct 26, 2010 at 3:36 AM, Nikola Smolenski  wrote:
> For a maximum security and minimal overhead, let the login always be
> over https. If a logged-in user is an admin or higher, use https for
> everything. Expand to all editors if easily possible.

This is an improvement, but not an ideal solution, because a MITM
could just change the HTTPS login link to be HTTP instead, and
translate the request to HTTPS themselves so Wikimedia doesn't see the
difference.  HTTPS for everything makes more sense, ideally with
Strict-Transport-Security.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] New installer is here

2010-10-26 Thread Chad
Good afternoon,

In r75437, r75438[0][1] I moved the old installer to old-index.php
and moved the new to index.php. At this stage in the process,
I don't see us backing this out before we branch 1.17. I really
want people to test it out and report any major breakages [2].

This has been a long development process for almost 2 years
now, and I'd like to thank Max, Mark H., Jure, Jeroen, Roan
and Siebrand for their invaluable help in working on this. And
especially thanks to Tim for starting the project and providing
feedback, as always. There is a *lot* of code in includes/installer,
and I'd like to highlight some of the major changes that you'll
need to know.

Database updaters: They have been moved from the gigantic
file in maintenance/updaters.inc (patchfiles still go in the same
place though). Each supported DB type has a class that needs
to subclass DatabaseUpdater. The format's very similar, only
it's operating on methods in the classes instead of global functions.
The globals $wgExtNewTables, etc. are retained for back compat
and will be for quite some time. However, you can pass more
advanced callbacks since the LoadExtensionSchemaUpdates
hook now passes the DatabaseUpdater subclass as a param.

DB2 and MSSQL have been dropped from the installer. The
implementations are far from complete and I'm not comfortable
advertising their use yet.

Other known issues:
- Some UI quirks still exist, but work is coming here
- Postgres and Oracle are *almost* done
- Stuff listed on mw.org[2]

-Chad

[0] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/75437
[1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/75438
[2] http://www.mediawiki.org/wiki/New-installer_issues

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] New installer is here

2010-10-26 Thread Erik Moeller
2010/10/26 Chad :
> Good afternoon,
>
> In r75437, r75438[0][1] I moved the old installer to old-index.php
> and moved the new to index.php. At this stage in the process,
> I don't see us backing this out before we branch 1.17. I really
> want people to test it out and report any major breakages [2].

Congratulations. :-) It looks great.

A few quick notes:

1) On the admin/site name screen at least, when both aren't supplied,
it only shows the error messages, not the form below. This may be a
general issue with the form validation.
Screenshot: http://tinypic.com/r/2po9vh0/7

2) Checkbox alignment in general is a bit off, at least in Chrome, e.g.:
http://tinypic.com/r/655n5x/7

3) for the "Extensions" section, I would suggest adding a more visible
warning: "Warning: Most extensions require additional configuration
beyond this step. Installing unreviewed extensions may expose your
wiki to security vulnerabilities." I know the Help already explains
the first point, but the simple installer may suggest to the user that
ticking a checkbox is all that's required.

4) It'd be great if we could change the design to Vector :-). In
general it could use a bit more UI love -- perhaps Brandon will have
time to take a quick look.

-- 
Erik Möller
Deputy Director, Wikimedia Foundation

Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] New installer is here

2010-10-26 Thread Erik Moeller
2010/10/26 Erik Moeller :
> A few quick notes:

And, sorry for duplicating stuff from the known issues list.
-- 
Erik Möller
Deputy Director, Wikimedia Foundation

Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] New installer is here

2010-10-26 Thread Brandon Harris

I am on ALL of these things, actually.  I have fixes for most of them 
pending.


On 10/26/10 10:41 AM, Erik Moeller wrote:
> 2010/10/26 Chad:
>> Good afternoon,
>>
>> In r75437, r75438[0][1] I moved the old installer to old-index.php
>> and moved the new to index.php. At this stage in the process,
>> I don't see us backing this out before we branch 1.17. I really
>> want people to test it out and report any major breakages [2].
>
> Congratulations. :-) It looks great.
>
> A few quick notes:
>
> 1) On the admin/site name screen at least, when both aren't supplied,
> it only shows the error messages, not the form below. This may be a
> general issue with the form validation.
> Screenshot: http://tinypic.com/r/2po9vh0/7
>
> 2) Checkbox alignment in general is a bit off, at least in Chrome, e.g.:
> http://tinypic.com/r/655n5x/7
>
> 3) for the "Extensions" section, I would suggest adding a more visible
> warning: "Warning: Most extensions require additional configuration
> beyond this step. Installing unreviewed extensions may expose your
> wiki to security vulnerabilities." I know the Help already explains
> the first point, but the simple installer may suggest to the user that
> ticking a checkbox is all that's required.
>
> 4) It'd be great if we could change the design to Vector :-). In
> general it could use a bit more UI love -- perhaps Brandon will have
> time to take a quick look.
>

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] New installer is here

2010-10-26 Thread Erik Moeller
2010/10/26 Brandon Harris :
>
>        I am on ALL of these things, actually.  I have fixes for most of them
> pending.

Awesome :-)


-- 
Erik Möller
Deputy Director, Wikimedia Foundation

Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Tisza Gergő
Aryeh Gregor  gmail.com> writes:

> To clarify, the subject needs to 1) be reasonably doable in a short
> timeframe, 2) not build on top of something that's already too
> optimized.  It should probably either be a new project; or an effort
> to parallelize something that already exists, isn't parallel yet, and
> isn't too complicated.  So far I have the password-cracking thing,
> maybe dbzip2, and maybe some unspecified thing involving dumps.

Some PageRank-like metric to approximate Wikipedia article importance/quality?
Parallelizing eigenvalue calculations has a rich literature.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Tim Starling
On 24/10/10 17:42, Aryeh Gregor wrote:
> This term I'm taking a course in high-performance computing
> , and I have
> to pick a topic for a final project.  According to the assignment
> ,
> "The only real requirement is that it be something in parallel."  In
> the class, we covered
> 
> * Microoptimization of single-threaded code (efficient use of CPU cache, etc.)
> * Multithreaded programming using OpenMP
> * GPU programming using OpenCL

I've occasionally wondered how hard it would be possible to
parallelize a parser. It's generally not done, despite the fact that
parsers are so slow and useful.

Some file formats can certainly be parsed in a parallel way, if you
partition them in the right way. For example, if you were parsing a
CSV file, you could partition on the line breaks. You can't do that by
scanning the whole file O(N) since that would defeat the purpose, but
you can seek ahead to a suitable byte position, and then scan forwards
for the next line break to partition at.

For more complex file formats, there are various approaches. Googling
tells me that this is a well-studied problem for XML.

Obviously for an assessable project, you don't want to dig yourself
into a hole too big to get out of. If you chose XML you could just
follow the previous work. JavaScript might be tractable. Attempting to
parse wikitext would be insane.

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] New installer is here

2010-10-26 Thread Brion Vibber
On Tue, Oct 26, 2010 at 10:00 AM, Chad  wrote:

> This has been a long development process for almost 2 years
> now, and I'd like to thank Max, Mark H., Jure, Jeroen, Roan
> and Siebrand for their invaluable help in working on this. And
> especially thanks to Tim for starting the project and providing
> feedback, as always. There is a *lot* of code in includes/installer,
> and I'd like to highlight some of the major changes that you'll
> need to know.
>

My hat is off to you, sirs! You guys have put a lot of great work into this
-- absolutely blows away the old installer, that's for dang sure! Looks like
1.17 is going to be an awesome release... I feel like a proud grandpappy
getting the chance to see you guys' work shine... :)

-- brion
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Robert Rohde
On Tue, Oct 26, 2010 at 8:25 AM, Ariel T. Glenn  wrote:
> Στις 26-10-2010, ημέρα Τρι, και ώρα 16:25 +0200, ο/η Platonides έγραψε:
>> Robert Rohde wrote:
>> > Many of the things done for the statistical analysis of database dumps
>> > should be suitable for parallelization (e.g. break the dump into
>> > chunks, process the chunks in parallel and sum the results).  You
>> > could talk to Erik Zachte.  I don't know if his code has already been
>> > designed for parallel processing though.
>>
>> I don't think it's a good candidate since you are presumably using
>> compressed files, and its decompression linearises it (and is most
>> likely the bottleneck, too).
>
> If one were clever (and I have some code that would enable one to be
> clever), one could seek to some point in the (bzip2-compressed) file and
> uncompress from there before processing.  Running a bunch of jobs each
> decompressing only their small piece then becomes feasible.  I don't
> have code that does this for gz or 7z; afaik these do not do compression
> in discrete blocks.

Actually the LMZA used by default in 7z can be partially parallelized
with some strong limitations:

1) The location of block N is generally only located by finding the
end of block N-1, so files have to be read serially.
2) The ability to decompress block N may or may not depend on already
having decompressed blocks N-1, N-2, N-3, etc., depending on the
details of the data stream.

Point 2 in particular tends to lead to a lot of conflicts that
prevents parallelization.  If block N happens to be independent of
block N-1 then they can be done in parallel, but in general this will
not be the case.  The frequency of such conflicts depends a lot on the
data stream and options given to the compressor.

Last year LMZA2 was introduced in 7z with the primary intent of
improving parallelization.  It actually produces slightly worse
compression in general, but can be operated to guarantee that block N
is independent of blocks N-1 ... N-k for a specified k, meaning that
k+1 blocks can always be considered in parallel.

I believe that gzip has similar constraints to LMZA that make
parallelization problematic, but I'm not sure about that.


Getting back to Wikimedia, it appears correct that the Wikistats code
is designed to run from the compressed files (source linked from [1]).
 As you suggest, one could use the properties of .bz2 format to
parallelize that.  I would also observe that parsers tend to be
relatively slow, while decompressors tend to be relatively fast.  I
wouldn't necessarily assume that the decompressing is the only
bottleneck.  I've run analyses on dumps that took longer to execute
than it took to decompress the files.  However, they probably didn't
take that many times longer (i.e. if the process were parallelized in
2 to 4 simultaneous chunks, then the decompression would be the
primary bottleneck again).

So it is probably true that if one wants to see a large increase in
the speed of stats processing one needs to consider parallelizing both
the decompression and the stats gathering.

-Robert Rohde

[1] http://stats.wikimedia.org/index_tabbed_new.html#fragment-14

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Ángel González
Ariel T. Glenn wrote:
> If one were clever (and I have some code that would enable one to be
> clever), one could seek to some point in the (bzip2-compressed) file and
> uncompress from there before processing.  Running a bunch of jobs each
> decompressing only their small piece then becomes feasible.  I don't
> have code that does this for gz or 7z; afaik these do not do compression
> in discrete blocks.
> 
> Ariel

The bzip2recover approach?
I am not sure how much will be the gain after so much bit moving.
Also, I was unable to continue from a flushed point, it may not be so easy.
OTOH, if you already have an index and the blocks end at page boundaries
(which is what I was doing), it becomes trivial.
Remember that the previous block MUST continue up to the point where the
next reader started processing inside the next block. And unlike what
ttsiod said, you do encounter tags split between blocks in a normal
compression.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Ariel T. Glenn
Στις 27-10-2010, ημέρα Τετ, και ώρα 00:05 +0200, ο/η Ángel González
έγραψε:
> Ariel T. Glenn wrote:
> > If one were clever (and I have some code that would enable one to be
> > clever), one could seek to some point in the (bzip2-compressed) file and
> > uncompress from there before processing.  Running a bunch of jobs each
> > decompressing only their small piece then becomes feasible.  I don't
> > have code that does this for gz or 7z; afaik these do not do compression
> > in discrete blocks.
> > 
> > Ariel
> 
> The bzip2recover approach?
> I am not sure how much will be the gain after so much bit moving.
> Also, I was unable to continue from a flushed point, it may not be so easy.
> OTOH, if you already have an index and the blocks end at page boundaries
> (which is what I was doing), it becomes trivial.
> Remember that the previous block MUST continue up to the point where the
> next reader started processing inside the next block. And unlike what
> ttsiod said, you do encounter tags split between blocks in a normal
> compression.

I am able (using python bindings to the bzip2 library and some fiddling)
to seek to an arbitrary point, find the first block after the seek
point, and uncompress it and the following blocks in sequence.  That is
sufficient for our work, when we are talking about 250GB size compressed
files.

We process everything by pages, so we ensure that any reader reads only
specified page ranges from the file.  This avoids overlaps.

We don't build an index; we're only talking about parallelizing 10-20
jobs at once, not all 21 million pages, so building an index would not
be worth it.

Ariel



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Parralel computing project

2010-10-26 Thread Erik Zachte
Robert Rohde:

Getting back to Wikimedia, it appears correct that the Wikistats code
is designed to run from the compressed files (source linked from [1]).
As you suggest, one could use the properties of .bz2 format to
parallelize that.  I would also observe that parsers tend to be
relatively slow, while decompressors tend to be relatively fast. 

Some additional notes:

Yes wikistats processes compressed dumps.
Nowadays these are mostly stub dumps.
Most monthly metrics can be collected here, with few exceptions like 
word count.

For stub dumps decompression is the major resource hog,
for full dumps some heavy regexp's do contribute considerably.

Wikistats could benefit a lot from parallelization (although these days 
dump production for larger wikis is generally the bottleneck).
First thing I would want to look into (some day) is running the count 
scripts for several wikis in parallel.
All intermediate data are stored in csv files, often one file for one 
metric for all languages.
Decoupling and aggregation as post processing step is simple.

Running several count threads on one machine might tax memory.
Some hashes are huge (much has been externalized, but e.g. edits per 
user per namespace is still a hash file).

The basic structure dates from the time that a full archive dump for 
English Wikipedia was processed in minutes rather than months.
There have been a lot of optimizations , but general setup is still like 
this:
Every months all counts for past 10 years are reproduced from scratch. 
Wikistats basically has no memory.
This probably sounds crazy, incremental processing has been suggested 
more than once.

Main reason to keep it this way is: ever so often new functionality is 
added to the scripts (and the occasional bug fix)
In order to have new counts for full history we would need to rerun from 
scratch ever so often anyway.

People asked me how come the counts can change from to month to month.
Same answer: counts are redone for all months, newer dumps will have 
more deletions for earlier months.
Although this mostly effects last two months: nearly all deletions occur 
within a month or two.

In early years deletions were very rare. most were done to prevent court 
orders (privacy).
Nowadays deletionism has taken hold.
Still wikistats treats deleted content as 'should not have been there in 
the first place'.
This makes our editor counts somewhat conservative, basically skews the 
activity patterns in favor of good content contributors.

Erik Zachte



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Commons ZIP file upload for admins

2010-10-26 Thread Maciej Jaros
@2010-10-26 03:45, Erik Moeller:
> 2010/10/25 Brion Vibber:
>> In all cases we have the worry that if we allow uploading those funky
>> formats, we'll either a) end up with malicious files or b) end up with lazy
>> people using and uploading non-free editing formats when we'd prefer them to
>> use freely editable formats. I'm not sure I like the idea of using admin
>> powers to control being able to upload those, though; bottlenecking content
>> reviews as a strict requirement can be problematic on its own.
> Yeah, I don't like the bottleneck approach either, but in the absence
> of better systems, it may be the best way to go as an immediate
> solution. We could do it for a list of whitelisted open formats that
> are requested by the community. And we'd see from usage which file
> types we need to prioritize proper support/security checks for.
>
>> What I'd probably like to see is a more wide-open allowal of arbitrary
>> 'source files' which can be uploaded as attachments to standalone files. We
>> could give them more limited access: download only, no inline viewing, only
>> allowed if DLs are on separate safe domain, etc.
> It seems fairly straightforward to me to say: "These free file formats
> are permitted to be uploaded. We haven't developed fully sophisticated
> security checks for them yet, so we're asking trusted users to do
> basic sanity checks until we've developed automatic checks." We can
> then prod people to convert any proprietary formats into free ones
> that are on that whitelist. And if they're free formats, I'm not sure
> why they shouldn't be first-class citizens -- as Michael mentioned,
> that makes it possible to plop in custom handlers at a later time. A
> COLLADA handler for 3D files may seem like a remote possibility, but
> it's certainly within the realm of sanity. ZIP files would have to be
> specially treated so they're only allowed if they contain only files
> in permitted formats.
>
> So, consistent with Michael's suggestion, we could define a
> 'restricted-upload' right, initially given to admins only but possibly
> expanded to other users, which would allow files from the "potentially
> insecure" list of extensions to be uploaded, and for ZIP files, would
> ensure that only accepted file types are contained within the archive.
> The resultant review bottleneck would simply be a reflection that we
> haven't gotten around to adding proper support for these file types
> yet. On the plus side, we could add restricted upload support for new
> open formats as soon as there's consensus to do so.
>
> The main downside I would see is that users might end up being
> confused why these files get uploaded. To mitigate this, we could add
> a "This file has a restricted filetype. Files of this type can
> currently only be uploaded by administrators for security reasons"
> note on file description pages.

ODS, ODT and such should be fairly easy to check at least on a basic 
level. A very basic check would be to check if it contains "Basic" or 
"Scripts" folder. Bit more advanced would be to check if manifest.xml 
contains "application/binary" (to check if anyone tried to change 
default naming) and check if any file contains "https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] RT

2010-10-26 Thread a b
After the recent dicussions open open-ness and clarity with requests by
serveral people what is contained within the RT after several people have
asked and given answers like "it's staff stuff".

So what is stored in it that can't be within either the staff or internal
wiki where it must be private or bugzilla for other matters?
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Commons ZIP file upload for admins

2010-10-26 Thread John Vandenberg
On Tue, Oct 26, 2010 at 6:50 AM, Max Semenik  wrote:
> >>
>
> Instead of amassing social constructs around technical deficiency, I
> propose to fix bug 24230 [1] by implementing proper checking for JAR
> format. Also, we need to check all contents with antivirus and
> disallow certain types of files inside archives (such as .exe). Once
> we took all these precautions, I see no need to restrict ZIPs to any
> special group. Of course, this doesn't mean that we soul allow all the
> safe ZIPs, just several open ZIP-based file formats.

If we only want zip's for several formats, we should check that they
are of the expected type, _and_ that they consist of open file formats
within the zip.

e.g. Open Office XML (the MS format) can include binary files for OLE
objects and fonts (I think)

see "Table 2. Content types in a ZIP container"

http://msdn.microsoft.com/en-us/library/aa338205(office.12).aspx

OOXML can also include any other mimetype, which are registered
_within_ the zip, and linked into the main content file.

afaics, allowing only safe zip to be upload isn't difficult.

Expand the zip, and reject any zip which contains files on
$wgFileBlacklist, and not on $wgFileExtensions + $wgZipFileExtensions.

$wgZipFileExtensions would consist of array('xml')

Then check the mimetypes of the files in the zip, against
$wgMimeTypeBlacklist (with 'application/zip' removed), again allowing
desired XML mimetypes through.

--
John Vandenberg

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] New installer is here

2010-10-26 Thread Andrew Garrett
On Wed, Oct 27, 2010 at 4:00 AM, Chad  wrote:
> In r75437, r75438[0][1] I moved the old installer to old-index.php
> and moved the new to index.php. At this stage in the process,
> I don't see us backing this out before we branch 1.17. I really
> want people to test it out and report any major breakages [2].

:D

-- 
Andrew Garrett
http://werdn.us/

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] RT

2010-10-26 Thread Ryan Lane
On Tue, Oct 26, 2010 at 4:39 PM, a b  wrote:
> After the recent dicussions open open-ness and clarity with requests by
> serveral people what is contained within the RT after several people have
> asked and given answers like "it's staff stuff".
>
> So what is stored in it that can't be within either the staff or internal
> wiki where it must be private or bugzilla for other matters?

RT is used by the ops team to track and plan operations work. It may
contain procurement information, quotes, and other sensitive
information that can not be released to the public due to contractual
or confidentiality reasons.

Who is telling you "it's staff stuff"? I'm pretty sure all of the ops
people have been pretty clear about why we can't allow public access
to the system.

Respectfully,

Ryan Lane

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l