[Wikitech-l] Parallel computing project

2010-10-24 Thread Aryeh Gregor
This term I'm taking a course in high-performance computing
, and I have
to pick a topic for a final project.  According to the assignment
,
"The only real requirement is that it be something in parallel."  In
the class, we covered

* Microoptimization of single-threaded code (efficient use of CPU cache, etc.)
* Multithreaded programming using OpenMP
* GPU programming using OpenCL

and will probably briefly cover distributed computing over multiple
machines with MPI.  I will have access to a high-performance cluster
at NYU, including lots of CPU nodes and some high-end GPUs.  Unlike
most of the other people in the class, I don't have any interesting
science projects I'm working on, so something useful to
MediaWiki/Wikimedia/Wikipedia is my first thought.  If anyone has any
suggestions, please share.  (If you have non-Wikimedia-related ones,
I'd also be interested in hearing about them offlist.)  They shouldn't
be too ambitious, since I have to finish them in about a month, while
doing work for three other courses and a bunch of other stuff.

My first thought was to write a GPU program to crack MediaWiki
password hashes as quickly as possible, then use what we've studied in
class about GPU architecture to design a hash function that would be
as slow as possible to crack on a GPU relative to its PHP execution
speed, as Tim suggested a while back.  However, maybe there's
something more interesting I could do.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-25 Thread Paul Houle
  On 10/24/2010 8:42 PM, Aryeh Gregor wrote:
> My first thought was to write a GPU program to crack MediaWiki
> password hashes as quickly as possible, then use what we've studied in
> class about GPU architecture to design a hash function that would be
> as slow as possible to crack on a GPU relative to its PHP execution
> speed, as Tim suggested a while back.  However, maybe there's
> something more interesting I could do.

 Boring.

 I want Wikipedia converted into facts in a representation system 
that supports modal,  temporal,  and "microtheory" reasoning.  You 
know,  in the "real" world,  :James_T_Kirk is a :Fictional_Character,  
but in the Star Trek universe,  he's a :Person.

 Of course,  you'd have to pick some chunk out of that big task 
that's doable.  One thing I'd like is something that extracts the 
"meaning" of hyperlinks.  For instance,  if we look at

http://en.wikipedia.org/wiki/Bruce_Lee

We see a link to :Wong_Jack_Man,  and in dbpedia right now,  this is 
represented as a unidirectional hyperlink w/o semantics.  Now,  a 
smarter system could say

:Bruce_Lee :Had_A_Fight_With :Wong_Jack_Man.

Although wikipedia is a relatively difficult text to work with with 
typical BOW and NLP methods,  it's got enough semantic structure that 
hybrid semantic-BOW/NLP methods ought to be able to work miracles.  I 
think that the way hyperlinks are used in text could be used to learn 
templates for detecting named entity references.   I think it also ought 
to be possible to build linguistic models for classification.  For 
instance,  if you're having trouble telling your Jaguars apart,

http://en.wikipedia.org/wiki/Jaguar_(disambiguation) 


and related documents might help you make a filter that can tell the 
difference between "jaguar the cat" and "jaguar the car".

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-25 Thread Platonides
Make the best dump compressor ever? :)

The page http://www.mediawiki.org/wiki/Dbzip2 is worth looking at just
for the available options. Continuing dbzip2 is the first idea but not
the only one. I'm sure many things can be dig from there.
Also worth noting, Ariel has been doing the last en dumps in page batches.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-25 Thread Aryeh Gregor
On Mon, Oct 25, 2010 at 12:38 PM, Paul Houle  wrote:
>     I want Wikipedia converted into facts in a representation system
> that supports modal,  temporal,  and "microtheory" reasoning.  You
> know,  in the "real" world,  :James_T_Kirk is a :Fictional_Character,
> but in the Star Trek universe,  he's a :Person.

This sounds like it would take far more work to actually write the
program in the first place than to parallelize it.

On Mon, Oct 25, 2010 at 4:47 PM, Platonides  wrote:
> Make the best dump compressor ever? :)
>
> The page http://www.mediawiki.org/wiki/Dbzip2 is worth looking at just
> for the available options. Continuing dbzip2 is the first idea but not
> the only one. I'm sure many things can be dig from there.
> Also worth noting, Ariel has been doing the last en dumps in page batches.

Possible.  It looks like dbzip2 has had a lot of optimization put into
it already, so I don't know that there would be much low-hanging fruit
for me to get.  I'm not sure if that would be acceptable for a final
project ("do lots of benchmarking and probably not end up improving it
much at all").  Whereas rewriting it to use the GPU sounds like a
suspiciously large project . . . plus I'm not sure GPUs would even be
suited for it.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parallel computing project

2010-10-25 Thread Ariel T. Glenn
Στις 25-10-2010, ημέρα Δευ, και ώρα 22:47 +0200, ο/η Platonides έγραψε:
> Make the best dump compressor ever? :)
> 
> The page http://www.mediawiki.org/wiki/Dbzip2 is worth looking at just
> for the available options. Continuing dbzip2 is the first idea but not
> the only one. I'm sure many things can be dig from there.
> Also worth noting, Ariel has been doing the last en dumps in page batches.
> 
The last dump.  (Just to be accurate.) As soon as I can get the "split
by number of revisions" stuff tested I will kick off another round.
Hopefully next week.  

I am following this discussion and happy to bat around ideas if there's
something that might be appropriate for your course, Aryeh.

Ariel



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parallel computing project

2010-10-25 Thread Aryeh Gregor
On Mon, Oct 25, 2010 at 6:09 PM, Ariel T. Glenn  wrote:
> I am following this discussion and happy to bat around ideas if there's
> something that might be appropriate for your course, Aryeh.

Well, is there?  Anything that needs to be parallelized, or that's
already parallelized but needs significant optimization work (e.g.,
scales much worse than linearly)?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-25 Thread George Herbert
On Mon, Oct 25, 2010 at 4:04 PM, Aryeh Gregor
 wrote:
> On Mon, Oct 25, 2010 at 6:09 PM, Ariel T. Glenn  wrote:
>> I am following this discussion and happy to bat around ideas if there's
>> something that might be appropriate for your course, Aryeh.
>
> Well, is there?  Anything that needs to be parallelized, or that's
> already parallelized but needs significant optimization work (e.g.,
> scales much worse than linearly)?

(suppressing grumbles about diff engine for edit conflicts, probably
an algorithm rather than speed issue)

I suspect that the MySQL engine is the one place where parallelism is
most applicable, but I am not a MySQL internals guru.  However, there
is one associated with the project, and I believe on the list.  Domas?


-- 
-george william herbert
george.herb...@gmail.com

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-25 Thread Aryeh Gregor
On Mon, Oct 25, 2010 at 7:15 PM, George Herbert
 wrote:
> (suppressing grumbles about diff engine for edit conflicts, probably
> an algorithm rather than speed issue)

Diffs are fast enough if you use wikidiff2, no?

> I suspect that the MySQL engine is the one place where parallelism is
> most applicable, but I am not a MySQL internals guru.  However, there
> is one associated with the project, and I believe on the list.  Domas?

Yes, parallelism is very applicable to databases, which is why they've
already had tons of effort invested by people who know a lot more
about the topic than me, so I probably can't do much there.


To clarify, the subject needs to 1) be reasonably doable in a short
timeframe, 2) not build on top of something that's already too
optimized.  It should probably either be a new project; or an effort
to parallelize something that already exists, isn't parallel yet, and
isn't too complicated.  So far I have the password-cracking thing,
maybe dbzip2, and maybe some unspecified thing involving dumps.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parallel computing project

2010-10-25 Thread George Herbert
On Mon, Oct 25, 2010 at 4:24 PM, Aryeh Gregor
 wrote:
> On Mon, Oct 25, 2010 at 7:15 PM, George Herbert
>  wrote:
>> (suppressing grumbles about diff engine for edit conflicts, probably
>> an algorithm rather than speed issue)
>
> Diffs are fast enough if you use wikidiff2, no?

Yeah; what I really want is a diff algorithm that properly deconflicts
more current edit conflict problems.  It's not speed, it's smarts in
the algorithm.

>> I suspect that the MySQL engine is the one place where parallelism is
>> most applicable, but I am not a MySQL internals guru.  However, there
>> is one associated with the project, and I believe on the list.  Domas?
>
> Yes, parallelism is very applicable to databases, which is why they've
> already had tons of effort invested by people who know a lot more
> about the topic than me, so I probably can't do much there.
>
> To clarify, the subject needs to 1) be reasonably doable in a short
> timeframe, 2) not build on top of something that's already too
> optimized.  It should probably either be a new project; or an effort
> to parallelize something that already exists, isn't parallel yet, and
> isn't too complicated.  So far I have the password-cracking thing,
> maybe dbzip2, and maybe some unspecified thing involving dumps.

I understand the context of the class and scope.  I just couldn't
think of a MW related topic other than that that would seem to need
serious parallelism work.

Good luck.


-- 
-george william herbert
george.herb...@gmail.com

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-25 Thread Robert Rohde
Many of the things done for the statistical analysis of database dumps
should be suitable for parallelization (e.g. break the dump into
chunks, process the chunks in parallel and sum the results).  You
could talk to Erik Zachte.  I don't know if his code has already been
designed for parallel processing though.

Another option might be to look at the methods for compressing old
revisions (is [1] still current?).

I make heavy use of parallel processing in my professional work (not
related to wikis), but I can't really think of any projects I have at
hand that would be accessible and completable in a month.

-Robert Rohde

[1] http://www.mediawiki.org/wiki/Manual:CompressOld.php

On Sun, Oct 24, 2010 at 5:42 PM, Aryeh Gregor
 wrote:
> This term I'm taking a course in high-performance computing
> , and I have
> to pick a topic for a final project.  According to the assignment
> ,
> "The only real requirement is that it be something in parallel."  In
> the class, we covered
>
> * Microoptimization of single-threaded code (efficient use of CPU cache, etc.)
> * Multithreaded programming using OpenMP
> * GPU programming using OpenCL
>
> and will probably briefly cover distributed computing over multiple
> machines with MPI.  I will have access to a high-performance cluster
> at NYU, including lots of CPU nodes and some high-end GPUs.  Unlike
> most of the other people in the class, I don't have any interesting
> science projects I'm working on, so something useful to
> MediaWiki/Wikimedia/Wikipedia is my first thought.  If anyone has any
> suggestions, please share.  (If you have non-Wikimedia-related ones,
> I'd also be interested in hearing about them offlist.)  They shouldn't
> be too ambitious, since I have to finish them in about a month, while
> doing work for three other courses and a bunch of other stuff.
>
> My first thought was to write a GPU program to crack MediaWiki
> password hashes as quickly as possible, then use what we've studied in
> class about GPU architecture to design a hash function that would be
> as slow as possible to crack on a GPU relative to its PHP execution
> speed, as Tim suggested a while back.  However, maybe there's
> something more interesting I could do.
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Platonides
Robert Rohde wrote:
> Many of the things done for the statistical analysis of database dumps
> should be suitable for parallelization (e.g. break the dump into
> chunks, process the chunks in parallel and sum the results).  You
> could talk to Erik Zachte.  I don't know if his code has already been
> designed for parallel processing though.

I don't think it's a good candidate since you are presumably using
compressed files, and its decompression linearises it (and is most
likely the bottleneck, too).


> Another option might be to look at the methods for compressing old
> revisions (is [1] still current?).
> 
> I make heavy use of parallel processing in my professional work (not
> related to wikis), but I can't really think of any projects I have at
> hand that would be accessible and completable in a month.
> 
> -Robert Rohde
> 
> [1] http://www.mediawiki.org/wiki/Manual:CompressOld.php

It can be used, I am unsure if it is used by WMF.

Another thing that would be nice to have parallelised would be things
like parser tests. That would need adding cotasks to php or so. The most
similar extension I know is runkit which is the other way around:
several php scopes instead of several threads in one scope.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Jyothis Edathoot
Develop a new bot framework (may be interwiki processing to start with) for
high performance GPU cluster (nvidia or AMD) similar to what boinc based
projects does.  nvdia is more popular while AMD has more cores for the same
price

 :)


Regards,
Jyothis.

http://www.Jyothis.net

http://ml.wikipedia.org/wiki/User:Jyothis
http://meta.wikimedia.org/wiki/User:Jyothis
I am the first customer of http://www.netdotnet.com

woods are lovely dark and deep,
but i have promises to keep and
miles to go before i sleep and
lines to go before I press sleep

completion date = (start date + ((estimated effort x 3.1415926) / resources)
+ ((total coffee breaks x 0.25) / 24)) + Effort in meetings



On Sun, Oct 24, 2010 at 8:42 PM, Aryeh Gregor

> wrote:

> This term I'm taking a course in high-performance computing
> , and I have
> to pick a topic for a final project.  According to the assignment
> ,
> "The only real requirement is that it be something in parallel."  In
> the class, we covered
>
> * Microoptimization of single-threaded code (efficient use of CPU cache,
> etc.)
> * Multithreaded programming using OpenMP
> * GPU programming using OpenCL
>
> and will probably briefly cover distributed computing over multiple
> machines with MPI.  I will have access to a high-performance cluster
> at NYU, including lots of CPU nodes and some high-end GPUs.  Unlike
> most of the other people in the class, I don't have any interesting
> science projects I'm working on, so something useful to
> MediaWiki/Wikimedia/Wikipedia is my first thought.  If anyone has any
> suggestions, please share.  (If you have non-Wikimedia-related ones,
> I'd also be interested in hearing about them offlist.)  They shouldn't
> be too ambitious, since I have to finish them in about a month, while
> doing work for three other courses and a bunch of other stuff.
>
> My first thought was to write a GPU program to crack MediaWiki
> password hashes as quickly as possible, then use what we've studied in
> class about GPU architecture to design a hash function that would be
> as slow as possible to crack on a GPU relative to its PHP execution
> speed, as Tim suggested a while back.  However, maybe there's
> something more interesting I could do.
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Ariel T. Glenn
Στις 26-10-2010, ημέρα Τρι, και ώρα 16:25 +0200, ο/η Platonides έγραψε:
> Robert Rohde wrote:
> > Many of the things done for the statistical analysis of database dumps
> > should be suitable for parallelization (e.g. break the dump into
> > chunks, process the chunks in parallel and sum the results).  You
> > could talk to Erik Zachte.  I don't know if his code has already been
> > designed for parallel processing though.
> 
> I don't think it's a good candidate since you are presumably using
> compressed files, and its decompression linearises it (and is most
> likely the bottleneck, too).

If one were clever (and I have some code that would enable one to be
clever), one could seek to some point in the (bzip2-compressed) file and
uncompress from there before processing.  Running a bunch of jobs each
decompressing only their small piece then becomes feasible.  I don't
have code that does this for gz or 7z; afaik these do not do compression
in discrete blocks.

Ariel



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Tisza Gergő
Aryeh Gregor  gmail.com> writes:

> To clarify, the subject needs to 1) be reasonably doable in a short
> timeframe, 2) not build on top of something that's already too
> optimized.  It should probably either be a new project; or an effort
> to parallelize something that already exists, isn't parallel yet, and
> isn't too complicated.  So far I have the password-cracking thing,
> maybe dbzip2, and maybe some unspecified thing involving dumps.

Some PageRank-like metric to approximate Wikipedia article importance/quality?
Parallelizing eigenvalue calculations has a rich literature.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Tim Starling
On 24/10/10 17:42, Aryeh Gregor wrote:
> This term I'm taking a course in high-performance computing
> , and I have
> to pick a topic for a final project.  According to the assignment
> ,
> "The only real requirement is that it be something in parallel."  In
> the class, we covered
> 
> * Microoptimization of single-threaded code (efficient use of CPU cache, etc.)
> * Multithreaded programming using OpenMP
> * GPU programming using OpenCL

I've occasionally wondered how hard it would be possible to
parallelize a parser. It's generally not done, despite the fact that
parsers are so slow and useful.

Some file formats can certainly be parsed in a parallel way, if you
partition them in the right way. For example, if you were parsing a
CSV file, you could partition on the line breaks. You can't do that by
scanning the whole file O(N) since that would defeat the purpose, but
you can seek ahead to a suitable byte position, and then scan forwards
for the next line break to partition at.

For more complex file formats, there are various approaches. Googling
tells me that this is a well-studied problem for XML.

Obviously for an assessable project, you don't want to dig yourself
into a hole too big to get out of. If you chose XML you could just
follow the previous work. JavaScript might be tractable. Attempting to
parse wikitext would be insane.

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Robert Rohde
On Tue, Oct 26, 2010 at 8:25 AM, Ariel T. Glenn  wrote:
> Στις 26-10-2010, ημέρα Τρι, και ώρα 16:25 +0200, ο/η Platonides έγραψε:
>> Robert Rohde wrote:
>> > Many of the things done for the statistical analysis of database dumps
>> > should be suitable for parallelization (e.g. break the dump into
>> > chunks, process the chunks in parallel and sum the results).  You
>> > could talk to Erik Zachte.  I don't know if his code has already been
>> > designed for parallel processing though.
>>
>> I don't think it's a good candidate since you are presumably using
>> compressed files, and its decompression linearises it (and is most
>> likely the bottleneck, too).
>
> If one were clever (and I have some code that would enable one to be
> clever), one could seek to some point in the (bzip2-compressed) file and
> uncompress from there before processing.  Running a bunch of jobs each
> decompressing only their small piece then becomes feasible.  I don't
> have code that does this for gz or 7z; afaik these do not do compression
> in discrete blocks.

Actually the LMZA used by default in 7z can be partially parallelized
with some strong limitations:

1) The location of block N is generally only located by finding the
end of block N-1, so files have to be read serially.
2) The ability to decompress block N may or may not depend on already
having decompressed blocks N-1, N-2, N-3, etc., depending on the
details of the data stream.

Point 2 in particular tends to lead to a lot of conflicts that
prevents parallelization.  If block N happens to be independent of
block N-1 then they can be done in parallel, but in general this will
not be the case.  The frequency of such conflicts depends a lot on the
data stream and options given to the compressor.

Last year LMZA2 was introduced in 7z with the primary intent of
improving parallelization.  It actually produces slightly worse
compression in general, but can be operated to guarantee that block N
is independent of blocks N-1 ... N-k for a specified k, meaning that
k+1 blocks can always be considered in parallel.

I believe that gzip has similar constraints to LMZA that make
parallelization problematic, but I'm not sure about that.


Getting back to Wikimedia, it appears correct that the Wikistats code
is designed to run from the compressed files (source linked from [1]).
 As you suggest, one could use the properties of .bz2 format to
parallelize that.  I would also observe that parsers tend to be
relatively slow, while decompressors tend to be relatively fast.  I
wouldn't necessarily assume that the decompressing is the only
bottleneck.  I've run analyses on dumps that took longer to execute
than it took to decompress the files.  However, they probably didn't
take that many times longer (i.e. if the process were parallelized in
2 to 4 simultaneous chunks, then the decompression would be the
primary bottleneck again).

So it is probably true that if one wants to see a large increase in
the speed of stats processing one needs to consider parallelizing both
the decompression and the stats gathering.

-Robert Rohde

[1] http://stats.wikimedia.org/index_tabbed_new.html#fragment-14

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Ángel González
Ariel T. Glenn wrote:
> If one were clever (and I have some code that would enable one to be
> clever), one could seek to some point in the (bzip2-compressed) file and
> uncompress from there before processing.  Running a bunch of jobs each
> decompressing only their small piece then becomes feasible.  I don't
> have code that does this for gz or 7z; afaik these do not do compression
> in discrete blocks.
> 
> Ariel

The bzip2recover approach?
I am not sure how much will be the gain after so much bit moving.
Also, I was unable to continue from a flushed point, it may not be so easy.
OTOH, if you already have an index and the blocks end at page boundaries
(which is what I was doing), it becomes trivial.
Remember that the previous block MUST continue up to the point where the
next reader started processing inside the next block. And unlike what
ttsiod said, you do encounter tags split between blocks in a normal
compression.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Ariel T. Glenn
Στις 27-10-2010, ημέρα Τετ, και ώρα 00:05 +0200, ο/η Ángel González
έγραψε:
> Ariel T. Glenn wrote:
> > If one were clever (and I have some code that would enable one to be
> > clever), one could seek to some point in the (bzip2-compressed) file and
> > uncompress from there before processing.  Running a bunch of jobs each
> > decompressing only their small piece then becomes feasible.  I don't
> > have code that does this for gz or 7z; afaik these do not do compression
> > in discrete blocks.
> > 
> > Ariel
> 
> The bzip2recover approach?
> I am not sure how much will be the gain after so much bit moving.
> Also, I was unable to continue from a flushed point, it may not be so easy.
> OTOH, if you already have an index and the blocks end at page boundaries
> (which is what I was doing), it becomes trivial.
> Remember that the previous block MUST continue up to the point where the
> next reader started processing inside the next block. And unlike what
> ttsiod said, you do encounter tags split between blocks in a normal
> compression.

I am able (using python bindings to the bzip2 library and some fiddling)
to seek to an arbitrary point, find the first block after the seek
point, and uncompress it and the following blocks in sequence.  That is
sufficient for our work, when we are talking about 250GB size compressed
files.

We process everything by pages, so we ensure that any reader reads only
specified page ranges from the file.  This avoids overlaps.

We don't build an index; we're only talking about parallelizing 10-20
jobs at once, not all 21 million pages, so building an index would not
be worth it.

Ariel



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parallel computing project

2010-10-27 Thread James Salsman
Aryeh Gregor writes:
>
> To clarify, the subject needs to 1) be reasonably doable in a short
> timeframe, 2) not build on top of something that's already too
> optimized

Integrating a subset of RTMP (e.g. the
http://code.google.com/p/rtmplite subset) into the chunk-based file
upload API -- http://www.mediawiki.org/wiki/API:Upload#Chunked_upload
-- would be an example of parallel I/O that we really need if we ever
hope to have reasonable microphone uploads for Wiktionary
pronunciation collection.  I know Flash sucks, but it sucks way less
for microphone upload than currently nonexistent HTML5 audio upload
support, client side Java, or any other alternative, and probably will
suck way less than any of those alternatives for years.  Soon GNU
Gnash should have microphone Speex upload on all three major
platforms, assuming the Gnash programming team doesn't starve to death
first.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l