Re: mod_perl for multi-process file processing?

2015-02-02 Thread Cosimo Streppone

Alan Raetz wrote:


So I have a perl application that upon startup loads about ten perl
hashes (some of them complex) from files. This takes up a few GB of
memory and about 5 minutes. It then iterates through some cases and
reads from (never writes) these perl hashes. To process all our cases,
it takes about 3 hours (millions of cases). We would like to speed up
this process. I am thinking this is an ideal application of mod_perl
because it would allow multiple processes but share memory.


Sure you could use modperl for this.
I would also consider at least these alternatives:

- use Cache::FastMmap, https://metacpan.org/pod/Cache::FastMmap
  Load up your data with a loader script, and forget about it.
  Cache::FastMmap also works with modperl.

- use a network server, like memcached or redis to store your
  read-only data, and use a lightweight network protocol (on localhost)
  to get the data.

In both cases, reading from multiple processes will not be a problem.
The cheapest solution for the consumer part (the "cases" above)
would be to use a command like "parallel" to fire up as many copies
of your consumer script as you can afford.

Hope this helps,

--
Cosimo



Re: mod_perl for multi-process file processing?

2015-02-02 Thread Dr James Smith

Alan/Alexandr,

There will always be an overhead with using a webserver to do this - 
even using mod_perl.


Assumiptions:
*from what you are saying that there is no actual website 
involved but you want to use mod_perl to cache data for an offline process;

*One set of data is used once and once only for a run?

Pros:
*Make sure you use your module in startup so that each child 
thread uses the same memory not generating a copy of the data;
*If you use something like curl multi as the fetcher you can 
write a simple parallel fetching queue to get the data - great if you 
have a multi-core box;


Cons:
*There is an overhead of using HTTP protocol webserver - if you 
aren't going to gain much from the parallelization of processes above 
you may find that writing a simple script which loops over all data 
would be more efficient...
*In your case we are probably looking at about 10ms (or less) 
the apache/http round tripping will probably take much more time than 
the actual processing...


On 03/02/2015 05:02, Alexandr Evstigneev wrote:
Pre-loading is good, but what you need, I belive, is Storable module. 
If your files contains parsed data (hashes) just store them as 
serialized. If they containing raw data, need to be parsed, you may 
pre-parse, serialize it and store as binary files.

Storable is written in C and works very fast.


2015-02-03 7:11 GMT+03:00 Alan Raetz >:


So I have a perl application that upon startup loads about ten
perl hashes (some of them complex) from files. This takes up a few
GB of memory and about 5 minutes. It then iterates through some
cases and reads from (never writes) these perl hashes. To process
all our cases, it takes about 3 hours (millions of cases). We
would like to speed up this process. I am thinking this is an
ideal application of mod_perl because it would allow multiple
processes but share memory.

The scheme would be to load the hashes on apache startup and have
a master program send requests with each case and apache children
will use the shared hashes.

I just want to verify some of the details about variable sharing. 
Would the following setup work (oversimplified, but you get the

idea…):

In a file Data.pm, which I would use() in my Apache startup.pl
, I would load the perl hashes and have hash
references that would be retrieved with class methods:

package Data;

my %big_hash;

open(FILE,"file.txt");

while (  ) {

  … code ….

  $big_hash{ $key } = $value;
}

sub get_big_hashref {   return \%big_hash; }



And so in the apache request handler, the code would be something
like:

use Data.pm;

my $hashref = Data::get_big_hashref();

…. code to access $hashref data with request parameters…..



The idea is the HTTP request/response will contain the relevant
input/output for each case… and the master client program will
collect these and concatentate the final output from all the requests.

So any issues/suggestions with this approach? I am facing a
non-trivial task of refactoring the existing code to work in this
framework, so just wanted to get some feedback before I invest
more time into this...

I am planning on using mod_perl 2.07 on a linux machine.

Thanks in advance, Alan






---
This email has been checked for viruses by Avast antivirus software.
http://www.avast.com



--
The Wellcome Trust Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 

Re: mod_perl for multi-process file processing?

2015-02-02 Thread Alexandr Evstigneev
Pre-loading is good, but what you need, I belive, is Storable module. If
your files contains parsed data (hashes) just store them as serialized. If
they containing raw data, need to be parsed, you may pre-parse, serialize
it and store as binary files.
Storable is written in C and works very fast.


2015-02-03 7:11 GMT+03:00 Alan Raetz :

> So I have a perl application that upon startup loads about ten perl hashes
> (some of them complex) from files. This takes up a few GB of memory and
> about 5 minutes. It then iterates through some cases and reads from (never
> writes) these perl hashes. To process all our cases, it takes about 3 hours
> (millions of cases). We would like to speed up this process. I am thinking
> this is an ideal application of mod_perl because it would allow multiple
> processes but share memory.
>
> The scheme would be to load the hashes on apache startup and have a master
> program send requests with each case and apache children will use the
> shared hashes.
>
> I just want to verify some of the details about variable sharing.  Would
> the following setup work (oversimplified, but you get the idea…):
>
> In a file Data.pm, which I would use() in my Apache startup.pl, I would
> load the perl hashes and have hash references that would be retrieved with
> class methods:
>
> package Data;
>
> my %big_hash;
>
> open(FILE,"file.txt");
>
> while (  ) {
>
>   … code ….
>
>   $big_hash{ $key } = $value;
> }
>
> sub get_big_hashref {   return \%big_hash; }
>
> 
>
> And so in the apache request handler, the code would be something like:
>
> use Data.pm;
>
> my $hashref = Data::get_big_hashref();
>
> …. code to access $hashref data with request parameters…..
>
> 
>
> The idea is the HTTP request/response will contain the relevant
> input/output for each case… and the master client program will collect
> these and concatentate the final output from all the requests.
>
> So any issues/suggestions with this approach? I am facing a non-trivial
> task of refactoring the existing code to work in this framework, so just
> wanted to get some feedback before I invest more time into this...
>
> I am planning on using mod_perl 2.07 on a linux machine.
>
> Thanks in advance, Alan
>


mod_perl for multi-process file processing?

2015-02-02 Thread Alan Raetz
So I have a perl application that upon startup loads about ten perl hashes
(some of them complex) from files. This takes up a few GB of memory and
about 5 minutes. It then iterates through some cases and reads from (never
writes) these perl hashes. To process all our cases, it takes about 3 hours
(millions of cases). We would like to speed up this process. I am thinking
this is an ideal application of mod_perl because it would allow multiple
processes but share memory.

The scheme would be to load the hashes on apache startup and have a master
program send requests with each case and apache children will use the
shared hashes.

I just want to verify some of the details about variable sharing.  Would
the following setup work (oversimplified, but you get the idea…):

In a file Data.pm, which I would use() in my Apache startup.pl, I would
load the perl hashes and have hash references that would be retrieved with
class methods:

package Data;

my %big_hash;

open(FILE,"file.txt");

while (  ) {

  … code ….

  $big_hash{ $key } = $value;
}

sub get_big_hashref {   return \%big_hash; }



And so in the apache request handler, the code would be something like:

use Data.pm;

my $hashref = Data::get_big_hashref();

…. code to access $hashref data with request parameters…..



The idea is the HTTP request/response will contain the relevant
input/output for each case… and the master client program will collect
these and concatentate the final output from all the requests.

So any issues/suggestions with this approach? I am facing a non-trivial
task of refactoring the existing code to work in this framework, so just
wanted to get some feedback before I invest more time into this...

I am planning on using mod_perl 2.07 on a linux machine.

Thanks in advance, Alan