Re: mod_perl for multi-process file processing?

2015-02-03 Thread Patton, Billy N
Unless I’m entirely wrong it appears that you want to use shared threaded 
memory.
This would allow you to keep out of apache altogether.
Here is an example of using threads that I worked out using shared memory.
We took a 4 hour task, serial, and turned it into 5 minutes with threads.
This worked extremely well for me.  I wasn’t using large hashes, but I was 
hundreds of files, per thread, with 30k lines per file.
#!/usr/bin/env perl -w
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Indent = 1;
$Data::Dumper::Sortkeys = 1;
$Data::Dumper::Deepcopy = 1;
use threads;
use threads::shared;
use constant  MAX_TRIES = 5;

sub sub_threads($$$);

my $switch= undef;
my $hash  = undef;
my $gsx   = undef;
my $cnt   = 5;
my %switches= ( 'A' = { 'b' = undef , 'c' = undef, 'd' = undef },
'B' = { 'b' = undef , 'c' = undef, 'd' = undef },
'C' = { 'b' = undef , 'c' = undef, 'd' = undef },
'D' = { 'b' = undef , 'c' = undef, 'd' = undef },
'E' = { 'b' = undef , 'c' = undef, 'd' = undef },
   );
my %threads : shared  = ();

##
## create the threads
#
while (($switch,$hash) = each %switches) {
  unless (exists $threads{$switch}) {
my %h : shared;
$threads{$switch} = \%h;
  }
  while (($gsx, $_) = each %$hash) {
unless (exists $threads{$switch}{$gsx}) {
  my %h : shared;
  $threads{$switch}{$gsx} = \%h;
}
unless (exists $threads{$switch}{$gsx}{'messages'}) {
  my @h : shared;
  $threads{$switch}-{$gsx}-{'messages'} = \@h;
}
$hash-{$gsx}-{'thread'} = 
threads-create(\sub_threads,\$switch,\$gsx,\$cnt);
$hash-{$gsx}-{'tries'}  = 1;
$cnt += 5;
  }
}

#print Dumper \%threads;
#print Dumper \%switches;

##
## endless loop to run while threads are still running
##
$cnt = 1;
while ($cnt) {
  $cnt = 0;
  while (($switch,$hash) = each %switches) {
while (($gsx, $_) = each %$hash) {
  if ($hash-{$gsx}-{'thread'}-is_running()) {
$cnt = 1;
#print $switch-$gsx is running\n;
  } else {
#print $switch-$gsx is NOT running\n;
#print   Reason for failure : \n;
#print ' ' .  join('\n' , 
@{$threads{$switch}-{$gsx}-{'messages'}}) . \n;
if ($hash-{$gsx}-{'tries'}  MAX_TRIES) {
#  print   max tries not reached for $switch-$gsx, will be trying 
again!\n;
  $hash-{$gsx}-{'tries'}++;
  $hash-{$gsx}-{'thread'} = 
threads-create(\sub_threads,\$switch,\$gsx,\$cnt);
} else {
  print send email! $switch-$gsx\n;
}
  }
}
sleep 2;
  }
}

#print Dumper \%threads;
#print Dumper \%switches;


sub sub_threads($$$) {
  my $ptr_switch  = shift;
  my $ptr_gsx = shift;
  my $ptr_tNum= shift;
  sleep $$ptr_tNum;
  {
lock(%threads);
push @{$threads{$$ptr_switch}-{$$ptr_gsx}-{'messages'}} , Leaving thread 
$$ptr_switch-$$ptr_gsx;
#$threads{$$ptr_switch}-{$ptr_gsx}-{'messages'} = Leaving thread 
$$ptr_switch-$$ptr_gsx;
# locke freed at end oc scope
  }
  return 0;
}

On Feb 2, 2015, at 10:11 PM, Alan Raetz 
alanra...@gmail.commailto:alanra...@gmail.com wrote:

So I have a perl application that upon startup loads about ten perl hashes 
(some of them complex) from files. This takes up a few GB of memory and about 5 
minutes. It then iterates through some cases and reads from (never writes) 
these perl hashes. To process all our cases, it takes about 3 hours (millions 
of cases). We would like to speed up this process. I am thinking this is an 
ideal application of mod_perl because it would allow multiple processes but 
share memory.

The scheme would be to load the hashes on apache startup and have a master 
program send requests with each case and apache children will use the shared 
hashes.

I just want to verify some of the details about variable sharing.  Would the 
following setup work (oversimplified, but you get the idea…):

In a file Data.pm, which I would use() in my Apache 
startup.plhttp://startup.pl/, I would load the perl hashes and have hash 
references that would be retrieved with class methods:

package Data;

my %big_hash;

open(FILE,file.txt);

while ( FILE ) {

  … code ….

  $big_hash{ $key } = $value;
}

sub get_big_hashref {   return \%big_hash; }

snip

And so in the apache request handler, the code would be something like:

use Data.pm;

my $hashref = Data::get_big_hashref();

…. code to access $hashref data with request parameters…..

snip

The idea is the HTTP request/response will contain the relevant input/output 
for each case… and the master client program will collect these and 
concatentate the final output from all the requests.

So any issues/suggestions with this approach? I am facing a non-trivial task of 
refactoring the existing code to work in this framework, so just wanted to get 
some feedback before I invest more time into this...

I am planning on using 

Re: mod_perl for multi-process file processing?

2015-02-03 Thread Perrin Harkins
Cache::FastMmap is a great module for sharing read/write data, but it can't
compete with the speed of loading it all into memory before forking as Alan
said he plans to do.

- Perrin

On Tue, Feb 3, 2015 at 2:05 AM, Cosimo Streppone cos...@streppone.it
wrote:

 Alan Raetz wrote:

  So I have a perl application that upon startup loads about ten perl
 hashes (some of them complex) from files. This takes up a few GB of
 memory and about 5 minutes. It then iterates through some cases and
 reads from (never writes) these perl hashes. To process all our cases,
 it takes about 3 hours (millions of cases). We would like to speed up
 this process. I am thinking this is an ideal application of mod_perl
 because it would allow multiple processes but share memory.


 Sure you could use modperl for this.
 I would also consider at least these alternatives:

 - use Cache::FastMmap, https://metacpan.org/pod/Cache::FastMmap
   Load up your data with a loader script, and forget about it.
   Cache::FastMmap also works with modperl.

 - use a network server, like memcached or redis to store your
   read-only data, and use a lightweight network protocol (on localhost)
   to get the data.

 In both cases, reading from multiple processes will not be a problem.
 The cheapest solution for the consumer part (the cases above)
 would be to use a command like parallel to fire up as many copies
 of your consumer script as you can afford.

 Hope this helps,

 --
 Cosimo




Re: mod_perl for multi-process file processing?

2015-02-03 Thread Perrin Harkins
I agree, either threads or Parallel::ForkManager, depending on your
platform and your perl, will be a lot faster than mod_perl for this. Of
course there might be other reasons to use mod_perl, e.g. it's useful to
have this available as a remote service, or you want to call this
frequently for small jobs throughout the day without needing to reload the
data.

- Perrin

On Tue, Feb 3, 2015 at 7:42 AM, Patton, Billy N billy.pat...@h3net.com
wrote:

 Unless I’m entirely wrong it appears that you want to use shared threaded
 memory.
 This would allow you to keep out of apache altogether.
 Here is an example of using threads that I worked out using shared memory.
 We took a 4 hour task, serial, and turned it into 5 minutes with threads.
 This worked extremely well for me.  I wasn’t using large hashes, but I was
 hundreds of files, per thread, with 30k lines per file.
 #!/usr/bin/env perl -w
 use strict;
 use warnings;
 use Data::Dumper;
 $Data::Dumper::Indent = 1;
 $Data::Dumper::Sortkeys = 1;
 $Data::Dumper::Deepcopy = 1;
 use threads;
 use threads::shared;
 use constant  MAX_TRIES = 5;

 sub sub_threads($$$);

 my $switch= undef;
 my $hash  = undef;
 my $gsx   = undef;
 my $cnt   = 5;
 my %switches= ( 'A' = { 'b' = undef , 'c' = undef, 'd' = undef },
 'B' = { 'b' = undef , 'c' = undef, 'd' = undef },
 'C' = { 'b' = undef , 'c' = undef, 'd' = undef },
 'D' = { 'b' = undef , 'c' = undef, 'd' = undef },
 'E' = { 'b' = undef , 'c' = undef, 'd' = undef },
);
 my %threads : shared  = ();

 ##
 ## create the threads
 #
 while (($switch,$hash) = each %switches) {
   unless (exists $threads{$switch}) {
 my %h : shared;
 $threads{$switch} = \%h;
   }
   while (($gsx, $_) = each %$hash) {
 unless (exists $threads{$switch}{$gsx}) {
   my %h : shared;
   $threads{$switch}{$gsx} = \%h;
 }
 unless (exists $threads{$switch}{$gsx}{'messages'}) {
   my @h : shared;
   $threads{$switch}-{$gsx}-{'messages'} = \@h;
 }
 $hash-{$gsx}-{'thread'} =
 threads-create(\sub_threads,\$switch,\$gsx,\$cnt);
 $hash-{$gsx}-{'tries'}  = 1;
 $cnt += 5;
   }
 }

 #print Dumper \%threads;
 #print Dumper \%switches;

 ##
 ## endless loop to run while threads are still running
 ##
 $cnt = 1;
 while ($cnt) {
   $cnt = 0;
   while (($switch,$hash) = each %switches) {
 while (($gsx, $_) = each %$hash) {
   if ($hash-{$gsx}-{'thread'}-is_running()) {
 $cnt = 1;
 #print $switch-$gsx is running\n;
   } else {
 #print $switch-$gsx is NOT running\n;
 #print   Reason for failure : \n;
 #print ' ' .  join('\n' ,
 @{$threads{$switch}-{$gsx}-{'messages'}}) . \n;
 if ($hash-{$gsx}-{'tries'}  MAX_TRIES) {
 #  print   max tries not reached for $switch-$gsx, will be
 trying again!\n;
   $hash-{$gsx}-{'tries'}++;
   $hash-{$gsx}-{'thread'} =
 threads-create(\sub_threads,\$switch,\$gsx,\$cnt);
 } else {
   print send email! $switch-$gsx\n;
 }
   }
 }
 sleep 2;
   }
 }

 #print Dumper \%threads;
 #print Dumper \%switches;


 sub sub_threads($$$) {
   my $ptr_switch  = shift;
   my $ptr_gsx = shift;
   my $ptr_tNum= shift;
   sleep $$ptr_tNum;
   {
 lock(%threads);
 push @{$threads{$$ptr_switch}-{$$ptr_gsx}-{'messages'}} , Leaving
 thread $$ptr_switch-$$ptr_gsx;
 #$threads{$$ptr_switch}-{$ptr_gsx}-{'messages'} = Leaving thread
 $$ptr_switch-$$ptr_gsx;
 # locke freed at end oc scope
   }
   return 0;
 }

 On Feb 2, 2015, at 10:11 PM, Alan Raetz alanra...@gmail.commailto:
 alanra...@gmail.com wrote:

 So I have a perl application that upon startup loads about ten perl hashes
 (some of them complex) from files. This takes up a few GB of memory and
 about 5 minutes. It then iterates through some cases and reads from (never
 writes) these perl hashes. To process all our cases, it takes about 3 hours
 (millions of cases). We would like to speed up this process. I am thinking
 this is an ideal application of mod_perl because it would allow multiple
 processes but share memory.

 The scheme would be to load the hashes on apache startup and have a master
 program send requests with each case and apache children will use the
 shared hashes.

 I just want to verify some of the details about variable sharing.  Would
 the following setup work (oversimplified, but you get the idea…):

 In a file Data.pm, which I would use() in my Apache startup.pl
 http://startup.pl/, I would load the perl hashes and have hash
 references that would be retrieved with class methods:

 package Data;

 my %big_hash;

 open(FILE,file.txt);

 while ( FILE ) {

   … code ….

   $big_hash{ $key } = $value;
 }

 sub get_big_hashref {   return \%big_hash; }

 snip

 And so in the apache request handler, the code would be something like:

 

Re: mod_perl for multi-process file processing?

2015-02-03 Thread Patton, Billy N
You will find that when you share the memory the hashes are not copied to each 
thread.
The docs are a little misleading.
On Feb 3, 2015, at 11:54 AM, Alan Raetz 
alanra...@gmail.commailto:alanra...@gmail.com wrote:

Thanks for all the input. I was considering threads, but according to how I 
read the perlthreadtut 
(http://search.cpan.org/~rjbs/perl-5.18.4/pod/perlthrtut.pod#Threads_And_Data), 
quote: When a new Perl thread is created, all the data associated with the 
current thread is copied to the new thread, and is subsequently private to that 
new thread So in my application, each thread would get the entire memory 
footprint copied. So although the data is shared in terms of application 
usage, in terms of physical memory limitations, I would quickly use up the 
machine memory. If you're reading/writing files, this may not be a significant 
difference, but with the way this app is structured now, I think it may be.

One of my thoughts as far as reducing the impact of request overhead was to 
bundle requests, such that a single request is getting multiple tasks.
​
I will look into some of these suggestions more, thanks again.




Re: mod_perl for multi-process file processing?

2015-02-03 Thread Alan Raetz
Thanks for all the input. I was considering threads, but according to how I
read the perlthreadtut (
http://search.cpan.org/~rjbs/perl-5.18.4/pod/perlthrtut.pod#Threads_And_Data),
quote: When a new Perl thread is created, all the data associated with the
current thread is copied to the new thread, and is subsequently private to
that new thread So in my application, each thread would get the entire
memory footprint copied. So although the data is shared in terms of
application usage, in terms of physical memory limitations, I would quickly
use up the machine memory. If you're reading/writing files, this may not be
a significant difference, but with the way this app is structured now, I
think it may be.

One of my thoughts as far as reducing the impact of request overhead was to
bundle requests, such that a single request is getting multiple tasks.
​
I will look into some of these suggestions more, thanks again.


Re: mod_perl for multi-process file processing?

2015-02-02 Thread Dr James Smith

Alan/Alexandr,

There will always be an overhead with using a webserver to do this - 
even using mod_perl.


Assumiptions:
*from what you are saying that there is no actual website 
involved but you want to use mod_perl to cache data for an offline process;

*One set of data is used once and once only for a run?

Pros:
*Make sure you use your module in startup so that each child 
thread uses the same memory not generating a copy of the data;
*If you use something like curl multi as the fetcher you can 
write a simple parallel fetching queue to get the data - great if you 
have a multi-core box;


Cons:
*There is an overhead of using HTTP protocol webserver - if you 
aren't going to gain much from the parallelization of processes above 
you may find that writing a simple script which loops over all data 
would be more efficient...
*In your case we are probably looking at about 10ms (or less) 
the apache/http round tripping will probably take much more time than 
the actual processing...


On 03/02/2015 05:02, Alexandr Evstigneev wrote:
Pre-loading is good, but what you need, I belive, is Storable module. 
If your files contains parsed data (hashes) just store them as 
serialized. If they containing raw data, need to be parsed, you may 
pre-parse, serialize it and store as binary files.

Storable is written in C and works very fast.


2015-02-03 7:11 GMT+03:00 Alan Raetz alanra...@gmail.com 
mailto:alanra...@gmail.com:


So I have a perl application that upon startup loads about ten
perl hashes (some of them complex) from files. This takes up a few
GB of memory and about 5 minutes. It then iterates through some
cases and reads from (never writes) these perl hashes. To process
all our cases, it takes about 3 hours (millions of cases). We
would like to speed up this process. I am thinking this is an
ideal application of mod_perl because it would allow multiple
processes but share memory.

The scheme would be to load the hashes on apache startup and have
a master program send requests with each case and apache children
will use the shared hashes.

I just want to verify some of the details about variable sharing. 
Would the following setup work (oversimplified, but you get the

idea…):

In a file Data.pm, which I would use() in my Apache startup.pl
http://startup.pl, I would load the perl hashes and have hash
references that would be retrieved with class methods:

package Data;

my %big_hash;

open(FILE,file.txt);

while ( FILE ) {

  … code ….

  $big_hash{ $key } = $value;
}

sub get_big_hashref {   return \%big_hash; }

snip

And so in the apache request handler, the code would be something
like:

use Data.pm;

my $hashref = Data::get_big_hashref();

…. code to access $hashref data with request parameters…..

snip

The idea is the HTTP request/response will contain the relevant
input/output for each case… and the master client program will
collect these and concatentate the final output from all the requests.

So any issues/suggestions with this approach? I am facing a
non-trivial task of refactoring the existing code to work in this
framework, so just wanted to get some feedback before I invest
more time into this...

I am planning on using mod_perl 2.07 on a linux machine.

Thanks in advance, Alan






---
This email has been checked for viruses by Avast antivirus software.
http://www.avast.com



--
The Wellcome Trust Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 

Re: mod_perl for multi-process file processing?

2015-02-02 Thread Alexandr Evstigneev
Pre-loading is good, but what you need, I belive, is Storable module. If
your files contains parsed data (hashes) just store them as serialized. If
they containing raw data, need to be parsed, you may pre-parse, serialize
it and store as binary files.
Storable is written in C and works very fast.


2015-02-03 7:11 GMT+03:00 Alan Raetz alanra...@gmail.com:

 So I have a perl application that upon startup loads about ten perl hashes
 (some of them complex) from files. This takes up a few GB of memory and
 about 5 minutes. It then iterates through some cases and reads from (never
 writes) these perl hashes. To process all our cases, it takes about 3 hours
 (millions of cases). We would like to speed up this process. I am thinking
 this is an ideal application of mod_perl because it would allow multiple
 processes but share memory.

 The scheme would be to load the hashes on apache startup and have a master
 program send requests with each case and apache children will use the
 shared hashes.

 I just want to verify some of the details about variable sharing.  Would
 the following setup work (oversimplified, but you get the idea…):

 In a file Data.pm, which I would use() in my Apache startup.pl, I would
 load the perl hashes and have hash references that would be retrieved with
 class methods:

 package Data;

 my %big_hash;

 open(FILE,file.txt);

 while ( FILE ) {

   … code ….

   $big_hash{ $key } = $value;
 }

 sub get_big_hashref {   return \%big_hash; }

 snip

 And so in the apache request handler, the code would be something like:

 use Data.pm;

 my $hashref = Data::get_big_hashref();

 …. code to access $hashref data with request parameters…..

 snip

 The idea is the HTTP request/response will contain the relevant
 input/output for each case… and the master client program will collect
 these and concatentate the final output from all the requests.

 So any issues/suggestions with this approach? I am facing a non-trivial
 task of refactoring the existing code to work in this framework, so just
 wanted to get some feedback before I invest more time into this...

 I am planning on using mod_perl 2.07 on a linux machine.

 Thanks in advance, Alan



Re: mod_perl for multi-process file processing?

2015-02-02 Thread Cosimo Streppone

Alan Raetz wrote:


So I have a perl application that upon startup loads about ten perl
hashes (some of them complex) from files. This takes up a few GB of
memory and about 5 minutes. It then iterates through some cases and
reads from (never writes) these perl hashes. To process all our cases,
it takes about 3 hours (millions of cases). We would like to speed up
this process. I am thinking this is an ideal application of mod_perl
because it would allow multiple processes but share memory.


Sure you could use modperl for this.
I would also consider at least these alternatives:

- use Cache::FastMmap, https://metacpan.org/pod/Cache::FastMmap
  Load up your data with a loader script, and forget about it.
  Cache::FastMmap also works with modperl.

- use a network server, like memcached or redis to store your
  read-only data, and use a lightweight network protocol (on localhost)
  to get the data.

In both cases, reading from multiple processes will not be a problem.
The cheapest solution for the consumer part (the cases above)
would be to use a command like parallel to fire up as many copies
of your consumer script as you can afford.

Hope this helps,

--
Cosimo