Re: mod_perl for multi-process file processing?

Patton, Billy N Tue, 03 Feb 2015 04:45:08 -0800

Unless I’m entirely wrong it appears that you want to use shared threaded 
memory.
This would allow you to keep out of apache altogether.
Here is an example of using threads that I worked out using shared memory.
We took a 4 hour task, serial, and turned it into 5 minutes with threads.
This worked extremely well for me.  I wasn’t using large hashes, but I was 
hundreds of files, per thread, with 30k lines per file.
#!/usr/bin/env perl -w
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Indent = 1;
$Data::Dumper::Sortkeys = 1;
$Data::Dumper::Deepcopy = 1;
use threads;
use threads::shared;
use constant  MAX_TRIES => 5;


sub sub_threads($$$);

my $switch            = undef;
my $hash              = undef;
my $gsx               = undef;
my $cnt               = 5;
my %switches    = ( 'A' => { 'b' => undef , 'c' => undef, 'd' => undef },
                    'B' => { 'b' => undef , 'c' => undef, 'd' => undef },
                    'C' => { 'b' => undef , 'c' => undef, 'd' => undef },
                    'D' => { 'b' => undef , 'c' => undef, 'd' => undef },
                    'E' => { 'b' => undef , 'c' => undef, 'd' => undef },
                   );
my %threads : shared  = ();

######
## create the threads
#####
while (($switch,$hash) = each %switches) {
  unless (exists $threads{$switch}) {
    my %h : shared;
    $threads{$switch} = \%h;
  }
  while (($gsx, $_) = each %$hash) {
    unless (exists $threads{$switch}{$gsx}) {
      my %h : shared;
      $threads{$switch}{$gsx} = \%h;
    }
    unless (exists $threads{$switch}{$gsx}{'messages'}) {
      my @h : shared;
      $threads{$switch}->{$gsx}->{'messages'} = \@h;
    }
    $hash->{$gsx}->{'thread'} = 
threads->create(\&sub_threads,\$switch,\$gsx,\$cnt);
    $hash->{$gsx}->{'tries'}  = 1;
    $cnt += 5;
  }
}

#print Dumper \%threads;
#print Dumper \%switches;

######
## endless loop to run while threads are still running
######
$cnt = 1;
while ($cnt) {
  $cnt = 0;
  while (($switch,$hash) = each %switches) {
    while (($gsx, $_) = each %$hash) {
      if ($hash->{$gsx}->{'thread'}->is_running()) {
        $cnt = 1;
#        print "$switch->$gsx is running\n";
      } else {
#        print "$switch->$gsx is NOT running\n";
#        print "  Reason for failure : \n";
#        print '     ' .  join('\n' , 
@{$threads{$switch}->{$gsx}->{'messages'}}) . "\n";
        if ($hash->{$gsx}->{'tries'} < MAX_TRIES) {
#          print "  max tries not reached for $switch->$gsx, will be trying 
again!\n";
          $hash->{$gsx}->{'tries'}++;
          $hash->{$gsx}->{'thread'} = 
threads->create(\&sub_threads,\$switch,\$gsx,\$cnt);
        } else {
          print "send email! $switch->$gsx\n";
        }
      }
    }
    sleep 2;
  }
}

#print Dumper \%threads;
#print Dumper \%switches;


sub sub_threads($$$) {
  my $ptr_switch  = shift;
  my $ptr_gsx     = shift;
  my $ptr_tNum    = shift;
  sleep $$ptr_tNum;
  {
    lock(%threads);
    push @{$threads{$$ptr_switch}->{$$ptr_gsx}->{'messages'}} , "Leaving thread 
$$ptr_switch->$$ptr_gsx";
    #$threads{$$ptr_switch}->{$ptr_gsx}->{'messages'} = "Leaving thread 
$$ptr_switch->$$ptr_gsx";
    # locke freed at end oc scope
  }
  return 0;
}

On Feb 2, 2015, at 10:11 PM, Alan Raetz 
<[email protected]<mailto:[email protected]>> wrote:

So I have a perl application that upon startup loads about ten perl hashes 
(some of them complex) from files. This takes up a few GB of memory and about 5 
minutes. It then iterates through some cases and reads from (never writes) 
these perl hashes. To process all our cases, it takes about 3 hours (millions 
of cases). We would like to speed up this process. I am thinking this is an 
ideal application of mod_perl because it would allow multiple processes but 
share memory.

The scheme would be to load the hashes on apache startup and have a master 
program send requests with each case and apache children will use the shared 
hashes.

I just want to verify some of the details about variable sharing.  Would the 
following setup work (oversimplified, but you get the idea…):

In a file Data.pm, which I would use() in my Apache 
startup.pl<http://startup.pl/>, I would load the perl hashes and have hash 
references that would be retrieved with class methods:

package Data;

my %big_hash;

open(FILE,"file.txt");

while ( <FILE> ) {

      … code ….

      $big_hash{ $key } = $value;
}

sub get_big_hashref {   return \%big_hash; }

<snip>

And so in the apache request handler, the code would be something like:

use Data.pm;

my $hashref = Data::get_big_hashref();

…. code to access $hashref data with request parameters…..

<snip>

The idea is the HTTP request/response will contain the relevant input/output 
for each case… and the master client program will collect these and 
concatentate the final output from all the requests.

So any issues/suggestions with this approach? I am facing a non-trivial task of 
refactoring the existing code to work in this framework, so just wanted to get 
some feedback before I invest more time into this...

I am planning on using mod_perl 2.07 on a linux machine.

Thanks in advance, Alan

Re: mod_perl for multi-process file processing?

Reply via email to