Re: mod_perl for multi-process file processing?
Unless I’m entirely wrong it appears that you want to use shared threaded memory. This would allow you to keep out of apache altogether. Here is an example of using threads that I worked out using shared memory. We took a 4 hour task, serial, and turned it into 5 minutes with threads. This worked extremely well for me. I wasn’t using large hashes, but I was hundreds of files, per thread, with 30k lines per file. #!/usr/bin/env perl -w use strict; use warnings; use Data::Dumper; $Data::Dumper::Indent = 1; $Data::Dumper::Sortkeys = 1; $Data::Dumper::Deepcopy = 1; use threads; use threads::shared; use constant MAX_TRIES = 5; sub sub_threads($$$); my $switch= undef; my $hash = undef; my $gsx = undef; my $cnt = 5; my %switches= ( 'A' = { 'b' = undef , 'c' = undef, 'd' = undef }, 'B' = { 'b' = undef , 'c' = undef, 'd' = undef }, 'C' = { 'b' = undef , 'c' = undef, 'd' = undef }, 'D' = { 'b' = undef , 'c' = undef, 'd' = undef }, 'E' = { 'b' = undef , 'c' = undef, 'd' = undef }, ); my %threads : shared = (); ## ## create the threads # while (($switch,$hash) = each %switches) { unless (exists $threads{$switch}) { my %h : shared; $threads{$switch} = \%h; } while (($gsx, $_) = each %$hash) { unless (exists $threads{$switch}{$gsx}) { my %h : shared; $threads{$switch}{$gsx} = \%h; } unless (exists $threads{$switch}{$gsx}{'messages'}) { my @h : shared; $threads{$switch}-{$gsx}-{'messages'} = \@h; } $hash-{$gsx}-{'thread'} = threads-create(\sub_threads,\$switch,\$gsx,\$cnt); $hash-{$gsx}-{'tries'} = 1; $cnt += 5; } } #print Dumper \%threads; #print Dumper \%switches; ## ## endless loop to run while threads are still running ## $cnt = 1; while ($cnt) { $cnt = 0; while (($switch,$hash) = each %switches) { while (($gsx, $_) = each %$hash) { if ($hash-{$gsx}-{'thread'}-is_running()) { $cnt = 1; #print $switch-$gsx is running\n; } else { #print $switch-$gsx is NOT running\n; #print Reason for failure : \n; #print ' ' . join('\n' , @{$threads{$switch}-{$gsx}-{'messages'}}) . \n; if ($hash-{$gsx}-{'tries'} MAX_TRIES) { # print max tries not reached for $switch-$gsx, will be trying again!\n; $hash-{$gsx}-{'tries'}++; $hash-{$gsx}-{'thread'} = threads-create(\sub_threads,\$switch,\$gsx,\$cnt); } else { print send email! $switch-$gsx\n; } } } sleep 2; } } #print Dumper \%threads; #print Dumper \%switches; sub sub_threads($$$) { my $ptr_switch = shift; my $ptr_gsx = shift; my $ptr_tNum= shift; sleep $$ptr_tNum; { lock(%threads); push @{$threads{$$ptr_switch}-{$$ptr_gsx}-{'messages'}} , Leaving thread $$ptr_switch-$$ptr_gsx; #$threads{$$ptr_switch}-{$ptr_gsx}-{'messages'} = Leaving thread $$ptr_switch-$$ptr_gsx; # locke freed at end oc scope } return 0; } On Feb 2, 2015, at 10:11 PM, Alan Raetz alanra...@gmail.commailto:alanra...@gmail.com wrote: So I have a perl application that upon startup loads about ten perl hashes (some of them complex) from files. This takes up a few GB of memory and about 5 minutes. It then iterates through some cases and reads from (never writes) these perl hashes. To process all our cases, it takes about 3 hours (millions of cases). We would like to speed up this process. I am thinking this is an ideal application of mod_perl because it would allow multiple processes but share memory. The scheme would be to load the hashes on apache startup and have a master program send requests with each case and apache children will use the shared hashes. I just want to verify some of the details about variable sharing. Would the following setup work (oversimplified, but you get the idea…): In a file Data.pm, which I would use() in my Apache startup.plhttp://startup.pl/, I would load the perl hashes and have hash references that would be retrieved with class methods: package Data; my %big_hash; open(FILE,file.txt); while ( FILE ) { … code …. $big_hash{ $key } = $value; } sub get_big_hashref { return \%big_hash; } snip And so in the apache request handler, the code would be something like: use Data.pm; my $hashref = Data::get_big_hashref(); …. code to access $hashref data with request parameters….. snip The idea is the HTTP request/response will contain the relevant input/output for each case… and the master client program will collect these and concatentate the final output from all the requests. So any issues/suggestions with this approach? I am facing a non-trivial task of refactoring the existing code to work in this framework, so just wanted to get some feedback before I invest more time into this... I am planning on using
Re: mod_perl for multi-process file processing?
Cache::FastMmap is a great module for sharing read/write data, but it can't compete with the speed of loading it all into memory before forking as Alan said he plans to do. - Perrin On Tue, Feb 3, 2015 at 2:05 AM, Cosimo Streppone cos...@streppone.it wrote: Alan Raetz wrote: So I have a perl application that upon startup loads about ten perl hashes (some of them complex) from files. This takes up a few GB of memory and about 5 minutes. It then iterates through some cases and reads from (never writes) these perl hashes. To process all our cases, it takes about 3 hours (millions of cases). We would like to speed up this process. I am thinking this is an ideal application of mod_perl because it would allow multiple processes but share memory. Sure you could use modperl for this. I would also consider at least these alternatives: - use Cache::FastMmap, https://metacpan.org/pod/Cache::FastMmap Load up your data with a loader script, and forget about it. Cache::FastMmap also works with modperl. - use a network server, like memcached or redis to store your read-only data, and use a lightweight network protocol (on localhost) to get the data. In both cases, reading from multiple processes will not be a problem. The cheapest solution for the consumer part (the cases above) would be to use a command like parallel to fire up as many copies of your consumer script as you can afford. Hope this helps, -- Cosimo
Re: mod_perl for multi-process file processing?
I agree, either threads or Parallel::ForkManager, depending on your platform and your perl, will be a lot faster than mod_perl for this. Of course there might be other reasons to use mod_perl, e.g. it's useful to have this available as a remote service, or you want to call this frequently for small jobs throughout the day without needing to reload the data. - Perrin On Tue, Feb 3, 2015 at 7:42 AM, Patton, Billy N billy.pat...@h3net.com wrote: Unless I’m entirely wrong it appears that you want to use shared threaded memory. This would allow you to keep out of apache altogether. Here is an example of using threads that I worked out using shared memory. We took a 4 hour task, serial, and turned it into 5 minutes with threads. This worked extremely well for me. I wasn’t using large hashes, but I was hundreds of files, per thread, with 30k lines per file. #!/usr/bin/env perl -w use strict; use warnings; use Data::Dumper; $Data::Dumper::Indent = 1; $Data::Dumper::Sortkeys = 1; $Data::Dumper::Deepcopy = 1; use threads; use threads::shared; use constant MAX_TRIES = 5; sub sub_threads($$$); my $switch= undef; my $hash = undef; my $gsx = undef; my $cnt = 5; my %switches= ( 'A' = { 'b' = undef , 'c' = undef, 'd' = undef }, 'B' = { 'b' = undef , 'c' = undef, 'd' = undef }, 'C' = { 'b' = undef , 'c' = undef, 'd' = undef }, 'D' = { 'b' = undef , 'c' = undef, 'd' = undef }, 'E' = { 'b' = undef , 'c' = undef, 'd' = undef }, ); my %threads : shared = (); ## ## create the threads # while (($switch,$hash) = each %switches) { unless (exists $threads{$switch}) { my %h : shared; $threads{$switch} = \%h; } while (($gsx, $_) = each %$hash) { unless (exists $threads{$switch}{$gsx}) { my %h : shared; $threads{$switch}{$gsx} = \%h; } unless (exists $threads{$switch}{$gsx}{'messages'}) { my @h : shared; $threads{$switch}-{$gsx}-{'messages'} = \@h; } $hash-{$gsx}-{'thread'} = threads-create(\sub_threads,\$switch,\$gsx,\$cnt); $hash-{$gsx}-{'tries'} = 1; $cnt += 5; } } #print Dumper \%threads; #print Dumper \%switches; ## ## endless loop to run while threads are still running ## $cnt = 1; while ($cnt) { $cnt = 0; while (($switch,$hash) = each %switches) { while (($gsx, $_) = each %$hash) { if ($hash-{$gsx}-{'thread'}-is_running()) { $cnt = 1; #print $switch-$gsx is running\n; } else { #print $switch-$gsx is NOT running\n; #print Reason for failure : \n; #print ' ' . join('\n' , @{$threads{$switch}-{$gsx}-{'messages'}}) . \n; if ($hash-{$gsx}-{'tries'} MAX_TRIES) { # print max tries not reached for $switch-$gsx, will be trying again!\n; $hash-{$gsx}-{'tries'}++; $hash-{$gsx}-{'thread'} = threads-create(\sub_threads,\$switch,\$gsx,\$cnt); } else { print send email! $switch-$gsx\n; } } } sleep 2; } } #print Dumper \%threads; #print Dumper \%switches; sub sub_threads($$$) { my $ptr_switch = shift; my $ptr_gsx = shift; my $ptr_tNum= shift; sleep $$ptr_tNum; { lock(%threads); push @{$threads{$$ptr_switch}-{$$ptr_gsx}-{'messages'}} , Leaving thread $$ptr_switch-$$ptr_gsx; #$threads{$$ptr_switch}-{$ptr_gsx}-{'messages'} = Leaving thread $$ptr_switch-$$ptr_gsx; # locke freed at end oc scope } return 0; } On Feb 2, 2015, at 10:11 PM, Alan Raetz alanra...@gmail.commailto: alanra...@gmail.com wrote: So I have a perl application that upon startup loads about ten perl hashes (some of them complex) from files. This takes up a few GB of memory and about 5 minutes. It then iterates through some cases and reads from (never writes) these perl hashes. To process all our cases, it takes about 3 hours (millions of cases). We would like to speed up this process. I am thinking this is an ideal application of mod_perl because it would allow multiple processes but share memory. The scheme would be to load the hashes on apache startup and have a master program send requests with each case and apache children will use the shared hashes. I just want to verify some of the details about variable sharing. Would the following setup work (oversimplified, but you get the idea…): In a file Data.pm, which I would use() in my Apache startup.pl http://startup.pl/, I would load the perl hashes and have hash references that would be retrieved with class methods: package Data; my %big_hash; open(FILE,file.txt); while ( FILE ) { … code …. $big_hash{ $key } = $value; } sub get_big_hashref { return \%big_hash; } snip And so in the apache request handler, the code would be something like:
Re: mod_perl for multi-process file processing?
You will find that when you share the memory the hashes are not copied to each thread. The docs are a little misleading. On Feb 3, 2015, at 11:54 AM, Alan Raetz alanra...@gmail.commailto:alanra...@gmail.com wrote: Thanks for all the input. I was considering threads, but according to how I read the perlthreadtut (http://search.cpan.org/~rjbs/perl-5.18.4/pod/perlthrtut.pod#Threads_And_Data), quote: When a new Perl thread is created, all the data associated with the current thread is copied to the new thread, and is subsequently private to that new thread So in my application, each thread would get the entire memory footprint copied. So although the data is shared in terms of application usage, in terms of physical memory limitations, I would quickly use up the machine memory. If you're reading/writing files, this may not be a significant difference, but with the way this app is structured now, I think it may be. One of my thoughts as far as reducing the impact of request overhead was to bundle requests, such that a single request is getting multiple tasks. I will look into some of these suggestions more, thanks again.
Re: mod_perl for multi-process file processing?
Thanks for all the input. I was considering threads, but according to how I read the perlthreadtut ( http://search.cpan.org/~rjbs/perl-5.18.4/pod/perlthrtut.pod#Threads_And_Data), quote: When a new Perl thread is created, all the data associated with the current thread is copied to the new thread, and is subsequently private to that new thread So in my application, each thread would get the entire memory footprint copied. So although the data is shared in terms of application usage, in terms of physical memory limitations, I would quickly use up the machine memory. If you're reading/writing files, this may not be a significant difference, but with the way this app is structured now, I think it may be. One of my thoughts as far as reducing the impact of request overhead was to bundle requests, such that a single request is getting multiple tasks. I will look into some of these suggestions more, thanks again.
Re: mod_perl for multi-process file processing?
Alan/Alexandr, There will always be an overhead with using a webserver to do this - even using mod_perl. Assumiptions: *from what you are saying that there is no actual website involved but you want to use mod_perl to cache data for an offline process; *One set of data is used once and once only for a run? Pros: *Make sure you use your module in startup so that each child thread uses the same memory not generating a copy of the data; *If you use something like curl multi as the fetcher you can write a simple parallel fetching queue to get the data - great if you have a multi-core box; Cons: *There is an overhead of using HTTP protocol webserver - if you aren't going to gain much from the parallelization of processes above you may find that writing a simple script which loops over all data would be more efficient... *In your case we are probably looking at about 10ms (or less) the apache/http round tripping will probably take much more time than the actual processing... On 03/02/2015 05:02, Alexandr Evstigneev wrote: Pre-loading is good, but what you need, I belive, is Storable module. If your files contains parsed data (hashes) just store them as serialized. If they containing raw data, need to be parsed, you may pre-parse, serialize it and store as binary files. Storable is written in C and works very fast. 2015-02-03 7:11 GMT+03:00 Alan Raetz alanra...@gmail.com mailto:alanra...@gmail.com: So I have a perl application that upon startup loads about ten perl hashes (some of them complex) from files. This takes up a few GB of memory and about 5 minutes. It then iterates through some cases and reads from (never writes) these perl hashes. To process all our cases, it takes about 3 hours (millions of cases). We would like to speed up this process. I am thinking this is an ideal application of mod_perl because it would allow multiple processes but share memory. The scheme would be to load the hashes on apache startup and have a master program send requests with each case and apache children will use the shared hashes. I just want to verify some of the details about variable sharing. Would the following setup work (oversimplified, but you get the idea…): In a file Data.pm, which I would use() in my Apache startup.pl http://startup.pl, I would load the perl hashes and have hash references that would be retrieved with class methods: package Data; my %big_hash; open(FILE,file.txt); while ( FILE ) { … code …. $big_hash{ $key } = $value; } sub get_big_hashref { return \%big_hash; } snip And so in the apache request handler, the code would be something like: use Data.pm; my $hashref = Data::get_big_hashref(); …. code to access $hashref data with request parameters….. snip The idea is the HTTP request/response will contain the relevant input/output for each case… and the master client program will collect these and concatentate the final output from all the requests. So any issues/suggestions with this approach? I am facing a non-trivial task of refactoring the existing code to work in this framework, so just wanted to get some feedback before I invest more time into this... I am planning on using mod_perl 2.07 on a linux machine. Thanks in advance, Alan --- This email has been checked for viruses by Avast antivirus software. http://www.avast.com -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
Re: mod_perl for multi-process file processing?
Pre-loading is good, but what you need, I belive, is Storable module. If your files contains parsed data (hashes) just store them as serialized. If they containing raw data, need to be parsed, you may pre-parse, serialize it and store as binary files. Storable is written in C and works very fast. 2015-02-03 7:11 GMT+03:00 Alan Raetz alanra...@gmail.com: So I have a perl application that upon startup loads about ten perl hashes (some of them complex) from files. This takes up a few GB of memory and about 5 minutes. It then iterates through some cases and reads from (never writes) these perl hashes. To process all our cases, it takes about 3 hours (millions of cases). We would like to speed up this process. I am thinking this is an ideal application of mod_perl because it would allow multiple processes but share memory. The scheme would be to load the hashes on apache startup and have a master program send requests with each case and apache children will use the shared hashes. I just want to verify some of the details about variable sharing. Would the following setup work (oversimplified, but you get the idea…): In a file Data.pm, which I would use() in my Apache startup.pl, I would load the perl hashes and have hash references that would be retrieved with class methods: package Data; my %big_hash; open(FILE,file.txt); while ( FILE ) { … code …. $big_hash{ $key } = $value; } sub get_big_hashref { return \%big_hash; } snip And so in the apache request handler, the code would be something like: use Data.pm; my $hashref = Data::get_big_hashref(); …. code to access $hashref data with request parameters….. snip The idea is the HTTP request/response will contain the relevant input/output for each case… and the master client program will collect these and concatentate the final output from all the requests. So any issues/suggestions with this approach? I am facing a non-trivial task of refactoring the existing code to work in this framework, so just wanted to get some feedback before I invest more time into this... I am planning on using mod_perl 2.07 on a linux machine. Thanks in advance, Alan
Re: mod_perl for multi-process file processing?
Alan Raetz wrote: So I have a perl application that upon startup loads about ten perl hashes (some of them complex) from files. This takes up a few GB of memory and about 5 minutes. It then iterates through some cases and reads from (never writes) these perl hashes. To process all our cases, it takes about 3 hours (millions of cases). We would like to speed up this process. I am thinking this is an ideal application of mod_perl because it would allow multiple processes but share memory. Sure you could use modperl for this. I would also consider at least these alternatives: - use Cache::FastMmap, https://metacpan.org/pod/Cache::FastMmap Load up your data with a loader script, and forget about it. Cache::FastMmap also works with modperl. - use a network server, like memcached or redis to store your read-only data, and use a lightweight network protocol (on localhost) to get the data. In both cases, reading from multiple processes will not be a problem. The cheapest solution for the consumer part (the cases above) would be to use a command like parallel to fire up as many copies of your consumer script as you can afford. Hope this helps, -- Cosimo