Re: mod_perl for multi-process file processing?
Alan/Alexandr, There will always be an overhead with using a webserver to do this - even using mod_perl. Assumiptions: *from what you are saying that there is no actual website involved but you want to use mod_perl to cache data for an offline process; *One set of data is used once and once only for a run? Pros: *Make sure you use your module in startup so that each child thread uses the same memory not generating a copy of the data; *If you use something like curl multi as the fetcher you can write a simple parallel fetching queue to get the data - great if you have a multi-core box; Cons: *There is an overhead of using HTTP protocol webserver - if you aren't going to gain much from the parallelization of processes above you may find that writing a simple script which loops over all data would be more efficient... *In your case we are probably looking at about 10ms (or less) the apache/http round tripping will probably take much more time than the actual processing... On 03/02/2015 05:02, Alexandr Evstigneev wrote: Pre-loading is good, but what you need, I belive, is Storable module. If your files contains parsed data (hashes) just store them as serialized. If they containing raw data, need to be parsed, you may pre-parse, serialize it and store as binary files. Storable is written in C and works very fast. 2015-02-03 7:11 GMT+03:00 Alan Raetz alanra...@gmail.com mailto:alanra...@gmail.com: So I have a perl application that upon startup loads about ten perl hashes (some of them complex) from files. This takes up a few GB of memory and about 5 minutes. It then iterates through some cases and reads from (never writes) these perl hashes. To process all our cases, it takes about 3 hours (millions of cases). We would like to speed up this process. I am thinking this is an ideal application of mod_perl because it would allow multiple processes but share memory. The scheme would be to load the hashes on apache startup and have a master program send requests with each case and apache children will use the shared hashes. I just want to verify some of the details about variable sharing. Would the following setup work (oversimplified, but you get the idea…): In a file Data.pm, which I would use() in my Apache startup.pl http://startup.pl, I would load the perl hashes and have hash references that would be retrieved with class methods: package Data; my %big_hash; open(FILE,file.txt); while ( FILE ) { … code …. $big_hash{ $key } = $value; } sub get_big_hashref { return \%big_hash; } snip And so in the apache request handler, the code would be something like: use Data.pm; my $hashref = Data::get_big_hashref(); …. code to access $hashref data with request parameters….. snip The idea is the HTTP request/response will contain the relevant input/output for each case… and the master client program will collect these and concatentate the final output from all the requests. So any issues/suggestions with this approach? I am facing a non-trivial task of refactoring the existing code to work in this framework, so just wanted to get some feedback before I invest more time into this... I am planning on using mod_perl 2.07 on a linux machine. Thanks in advance, Alan --- This email has been checked for viruses by Avast antivirus software. http://www.avast.com -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
mod_perl for multi-process file processing?
So I have a perl application that upon startup loads about ten perl hashes (some of them complex) from files. This takes up a few GB of memory and about 5 minutes. It then iterates through some cases and reads from (never writes) these perl hashes. To process all our cases, it takes about 3 hours (millions of cases). We would like to speed up this process. I am thinking this is an ideal application of mod_perl because it would allow multiple processes but share memory. The scheme would be to load the hashes on apache startup and have a master program send requests with each case and apache children will use the shared hashes. I just want to verify some of the details about variable sharing. Would the following setup work (oversimplified, but you get the idea…): In a file Data.pm, which I would use() in my Apache startup.pl, I would load the perl hashes and have hash references that would be retrieved with class methods: package Data; my %big_hash; open(FILE,file.txt); while ( FILE ) { … code …. $big_hash{ $key } = $value; } sub get_big_hashref { return \%big_hash; } snip And so in the apache request handler, the code would be something like: use Data.pm; my $hashref = Data::get_big_hashref(); …. code to access $hashref data with request parameters….. snip The idea is the HTTP request/response will contain the relevant input/output for each case… and the master client program will collect these and concatentate the final output from all the requests. So any issues/suggestions with this approach? I am facing a non-trivial task of refactoring the existing code to work in this framework, so just wanted to get some feedback before I invest more time into this... I am planning on using mod_perl 2.07 on a linux machine. Thanks in advance, Alan
Re: mod_perl for multi-process file processing?
Pre-loading is good, but what you need, I belive, is Storable module. If your files contains parsed data (hashes) just store them as serialized. If they containing raw data, need to be parsed, you may pre-parse, serialize it and store as binary files. Storable is written in C and works very fast. 2015-02-03 7:11 GMT+03:00 Alan Raetz alanra...@gmail.com: So I have a perl application that upon startup loads about ten perl hashes (some of them complex) from files. This takes up a few GB of memory and about 5 minutes. It then iterates through some cases and reads from (never writes) these perl hashes. To process all our cases, it takes about 3 hours (millions of cases). We would like to speed up this process. I am thinking this is an ideal application of mod_perl because it would allow multiple processes but share memory. The scheme would be to load the hashes on apache startup and have a master program send requests with each case and apache children will use the shared hashes. I just want to verify some of the details about variable sharing. Would the following setup work (oversimplified, but you get the idea…): In a file Data.pm, which I would use() in my Apache startup.pl, I would load the perl hashes and have hash references that would be retrieved with class methods: package Data; my %big_hash; open(FILE,file.txt); while ( FILE ) { … code …. $big_hash{ $key } = $value; } sub get_big_hashref { return \%big_hash; } snip And so in the apache request handler, the code would be something like: use Data.pm; my $hashref = Data::get_big_hashref(); …. code to access $hashref data with request parameters….. snip The idea is the HTTP request/response will contain the relevant input/output for each case… and the master client program will collect these and concatentate the final output from all the requests. So any issues/suggestions with this approach? I am facing a non-trivial task of refactoring the existing code to work in this framework, so just wanted to get some feedback before I invest more time into this... I am planning on using mod_perl 2.07 on a linux machine. Thanks in advance, Alan
Re: mod_perl for multi-process file processing?
Alan Raetz wrote: So I have a perl application that upon startup loads about ten perl hashes (some of them complex) from files. This takes up a few GB of memory and about 5 minutes. It then iterates through some cases and reads from (never writes) these perl hashes. To process all our cases, it takes about 3 hours (millions of cases). We would like to speed up this process. I am thinking this is an ideal application of mod_perl because it would allow multiple processes but share memory. Sure you could use modperl for this. I would also consider at least these alternatives: - use Cache::FastMmap, https://metacpan.org/pod/Cache::FastMmap Load up your data with a loader script, and forget about it. Cache::FastMmap also works with modperl. - use a network server, like memcached or redis to store your read-only data, and use a lightweight network protocol (on localhost) to get the data. In both cases, reading from multiple processes will not be a problem. The cheapest solution for the consumer part (the cases above) would be to use a command like parallel to fire up as many copies of your consumer script as you can afford. Hope this helps, -- Cosimo