Re: perl process management

Lance Mon, 16 Jun 2003 14:20:54 -0700

Thanks for that!

I use Perl on both Windows and Linux boxes.  I sort of played around with
fork() but did not understand the advantages of fork, until you mentioned
them.  Truly, the drag on resources having to spawn a bunch of processes was
killing any PC I was putting it on.  I did sort of do my own limiter - when
x processes have spawned, pause for y seconds.  This works to limit resouce
hogging, but slows down the entire process too much.  I will have to look
into forking on the Linux box later this week and see if I can get it to
work like I know it should - after all I am just retrieving a webpage off of
each site and checking it for consistancy, nothing that I equate with
needing heavy power to perform.


Do you know of any docs where I can check out the threading you suggest?
That would be really cool, but how much harder is it to thread a program
than to fork?  As I currently understand forking, it just clones the current
process, so I have to check the process to see if it is a parent or a clone
and then execute the appropriate code. Hmmm. When I fork, does the child
start from the beginning of the script again, or does it pick up from the
exact same place in the script as where the parent spawned it (but of course
in the child process) ?  I have had a brief look at POE as you suggested,
but someplace that has tutorials on threading in Perl would be great.

The XML file comes in for IPC - I could have used anything, but XML seemed
to be the way to do it.  I don't know how to do *actual* IPC in Perl, so I
used to have the parent read this file, then spawn children, then the parent
waits, polling this same XML file.  When the children are done their task
they write to the same XML file flags to say that they are done, and the
outcome is placed into another XML file.  Once all children are done, the
parent detects this via the XML file and compiles all the children's XML
info back into this same file.  But I kept getting deadlocks and alot of
disk thrashing.

So my most recent attempt is for the *master* script to write to a
directory, flag files to say which webpages should be checked, one per page.
There is another script (This script I have placed on 3 PCs to split up the
work of the spawning a VERY primitive attempt at clustering, I know.) that
polls this directory and if any files are found spawns children.  These
children erase this flag file when they are done.  The master also polls
this directory and when it is empty does all the collation as before.  But
this also gives mucho disk thrashing.

So, my last thought before getting your reply was to use a couple of DB
tables to replace the XML files.  Then I don't have to worry about locking
and (hopefully) thrashing.  I initally used XML, so that I could read it
with different programs and extract statistics - uptime, bandwidth used,
etc.  But if I have to use a DB I will.

But your reply gives new life to the script - maybe if I thread or fork()
it, I can get the resources under control!  There is no fork() for Windows,
though - is there something else I could use that you know of?


PS Thanks for helping me get deeper into Perl - I was stagnating for a
while, but this fork() and thread stuff is quite interesting (and powerful I
hope!).



"Wiggins D'Anconia" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> Lance wrote:
> > Hi,
> >
> > I am writing a Perl script ( on a Pentium1 RedHat box ) to monitor some
> > websites on some webservers.  There are about 20 servers with 5 sites
each.
> > I have been playing with running various parts of the script in
parallel, to
> > try to get a perfomance boost, but am at a quandry.  If I run the check
on
> > each one sequentially, the entire script takes about 4 minutes to
execute
> > (the fastest I have been able to achieve).  However, if there is a
problem
> > with one of the pages, the user-agent will wait ( I have it timed to
wait a
> > max of 30 secs ).  If many pages have difficulty, this will extend the
> > execution time of the script dramatically.
> >
> > The data structure is arranged like so:
> > Server1 -> Store1 -> Page1 ->page-specific info
> >                              -> Page2
> >              -> Store2 -> Page4
> >              -> Store3 -> Page8
> > Server2 -> Store9 -> Page3
> >              -> Store5 -> Page1
> > etc..
> >
> > Please ignore the numbers - Page1 in Store1 of Server1 is NOT the same
page
> > as Page1 in Store5 of Server2.
> >
> > The script iterates through the Hash of hashes of hashes of hashes.
Once it
> > gets to the page level, it runs another script to validate the webpage.
It
> > is this validation and the work of invoking another Perl instance that
takes
> > the most time.  If I skip the check, the script will execute in it's
> > entirety in about 15 seconds!
> >
> > I used to have the webpage-validating external script as an internal
> > subroutine, but having to wait for the sub to finish defeats the ability
to
> > run the page checks in parallel.  However.... when I do execute the
checks
> > in parallel at the "Page" level (ie one new Perl instance is created for
> > each page to be checked) the initial script (imaginitively called
> > 'store_monitor.pm') executes and completes in about 20 seconds.  But
then
> > there are about 80 Perl instances in the process list - all of them
> > executing their version of the webpage validation script,
'page_check.pl'.
> > These 80 Perl instances take up alotta resources!  whew!  The total
> > execution time goes from about 4 minutes to over 40!  NOT exactly the
speed
> > increase I was looking for.
> >
> > So.  I have come to the conclusion that each Perl instance requires
overhead
> > to use.  I *knew* this of course, I just did not expect it to create
such a
> > logjam.  The script executes fastest with just one Perl instance - just
one
> > does use >90% cpu for the entire time.  Extras just seem to split 100%
among
> > them.
> >
> > Is there a way to get the initial Perl instance to run in parallel?  The
big
> > thing is to not have to wait for pages that don't respond quickly.
> >
> > I could take two runs at each page, a test run that only waits 10
seconds
> > for a valid page, and if that does not work, spawn another process that
uses
> > the full waiting time.  but this does run the risk of submarining the
> > 'parent' process if too many pages aren't working...
> >
> >
> > fyi - I am currently executing page_check.pl as a linux background
process
> > (when I was testing the parallel run) .  I tried using fork() but it
seemed
> > to have extra overhead of keeping the values of the 'parent' process,
when I
> > just wanted to run another script.  So I use system( "perl page_check.pl
> > xmlfile.xml \&" ) to spawn a bg process, or without the '\&' to wait for
the
> > script to return.
> >
> > fyi2 - I use an xml file for interprocess communication.
> >
>
> Didn't see any other posts on the subject and I know it has a been a
> while but I thought I would chime in.
>
> Your first problem seems to be how to get things to run in parallel,
> which you found a number of possibilities and appear to have explored a
> number of them.  Naturally the issue is that one piece of code blocks
> another until it is finished, so how to get that other code executed
> while not blocking.   There is the use of system/backticks to shell out
> to a separate perl process, as you found out this has major overhead,
> first starting up the shell, firing the perl interpreter, compiling the
> program, then running it.  That is where one would move into a fork/exec
> model, which you seem to have accomplished, which is where you ran into
> the "extra overhead" of the main script, the good news is this really
> only leads to memory usage issues and pollution, but should be ultra
> fast. On most (if not all) Unix systems forking a process is much faster
> than shelling out, because the original process is cloned and then
> executed rather than going through all of the steps mentioned earlier.
> The other nice thing is that all of that should get cleaned up when the
> fork exits.  Naturally the next progression that you didn't mention
> would be to use a threading model, therefore eliminating the overhead of
> shelling out, forking and its memory usage, and being left with an
> elegant non-blocking reduced IPC model, problem is Perl's threading
> model has always been (and to the best of my knowledge still is) a
> little iffy.
>
> The second problem you referred to of having the 80 processes dividing
> up the time is commonly known as "thrashing". Essentially the OS is
> taking one process putting it in the CPU, performing some computations,
> then swapping out the process for another allowing for time sharing and
> the appearance of multi-tasking.  Problem becomes the overhead from
> swapping the processes in and out of the CPU, memory, etc.  So in a
> sense the OS is spending more time swapping the processes than executing
> them. One way to cut down on this problem is to queue processes and do
> your own management of them, so for instance you keep 5-10 processes
> always running in the system so it has fewer to swap between, then when
> one finishes, a new one gets loaded into place, that way you maximize
> the time spent in the CPU for each process and minimize the context
> switching that must occur.  The only real way to effectively decrease
> the amount of time required to perform the whole thing is to add
> additional CPUs allowing multi-processes to run "simultaneously", but
> reducing the overhead and context switching should help the performance.
>
> I am not sure I understand how you are using XML for IPC but that could
> be another issue impeding performance, XML is very good for lots of
> things, however parsing it is very slow compared to other processes.  It
> is meant for cross platform, cross network, standardization, etc. not
> for speed.
>
> Having said all of that, and given enough time (read: as much as I
> wanted/needed) I would probably create a controller that would have two
> queues, one for fetching pages the other for doing their validation,
> load the fetch list into the queue then keep a certain number of
> requests going, when one succeeds take it off the first queue and throw
> it on the second, the second queue then maintains a certain number of
> running page validations. Both would be designed to maximize their
> processing time while reducing context switching. And since I have it on
> the brain and it makes this type of thing so trivial (once you are over
> the learning curve) I would suggest POE (http://poe.perl.org) for
> writing the queues and executing the two main parts that are not allowed
> to block inside of a Wheel::Run.
>
> http://danconia.org
>
>
>



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: perl process management

Reply via email to