Tom,

Thanks for taking a look at this.

Is your machine hyperthreaded?  If so, have you tried testing it with
-nodepara 4?

I'll take a look at this some more and get back to you.

- Chris


On Mon, Apr 7, 2014 at 2:36 PM, Tom Hildebrandt <[email protected]> wrote:

>  Hi Chris:
>
> I used -nodepara 2 to fully occupy the two CPUs on my system with a
> paratest.server run, and it terminated OK.  I did not set up the experiment
> so that a test timed out on one of the workers, nor did I investigate what
> happens when paratest.client hangs or crashes.  So my experiment just
> represents the "happy path".  However, it would appear that paratest.server
> is basically functioning correctly, even when the available CPUs are fully
> utilized.
>
> Another thing I did not check was whether paratest is trying to read the
> output log file before the worker has had a chance to open it.  Perhaps the
> server script assumes that the workers are reasonably responsive in this
> way.  If they are not, that might explain the grep:  Logs/ ... messages.  I
> have usually interpreted those messages to mean that the corresponding
> worker died.
>
> In a separate test, I just check to see if the perl "sleep" function does
> something lame like busy-waiting without yielding.  At least on my SLES
> system, it does not busy-wait.  So the assumption that paratest.server and
> the workers can make progress in parallel is upheld on this platform.
>
> I'm inclined to think that paratest.server is working as intended.  It
> would seem, rather, that paratest.client is not handling all error cases
> correctly.  I would look there for the fault.  (But since my mods were only
> on the server side, I have so far avoided doing so.)
>
> In order to make further progress on your problem, I think I would need to
> duplicate it.  If you wish, you can send me a patch, and I will give it a
> try on my system.
>
> THH
>  ------------------------------
> *From:* Tom Hildebrandt [[email protected]]
> *Sent:* Saturday, April 05, 2014 2:50 PM
> *To:* Chris Wailes
> *Cc:* [email protected]
>
> *Subject:* Re: [Chapel-developers] Paratest and TooManyThreads.chpl
>
>   Hi Chris:
>
>  I did draw attention to my change that removed the "wait;" statement
> from the loop in paratest.server that waits for all child processes to
> complete.  That, combined with your observation that you are unable to
> create more threads points at the problem: there are not enough physical
> threads to go around; at least one of them is dying of starvation.  There
> ought to be more than enough threads to go around, so perhaps the problem
> also involves mismatched priorities.
>
>  As it stands, paratest.server expects there to be at least w+1 threads
> available (for w workers and paratest.server iteself) and for scheduling
> among those threads to be reasonably fair.
>
>  I have assumed all along that the call to sleep() in that wait loop
> yields to waiting threads.  If not, then we might need to find a different
> way to pass the time between checking for updates.  I have not examined the
> paratest.client script to see if there are potential gotchas there.
>
>  I'll play with this a bit, and see if I can duplicate your problem on my
> workstation.
>
>  THH
>  ------------------------------
> *From:* Chris Wailes [[email protected]]
> *Sent:* Friday, April 04, 2014 3:43 PM
> *To:* Brad Chamberlain
> *Cc:* Tom Hildebrandt; Lydia Duncan;
> [email protected]
> *Subject:* Re: [Chapel-developers] Paratest and TooManyThreads.chpl
>
>   After bisecting the commit log I found that commit 22715 is responsible
> for this issue.  Oddly, before the script actually exits my system becomes
> unable to create new threads and grep says it is unable to find log files.
>
>  - Chris
>
>
> On Mon, Mar 31, 2014 at 7:03 PM, Brad Chamberlain <[email protected]> wrote:
>
>>
>> I don't have any insights, but will note that in our use cases, we tend
>> not to use paratest to oversubscribe testing on a single machine; rather we
>> farm out across multiple machines; so there may be some race/conflict which
>> only shows up in that situation?
>>
>> Assuming any issue is in the paratest servers themselves, it shouldn't
>> take you long to do the binary search -- I think there have only been five
>> changes to it since Jan.
>>
>> -Brad
>>
>>
>>
>> On Mon, 31 Mar 2014, Tom Hildebrandt wrote:
>>
>>  Hi Chris:
>>> The other change that I made in paratest.server was to remove the "wait"
>>> command on line 172 or thereabouts, so the timeout time is updated each
>>> second.  I can't really see how this would cause the error messages
>>> you're
>>> seeing.  On the other hand, I have never tested by forking a number of
>>> children equal to the number of processors available. I'll give that a
>>> try
>>> (most likely this evening).
>>>
>>> Tom H.
>>>
>>>  
>>> _____________________________________________________________________________
>>>
>>>
>>> From: Chris Wailes [[email protected]]
>>> Sent: Monday, March 31, 2014 3:34 PM
>>> To: Tom Hildebrandt
>>> Cc: Brad Chamberlain; Lydia Duncan; chapel-developers@lists.
>>> sourceforge.net
>>> Subject: Re: [Chapel-developers] Paratest and TooManyThreads.chpl
>>>
>>> I've been playing with this for a couple of days now, and even with
>>> skipif
>>> files for what I thought were the offending directories I end up getting
>>> the
>>> following output (https://gist.github.com/chriswailes/
>>> a1b0c4d8df4eb983607c)
>>> before the paratest.server script fails.  Running start_test works just
>>> fine,
>>> but if I try to run even 4 tests at once on my quad-core, hyperthreaded
>>> machine, I get these error messages.
>>>
>>> I haven't been as diligent with my rebasing as I should have been, so the
>>> last time I know the mainline's version of the scripts worked was on
>>> January
>>> 29th.  Does anyone know what might have changed since then to have caused
>>> this problem?  Before I was able to run 10 tests at a time on this same
>>> machine.  I'm about to head home now, but tomorrow I'll run a binary
>>> search
>>> on the commit history to try and pin down the commit that caused this to
>>> stop
>>> working.
>>>
>>> - Chris
>>>
>>>
>>> On Fri, Mar 28, 2014 at 12:32 PM, Tom Hildebrandt <[email protected]>
>>> wrote:
>>>       That is correct.
>>>       Note also that the .skipif file the skips a directory and it
>>>       descendents is a sibling of the directory to be skipped, whereas
>>>       the directory-wide SKIPIF file resides within the directory it
>>>       affects.  Compare
>>>         test/chpldoc          <-- Skip testing here and in all
>>>       descendents
>>>         test/chpldoc.skipif  <-- if this script tests true.
>>>       vs.
>>>         test/distributions/deitz/SKIPIF <-- Skip testing in the
>>>       containing directory (only) if this script tests true.
>>>
>>>       THH
>>>       ________________________________________
>>>       From: Brad Chamberlain [[email protected]]
>>>       Sent: Friday, March 28, 2014 6:37 AM
>>>       To: Lydia Duncan; [email protected]
>>>       Subject: Re: [Chapel-developers] Paratest and TooManyThreads.chpl
>>>
>>>       IIRC, a difference between the two approaches is that putting it
>>>       in
>>>       the parent skips all recursive traversal below that directory as
>>>       well,
>>>       whereas putting it within the directory just skips that
>>>       directory, but
>>>       not its children?
>>>
>>>       -Brad
>>>
>>>       ________________________________________
>>>       From: Lydia Duncan [[email protected]]
>>>       Sent: Thursday, March 27, 2014 2:57 PM
>>>       To: [email protected]
>>>       Subject: Re: [Chapel-developers] Paratest and TooManyThreads.chpl
>>>
>>>       On 03/27/2014 02:53 PM, Chris Wailes wrote:
>>>       > Do skipif files work for directories?
>>>       Yup!  You can either make a SKIPIF within the directory, or make
>>>       a
>>>       <dirname>.skipif file in its parent directory.
>>>
>>>       Lydia
>>>
>>> ------------------------------------------------------------
>>> ----------------
>>>       --
>>>       _______________________________________________
>>>       Chapel-developers mailing list
>>>       [email protected]
>>>       https://lists.sourceforge.net/lists/listinfo/chapel-developers
>>>
>>> ------------------------------------------------------------
>>> ----------------
>>>       --
>>>       _______________________________________________
>>>       Chapel-developers mailing list
>>>       [email protected]
>>>       https://lists.sourceforge.net/lists/listinfo/chapel-developers
>>>
>>> ------------------------------------------------------------
>>> ----------------
>>>       --
>>>       _______________________________________________
>>>       Chapel-developers mailing list
>>>       [email protected]
>>>       https://lists.sourceforge.net/lists/listinfo/chapel-developers
>>>
>>>
>>>
>>>
>
------------------------------------------------------------------------------
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment 
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to