Trial correction, this part of boinc_api.cpp, line 778: // various platforms 
have problems shutting down a process    // while other threads are still 
executing,    // or triggering endless exit()/atexit() loops.    //    
BOINCINFO("Exit Status: %d", status);    fflush(NULL);
#if defined(_WIN32)    // Halt all the threads and clean up.    
TerminateProcess(GetCurrentProcess(), status);    // note: the above CAN 
return!    Sleep(1000);    DebugBreak();#elif defined(__APPLE_CC__)

Becomes // various platforms have problems shutting down a process    // while 
other threads are still executing,    // or triggering endless exit()/atexit() 
loops.    //    BOINCINFO("Exit Status: %d", status);    fflush(NULL);#if 
defined(_WIN32)    // JG: Buffered IO is not committed to disk on flush, so 
commit it, add other file descriptors if needed    _commit(stderr);    // Halt 
all the threads and clean up.    TerminateProcess(GetCurrentProcess(), status); 
   // note: the above CAN return!  [JG: It does, it's asychronous, system 
dependant this thread runs on some time so    
WaitForSingleObject(GetCurrentProcess(), INFINITE); // That's why you do this   
 Sleep(1000);  //JG: Will never be reached    DebugBreak();  //JG: Will never 
be reached#elif defined(__APPLE_CC__)


------------------------------------------------------------------------------------------------------
Jason Richard Groothuis 
bSc(compSci)

------------------------------------------------------------------------------------------------------


> From: jason_grooth...@hotmail.com
> To: r.haselgr...@btopenworld.com; da...@ssl.berkeley.edu; 
> boinc_dev@ssl.berkeley.edu
> Date: Sun, 12 Jul 2015 05:47:37 +0930
> Subject: Re: [boinc_dev] Client: race condition on stderr.txt invalidates 
> Milkyway tasks
> 
> Comparing versions at leisure, but this same issue dates back in slight 
> variations to since I started crunching back in 2007, with different symptoms 
> depending on build characteristics and afflicted system ( OS version,  
> #cores, and GPU or CPU).  You be looking for a history on the boinc_exit() 
> function.  Don;t think it ever had the wait after terminateprocess, so IO 
> cancellations are likely depending on timing/chance.
> 
> ------------------------------------------------------------------------------------------------------
> Jason Richard Groothuis 
> bSc(compSci)
> 
> ------------------------------------------------------------------------------------------------------
> 
> 
> Date: Sat, 11 Jul 2015 20:07:56 +0000
> From: r.haselgr...@btopenworld.com
> To: jason_grooth...@hotmail.com; da...@ssl.berkeley.edu; 
> boinc_dev@ssl.berkeley.edu
> Subject: Re: [boinc_dev] Client: race condition on stderr.txt invalidates 
> Milkyway tasks
> 
> The Milkyway application we are mostly observing this with is 
> milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe, 
> which was deployed to their server on 6 Oct 2014, 20:18:34 UTC - the internal 
> signature says "API_VERSION_6.13.0"
> 
>   
> 
> 
>      On Saturday, 11 July 2015, 20:55, Jason Groothuis 
> <jason_grooth...@hotmail.com> wrote:
>      
> 
>  "Perhaps the exit process has been invoked in the Milkyway app, but not all 
> consequent OS functions have completed in time."Correct, since the 
> TermnateProcess() call, which is asynchronous and so returns immediately 
> without necessarily doing anything,  is missing the WaitForSingleObject() on 
> the Current process after it.  The process resources will cleanup as part of 
> OS garbage collection *sometime* down the road.Doubting the accuracy of the 
> MSDN documentation on these functions  is fine, but wondering why it doesn;t 
> work as expected when you ignore it, is just odd.       On Saturday, 11 July 
> 2015, 20:09, Jason Groothuis <jason_grooth...@hotmail.com> wrote:      Not 
> sure how much detail you'd like on the situation. (Can provide much more)  
> It's a result of buffered IO implemented in multithreaded C Runtimes, in some 
> situations using deferred procedure calls.  Internal helper threads are being 
> killed before commits are completed.least desirable partial workaround (but 
> helps
 ):
>  - disable buffered IO by linking the application with the ms supplied 
> COMMODE.OBJProbably Better, but not tested:- initiate a low level _commit() 
> and add the missing WaitForSingleObject() after the TerminateProcess 
> Call,Best:- do a low level _comit() and check the file modification time 
> updated, then preferably use a friendly means of exit that allows DLL/Thread 
> cleanup, closing threads/processes using sentinel flags, like while(!done) 
> instead of while(1) with 
> kills.------------------------------------------------------------------------------------------------------Jason
>  Richard Groothuis 
> bSc(compSci)-------------------------------------------------------------- 
> ----------------------------------------> Date: Sat, 11 Jul 2015 11:30:32 
> -0700> From: da...@ssl.berkeley.edu> To: r.haselgr...@btopenworld.com; 
> boinc_dev@ssl.berkeley.edu> Subject: Re: [boinc_dev] Client: race condition 
> on stderr.txt invalidates    Milkyway tasks> > Richard:> Can you please ask 
> him to set <task_d
 eb
>  ug> as well?> > I have no theories about what could cause this.> The BOINC 
> client learns that a job is finished when its process has exited,> and by 
> that time all files are closed and locks released> (I'm assuming the MW@h app 
> is single-process - is that correct?)> > In this case, when the job finishes, 
> the client successfully reads stderr.txt> (otherwise <stderr_txt> would be 
> absent or there would be an error message)> but it's empty.> This would be 
> the case, e.g., if the writing process hadn't exited yet> and its stderr 
> buffer wasn't flushed.> But the process has exited.> > Anyone have any 
> ideas?> > -- David> > On 09-Jul-2015 7:42 AM, Richard Haselgrove wrote:>  > 
> User Keith Myers (UID 147145 at 
> http://milkyway.cs.rpi.edu/milkyway/index.php) has > > asked for my help in 
> identifying task failures at Milkyway.> >> > At my suggestion, he installed 
> Windows client v7.6.2, and the attached message log > > extracts show the 
> enhanced <slot_debug> output that helped identify the 
 CM
>  S-dev > > problem.> >> > In both cases, the task under scrutiny> >> > (1) 
> de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0, > > 
> http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181200273> >> > (2) 
> ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0, > > 
> http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181298220> >> > was 
> declared 'Validate error', and the <stderr_txt> section is empty. In the > > 
> special case of Milkyway@Home, these two observations are linked, because the 
> > > science result is returned in stderr, not a separate upload file.> >> > 
> Also in both cases, the <slot_debug> log contains> >> > [slot] failed to 
> remove file slots/x/stderr.txt:  unlink() failed> >> > between 
> 'handle_exited_app()' and 'Computation for task ... finished'> >> > It 
> appears that there is a race condition, whereby BOINC tries (and fails) to > 
> > delete stderr.txt before the operating system has released the write lock. 
> This > > (I'm presuming) also explains why the file appea
 rs
>   empty when read off the disk > > for incorporation into the client_state 
> structure in memory, prior to reporting > > the completed task to the 
> project.> >> > In order the preserve the scientific result at Milkyway (and 
> debug and other > > useful information at other projects), the client should 
> not initiate > > 'handle_exited_app()' until it has confirmed that the write 
> lock on stderr.txt has > > been released.> >> >> > Log 1 also shows that the 
> additional safeguards on cleaning out slots are working > > properly: if both 
> handle_exited_app() and get_free_slot() fail to delete the file, > > the next 
> task isn't started in the not-empty slot (11), but in slot 14 inste ad. > > 
> And when slot 11 is tested again at the next get_free_slot(), the delete 
> succeeds > > and the now-empty slot is reused.> > 
> _______________________________________________> boinc_dev mailing list> 
> boinc_dev@ssl.berkeley.edu> 
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev> To unsubscribe, 
> visit 
 th
>  e above URL and> (near bottom of page) enter your email address.             
>               _______________________________________________boinc_dev 
> mailing 
> listboinc_dev@ssl.berkeley.eduhttp://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo
>  unsubscribe, visit the above URL and(near bottom of page) enter your email 
> address.                                  
> _______________________________________________boinc_dev mailing 
> listboinc_dev@ssl.berkeley.eduhttp://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo
>  unsubscribe, visit the above URL and(near bottom of page) enter your email 
> address.
> 
>                                                 
> _______________________________________________
> boinc_dev mailing list
> boinc_dev@ssl.berkeley.edu
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
                                          
_______________________________________________
boinc_dev mailing list
boinc_dev@ssl.berkeley.edu
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to