Trial correction, this part of boinc_api.cpp, line 778: // various platforms have problems shutting down a process // while other threads are still executing, // or triggering endless exit()/atexit() loops. // BOINCINFO("Exit Status: %d", status); fflush(NULL); #if defined(_WIN32) // Halt all the threads and clean up. TerminateProcess(GetCurrentProcess(), status); // note: the above CAN return! Sleep(1000); DebugBreak();#elif defined(__APPLE_CC__)
Becomes // various platforms have problems shutting down a process // while other threads are still executing, // or triggering endless exit()/atexit() loops. // BOINCINFO("Exit Status: %d", status); fflush(NULL);#if defined(_WIN32) // JG: Buffered IO is not committed to disk on flush, so commit it, add other file descriptors if needed _commit(stderr); // Halt all the threads and clean up. TerminateProcess(GetCurrentProcess(), status); // note: the above CAN return! [JG: It does, it's asychronous, system dependant this thread runs on some time so WaitForSingleObject(GetCurrentProcess(), INFINITE); // That's why you do this Sleep(1000); //JG: Will never be reached DebugBreak(); //JG: Will never be reached#elif defined(__APPLE_CC__) ------------------------------------------------------------------------------------------------------ Jason Richard Groothuis bSc(compSci) ------------------------------------------------------------------------------------------------------ > From: jason_grooth...@hotmail.com > To: r.haselgr...@btopenworld.com; da...@ssl.berkeley.edu; > boinc_dev@ssl.berkeley.edu > Date: Sun, 12 Jul 2015 05:47:37 +0930 > Subject: Re: [boinc_dev] Client: race condition on stderr.txt invalidates > Milkyway tasks > > Comparing versions at leisure, but this same issue dates back in slight > variations to since I started crunching back in 2007, with different symptoms > depending on build characteristics and afflicted system ( OS version, > #cores, and GPU or CPU). You be looking for a history on the boinc_exit() > function. Don;t think it ever had the wait after terminateprocess, so IO > cancellations are likely depending on timing/chance. > > ------------------------------------------------------------------------------------------------------ > Jason Richard Groothuis > bSc(compSci) > > ------------------------------------------------------------------------------------------------------ > > > Date: Sat, 11 Jul 2015 20:07:56 +0000 > From: r.haselgr...@btopenworld.com > To: jason_grooth...@hotmail.com; da...@ssl.berkeley.edu; > boinc_dev@ssl.berkeley.edu > Subject: Re: [boinc_dev] Client: race condition on stderr.txt invalidates > Milkyway tasks > > The Milkyway application we are mostly observing this with is > milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe, > which was deployed to their server on 6 Oct 2014, 20:18:34 UTC - the internal > signature says "API_VERSION_6.13.0" > > > > > On Saturday, 11 July 2015, 20:55, Jason Groothuis > <jason_grooth...@hotmail.com> wrote: > > > "Perhaps the exit process has been invoked in the Milkyway app, but not all > consequent OS functions have completed in time."Correct, since the > TermnateProcess() call, which is asynchronous and so returns immediately > without necessarily doing anything, is missing the WaitForSingleObject() on > the Current process after it. The process resources will cleanup as part of > OS garbage collection *sometime* down the road.Doubting the accuracy of the > MSDN documentation on these functions is fine, but wondering why it doesn;t > work as expected when you ignore it, is just odd. On Saturday, 11 July > 2015, 20:09, Jason Groothuis <jason_grooth...@hotmail.com> wrote: Not > sure how much detail you'd like on the situation. (Can provide much more) > It's a result of buffered IO implemented in multithreaded C Runtimes, in some > situations using deferred procedure calls. Internal helper threads are being > killed before commits are completed.least desirable partial workaround (but > helps ): > - disable buffered IO by linking the application with the ms supplied > COMMODE.OBJProbably Better, but not tested:- initiate a low level _commit() > and add the missing WaitForSingleObject() after the TerminateProcess > Call,Best:- do a low level _comit() and check the file modification time > updated, then preferably use a friendly means of exit that allows DLL/Thread > cleanup, closing threads/processes using sentinel flags, like while(!done) > instead of while(1) with > kills.------------------------------------------------------------------------------------------------------Jason > Richard Groothuis > bSc(compSci)-------------------------------------------------------------- > ----------------------------------------> Date: Sat, 11 Jul 2015 11:30:32 > -0700> From: da...@ssl.berkeley.edu> To: r.haselgr...@btopenworld.com; > boinc_dev@ssl.berkeley.edu> Subject: Re: [boinc_dev] Client: race condition > on stderr.txt invalidates Milkyway tasks> > Richard:> Can you please ask > him to set <task_d eb > ug> as well?> > I have no theories about what could cause this.> The BOINC > client learns that a job is finished when its process has exited,> and by > that time all files are closed and locks released> (I'm assuming the MW@h app > is single-process - is that correct?)> > In this case, when the job finishes, > the client successfully reads stderr.txt> (otherwise <stderr_txt> would be > absent or there would be an error message)> but it's empty.> This would be > the case, e.g., if the writing process hadn't exited yet> and its stderr > buffer wasn't flushed.> But the process has exited.> > Anyone have any > ideas?> > -- David> > On 09-Jul-2015 7:42 AM, Richard Haselgrove wrote:> > > User Keith Myers (UID 147145 at > http://milkyway.cs.rpi.edu/milkyway/index.php) has > > asked for my help in > identifying task failures at Milkyway.> >> > At my suggestion, he installed > Windows client v7.6.2, and the attached message log > > extracts show the > enhanced <slot_debug> output that helped identify the CM > S-dev > > problem.> >> > In both cases, the task under scrutiny> >> > (1) > de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0, > > > http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181200273> >> > (2) > ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0, > > > http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181298220> >> > was > declared 'Validate error', and the <stderr_txt> section is empty. In the > > > special case of Milkyway@Home, these two observations are linked, because the > > > science result is returned in stderr, not a separate upload file.> >> > > Also in both cases, the <slot_debug> log contains> >> > [slot] failed to > remove file slots/x/stderr.txt: unlink() failed> >> > between > 'handle_exited_app()' and 'Computation for task ... finished'> >> > It > appears that there is a race condition, whereby BOINC tries (and fails) to > > > delete stderr.txt before the operating system has released the write lock. > This > > (I'm presuming) also explains why the file appea rs > empty when read off the disk > > for incorporation into the client_state > structure in memory, prior to reporting > > the completed task to the > project.> >> > In order the preserve the scientific result at Milkyway (and > debug and other > > useful information at other projects), the client should > not initiate > > 'handle_exited_app()' until it has confirmed that the write > lock on stderr.txt has > > been released.> >> >> > Log 1 also shows that the > additional safeguards on cleaning out slots are working > > properly: if both > handle_exited_app() and get_free_slot() fail to delete the file, > > the next > task isn't started in the not-empty slot (11), but in slot 14 inste ad. > > > And when slot 11 is tested again at the next get_free_slot(), the delete > succeeds > > and the now-empty slot is reused.> > > _______________________________________________> boinc_dev mailing list> > boinc_dev@ssl.berkeley.edu> > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev> To unsubscribe, > visit th > e above URL and> (near bottom of page) enter your email address. > _______________________________________________boinc_dev > mailing > listboinc_dev@ssl.berkeley.eduhttp://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo > unsubscribe, visit the above URL and(near bottom of page) enter your email > address. > _______________________________________________boinc_dev mailing > listboinc_dev@ssl.berkeley.eduhttp://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo > unsubscribe, visit the above URL and(near bottom of page) enter your email > address. > > > _______________________________________________ > boinc_dev mailing list > boinc_dev@ssl.berkeley.edu > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and > (near bottom of page) enter your email address. _______________________________________________ boinc_dev mailing list boinc_dev@ssl.berkeley.edu http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.