Here is a private drop: http://boinc.berkeley.edu/dl/boinc.100615.x64.zip
----- Rom -----Original Message----- From: boinc_dev [mailto:[email protected]] On Behalf Of Richard Haselgrove Sent: Wednesday, June 10, 2015 3:34 AM To: David Anderson; Jacob Klein; Seke Rob Cc: BOINC Development Subject: Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without ensuring they're empty Rom, if you could build a private drop, I'll report what the log says. On Wednesday, 10 June 2015, 4:28, David Anderson <[email protected]> wrote: I added a log message that may help a bit. I'd like to track this down, even though it's minor. -- David On 19-May-2015 12:15 PM, Richard Haselgrove wrote: OK, the delay happened again, and I captured a procmon log. Copy of the BOINC log attached (period of interest is 19:35:30 to 19:35:41): also a simple extract of ProcMon for the same period. It has to be said, boinc.exe was doing surprisingly little. I have kept the full ~200 MB native ProcMon log, which can be re-filtered to look for anything else of interest, if you can suggest some likely targets. On Monday, 18 May 2015, 20:57, David Anderson <[email protected]> wrote: That looks like what's needed. Richard, if you can repro the inter-job delay, you could try using Process Monitor to capture as much as possible from the client during that period. -- David On 18-May-2015 11:12 AM, Jacob Klein wrote: > Process Monitor can be used to "watch the things a process does" (you have > to set > up correct filters, etc.)... but I'm not sure if that includes > sleeps. But if the > process is waiting on a file or something, though, it > should be able to tell you. > Worth looking into. > > https://technet.microsoft.com/en-us/library/bb896645.aspx > > Regards, > Jacob > > >------------------------------------------------------------------------------------ > Date: Mon, 18 May 2015 10:41:16 -0700 > From: [email protected] > To: > [email protected]; [email protected]; [email protected] > > CC: [email protected] > Subject: Re: [boinc_dev] [boinc_alpha] > BOINC re-using slot directories without > ensuring they're empty > > I > looked at this and couldn't figure out the source of the 12-sec delay. > In general, delays could happen because > 1) the client does something that > takes a long time (like copying a 5 GB file) > 2) the client sleeps (i.e. > calls boinc_sleep()). > It does this in a few situations, > like backing off and retrying a file system operation. > But there's no indication that either of these is happening here. > > Does Windows have a way of logging the system calls that a process makes > > (like strace on Unix)? > If so that might reveal what the client is doing during those 12 seconds. > > -- David > > On 16-May-2015 8:01 AM, Richard Haselgrove wrote: > > Here is the message log file for a GPUGrid task finish. The 12-second >delay > appears again between 14:26:35 and 14:26:47 - that's after the >slot directory > has been cleared, and the exiting task has changed state >from 'running' to > 'uploading'. Two new tasks have been assigned to the >GPU, but their (small) > startup files have not yet been linked to their >respective slot directories. > > I also attach directory listings for the slot and GPUGrid project folders >at > various stages of the cleanup: the slot held 34 files totalling >44,186,727 > bytes, which doesn't sound excessive: the largest file >deletion (94,783,960 > bytes) occurred several minutes later, when that >file finished uploading. > > I'll enable similar logging and watch what happens when the next GPUGrid >task > starts up, but from memory, the disruption to BOINC is less severe >at startup. > > > > On Tuesday, 12 May 2015, 23:29, David Anderson <[email protected]> >> <mailto:[email protected]> wrote: > > > > BTW: the client isn't completely single-threaded; > it uses a >separate thread to do CPU throttling. > It would be feasible to also use separate threads > for >serving GUI RPC connections, > which would allow client to remain >responsive even while > e.g. copying thousands of files to a slot dir. > -- David > > On 12-May-2015 2:40 AM, Seke Rob wrote: > > Reminds me of the Clean Energy Project, Phase 2 and why we have > > app_config and > > <max_concurrent> and a default control of >allowing 1 'In Progress' on a > host. This > > project sets >up in slot copying near 6700 files [symlinking proposed > long ago as > > > is done on several other WCG projects for the static files]. If >more > than one CEP2 > > is started the machine feels at >times like a snail, responsiveness of > the BOINC > > manager >is poor, many a time the less powerful systems incurring error > zero >status > > exits or total fail. On an 8 core observed it could take >over an hour > before > > actual computing commenced [CPU >time logged]. Boot cycle requires manually > > starting of tasks one >by one. Kevin Reed few years ago raised a ticket for > > staggered >starting, where the models can reach several GB and bigger in the > > >coming. At any rate, as much as these 6700 files are copied, they also > > then are > > needing of deletion at completion [physical or >symlink references]. The > effect of > > starting 1 CEP2 and >finishing / packaging / zipping and transmitting can > easily > > > lead to several minutes of there not being any computing, just whirring, >> for > > minutes, just elapsed being logged. The more run the >more the issue > compounds, > > with the effect of what many >incur, the exit zero status series, > resetting to > > start >or last checkpoint with often hours of computing time lost. > > > > Maybe you'd like to get in touch with your confederates at WCG >[Keith > Uplinger], > > to discuss the issue further as this >is now nearing a 5 year continues > frustration > > [June >2010 launch, and a huge limitation on the speed of progress on > this >project]. > > > > --SekeRob. > > > > On 12-5-2015 1:55, David Anderson wrote: > >> That delay looks like it's caused by deleting files or by process >cleanup. > >> Does GPUGrid make lots of (non-output) files in the slot dir? > >> > >> Please try to repro it with slot_debug, task_debug, and >heartbeat_debug set > >> (gui_rpc_debug not needed). > >> > >> -- David > >> > >> On 11-May-2015 10:54 AM, Richard Haselgrove wrote: > >>> Here's another example of a case where BOINC finds that it can't >walk > and chew > >>> gum at the same time. The event of >interest is > >>> > >>> 11/05/2015 18:35:34 | GPUGRID | >Computation for task > >>> >e10s9_e7s6f4-GERARD_FXCXCL12_LIG_6282622-0-1-RND7898_0 finished > >>> > > >>> Following that, there's a 12-second interval where neither >heartbeats > nor GUI > >>> RPC traffic was logged: during >that time, the Task tab of the Manager was > >>> unchanging, not >showing the regular update of elapsed time for running > tasks. > >>> > >>> async_file_debug was active at the time, but found no events to >log. > >>> > >>> These particular GPUGrid tasks generate around 90 MB of upload >files, > but I > >>> think they are generated directly in the >project folder and don't need > to be > >>> copied anywhere. > >>> > >>> Main log as attached file only. > >>> > >>> I'll catch a CMS-dev log later this evening, but after that, I'll >be > away for a > >>> few days and I'll have to leave the >bug-chase until the weekend. > >>> > >>> > >>> > >>> > >>> On Monday, 11 May 2015, 9:42, Jacob Klein <[email protected] >> <mailto:[email protected]>> wrote: > >>> > >>> > >>> > >>> I have seen this problem before, where the UI becomes >unresponsive. > If I > >>> recall, it happens when a T4T task is being set up (ie: after >> everything was > >>> downloaded). For me, I don't recall >the problem ever "screwing over > other > >>> tasks", >though. > >>> > >>> Try this to reproduce it: Attach to T4T, and get a task. It >may > take a while > >>> to do that download, so you can >"step away" for a bit. Then, once > that task > >>> is >going, abort it. Downloading the 2nd task should be instantaneous > >>>> (nothing really to download), but instantiation of that 2nd task >should > >>> cause the UI to hang (showing the "Please wait" >messagebox in the > manager). > >>> > >>> Does that help? > >>> > Date: Sun, 10 May 2015 23:19:24 -0700 > >>> > >From: [email protected] <mailto:[email protected]> > ><mailto:[email protected] <mailto:[email protected]>> > >>> > > To: [email protected] > ><mailto:[email protected]> <mailto:[email protected] > <mailto:[email protected]>>; > >>> [email protected] <mailto:[email protected]> > ><mailto:[email protected] <mailto:[email protected]>> > >>> > >CC: [email protected] > ><mailto:[email protected]> <mailto:[email protected] > <mailto:[email protected]>> > >>> > Subject: Re: [boinc_alpha] BOINC re-using slot directories >without > >>> ensuring they're empty > >>> > > >>>> > I did some initial testing and couldn't repro this; > >>> >> the client remains responsive while copying a 5 GB file to a slot > >dir. > >>> > Does anyone else see this behavior? > >>> > > >>> > While testing this, please set "async_file_debug" log flag. > >>> > This says when asynchronous file operations start and end. > >>> > > >>> > -- David > >>> > > >>> > On 10-May-2015 12:31 PM, Richard Haselgrove wrote: > >>> > > One thing that may need attention if very large files >become > the norm is > >>> the > >>> > > >single-threaded nature of some parts of the core client. My > 1-hour >CMS > >>> test has > >>> > > just finished, and a new >24-hour test started. > >>> > > > >>> > > > >>> > > I watched this happening, and part of the process is >copying a > 1.33 GB > >>> initial > >>> > > >.vmi image file (downloaded previously by BOINC from CERN) from > the >project > >>> > > directory to the slot directory. This took about >90 seconds: > during that > >>> time, all > >>> > > Manager updating stopped. I'm sure it's the copying >process > which inhibited > >>> > > updates: I was >watching the slot directory, and the .vmi image > file had > >>>> appeared, > >>> > > but other essential startup files >hadn't. > >>> > > > >>> > > > >>> > > When BOINC regained its ability to communicate, three >running > tasks had > >>> exited > >>> > > with >the dreaded (and false) 'you may need to reset the > project' advice. > >>> inline > >>> > > log follows: because my last log got mangled by my ISP's >new mail > >>> interface, I'll > >>> > > attach it as a >text file as well. > >>> > > > >>> > > > >>> > > 10/05/2015 20:12:56 | LHC@home <mailto:LHC@home> > ><mailto:LHC@home <mailto:LHC@home>> 1.0 | Computation for task > >>> > > > > >>> > >sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1 > >>> > >>> > > finished > >>> > > 10/05/2015 20:12:56 | CMS-dev | Starting task > >CMS_31107_1427806626.783437_0 > >>> > > 10/05/2015 20:12:56 | >CMS-dev | [cpu_sched] Starting task > >>> > > >CMS_31107_1427806626.783437_0 using CMS version 4615 (vbox64) > in >slot 7 > >>> > > 10/05/2015 20:14:25 | climateprediction.net | >Task > >>> > > hadam3p_anz_e3g7_2013_1_009760406_0 exited with >zero status but no > >>> 'finished' file > >>> > > >10/05/2015 20:14:25 | climateprediction.net | If this happens > >repeatedly > >>> you may > >>> > > need to reset the >project. > >>> > > 10/05/2015 20:14:25 | NumberFields@home > ><mailto:NumberFields@home> <mailto:NumberFields@home > ><mailto:NumberFields@home>> | Task > >>> > > >wu_sf3_DS-10x271_Grp503196of682667_0 exited with zero status but no > >>>> 'finished' file > >>> > > 10/05/2015 20:14:25 | >NumberFields@home > <mailto:NumberFields@home> ><mailto:NumberFields@home > <mailto:NumberFields@home>> | If > > >>> this happens repeatedly you may need > >>> > > to reset >the project. > >>> > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home> > > <mailto:SETI@home <mailto:SETI@home>> | Task > >>> >05jl12ab.3911.10292.438086664199.12.207_1 > >>> > > exited with zero status but no 'finished' file > >>>> > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home> > ><mailto:SETI@home <mailto:SETI@home>> | If this happens > >>> >repeatedly you may need to reset > >>> > > the project. > >>> > > 10/05/2015 20:14:25 | climateprediction.net | [cpu_sched] >> Restarting task > >>> > > >hadam3p_anz_e3g7_2013_1_009760406_0 using hadam3p_anz version > 610 >in slot 5 > >>> > > 10/05/2015 20:14:25 | NumberFields@home > > <mailto:NumberFields@home> <mailto:NumberFields@home > ><mailto:NumberFields@home>> | > >>> [cpu_sched] Restarting task > > >>> > > wu_sf3_DS-10x271_Grp503196of682667_0 using GetDecics >version > 200 in slot 0 > >>> > > 10/05/2015 20:14:25 | >SETI@home <mailto:SETI@home> > <mailto:SETI@home <mailto:SETI@home>> >| [cpu_sched] > >>> Restarting task > >>> > > >05jl12ab.3911.10292.438086664199.12.207_1 using setiathome_v7 > >version > >>> 700 (cuda42) > >>> > > in slot 2 > > >>> > > 10/05/2015 20:14:27 | LHC@home <mailto:LHC@home> > ><mailto:LHC@home <mailto:LHC@home>> 1.0 | Started upload of > >>> >> > > >>> > >sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0 > >>> > > 10/05/2015 20:14:30 | LHC@home <mailto:LHC@home> > ><mailto:LHC@home <mailto:LHC@home>> 1.0 | Finished upload of > >>> >> > > >>> > >sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0 > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > On Sunday, 10 May 2015, 19:59, Seke Rob ><[email protected] > <mailto:[email protected]> > >>> ><mailto:[email protected] <mailto:[email protected]>>> wrote: > >>> > > > >>> > > > >>> > > > >>> > > Excellent this is all fixed and tested. Interest is/was >that > WCG's Clean > >>> > > Energy at some point in >time was to run very large models, > talk of > >>> 4-8GB >IIRC. > >>> > > > >>> > > --SekeRob > >>> > > > >>> > > On May 10, 2015 20:27, Richard Haselgrove > >>> > <[email protected] <mailto:[email protected]> > <mailto:[email protected] ><mailto:[email protected]>> > >>> > > <mailto:[email protected] > <mailto:[email protected]> > >>> <mailto:[email protected] > <mailto:[email protected]>>>> wrote: > >>> > > CMS only has stock applications configured for delivery >to > 64-bit > >>> platforms. > >>> > > I've made an anonymous platform configuration using the > > 32-bit VBox > >>> Windows > >>> > > >wrapper: it has downloaded and is running its first 1-hour > task. If >that > >>> > > completes successfully (it seems to have reached >the > >>> fully-operational stage), > >>> > > I'll >try a full 24-hour task, which under current operational > >>> >circumstances > >>> > > should generate a >4 GB file locally. > >>> > > > >>> > > > >>> > > On Sunday, 10 May 2015, 18:28, David Anderson > > >>> <[email protected] <mailto:[email protected]> > ><mailto:[email protected] <mailto:[email protected]>> > >>> > > > <mailto:[email protected] > ><mailto:[email protected]> <mailto:[email protected] > ><mailto:[email protected]>>>> wrote: > >>> > > > >>> > > > >>> > > > >>> > > NTFS handles > 4GB files, even if the hardware and/or >OS is > only 32-bit. > >>> > > 32-bit versions of Windows have APIs (like _stat64()) >for > handling > > >>> 4GB files. > >>> > > BOINC needs to use these; we fixed one place where it >wasn't. > >>> > > > >>> > > On Unix (Linux and Mac), BOINC uses the regular APIs >(like > lseek()) > >>> but is > >>> > > >built with a > >>> > > -D_FILE_OFFSET_BITS=64 flag that causes >these functions to > 64-bit size. > >>> > > However, it's possible that BOINC has bugs involving > >4GB > files on > >>> Unix too. > >>> > > If anyone has a 32-bit Linux system, please test with >the > CMS project. > >>> > > > >>> > > -- David > >>> > > > >>> > > On 10-May-2015 3:58 AM, --SekeRob wrote: > >>> > > > > >>> > > > Just wondering, with files over 4GB and a 64 bit lib >> introduced, is > >>> it not a CMS > >>> > > >> project requirement to run on a 64 bit OS? > >>> > > > > >>> > > > > >>> > > > >>> > > _______________________________________________ > >>> > > boinc_alpha mailing list > >>> > > >[email protected] > <mailto:[email protected]> ><mailto:[email protected] > <mailto:[email protected]>> > >>> <mailto:[email protected] > <mailto:[email protected]> ><mailto:[email protected] > <mailto:[email protected]>>> > >>> > > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha > >>> > > To unsubscribe, visit the above URL and > >>> > > > (near bottom of page) enter your email address. > >>> > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > _______________________________________________ > >>> > > boinc_alpha mailing list > >>> > > >[email protected] > <mailto:[email protected]> ><mailto:[email protected] > <mailto:[email protected]>> > >>> <mailto:[email protected] > <mailto:[email protected]> ><mailto:[email protected] > <mailto:[email protected]>>> > >>> > > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha > >>> > > To unsubscribe, visit the above URL and > >>> > > > (near bottom of page) enter your email address. > >>> > > > >>> > > > >>> > > >>> > _______________________________________________ > >>> > boinc_alpha mailing list > >>> > >[email protected] > <mailto:[email protected]> ><mailto:[email protected] > <mailto:[email protected]>> > >>> > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha > >>> > To unsubscribe, visit the above URL and > >>> > >(near bottom of page) enter your email address. > >>> > >>> _______________________________________________ > >>> boinc_alpha mailing list > >>> >[email protected] <mailto:[email protected]> > <mailto:[email protected] ><mailto:[email protected]>> > >>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha > >>> To unsubscribe, visit the above URL and > >>> (near >bottom of page) enter your email address. > >>> > >>> > >> > > > > > > > > > >------------------------------------------------------------------------------------ > > Avast logo <http://www.avast.com/> > > > > This >email has been checked for viruses by Avast antivirus software. > > www.avast.com <http://www.avast.com> <http://www.avast.com/> > > > > > > > _______________________________________________ > boinc_dev mailing list > [email protected] <mailto:[email protected]> > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > > To unsubscribe, visit the above URL and > (near bottom of >page) enter your email address. > > > _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
