Rom, if you could build a private drop, I'll report what the log says.
On Wednesday, 10 June 2015, 4:28, David Anderson <[email protected]>
wrote:
I added a log message that may help a bit.
I'd like to track this down, even though it's minor.
-- David
On 19-May-2015 12:15 PM, Richard Haselgrove wrote:
OK, the delay happened again, and I captured a procmon log.
Copy of the BOINC log attached (period of interest is 19:35:30 to 19:35:41):
also a simple extract of ProcMon for the same period. It has to be said,
boinc.exe was doing surprisingly little.
I have kept the full ~200 MB native ProcMon log, which can be re-filtered to
look for anything else of interest, if you can suggest some likely targets.
On Monday, 18 May 2015, 20:57, David Anderson <[email protected]>
wrote:
That looks like what's needed.
Richard, if you can repro the inter-job delay,
you could try using Process Monitor to capture as much
as possible from the client during that period.
-- David
On 18-May-2015 11:12 AM, Jacob Klein wrote:
> Process Monitor can be used to "watch the things a process does" (you have
> to set
> up correct filters, etc.)... but I'm not sure if that includes sleeps. But
> if the
> process is waiting on a file or something, though, it should be able to tell
> you.
> Worth looking into.
>
> https://technet.microsoft.com/en-us/library/bb896645.aspx
>
> Regards,
> Jacob
>
>
>------------------------------------------------------------------------------------
> Date: Mon, 18 May 2015 10:41:16 -0700
> From: [email protected]
> To: [email protected]; [email protected]; [email protected]
> CC: [email protected]
> Subject: Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories
> without
> ensuring they're empty
>
> I looked at this and couldn't figure out the source of the 12-sec delay.
> In general, delays could happen because
> 1) the client does something that takes a long time (like copying a 5 GB
> file)
> 2) the client sleeps (i.e. calls boinc_sleep()).
> It does this in a few situations,
> like backing off and retrying a file system operation.
> But there's no indication that either of these is happening here.
>
> Does Windows have a way of logging the system calls that a process makes
> (like strace on Unix)?
> If so that might reveal what the client is doing during those 12 seconds.
>
> -- David
>
> On 16-May-2015 8:01 AM, Richard Haselgrove wrote:
>
> Here is the message log file for a GPUGrid task finish. The 12-second
>delay
> appears again between 14:26:35 and 14:26:47 - that's after the slot
>directory
> has been cleared, and the exiting task has changed state from 'running' to
> 'uploading'. Two new tasks have been assigned to the GPU, but their
>(small)
> startup files have not yet been linked to their respective slot
>directories.
>
> I also attach directory listings for the slot and GPUGrid project folders
>at
> various stages of the cleanup: the slot held 34 files totalling 44,186,727
> bytes, which doesn't sound excessive: the largest file deletion
>(94,783,960
> bytes) occurred several minutes later, when that file finished uploading.
>
> I'll enable similar logging and watch what happens when the next GPUGrid
>task
> starts up, but from memory, the disruption to BOINC is less severe at
>startup.
>
>
>
> On Tuesday, 12 May 2015, 23:29, David Anderson <[email protected]>
> <mailto:[email protected]> wrote:
>
>
>
> BTW: the client isn't completely single-threaded;
> it uses a separate thread to do CPU throttling.
> It would be feasible to also use separate threads
> for serving GUI RPC connections,
> which would allow client to remain responsive even while
> e.g. copying thousands of files to a slot dir.
> -- David
>
> On 12-May-2015 2:40 AM, Seke Rob wrote:
> > Reminds me of the Clean Energy Project, Phase 2 and why we have
> app_config and
> > <max_concurrent> and a default control of allowing 1 'In Progress'
>on a
> host. This
> > project sets up in slot copying near 6700 files [symlinking proposed
> long ago as
> > is done on several other WCG projects for the static files]. If more
> than one CEP2
> > is started the machine feels at times like a snail, responsiveness
>of
> the BOINC
> > manager is poor, many a time the less powerful systems incurring
>error
> zero status
> > exits or total fail. On an 8 core observed it could take over an
>hour
> before
> > actual computing commenced [CPU time logged]. Boot cycle requires
>manually
> > starting of tasks one by one. Kevin Reed few years ago raised a
>ticket for
> > staggered starting, where the models can reach several GB and
>bigger in the
> > coming. At any rate, as much as these 6700 files are copied, they
>also
> then are
> > needing of deletion at completion [physical or symlink references].
>The
> effect of
> > starting 1 CEP2 and finishing / packaging / zipping and
>transmitting can
> easily
> > lead to several minutes of there not being any computing, just
>whirring,
> for
> > minutes, just elapsed being logged. The more run the more the issue
> compounds,
> > with the effect of what many incur, the exit zero status series,
> resetting to
> > start or last checkpoint with often hours of computing time lost.
> >
> > Maybe you'd like to get in touch with your confederates at WCG
>[Keith
> Uplinger],
> > to discuss the issue further as this is now nearing a 5 year
>continues
> frustration
> > [June 2010 launch, and a huge limitation on the speed of progress on
> this project].
> >
> > --SekeRob.
> >
> > On 12-5-2015 1:55, David Anderson wrote:
> >> That delay looks like it's caused by deleting files or by process
>cleanup.
> >> Does GPUGrid make lots of (non-output) files in the slot dir?
> >>
> >> Please try to repro it with slot_debug, task_debug, and
>heartbeat_debug set
> >> (gui_rpc_debug not needed).
> >>
> >> -- David
> >>
> >> On 11-May-2015 10:54 AM, Richard Haselgrove wrote:
> >>> Here's another example of a case where BOINC finds that it can't
>walk
> and chew
> >>> gum at the same time. The event of interest is
> >>>
> >>> 11/05/2015 18:35:34 | GPUGRID | Computation for task
> >>> e10s9_e7s6f4-GERARD_FXCXCL12_LIG_6282622-0-1-RND7898_0 finished
> >>>
> >>> Following that, there's a 12-second interval where neither
>heartbeats
> nor GUI
> >>> RPC traffic was logged: during that time, the Task tab of the
>Manager was
> >>> unchanging, not showing the regular update of elapsed time for
>running
> tasks.
> >>>
> >>> async_file_debug was active at the time, but found no events to
>log.
> >>>
> >>> These particular GPUGrid tasks generate around 90 MB of upload
>files,
> but I
> >>> think they are generated directly in the project folder and don't
>need
> to be
> >>> copied anywhere.
> >>>
> >>> Main log as attached file only.
> >>>
> >>> I'll catch a CMS-dev log later this evening, but after that, I'll
>be
> away for a
> >>> few days and I'll have to leave the bug-chase until the weekend.
> >>>
> >>>
> >>>
> >>>
> >>> On Monday, 11 May 2015, 9:42, Jacob Klein <[email protected]
> <mailto:[email protected]>> wrote:
> >>>
> >>>
> >>>
> >>> I have seen this problem before, where the UI becomes
>unresponsive.
> If I
> >>> recall, it happens when a T4T task is being set up (ie: after
> everything was
> >>> downloaded). For me, I don't recall the problem ever "screwing
>over
> other
> >>> tasks", though.
> >>>
> >>> Try this to reproduce it: Attach to T4T, and get a task. It may
> take a while
> >>> to do that download, so you can "step away" for a bit. Then,
>once
> that task
> >>> is going, abort it. Downloading the 2nd task should be
>instantaneous
> >>> (nothing really to download), but instantiation of that 2nd
>task should
> >>> cause the UI to hang (showing the "Please wait" messagebox in
>the
> manager).
> >>>
> >>> Does that help?
> >>> > Date: Sun, 10 May 2015 23:19:24 -0700
> >>> > From: [email protected] <mailto:[email protected]>
> <mailto:[email protected] <mailto:[email protected]>>
> >>> > To: [email protected]
> <mailto:[email protected]>
><mailto:[email protected]
> <mailto:[email protected]>>;
> >>> [email protected] <mailto:[email protected]>
> <mailto:[email protected] <mailto:[email protected]>>
> >>> > CC: [email protected]
> <mailto:[email protected]>
><mailto:[email protected]
> <mailto:[email protected]>>
> >>> > Subject: Re: [boinc_alpha] BOINC re-using slot directories
>without
> >>> ensuring they're empty
> >>> >
> >>> > I did some initial testing and couldn't repro this;
> >>> > the client remains responsive while copying a 5 GB file to a
>slot
> dir.
> >>> > Does anyone else see this behavior?
> >>> >
> >>> > While testing this, please set "async_file_debug" log flag.
> >>> > This says when asynchronous file operations start and end.
> >>> >
> >>> > -- David
> >>> >
> >>> > On 10-May-2015 12:31 PM, Richard Haselgrove wrote:
> >>> > > One thing that may need attention if very large files
>become
> the norm is
> >>> the
> >>> > > single-threaded nature of some parts of the core client. My
> 1-hour CMS
> >>> test has
> >>> > > just finished, and a new 24-hour test started.
> >>> > >
> >>> > >
> >>> > > I watched this happening, and part of the process is
>copying a
> 1.33 GB
> >>> initial
> >>> > > .vmi image file (downloaded previously by BOINC from CERN)
>from
> the project
> >>> > > directory to the slot directory. This took about 90
>seconds:
> during that
> >>> time, all
> >>> > > Manager updating stopped. I'm sure it's the copying process
> which inhibited
> >>> > > updates: I was watching the slot directory, and the .vmi
>image
> file had
> >>> appeared,
> >>> > > but other essential startup files hadn't.
> >>> > >
> >>> > >
> >>> > > When BOINC regained its ability to communicate, three
>running
> tasks had
> >>> exited
> >>> > > with the dreaded (and false) 'you may need to reset the
> project' advice.
> >>> inline
> >>> > > log follows: because my last log got mangled by my ISP's
>new mail
> >>> interface, I'll
> >>> > > attach it as a text file as well.
> >>> > >
> >>> > >
> >>> > > 10/05/2015 20:12:56 | LHC@home <mailto:LHC@home>
> <mailto:LHC@home <mailto:LHC@home>> 1.0 | Computation for task
> >>> > >
> >>>
>
>sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1
> >>>
> >>> > > finished
> >>> > > 10/05/2015 20:12:56 | CMS-dev | Starting task
> CMS_31107_1427806626.783437_0
> >>> > > 10/05/2015 20:12:56 | CMS-dev | [cpu_sched] Starting task
> >>> > > CMS_31107_1427806626.783437_0 using CMS version 4615
>(vbox64)
> in slot 7
> >>> > > 10/05/2015 20:14:25 | climateprediction.net | Task
> >>> > > hadam3p_anz_e3g7_2013_1_009760406_0 exited with zero
>status but no
> >>> 'finished' file
> >>> > > 10/05/2015 20:14:25 | climateprediction.net | If this
>happens
> repeatedly
> >>> you may
> >>> > > need to reset the project.
> >>> > > 10/05/2015 20:14:25 | NumberFields@home
> <mailto:NumberFields@home> <mailto:NumberFields@home
> <mailto:NumberFields@home>> | Task
> >>> > > wu_sf3_DS-10x271_Grp503196of682667_0 exited with zero
>status but no
> >>> 'finished' file
> >>> > > 10/05/2015 20:14:25 | NumberFields@home
> <mailto:NumberFields@home> <mailto:NumberFields@home
> <mailto:NumberFields@home>> | If
> >>> this happens repeatedly you may need
> >>> > > to reset the project.
> >>> > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home>
> <mailto:SETI@home <mailto:SETI@home>> | Task
> >>> 05jl12ab.3911.10292.438086664199.12.207_1
> >>> > > exited with zero status but no 'finished' file
> >>> > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home>
> <mailto:SETI@home <mailto:SETI@home>> | If this happens
> >>> repeatedly you may need to reset
> >>> > > the project.
> >>> > > 10/05/2015 20:14:25 | climateprediction.net | [cpu_sched]
> Restarting task
> >>> > > hadam3p_anz_e3g7_2013_1_009760406_0 using hadam3p_anz
>version
> 610 in slot 5
> >>> > > 10/05/2015 20:14:25 | NumberFields@home
> <mailto:NumberFields@home> <mailto:NumberFields@home
> <mailto:NumberFields@home>> |
> >>> [cpu_sched] Restarting task
> >>> > > wu_sf3_DS-10x271_Grp503196of682667_0 using GetDecics
>version
> 200 in slot 0
> >>> > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home>
> <mailto:SETI@home <mailto:SETI@home>> | [cpu_sched]
> >>> Restarting task
> >>> > > 05jl12ab.3911.10292.438086664199.12.207_1 using
>setiathome_v7
> version
> >>> 700 (cuda42)
> >>> > > in slot 2
> >>> > > 10/05/2015 20:14:27 | LHC@home <mailto:LHC@home>
> <mailto:LHC@home <mailto:LHC@home>> 1.0 | Started upload of
> >>> > >
> >>>
>
>sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
> >>> > > 10/05/2015 20:14:30 | LHC@home <mailto:LHC@home>
> <mailto:LHC@home <mailto:LHC@home>> 1.0 | Finished upload of
> >>> > >
> >>>
>
>sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > On Sunday, 10 May 2015, 19:59, Seke Rob
><[email protected]
> <mailto:[email protected]>
> >>> <mailto:[email protected] <mailto:[email protected]>>>
>wrote:
> >>> > >
> >>> > >
> >>> > >
> >>> > > Excellent this is all fixed and tested. Interest is/was
>that
> WCG's Clean
> >>> > > Energy at some point in time was to run very large
>models,
> talk of
> >>> 4-8GB IIRC.
> >>> > >
> >>> > > --SekeRob
> >>> > >
> >>> > > On May 10, 2015 20:27, Richard Haselgrove
> >>> <[email protected]
><mailto:[email protected]>
> <mailto:[email protected]
><mailto:[email protected]>>
> >>> > > <mailto:[email protected]
> <mailto:[email protected]>
> >>> <mailto:[email protected]
> <mailto:[email protected]>>>> wrote:
> >>> > > CMS only has stock applications configured for delivery
>to
> 64-bit
> >>> platforms.
> >>> > > I've made an anonymous platform configuration using the
> 32-bit VBox
> >>> Windows
> >>> > > wrapper: it has downloaded and is running its first
>1-hour
> task. If that
> >>> > > completes successfully (it seems to have reached the
> >>> fully-operational stage),
> >>> > > I'll try a full 24-hour task, which under current
>operational
> >>> circumstances
> >>> > > should generate a >4 GB file locally.
> >>> > >
> >>> > >
> >>> > > On Sunday, 10 May 2015, 18:28, David Anderson
> >>> <[email protected] <mailto:[email protected]>
> <mailto:[email protected] <mailto:[email protected]>>
> >>> > > <mailto:[email protected]
> <mailto:[email protected]> <mailto:[email protected]
> <mailto:[email protected]>>>> wrote:
> >>> > >
> >>> > >
> >>> > >
> >>> > > NTFS handles > 4GB files, even if the hardware and/or
>OS is
> only 32-bit.
> >>> > > 32-bit versions of Windows have APIs (like _stat64())
>for
> handling >
> >>> 4GB files.
> >>> > > BOINC needs to use these; we fixed one place where it
>wasn't.
> >>> > >
> >>> > > On Unix (Linux and Mac), BOINC uses the regular APIs
>(like
> lseek())
> >>> but is
> >>> > > built with a
> >>> > > -D_FILE_OFFSET_BITS=64 flag that causes these functions to
> 64-bit size.
> >>> > > However, it's possible that BOINC has bugs involving >
>4GB
> files on
> >>> Unix too.
> >>> > > If anyone has a 32-bit Linux system, please test with
>the
> CMS project.
> >>> > >
> >>> > > -- David
> >>> > >
> >>> > > On 10-May-2015 3:58 AM, --SekeRob wrote:
> >>> > > >
> >>> > > > Just wondering, with files over 4GB and a 64 bit lib
> introduced, is
> >>> it not a CMS
> >>> > > > project requirement to run on a 64 bit OS?
> >>> > > >
> >>> > > >
> >>> > >
> >>> > > _______________________________________________
> >>> > > boinc_alpha mailing list
> >>> > > [email protected]
> <mailto:[email protected]>
><mailto:[email protected]
> <mailto:[email protected]>>
> >>> <mailto:[email protected]
> <mailto:[email protected]>
><mailto:[email protected]
> <mailto:[email protected]>>>
> >>> > > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
> >>> > > To unsubscribe, visit the above URL and
> >>> > > (near bottom of page) enter your email address.
> >>>
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > _______________________________________________
> >>> > > boinc_alpha mailing list
> >>> > > [email protected]
> <mailto:[email protected]>
><mailto:[email protected]
> <mailto:[email protected]>>
> >>> <mailto:[email protected]
> <mailto:[email protected]>
><mailto:[email protected]
> <mailto:[email protected]>>>
> >>> > > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
> >>> > > To unsubscribe, visit the above URL and
> >>> > > (near bottom of page) enter your email address.
> >>> > >
> >>> > >
> >>> >
> >>> > _______________________________________________
> >>> > boinc_alpha mailing list
> >>> > [email protected]
> <mailto:[email protected]>
><mailto:[email protected]
> <mailto:[email protected]>>
> >>> > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
> >>> > To unsubscribe, visit the above URL and
> >>> > (near bottom of page) enter your email address.
> >>>
> >>> _______________________________________________
> >>> boinc_alpha mailing list
> >>> [email protected] <mailto:[email protected]>
> <mailto:[email protected]
><mailto:[email protected]>>
> >>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
> >>> To unsubscribe, visit the above URL and
> >>> (near bottom of page) enter your email address.
> >>>
> >>>
> >>
> >
> >
> >
> >
>
>------------------------------------------------------------------------------------
> > Avast logo <http://www.avast.com/>
> >
> > This email has been checked for viruses by Avast antivirus software.
> > www.avast.com <http://www.avast.com> <http://www.avast.com/>
> >
> >
>
> _______________________________________________
> boinc_dev mailing list
> [email protected] <mailto:[email protected]>
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
>
>
>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.