OK, the delay happened again, and I captured a procmon log.
Copy of the BOINC log attached (period of interest is 19:35:30 to 19:35:41): also
a simple extract of ProcMon for the same period. It has to be said, boinc.exe was
doing surprisingly little.
I have kept the full ~200 MB native ProcMon log, which can be re-filtered to look
for anything else of interest, if you can suggest some likely targets.
On Monday, 18 May 2015, 20:57, David Anderson <[email protected]> wrote:
That looks like what's needed.
Richard, if you can repro the inter-job delay,
you could try using Process Monitor to capture as much
as possible from the client during that period.
-- David
On 18-May-2015 11:12 AM, Jacob Klein wrote:
> Process Monitor can be used to "watch the things a process does" (you have
to set
> up correct filters, etc.)... but I'm not sure if that includes sleeps. But
if the
> process is waiting on a file or something, though, it should be able to
tell
you.
> Worth looking into.
>
> https://technet.microsoft.com/en-us/library/bb896645.aspx
>
> Regards,
> Jacob
>
>
>
------------------------------------------------------------------------------------
> Date: Mon, 18 May 2015 10:41:16 -0700
> From: [email protected] <mailto:[email protected]>
> To: [email protected] <mailto:[email protected]>;
[email protected] <mailto:[email protected]>; [email protected]
<mailto:[email protected]>
> CC: [email protected] <mailto:[email protected]>
> Subject: Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories
without
> ensuring they're empty
>
> I looked at this and couldn't figure out the source of the 12-sec delay.
> In general, delays could happen because
> 1) the client does something that takes a long time (like copying a 5 GB
file)
> 2) the client sleeps (i.e. calls boinc_sleep()).
> It does this in a few situations,
> like backing off and retrying a file system operation.
> But there's no indication that either of these is happening here.
>
> Does Windows have a way of logging the system calls that a process makes
> (like strace on Unix)?
> If so that might reveal what the client is doing during those 12 seconds.
>
> -- David
>
> On 16-May-2015 8:01 AM, Richard Haselgrove wrote:
>
> Here is the message log file for a GPUGrid task finish. The 12-second
delay
> appears again between 14:26:35 and 14:26:47 - that's after the slot
directory
> has been cleared, and the exiting task has changed state from
'running' to
> 'uploading'. Two new tasks have been assigned to the GPU, but their
(small)
> startup files have not yet been linked to their respective slot
directories.
>
> I also attach directory listings for the slot and GPUGrid project
folders at
> various stages of the cleanup: the slot held 34 files totalling
44,186,727
> bytes, which doesn't sound excessive: the largest file deletion
(94,783,960
> bytes) occurred several minutes later, when that file finished
uploading.
>
> I'll enable similar logging and watch what happens when the next
GPUGrid task
> starts up, but from memory, the disruption to BOINC is less severe at
startup.
>
>
>
> On Tuesday, 12 May 2015, 23:29, David Anderson <[email protected]
<mailto:[email protected]>>
> <mailto:[email protected] <mailto:[email protected]>> wrote:
>
>
>
> BTW: the client isn't completely single-threaded;
> it uses a separate thread to do CPU throttling.
> It would be feasible to also use separate threads
> for serving GUI RPC connections,
> which would allow client to remain responsive even while
> e.g. copying thousands of files to a slot dir.
> -- David
>
> On 12-May-2015 2:40 AM, Seke Rob wrote:
> > Reminds me of the Clean Energy Project, Phase 2 and why we have
> app_config and
> > <max_concurrent> and a default control of allowing 1 'In
Progress' on a
> host. This
> > project sets up in slot copying near 6700 files [symlinking
proposed
> long ago as
> > is done on several other WCG projects for the static files]. If
more
> than one CEP2
> > is started the machine feels at times like a snail,
responsiveness of
> the BOINC
> > manager is poor, many a time the less powerful systems incurring
error
> zero status
> > exits or total fail. On an 8 core observed it could take over an
hour
> before
> > actual computing commenced [CPU time logged]. Boot cycle requires
manually
> > starting of tasks one by one. Kevin Reed few years ago raised a
ticket for
> > staggered starting, where the models can reach several GB and
bigger in the
> > coming. At any rate, as much as these 6700 files are copied,
they also
> then are
> > needing of deletion at completion [physical or symlink
references]. The
> effect of
> > starting 1 CEP2 and finishing / packaging / zipping and
transmitting can
> easily
> > lead to several minutes of there not being any computing, just
whirring,
> for
> > minutes, just elapsed being logged. The more run the more the
issue
> compounds,
> > with the effect of what many incur, the exit zero status series,
> resetting to
> > start or last checkpoint with often hours of computing time lost.
> >
> > Maybe you'd like to get in touch with your confederates at WCG
[Keith
> Uplinger],
> > to discuss the issue further as this is now nearing a 5 year
continues
> frustration
> > [June 2010 launch, and a huge limitation on the speed of
progress on
> this project].
> >
> > --SekeRob.
> >
> > On 12-5-2015 1:55, David Anderson wrote:
> >> That delay looks like it's caused by deleting files or by
process
cleanup.
> >> Does GPUGrid make lots of (non-output) files in the slot dir?
> >>
> >> Please try to repro it with slot_debug, task_debug, and
heartbeat_debug set
> >> (gui_rpc_debug not needed).
> >>
> >> -- David
> >>
> >> On 11-May-2015 10:54 AM, Richard Haselgrove wrote:
> >>> Here's another example of a case where BOINC finds that it
can't walk
> and chew
> >>> gum at the same time. The event of interest is
> >>>
> >>> 11/05/2015 18:35:34 | GPUGRID | Computation for task
> >>> e10s9_e7s6f4-GERARD_FXCXCL12_LIG_6282622-0-1-RND7898_0 finished
> >>>
> >>> Following that, there's a 12-second interval where neither
heartbeats
> nor GUI
> >>> RPC traffic was logged: during that time, the Task tab of the
Manager was
> >>> unchanging, not showing the regular update of elapsed time for
running
> tasks.
> >>>
> >>> async_file_debug was active at the time, but found no events
to log.
> >>>
> >>> These particular GPUGrid tasks generate around 90 MB of upload
files,
> but I
> >>> think they are generated directly in the project folder and
don't
need
> to be
> >>> copied anywhere.
> >>>
> >>> Main log as attached file only.
> >>>
> >>> I'll catch a CMS-dev log later this evening, but after that,
I'll be
> away for a
> >>> few days and I'll have to leave the bug-chase until the
weekend.
> >>>
> >>>
> >>>
> >>>
> >>> On Monday, 11 May 2015, 9:42, Jacob Klein
<[email protected]
<mailto:[email protected]>
> <mailto:[email protected] <mailto:[email protected]>>>
wrote:
> >>>
> >>>
> >>>
> >>> I have seen this problem before, where the UI becomes
unresponsive.
> If I
> >>> recall, it happens when a T4T task is being set up (ie:
after
> everything was
> >>> downloaded). For me, I don't recall the problem ever
"screwing
over
> other
> >>> tasks", though.
> >>>
> >>> Try this to reproduce it: Attach to T4T, and get a task. It
may
> take a while
> >>> to do that download, so you can "step away" for a bit.
Then, once
> that task
> >>> is going, abort it. Downloading the 2nd task should be
instantaneous
> >>> (nothing really to download), but instantiation of that 2nd
task should
> >>> cause the UI to hang (showing the "Please wait" messagebox
in the
> manager).
> >>>
> >>> Does that help?
> >>> > Date: Sun, 10 May 2015 23:19:24 -0700
> >>> > From: [email protected]
<mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
> <mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>
> >>> > To: [email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>> <mailto:[email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>>;
> >>> [email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
> <mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>
> >>> > CC: [email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>> <mailto:[email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>>
> >>> > Subject: Re: [boinc_alpha] BOINC re-using slot directories
without
> >>> ensuring they're empty
> >>> >
> >>> > I did some initial testing and couldn't repro this;
> >>> > the client remains responsive while copying a 5 GB file
to a
slot
> dir.
> >>> > Does anyone else see this behavior?
> >>> >
> >>> > While testing this, please set "async_file_debug" log
flag.
> >>> > This says when asynchronous file operations start and end.
> >>> >
> >>> > -- David
> >>> >
> >>> > On 10-May-2015 12:31 PM, Richard Haselgrove wrote:
> >>> > > One thing that may need attention if very large files
become
> the norm is
> >>> the
> >>> > > single-threaded nature of some parts of the core
client. My
> 1-hour CMS
> >>> test has
> >>> > > just finished, and a new 24-hour test started.
> >>> > >
> >>> > >
> >>> > > I watched this happening, and part of the process is
copying a
> 1.33 GB
> >>> initial
> >>> > > .vmi image file (downloaded previously by BOINC from
CERN)
from
> the project
> >>> > > directory to the slot directory. This took about 90
seconds:
> during that
> >>> time, all
> >>> > > Manager updating stopped. I'm sure it's the copying
process
> which inhibited
> >>> > > updates: I was watching the slot directory, and the
.vmi image
> file had
> >>> appeared,
> >>> > > but other essential startup files hadn't.
> >>> > >
> >>> > >
> >>> > > When BOINC regained its ability to communicate, three
running
> tasks had
> >>> exited
> >>> > > with the dreaded (and false) 'you may need to reset the
> project' advice.
> >>> inline
> >>> > > log follows: because my last log got mangled by my ISP's
new mail
> >>> interface, I'll
> >>> > > attach it as a text file as well.
> >>> > >
> >>> > >
> >>> > > 10/05/2015 20:12:56 | LHC@home <mailto:LHC@home>
<mailto:LHC@home <mailto:LHC@home>>
> <mailto:LHC@home <mailto:LHC@home> <mailto:LHC@home
<mailto:LHC@home>>> 1.0 | Computation for task
> >>> > >
> >>>
>
sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1
> >>>
> >>> > > finished
> >>> > > 10/05/2015 20:12:56 | CMS-dev | Starting task
> CMS_31107_1427806626.783437_0
> >>> > > 10/05/2015 20:12:56 | CMS-dev | [cpu_sched] Starting
task
> >>> > > CMS_31107_1427806626.783437_0 using CMS version 4615
(vbox64)
> in slot 7
> >>> > > 10/05/2015 20:14:25 | climateprediction.net | Task
> >>> > > hadam3p_anz_e3g7_2013_1_009760406_0 exited with zero
status but no
> >>> 'finished' file
> >>> > > 10/05/2015 20:14:25 | climateprediction.net | If this
happens
> repeatedly
> >>> you may
> >>> > > need to reset the project.
> >>> > > 10/05/2015 20:14:25 | NumberFields@home
<mailto:NumberFields@home>
> <mailto:NumberFields@home <mailto:NumberFields@home>>
<mailto:NumberFields@home <mailto:NumberFields@home>
> <mailto:NumberFields@home <mailto:NumberFields@home>>> | Task
> >>> > > wu_sf3_DS-10x271_Grp503196of682667_0 exited with zero
status but no
> >>> 'finished' file
> >>> > > 10/05/2015 20:14:25 | NumberFields@home
<mailto:NumberFields@home>
> <mailto:NumberFields@home <mailto:NumberFields@home>>
<mailto:NumberFields@home <mailto:NumberFields@home>
> <mailto:NumberFields@home <mailto:NumberFields@home>>> | If
> >>> this happens repeatedly you may need
> >>> > > to reset the project.
> >>> > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home>
<mailto:SETI@home <mailto:SETI@home>>
> <mailto:SETI@home <mailto:SETI@home> <mailto:SETI@home
<mailto:SETI@home>>> | Task
> >>> 05jl12ab.3911.10292.438086664199.12.207_1
> >>> > > exited with zero status but no 'finished' file
> >>> > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home>
<mailto:SETI@home <mailto:SETI@home>>
> <mailto:SETI@home <mailto:SETI@home> <mailto:SETI@home
<mailto:SETI@home>>> | If this happens
> >>> repeatedly you may need to reset
> >>> > > the project.
> >>> > > 10/05/2015 20:14:25 | climateprediction.net |
[cpu_sched]
> Restarting task
> >>> > > hadam3p_anz_e3g7_2013_1_009760406_0 using hadam3p_anz
version
> 610 in slot 5
> >>> > > 10/05/2015 20:14:25 | NumberFields@home
<mailto:NumberFields@home>
> <mailto:NumberFields@home <mailto:NumberFields@home>>
<mailto:NumberFields@home <mailto:NumberFields@home>
> <mailto:NumberFields@home <mailto:NumberFields@home>>> |
> >>> [cpu_sched] Restarting task
> >>> > > wu_sf3_DS-10x271_Grp503196of682667_0 using GetDecics
version
> 200 in slot 0
> >>> > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home>
<mailto:SETI@home <mailto:SETI@home>>
> <mailto:SETI@home <mailto:SETI@home> <mailto:SETI@home
<mailto:SETI@home>>> | [cpu_sched]
> >>> Restarting task
> >>> > > 05jl12ab.3911.10292.438086664199.12.207_1 using
setiathome_v7
> version
> >>> 700 (cuda42)
> >>> > > in slot 2
> >>> > > 10/05/2015 20:14:27 | LHC@home <mailto:LHC@home>
<mailto:LHC@home <mailto:LHC@home>>
> <mailto:LHC@home <mailto:LHC@home> <mailto:LHC@home
<mailto:LHC@home>>> 1.0 | Started upload of
> >>> > >
> >>>
>
sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
> >>> > > 10/05/2015 20:14:30 | LHC@home <mailto:LHC@home>
<mailto:LHC@home <mailto:LHC@home>>
> <mailto:LHC@home <mailto:LHC@home> <mailto:LHC@home
<mailto:LHC@home>>> 1.0 | Finished upload of
> >>> > >
> >>>
>
sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > On Sunday, 10 May 2015, 19:59, Seke Rob
<[email protected] <mailto:[email protected]>
> <mailto:[email protected] <mailto:[email protected]>>
> >>> <mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>> wrote:
> >>> > >
> >>> > >
> >>> > >
> >>> > > Excellent this is all fixed and tested. Interest
is/was
that
> WCG's Clean
> >>> > > Energy at some point in time was to run very large
models,
> talk of
> >>> 4-8GB IIRC.
> >>> > >
> >>> > > --SekeRob
> >>> > >
> >>> > > On May 10, 2015 20:27, Richard Haselgrove
> >>> <[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>
> <mailto:[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>>
> >>> > > <mailto:[email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>
> >>> <mailto:[email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>>>> wrote:
> >>> > > CMS only has stock applications configured for
delivery to
> 64-bit
> >>> platforms.
> >>> > > I've made an anonymous platform configuration using
the
> 32-bit VBox
> >>> Windows
> >>> > > wrapper: it has downloaded and is running its first
1-hour
> task. If that
> >>> > > completes successfully (it seems to have reached the
> >>> fully-operational stage),
> >>> > > I'll try a full 24-hour task, which under current
operational
> >>> circumstances
> >>> > > should generate a >4 GB file locally.
> >>> > >
> >>> > >
> >>> > > On Sunday, 10 May 2015, 18:28, David Anderson
> >>> <[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
> <mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>
> >>> > > <mailto:[email protected]
<mailto:[email protected]>
> <mailto:[email protected] <mailto:[email protected]>>
<mailto:[email protected] <mailto:[email protected]>
> <mailto:[email protected] <mailto:[email protected]>>>>>
wrote:
> >>> > >
> >>> > >
> >>> > >
> >>> > > NTFS handles > 4GB files, even if the hardware
and/or OS is
> only 32-bit.
> >>> > > 32-bit versions of Windows have APIs (like
_stat64()) for
> handling >
> >>> 4GB files.
> >>> > > BOINC needs to use these; we fixed one place where it
wasn't.
> >>> > >
> >>> > > On Unix (Linux and Mac), BOINC uses the regular APIs
(like
> lseek())
> >>> but is
> >>> > > built with a
> >>> > > -D_FILE_OFFSET_BITS=64 flag that causes these functions
to
> 64-bit size.
> >>> > > However, it's possible that BOINC has bugs involving
> 4GB
> files on
> >>> Unix too.
> >>> > > If anyone has a 32-bit Linux system, please test
with the
> CMS project.
> >>> > >
> >>> > > -- David
> >>> > >
> >>> > > On 10-May-2015 3:58 AM, --SekeRob wrote:
> >>> > > >
> >>> > > > Just wondering, with files over 4GB and a 64 bit
lib
> introduced, is
> >>> it not a CMS
> >>> > > > project requirement to run on a 64 bit OS?
> >>> > > >
> >>> > > >
> >>> > >
> >>> > > _______________________________________________
> >>> > > boinc_alpha mailing list
> >>> > > [email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>> <mailto:[email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>>
> >>> <mailto:[email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>> <mailto:[email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>>>
> >>> > >
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
> >>> > > To unsubscribe, visit the above URL and
> >>> > > (near bottom of page) enter your email address.
> >>>
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > _______________________________________________
> >>> > > boinc_alpha mailing list
> >>> > > [email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>> <mailto:[email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>>
> >>> <mailto:[email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>> <mailto:[email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>>>
> >>> > >
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
> >>> > > To unsubscribe, visit the above URL and
> >>> > > (near bottom of page) enter your email address.
> >>> > >
> >>> > >
> >>> >
> >>> > _______________________________________________
> >>> > boinc_alpha mailing list
> >>> > [email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>> <mailto:[email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>>
> >>> > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
> >>> > To unsubscribe, visit the above URL and
> >>> > (near bottom of page) enter your email address.
> >>>
> >>> _______________________________________________
> >>> boinc_alpha mailing list
> >>> [email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>
> <mailto:[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>>
> >>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
> >>> To unsubscribe, visit the above URL and
> >>> (near bottom of page) enter your email address.
> >>>
> >>>
> >>
> >
> >
> >
> >
>
------------------------------------------------------------------------------------
> > Avast logo <http://www.avast.com/>
> >
> > This email has been checked for viruses by Avast antivirus
software.
> > www.avast.com <http://www.avast.com <http://www.avast.com/>>
<http://www.avast.com/>
> >
> >
>
> _______________________________________________
> boinc_dev mailing list
> [email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
>
>
>
_______________________________________________
boinc_dev mailing list
[email protected] <mailto:[email protected]>
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.