Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without ensuring they're empty

Richard Haselgrove Wed, 10 Jun 2015 00:38:11 -0700

Rom, if you could build a private drop, I'll report what the log says.


     On Wednesday, 10 June 2015, 4:28, David Anderson <[email protected]> 
wrote:
   
 

  I added a log message that may help a bit.
 I'd like to track this down, even though it's minor.
 -- David
 
 On 19-May-2015 12:15 PM, Richard Haselgrove wrote:
  
  OK, the delay happened again, and I captured a procmon log. 
  Copy of the BOINC log attached (period of interest is 19:35:30 to 19:35:41): 
also a simple extract of ProcMon for the same period. It has to be said, 
boinc.exe was doing surprisingly little. 
  I have kept the full ~200 MB native ProcMon log, which can be re-filtered to 
look for anything else of interest, if you can suggest some likely targets. 
 
 
       On Monday, 18 May 2015, 20:57, David Anderson <[email protected]> 
wrote:
   
 
 
 That looks like what's needed.
 Richard, if you can repro the inter-job delay,
 you could try using Process Monitor to capture as much
 as possible from the client during that period.
 -- David
 
 On 18-May-2015 11:12 AM, Jacob Klein wrote:
 > Process Monitor can be used to "watch the things a process does" (you have 
 > to set 
 > up correct filters, etc.)... but I'm not sure if that includes sleeps. But 
 > if the 
 > process is waiting on a file or something, though, it should be able to tell 
 > you. 
 > Worth looking into.
 >
 > https://technet.microsoft.com/en-us/library/bb896645.aspx
 >
 > Regards,
 > Jacob
 >
 >
 >------------------------------------------------------------------------------------
 > Date: Mon, 18 May 2015 10:41:16 -0700
 > From: [email protected]
 > To: [email protected]; [email protected]; [email protected]
 > CC: [email protected]
 > Subject: Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories 
 > without 
 > ensuring they're empty
 >
 > I looked at this and couldn't figure out the source of the 12-sec delay.
 > In general, delays could happen because
 > 1) the client does something that takes a long time (like copying a 5 GB 
 > file)
 > 2) the client sleeps (i.e. calls boinc_sleep()).
 >    It does this in a few situations,
 >    like backing off and retrying a file system operation.
 > But there's no indication that either of these is happening here.
 >
 > Does Windows have a way of logging the system calls that a process makes
 > (like strace on Unix)?
 > If so that might reveal what the client is doing during those 12 seconds.
 >
 > -- David
 >
 > On 16-May-2015 8:01 AM, Richard Haselgrove wrote:
 >
 >    Here is the message log file for a GPUGrid task finish. The 12-second 
 >delay
 >    appears again between 14:26:35 and 14:26:47 - that's after the slot 
 >directory
 >    has been cleared, and the exiting task has changed state from 'running' to
 >    'uploading'. Two new tasks have been assigned to the GPU, but their 
 >(small)
 >    startup files have not yet been linked to their respective slot 
 >directories.
 >
 >    I also attach directory listings for the slot and GPUGrid project folders 
 >at
 >    various stages of the cleanup: the slot held 34 files totalling 44,186,727
 >    bytes, which doesn't sound excessive: the largest file deletion 
 >(94,783,960
 >    bytes) occurred several minutes later, when that file finished uploading.
 >
 >    I'll enable similar logging and watch what happens when the next GPUGrid 
 >task
 >    starts up, but from memory, the disruption to BOINC is less severe at 
 >startup.
 >
 >
 >
 >    On Tuesday, 12 May 2015, 23:29, David Anderson <[email protected]>
 >    <mailto:[email protected]> wrote:
 >
 >
 >
 >        BTW: the client isn't completely single-threaded;
 >        it uses a separate thread to do CPU throttling.
 >        It would be feasible to also use separate threads
 >        for serving GUI RPC connections,
 >        which would allow client to remain responsive even while
 >        e.g. copying thousands of files to a slot dir.
 >        -- David
 >
 >        On 12-May-2015 2:40 AM, Seke Rob wrote:
 >        > Reminds me of the Clean Energy Project, Phase 2 and why we have
 >        app_config and
 >        > <max_concurrent> and a default control of allowing 1 'In Progress' 
 >on a
 >        host. This
 >        > project sets up in slot copying near 6700 files [symlinking proposed
 >        long ago as
 >        > is done on several other WCG projects for the static files]. If more
 >        than one CEP2
 >        > is started the machine feels at times like a snail, responsiveness 
 >of
 >        the BOINC
 >        > manager is poor, many a time the less powerful systems incurring 
 >error
 >        zero status
 >        > exits or total fail. On an 8 core observed it could take over an 
 >hour
 >        before
 >        > actual computing commenced [CPU time logged]. Boot cycle requires 
 >manually
 >        > starting of tasks one by one. Kevin Reed few years ago raised a 
 >ticket for
 >        > staggered starting, where the models can reach several GB and 
 >bigger in the
 >        > coming. At any rate, as much as these 6700 files are copied, they 
 >also
 >        then are
 >        > needing of deletion at completion [physical or symlink references]. 
 >The
 >        effect of
 >        > starting 1 CEP2 and finishing / packaging / zipping and 
 >transmitting can
 >        easily
 >        > lead to several minutes of there not being any computing, just 
 >whirring,
 >        for
 >        > minutes, just elapsed being logged. The more run the more the issue
 >        compounds,
 >        > with the effect of what many incur, the exit zero status series,
 >        resetting to
 >        > start or last checkpoint with often hours of computing time lost.
 >        >
 >        > Maybe you'd like to get in touch with your confederates at WCG 
 >[Keith
 >        Uplinger],
 >        > to discuss the issue further as this is now nearing a 5 year 
 >continues
 >        frustration
 >        > [June 2010 launch, and a huge limitation on the speed of progress on
 >        this project].
 >        >
 >        > --SekeRob.
 >        >
 >        > On 12-5-2015 1:55, David Anderson wrote:
 >        >> That delay looks like it's caused by deleting files or by process 
 >cleanup.
 >        >> Does GPUGrid make lots of (non-output) files in the slot dir?
 >        >>
 >        >> Please try to repro it with slot_debug, task_debug, and 
 >heartbeat_debug set
 >        >> (gui_rpc_debug not needed).
 >        >>
 >        >> -- David
 >        >>
 >        >> On 11-May-2015 10:54 AM, Richard Haselgrove wrote:
 >        >>> Here's another example of a case where BOINC finds that it can't 
 >walk
 >        and chew
 >        >>> gum at the same time. The event of interest is
 >        >>>
 >        >>> 11/05/2015 18:35:34 | GPUGRID | Computation for task
 >        >>> e10s9_e7s6f4-GERARD_FXCXCL12_LIG_6282622-0-1-RND7898_0 finished
 >        >>>
 >        >>> Following that, there's a 12-second interval where neither 
 >heartbeats
 >        nor GUI
 >        >>> RPC traffic was logged: during that time, the Task tab of the 
 >Manager was
 >        >>> unchanging, not showing the regular update of elapsed time for 
 >running
 >        tasks.
 >        >>>
 >        >>> async_file_debug was active at the time, but found no events to 
 >log.
 >        >>>
 >        >>> These particular GPUGrid tasks generate around 90 MB of upload 
 >files,
 >        but I
 >        >>> think they are generated directly in the project folder and don't 
 >need
 >        to be
 >        >>> copied anywhere.
 >        >>>
 >        >>> Main log as attached file only.
 >        >>>
 >        >>> I'll catch a CMS-dev log later this evening, but after that, I'll 
 >be
 >        away for a
 >        >>> few days and I'll have to leave the bug-chase until the weekend.
 >        >>>
 >        >>>
 >        >>>
 >        >>>
 >        >>> On Monday, 11 May 2015, 9:42, Jacob Klein <[email protected]
 >        <mailto:[email protected]>> wrote:
 >        >>>
 >        >>>
 >        >>>
 >        >>>    I have seen this problem before, where the UI becomes 
 >unresponsive.
 >        If I
 >        >>>    recall, it happens when a T4T task is being set up (ie: after
 >        everything was
 >        >>>    downloaded). For me, I don't recall the problem ever "screwing 
 >over
 >        other
 >        >>>    tasks", though.
 >        >>>
 >        >>>    Try this to reproduce it: Attach to T4T, and get a task. It may
 >        take a while
 >        >>>    to do that download, so you can "step away" for a bit. Then, 
 >once
 >        that task
 >        >>>    is going, abort it. Downloading the 2nd task should be 
 >instantaneous
 >        >>>    (nothing really to download), but instantiation of that 2nd 
 >task should
 >        >>>    cause the UI to hang (showing the "Please wait" messagebox in 
 >the
 >        manager).
 >        >>>
 >        >>>    Does that help?
 >        >>>    > Date: Sun, 10 May 2015 23:19:24 -0700
 >        >>>    > From: [email protected] <mailto:[email protected]>
 >        <mailto:[email protected] <mailto:[email protected]>>
 >        >>>    > To: [email protected]
 >        <mailto:[email protected]> 
 ><mailto:[email protected]
 >        <mailto:[email protected]>>;
 >        >>> [email protected] <mailto:[email protected]>
 >        <mailto:[email protected] <mailto:[email protected]>>
 >        >>>    > CC: [email protected]
 >        <mailto:[email protected]> 
 ><mailto:[email protected]
 >        <mailto:[email protected]>>
 >        >>>    > Subject: Re: [boinc_alpha] BOINC re-using slot directories 
 >without
 >        >>>    ensuring they're empty
 >        >>>    >
 >        >>>    > I did some initial testing and couldn't repro this;
 >        >>>    > the client remains responsive while copying a 5 GB file to a 
 >slot
 >        dir.
 >        >>>    > Does anyone else see this behavior?
 >        >>>    >
 >        >>>    > While testing this, please set "async_file_debug" log flag.
 >        >>>    > This says when asynchronous file operations start and end.
 >        >>>    >
 >        >>>    > -- David
 >        >>>    >
 >        >>>    > On 10-May-2015 12:31 PM, Richard Haselgrove wrote:
 >        >>>    > > One thing that may need attention if very large files 
 >become
 >        the norm is
 >        >>>    the
 >        >>>    > > single-threaded nature of some parts of the core client. My
 >        1-hour CMS
 >        >>>    test has
 >        >>>    > > just finished, and a new 24-hour test started.
 >        >>>    > >
 >        >>>    > >
 >        >>>    > > I watched this happening, and part of the process is 
 >copying a
 >        1.33 GB
 >        >>>    initial
 >        >>>    > > .vmi image file (downloaded previously by BOINC from CERN) 
 >from
 >        the project
 >        >>>    > > directory to the slot directory. This took about 90 
 >seconds:
 >        during that
 >        >>>    time, all
 >        >>>    > > Manager updating stopped. I'm sure it's the copying process
 >        which inhibited
 >        >>>    > > updates: I was watching the slot directory, and the .vmi 
 >image
 >        file had
 >        >>>    appeared,
 >        >>>    > > but other essential startup files hadn't.
 >        >>>    > >
 >        >>>    > >
 >        >>>    > > When BOINC regained its ability to communicate, three 
 >running
 >        tasks had
 >        >>>    exited
 >        >>>    > > with the dreaded (and false) 'you may need to reset the
 >        project' advice.
 >        >>>    inline
 >        >>>    > > log follows: because my last log got mangled by my ISP's 
 >new mail
 >        >>>    interface, I'll
 >        >>>    > > attach it as a text file as well.
 >        >>>    > >
 >        >>>    > >
 >        >>>    > > 10/05/2015 20:12:56 | LHC@home <mailto:LHC@home>
 >        <mailto:LHC@home <mailto:LHC@home>> 1.0 | Computation for task
 >        >>>    > >
 >        >>>
 >       
 >sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1
 >        >>>
 >        >>>    > > finished
 >        >>>    > > 10/05/2015 20:12:56 | CMS-dev | Starting task
 >        CMS_31107_1427806626.783437_0
 >        >>>    > > 10/05/2015 20:12:56 | CMS-dev | [cpu_sched] Starting task
 >        >>>    > > CMS_31107_1427806626.783437_0 using CMS version 4615 
 >(vbox64)
 >        in slot 7
 >        >>>    > > 10/05/2015 20:14:25 | climateprediction.net | Task
 >        >>>    > > hadam3p_anz_e3g7_2013_1_009760406_0 exited with zero 
 >status but no
 >        >>>    'finished' file
 >        >>>    > > 10/05/2015 20:14:25 | climateprediction.net | If this 
 >happens
 >        repeatedly
 >        >>>    you may
 >        >>>    > > need to reset the project.
 >        >>>    > > 10/05/2015 20:14:25 | NumberFields@home
 >        <mailto:NumberFields@home> <mailto:NumberFields@home
 >        <mailto:NumberFields@home>> | Task
 >        >>>    > > wu_sf3_DS-10x271_Grp503196of682667_0 exited with zero 
 >status but no
 >        >>>    'finished' file
 >        >>>    > > 10/05/2015 20:14:25 | NumberFields@home
 >        <mailto:NumberFields@home> <mailto:NumberFields@home
 >        <mailto:NumberFields@home>> | If
 >        >>>    this happens repeatedly you may need
 >        >>>    > > to reset the project.
 >        >>>    > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home>
 >        <mailto:SETI@home <mailto:SETI@home>> | Task
 >        >>> 05jl12ab.3911.10292.438086664199.12.207_1
 >        >>>    > > exited with zero status but no 'finished' file
 >        >>>    > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home>
 >        <mailto:SETI@home <mailto:SETI@home>> | If this happens
 >        >>>    repeatedly you may need to reset
 >        >>>    > > the project.
 >        >>>    > > 10/05/2015 20:14:25 | climateprediction.net | [cpu_sched]
 >        Restarting task
 >        >>>    > > hadam3p_anz_e3g7_2013_1_009760406_0 using hadam3p_anz 
 >version
 >        610 in slot 5
 >        >>>    > > 10/05/2015 20:14:25 | NumberFields@home
 >        <mailto:NumberFields@home> <mailto:NumberFields@home
 >        <mailto:NumberFields@home>> |
 >        >>>    [cpu_sched] Restarting task
 >        >>>    > > wu_sf3_DS-10x271_Grp503196of682667_0 using GetDecics 
 >version
 >        200 in slot 0
 >        >>>    > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home>
 >        <mailto:SETI@home <mailto:SETI@home>> | [cpu_sched]
 >        >>>    Restarting task
 >        >>>    > > 05jl12ab.3911.10292.438086664199.12.207_1 using 
 >setiathome_v7
 >        version
 >        >>>    700 (cuda42)
 >        >>>    > > in slot 2
 >        >>>    > > 10/05/2015 20:14:27 | LHC@home <mailto:LHC@home>
 >        <mailto:LHC@home <mailto:LHC@home>> 1.0 | Started upload of
 >        >>>    > >
 >        >>>
 >       
 >sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
 >        >>>    > > 10/05/2015 20:14:30 | LHC@home <mailto:LHC@home>
 >        <mailto:LHC@home <mailto:LHC@home>> 1.0 | Finished upload of
 >        >>>    > >
 >        >>>
 >       
 >sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >
 >        >>>    > > On Sunday, 10 May 2015, 19:59, Seke Rob 
 ><[email protected]
 >        <mailto:[email protected]>
 >        >>>    <mailto:[email protected] <mailto:[email protected]>>> 
 >wrote:
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >    Excellent this is all fixed and tested. Interest is/was 
 >that
 >        WCG's Clean
 >        >>>    > >    Energy at some point in time was to run very large 
 >models,
 >        talk of
 >        >>>    4-8GB IIRC.
 >        >>>    > >
 >        >>>    > >    --SekeRob
 >        >>>    > >
 >        >>>    > >    On May 10, 2015 20:27, Richard Haselgrove
 >        >>>    <[email protected] 
 ><mailto:[email protected]>
 >        <mailto:[email protected] 
 ><mailto:[email protected]>>
 >        >>>    > >    <mailto:[email protected]
 >        <mailto:[email protected]>
 >        >>>    <mailto:[email protected]
 >        <mailto:[email protected]>>>> wrote:
 >        >>>    > >    CMS only has stock applications configured for delivery 
 >to
 >        64-bit
 >        >>>    platforms.
 >        >>>    > >    I've made an anonymous platform configuration using the
 >        32-bit VBox
 >        >>>    Windows
 >        >>>    > >    wrapper: it has downloaded and is running its first 
 >1-hour
 >        task. If that
 >        >>>    > >    completes successfully (it seems to have reached the
 >        >>>    fully-operational stage),
 >        >>>    > >    I'll try a full 24-hour task, which under current 
 >operational
 >        >>>    circumstances
 >        >>>    > >    should generate a >4 GB file locally.
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >        On Sunday, 10 May 2015, 18:28, David Anderson
 >        >>>    <[email protected] <mailto:[email protected]>
 >        <mailto:[email protected] <mailto:[email protected]>>
 >        >>>    > >    <mailto:[email protected]
 >        <mailto:[email protected]> <mailto:[email protected]
 >        <mailto:[email protected]>>>> wrote:
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >    NTFS handles > 4GB files, even if the hardware and/or 
 >OS is
 >        only 32-bit.
 >        >>>    > >    32-bit versions of Windows have APIs (like _stat64()) 
 >for
 >        handling >
 >        >>>    4GB files.
 >        >>>    > >    BOINC needs to use these; we fixed one place where it 
 >wasn't.
 >        >>>    > >
 >        >>>    > >    On Unix (Linux and Mac), BOINC uses the regular APIs 
 >(like
 >        lseek())
 >        >>>    but is
 >        >>>    > >    built with a
 >        >>>    > > -D_FILE_OFFSET_BITS=64 flag that causes these functions to
 >        64-bit size.
 >        >>>    > >    However, it's possible that BOINC has bugs involving > 
 >4GB
 >        files on
 >        >>>    Unix too.
 >        >>>    > >    If anyone has a 32-bit Linux system, please test with 
 >the
 >        CMS project.
 >        >>>    > >
 >        >>>    > >    -- David
 >        >>>    > >
 >        >>>    > >    On 10-May-2015 3:58 AM, --SekeRob wrote:
 >        >>>    > >    >
 >        >>>    > >    > Just wondering, with files over 4GB and a 64 bit lib
 >        introduced, is
 >        >>>    it not a CMS
 >        >>>    > >    > project requirement to run on a 64 bit OS?
 >        >>>    > >    >
 >        >>>    > >    >
 >        >>>    > >
 >        >>>    > > _______________________________________________
 >        >>>    > >    boinc_alpha mailing list
 >        >>>    > > [email protected]
 >        <mailto:[email protected]> 
 ><mailto:[email protected]
 >        <mailto:[email protected]>>
 >        >>>    <mailto:[email protected]
 >        <mailto:[email protected]> 
 ><mailto:[email protected]
 >        <mailto:[email protected]>>>
 >        >>>    > > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
 >        >>>    > >    To unsubscribe, visit the above URL and
 >        >>>    > >    (near bottom of page) enter your email address.
 >        >>>
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >
 >        >>>    > > _______________________________________________
 >        >>>    > >    boinc_alpha mailing list
 >        >>>    > > [email protected]
 >        <mailto:[email protected]> 
 ><mailto:[email protected]
 >        <mailto:[email protected]>>
 >        >>>    <mailto:[email protected]
 >        <mailto:[email protected]> 
 ><mailto:[email protected]
 >        <mailto:[email protected]>>>
 >        >>>    > > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
 >        >>>    > >    To unsubscribe, visit the above URL and
 >        >>>    > >    (near bottom of page) enter your email address.
 >        >>>    > >
 >        >>>    > >
 >        >>>    >
 >        >>>    > _______________________________________________
 >        >>>    > boinc_alpha mailing list
 >        >>>    > [email protected]
 >        <mailto:[email protected]> 
 ><mailto:[email protected]
 >        <mailto:[email protected]>>
 >        >>>    > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
 >        >>>    > To unsubscribe, visit the above URL and
 >        >>>    > (near bottom of page) enter your email address.
 >        >>>
 >        >>> _______________________________________________
 >        >>>    boinc_alpha mailing list
 >        >>> [email protected] <mailto:[email protected]>
 >        <mailto:[email protected] 
 ><mailto:[email protected]>>
 >        >>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
 >        >>>    To unsubscribe, visit the above URL and
 >        >>>    (near bottom of page) enter your email address.
 >        >>>
 >        >>>
 >        >>
 >        >
 >        >
 >        >
 >        >
 >       
 >------------------------------------------------------------------------------------
 >        > Avast logo <http://www.avast.com/>
 >        >
 >        > This email has been checked for viruses by Avast antivirus software.
 >        > www.avast.com <http://www.avast.com> <http://www.avast.com/>
 >        >
 >        >
 >
 >        _______________________________________________
 >        boinc_dev mailing list
 >        [email protected] <mailto:[email protected]>
 >        http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
 >
 >        To unsubscribe, visit the above URL and
 >        (near bottom of page) enter your email address.
 >
 >
 >
 
 _______________________________________________
 boinc_dev mailing list
 [email protected]
 http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
 To unsubscribe, visit the above URL and
 (near bottom of page) enter your email address.
 
 
  
     
 
 

 
  
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without ensuring they're empty

Reply via email to