Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without ensuring they're empty

Rom Walton Wed, 10 Jun 2015 10:17:04 -0700

Here is a private drop:
http://boinc.berkeley.edu/dl/boinc.100615.x64.zip


----- Rom

-----Original Message-----
From: boinc_dev [mailto:[email protected]] On Behalf Of 
Richard Haselgrove
Sent: Wednesday, June 10, 2015 3:34 AM
To: David Anderson; Jacob Klein; Seke Rob
Cc: BOINC Development
Subject: Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without 
ensuring they're empty

Rom, if you could build a private drop, I'll report what the log says. 


     On Wednesday, 10 June 2015, 4:28, David Anderson <[email protected]> 
wrote:
   
 

  I added a log message that may help a bit.
 I'd like to track this down, even though it's minor.
 -- David
 
 On 19-May-2015 12:15 PM, Richard Haselgrove wrote:
  
  OK, the delay happened again, and I captured a procmon log. 
  Copy of the BOINC log attached (period of interest is 19:35:30 to 19:35:41): 
also a simple extract of ProcMon for the same period. It has to be said, 
boinc.exe was doing surprisingly little. 
  I have kept the full ~200 MB native ProcMon log, which can be re-filtered to 
look for anything else of interest, if you can suggest some likely targets. 
 
 
       On Monday, 18 May 2015, 20:57, David Anderson <[email protected]> 
wrote:
   
 
 
 That looks like what's needed.
 Richard, if you can repro the inter-job delay,  you could try using Process 
Monitor to capture as much  as possible from the client during that period.
 -- David
 
 On 18-May-2015 11:12 AM, Jacob Klein wrote:
 > Process Monitor can be used to "watch the things a process does" (you have 
 > to set  > up correct filters, etc.)... but I'm not sure if that includes 
 > sleeps. But if the  > process is waiting on a file or something, though, it 
 > should be able to tell you. 
 > Worth looking into.
 >
 > https://technet.microsoft.com/en-us/library/bb896645.aspx
 >
 > Regards,
 > Jacob
 >
 >
 >------------------------------------------------------------------------------------
 > Date: Mon, 18 May 2015 10:41:16 -0700  > From: [email protected]  > To: 
 > [email protected]; [email protected]; [email protected]  > 
 > CC: [email protected]  > Subject: Re: [boinc_dev] [boinc_alpha] 
 > BOINC re-using slot directories without  > ensuring they're empty  >  > I 
 > looked at this and couldn't figure out the source of the 12-sec delay.
 > In general, delays could happen because  > 1) the client does something that 
 > takes a long time (like copying a 5 GB file)  > 2) the client sleeps (i.e. 
 > calls boinc_sleep()).
 >    It does this in a few situations,
 >    like backing off and retrying a file system operation.
 > But there's no indication that either of these is happening here.
 >
 > Does Windows have a way of logging the system calls that a process makes  > 
 > (like strace on Unix)?
 > If so that might reveal what the client is doing during those 12 seconds.
 >
 > -- David
 >
 > On 16-May-2015 8:01 AM, Richard Haselgrove wrote:
 >
 >    Here is the message log file for a GPUGrid task finish. The 12-second 
 >delay  >    appears again between 14:26:35 and 14:26:47 - that's after the 
 >slot directory  >    has been cleared, and the exiting task has changed state 
 >from 'running' to  >    'uploading'. Two new tasks have been assigned to the 
 >GPU, but their (small)  >    startup files have not yet been linked to their 
 >respective slot directories.
 >
 >    I also attach directory listings for the slot and GPUGrid project folders 
 >at  >    various stages of the cleanup: the slot held 34 files totalling 
 >44,186,727  >    bytes, which doesn't sound excessive: the largest file 
 >deletion (94,783,960  >    bytes) occurred several minutes later, when that 
 >file finished uploading.
 >
 >    I'll enable similar logging and watch what happens when the next GPUGrid 
 >task  >    starts up, but from memory, the disruption to BOINC is less severe 
 >at startup.
 >
 >
 >
 >    On Tuesday, 12 May 2015, 23:29, David Anderson <[email protected]>  
 >>    <mailto:[email protected]> wrote:
 >
 >
 >
 >        BTW: the client isn't completely single-threaded;  >        it uses a 
 >separate thread to do CPU throttling.
 >        It would be feasible to also use separate threads  >        for 
 >serving GUI RPC connections,  >        which would allow client to remain 
 >responsive even while  >        e.g. copying thousands of files to a slot dir.
 >        -- David
 >
 >        On 12-May-2015 2:40 AM, Seke Rob wrote:
 >        > Reminds me of the Clean Energy Project, Phase 2 and why we have  >  
 >      app_config and  >        > <max_concurrent> and a default control of 
 >allowing 1 'In Progress' on a  >        host. This  >        > project sets 
 >up in slot copying near 6700 files [symlinking proposed  >        long ago as 
 > >        > is done on several other WCG projects for the static files]. If 
 >more  >        than one CEP2  >        > is started the machine feels at 
 >times like a snail, responsiveness of  >        the BOINC  >        > manager 
 >is poor, many a time the less powerful systems incurring error  >        zero 
 >status  >        > exits or total fail. On an 8 core observed it could take 
 >over an hour  >        before  >        > actual computing commenced [CPU 
 >time logged]. Boot cycle requires manually  >        > starting of tasks one 
 >by one. Kevin Reed few years ago raised a ticket for  >        > staggered 
 >starting, where the models can reach several GB and bigger in the  >        > 
 >coming. At any rate, as much as these 6700 files are copied, they also  >     
 >   then are  >        > needing of deletion at completion [physical or 
 >symlink references]. The  >        effect of  >        > starting 1 CEP2 and 
 >finishing / packaging / zipping and transmitting can  >        easily  >      
 >  > lead to several minutes of there not being any computing, just whirring,  
 >>        for  >        > minutes, just elapsed being logged. The more run the 
 >more the issue  >        compounds,  >        > with the effect of what many 
 >incur, the exit zero status series,  >        resetting to  >        > start 
 >or last checkpoint with often hours of computing time lost.
 >        >
 >        > Maybe you'd like to get in touch with your confederates at WCG 
 >[Keith  >        Uplinger],  >        > to discuss the issue further as this 
 >is now nearing a 5 year continues  >        frustration  >        > [June 
 >2010 launch, and a huge limitation on the speed of progress on  >        this 
 >project].
 >        >
 >        > --SekeRob.
 >        >
 >        > On 12-5-2015 1:55, David Anderson wrote:
 >        >> That delay looks like it's caused by deleting files or by process 
 >cleanup.
 >        >> Does GPUGrid make lots of (non-output) files in the slot dir?
 >        >>
 >        >> Please try to repro it with slot_debug, task_debug, and 
 >heartbeat_debug set  >        >> (gui_rpc_debug not needed).
 >        >>
 >        >> -- David
 >        >>
 >        >> On 11-May-2015 10:54 AM, Richard Haselgrove wrote:
 >        >>> Here's another example of a case where BOINC finds that it can't 
 >walk  >        and chew  >        >>> gum at the same time. The event of 
 >interest is  >        >>>  >        >>> 11/05/2015 18:35:34 | GPUGRID | 
 >Computation for task  >        >>> 
 >e10s9_e7s6f4-GERARD_FXCXCL12_LIG_6282622-0-1-RND7898_0 finished  >        >>> 
 > >        >>> Following that, there's a 12-second interval where neither 
 >heartbeats  >        nor GUI  >        >>> RPC traffic was logged: during 
 >that time, the Task tab of the Manager was  >        >>> unchanging, not 
 >showing the regular update of elapsed time for running  >        tasks.
 >        >>>
 >        >>> async_file_debug was active at the time, but found no events to 
 >log.
 >        >>>
 >        >>> These particular GPUGrid tasks generate around 90 MB of upload 
 >files,  >        but I  >        >>> think they are generated directly in the 
 >project folder and don't need  >        to be  >        >>> copied anywhere.
 >        >>>
 >        >>> Main log as attached file only.
 >        >>>
 >        >>> I'll catch a CMS-dev log later this evening, but after that, I'll 
 >be  >        away for a  >        >>> few days and I'll have to leave the 
 >bug-chase until the weekend.
 >        >>>
 >        >>>
 >        >>>
 >        >>>
 >        >>> On Monday, 11 May 2015, 9:42, Jacob Klein <[email protected]  
 >>        <mailto:[email protected]>> wrote:
 >        >>>
 >        >>>
 >        >>>
 >        >>>    I have seen this problem before, where the UI becomes 
 >unresponsive.
 >        If I
 >        >>>    recall, it happens when a T4T task is being set up (ie: after  
 >>        everything was  >        >>>    downloaded). For me, I don't recall 
 >the problem ever "screwing over  >        other  >        >>>    tasks", 
 >though.
 >        >>>
 >        >>>    Try this to reproduce it: Attach to T4T, and get a task. It 
 >may  >        take a while  >        >>>    to do that download, so you can 
 >"step away" for a bit. Then, once  >        that task  >        >>>    is 
 >going, abort it. Downloading the 2nd task should be instantaneous  >        
 >>>>    (nothing really to download), but instantiation of that 2nd task 
 >should  >        >>>    cause the UI to hang (showing the "Please wait" 
 >messagebox in the  >        manager).
 >        >>>
 >        >>>    Does that help?
 >        >>>    > Date: Sun, 10 May 2015 23:19:24 -0700  >        >>>    > 
 >From: [email protected] <mailto:[email protected]>  >        
 ><mailto:[email protected] <mailto:[email protected]>>  >        >>> 
 >   > To: [email protected]  >        
 ><mailto:[email protected]> <mailto:[email protected]
 >        <mailto:[email protected]>>;
 >        >>> [email protected] <mailto:[email protected]>  >        
 ><mailto:[email protected] <mailto:[email protected]>>  >        >>>    > 
 >CC: [email protected]  >        
 ><mailto:[email protected]> <mailto:[email protected]
 >        <mailto:[email protected]>>
 >        >>>    > Subject: Re: [boinc_alpha] BOINC re-using slot directories 
 >without  >        >>>    ensuring they're empty  >        >>>    >  >        
 >>>>    > I did some initial testing and couldn't repro this;  >        >>>    
 >> the client remains responsive while copying a 5 GB file to a slot  >        
 >dir.
 >        >>>    > Does anyone else see this behavior?
 >        >>>    >
 >        >>>    > While testing this, please set "async_file_debug" log flag.
 >        >>>    > This says when asynchronous file operations start and end.
 >        >>>    >
 >        >>>    > -- David
 >        >>>    >
 >        >>>    > On 10-May-2015 12:31 PM, Richard Haselgrove wrote:
 >        >>>    > > One thing that may need attention if very large files 
 >become  >        the norm is  >        >>>    the  >        >>>    > > 
 >single-threaded nature of some parts of the core client. My  >        1-hour 
 >CMS  >        >>>    test has  >        >>>    > > just finished, and a new 
 >24-hour test started.
 >        >>>    > >
 >        >>>    > >
 >        >>>    > > I watched this happening, and part of the process is 
 >copying a  >        1.33 GB  >        >>>    initial  >        >>>    > > 
 >.vmi image file (downloaded previously by BOINC from CERN) from  >        the 
 >project  >        >>>    > > directory to the slot directory. This took about 
 >90 seconds:
 >        during that
 >        >>>    time, all
 >        >>>    > > Manager updating stopped. I'm sure it's the copying 
 >process  >        which inhibited  >        >>>    > > updates: I was 
 >watching the slot directory, and the .vmi image  >        file had  >        
 >>>>    appeared,  >        >>>    > > but other essential startup files 
 >hadn't.
 >        >>>    > >
 >        >>>    > >
 >        >>>    > > When BOINC regained its ability to communicate, three 
 >running  >        tasks had  >        >>>    exited  >        >>>    > > with 
 >the dreaded (and false) 'you may need to reset the  >        project' advice.
 >        >>>    inline
 >        >>>    > > log follows: because my last log got mangled by my ISP's 
 >new mail  >        >>>    interface, I'll  >        >>>    > > attach it as a 
 >text file as well.
 >        >>>    > >
 >        >>>    > >
 >        >>>    > > 10/05/2015 20:12:56 | LHC@home <mailto:LHC@home>  >        
 ><mailto:LHC@home <mailto:LHC@home>> 1.0 | Computation for task  >        >>>  
 >  > >  >        >>>  >       
 >sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1
 >        >>>
 >        >>>    > > finished
 >        >>>    > > 10/05/2015 20:12:56 | CMS-dev | Starting task  >        
 >CMS_31107_1427806626.783437_0  >        >>>    > > 10/05/2015 20:12:56 | 
 >CMS-dev | [cpu_sched] Starting task  >        >>>    > > 
 >CMS_31107_1427806626.783437_0 using CMS version 4615 (vbox64)  >        in 
 >slot 7  >        >>>    > > 10/05/2015 20:14:25 | climateprediction.net | 
 >Task  >        >>>    > > hadam3p_anz_e3g7_2013_1_009760406_0 exited with 
 >zero status but no  >        >>>    'finished' file  >        >>>    > > 
 >10/05/2015 20:14:25 | climateprediction.net | If this happens  >        
 >repeatedly  >        >>>    you may  >        >>>    > > need to reset the 
 >project.
 >        >>>    > > 10/05/2015 20:14:25 | NumberFields@home  >        
 ><mailto:NumberFields@home> <mailto:NumberFields@home  >        
 ><mailto:NumberFields@home>> | Task  >        >>>    > > 
 >wu_sf3_DS-10x271_Grp503196of682667_0 exited with zero status but no  >        
 >>>>    'finished' file  >        >>>    > > 10/05/2015 20:14:25 | 
 >NumberFields@home  >        <mailto:NumberFields@home> 
 ><mailto:NumberFields@home  >        <mailto:NumberFields@home>> | If  >       
 > >>>    this happens repeatedly you may need  >        >>>    > > to reset 
 >the project.
 >        >>>    > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home>  >      
 >  <mailto:SETI@home <mailto:SETI@home>> | Task  >        >>> 
 >05jl12ab.3911.10292.438086664199.12.207_1
 >        >>>    > > exited with zero status but no 'finished' file  >        
 >>>>    > > 10/05/2015 20:14:25 | SETI@home <mailto:SETI@home>  >        
 ><mailto:SETI@home <mailto:SETI@home>> | If this happens  >        >>>    
 >repeatedly you may need to reset  >        >>>    > > the project.
 >        >>>    > > 10/05/2015 20:14:25 | climateprediction.net | [cpu_sched]  
 >>        Restarting task  >        >>>    > > 
 >hadam3p_anz_e3g7_2013_1_009760406_0 using hadam3p_anz version  >        610 
 >in slot 5  >        >>>    > > 10/05/2015 20:14:25 | NumberFields@home  >     
 >   <mailto:NumberFields@home> <mailto:NumberFields@home  >        
 ><mailto:NumberFields@home>> |  >        >>>    [cpu_sched] Restarting task  > 
 >       >>>    > > wu_sf3_DS-10x271_Grp503196of682667_0 using GetDecics 
 >version  >        200 in slot 0  >        >>>    > > 10/05/2015 20:14:25 | 
 >SETI@home <mailto:SETI@home>  >        <mailto:SETI@home <mailto:SETI@home>> 
 >| [cpu_sched]  >        >>>    Restarting task  >        >>>    > > 
 >05jl12ab.3911.10292.438086664199.12.207_1 using setiathome_v7  >        
 >version  >        >>>    700 (cuda42)  >        >>>    > > in slot 2  >       
 > >>>    > > 10/05/2015 20:14:27 | LHC@home <mailto:LHC@home>  >        
 ><mailto:LHC@home <mailto:LHC@home>> 1.0 | Started upload of  >        >>>    
 >> >  >        >>>  >       
 >sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
 >        >>>    > > 10/05/2015 20:14:30 | LHC@home <mailto:LHC@home>  >        
 ><mailto:LHC@home <mailto:LHC@home>> 1.0 | Finished upload of  >        >>>    
 >> >  >        >>>  >       
 >sd_FCChh_bs25_beta30_xing120_int1.0_emit2.0_tunex117.216_tuney118.226_6D_V4__1__s__118.31_117.32__4.1_4.2__6__20_1_sixvf_boinc701_1_0
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >
 >        >>>    > > On Sunday, 10 May 2015, 19:59, Seke Rob 
 ><[email protected]  >        <mailto:[email protected]>  >        >>>    
 ><mailto:[email protected] <mailto:[email protected]>>> wrote:
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >    Excellent this is all fixed and tested. Interest is/was 
 >that  >        WCG's Clean  >        >>>    > >    Energy at some point in 
 >time was to run very large models,  >        talk of  >        >>>    4-8GB 
 >IIRC.
 >        >>>    > >
 >        >>>    > >    --SekeRob
 >        >>>    > >
 >        >>>    > >    On May 10, 2015 20:27, Richard Haselgrove  >        >>> 
 >   <[email protected] <mailto:[email protected]>
 >        <mailto:[email protected] 
 ><mailto:[email protected]>>
 >        >>>    > >    <mailto:[email protected]
 >        <mailto:[email protected]>
 >        >>>    <mailto:[email protected]
 >        <mailto:[email protected]>>>> wrote:
 >        >>>    > >    CMS only has stock applications configured for delivery 
 >to  >        64-bit  >        >>>    platforms.
 >        >>>    > >    I've made an anonymous platform configuration using the 
 > >        32-bit VBox  >        >>>    Windows  >        >>>    > >    
 >wrapper: it has downloaded and is running its first 1-hour  >        task. If 
 >that  >        >>>    > >    completes successfully (it seems to have reached 
 >the  >        >>>    fully-operational stage),  >        >>>    > >    I'll 
 >try a full 24-hour task, which under current operational  >        >>>    
 >circumstances  >        >>>    > >    should generate a >4 GB file locally.
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >        On Sunday, 10 May 2015, 18:28, David Anderson  >    
 >    >>>    <[email protected] <mailto:[email protected]>  >        
 ><mailto:[email protected] <mailto:[email protected]>>  >        >>> 
 >   > >    <mailto:[email protected]  >        
 ><mailto:[email protected]> <mailto:[email protected]  >        
 ><mailto:[email protected]>>>> wrote:
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >    NTFS handles > 4GB files, even if the hardware and/or 
 >OS is  >        only 32-bit.
 >        >>>    > >    32-bit versions of Windows have APIs (like _stat64()) 
 >for  >        handling >  >        >>>    4GB files.
 >        >>>    > >    BOINC needs to use these; we fixed one place where it 
 >wasn't.
 >        >>>    > >
 >        >>>    > >    On Unix (Linux and Mac), BOINC uses the regular APIs 
 >(like  >        lseek())  >        >>>    but is  >        >>>    > >    
 >built with a  >        >>>    > > -D_FILE_OFFSET_BITS=64 flag that causes 
 >these functions to  >        64-bit size.
 >        >>>    > >    However, it's possible that BOINC has bugs involving > 
 >4GB  >        files on  >        >>>    Unix too.
 >        >>>    > >    If anyone has a 32-bit Linux system, please test with 
 >the  >        CMS project.
 >        >>>    > >
 >        >>>    > >    -- David
 >        >>>    > >
 >        >>>    > >    On 10-May-2015 3:58 AM, --SekeRob wrote:
 >        >>>    > >    >
 >        >>>    > >    > Just wondering, with files over 4GB and a 64 bit lib  
 >>        introduced, is  >        >>>    it not a CMS  >        >>>    > >    
 >> project requirement to run on a 64 bit OS?
 >        >>>    > >    >
 >        >>>    > >    >
 >        >>>    > >
 >        >>>    > > _______________________________________________
 >        >>>    > >    boinc_alpha mailing list  >        >>>    > > 
 >[email protected]  >        <mailto:[email protected]> 
 ><mailto:[email protected]
 >        <mailto:[email protected]>>
 >        >>>    <mailto:[email protected]
 >        <mailto:[email protected]> 
 ><mailto:[email protected]
 >        <mailto:[email protected]>>>
 >        >>>    > > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
 >        >>>    > >    To unsubscribe, visit the above URL and  >        >>>   
 > > >    (near bottom of page) enter your email address.
 >        >>>
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >
 >        >>>    > >
 >        >>>    > > _______________________________________________
 >        >>>    > >    boinc_alpha mailing list  >        >>>    > > 
 >[email protected]  >        <mailto:[email protected]> 
 ><mailto:[email protected]
 >        <mailto:[email protected]>>
 >        >>>    <mailto:[email protected]
 >        <mailto:[email protected]> 
 ><mailto:[email protected]
 >        <mailto:[email protected]>>>
 >        >>>    > > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
 >        >>>    > >    To unsubscribe, visit the above URL and  >        >>>   
 > > >    (near bottom of page) enter your email address.
 >        >>>    > >
 >        >>>    > >
 >        >>>    >
 >        >>>    > _______________________________________________
 >        >>>    > boinc_alpha mailing list  >        >>>    > 
 >[email protected]  >        <mailto:[email protected]> 
 ><mailto:[email protected]
 >        <mailto:[email protected]>>
 >        >>>    > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
 >        >>>    > To unsubscribe, visit the above URL and  >        >>>    > 
 >(near bottom of page) enter your email address.
 >        >>>
 >        >>> _______________________________________________
 >        >>>    boinc_alpha mailing list  >        >>> 
 >[email protected] <mailto:[email protected]>
 >        <mailto:[email protected] 
 ><mailto:[email protected]>>
 >        >>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha
 >        >>>    To unsubscribe, visit the above URL and  >        >>>    (near 
 >bottom of page) enter your email address.
 >        >>>
 >        >>>
 >        >>
 >        >
 >        >
 >        >
 >        >
 >       
 >------------------------------------------------------------------------------------
 >        > Avast logo <http://www.avast.com/>  >        >  >        > This 
 >email has been checked for viruses by Avast antivirus software.
 >        > www.avast.com <http://www.avast.com> <http://www.avast.com/>  >     
 >   >  >        >  >  >        _______________________________________________
 >        boinc_dev mailing list
 >        [email protected] <mailto:[email protected]>
 >        http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
 >
 >        To unsubscribe, visit the above URL and  >        (near bottom of 
 >page) enter your email address.
 >
 >
 >
 
 _______________________________________________
 boinc_dev mailing list
 [email protected]
 http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
 To unsubscribe, visit the above URL and  (near bottom of page) enter your 
email address.
 
 
  
     
 
 

 
  
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without ensuring they're empty

Reply via email to