User Keith Myers (UID 147145 at http://milkyway.cs.rpi.edu/milkyway/index.php)
has asked for my help in identifying task failures at Milkyway.
At my suggestion, he installed Windows client v7.6.2, and the attached message
log extracts show the enhanced <slot_debug> output that helped identify the
CMS-dev problem.
In both cases, the task under scrutiny
(1) de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0,
http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181200273
(2) ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0,
http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181298220
was declared 'Validate error', and the <stderr_txt> section is empty. In the
special case of Milkyway@Home, these two observations are linked, because the
science result is returned in stderr, not a separate upload file.
Also in both cases, the <slot_debug> log contains
[slot] failed to remove file slots/x/stderr.txt: unlink() failed
between 'handle_exited_app()' and 'Computation for task ... finished '
It appears that there is a race condition, whereby BOINC tries (and fails) to
delete stderr.txt before the operating system has released the write lock. This
(I'm presuming) also explains why the file appears empty when read off the disk
for incorporation into the client_state structure in memory, prior to reporting
the completed task to the project.
In order the preserve the scientific result at Milkyway (and debug and other
useful information at other projects), the client should not initiate
'handle_exited_app()' until it has confirmed that the write lock on stderr.txt
has been released.
Log 1 also shows that the additional safeguards on cleaning out slots are
working properly: if both handle_exited_app() and get_free_slot() fail to
delete the file, the next task isn't started in the not-empty slot (11), but in
slot 14 instead. And when slot 11 is tested again at the next get_free_slot(),
the delete succeeds and the now-empty slot is reused.
7/8/2015 3:55:15 PM | Milkyway@Home | [slot] assigning slot 11 to
de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0
7/8/2015 3:55:15 PM | | [slot] removed file slots/11/init_data.xml
7/8/2015 3:55:15 PM | Milkyway@Home | [slot] linked
../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe
to
slots/11/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe
7/8/2015 3:55:15 PM | Milkyway@Home | [slot] linked
../../projects/milkyway.cs.rpi.edu_milkyway/parameters-15-3s-sim-fast.txt to
slots/11/astronomy_parameters.txt
7/8/2015 3:55:15 PM | Milkyway@Home | [slot] linked
../../projects/milkyway.cs.rpi.edu_milkyway/stars-15-sim-1Jun1.txt to
slots/11/stars.txt
7/8/2015 3:55:15 PM | | [slot] removed file slots/11/boinc_temporary_exit
7/8/2015 3:55:15 PM | Milkyway@Home | Starting task
de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0
7/8/2015 3:55:15 PM | Milkyway@Home | [cpu_sched] Starting task
de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0 using
milkyway_separation__modified_fit version 136 (opencl_nvidia_101) in slot 11
7/8/2015 3:55:16 PM | Milkyway@Home | Sending scheduler request: To fetch work.
7/8/2015 3:55:16 PM | Milkyway@Home | Reporting 1 completed tasks
7/8/2015 3:55:16 PM | Milkyway@Home | Requesting new tasks for NVIDIA GPU
7/8/2015 3:55:18 PM | Milkyway@Home | Scheduler request completed: got 1 new
tasks
7/8/2015 3:55:18 PM | | [slot] removed file
projects/milkyway.cs.rpi.edu_milkyway/de_80_DR8_Rev_8_5_00004_1434551187_10653037_0_0
7/8/2015 3:55:18 PM | | [slot] removed file
projects/milkyway.cs.rpi.edu_milkyway/de_80_DR8_Rev_8_5_00004_1434551187_10653037_0_0.gz
7/8/2015 3:55:18 PM | | [slot] removed file
projects/milkyway.cs.rpi.edu_milkyway/de_80_DR8_Rev_8_5_00004_1434551187_10653037_0_0.gzt
7/8/2015 3:55:26 PM | | [slot] cleaning out slots/2: handle_exited_app()
7/8/2015 3:55:26 PM | | [slot] removed file slots/2/boinc_finish_called
7/8/2015 3:55:26 PM | | [slot] removed file slots/2/boinc_task_state.xml
7/8/2015 3:55:26 PM | | [slot] removed file slots/2/cudart32_50_35.dll
7/8/2015 3:55:26 PM | | [slot] removed file slots/2/cufft32_50_35.dll
7/8/2015 3:55:26 PM | | [slot] removed file slots/2/init_data.xml
7/8/2015 3:55:26 PM | | [slot] removed file
slots/2/Lunatics_x41zc_win32_cuda50.exe
7/8/2015 3:55:26 PM | | [slot] removed file slots/2/mbcuda.cfg
7/8/2015 3:55:26 PM | | [slot] removed file slots/2/result.sah
7/8/2015 3:55:26 PM | | [slot] removed file slots/2/state.sah
7/8/2015 3:55:26 PM | | [slot] removed file slots/2/stderr.txt
7/8/2015 3:55:26 PM | | [slot] removed file slots/2/work_unit.sah
7/8/2015 3:55:26 PM | | [slot] cleaning out slots/2: get_free_slot()
7/8/2015 3:55:26 PM | | [slot] removed file slots/2/init_data.xml
7/8/2015 3:55:26 PM | | [slot] removed file slots/2/boinc_temporary_exit
7/8/2015 3:55:31 PM | | [slot] removed file
projects/setiathome.berkeley.edu/30ja15ab.5711.328466.438086664199.12.107_1_0
7/8/2015 3:55:31 PM | | [slot] removed file
projects/setiathome.berkeley.edu/30ja15ab.5711.328466.438086664199.12.107_1_0.gz
7/8/2015 3:55:31 PM | | [slot] removed file
projects/setiathome.berkeley.edu/30ja15ab.5711.328466.438086664199.12.107_1_0.gzt
7/8/2015 3:56:05 PM | | [slot] cleaning out slots/11: handle_exited_app()
7/8/2015 3:56:05 PM | | [slot] removed file slots/11/astronomy_parameters.txt
7/8/2015 3:56:05 PM | | [slot] removed file slots/11/boinc_finish_called
7/8/2015 3:56:05 PM | | [slot] removed file slots/11/init_data.xml
7/8/2015 3:56:05 PM | | [slot] removed file
slots/11/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe
7/8/2015 3:56:05 PM | | [slot] removed file slots/11/separation_checkpoint
7/8/2015 3:56:05 PM | | [slot] removed file slots/11/stars.txt
7/8/2015 3:56:05 PM | | [slot] failed to remove file slots/11/stderr.txt:
unlink() failed
7/8/2015 3:56:05 PM | Milkyway@Home | Computation for task
de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0 finished
7/8/2015 3:56:05 PM | | [slot] cleaning out slots/11: get_free_slot()
7/8/2015 3:56:05 PM | | [slot] failed to remove file slots/11/stderr.txt:
unlink() failed
7/8/2015 3:56:05 PM | Milkyway@Home | [slot] failed to clean out dir: unlink()
failed
7/8/2015 3:56:05 PM | Milkyway@Home | [slot] assigning slot 14 to
de_fast_15_3s_136_sim1Jun1_1_1434554402_7777240_0
7/8/2015 3:56:05 PM | | [slot] removed file slots/14/init_data.xml
7/8/2015 3:56:05 PM | Milkyway@Home | [slot] linked
../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe
to
slots/14/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe
7/8/2015 3:56:05 PM | Milkyway@Home | [slot] linked
../../projects/milkyway.cs.rpi.edu_milkyway/parameters-15-3s-sim-fast.txt to
slots/14/astronomy_parameters.txt
7/8/2015 3:56:05 PM | Milkyway@Home | [slot] linked
../../projects/milkyway.cs.rpi.edu_milkyway/stars-15-sim-1Jun1.txt to
slots/14/stars.txt
7/8/2015 3:56:05 PM | | [slot] removed file slots/14/boinc_temporary_exit
7/8/2015 3:56:05 PM | Milkyway@Home | Starting task
de_fast_15_3s_136_sim1Jun1_1_1434554402_7777240_0
7/8/2015 3:56:05 PM | Milkyway@Home | [cpu_sched] Starting task
de_fast_15_3s_136_sim1Jun1_1_1434554402_7777240_0 using
milkyway_separation__modified_fit version 136 (opencl_nvidia_101) in slot 14
7/8/2015 3:56:23 PM | Milkyway@Home | Sending scheduler request: To fetch work.
7/8/2015 3:56:23 PM | Milkyway@Home | Reporting 1 completed tasks
7/8/2015 3:56:23 PM | Milkyway@Home | Requesting new tasks for NVIDIA GPU
7/8/2015 3:56:25 PM | Milkyway@Home | Scheduler request completed: got 1 new
tasks
7/8/2015 3:56:25 PM | | [slot] removed file
projects/milkyway.cs.rpi.edu_milkyway/de_modfit_fast_15_3s_136_sim1Jun1_2_1434554402_7852751_0_0
7/8/2015 3:56:25 PM | | [slot] removed file
projects/milkyway.cs.rpi.edu_milkyway/de_modfit_fast_15_3s_136_sim1Jun1_2_1434554402_7852751_0_0.gz
7/8/2015 3:56:25 PM | | [slot] removed file
projects/milkyway.cs.rpi.edu_milkyway/de_modfit_fast_15_3s_136_sim1Jun1_2_1434554402_7852751_0_0.gzt
7/8/2015 3:56:55 PM | | [slot] cleaning out slots/14: handle_exited_app()
7/8/2015 3:56:55 PM | | [slot] removed file slots/14/astronomy_parameters.txt
7/8/2015 3:56:55 PM | | [slot] removed file slots/14/boinc_finish_called
7/8/2015 3:56:55 PM | | [slot] removed file slots/14/init_data.xml
7/8/2015 3:56:55 PM | | [slot] removed file
slots/14/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe
7/8/2015 3:56:55 PM | | [slot] removed file slots/14/separation_checkpoint
7/8/2015 3:56:55 PM | | [slot] removed file slots/14/stars.txt
7/8/2015 3:56:55 PM | | [slot] removed file slots/14/stderr.txt
7/8/2015 3:56:55 PM | Milkyway@Home | Computation for task
de_fast_15_3s_136_sim1Jun1_1_1434554402_7777240_0 finished
7/8/2015 3:56:55 PM | | [slot] cleaning out slots/11: get_free_slot()
7/8/2015 3:56:55 PM | | [slot] removed file slots/11/stderr.txt
7/8/2015 3:56:55 PM | Milkyway@Home | [slot] assigning slot 11 to
de_80_DR8_Rev_8_5_00004_1434551187_10549411_1
7/8/2015 3:56:55 PM | | [slot] removed file slots/11/init_data.xml
7/8/2015 3:56:55 PM | Milkyway@Home | [slot] linked
../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_separation_1.02_windows_x86_64__opencl_nvidia.exe
to slots/11/milkyway_separation_1.02_windows_x86_64__opencl_nvidia.exe
7/8/2015 3:56:55 PM | Milkyway@Home | [slot] linked
../../projects/milkyway.cs.rpi.edu_milkyway/80_rev_8_5.prmtrs to
slots/11/astronomy_parameters.txt
7/8/2015 3:56:55 PM | Milkyway@Home | [slot] linked
../../projects/milkyway.cs.rpi.edu_milkyway/80_Rev_8_3.stars to
slots/11/stars.txt
7/8/2015 3:56:55 PM | | [slot] removed file slots/11/boinc_temporary_exit
7/8/2015 3:56:55 PM | Milkyway@Home | Starting task
de_80_DR8_Rev_8_5_00004_1434551187_10549411_1
7/8/2015 3:56:55 PM | Milkyway@Home | [cpu_sched] Starting task
de_80_DR8_Rev_8_5_00004_1434551187_10549411_1 using milkyway version 102
(opencl_nvidia) in slot 11
9964 Milkyway@Home 7/8/2015 5:40:58 PM [slot] assigning slot 5 to
ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0
9965 7/8/2015 5:40:58 PM [slot] removed file
slots/5/init_data.xml
9966 Milkyway@Home 7/8/2015 5:40:58 PM [slot] linked
../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe
to
slots/5/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe
9967 Milkyway@Home 7/8/2015 5:40:58 PM [slot] linked
../../projects/milkyway.cs.rpi.edu_milkyway/parameters-15-3s-sim-fast.txt to
slots/5/astronomy_parameters.txt
9968 Milkyway@Home 7/8/2015 5:40:58 PM [slot] linked
../../projects/milkyway.cs.rpi.edu_milkyway/stars-15-sim-1Jun1.txt to
slots/5/stars.txt
9969 7/8/2015 5:40:58 PM [slot] removed file
slots/5/boinc_temporary_exit
9970 Milkyway@Home 7/8/2015 5:40:58 PM Starting task
ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0
9971 Milkyway@Home 7/8/2015 5:40:58 PM [cpu_sched] Starting task
ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0 using
milkyway_separation__modified_fit version 136 (opencl_nvidia_101) in slot 5
9972 Milkyway@Home 7/8/2015 5:41:16 PM Sending scheduler request: To
fetch work.
9973 Milkyway@Home 7/8/2015 5:41:16 PM Reporting 1 completed tasks
9974 Milkyway@Home 7/8/2015 5:41:16 PM Requesting new tasks for NVIDIA
GPU
9975 Milkyway@Home 7/8/2015 5:41:18 PM Scheduler request completed:
got 1 new tasks
9976 7/8/2015 5:41:18 PM [slot] removed file
projects/milkyway.cs.rpi.edu_milkyway/de_80_DR8_Rev_8_5_00004_1434551187_10687133_0_0
9977 7/8/2015 5:41:18 PM [slot] removed file
projects/milkyway.cs.rpi.edu_milkyway/de_80_DR8_Rev_8_5_00004_1434551187_10687133_0_0.gz
9978 7/8/2015 5:41:18 PM [slot] removed file
projects/milkyway.cs.rpi.edu_milkyway/de_80_DR8_Rev_8_5_00004_1434551187_10687133_0_0.gzt
9979 7/8/2015 5:41:31 PM [slot] cleaning out slots/6:
handle_exited_app()
9980 7/8/2015 5:41:31 PM [slot] removed file
slots/6/AKv8c_r2549_winx86-64_SSE42xjfs.exe
9981 7/8/2015 5:41:31 PM [slot] removed file
slots/6/boinc_finish_called
9982 7/8/2015 5:41:31 PM [slot] removed file
slots/6/boinc_task_state.xml
9983 7/8/2015 5:41:31 PM [slot] removed file
slots/6/init_data.xml
9984 7/8/2015 5:41:31 PM [slot] removed file
slots/6/libfftw3f-3-3-4_x64.dll
9985 7/8/2015 5:41:31 PM [slot] removed file
slots/6/mb_cmdline.txt
9986 7/8/2015 5:41:31 PM [slot] removed file
slots/6/result.sah
9987 7/8/2015 5:41:31 PM [slot] removed file
slots/6/state.sah
9988 7/8/2015 5:41:31 PM [slot] removed file
slots/6/stderr.txt
9989 7/8/2015 5:41:31 PM [slot] removed file
slots/6/work_unit.sah
9991 7/8/2015 5:41:31 PM [slot] cleaning out slots/6:
get_free_slot()
9993 7/8/2015 5:41:31 PM [slot] removed file
slots/6/init_data.xml
9999 7/8/2015 5:41:31 PM [slot] removed file
slots/6/boinc_temporary_exit
10004 7/8/2015 5:41:36 PM [slot] removed file
projects/setiathome.berkeley.edu/07ja15aa.31640.1906454.438086664205.12.59.vlar_0_0
10005 7/8/2015 5:41:36 PM [slot] removed file
projects/setiathome.berkeley.edu/07ja15aa.31640.1906454.438086664205.12.59.vlar_0_0.gz
10006 7/8/2015 5:41:36 PM [slot] removed file
projects/setiathome.berkeley.edu/07ja15aa.31640.1906454.438086664205.12.59.vlar_0_0.gzt
10007 7/8/2015 5:41:44 PM [slot] cleaning out slots/3:
handle_exited_app()
10008 7/8/2015 5:41:44 PM [slot] removed file
slots/3/boinc_finish_called
10009 7/8/2015 5:41:44 PM [slot] removed file
slots/3/boinc_task_state.xml
10010 7/8/2015 5:41:44 PM [slot] removed file
slots/3/cudart32_50_35.dll
10011 7/8/2015 5:41:44 PM [slot] removed file
slots/3/cufft32_50_35.dll
10012 7/8/2015 5:41:44 PM [slot] removed file
slots/3/init_data.xml
10013 7/8/2015 5:41:44 PM [slot] removed file
slots/3/Lunatics_x41zc_win32_cuda50.exe
10014 7/8/2015 5:41:44 PM [slot] removed file
slots/3/mbcuda.cfg
10015 7/8/2015 5:41:44 PM [slot] removed file
slots/3/result.sah
10016 7/8/2015 5:41:44 PM [slot] removed file
slots/3/state.sah
10017 7/8/2015 5:41:44 PM [slot] removed file
slots/3/stderr.txt
10018 7/8/2015 5:41:44 PM [slot] removed file
slots/3/work_unit.sah
10020 7/8/2015 5:41:44 PM [slot] cleaning out slots/3:
get_free_slot()
10022 7/8/2015 5:41:44 PM [slot] removed file
slots/3/init_data.xml
10029 7/8/2015 5:41:44 PM [slot] removed file
slots/3/boinc_temporary_exit
10033 7/8/2015 5:41:48 PM [slot] cleaning out slots/5:
handle_exited_app()
10034 7/8/2015 5:41:48 PM [slot] removed file
slots/5/astronomy_parameters.txt
10035 7/8/2015 5:41:48 PM [slot] removed file
slots/5/boinc_finish_called
10036 7/8/2015 5:41:48 PM [slot] removed file
slots/5/init_data.xml
10037 7/8/2015 5:41:48 PM [slot] removed file
slots/5/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe
10038 7/8/2015 5:41:48 PM [slot] removed file
slots/5/separation_checkpoint
10039 7/8/2015 5:41:48 PM [slot] removed file
slots/5/stars.txt
10040 7/8/2015 5:41:48 PM [slot] failed to remove file
slots/5/stderr.txt: unlink() failed
10042 7/8/2015 5:41:48 PM [slot] removed file
projects/setiathome.berkeley.edu/30no14ab.7228.271858.438086664199.12.232_1_0
10043 7/8/2015 5:41:48 PM [slot] removed file
projects/setiathome.berkeley.edu/30no14ab.7228.271858.438086664199.12.232_1_0.gz
10044 7/8/2015 5:41:48 PM [slot] removed file
projects/setiathome.berkeley.edu/30no14ab.7228.271858.438086664199.12.232_1_0.gzt
10045 Milkyway@Home 7/8/2015 5:41:48 PM Computation for task
ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0 finished
10046 7/8/2015 5:41:48 PM [slot] cleaning out slots/5:
get_free_slot()
10047 7/8/2015 5:41:48 PM [slot] removed file
slots/5/stderr.txt
10048 Milkyway@Home 7/8/2015 5:41:48 PM [slot] assigning slot 5 to
ps_modfit_fast_15_3s_136_sim1Jun1_1_1434554402_7816077_0
10049 7/8/2015 5:41:48 PM [slot] removed file
slots/5/init_data.xml_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.