Re: [hwloc-users] possible concurrency issue with reading /proc data on Linux

Brice Goglin Sat, 21 Apr 2012 17:26:28 -0400

On 21/04/2012 23:08, Vlad wrote:

Greetings,
I use hwloc-1.4.1 stable on Red Hat 5 and am seeing a possibleconcurrency issue not covered by the "Thread Safety" guidelines:
- I start a small number (4) of threads, each of which does some workand periodically executes hwloc_get_last_cpu_location() withHWLOC_CPUBIND_PROCESS- occasionally, one or two of those threads will see the call failwith ENOSYS (even though the same call has already executedsuccessfully a number of times)
These errors are transient and seem to occur only when some of thethreads in the group are terminating. I've skimmed through theimplementation in topology-linux.c and it seems plausible to me thatthe errors could be caused by failure to read /proc state "atomically"in the presence of concurrent thread starts/exits.
Of course, the latter is hard (impossible ?) to do because the statealways changes and a snapshot can only be obtained with a singleread() (which in turn would require knowing how many thread entries toexpect in advance). However, returning ENOSYS in such cases does notseems intended but rather a flaw in retry logic. Similar issues may bepresent with other API methods that rely onhwloc_linux_foreach_proc_tid() orhwloc_linux_get_proc_tids().

Can you try the attached patch? It doesn't abort the loop immediately onper-tid errors anymore. This may work better when threads disappear. Idon't remember if the retry logic was written while thinking aboutadding threads only or about adding and removing threads.


If the patch doesn't help, can you send your code to help debug things?

An alternative explanation could be that the retry logic is correctbut the implementation relies on readdir(), which is documented to notbe thread-safe:http://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html

I don't this can happen. Your threads should not be accessing the sameDIR stream here.


Thanks
Brice

diff --git a/src/topology-linux.c b/src/topology-linux.c
index e1f46cb..99a6381 100644
--- a/src/topology-linux.c
+++ b/src/topology-linux.c
@@ -475,7 +475,7 @@ hwloc_linux_foreach_proc_tid(hwloc_topology_t topology,
   char taskdir_path[128];
   DIR *taskdir;
   pid_t *tids, *newtids;
-  unsigned i, nr, newnr;
+  unsigned i, nr, newnr, failed;
   int err;

   if (pid)
@@ -497,11 +497,17 @@ hwloc_linux_foreach_proc_tid(hwloc_topology_t topology,

  retry:
   /* apply the callback to all threads */
+  failed=0;
   for(i=0; i<nr; i++) {
     err = cb(topology, tids[i], data, i);
     if (err < 0)
-      goto out_with_tids;
+      failed++;
   }
+  /* some may fail (if threads disappear), but some should succeed.
+   * if all failed, abort with the last errno.
+   */
+  if (failed == nr)
+    goto out_with_tids;

   /* re-read the list of thread and retry if it changed in the meantime */
   err = hwloc_linux_get_proc_tids(taskdir, &newnr, &newtids);

Re: [hwloc-users] possible concurrency issue with reading /proc data on Linux

Reply via email to