Hi Jini,

I have a comment for your webrev.02 .


You added process_doesnt_exist() to check process / thread liveness from 
/proc/<PID>, but it is not enough.
Information of threads (LWP) will be stored in /proc/<PID>/task/<LWPID>.
So you should check /proc/<PID>/task/status for threads.


Thanks,

Yasumasa


On 2018/12/12 21:15, Jini George wrote:
Thank you very much for looking into this, JC!

I have a revised webrev addressing your comments at:

http://cr.openjdk.java.net/~jgeorge/8202884/webrev.02/index.html

Requesting one more review for this. My comments inline:

On 12/12/2018 2:53 AM, JC Beyler wrote:
Hi Jini,

I saw a few nits:
http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/src/jdk.hotspot.agent/linux/native/libsaproc/libproc_impl.h.udiff.html
  ? -> The comments are in the third person normally it seems so it would
become (I also removed the s from threads):

+// deletes a thread from the thread list
Done.

http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/src/jdk.hotspot.agent/linux/native/libsaproc/libproc_impl.c.udiff.html
  ? -> You added two empty lines it seems that could be removed
Done.


http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/src/jdk.hotspot.agent/linux/native/libsaproc/ps_proc.c.udiff.html
  ? -> Is there a real reason to have both enums? We could have a single
enum it seems and not lose too much

You are right. I have done away with the WAITPID* enum.

  ? -> you have a switch "
  ? ? ? ?switch (errno) {"
  ? ? ? ? -> Where really you could simplify the reading by moving the
EINTR case outside with its continue
  ? ? ? ? -> The switch could then remain as it was (though you move
print_debug to print_error)
  ? ? ? ? -> and just return in each cases
I have changed this to:

206     } else {
207       switch (errno) {
208         case EINTR:
209           continue;
210           break;
211         case ECHILD:
212           print_debug("waitpid() failed. Child process pid (%d) does
not exist \n", pid);
213           return ATTACH_THREAD_DEAD;
214         case EINVAL:
215           print_error("waitpid() failed. Invalid options argument.\n");
216           return ATTACH_FAIL;
217         default:
218           print_error("waitpid() failed. Unexpected error %d\n", errno);
219           return ATTACH_FAIL;
220       }
221     } // else



  ? ?->?if (strncmp (buf, "State:", 6) == 0) {
  ? ? ? -> You use sizeof("State:") right below; perhaps you could just
use "? const char const state[] = "State:";" and use sizeof(state) and
for the string, it seems less error prone

  ? -> A minor "bug" is here:
+? ? ? state = buf + sizeof ("State:");
  ? ? ? ? -> You did a strncmp above but that only assures the start of
the string is "State:", technically the character after the ':' is the
but it could only be that; sizeof("State:") is 7 and not 6. So you miss
one character when you are skipping spaces
  ? ? ? ? -> It was probably ok because you always had at least one
space, ie "State: "

Thanks! I have made some changes here to use a const char string and a
variable to store the calculated length using strlen(). And I am using
isspace() now to skip spaces since tabs could also be used as a delimiter.

  ? -> Extra space here before the '(': "sizeof (buf)"
Done.

Finally your return sequence for that method could be simplified to:

+? if (!found_state) {
+? ? print_error(" Could not find the State: string in the status file
for pid %d\n", pid);
+? }
+? fclose (fp);
+? return !found_state;

I have modified this to:

257   if (!found_state) {
258     // Assuming the thread exists.
259     print_error("Could not find the 'State:' string in the
/proc/%d/status file\n", pid);
260   }
261   fclose (fp);
262   return false;

Thank you,
Jini.


Thanks!
Jc

On Tue, Dec 11, 2018 at 9:30 AM Jini George <jini.george at oracle.com
<mailto:jini.george at oracle.com>> wrote:

     Hello !

     Requesting reviews for:

     https://bugs.openjdk.java.net/browse/JDK-8202884
     Webrev: http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/index.html

     Details:
     For attaching to the threads in a process, we first go ahead and do a
     ptrace attach to the main thread. Later, we use the libthread_db
     library
     (or, in the case of being within a container, iterate through the
     /proc/<pid>/task files) to discover the threads of the process, and add
     them to the threads list (within SA) for this process. Once, we have
     discovered all the threads and added these to the list of threads, we
     then invoke ptrace attach individually on all these threads to
     attach to
     these. When we deal with an application where the threads are exiting
     continuously, some of these threads might not exist by the time we try
     to ptrace attach to these threads. The proposed fix includes the
     following modifications to solve this.
      ? 1. Check the state of the threads in the thread_db callback routine,
     and skip if the state of the thread is TD_THR_UNKNOWN or TD_THR_ZOMBIE.
     SA does not try to ptrace attach to these threads and does not include
     these threads in the threads list.
      ? 2. While ptrace attaching to the thread, if ptrace(PTRACE_ATTACH,
     ...)
     fails with either ESCRH or EPERM, check the state of the thread by
     checking if the /proc/<pid>/status file corresponding to that thread
     exists and if so, reading in the 'State:' line of that file. Skip
     attaching to this thread and delete this thread from the SA list of
     threads, if the thread is dead (State: X) or is a zombie (State: Z).
      ?From the /proc man page, "Current state of the process. One of "R
     (running)", "S (sleeping)", "D (disk sleep)", "T (stopped)", "T
     (tracing
     stop)", "Z (zombie)", or "X (dead)"."
      ? 3. If waitpid() on the thread is a failure, again skip this thread
     (delete this from SA's list of threads) instead of bailing out if the
     thread has exited or terminated.

     Testing:
     1. Tested by attaching and detaching multiple times to a test program
     spawning numerous short lived threads.
     2. The SA tests (under test/hotspot/jtreg/serviceability/sa) passed
     with
     100 repeats on Mach5.
     3. No new failures and no occurrences of JDK-8202884 seen with testing
     the SA tests (tiers 1 to 5) on Mach5.

     More details in the bug comments section.

     Thank you,
     Jini.



--

Thanks,
Jc

Reply via email to