Re: Added "spinner" code to allow debugging of failure cause

Boaz Ben-Zvi Thu, 21 Sep 2017 23:07:31 -0700

 Hi Tim,


     Using this "flag file" gives a finer control (e.g., can be set on a single 
node) than a System Option.

Also with a system option, one needs to start another session to turn the 
option OFF (would the looping thread then see the option changing ? Or is it 
using a cashed value ?)


This feature is for use by developers (or maybe support), so whichever is 
easier for us ....


    Boaz

________________________________
From: Timothy Farkas <[email protected]>
Sent: Thursday, September 21, 2017 6:31:22 PM
To: [email protected]
Subject: Re: Added "spinner" code to allow debugging of failure cause

Hi Boaz,

Would it be possible to implement this as a System option, so that there can be 
a uniform way for toggling these features?

Thanks,
Tim

________________________________
From: Boaz Ben-Zvi <[email protected]>
Sent: Wednesday, September 20, 2017 5:23:43 PM
To: [email protected]
Subject: Added "spinner" code to allow debugging of failure cause

  FYI and for feedback:

  As part of Pull Request #938 I added a “spinner” code in the build() method 
of the UserException class, such that when this method is called (i.e., before 
reporting of a failure to the user), that code can go into a looping spin 
(instead of continuing to termination).

This can be useful when investigating the original failure, allowing to attach 
a debugger, or use jstack to see the stacks at this point of execution, or 
check some external things (like condition of the spill files at that point), 
etc.

To trigger this feature ON, need to create (an empty) flag file named 
/tmp/drill/spin at every node where this stop-spinning needs to take place 
(e.g., use “clush –a touch /tmp/drill/spin” to set it all across the cluster).  
Once a thread hits this code, it checks for the existence of this spin file, 
and if exists, the thread creates a temp file named something like: 
/tmp/drill/spin4148663301172491613.tmp  which contains its process ID (e.g., to 
allow jstack) and the error message, like:

~ 5 > cat /tmp/drill/spin5273075865809469794.tmp
Spinning process: [email protected]
Error cause: SYSTEM ERROR: CannotPlanException: Node 
[rel#232:Subset#10.PHYSICAL.SINGLETON([]).[]] could not be implemented; planner 
state:

Root: rel#232:Subset#10.PHYSICAL.SINGLETON([]).[]
. . . . . . .

~ 6 > jstack 16966
Picked up JAVA_TOOL_OPTIONS: -ea
2017-09-20 17:15:21
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode):

"Attach Listener" #91 daemon prio=9 os_prio=31 tid=0x00007fdd8830b000 
nid=0x4f07 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"263cfbd5-329d-b9fb-d96e-392e4fe0be4d:foreman" #53 daemon prio=10 os_prio=31 
tid=0x00007fdd8823a000 nid=0x7203 waiting on condition [0x0000700002224000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
     at java.lang.Thread.sleep(Native Method)
     at 
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:570)
. . . . . . . .

The spinning thread then loops – sleeps for a second and then rechecks that 
flag file. To turn this feature OFF and release the spinning threads one need 
to delete that empty spin files (e.g., use “clush –a rm /tmp/drill/spin”). This 
will also clean the relevant temp files.

   Hope this is useful, and welcome any feedback or suggestions.

      Boaz

Re: Added "spinner" code to allow debugging of failure cause

Reply via email to