Hi Tim,
Using this "flag file" gives a finer control (e.g., can be set on a single
node) than a System Option.
Also with a system option, one needs to start another session to turn the
option OFF (would the looping thread then see the option changing ? Or is it
using a cashed value ?)
This feature is for use by developers (or maybe support), so whichever is
easier for us ....
Boaz
________________________________
From: Timothy Farkas <[email protected]>
Sent: Thursday, September 21, 2017 6:31:22 PM
To: [email protected]
Subject: Re: Added "spinner" code to allow debugging of failure cause
Hi Boaz,
Would it be possible to implement this as a System option, so that there can be
a uniform way for toggling these features?
Thanks,
Tim
________________________________
From: Boaz Ben-Zvi <[email protected]>
Sent: Wednesday, September 20, 2017 5:23:43 PM
To: [email protected]
Subject: Added "spinner" code to allow debugging of failure cause
FYI and for feedback:
As part of Pull Request #938 I added a “spinner” code in the build() method
of the UserException class, such that when this method is called (i.e., before
reporting of a failure to the user), that code can go into a looping spin
(instead of continuing to termination).
This can be useful when investigating the original failure, allowing to attach
a debugger, or use jstack to see the stacks at this point of execution, or
check some external things (like condition of the spill files at that point),
etc.
To trigger this feature ON, need to create (an empty) flag file named
/tmp/drill/spin at every node where this stop-spinning needs to take place
(e.g., use “clush –a touch /tmp/drill/spin” to set it all across the cluster).
Once a thread hits this code, it checks for the existence of this spin file,
and if exists, the thread creates a temp file named something like:
/tmp/drill/spin4148663301172491613.tmp which contains its process ID (e.g., to
allow jstack) and the error message, like:
~ 5 > cat /tmp/drill/spin5273075865809469794.tmp
Spinning process: [email protected]
Error cause: SYSTEM ERROR: CannotPlanException: Node
[rel#232:Subset#10.PHYSICAL.SINGLETON([]).[]] could not be implemented; planner
state:
Root: rel#232:Subset#10.PHYSICAL.SINGLETON([]).[]
. . . . . . .
~ 6 > jstack 16966
Picked up JAVA_TOOL_OPTIONS: -ea
2017-09-20 17:15:21
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode):
"Attach Listener" #91 daemon prio=9 os_prio=31 tid=0x00007fdd8830b000
nid=0x4f07 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"263cfbd5-329d-b9fb-d96e-392e4fe0be4d:foreman" #53 daemon prio=10 os_prio=31
tid=0x00007fdd8823a000 nid=0x7203 waiting on condition [0x0000700002224000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:570)
. . . . . . . .
The spinning thread then loops – sleeps for a second and then rechecks that
flag file. To turn this feature OFF and release the spinning threads one need
to delete that empty spin files (e.g., use “clush –a rm /tmp/drill/spin”). This
will also clean the relevant temp files.
Hope this is useful, and welcome any feedback or suggestions.
Boaz