[jira] [Commented] (DRILL-4330) Long running SQL query hangs once Foreman node is killed
[ https://issues.apache.org/jira/browse/DRILL-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197904#comment-15197904 ] Khurram Faraaz commented on DRILL-4330: --- Problem is reproducible on Drill 1.6.0, JDK 7 and git commit ID : 64ab0a8ec9d98bf96f4d69274dddc180b8efe263 > Long running SQL query hangs once Foreman node is killed > > > Key: DRILL-4330 > URL: https://issues.apache.org/jira/browse/DRILL-4330 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Flow >Affects Versions: 1.4.0 > Environment: 4 node cluster CentOS >Reporter: Khurram Faraaz >Assignee: Sudheesh Katkam > Attachments: drillbit.out > > > Summary : Once Foreman node Drillbit is killed, long running query just hangs > and no profile information is written to Web UI. That long running query was > issued from the Foreman node. > MapR Drill 1.4.0 GA > MapR FS 5.0.0 GA > JDK8 > 4 node CentOS cluster > ./sqlline -u "jdbc:drill:schema=dfs.tmp -n mapr -p mapr" > Issue a long running select query over JSON data > Immediately kill the Drillbit on Foreman node (ps -eaf | grep Drillbit), kill > -9 PID > The long running query hangs on sqlline prompt, there are no > messages/errors/Exceptions reported on sqlline prompt. > On the Web UI there is no profile reported for the long running query that > was running on the Drillbit that was killed. > Question (1) : Why was there no profile reported/written on the Web UI for > that long running query ? In a real production scenario user will not know > what query was under execution at the point when Foreman went down. > Question (2) : Why does the long running query not terminate, once the > foreman was killed ? from the drillbit.log snippet we do not see any > CANCELED/TERMINATED message for that query, why ? > Snippet from drillbit.log on the foreman node. > {noformat} > 2016-02-01 10:59:20,917 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.drill.exec.work.foreman.Foreman - Query text for query id > 2950c576-b2d2-5bc3-e9b5-ff4414d088c0: select * from `twoKeyJsn.json` > 2016-02-01 10:59:21,067 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.exec.store.dfs.FileSelection - FileSelection.create() took 1 ms > 2016-02-01 10:59:21,068 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, > numFiles: 1 > 2016-02-01 10:59:21,068 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, > numFiles: 1 > 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, > numFiles: 1 > 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, > numFiles: 1 > 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, > numFiles: 1 > 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, > numFiles: 1 > 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, > numFiles: 1 > 2016-02-01 10:59:21,155 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, > numFiles: 1 > 2016-02-01 10:59:21,250 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.e.s.schedule.BlockMapBuilder - Get block maps: Executed 1 out of 1 > using 1 threads. Time: 90ms total, 90.891938ms avg, 90ms max. > 2016-02-01 10:59:21,250 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.e.s.schedule.BlockMapBuilder - Get block maps: Executed 1 out of 1 > using 1 threads. Earliest start: 18.28 μs, Latest start: 18.28 μs, > Average start: 18.28 μs . > 2016-02-01 10:59:21,448 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:frag:0:0] INFO > o.a.d.e.w.fragment.FragmentExecutor - > 2950c576-b2d2-5bc3-e9b5-ff4414d088c0:0:0: State change requested > AWAITING_ALLOCATION --> RUNNING > 2016-02-01 10:59:21,448 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:frag:0:0] INFO > o.a.d.e.w.f.FragmentStatusReporter - > 2950c576-b2d2-5bc3-e9b5-ff4414d088c0:0:0: State to report: RUNNING > {noformat} > Doing kill -3 PID on the non foreman node for the Drillbit process gives us > stack trace in drillbit.out > {noformat} > 2016-02-01 11:03:31 > Full thread dump OpenJDK 64-Bit Server VM (25.65-b01 mixed mode): > "qtp801808302-129" #129 prio=5 os_prio=0
[jira] [Commented] (DRILL-4330) Long running SQL query hangs once Foreman node is killed
[ https://issues.apache.org/jira/browse/DRILL-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126577#comment-15126577 ] Khurram Faraaz commented on DRILL-4330: --- Adding Hakim's comment from email thread on this JIRA. I don't see any fragment running which means the query did cancel itself properly, yet some threads are blocked. Please report this in the JIRA too. > Long running SQL query hangs once Foreman node is killed > > > Key: DRILL-4330 > URL: https://issues.apache.org/jira/browse/DRILL-4330 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Flow >Affects Versions: 1.4.0 > Environment: 4 node cluster CentOS >Reporter: Khurram Faraaz > Attachments: drillbit.out > > > Summary : Once Foreman node Drillbit is killed, long running query just hangs > and no profile information is written to Web UI. That long running query was > issued from the Foreman node. > MapR Drill 1.4.0 GA > MapR FS 5.0.0 GA > JDK8 > 4 node CentOS cluster > ./sqlline -u "jdbc:drill:schema=dfs.tmp -n mapr -p mapr" > Issue a long running select query over JSON data > Immediately kill the Drillbit on Foreman node (ps -eaf | grep Drillbit), kill > -9 PID > The long running query hangs on sqlline prompt, there are no > messages/errors/Exceptions reported on sqlline prompt. > On the Web UI there is no profile reported for the long running query that > was running on the Drillbit that was killed. > Question (1) : Why was there no profile reported/written on the Web UI for > that long running query ? In a real production scenario user will not know > what query was under execution at the point when Foreman went down. > Question (2) : Why does the long running query not terminate, once the > foreman was killed ? from the drillbit.log snippet we do not see any > CANCELED/TERMINATED message for that query, why ? > Snippet from drillbit.log on the foreman node. > {noformat} > 2016-02-01 10:59:20,917 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.drill.exec.work.foreman.Foreman - Query text for query id > 2950c576-b2d2-5bc3-e9b5-ff4414d088c0: select * from `twoKeyJsn.json` > 2016-02-01 10:59:21,067 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.exec.store.dfs.FileSelection - FileSelection.create() took 1 ms > 2016-02-01 10:59:21,068 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, > numFiles: 1 > 2016-02-01 10:59:21,068 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, > numFiles: 1 > 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, > numFiles: 1 > 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, > numFiles: 1 > 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, > numFiles: 1 > 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, > numFiles: 1 > 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, > numFiles: 1 > 2016-02-01 10:59:21,155 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, > numFiles: 1 > 2016-02-01 10:59:21,250 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.e.s.schedule.BlockMapBuilder - Get block maps: Executed 1 out of 1 > using 1 threads. Time: 90ms total, 90.891938ms avg, 90ms max. > 2016-02-01 10:59:21,250 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO > o.a.d.e.s.schedule.BlockMapBuilder - Get block maps: Executed 1 out of 1 > using 1 threads. Earliest start: 18.28 μs, Latest start: 18.28 μs, > Average start: 18.28 μs . > 2016-02-01 10:59:21,448 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:frag:0:0] INFO > o.a.d.e.w.fragment.FragmentExecutor - > 2950c576-b2d2-5bc3-e9b5-ff4414d088c0:0:0: State change requested > AWAITING_ALLOCATION --> RUNNING > 2016-02-01 10:59:21,448 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:frag:0:0] INFO > o.a.d.e.w.f.FragmentStatusReporter - > 2950c576-b2d2-5bc3-e9b5-ff4414d088c0:0:0: State to report: RUNNING > {noformat} > Doing kill -3 PID on the non foreman node for the Drillbit process gives us > stack trace in drillbit.out > {noformat} > 2016-02-01 11:03:31 > Full thread dump OpenJDK 64-Bit Server VM (25.65-b01 mixed