[jira] [Commented] (DRILL-4330) Long running SQL query hangs once Foreman node is killed

2016-03-20 Thread Khurram Faraaz (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197904#comment-15197904
 ] 

Khurram Faraaz commented on DRILL-4330:
---

Problem is reproducible on Drill 1.6.0, JDK 7 and git commit ID : 
64ab0a8ec9d98bf96f4d69274dddc180b8efe263

> Long running SQL query hangs once Foreman node is killed
> 
>
> Key: DRILL-4330
> URL: https://issues.apache.org/jira/browse/DRILL-4330
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.4.0
> Environment: 4 node cluster CentOS
>Reporter: Khurram Faraaz
>Assignee: Sudheesh Katkam
> Attachments: drillbit.out
>
>
> Summary : Once Foreman node Drillbit is killed, long running query just hangs 
> and no profile information is written to Web UI. That long running query was 
> issued from the Foreman node.
> MapR Drill 1.4.0 GA
> MapR FS 5.0.0 GA
> JDK8
> 4 node CentOS cluster
> ./sqlline -u "jdbc:drill:schema=dfs.tmp -n mapr -p mapr"
> Issue a long running select query over JSON data
> Immediately kill the Drillbit on Foreman node (ps -eaf | grep Drillbit), kill 
> -9 PID
> The long running query hangs on sqlline prompt, there are no 
> messages/errors/Exceptions reported on sqlline prompt.
> On the Web UI there is no profile reported for the long running query that 
> was running on the Drillbit that was killed.
> Question (1) : Why was there no profile reported/written on the Web UI for 
> that long running query ? In a real production scenario user will not know 
> what query was under execution at the point when Foreman went down. 
> Question (2) : Why does the long running query not terminate, once the 
> foreman was killed ? from the drillbit.log snippet we do not see any 
> CANCELED/TERMINATED message for that query, why ?
> Snippet from drillbit.log on the foreman node. 
> {noformat}
> 2016-02-01 10:59:20,917 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.drill.exec.work.foreman.Foreman - Query text for query id 
> 2950c576-b2d2-5bc3-e9b5-ff4414d088c0: select * from `twoKeyJsn.json`
> 2016-02-01 10:59:21,067 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.exec.store.dfs.FileSelection - FileSelection.create() took 1 ms
> 2016-02-01 10:59:21,068 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, 
> numFiles: 1
> 2016-02-01 10:59:21,068 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, 
> numFiles: 1
> 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, 
> numFiles: 1
> 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, 
> numFiles: 1
> 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, 
> numFiles: 1
> 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, 
> numFiles: 1
> 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, 
> numFiles: 1
> 2016-02-01 10:59:21,155 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, 
> numFiles: 1
> 2016-02-01 10:59:21,250 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.e.s.schedule.BlockMapBuilder - Get block maps: Executed 1 out of 1 
> using 1 threads. Time: 90ms total, 90.891938ms avg, 90ms max.
> 2016-02-01 10:59:21,250 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.e.s.schedule.BlockMapBuilder - Get block maps: Executed 1 out of 1 
> using 1 threads. Earliest start: 18.28 μs, Latest start: 18.28 μs, 
> Average start: 18.28 μs .
> 2016-02-01 10:59:21,448 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:frag:0:0] INFO  
> o.a.d.e.w.fragment.FragmentExecutor - 
> 2950c576-b2d2-5bc3-e9b5-ff4414d088c0:0:0: State change requested 
> AWAITING_ALLOCATION --> RUNNING
> 2016-02-01 10:59:21,448 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:frag:0:0] INFO  
> o.a.d.e.w.f.FragmentStatusReporter - 
> 2950c576-b2d2-5bc3-e9b5-ff4414d088c0:0:0: State to report: RUNNING
> {noformat}
> Doing kill -3 PID on the non foreman node for the Drillbit process gives us 
> stack trace in drillbit.out
> {noformat}
> 2016-02-01 11:03:31
> Full thread dump OpenJDK 64-Bit Server VM (25.65-b01 mixed mode):
> "qtp801808302-129" #129 prio=5 os_prio=0 

[jira] [Commented] (DRILL-4330) Long running SQL query hangs once Foreman node is killed

2016-02-01 Thread Khurram Faraaz (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126577#comment-15126577
 ] 

Khurram Faraaz commented on DRILL-4330:
---

Adding Hakim's comment from email thread on this JIRA.


I don't see any fragment running which means the query did cancel itself 
properly, yet some threads are blocked. Please report this in the JIRA too.


> Long running SQL query hangs once Foreman node is killed
> 
>
> Key: DRILL-4330
> URL: https://issues.apache.org/jira/browse/DRILL-4330
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.4.0
> Environment: 4 node cluster CentOS
>Reporter: Khurram Faraaz
> Attachments: drillbit.out
>
>
> Summary : Once Foreman node Drillbit is killed, long running query just hangs 
> and no profile information is written to Web UI. That long running query was 
> issued from the Foreman node.
> MapR Drill 1.4.0 GA
> MapR FS 5.0.0 GA
> JDK8
> 4 node CentOS cluster
> ./sqlline -u "jdbc:drill:schema=dfs.tmp -n mapr -p mapr"
> Issue a long running select query over JSON data
> Immediately kill the Drillbit on Foreman node (ps -eaf | grep Drillbit), kill 
> -9 PID
> The long running query hangs on sqlline prompt, there are no 
> messages/errors/Exceptions reported on sqlline prompt.
> On the Web UI there is no profile reported for the long running query that 
> was running on the Drillbit that was killed.
> Question (1) : Why was there no profile reported/written on the Web UI for 
> that long running query ? In a real production scenario user will not know 
> what query was under execution at the point when Foreman went down. 
> Question (2) : Why does the long running query not terminate, once the 
> foreman was killed ? from the drillbit.log snippet we do not see any 
> CANCELED/TERMINATED message for that query, why ?
> Snippet from drillbit.log on the foreman node. 
> {noformat}
> 2016-02-01 10:59:20,917 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.drill.exec.work.foreman.Foreman - Query text for query id 
> 2950c576-b2d2-5bc3-e9b5-ff4414d088c0: select * from `twoKeyJsn.json`
> 2016-02-01 10:59:21,067 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.exec.store.dfs.FileSelection - FileSelection.create() took 1 ms
> 2016-02-01 10:59:21,068 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, 
> numFiles: 1
> 2016-02-01 10:59:21,068 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, 
> numFiles: 1
> 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, 
> numFiles: 1
> 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, 
> numFiles: 1
> 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, 
> numFiles: 1
> 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, 
> numFiles: 1
> 2016-02-01 10:59:21,069 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, 
> numFiles: 1
> 2016-02-01 10:59:21,155 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, 
> numFiles: 1
> 2016-02-01 10:59:21,250 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.e.s.schedule.BlockMapBuilder - Get block maps: Executed 1 out of 1 
> using 1 threads. Time: 90ms total, 90.891938ms avg, 90ms max.
> 2016-02-01 10:59:21,250 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:foreman] INFO  
> o.a.d.e.s.schedule.BlockMapBuilder - Get block maps: Executed 1 out of 1 
> using 1 threads. Earliest start: 18.28 μs, Latest start: 18.28 μs, 
> Average start: 18.28 μs .
> 2016-02-01 10:59:21,448 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:frag:0:0] INFO  
> o.a.d.e.w.fragment.FragmentExecutor - 
> 2950c576-b2d2-5bc3-e9b5-ff4414d088c0:0:0: State change requested 
> AWAITING_ALLOCATION --> RUNNING
> 2016-02-01 10:59:21,448 [2950c576-b2d2-5bc3-e9b5-ff4414d088c0:frag:0:0] INFO  
> o.a.d.e.w.f.FragmentStatusReporter - 
> 2950c576-b2d2-5bc3-e9b5-ff4414d088c0:0:0: State to report: RUNNING
> {noformat}
> Doing kill -3 PID on the non foreman node for the Drillbit process gives us 
> stack trace in drillbit.out
> {noformat}
> 2016-02-01 11:03:31
> Full thread dump OpenJDK 64-Bit Server VM (25.65-b01 mixed