Re: Monitoring long / stuck CTAS

Matt Fri, 29 May 2015 08:10:24 -0700

I have another test case that queries a table using a filter of a rangeof dates and customer key, that SUMs 38 columns. The returned record setencompasses all 42 columns in the table - not a good design for parquetfiles or any RDBMS, but a modeling problem that is not yet fully in mycontrol (the application needs some changes).

Simply selecting all the columns in the parquet files with that filterreturns data to the client in about 3 seconds, but SUMming all of the 38measure columns resulted in the query still running at the client 22hours later.

However, the query profile shows no fragments with a Max Runtime or morethan 2h20m, much like the "stuck CTAS" I had before. Learning from thatcase, I looked at the node hosting the one fragment that did not finish.Could this be a communication failure between nodes that is notsignaling the client?


~~~
Major Fragment: 02-xx-xx

Minor Fragment ID Host Name Start End Runtime Max Records MaxBatches Last Update Last Progress Peak Memory State

02-00-xx        es06    1.011s  2h20m   2h20m   0       1       02:35:43        
02:35:43        2MB     CANCELLED
02-01-xx        es08    0.999s  4m33s   4m32s   0       1       01:19:52        
01:19:52        2MB     FINISHED
02-02-xx        es07    1.010s  2m16s   2m15s   0       1       01:17:34        
01:17:34        2MB     FINISHED
02-03-xx        es05    1.009s  2m56s   2m55s   0       1       01:18:14        
01:18:14        2MB     FINISHED
~~~

~~~

2015-05-29 05:23:07,822 [UserServer-1] INFOo.a.drill.exec.work.foreman.Foreman - Failure while trying communicatequery result to initiating client. This would happen if a client isdisconnected before response notice can be sent.

org.apache.drill.exec.rpc.ChannelClosedException: null

atorg.apache.drill.exec.rpc.CoordinationQueue$RpcListener.operationComplete(CoordinationQueue.java:89)[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]atorg.apache.drill.exec.rpc.CoordinationQueue$RpcListener.operationComplete(CoordinationQueue.java:67)[drill-java-exec-1.0.0-rebuffed.jar:1.0.0]atio.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)[netty-common-4.0.27.Final.jar:4.0.27.Final]atio.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)[netty-common-4.0.27.Final.jar:4.0.27.Final]atio.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)[netty-common-4.0.27.Final.jar:4.0.27.Final]atio.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)[netty-common-4.0.27.Final.jar:4.0.27.Final]atio.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:788)[netty-transport-4.0.27.Final.jar:4.0.27.Final]atio.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:689)[netty-transport-4.0.27.Final.jar:4.0.27.Final]atio.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1114)[netty-transport-4.0.27.Final.jar:4.0.27.Final]atio.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:705)[netty-transport-4.0.27.Final.jar:4.0.27.Final]atio.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32)[netty-transport-4.0.27.Final.jar:4.0.27.Final]atio.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:980)[netty-transport-4.0.27.Final.jar:4.0.27.Final]atio.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1032)[netty-transport-4.0.27.Final.jar:4.0.27.Final]atio.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:965)[netty-transport-4.0.27.Final.jar:4.0.27.Final]atio.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)[netty-common-4.0.27.Final.jar:4.0.27.Final]atio.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254)[netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]atio.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)[netty-common-4.0.27.Final.jar:4.0.27.Final]

        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]

2015-05-29 05:23:07,822 [UserServer-1] INFOo.a.drill.exec.work.foreman.Foreman - State change requested. CANCELED--> FAILED

org.apache.drill.exec.rpc.ChannelClosedException: null

        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
~~~



On 28 May 2015, at 16:43, Mehant Baid wrote:

I think the problem might be related to a single laggard, looks likewe are waiting for one minor fragment to complete. Based on the outputyou provided looks like the fragment 1_1 hasn't completed. You mightwant to find out where the fragment was scheduled and what is going onin that node. It might also be useful to look at the profile for thatminor fragment to see how much data has been processed.
Thanks
Mehant

On 5/28/15 10:57 AM, Matt wrote:
Did you check the log files for any errors?
No messages related to this query containing errors or warning, nornothing mentioning memory or heap. Querying now to determine what ismissing in the parquet destination.
drillbit.out on the master shows no error messages, and what lookslike the last relevant line is:
~~~
May 27, 2015 6:43:50 PM INFO:parquet.hadoop.ColumnChunkPageWriteStore: written 2,258,263B for[bytes_1250] INT64: 3,069,414 values, 24,555,504B raw, 2,257,112Bcomp, 24 pages, encodings: [RLE, PLAIN, BIT_PACKED]May 27, 2015 6:43:51 PM INFO: parquet.haMay 28, 2015 5:13:42 PMorg.apache.calcite.sql.validate.SqlValidatorException <init>
~~~
The final lines in drillbit.log (which appear to use a different timeformat in the log) that contain the profile ID:
~~~
2015-05-27 18:39:49,980[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:20] INFOo.a.d.e.w.fragment.FragmentExecutor -2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:20: State change requestedfrom RUNNING --> FINISHED for2015-05-27 18:39:49,981[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:20] INFOo.a.d.e.w.f.AbstractStatusReporter - State changed for2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:20. New state: FINISHED2015-05-27 18:40:05,650[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:12] INFOo.a.d.e.w.fragment.FragmentExecutor -2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:12: State change requestedfrom RUNNING --> FINISHED for2015-05-27 18:40:05,650[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:12] INFOo.a.d.e.w.f.AbstractStatusReporter - State changed for2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:12. New state: FINISHED2015-05-27 18:41:57,444[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:16] INFOo.a.d.e.w.fragment.FragmentExecutor -2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:16: State change requestedfrom RUNNING --> FINISHED for2015-05-27 18:41:57,444[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:16] INFOo.a.d.e.w.f.AbstractStatusReporter - State changed for2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:16. New state: FINISHED2015-05-27 18:43:25,005[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:8] INFOo.a.d.e.w.fragment.FragmentExecutor -2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:8: State change requested fromRUNNING --> FINISHED for2015-05-27 18:43:25,005[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:8] INFOo.a.d.e.w.f.AbstractStatusReporter - State changed for2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:8. New state: FINISHED2015-05-27 18:43:54,539[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:0] INFOo.a.d.e.w.fragment.FragmentExecutor -2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:0: State change requested fromRUNNING --> FINISHED for2015-05-27 18:43:54,540[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:0] INFOo.a.d.e.w.f.AbstractStatusReporter - State changed for2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:0. New state: FINISHED2015-05-27 18:43:59,947[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:4] INFOo.a.d.e.w.fragment.FragmentExecutor -2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:4: State change requested fromRUNNING --> FINISHED for2015-05-27 18:43:59,947[2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:frag:1:4] INFOo.a.d.e.w.f.AbstractStatusReporter - State changed for2a9a10ec-6f96-5dc5-54fc-dc5295a77e42:1:4. New state: FINISHED
~~~


On 28 May 2015, at 13:42, Andries Engelbrecht wrote:
It should execute multi threaded, need to check on text file.

Did you check the log files for any errors?


On May 28, 2015, at 10:36 AM, Matt <bsg...@gmail.com> wrote:
The time seems pretty long for that file size. What type of fileis it?
Tab delimited UTF-8 text.
I left the query to run overnight to see if it would complete, but24 hours for an import like this would indeed be too long.
Is the CTAS running single threaded?
In the first hour, with this being the only client connected to thecluster, I observed activity on all 4 nodes.
Is multi-threaded query execution the default? I would not havechanged anything deliberately to force single thread execution.
On 28 May 2015, at 13:06, Andries Engelbrecht wrote:
The time seems pretty long for that file size. What type of fileis it?
Is the CTAS running single threaded?

—Andries


On May 28, 2015, at 9:37 AM, Matt <bsg...@gmail.com> wrote:
How large is the data set you are working with, and yourcluster/nodes?
Just testing with that single 44GB source file currently, and mytest cluster is made from 4 nodes, each with 8 CPU cores, 32GBRAM, a 6TB Ext4 volume (RAID-10).
Drill defaults left as come in v1.0. I will be adjusting memoryand retrying the CTAS.
I know I can / should assign individual disks to HDFS, but as atest cluster there are apps that expect data volumes to work on.A dedicated Hadoop production cluster would have a disk layoutspecific to the task.
On 28 May 2015, at 12:26, Andries Engelbrecht wrote:
Just check the drillbit.log and drillbit.out files in the logdirectory.Before adjusting memory, see if that is an issue first. It wasfor me, but as Jason mentioned there can be other causes aswell.
You adjust memory allocation in the drill-env.sh files, and haveto restart the drill bits.
How large is the data set you are working with, and yourcluster/nodes?
—Andries


On May 28, 2015, at 9:17 AM, Matt <bsg...@gmail.com> wrote:
To make sure I am adjusting the correct config, these are heapparameters within the Drill configure path, not for Hadoop orZookeeper?
On May 28, 2015, at 12:08 PM, Jason Altekruse<altekruseja...@gmail.com> wrote:
There should be no upper limit on the size of the tables youcan createwith Drill. Be advised that Drill does currently operateentirelyoptimistically in regards to available resources. If a networkconnectionbetween two drillbits fails during a query, we will notcurrentlyre-schedule the work to make use of remaining nodes andnetwork connectionsthat are still live. While we have had a good amount ofsuccess using Drillfor data conversion, be aware that these conditions couldcause long
running queries to fail.
That being said, it isn't the only possible cause for such afailure. Inthe case of a network failure we would expect to see a messagereturned toyou that part of the query was unsuccessful and that it hadbeen cancelled.Andries has a good suggestion in regards to checking the heapmemory, thisshould also be detected and reported back to you at the CLI,but we may befailing to propagate the error back to the head node for thequery. Ibelieve writing parquet may still be the most heap-intensiveoperation inDrill, despite our efforts to refactor the write path to usedirect memoryinstead of on-heap for large buffers needed in the process ofcreating
parquet files.
On Thu, May 28, 2015 at 8:43 AM, Matt <bsg...@gmail.com>wrote:
Is 300MM records too much to do in a single CTAS statement?
After almost 23 hours I killed the query (^c) and itreturned:
~~~
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 1_20      | 13568824                   |
| 1_15      | 12411822                   |
| 1_7       | 12470329                   |
| 1_12      | 13693867                   |
| 1_5       | 13292136                   |
| 1_18      | 13874321                   |
| 1_16      | 13303094                   |
| 1_9       | 13639049                   |
| 1_10      | 13698380                   |
| 1_22      | 13501073                   |
| 1_8       | 13533736                   |
| 1_2       | 13549402                   |
| 1_21      | 13665183                   |
| 1_0       | 13544745                   |
| 1_4       | 13532957                   |
| 1_19      | 12767473                   |
| 1_17      | 13670687                   |
| 1_13      | 13469515                   |
| 1_23      | 12517632                   |
| 1_6       | 13634338                   |
| 1_14      | 13611322                   |
| 1_3       | 13061900                   |
| 1_11      | 12760978                   |
+-----------+----------------------------+
23 rows selected (82294.854 seconds)
~~~
The sum of those record counts is 306,772,763 which is closeto the
320,843,454 in the source file:

~~~
0: jdbc:drill:zk=es05:2181> select count(*) FROMroot.`sample_201501.dat`;
+------------+
|   EXPR$0   |
+------------+
| 320843454  |
+------------+
1 row selected (384.665 seconds)
~~~
It represents one month of data, 4 key columns and 38 numericmeasurecolumns, which could also be partitioned daily. The test herewas to createmonthly Parquet files to see how the min/max stats on Parquetchunks help
with range select performance.
Instead of a small number of large monthly RDBMS tables, I amattemptingto determine how many Parquet files should be used with Drill/ HDFS.
On 27 May 2015, at 15:17, Matt wrote:
Attempting to create a Parquet backed table with a CTAS froman 44GB tab
delimited file in HDFS. The process seemed to be running, asCPU and IO wasseen on all 4 nodes in this cluster, and .parquet filesbeing created in
the expected path.
In however in the last two hours or so, all nodes show nearzero CPU orIO, and the Last Modified date on the .parquet have notchanged. Same timedelay shown in the Last Progress column in the activefragment profile.
What approach can I take to determine what is happening (ornot)?

Re: Monitoring long / stuck CTAS

Reply via email to