Re: Monitoring long / stuck CTAS

Matt Thu, 28 May 2015 15:14:12 -0700

Bumping memory to:

DRILL_MAX_DIRECT_MEMORY="16G"
DRILL_HEAP="8G"

The 44GB file imported successfully in 25 minutes - acceptable on thishardware.


I don't know if the default memory setting was to blame or not.


On 28 May 2015, at 14:22, Andries Engelbrecht wrote:

That is the Drill direct memory per node.

DRILL_HEAP is for the heap size per node.

More info here
http://drill.apache.org/docs/configuring-drill-memory/


—Andries

On May 28, 2015, at 11:09 AM, Matt <bsg...@gmail.com> wrote:
Referencing http://drill.apache.org/docs/configuring-drill-memory/

Is DRILL_MAX_DIRECT_MEMORY the limit for each node, or the cluster?
The root page on a drillbit at port 8047 list for nodes, with the 16GMaximum Direct Memory equal to DRILL_MAX_DIRECT_MEMORY, thusuncertain if that is a node or cluster limit.
On 28 May 2015, at 12:23, Jason Altekruse wrote:
That is correct. I guess it could be possible that HDFS might runout ofheap, but I'm guessing that is unlikely the cause of the failure youareseeing. We should not be taxing zookeeper enough to be causing anyissues
there.

On Thu, May 28, 2015 at 9:17 AM, Matt <bsg...@gmail.com> wrote:
To make sure I am adjusting the correct config, these are heapparameters
within the Drill configure path, not for Hadoop or Zookeeper?
On May 28, 2015, at 12:08 PM, Jason Altekruse<altekruseja...@gmail.com>
wrote:
There should be no upper limit on the size of the tables you cancreate
with Drill. Be advised that Drill does currently operate entirely
optimistically in regards to available resources. If a networkconnection
between two drillbits fails during a query, we will not currently
re-schedule the work to make use of remaining nodes and network
connections
that are still live. While we have had a good amount of successusing
Drill
for data conversion, be aware that these conditions could causelong
running queries to fail.
That being said, it isn't the only possible cause for such afailure. Inthe case of a network failure we would expect to see a messagereturned
to
you that part of the query was unsuccessful and that it had been
cancelled.
Andries has a good suggestion in regards to checking the heapmemory,
this
should also be detected and reported back to you at the CLI, butwe may
be
failing to propagate the error back to the head node for thequery. Ibelieve writing parquet may still be the most heap-intensiveoperation inDrill, despite our efforts to refactor the write path to usedirect
memory
instead of on-heap for large buffers needed in the process ofcreating
parquet files.
On Thu, May 28, 2015 at 8:43 AM, Matt <bsg...@gmail.com> wrote:

Is 300MM records too much to do in a single CTAS statement?

After almost 23 hours I killed the query (^c) and it returned:

~~~
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 1_20      | 13568824                   |
| 1_15      | 12411822                   |
| 1_7       | 12470329                   |
| 1_12      | 13693867                   |
| 1_5       | 13292136                   |
| 1_18      | 13874321                   |
| 1_16      | 13303094                   |
| 1_9       | 13639049                   |
| 1_10      | 13698380                   |
| 1_22      | 13501073                   |
| 1_8       | 13533736                   |
| 1_2       | 13549402                   |
| 1_21      | 13665183                   |
| 1_0       | 13544745                   |
| 1_4       | 13532957                   |
| 1_19      | 12767473                   |
| 1_17      | 13670687                   |
| 1_13      | 13469515                   |
| 1_23      | 12517632                   |
| 1_6       | 13634338                   |
| 1_14      | 13611322                   |
| 1_3       | 13061900                   |
| 1_11      | 12760978                   |
+-----------+----------------------------+
23 rows selected (82294.854 seconds)
~~~
The sum of those record counts is 306,772,763 which is close tothe
320,843,454 in the source file:

~~~
0: jdbc:drill:zk=es05:2181> select count(*)  FROM
root.`sample_201501.dat`;
+------------+
|   EXPR$0   |
+------------+
| 320843454  |
+------------+
1 row selected (384.665 seconds)
~~~
It represents one month of data, 4 key columns and 38 numericmeasurecolumns, which could also be partitioned daily. The test here wasto
create
monthly Parquet files to see how the min/max stats on Parquetchunks
help
with range select performance.
Instead of a small number of large monthly RDBMS tables, I amattemptingto determine how many Parquet files should be used with Drill /HDFS.
On 27 May 2015, at 15:17, Matt wrote:
Attempting to create a Parquet backed table with a CTAS from an44GB tab
delimited file in HDFS. The process seemed to be running, as CPUand
IO was
seen on all 4 nodes in this cluster, and .parquet files beingcreated
in
the expected path.
In however in the last two hours or so, all nodes show near zeroCPU orIO, and the Last Modified date on the .parquet have not changed.Same
time
delay shown in the Last Progress column in the active fragmentprofile.
What approach can I take to determine what is happening (ornot)?

Re: Monitoring long / stuck CTAS

Reply via email to