How large is the data set you are working with, and your cluster/nodes?

Just testing with that single 44GB source file currently, and my test cluster is made from 4 nodes, each with 8 CPU cores, 32GB RAM, a 6TB Ext4 volume (RAID-10).

Drill defaults left as come in v1.0. I will be adjusting memory and retrying the CTAS.

I know I can / should assign individual disks to HDFS, but as a test cluster there are apps that expect data volumes to work on. A dedicated Hadoop production cluster would have a disk layout specific to the task.


On 28 May 2015, at 12:26, Andries Engelbrecht wrote:

Just check the drillbit.log and drillbit.out files in the log directory. Before adjusting memory, see if that is an issue first. It was for me, but as Jason mentioned there can be other causes as well.

You adjust memory allocation in the drill-env.sh files, and have to restart the drill bits.

How large is the data set you are working with, and your cluster/nodes?

—Andries


On May 28, 2015, at 9:17 AM, Matt <bsg...@gmail.com> wrote:

To make sure I am adjusting the correct config, these are heap parameters within the Drill configure path, not for Hadoop or Zookeeper?


On May 28, 2015, at 12:08 PM, Jason Altekruse <altekruseja...@gmail.com> wrote:

There should be no upper limit on the size of the tables you can create
with Drill. Be advised that Drill does currently operate entirely
optimistically in regards to available resources. If a network connection
between two drillbits fails during a query, we will not currently
re-schedule the work to make use of remaining nodes and network connections that are still live. While we have had a good amount of success using Drill
for data conversion, be aware that these conditions could cause long
running queries to fail.

That being said, it isn't the only possible cause for such a failure. In the case of a network failure we would expect to see a message returned to you that part of the query was unsuccessful and that it had been cancelled. Andries has a good suggestion in regards to checking the heap memory, this should also be detected and reported back to you at the CLI, but we may be failing to propagate the error back to the head node for the query. I believe writing parquet may still be the most heap-intensive operation in Drill, despite our efforts to refactor the write path to use direct memory instead of on-heap for large buffers needed in the process of creating
parquet files.

On Thu, May 28, 2015 at 8:43 AM, Matt <bsg...@gmail.com> wrote:

Is 300MM records too much to do in a single CTAS statement?

After almost 23 hours I killed the query (^c) and it returned:

~~~
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 1_20      | 13568824                   |
| 1_15      | 12411822                   |
| 1_7       | 12470329                   |
| 1_12      | 13693867                   |
| 1_5       | 13292136                   |
| 1_18      | 13874321                   |
| 1_16      | 13303094                   |
| 1_9       | 13639049                   |
| 1_10      | 13698380                   |
| 1_22      | 13501073                   |
| 1_8       | 13533736                   |
| 1_2       | 13549402                   |
| 1_21      | 13665183                   |
| 1_0       | 13544745                   |
| 1_4       | 13532957                   |
| 1_19      | 12767473                   |
| 1_17      | 13670687                   |
| 1_13      | 13469515                   |
| 1_23      | 12517632                   |
| 1_6       | 13634338                   |
| 1_14      | 13611322                   |
| 1_3       | 13061900                   |
| 1_11      | 12760978                   |
+-----------+----------------------------+
23 rows selected (82294.854 seconds)
~~~

The sum of those record counts is 306,772,763 which is close to the
320,843,454 in the source file:

~~~
0: jdbc:drill:zk=es05:2181> select count(*) FROM root.`sample_201501.dat`;
+------------+
|   EXPR$0   |
+------------+
| 320843454  |
+------------+
1 row selected (384.665 seconds)
~~~


It represents one month of data, 4 key columns and 38 numeric measure columns, which could also be partitioned daily. The test here was to create monthly Parquet files to see how the min/max stats on Parquet chunks help
with range select performance.

Instead of a small number of large monthly RDBMS tables, I am attempting to determine how many Parquet files should be used with Drill / HDFS.




On 27 May 2015, at 15:17, Matt wrote:

Attempting to create a Parquet backed table with a CTAS from an 44GB tab
delimited file in HDFS. The process seemed to be running, as CPU and IO was seen on all 4 nodes in this cluster, and .parquet files being created in
the expected path.

In however in the last two hours or so, all nodes show near zero CPU or IO, and the Last Modified date on the .parquet have not changed. Same time delay shown in the Last Progress column in the active fragment profile.

What approach can I take to determine what is happening (or not)?

Reply via email to