I have used a single CTAS to create tables using parquet with 1.5B rows. It did consume a lot of heap memory on the Drillbits and I had to increase the heap size. Check your logs to see if you are running out of heap memory.
I used 128MB parquet block size. This was with Drill 0.9 , so I’m sure 1.0 will be better in this regard. —Andries On May 28, 2015, at 8:43 AM, Matt <bsg...@gmail.com> wrote: > Is 300MM records too much to do in a single CTAS statement? > > After almost 23 hours I killed the query (^c) and it returned: > > ~~~ > +-----------+----------------------------+ > | Fragment | Number of records written | > +-----------+----------------------------+ > | 1_20 | 13568824 | > | 1_15 | 12411822 | > | 1_7 | 12470329 | > | 1_12 | 13693867 | > | 1_5 | 13292136 | > | 1_18 | 13874321 | > | 1_16 | 13303094 | > | 1_9 | 13639049 | > | 1_10 | 13698380 | > | 1_22 | 13501073 | > | 1_8 | 13533736 | > | 1_2 | 13549402 | > | 1_21 | 13665183 | > | 1_0 | 13544745 | > | 1_4 | 13532957 | > | 1_19 | 12767473 | > | 1_17 | 13670687 | > | 1_13 | 13469515 | > | 1_23 | 12517632 | > | 1_6 | 13634338 | > | 1_14 | 13611322 | > | 1_3 | 13061900 | > | 1_11 | 12760978 | > +-----------+----------------------------+ > 23 rows selected (82294.854 seconds) > ~~~ > > The sum of those record counts is 306,772,763 which is close to the > 320,843,454 in the source file: > > ~~~ > 0: jdbc:drill:zk=es05:2181> select count(*) FROM root.`sample_201501.dat`; > +------------+ > | EXPR$0 | > +------------+ > | 320843454 | > +------------+ > 1 row selected (384.665 seconds) > ~~~ > > > It represents one month of data, 4 key columns and 38 numeric measure > columns, which could also be partitioned daily. The test here was to create > monthly Parquet files to see how the min/max stats on Parquet chunks help > with range select performance. > > Instead of a small number of large monthly RDBMS tables, I am attempting to > determine how many Parquet files should be used with Drill / HDFS. > > > > On 27 May 2015, at 15:17, Matt wrote: > >> Attempting to create a Parquet backed table with a CTAS from an 44GB tab >> delimited file in HDFS. The process seemed to be running, as CPU and IO was >> seen on all 4 nodes in this cluster, and .parquet files being created in the >> expected path. >> >> In however in the last two hours or so, all nodes show near zero CPU or IO, >> and the Last Modified date on the .parquet have not changed. Same time delay >> shown in the Last Progress column in the active fragment profile. >> >> What approach can I take to determine what is happening (or not)?