I did not note any memory errors or warnings in a quick scan of the logs, but to double check, is there a specific log I would find such warnings in?
> On May 28, 2015, at 12:01 PM, Andries Engelbrecht <aengelbre...@maprtech.com> > wrote: > > I have used a single CTAS to create tables using parquet with 1.5B rows. > > It did consume a lot of heap memory on the Drillbits and I had to increase > the heap size. Check your logs to see if you are running out of heap memory. > > I used 128MB parquet block size. > > This was with Drill 0.9 , so I’m sure 1.0 will be better in this regard. > > —Andries > > > >> On May 28, 2015, at 8:43 AM, Matt <bsg...@gmail.com> wrote: >> >> Is 300MM records too much to do in a single CTAS statement? >> >> After almost 23 hours I killed the query (^c) and it returned: >> >> ~~~ >> +-----------+----------------------------+ >> | Fragment | Number of records written | >> +-----------+----------------------------+ >> | 1_20 | 13568824 | >> | 1_15 | 12411822 | >> | 1_7 | 12470329 | >> | 1_12 | 13693867 | >> | 1_5 | 13292136 | >> | 1_18 | 13874321 | >> | 1_16 | 13303094 | >> | 1_9 | 13639049 | >> | 1_10 | 13698380 | >> | 1_22 | 13501073 | >> | 1_8 | 13533736 | >> | 1_2 | 13549402 | >> | 1_21 | 13665183 | >> | 1_0 | 13544745 | >> | 1_4 | 13532957 | >> | 1_19 | 12767473 | >> | 1_17 | 13670687 | >> | 1_13 | 13469515 | >> | 1_23 | 12517632 | >> | 1_6 | 13634338 | >> | 1_14 | 13611322 | >> | 1_3 | 13061900 | >> | 1_11 | 12760978 | >> +-----------+----------------------------+ >> 23 rows selected (82294.854 seconds) >> ~~~ >> >> The sum of those record counts is 306,772,763 which is close to the >> 320,843,454 in the source file: >> >> ~~~ >> 0: jdbc:drill:zk=es05:2181> select count(*) FROM root.`sample_201501.dat`; >> +------------+ >> | EXPR$0 | >> +------------+ >> | 320843454 | >> +------------+ >> 1 row selected (384.665 seconds) >> ~~~ >> >> >> It represents one month of data, 4 key columns and 38 numeric measure >> columns, which could also be partitioned daily. The test here was to create >> monthly Parquet files to see how the min/max stats on Parquet chunks help >> with range select performance. >> >> Instead of a small number of large monthly RDBMS tables, I am attempting to >> determine how many Parquet files should be used with Drill / HDFS. >> >> >> >>> On 27 May 2015, at 15:17, Matt wrote: >>> >>> Attempting to create a Parquet backed table with a CTAS from an 44GB tab >>> delimited file in HDFS. The process seemed to be running, as CPU and IO was >>> seen on all 4 nodes in this cluster, and .parquet files being created in >>> the expected path. >>> >>> In however in the last two hours or so, all nodes show near zero CPU or IO, >>> and the Last Modified date on the .parquet have not changed. Same time >>> delay shown in the Last Progress column in the active fragment profile. >>> >>> What approach can I take to determine what is happening (or not)? >