Re: Parquet Files Loaded to "directory" partitions Error on Grouping by dir0

Andries Engelbrecht Thu, 05 Nov 2015 08:44:22 -0800

You can query each directory individually to see which one is troublesome.

Also till the changes Jason is discussing is made to handle schema change 
better I would suggest that you avoid * operations as a general rule of thumb, 
unless needed.


Since Drill has to manage schema on the fly * operations are a lot more 
involved than a traditional RDBMS where schema is known and the optimizer can 
easily manage it. A simple elect count(*) on a traditional RDBMS with stats is 
a very simple operation, where in Drill it can be a challenge as the schema is 
not necessarily known ahead of time.

For this reason I normally suggest that people use select count() only on a 
single field that is a simple data type and known not to change. Typically a 
key or id field is the best to use, this way you sidestep schema challenges and 
still get the result you want.

select dir0, count(id_field) from parqtable group by dir0; will be a much 
better option.


—Andries

> On Nov 5, 2015, at 8:16 AM, John Omernik <[email protected]> wrote:
> 
> I am getting the same error as before.  I am using the same query to create
> my tables, so I am unsure why it would see the schema as different. Is
> there way to view the schema for each directory to see what Drill thinks is
> happening?
> 
> On Thu, Nov 5, 2015 at 9:35 AM, Andries Engelbrecht <
> [email protected]> wrote:
> 
>> Have you tried to create a view on top of parqtable, and then use the view?
>> 
>> I did a quick experiment that may help you.
>> 
>> Created a dir structure as follows (using csv files - what I had on hand)
>> orders
>> |____month1
>> |____month2
>> .
>> .
>> |____month9
>> 
>> create or replace view dfs.views.ordersdir as select dir0, * from
>> dfs.data.`/orders`;
>> 
>> 0: jdbc:drill:> select * from dfs.views.ordersdir limit 2;
>> 
>> +---------+-------------------------------------------------------------------+---------+
>> |  dir0   |                              columns
>>    |  dir00  |
>> 
>> +---------+-------------------------------------------------------------------+---------+
>> | month1  | ["8000","January","2014-01-02
>> 23:06:07","15349","ri","203","26"]  | month1  |
>> | month1  | ["8001","January","2014-01-31
>> 03:40:21","16033","oh","113","23"]  | month1  |
>> 
>> +---------+-------------------------------------------------------------------+————+
>> 
>> (interesting to note dir00 in the output)
>> 
>> But the group by on the dir works as it is a column defined in the view
>> 
>> 0: jdbc:drill:> select dir0, count(*) from dfs.views.ordersdir group by
>> dir0;
>> +---------+---------+
>> |  dir0   | EXPR$1  |
>> +---------+---------+
>> | month1  | 9000    |
>> | month6  | 16000   |
>> | month9  | 15131   |
>> | month2  | 12000   |
>> | month3  | 16000   |
>> | month7  | 15573   |
>> | month5  | 16000   |
>> | month4  | 12000   |
>> | month8  | 15390   |
>> +---------+————+
>> 
>> See if that works in your case.
>> 
>> 
>> —Andries
>> 
>> 
>>> On Nov 5, 2015, at 6:48 AM, John Omernik <[email protected]> wrote:
>>> 
>>> Hey all,  to facilitate loading of some data from JSON to Parquet, I am
>>> using the the load into "day" based directories...
>>> 
>>> parqtable
>>> |
>>> |_______2015-11-01
>>> |
>>> |_______2015-11-02
>>> |
>>> |_______2015-11-03
>>> |
>>> |_______2015-11-04
>>> 
>>> That way I can do select * from `parqtable` where dir0 = '2015-11-01' and
>>> other cool tricks. It also helps my data loading.
>>> 
>>> I am using the exact same query to load each day.
>>> 
>>> CREATE TABLE `parqtable/2015-11-01' as
>>> (select field1, field2, field3, field4 from jsontable where dir0 =
>>> '2015-11-01')
>>> 
>>> CREATE TABLE `parqtable/2015-11-02' as
>>> (select field1, field2, field3, field4 from jsontable where dir0 =
>>> '2015-11-02')
>>> 
>>> CREATE TABLE `parqtable/2015-11-03' as
>>> (select field1, field2, field3, field4 from jsontable where dir0 =
>>> '2015-11-03')
>>> 
>>> Etc
>>> 
>>> This seams to work well except for one thing:
>>> 
>>> If I want to see the count per directory, this (what I thought was
>> obvious)
>>> query:
>>> 
>>> select dir0, count(*) from `parqtable` group by dir0
>>> 
>>> fails with
>>> 
>>> Error: UNSUPPORTED_OPERATION_ ERROR: Hash aggregate does not support
>> schema
>>> changes
>>> 
>>> Fragment: 2:8
>>> 
>>> 
>>> I am not sure why this would be the case, the data is loaded by the same
>>> query, I would assume the schema is the same....
>>> 
>>> Thoughts on how to troubleshoot?
>>> 
>>> Thanks!
>>> 
>>> John
>> 
>>

Re: Parquet Files Loaded to "directory" partitions Error on Grouping by dir0

Reply via email to