You can query each directory individually to see which one is troublesome. Also till the changes Jason is discussing is made to handle schema change better I would suggest that you avoid * operations as a general rule of thumb, unless needed.
Since Drill has to manage schema on the fly * operations are a lot more involved than a traditional RDBMS where schema is known and the optimizer can easily manage it. A simple elect count(*) on a traditional RDBMS with stats is a very simple operation, where in Drill it can be a challenge as the schema is not necessarily known ahead of time. For this reason I normally suggest that people use select count() only on a single field that is a simple data type and known not to change. Typically a key or id field is the best to use, this way you sidestep schema challenges and still get the result you want. select dir0, count(id_field) from parqtable group by dir0; will be a much better option. —Andries > On Nov 5, 2015, at 8:16 AM, John Omernik <[email protected]> wrote: > > I am getting the same error as before. I am using the same query to create > my tables, so I am unsure why it would see the schema as different. Is > there way to view the schema for each directory to see what Drill thinks is > happening? > > On Thu, Nov 5, 2015 at 9:35 AM, Andries Engelbrecht < > [email protected]> wrote: > >> Have you tried to create a view on top of parqtable, and then use the view? >> >> I did a quick experiment that may help you. >> >> Created a dir structure as follows (using csv files - what I had on hand) >> orders >> |____month1 >> |____month2 >> . >> . >> |____month9 >> >> create or replace view dfs.views.ordersdir as select dir0, * from >> dfs.data.`/orders`; >> >> 0: jdbc:drill:> select * from dfs.views.ordersdir limit 2; >> >> +---------+-------------------------------------------------------------------+---------+ >> | dir0 | columns >> | dir00 | >> >> +---------+-------------------------------------------------------------------+---------+ >> | month1 | ["8000","January","2014-01-02 >> 23:06:07","15349","ri","203","26"] | month1 | >> | month1 | ["8001","January","2014-01-31 >> 03:40:21","16033","oh","113","23"] | month1 | >> >> +---------+-------------------------------------------------------------------+————+ >> >> (interesting to note dir00 in the output) >> >> But the group by on the dir works as it is a column defined in the view >> >> 0: jdbc:drill:> select dir0, count(*) from dfs.views.ordersdir group by >> dir0; >> +---------+---------+ >> | dir0 | EXPR$1 | >> +---------+---------+ >> | month1 | 9000 | >> | month6 | 16000 | >> | month9 | 15131 | >> | month2 | 12000 | >> | month3 | 16000 | >> | month7 | 15573 | >> | month5 | 16000 | >> | month4 | 12000 | >> | month8 | 15390 | >> +---------+————+ >> >> See if that works in your case. >> >> >> —Andries >> >> >>> On Nov 5, 2015, at 6:48 AM, John Omernik <[email protected]> wrote: >>> >>> Hey all, to facilitate loading of some data from JSON to Parquet, I am >>> using the the load into "day" based directories... >>> >>> parqtable >>> | >>> |_______2015-11-01 >>> | >>> |_______2015-11-02 >>> | >>> |_______2015-11-03 >>> | >>> |_______2015-11-04 >>> >>> That way I can do select * from `parqtable` where dir0 = '2015-11-01' and >>> other cool tricks. It also helps my data loading. >>> >>> I am using the exact same query to load each day. >>> >>> CREATE TABLE `parqtable/2015-11-01' as >>> (select field1, field2, field3, field4 from jsontable where dir0 = >>> '2015-11-01') >>> >>> CREATE TABLE `parqtable/2015-11-02' as >>> (select field1, field2, field3, field4 from jsontable where dir0 = >>> '2015-11-02') >>> >>> CREATE TABLE `parqtable/2015-11-03' as >>> (select field1, field2, field3, field4 from jsontable where dir0 = >>> '2015-11-03') >>> >>> Etc >>> >>> This seams to work well except for one thing: >>> >>> If I want to see the count per directory, this (what I thought was >> obvious) >>> query: >>> >>> select dir0, count(*) from `parqtable` group by dir0 >>> >>> fails with >>> >>> Error: UNSUPPORTED_OPERATION_ ERROR: Hash aggregate does not support >> schema >>> changes >>> >>> Fragment: 2:8 >>> >>> >>> I am not sure why this would be the case, the data is loaded by the same >>> query, I would assume the schema is the same.... >>> >>> Thoughts on how to troubleshoot? >>> >>> Thanks! >>> >>> John >> >>
