This works: select dir0, count(1) from table group by dir0
This does not: select dir0, count(*) from table group by dir0 Thanks for the tip on count(1) John On Thu, Nov 5, 2015 at 7:40 PM, Jacques Nadeau <[email protected]> wrote: > In general, * is more complicated. However, count(*) is actually rewritten > as count(1) and is generally the most efficient way to do a count. (Test > reader might have a bug here which makes take longer (read all data instead > of just row count). > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > On Thu, Nov 5, 2015 at 8:43 AM, Andries Engelbrecht < > [email protected]> wrote: > > > You can query each directory individually to see which one is > troublesome. > > > > Also till the changes Jason is discussing is made to handle schema change > > better I would suggest that you avoid * operations as a general rule of > > thumb, unless needed. > > > > Since Drill has to manage schema on the fly * operations are a lot more > > involved than a traditional RDBMS where schema is known and the optimizer > > can easily manage it. A simple elect count(*) on a traditional RDBMS with > > stats is a very simple operation, where in Drill it can be a challenge as > > the schema is not necessarily known ahead of time. > > > > For this reason I normally suggest that people use select count() only on > > a single field that is a simple data type and known not to change. > > Typically a key or id field is the best to use, this way you sidestep > > schema challenges and still get the result you want. > > > > select dir0, count(id_field) from parqtable group by dir0; will be a much > > better option. > > > > > > —Andries > > > > > On Nov 5, 2015, at 8:16 AM, John Omernik <[email protected]> wrote: > > > > > > I am getting the same error as before. I am using the same query to > > create > > > my tables, so I am unsure why it would see the schema as different. Is > > > there way to view the schema for each directory to see what Drill > thinks > > is > > > happening? > > > > > > On Thu, Nov 5, 2015 at 9:35 AM, Andries Engelbrecht < > > > [email protected]> wrote: > > > > > >> Have you tried to create a view on top of parqtable, and then use the > > view? > > >> > > >> I did a quick experiment that may help you. > > >> > > >> Created a dir structure as follows (using csv files - what I had on > > hand) > > >> orders > > >> |____month1 > > >> |____month2 > > >> . > > >> . > > >> |____month9 > > >> > > >> create or replace view dfs.views.ordersdir as select dir0, * from > > >> dfs.data.`/orders`; > > >> > > >> 0: jdbc:drill:> select * from dfs.views.ordersdir limit 2; > > >> > > >> > > > +---------+-------------------------------------------------------------------+---------+ > > >> | dir0 | columns > > >> | dir00 | > > >> > > >> > > > +---------+-------------------------------------------------------------------+---------+ > > >> | month1 | ["8000","January","2014-01-02 > > >> 23:06:07","15349","ri","203","26"] | month1 | > > >> | month1 | ["8001","January","2014-01-31 > > >> 03:40:21","16033","oh","113","23"] | month1 | > > >> > > >> > > > +---------+-------------------------------------------------------------------+————+ > > >> > > >> (interesting to note dir00 in the output) > > >> > > >> But the group by on the dir works as it is a column defined in the > view > > >> > > >> 0: jdbc:drill:> select dir0, count(*) from dfs.views.ordersdir group > by > > >> dir0; > > >> +---------+---------+ > > >> | dir0 | EXPR$1 | > > >> +---------+---------+ > > >> | month1 | 9000 | > > >> | month6 | 16000 | > > >> | month9 | 15131 | > > >> | month2 | 12000 | > > >> | month3 | 16000 | > > >> | month7 | 15573 | > > >> | month5 | 16000 | > > >> | month4 | 12000 | > > >> | month8 | 15390 | > > >> +---------+————+ > > >> > > >> See if that works in your case. > > >> > > >> > > >> —Andries > > >> > > >> > > >>> On Nov 5, 2015, at 6:48 AM, John Omernik <[email protected]> wrote: > > >>> > > >>> Hey all, to facilitate loading of some data from JSON to Parquet, I > am > > >>> using the the load into "day" based directories... > > >>> > > >>> parqtable > > >>> | > > >>> |_______2015-11-01 > > >>> | > > >>> |_______2015-11-02 > > >>> | > > >>> |_______2015-11-03 > > >>> | > > >>> |_______2015-11-04 > > >>> > > >>> That way I can do select * from `parqtable` where dir0 = '2015-11-01' > > and > > >>> other cool tricks. It also helps my data loading. > > >>> > > >>> I am using the exact same query to load each day. > > >>> > > >>> CREATE TABLE `parqtable/2015-11-01' as > > >>> (select field1, field2, field3, field4 from jsontable where dir0 = > > >>> '2015-11-01') > > >>> > > >>> CREATE TABLE `parqtable/2015-11-02' as > > >>> (select field1, field2, field3, field4 from jsontable where dir0 = > > >>> '2015-11-02') > > >>> > > >>> CREATE TABLE `parqtable/2015-11-03' as > > >>> (select field1, field2, field3, field4 from jsontable where dir0 = > > >>> '2015-11-03') > > >>> > > >>> Etc > > >>> > > >>> This seams to work well except for one thing: > > >>> > > >>> If I want to see the count per directory, this (what I thought was > > >> obvious) > > >>> query: > > >>> > > >>> select dir0, count(*) from `parqtable` group by dir0 > > >>> > > >>> fails with > > >>> > > >>> Error: UNSUPPORTED_OPERATION_ ERROR: Hash aggregate does not support > > >> schema > > >>> changes > > >>> > > >>> Fragment: 2:8 > > >>> > > >>> > > >>> I am not sure why this would be the case, the data is loaded by the > > same > > >>> query, I would assume the schema is the same.... > > >>> > > >>> Thoughts on how to troubleshoot? > > >>> > > >>> Thanks! > > >>> > > >>> John > > >> > > >> > > > > >
