Re: Parquet Files Loaded to "directory" partitions Error on Grouping by dir0

John Omernik Fri, 06 Nov 2015 07:05:03 -0800

This works:

select dir0, count(1) from table group by dir0


This does not:

select dir0, count(*) from table group by dir0

Thanks for the tip on count(1)

John


On Thu, Nov 5, 2015 at 7:40 PM, Jacques Nadeau <[email protected]> wrote:

> In general, * is more complicated. However, count(*) is actually rewritten
> as count(1) and is generally the most efficient way to do a count. (Test
> reader might have a bug here which makes take longer (read all data instead
> of just row count).
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Thu, Nov 5, 2015 at 8:43 AM, Andries Engelbrecht <
> [email protected]> wrote:
>
> > You can query each directory individually to see which one is
> troublesome.
> >
> > Also till the changes Jason is discussing is made to handle schema change
> > better I would suggest that you avoid * operations as a general rule of
> > thumb, unless needed.
> >
> > Since Drill has to manage schema on the fly * operations are a lot more
> > involved than a traditional RDBMS where schema is known and the optimizer
> > can easily manage it. A simple elect count(*) on a traditional RDBMS with
> > stats is a very simple operation, where in Drill it can be a challenge as
> > the schema is not necessarily known ahead of time.
> >
> > For this reason I normally suggest that people use select count() only on
> > a single field that is a simple data type and known not to change.
> > Typically a key or id field is the best to use, this way you sidestep
> > schema challenges and still get the result you want.
> >
> > select dir0, count(id_field) from parqtable group by dir0; will be a much
> > better option.
> >
> >
> > —Andries
> >
> > > On Nov 5, 2015, at 8:16 AM, John Omernik <[email protected]> wrote:
> > >
> > > I am getting the same error as before.  I am using the same query to
> > create
> > > my tables, so I am unsure why it would see the schema as different. Is
> > > there way to view the schema for each directory to see what Drill
> thinks
> > is
> > > happening?
> > >
> > > On Thu, Nov 5, 2015 at 9:35 AM, Andries Engelbrecht <
> > > [email protected]> wrote:
> > >
> > >> Have you tried to create a view on top of parqtable, and then use the
> > view?
> > >>
> > >> I did a quick experiment that may help you.
> > >>
> > >> Created a dir structure as follows (using csv files - what I had on
> > hand)
> > >> orders
> > >> |____month1
> > >> |____month2
> > >> .
> > >> .
> > >> |____month9
> > >>
> > >> create or replace view dfs.views.ordersdir as select dir0, * from
> > >> dfs.data.`/orders`;
> > >>
> > >> 0: jdbc:drill:> select * from dfs.views.ordersdir limit 2;
> > >>
> > >>
> >
> +---------+-------------------------------------------------------------------+---------+
> > >> |  dir0   |                              columns
> > >>    |  dir00  |
> > >>
> > >>
> >
> +---------+-------------------------------------------------------------------+---------+
> > >> | month1  | ["8000","January","2014-01-02
> > >> 23:06:07","15349","ri","203","26"]  | month1  |
> > >> | month1  | ["8001","January","2014-01-31
> > >> 03:40:21","16033","oh","113","23"]  | month1  |
> > >>
> > >>
> >
> +---------+-------------------------------------------------------------------+————+
> > >>
> > >> (interesting to note dir00 in the output)
> > >>
> > >> But the group by on the dir works as it is a column defined in the
> view
> > >>
> > >> 0: jdbc:drill:> select dir0, count(*) from dfs.views.ordersdir group
> by
> > >> dir0;
> > >> +---------+---------+
> > >> |  dir0   | EXPR$1  |
> > >> +---------+---------+
> > >> | month1  | 9000    |
> > >> | month6  | 16000   |
> > >> | month9  | 15131   |
> > >> | month2  | 12000   |
> > >> | month3  | 16000   |
> > >> | month7  | 15573   |
> > >> | month5  | 16000   |
> > >> | month4  | 12000   |
> > >> | month8  | 15390   |
> > >> +---------+————+
> > >>
> > >> See if that works in your case.
> > >>
> > >>
> > >> —Andries
> > >>
> > >>
> > >>> On Nov 5, 2015, at 6:48 AM, John Omernik <[email protected]> wrote:
> > >>>
> > >>> Hey all,  to facilitate loading of some data from JSON to Parquet, I
> am
> > >>> using the the load into "day" based directories...
> > >>>
> > >>> parqtable
> > >>> |
> > >>> |_______2015-11-01
> > >>> |
> > >>> |_______2015-11-02
> > >>> |
> > >>> |_______2015-11-03
> > >>> |
> > >>> |_______2015-11-04
> > >>>
> > >>> That way I can do select * from `parqtable` where dir0 = '2015-11-01'
> > and
> > >>> other cool tricks. It also helps my data loading.
> > >>>
> > >>> I am using the exact same query to load each day.
> > >>>
> > >>> CREATE TABLE `parqtable/2015-11-01' as
> > >>> (select field1, field2, field3, field4 from jsontable where dir0 =
> > >>> '2015-11-01')
> > >>>
> > >>> CREATE TABLE `parqtable/2015-11-02' as
> > >>> (select field1, field2, field3, field4 from jsontable where dir0 =
> > >>> '2015-11-02')
> > >>>
> > >>> CREATE TABLE `parqtable/2015-11-03' as
> > >>> (select field1, field2, field3, field4 from jsontable where dir0 =
> > >>> '2015-11-03')
> > >>>
> > >>> Etc
> > >>>
> > >>> This seams to work well except for one thing:
> > >>>
> > >>> If I want to see the count per directory, this (what I thought was
> > >> obvious)
> > >>> query:
> > >>>
> > >>> select dir0, count(*) from `parqtable` group by dir0
> > >>>
> > >>> fails with
> > >>>
> > >>> Error: UNSUPPORTED_OPERATION_ ERROR: Hash aggregate does not support
> > >> schema
> > >>> changes
> > >>>
> > >>> Fragment: 2:8
> > >>>
> > >>>
> > >>> I am not sure why this would be the case, the data is loaded by the
> > same
> > >>> query, I would assume the schema is the same....
> > >>>
> > >>> Thoughts on how to troubleshoot?
> > >>>
> > >>> Thanks!
> > >>>
> > >>> John
> > >>
> > >>
> >
> >
>

Re: Parquet Files Loaded to "directory" partitions Error on Grouping by dir0

Reply via email to