Re: Parquet Files Loaded to "directory" partitions Error on Grouping by dir0

John Omernik Thu, 05 Nov 2015 08:41:03 -0800

Is there a way to see what Drill thinks the schema is for a group a files
(a show schema or something similar)


Thanks


On Thu, Nov 5, 2015 at 10:34 AM, Jason Altekruse <[email protected]>
wrote:

> It is possible to introduce a schema change when you are writing parquet
> from JSON files. If you read part of the JSON that only contains some
> fields, and later see a new field, after we have written a few batches of
> data to the parquet file, we will close the current file and open a new
> one. Parquet currently requires that all columns be known when you start
> writing.
>
> Unfortunately this means that when you query the directory you would see
> schema changes. This is my best guess as to why you are seeing this
> behavior. We have a number of tasks outstanding around handling schema
> change in different parts of the engine, so we will be working to remove
> this restriction. If you need a workaround today, you could try to figure
> out what the complete schema in your JSON will be and just put one record
> at the top of each file that has data in each field that could possibly
> appear in the dataset later. This will tell all of the JSON readers the
> full schema at the start. There currently is no way to provide this
> information to the reader without putting it in the JSON data itself.
>
> On Thu, Nov 5, 2015 at 8:16 AM, John Omernik <[email protected]> wrote:
>
> > I am getting the same error as before.  I am using the same query to
> create
> > my tables, so I am unsure why it would see the schema as different. Is
> > there way to view the schema for each directory to see what Drill thinks
> is
> > happening?
> >
> > On Thu, Nov 5, 2015 at 9:35 AM, Andries Engelbrecht <
> > [email protected]> wrote:
> >
> > > Have you tried to create a view on top of parqtable, and then use the
> > view?
> > >
> > > I did a quick experiment that may help you.
> > >
> > > Created a dir structure as follows (using csv files - what I had on
> hand)
> > > orders
> > > |____month1
> > > |____month2
> > > .
> > > .
> > > |____month9
> > >
> > > create or replace view dfs.views.ordersdir as select dir0, * from
> > > dfs.data.`/orders`;
> > >
> > > 0: jdbc:drill:> select * from dfs.views.ordersdir limit 2;
> > >
> > >
> >
> +---------+-------------------------------------------------------------------+---------+
> > > |  dir0   |                              columns
> > >     |  dir00  |
> > >
> > >
> >
> +---------+-------------------------------------------------------------------+---------+
> > > | month1  | ["8000","January","2014-01-02
> > > 23:06:07","15349","ri","203","26"]  | month1  |
> > > | month1  | ["8001","January","2014-01-31
> > > 03:40:21","16033","oh","113","23"]  | month1  |
> > >
> > >
> >
> +---------+-------------------------------------------------------------------+————+
> > >
> > > (interesting to note dir00 in the output)
> > >
> > > But the group by on the dir works as it is a column defined in the view
> > >
> > > 0: jdbc:drill:> select dir0, count(*) from dfs.views.ordersdir group by
> > > dir0;
> > > +---------+---------+
> > > |  dir0   | EXPR$1  |
> > > +---------+---------+
> > > | month1  | 9000    |
> > > | month6  | 16000   |
> > > | month9  | 15131   |
> > > | month2  | 12000   |
> > > | month3  | 16000   |
> > > | month7  | 15573   |
> > > | month5  | 16000   |
> > > | month4  | 12000   |
> > > | month8  | 15390   |
> > > +---------+————+
> > >
> > > See if that works in your case.
> > >
> > >
> > > —Andries
> > >
> > >
> > > > On Nov 5, 2015, at 6:48 AM, John Omernik <[email protected]> wrote:
> > > >
> > > > Hey all,  to facilitate loading of some data from JSON to Parquet, I
> am
> > > > using the the load into "day" based directories...
> > > >
> > > > parqtable
> > > > |
> > > > |_______2015-11-01
> > > > |
> > > > |_______2015-11-02
> > > > |
> > > > |_______2015-11-03
> > > > |
> > > > |_______2015-11-04
> > > >
> > > > That way I can do select * from `parqtable` where dir0 = '2015-11-01'
> > and
> > > > other cool tricks. It also helps my data loading.
> > > >
> > > > I am using the exact same query to load each day.
> > > >
> > > > CREATE TABLE `parqtable/2015-11-01' as
> > > > (select field1, field2, field3, field4 from jsontable where dir0 =
> > > > '2015-11-01')
> > > >
> > > > CREATE TABLE `parqtable/2015-11-02' as
> > > > (select field1, field2, field3, field4 from jsontable where dir0 =
> > > > '2015-11-02')
> > > >
> > > > CREATE TABLE `parqtable/2015-11-03' as
> > > > (select field1, field2, field3, field4 from jsontable where dir0 =
> > > > '2015-11-03')
> > > >
> > > > Etc
> > > >
> > > > This seams to work well except for one thing:
> > > >
> > > > If I want to see the count per directory, this (what I thought was
> > > obvious)
> > > > query:
> > > >
> > > > select dir0, count(*) from `parqtable` group by dir0
> > > >
> > > > fails with
> > > >
> > > > Error: UNSUPPORTED_OPERATION_ ERROR: Hash aggregate does not support
> > > schema
> > > > changes
> > > >
> > > > Fragment: 2:8
> > > >
> > > >
> > > > I am not sure why this would be the case, the data is loaded by the
> > same
> > > > query, I would assume the schema is the same....
> > > >
> > > > Thoughts on how to troubleshoot?
> > > >
> > > > Thanks!
> > > >
> > > > John
> > >
> > >
> >
>

Re: Parquet Files Loaded to "directory" partitions Error on Grouping by dir0

Reply via email to