Re: Parquet Files Loaded to "directory" partitions Error on Grouping by dir0

Jason Altekruse Thu, 05 Nov 2015 08:35:11 -0800

It is possible to introduce a schema change when you are writing parquet
from JSON files. If you read part of the JSON that only contains some
fields, and later see a new field, after we have written a few batches of
data to the parquet file, we will close the current file and open a new
one. Parquet currently requires that all columns be known when you start
writing.


Unfortunately this means that when you query the directory you would see
schema changes. This is my best guess as to why you are seeing this
behavior. We have a number of tasks outstanding around handling schema
change in different parts of the engine, so we will be working to remove
this restriction. If you need a workaround today, you could try to figure
out what the complete schema in your JSON will be and just put one record
at the top of each file that has data in each field that could possibly
appear in the dataset later. This will tell all of the JSON readers the
full schema at the start. There currently is no way to provide this
information to the reader without putting it in the JSON data itself.

On Thu, Nov 5, 2015 at 8:16 AM, John Omernik <[email protected]> wrote:

> I am getting the same error as before.  I am using the same query to create
> my tables, so I am unsure why it would see the schema as different. Is
> there way to view the schema for each directory to see what Drill thinks is
> happening?
>
> On Thu, Nov 5, 2015 at 9:35 AM, Andries Engelbrecht <
> [email protected]> wrote:
>
> > Have you tried to create a view on top of parqtable, and then use the
> view?
> >
> > I did a quick experiment that may help you.
> >
> > Created a dir structure as follows (using csv files - what I had on hand)
> > orders
> > |____month1
> > |____month2
> > .
> > .
> > |____month9
> >
> > create or replace view dfs.views.ordersdir as select dir0, * from
> > dfs.data.`/orders`;
> >
> > 0: jdbc:drill:> select * from dfs.views.ordersdir limit 2;
> >
> >
> +---------+-------------------------------------------------------------------+---------+
> > |  dir0   |                              columns
> >     |  dir00  |
> >
> >
> +---------+-------------------------------------------------------------------+---------+
> > | month1  | ["8000","January","2014-01-02
> > 23:06:07","15349","ri","203","26"]  | month1  |
> > | month1  | ["8001","January","2014-01-31
> > 03:40:21","16033","oh","113","23"]  | month1  |
> >
> >
> +---------+-------------------------------------------------------------------+————+
> >
> > (interesting to note dir00 in the output)
> >
> > But the group by on the dir works as it is a column defined in the view
> >
> > 0: jdbc:drill:> select dir0, count(*) from dfs.views.ordersdir group by
> > dir0;
> > +---------+---------+
> > |  dir0   | EXPR$1  |
> > +---------+---------+
> > | month1  | 9000    |
> > | month6  | 16000   |
> > | month9  | 15131   |
> > | month2  | 12000   |
> > | month3  | 16000   |
> > | month7  | 15573   |
> > | month5  | 16000   |
> > | month4  | 12000   |
> > | month8  | 15390   |
> > +---------+————+
> >
> > See if that works in your case.
> >
> >
> > —Andries
> >
> >
> > > On Nov 5, 2015, at 6:48 AM, John Omernik <[email protected]> wrote:
> > >
> > > Hey all,  to facilitate loading of some data from JSON to Parquet, I am
> > > using the the load into "day" based directories...
> > >
> > > parqtable
> > > |
> > > |_______2015-11-01
> > > |
> > > |_______2015-11-02
> > > |
> > > |_______2015-11-03
> > > |
> > > |_______2015-11-04
> > >
> > > That way I can do select * from `parqtable` where dir0 = '2015-11-01'
> and
> > > other cool tricks. It also helps my data loading.
> > >
> > > I am using the exact same query to load each day.
> > >
> > > CREATE TABLE `parqtable/2015-11-01' as
> > > (select field1, field2, field3, field4 from jsontable where dir0 =
> > > '2015-11-01')
> > >
> > > CREATE TABLE `parqtable/2015-11-02' as
> > > (select field1, field2, field3, field4 from jsontable where dir0 =
> > > '2015-11-02')
> > >
> > > CREATE TABLE `parqtable/2015-11-03' as
> > > (select field1, field2, field3, field4 from jsontable where dir0 =
> > > '2015-11-03')
> > >
> > > Etc
> > >
> > > This seams to work well except for one thing:
> > >
> > > If I want to see the count per directory, this (what I thought was
> > obvious)
> > > query:
> > >
> > > select dir0, count(*) from `parqtable` group by dir0
> > >
> > > fails with
> > >
> > > Error: UNSUPPORTED_OPERATION_ ERROR: Hash aggregate does not support
> > schema
> > > changes
> > >
> > > Fragment: 2:8
> > >
> > >
> > > I am not sure why this would be the case, the data is loaded by the
> same
> > > query, I would assume the schema is the same....
> > >
> > > Thoughts on how to troubleshoot?
> > >
> > > Thanks!
> > >
> > > John
> >
> >
>

Re: Parquet Files Loaded to "directory" partitions Error on Grouping by dir0

Reply via email to