It is possible to introduce a schema change when you are writing parquet from JSON files. If you read part of the JSON that only contains some fields, and later see a new field, after we have written a few batches of data to the parquet file, we will close the current file and open a new one. Parquet currently requires that all columns be known when you start writing.
Unfortunately this means that when you query the directory you would see schema changes. This is my best guess as to why you are seeing this behavior. We have a number of tasks outstanding around handling schema change in different parts of the engine, so we will be working to remove this restriction. If you need a workaround today, you could try to figure out what the complete schema in your JSON will be and just put one record at the top of each file that has data in each field that could possibly appear in the dataset later. This will tell all of the JSON readers the full schema at the start. There currently is no way to provide this information to the reader without putting it in the JSON data itself. On Thu, Nov 5, 2015 at 8:16 AM, John Omernik <[email protected]> wrote: > I am getting the same error as before. I am using the same query to create > my tables, so I am unsure why it would see the schema as different. Is > there way to view the schema for each directory to see what Drill thinks is > happening? > > On Thu, Nov 5, 2015 at 9:35 AM, Andries Engelbrecht < > [email protected]> wrote: > > > Have you tried to create a view on top of parqtable, and then use the > view? > > > > I did a quick experiment that may help you. > > > > Created a dir structure as follows (using csv files - what I had on hand) > > orders > > |____month1 > > |____month2 > > . > > . > > |____month9 > > > > create or replace view dfs.views.ordersdir as select dir0, * from > > dfs.data.`/orders`; > > > > 0: jdbc:drill:> select * from dfs.views.ordersdir limit 2; > > > > > +---------+-------------------------------------------------------------------+---------+ > > | dir0 | columns > > | dir00 | > > > > > +---------+-------------------------------------------------------------------+---------+ > > | month1 | ["8000","January","2014-01-02 > > 23:06:07","15349","ri","203","26"] | month1 | > > | month1 | ["8001","January","2014-01-31 > > 03:40:21","16033","oh","113","23"] | month1 | > > > > > +---------+-------------------------------------------------------------------+————+ > > > > (interesting to note dir00 in the output) > > > > But the group by on the dir works as it is a column defined in the view > > > > 0: jdbc:drill:> select dir0, count(*) from dfs.views.ordersdir group by > > dir0; > > +---------+---------+ > > | dir0 | EXPR$1 | > > +---------+---------+ > > | month1 | 9000 | > > | month6 | 16000 | > > | month9 | 15131 | > > | month2 | 12000 | > > | month3 | 16000 | > > | month7 | 15573 | > > | month5 | 16000 | > > | month4 | 12000 | > > | month8 | 15390 | > > +---------+————+ > > > > See if that works in your case. > > > > > > —Andries > > > > > > > On Nov 5, 2015, at 6:48 AM, John Omernik <[email protected]> wrote: > > > > > > Hey all, to facilitate loading of some data from JSON to Parquet, I am > > > using the the load into "day" based directories... > > > > > > parqtable > > > | > > > |_______2015-11-01 > > > | > > > |_______2015-11-02 > > > | > > > |_______2015-11-03 > > > | > > > |_______2015-11-04 > > > > > > That way I can do select * from `parqtable` where dir0 = '2015-11-01' > and > > > other cool tricks. It also helps my data loading. > > > > > > I am using the exact same query to load each day. > > > > > > CREATE TABLE `parqtable/2015-11-01' as > > > (select field1, field2, field3, field4 from jsontable where dir0 = > > > '2015-11-01') > > > > > > CREATE TABLE `parqtable/2015-11-02' as > > > (select field1, field2, field3, field4 from jsontable where dir0 = > > > '2015-11-02') > > > > > > CREATE TABLE `parqtable/2015-11-03' as > > > (select field1, field2, field3, field4 from jsontable where dir0 = > > > '2015-11-03') > > > > > > Etc > > > > > > This seams to work well except for one thing: > > > > > > If I want to see the count per directory, this (what I thought was > > obvious) > > > query: > > > > > > select dir0, count(*) from `parqtable` group by dir0 > > > > > > fails with > > > > > > Error: UNSUPPORTED_OPERATION_ ERROR: Hash aggregate does not support > > schema > > > changes > > > > > > Fragment: 2:8 > > > > > > > > > I am not sure why this would be the case, the data is loaded by the > same > > > query, I would assume the schema is the same.... > > > > > > Thoughts on how to troubleshoot? > > > > > > Thanks! > > > > > > John > > > > >
