Is there a way to see what Drill thinks the schema is for a group a files (a show schema or something similar)
Thanks On Thu, Nov 5, 2015 at 10:34 AM, Jason Altekruse <[email protected]> wrote: > It is possible to introduce a schema change when you are writing parquet > from JSON files. If you read part of the JSON that only contains some > fields, and later see a new field, after we have written a few batches of > data to the parquet file, we will close the current file and open a new > one. Parquet currently requires that all columns be known when you start > writing. > > Unfortunately this means that when you query the directory you would see > schema changes. This is my best guess as to why you are seeing this > behavior. We have a number of tasks outstanding around handling schema > change in different parts of the engine, so we will be working to remove > this restriction. If you need a workaround today, you could try to figure > out what the complete schema in your JSON will be and just put one record > at the top of each file that has data in each field that could possibly > appear in the dataset later. This will tell all of the JSON readers the > full schema at the start. There currently is no way to provide this > information to the reader without putting it in the JSON data itself. > > On Thu, Nov 5, 2015 at 8:16 AM, John Omernik <[email protected]> wrote: > > > I am getting the same error as before. I am using the same query to > create > > my tables, so I am unsure why it would see the schema as different. Is > > there way to view the schema for each directory to see what Drill thinks > is > > happening? > > > > On Thu, Nov 5, 2015 at 9:35 AM, Andries Engelbrecht < > > [email protected]> wrote: > > > > > Have you tried to create a view on top of parqtable, and then use the > > view? > > > > > > I did a quick experiment that may help you. > > > > > > Created a dir structure as follows (using csv files - what I had on > hand) > > > orders > > > |____month1 > > > |____month2 > > > . > > > . > > > |____month9 > > > > > > create or replace view dfs.views.ordersdir as select dir0, * from > > > dfs.data.`/orders`; > > > > > > 0: jdbc:drill:> select * from dfs.views.ordersdir limit 2; > > > > > > > > > +---------+-------------------------------------------------------------------+---------+ > > > | dir0 | columns > > > | dir00 | > > > > > > > > > +---------+-------------------------------------------------------------------+---------+ > > > | month1 | ["8000","January","2014-01-02 > > > 23:06:07","15349","ri","203","26"] | month1 | > > > | month1 | ["8001","January","2014-01-31 > > > 03:40:21","16033","oh","113","23"] | month1 | > > > > > > > > > +---------+-------------------------------------------------------------------+————+ > > > > > > (interesting to note dir00 in the output) > > > > > > But the group by on the dir works as it is a column defined in the view > > > > > > 0: jdbc:drill:> select dir0, count(*) from dfs.views.ordersdir group by > > > dir0; > > > +---------+---------+ > > > | dir0 | EXPR$1 | > > > +---------+---------+ > > > | month1 | 9000 | > > > | month6 | 16000 | > > > | month9 | 15131 | > > > | month2 | 12000 | > > > | month3 | 16000 | > > > | month7 | 15573 | > > > | month5 | 16000 | > > > | month4 | 12000 | > > > | month8 | 15390 | > > > +---------+————+ > > > > > > See if that works in your case. > > > > > > > > > —Andries > > > > > > > > > > On Nov 5, 2015, at 6:48 AM, John Omernik <[email protected]> wrote: > > > > > > > > Hey all, to facilitate loading of some data from JSON to Parquet, I > am > > > > using the the load into "day" based directories... > > > > > > > > parqtable > > > > | > > > > |_______2015-11-01 > > > > | > > > > |_______2015-11-02 > > > > | > > > > |_______2015-11-03 > > > > | > > > > |_______2015-11-04 > > > > > > > > That way I can do select * from `parqtable` where dir0 = '2015-11-01' > > and > > > > other cool tricks. It also helps my data loading. > > > > > > > > I am using the exact same query to load each day. > > > > > > > > CREATE TABLE `parqtable/2015-11-01' as > > > > (select field1, field2, field3, field4 from jsontable where dir0 = > > > > '2015-11-01') > > > > > > > > CREATE TABLE `parqtable/2015-11-02' as > > > > (select field1, field2, field3, field4 from jsontable where dir0 = > > > > '2015-11-02') > > > > > > > > CREATE TABLE `parqtable/2015-11-03' as > > > > (select field1, field2, field3, field4 from jsontable where dir0 = > > > > '2015-11-03') > > > > > > > > Etc > > > > > > > > This seams to work well except for one thing: > > > > > > > > If I want to see the count per directory, this (what I thought was > > > obvious) > > > > query: > > > > > > > > select dir0, count(*) from `parqtable` group by dir0 > > > > > > > > fails with > > > > > > > > Error: UNSUPPORTED_OPERATION_ ERROR: Hash aggregate does not support > > > schema > > > > changes > > > > > > > > Fragment: 2:8 > > > > > > > > > > > > I am not sure why this would be the case, the data is loaded by the > > same > > > > query, I would assume the schema is the same.... > > > > > > > > Thoughts on how to troubleshoot? > > > > > > > > Thanks! > > > > > > > > John > > > > > > > > >
