Re #7 in the original post "Select table syntax can specify constant values for one or more columns:" I would have assumed the select list can have any expressions that can be evaluated on a row from the source; that includes columns, expressions on columns, or constants. It's probably not your intent, but the stated form implies that all I get are column values and constants. Which is it?
On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid <baid.meh...@gmail.com> wrote: > I wanted to start a conversation around supporting the "Insert into Table" > feature. As of 1.2 we initially want to support inserting into a table with > Parquet files. Support for Json, CSV and other sources will follow as > future enhancements. > > Aman, Jinfeng, Neeraja and I had an initial discussion about this and > Neeraja provided a good summary of our discussion (pasted below) also > stating some of the requirements for this feature. > > A ) Support Insert into a non-partitioned table > --------------------- > > Ex: INSERT INTO T1 [col1, col2, col3] SELECT col4, col5, col6 from T2 > (Source table: T2, Target table T1) > Requirements: > > 1. Target table column list specification is optional for Insert statement > 2. When specified, the column list in the Insert statement should > contain all the columns present in the target table (i.e No support > for partial insert) > 3. The column names specified for the source table do not need to match > to the target table column names. Match is performed based on ordinal. > 4. # of Source table columns specified must be same as # of target > table columns > 5. Types of specified source table columns must match to the types of > target table columns > 6. Specification of * is not allowed in the Select table syntax > 7. Select table syntax can specify constant values for one or more columns > > > B ) Support insert into a partitioned table > ---------------------- > > Ex: INSERT INTO T1 col1, col2,col3 partition by col1,col2 SELECT > col4,col,col6 from T2 > > * Target column specification is required when inserting data into an > already partitioned table > * Requirements A.3-A.7 above apply for insert into partitioned tables > as well > * A partition by clause along with one or more columns is required > * All the columns specified in partition by clause must exist in the > target column list > * Partition by columns specified do not need to match to the list of > columns that the original table partitioned with (i.e if the > original table is partitioned with col1, col2, new data during > insert can be partitioned by col3 or just with col1 or col2..) > > > Couple of open questions from the design perspective are > > 1. How do we perform validation. Validation of data types, number of > columns being inserted etc. In addition to validation we need to make sure > that when we insert into an existing tables we insert data with the > existing column names (select column list can have different names). This > poses problems around needing to know the metadata at planning time, two > approaches that have been floating around are > * DotDrill files: We can store metadata, partitioning columns and > other useful information here and we can perform validation during planning > time. However the challenges with introducing DotDrill files include > - consistency between metadata and the actual data (Nothing > preventing users to copy files directly). > - security around DotDrill files (can be dealt in the same > way we perform security checks for drill tables in hdfs) > - interface to change the DotDrill file, in the case we need > to add a column to the table or add a new partition etc. > > * Explicit Syntax/ No metadata approach: Another approach is to > avoid DotDrill files and use explicit syntax to glean as much information > as possible from the SQL statement itself. Some of the challenges with this > approach are > - Gathering metadata information: Since we have no idea what > the existing schema is we would need to perform a "mini scan" to learn the > schema at planning time to be able to perform some validation. The problem > with this approach is how do we determine how many files we need to read in > order to learn the schema? If we use a sample set and not all the files > have the same schema, > we could have non-deterministic results based on the > sample of files read. Also reading all the files and merging the schema > seems like an expensive cost to pay. > - From the user's perspective, while inserting into a > partitioned table, user will have to specify the partitioning columns again > in the Insert statement, despite having specified the partition columns in > the CTAS. > > 2. What is a reasonable assumption for a Drill table in terms of changing > schema. Having the same exact schema for all files in a table is too rigid > an assumption at this point? > > One thing to remember with DotDrill file is to also the repercussions on > Drop table, Show tables, Describe table etc. i.e. it might make it easier > to be able to support these operations. > > Thanks > Mehant >