Re: [DISCUSS] Insert into Table support
Re #7 in the original post Select table syntax can specify constant values for one or more columns: I would have assumed the select list can have any expressions that can be evaluated on a row from the source; that includes columns, expressions on columns, or constants. It's probably not your intent, but the stated form implies that all I get are column values and constants. Which is it? On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid baid.meh...@gmail.com wrote: I wanted to start a conversation around supporting the Insert into Table feature. As of 1.2 we initially want to support inserting into a table with Parquet files. Support for Json, CSV and other sources will follow as future enhancements. Aman, Jinfeng, Neeraja and I had an initial discussion about this and Neeraja provided a good summary of our discussion (pasted below) also stating some of the requirements for this feature. A ) Support Insert into a non-partitioned table - Ex: INSERT INTO T1 [col1, col2, col3] SELECT col4, col5, col6 from T2 (Source table: T2, Target table T1) Requirements: 1. Target table column list specification is optional for Insert statement 2. When specified, the column list in the Insert statement should contain all the columns present in the target table (i.e No support for partial insert) 3. The column names specified for the source table do not need to match to the target table column names. Match is performed based on ordinal. 4. # of Source table columns specified must be same as # of target table columns 5. Types of specified source table columns must match to the types of target table columns 6. Specification of * is not allowed in the Select table syntax 7. Select table syntax can specify constant values for one or more columns B ) Support insert into a partitioned table -- Ex: INSERT INTO T1 col1, col2,col3 partition by col1,col2 SELECT col4,col,col6 from T2 * Target column specification is required when inserting data into an already partitioned table * Requirements A.3-A.7 above apply for insert into partitioned tables as well * A partition by clause along with one or more columns is required * All the columns specified in partition by clause must exist in the target column list * Partition by columns specified do not need to match to the list of columns that the original table partitioned with (i.e if the original table is partitioned with col1, col2, new data during insert can be partitioned by col3 or just with col1 or col2..) Couple of open questions from the design perspective are 1. How do we perform validation. Validation of data types, number of columns being inserted etc. In addition to validation we need to make sure that when we insert into an existing tables we insert data with the existing column names (select column list can have different names). This poses problems around needing to know the metadata at planning time, two approaches that have been floating around are * DotDrill files: We can store metadata, partitioning columns and other useful information here and we can perform validation during planning time. However the challenges with introducing DotDrill files include - consistency between metadata and the actual data (Nothing preventing users to copy files directly). - security around DotDrill files (can be dealt in the same way we perform security checks for drill tables in hdfs) - interface to change the DotDrill file, in the case we need to add a column to the table or add a new partition etc. * Explicit Syntax/ No metadata approach: Another approach is to avoid DotDrill files and use explicit syntax to glean as much information as possible from the SQL statement itself. Some of the challenges with this approach are - Gathering metadata information: Since we have no idea what the existing schema is we would need to perform a mini scan to learn the schema at planning time to be able to perform some validation. The problem with this approach is how do we determine how many files we need to read in order to learn the schema? If we use a sample set and not all the files have the same schema, we could have non-deterministic results based on the sample of files read. Also reading all the files and merging the schema seems like an expensive cost to pay. - From the user's perspective, while inserting into a partitioned table, user will have to specify the partitioning columns again in the Insert statement, despite having specified the partition columns in the CTAS. 2. What is a reasonable assumption for a Drill table in terms of changing schema. Having the same exact schema for all files in a table is too rigid an assumption at this point? One thing to remember with DotDrill file is to also the repercussions on Drop table,
Re: [DISCUSS] Insert into Table support
@Jacques.. Since the topic of metadata caching is closely related, the main issue is how is the metadata file maintained in the presence of either concurrent INSERTs or concurrent REFRESH METADATA operations ? One could maintain multiple versions of the metadata file or keep the version information inside a single metadata file. Is that what you were thinking when you mention Lucene's index versioning ? (I am not familiar with lucene's versioning). Aman On Wed, Aug 5, 2015 at 12:02 PM, Chris Westin chriswesti...@gmail.com wrote: Re #7 in the original post Select table syntax can specify constant values for one or more columns: I would have assumed the select list can have any expressions that can be evaluated on a row from the source; that includes columns, expressions on columns, or constants. It's probably not your intent, but the stated form implies that all I get are column values and constants. Which is it? On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid baid.meh...@gmail.com wrote: I wanted to start a conversation around supporting the Insert into Table feature. As of 1.2 we initially want to support inserting into a table with Parquet files. Support for Json, CSV and other sources will follow as future enhancements. Aman, Jinfeng, Neeraja and I had an initial discussion about this and Neeraja provided a good summary of our discussion (pasted below) also stating some of the requirements for this feature. A ) Support Insert into a non-partitioned table - Ex: INSERT INTO T1 [col1, col2, col3] SELECT col4, col5, col6 from T2 (Source table: T2, Target table T1) Requirements: 1. Target table column list specification is optional for Insert statement 2. When specified, the column list in the Insert statement should contain all the columns present in the target table (i.e No support for partial insert) 3. The column names specified for the source table do not need to match to the target table column names. Match is performed based on ordinal. 4. # of Source table columns specified must be same as # of target table columns 5. Types of specified source table columns must match to the types of target table columns 6. Specification of * is not allowed in the Select table syntax 7. Select table syntax can specify constant values for one or more columns B ) Support insert into a partitioned table -- Ex: INSERT INTO T1 col1, col2,col3 partition by col1,col2 SELECT col4,col,col6 from T2 * Target column specification is required when inserting data into an already partitioned table * Requirements A.3-A.7 above apply for insert into partitioned tables as well * A partition by clause along with one or more columns is required * All the columns specified in partition by clause must exist in the target column list * Partition by columns specified do not need to match to the list of columns that the original table partitioned with (i.e if the original table is partitioned with col1, col2, new data during insert can be partitioned by col3 or just with col1 or col2..) Couple of open questions from the design perspective are 1. How do we perform validation. Validation of data types, number of columns being inserted etc. In addition to validation we need to make sure that when we insert into an existing tables we insert data with the existing column names (select column list can have different names). This poses problems around needing to know the metadata at planning time, two approaches that have been floating around are * DotDrill files: We can store metadata, partitioning columns and other useful information here and we can perform validation during planning time. However the challenges with introducing DotDrill files include - consistency between metadata and the actual data (Nothing preventing users to copy files directly). - security around DotDrill files (can be dealt in the same way we perform security checks for drill tables in hdfs) - interface to change the DotDrill file, in the case we need to add a column to the table or add a new partition etc. * Explicit Syntax/ No metadata approach: Another approach is to avoid DotDrill files and use explicit syntax to glean as much information as possible from the SQL statement itself. Some of the challenges with this approach are - Gathering metadata information: Since we have no idea what the existing schema is we would need to perform a mini scan to learn the schema at planning time to be able to perform some validation. The problem with this approach is how do we determine how many files we need to read in order to learn the schema? If we use a sample set and not all the files have the same schema, we could have
Re: [DISCUSS] Insert into Table support
I thought I'd add my two cents based on my effort with Parquet pushdown filtering. It sounds like merging schemas is going to be pretty daunting, considering the work planned around embedded types and considering Parquet doesn't support those. Furthermore, metadata sounds like it's going to be fairly specific to each storage type. In the Parquet example, it's very beneficial in having statistics for each column in the file so that you can easily filter out files that clearly won't match a filter (which is what I did in the pushdown implementation). This is the challenge with the current metadata implementation (it doesn't include that information), so you end up planning batches out to many Drillbits, often with a lot of those batches entirely skipped due to mismatching filters. I completely agree with Jacques's point with regards to not changing the outcome of the query if the dot drill/metadata files are not present - it just simply makes the query more expensive without them. On Thu, Aug 6, 2015 at 8:01 AM, Aman Sinha asi...@maprtech.com wrote: @Jacques.. Since the topic of metadata caching is closely related, the main issue is how is the metadata file maintained in the presence of either concurrent INSERTs or concurrent REFRESH METADATA operations ? One could maintain multiple versions of the metadata file or keep the version information inside a single metadata file. Is that what you were thinking when you mention Lucene's index versioning ? (I am not familiar with lucene's versioning). Aman On Wed, Aug 5, 2015 at 12:02 PM, Chris Westin chriswesti...@gmail.com wrote: Re #7 in the original post Select table syntax can specify constant values for one or more columns: I would have assumed the select list can have any expressions that can be evaluated on a row from the source; that includes columns, expressions on columns, or constants. It's probably not your intent, but the stated form implies that all I get are column values and constants. Which is it? On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid baid.meh...@gmail.com wrote: I wanted to start a conversation around supporting the Insert into Table feature. As of 1.2 we initially want to support inserting into a table with Parquet files. Support for Json, CSV and other sources will follow as future enhancements. Aman, Jinfeng, Neeraja and I had an initial discussion about this and Neeraja provided a good summary of our discussion (pasted below) also stating some of the requirements for this feature. A ) Support Insert into a non-partitioned table - Ex: INSERT INTO T1 [col1, col2, col3] SELECT col4, col5, col6 from T2 (Source table: T2, Target table T1) Requirements: 1. Target table column list specification is optional for Insert statement 2. When specified, the column list in the Insert statement should contain all the columns present in the target table (i.e No support for partial insert) 3. The column names specified for the source table do not need to match to the target table column names. Match is performed based on ordinal. 4. # of Source table columns specified must be same as # of target table columns 5. Types of specified source table columns must match to the types of target table columns 6. Specification of * is not allowed in the Select table syntax 7. Select table syntax can specify constant values for one or more columns B ) Support insert into a partitioned table -- Ex: INSERT INTO T1 col1, col2,col3 partition by col1,col2 SELECT col4,col,col6 from T2 * Target column specification is required when inserting data into an already partitioned table * Requirements A.3-A.7 above apply for insert into partitioned tables as well * A partition by clause along with one or more columns is required * All the columns specified in partition by clause must exist in the target column list * Partition by columns specified do not need to match to the list of columns that the original table partitioned with (i.e if the original table is partitioned with col1, col2, new data during insert can be partitioned by col3 or just with col1 or col2..) Couple of open questions from the design perspective are 1. How do we perform validation. Validation of data types, number of columns being inserted etc. In addition to validation we need to make sure that when we insert into an existing tables we insert data with the existing column names (select column list can have different names). This poses problems around needing to know the metadata at planning time, two approaches that have been floating around are * DotDrill files: We can store metadata, partitioning columns and other useful information here and we can perform validation during
Re: [DISCUSS] Insert into Table support
Some thoughts around concurrency for INSERTs and SELECTs: For the consistency model we could aim to provide something close to snapshot-isolation. Only 1 INSERT to a particular table is allowed at a time. SELECT queries happen concurrently while the INSERT is going on, but in reality there is a finer-granularity of steps: For INSERT operation: 1. create a temp directory (or hidden directory) to store the files - say foo/tmpdir for the table foo 2. start writing files to foo/tmpdir 3. when complete, rename foo/tmpdir to indicate that it is available for queries. --- Q: is there a need for a short term directory level locking ? 4. call REFRESH METADATA that does the merging of the existing schema with the new files and collects stats on the new files 4(a). During this process, a metadata write lock needs to be acquired For any query that is running concurrently with the INSERT, we should only block for step 4(a). This is a relatively short period of time compared to the overall elapsed time of the INSERT. The SELECT queries themselves have following steps: 1. Query planning 1(a) Read the table metadata (presumably from the metadata file) by first acquiring a read lock on the metadata, reading the metadata and releasing the lock. 1(b) Perform query optimization 2. Query execution and return results The only blocking step is 1(a). Q: is it possible to completely avoid any locks and what would be the consistency model that we can claim ? We are not building transactional system but do need to have a well-defined deterministic behavior in the presence of concurrency. Q: if we do acquire locks, what should be the nature of these...are they persistent etc. Aman On Thu, Jul 30, 2015 at 9:28 AM, Aman Sinha asi...@maprtech.com wrote: Two more observations: 1. I would think the merged schema should be deterministic and not depend on the order of scanning the files. How can this be enforced ? Timestamp order, lexicographical order ? .. but these are being read in parallel, so we have to think through this.. 2. Due to merging of schemas, a column that was of REQUIRED type in the original parquet file may end up as having NULL values after merging. If we do a CTAS of the merged schema, this column will be OPTIONAL type in parquet. Is this acceptable ? On Wed, Jul 29, 2015 at 7:42 PM, Jacques Nadeau jacq...@dremio.com wrote: One comment... I think this is an overreach: The union schema is saved in a .drill file There is no point in saving something derived in a .drill file. Maybe a cache file (extending the other work on metadata caching). The idea for .drill files is that they are supplementary to what can be acquired from the data directly. -- Jacques Nadeau CTO and Co-Founder, Dremio On Wed, Jul 29, 2015 at 6:58 PM, Parth Chandra par...@apache.org wrote: Notes based on discussion in hangout, JAcques' notes and later discussion: 1. For Parquet files - Merge schema information into a union schema - The union schema is saved in a .drill file - The union schema can be cached - Union schema should maintain ordinal position of fields. 2. At the time the table is read, we can check directory timestamp and regenerate merged schema 3. Insert into does not need to update the merged schema. (It should check if there is an incompatible schema being written? An incompatible schema is one that cannot be merged successfully. ) - Suggestion 1 - Allow flexible insert. (Issue a warning if there is going to be a schema change conflict). Queries on columns with no conflict can still run. - Suggestion 2 - Do not allow insert unless the schema can be merged. 4. Rules for merging can be relaxed as we improve our ability to promote from one type to another 5. Concept of schema change different from a merged schema. Schema change is needed for reading schema less sources like JSON. 6. If there is a different column order (or difference in number of columns) in the table files, an insert based on ordinal positions cannot be allowed. 7. Suggestion - go with a naming based insert. Ordinal based insert can be enabled based on a switch (?) 8. Reuse partitioning information from 'create table' in 'insert into'. No partitioning info is needed in the 'insert into' statement. On Mon, Jul 27, 2015 at 6:43 PM, Khurram Faraaz kfar...@maprtech.com wrote: I have a question related to insert into partitioned tables. I see that DB2, Oracle, and Postgres do not support a PARTITION BY clause in their INSERT INTO implementation. MySQL does however provide support for a PARTITION BY clause in their INSERT INTO implementation. I do not know what the standard ANSI SQL specification says about INSERT INTO statement. We may want to keep
Re: [DISCUSS] Insert into Table support
Two more observations: 1. I would think the merged schema should be deterministic and not depend on the order of scanning the files. How can this be enforced ? Timestamp order, lexicographical order ? .. but these are being read in parallel, so we have to think through this.. 2. Due to merging of schemas, a column that was of REQUIRED type in the original parquet file may end up as having NULL values after merging. If we do a CTAS of the merged schema, this column will be OPTIONAL type in parquet. Is this acceptable ? On Wed, Jul 29, 2015 at 7:42 PM, Jacques Nadeau jacq...@dremio.com wrote: One comment... I think this is an overreach: The union schema is saved in a .drill file There is no point in saving something derived in a .drill file. Maybe a cache file (extending the other work on metadata caching). The idea for .drill files is that they are supplementary to what can be acquired from the data directly. -- Jacques Nadeau CTO and Co-Founder, Dremio On Wed, Jul 29, 2015 at 6:58 PM, Parth Chandra par...@apache.org wrote: Notes based on discussion in hangout, JAcques' notes and later discussion: 1. For Parquet files - Merge schema information into a union schema - The union schema is saved in a .drill file - The union schema can be cached - Union schema should maintain ordinal position of fields. 2. At the time the table is read, we can check directory timestamp and regenerate merged schema 3. Insert into does not need to update the merged schema. (It should check if there is an incompatible schema being written? An incompatible schema is one that cannot be merged successfully. ) - Suggestion 1 - Allow flexible insert. (Issue a warning if there is going to be a schema change conflict). Queries on columns with no conflict can still run. - Suggestion 2 - Do not allow insert unless the schema can be merged. 4. Rules for merging can be relaxed as we improve our ability to promote from one type to another 5. Concept of schema change different from a merged schema. Schema change is needed for reading schema less sources like JSON. 6. If there is a different column order (or difference in number of columns) in the table files, an insert based on ordinal positions cannot be allowed. 7. Suggestion - go with a naming based insert. Ordinal based insert can be enabled based on a switch (?) 8. Reuse partitioning information from 'create table' in 'insert into'. No partitioning info is needed in the 'insert into' statement. On Mon, Jul 27, 2015 at 6:43 PM, Khurram Faraaz kfar...@maprtech.com wrote: I have a question related to insert into partitioned tables. I see that DB2, Oracle, and Postgres do not support a PARTITION BY clause in their INSERT INTO implementation. MySQL does however provide support for a PARTITION BY clause in their INSERT INTO implementation. I do not know what the standard ANSI SQL specification says about INSERT INTO statement. We may want to keep it compliant with standard SQL syntax. B ) Support insert into a partitioned table -- Ex: INSERT INTO T1 col1, col2,col3 partition by col1,col2 SELECT col4,col,col6 from T2 * Target column specification is required when inserting data into an already partitioned table * Requirements A.3-A.7 above apply for insert into partitioned tables as well * A partition by clause along with one or more columns is required * All the columns specified in partition by clause must exist in the target column list * Partition by columns specified do not need to match to the list of columns that the original table partitioned with (i.e if the original table is partitioned with col1, col2, new data during insert can be partitioned by col3 or just with col1 or col2..) On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid baid.meh...@gmail.com wrote: I wanted to start a conversation around supporting the Insert into Table feature. As of 1.2 we initially want to support inserting into a table with Parquet files. Support for Json, CSV and other sources will follow as future enhancements. Aman, Jinfeng, Neeraja and I had an initial discussion about this and Neeraja provided a good summary of our discussion (pasted below) also stating some of the requirements for this feature. A ) Support Insert into a non-partitioned table - Ex: INSERT INTO T1 [col1, col2, col3] SELECT col4, col5, col6 from T2 (Source table: T2, Target table T1) Requirements: 1. Target table column list specification is optional for Insert statement 2. When specified, the column list in the Insert statement should contain all the columns present in
Re: [DISCUSS] Insert into Table support
One comment... I think this is an overreach: The union schema is saved in a .drill file There is no point in saving something derived in a .drill file. Maybe a cache file (extending the other work on metadata caching). The idea for .drill files is that they are supplementary to what can be acquired from the data directly. -- Jacques Nadeau CTO and Co-Founder, Dremio On Wed, Jul 29, 2015 at 6:58 PM, Parth Chandra par...@apache.org wrote: Notes based on discussion in hangout, JAcques' notes and later discussion: 1. For Parquet files - Merge schema information into a union schema - The union schema is saved in a .drill file - The union schema can be cached - Union schema should maintain ordinal position of fields. 2. At the time the table is read, we can check directory timestamp and regenerate merged schema 3. Insert into does not need to update the merged schema. (It should check if there is an incompatible schema being written? An incompatible schema is one that cannot be merged successfully. ) - Suggestion 1 - Allow flexible insert. (Issue a warning if there is going to be a schema change conflict). Queries on columns with no conflict can still run. - Suggestion 2 - Do not allow insert unless the schema can be merged. 4. Rules for merging can be relaxed as we improve our ability to promote from one type to another 5. Concept of schema change different from a merged schema. Schema change is needed for reading schema less sources like JSON. 6. If there is a different column order (or difference in number of columns) in the table files, an insert based on ordinal positions cannot be allowed. 7. Suggestion - go with a naming based insert. Ordinal based insert can be enabled based on a switch (?) 8. Reuse partitioning information from 'create table' in 'insert into'. No partitioning info is needed in the 'insert into' statement. On Mon, Jul 27, 2015 at 6:43 PM, Khurram Faraaz kfar...@maprtech.com wrote: I have a question related to insert into partitioned tables. I see that DB2, Oracle, and Postgres do not support a PARTITION BY clause in their INSERT INTO implementation. MySQL does however provide support for a PARTITION BY clause in their INSERT INTO implementation. I do not know what the standard ANSI SQL specification says about INSERT INTO statement. We may want to keep it compliant with standard SQL syntax. B ) Support insert into a partitioned table -- Ex: INSERT INTO T1 col1, col2,col3 partition by col1,col2 SELECT col4,col,col6 from T2 * Target column specification is required when inserting data into an already partitioned table * Requirements A.3-A.7 above apply for insert into partitioned tables as well * A partition by clause along with one or more columns is required * All the columns specified in partition by clause must exist in the target column list * Partition by columns specified do not need to match to the list of columns that the original table partitioned with (i.e if the original table is partitioned with col1, col2, new data during insert can be partitioned by col3 or just with col1 or col2..) On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid baid.meh...@gmail.com wrote: I wanted to start a conversation around supporting the Insert into Table feature. As of 1.2 we initially want to support inserting into a table with Parquet files. Support for Json, CSV and other sources will follow as future enhancements. Aman, Jinfeng, Neeraja and I had an initial discussion about this and Neeraja provided a good summary of our discussion (pasted below) also stating some of the requirements for this feature. A ) Support Insert into a non-partitioned table - Ex: INSERT INTO T1 [col1, col2, col3] SELECT col4, col5, col6 from T2 (Source table: T2, Target table T1) Requirements: 1. Target table column list specification is optional for Insert statement 2. When specified, the column list in the Insert statement should contain all the columns present in the target table (i.e No support for partial insert) 3. The column names specified for the source table do not need to match to the target table column names. Match is performed based on ordinal. 4. # of Source table columns specified must be same as # of target table columns 5. Types of specified source table columns must match to the types of target table columns 6. Specification of * is not allowed in the Select table syntax 7. Select table syntax can specify constant values for one or more columns B ) Support insert into a partitioned table -- Ex: INSERT INTO T1 col1, col2,col3 partition by col1,col2 SELECT col4,col,col6 from T2
Re: [DISCUSS] Insert into Table support
Notes based on discussion in hangout, JAcques' notes and later discussion: 1. For Parquet files - Merge schema information into a union schema - The union schema is saved in a .drill file - The union schema can be cached - Union schema should maintain ordinal position of fields. 2. At the time the table is read, we can check directory timestamp and regenerate merged schema 3. Insert into does not need to update the merged schema. (It should check if there is an incompatible schema being written? An incompatible schema is one that cannot be merged successfully. ) - Suggestion 1 - Allow flexible insert. (Issue a warning if there is going to be a schema change conflict). Queries on columns with no conflict can still run. - Suggestion 2 - Do not allow insert unless the schema can be merged. 4. Rules for merging can be relaxed as we improve our ability to promote from one type to another 5. Concept of schema change different from a merged schema. Schema change is needed for reading schema less sources like JSON. 6. If there is a different column order (or difference in number of columns) in the table files, an insert based on ordinal positions cannot be allowed. 7. Suggestion - go with a naming based insert. Ordinal based insert can be enabled based on a switch (?) 8. Reuse partitioning information from 'create table' in 'insert into'. No partitioning info is needed in the 'insert into' statement. On Mon, Jul 27, 2015 at 6:43 PM, Khurram Faraaz kfar...@maprtech.com wrote: I have a question related to insert into partitioned tables. I see that DB2, Oracle, and Postgres do not support a PARTITION BY clause in their INSERT INTO implementation. MySQL does however provide support for a PARTITION BY clause in their INSERT INTO implementation. I do not know what the standard ANSI SQL specification says about INSERT INTO statement. We may want to keep it compliant with standard SQL syntax. B ) Support insert into a partitioned table -- Ex: INSERT INTO T1 col1, col2,col3 partition by col1,col2 SELECT col4,col,col6 from T2 * Target column specification is required when inserting data into an already partitioned table * Requirements A.3-A.7 above apply for insert into partitioned tables as well * A partition by clause along with one or more columns is required * All the columns specified in partition by clause must exist in the target column list * Partition by columns specified do not need to match to the list of columns that the original table partitioned with (i.e if the original table is partitioned with col1, col2, new data during insert can be partitioned by col3 or just with col1 or col2..) On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid baid.meh...@gmail.com wrote: I wanted to start a conversation around supporting the Insert into Table feature. As of 1.2 we initially want to support inserting into a table with Parquet files. Support for Json, CSV and other sources will follow as future enhancements. Aman, Jinfeng, Neeraja and I had an initial discussion about this and Neeraja provided a good summary of our discussion (pasted below) also stating some of the requirements for this feature. A ) Support Insert into a non-partitioned table - Ex: INSERT INTO T1 [col1, col2, col3] SELECT col4, col5, col6 from T2 (Source table: T2, Target table T1) Requirements: 1. Target table column list specification is optional for Insert statement 2. When specified, the column list in the Insert statement should contain all the columns present in the target table (i.e No support for partial insert) 3. The column names specified for the source table do not need to match to the target table column names. Match is performed based on ordinal. 4. # of Source table columns specified must be same as # of target table columns 5. Types of specified source table columns must match to the types of target table columns 6. Specification of * is not allowed in the Select table syntax 7. Select table syntax can specify constant values for one or more columns B ) Support insert into a partitioned table -- Ex: INSERT INTO T1 col1, col2,col3 partition by col1,col2 SELECT col4,col,col6 from T2 * Target column specification is required when inserting data into an already partitioned table * Requirements A.3-A.7 above apply for insert into partitioned tables as well * A partition by clause along with one or more columns is required * All the columns specified in partition by clause must exist in the target column list * Partition by columns specified do not need to match to the list of columns that the original table partitioned with (i.e if the original table is partitioned with col1, col2, new
Re: [DISCUSS] Insert into Table support
I have a question related to insert into partitioned tables. I see that DB2, Oracle, and Postgres do not support a PARTITION BY clause in their INSERT INTO implementation. MySQL does however provide support for a PARTITION BY clause in their INSERT INTO implementation. I do not know what the standard ANSI SQL specification says about INSERT INTO statement. We may want to keep it compliant with standard SQL syntax. B ) Support insert into a partitioned table -- Ex: INSERT INTO T1 col1, col2,col3 partition by col1,col2 SELECT col4,col,col6 from T2 * Target column specification is required when inserting data into an already partitioned table * Requirements A.3-A.7 above apply for insert into partitioned tables as well * A partition by clause along with one or more columns is required * All the columns specified in partition by clause must exist in the target column list * Partition by columns specified do not need to match to the list of columns that the original table partitioned with (i.e if the original table is partitioned with col1, col2, new data during insert can be partitioned by col3 or just with col1 or col2..) On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid baid.meh...@gmail.com wrote: I wanted to start a conversation around supporting the Insert into Table feature. As of 1.2 we initially want to support inserting into a table with Parquet files. Support for Json, CSV and other sources will follow as future enhancements. Aman, Jinfeng, Neeraja and I had an initial discussion about this and Neeraja provided a good summary of our discussion (pasted below) also stating some of the requirements for this feature. A ) Support Insert into a non-partitioned table - Ex: INSERT INTO T1 [col1, col2, col3] SELECT col4, col5, col6 from T2 (Source table: T2, Target table T1) Requirements: 1. Target table column list specification is optional for Insert statement 2. When specified, the column list in the Insert statement should contain all the columns present in the target table (i.e No support for partial insert) 3. The column names specified for the source table do not need to match to the target table column names. Match is performed based on ordinal. 4. # of Source table columns specified must be same as # of target table columns 5. Types of specified source table columns must match to the types of target table columns 6. Specification of * is not allowed in the Select table syntax 7. Select table syntax can specify constant values for one or more columns B ) Support insert into a partitioned table -- Ex: INSERT INTO T1 col1, col2,col3 partition by col1,col2 SELECT col4,col,col6 from T2 * Target column specification is required when inserting data into an already partitioned table * Requirements A.3-A.7 above apply for insert into partitioned tables as well * A partition by clause along with one or more columns is required * All the columns specified in partition by clause must exist in the target column list * Partition by columns specified do not need to match to the list of columns that the original table partitioned with (i.e if the original table is partitioned with col1, col2, new data during insert can be partitioned by col3 or just with col1 or col2..) Couple of open questions from the design perspective are 1. How do we perform validation. Validation of data types, number of columns being inserted etc. In addition to validation we need to make sure that when we insert into an existing tables we insert data with the existing column names (select column list can have different names). This poses problems around needing to know the metadata at planning time, two approaches that have been floating around are * DotDrill files: We can store metadata, partitioning columns and other useful information here and we can perform validation during planning time. However the challenges with introducing DotDrill files include - consistency between metadata and the actual data (Nothing preventing users to copy files directly). - security around DotDrill files (can be dealt in the same way we perform security checks for drill tables in hdfs) - interface to change the DotDrill file, in the case we need to add a column to the table or add a new partition etc. * Explicit Syntax/ No metadata approach: Another approach is to avoid DotDrill files and use explicit syntax to glean as much information as possible from the SQL statement itself. Some of the challenges with this approach are - Gathering metadata information: Since we have no idea what the existing schema is we would need to perform a mini scan to learn the schema at planning time to be able to perform some validation. The problem with this approach is how do we determine how many files
[DISCUSS] Insert into Table support
I wanted to start a conversation around supporting the Insert into Table feature. As of 1.2 we initially want to support inserting into a table with Parquet files. Support for Json, CSV and other sources will follow as future enhancements. Aman, Jinfeng, Neeraja and I had an initial discussion about this and Neeraja provided a good summary of our discussion (pasted below) also stating some of the requirements for this feature. A ) Support Insert into a non-partitioned table - Ex: INSERT INTO T1 [col1, col2, col3] SELECT col4, col5, col6 from T2 (Source table: T2, Target table T1) Requirements: 1. Target table column list specification is optional for Insert statement 2. When specified, the column list in the Insert statement should contain all the columns present in the target table (i.e No support for partial insert) 3. The column names specified for the source table do not need to match to the target table column names. Match is performed based on ordinal. 4. # of Source table columns specified must be same as # of target table columns 5. Types of specified source table columns must match to the types of target table columns 6. Specification of * is not allowed in the Select table syntax 7. Select table syntax can specify constant values for one or more columns B ) Support insert into a partitioned table -- Ex: INSERT INTO T1 col1, col2,col3 partition by col1,col2 SELECT col4,col,col6 from T2 * Target column specification is required when inserting data into an already partitioned table * Requirements A.3-A.7 above apply for insert into partitioned tables as well * A partition by clause along with one or more columns is required * All the columns specified in partition by clause must exist in the target column list * Partition by columns specified do not need to match to the list of columns that the original table partitioned with (i.e if the original table is partitioned with col1, col2, new data during insert can be partitioned by col3 or just with col1 or col2..) Couple of open questions from the design perspective are 1. How do we perform validation. Validation of data types, number of columns being inserted etc. In addition to validation we need to make sure that when we insert into an existing tables we insert data with the existing column names (select column list can have different names). This poses problems around needing to know the metadata at planning time, two approaches that have been floating around are * DotDrill files: We can store metadata, partitioning columns and other useful information here and we can perform validation during planning time. However the challenges with introducing DotDrill files include - consistency between metadata and the actual data (Nothing preventing users to copy files directly). - security around DotDrill files (can be dealt in the same way we perform security checks for drill tables in hdfs) - interface to change the DotDrill file, in the case we need to add a column to the table or add a new partition etc. * Explicit Syntax/ No metadata approach: Another approach is to avoid DotDrill files and use explicit syntax to glean as much information as possible from the SQL statement itself. Some of the challenges with this approach are - Gathering metadata information: Since we have no idea what the existing schema is we would need to perform a mini scan to learn the schema at planning time to be able to perform some validation. The problem with this approach is how do we determine how many files we need to read in order to learn the schema? If we use a sample set and not all the files have the same schema, we could have non-deterministic results based on the sample of files read. Also reading all the files and merging the schema seems like an expensive cost to pay. - From the user's perspective, while inserting into a partitioned table, user will have to specify the partitioning columns again in the Insert statement, despite having specified the partition columns in the CTAS. 2. What is a reasonable assumption for a Drill table in terms of changing schema. Having the same exact schema for all files in a table is too rigid an assumption at this point? One thing to remember with DotDrill file is to also the repercussions on Drop table, Show tables, Describe table etc. i.e. it might make it easier to be able to support these operations. Thanks Mehant