Re: [DISCUSS] Insert into Table support

2015-08-05 Thread Chris Westin
Re #7 in the original post Select table syntax can specify constant values
for one or more columns:
I would have assumed the select list can have any expressions that can be
evaluated on a row from the source; that includes columns, expressions on
columns, or constants. It's probably not your intent, but the stated form
implies that all I get are column values and constants. Which is it?

On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid baid.meh...@gmail.com wrote:

 I wanted to start a conversation around supporting the Insert into Table
 feature. As of 1.2 we initially want to support inserting into a table with
 Parquet files. Support for Json, CSV and other sources will follow as
 future enhancements.

 Aman, Jinfeng, Neeraja and I had an initial discussion about this and
 Neeraja provided a good summary of our discussion (pasted below) also
 stating some of the requirements for this feature.

  A ) Support Insert into a non-partitioned table
 -

 Ex: INSERT INTO T1 [col1, col2, col3]  SELECT col4, col5, col6 from T2
 (Source table: T2, Target table T1)
 Requirements:

 1. Target table column list specification is optional for Insert statement
 2. When specified, the column list in the Insert statement should
contain all the columns present in the target table (i.e No support
for partial insert)
 3. The column names specified for the source table do not need to match
to the target table column names. Match is performed based on ordinal.
 4.   # of Source table columns specified must be same as # of target
table columns
 5. Types of specified source table columns must match to the types of
target table columns
 6. Specification of * is not allowed in the Select table syntax
 7. Select table syntax can specify constant values for one or more columns


  B ) Support insert into a partitioned table
 --

 Ex: INSERT INTO T1 col1, col2,col3  partition by col1,col2 SELECT
 col4,col,col6 from T2

  * Target column specification is required when inserting data into an
already partitioned table
  * Requirements A.3-A.7 above apply for insert into partitioned tables
as well
  * A partition by clause along with one or more columns is required
  * All the columns specified in partition by clause must exist in the
target column list
  * Partition by columns specified do not need to match to the list of
columns that the original table partitioned with (i.e if the
original table is partitioned with col1, col2,  new data during
insert can be partitioned by col3 or just with col1 or col2..)


 Couple of open questions from the design perspective are

 1. How do we perform validation. Validation of data types, number of
 columns being inserted etc. In addition to validation we need to make sure
 that when we insert into an existing tables we insert data with the
 existing column names (select column list can have different names). This
 poses problems around needing to know the metadata at planning time, two
 approaches that have been floating around are
 * DotDrill files: We can store metadata, partitioning columns and
 other useful information here and we can perform validation during planning
 time. However the challenges with introducing DotDrill files include
  - consistency between metadata and the actual data (Nothing
 preventing users to copy files directly).
  - security around DotDrill files (can be dealt in the same
 way we perform security checks for drill tables in hdfs)
  - interface to change the DotDrill file, in the case we need
 to add a column to the table or add a new partition etc.

 * Explicit Syntax/ No metadata approach: Another approach is to
 avoid DotDrill files and use explicit syntax to glean as much information
 as possible from the SQL statement itself. Some of the challenges with this
 approach are
  - Gathering metadata information: Since we have no idea what
 the existing schema is we would need to perform a mini scan to learn the
 schema at planning time to be able to perform some validation. The problem
 with this approach is how do we determine how many files we need to read in
 order to learn the schema? If we use a sample set and not all the files
 have the same schema,
 we could have non-deterministic results based on the
 sample of files read. Also reading all the files and merging the schema
 seems like an expensive cost to pay.
  - From the user's perspective, while inserting into a
 partitioned table, user will have to specify the partitioning columns again
 in the Insert statement, despite having specified the partition columns in
 the CTAS.

 2. What is a reasonable assumption for a Drill table in terms of changing
 schema. Having the same exact schema for all files in a table is too rigid
 an assumption at this point?

 One thing to remember with DotDrill file is to also the repercussions on
 Drop table, 

Re: [DISCUSS] Insert into Table support

2015-08-05 Thread Aman Sinha
@Jacques..
Since the topic of  metadata caching is closely related, the main issue is
how is the metadata file maintained in the presence of either concurrent
INSERTs or concurrent REFRESH METADATA operations ?  One could maintain
multiple versions of the metadata file or keep the version information
inside a single metadata file.  Is that what you were thinking when you
mention Lucene's index versioning ? (I am not familiar with lucene's
versioning).

Aman

On Wed, Aug 5, 2015 at 12:02 PM, Chris Westin chriswesti...@gmail.com
wrote:

 Re #7 in the original post Select table syntax can specify constant values
 for one or more columns:
 I would have assumed the select list can have any expressions that can be
 evaluated on a row from the source; that includes columns, expressions on
 columns, or constants. It's probably not your intent, but the stated form
 implies that all I get are column values and constants. Which is it?

 On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid baid.meh...@gmail.com
 wrote:

  I wanted to start a conversation around supporting the Insert into
 Table
  feature. As of 1.2 we initially want to support inserting into a table
 with
  Parquet files. Support for Json, CSV and other sources will follow as
  future enhancements.
 
  Aman, Jinfeng, Neeraja and I had an initial discussion about this and
  Neeraja provided a good summary of our discussion (pasted below) also
  stating some of the requirements for this feature.
 
   A ) Support Insert into a non-partitioned table
  -
 
  Ex: INSERT INTO T1 [col1, col2, col3]  SELECT col4, col5, col6 from T2
  (Source table: T2, Target table T1)
  Requirements:
 
  1. Target table column list specification is optional for Insert
 statement
  2. When specified, the column list in the Insert statement should
 contain all the columns present in the target table (i.e No support
 for partial insert)
  3. The column names specified for the source table do not need to match
 to the target table column names. Match is performed based on ordinal.
  4.   # of Source table columns specified must be same as # of target
 table columns
  5. Types of specified source table columns must match to the types of
 target table columns
  6. Specification of * is not allowed in the Select table syntax
  7. Select table syntax can specify constant values for one or more
 columns
 
 
   B ) Support insert into a partitioned table
  --
 
  Ex: INSERT INTO T1 col1, col2,col3  partition by col1,col2 SELECT
  col4,col,col6 from T2
 
   * Target column specification is required when inserting data into an
 already partitioned table
   * Requirements A.3-A.7 above apply for insert into partitioned tables
 as well
   * A partition by clause along with one or more columns is required
   * All the columns specified in partition by clause must exist in the
 target column list
   * Partition by columns specified do not need to match to the list of
 columns that the original table partitioned with (i.e if the
 original table is partitioned with col1, col2,  new data during
 insert can be partitioned by col3 or just with col1 or col2..)
 
 
  Couple of open questions from the design perspective are
 
  1. How do we perform validation. Validation of data types, number of
  columns being inserted etc. In addition to validation we need to make
 sure
  that when we insert into an existing tables we insert data with the
  existing column names (select column list can have different names). This
  poses problems around needing to know the metadata at planning time, two
  approaches that have been floating around are
  * DotDrill files: We can store metadata, partitioning columns and
  other useful information here and we can perform validation during
 planning
  time. However the challenges with introducing DotDrill files include
   - consistency between metadata and the actual data (Nothing
  preventing users to copy files directly).
   - security around DotDrill files (can be dealt in the same
  way we perform security checks for drill tables in hdfs)
   - interface to change the DotDrill file, in the case we need
  to add a column to the table or add a new partition etc.
 
  * Explicit Syntax/ No metadata approach: Another approach is to
  avoid DotDrill files and use explicit syntax to glean as much information
  as possible from the SQL statement itself. Some of the challenges with
 this
  approach are
   - Gathering metadata information: Since we have no idea what
  the existing schema is we would need to perform a mini scan to learn
 the
  schema at planning time to be able to perform some validation. The
 problem
  with this approach is how do we determine how many files we need to read
 in
  order to learn the schema? If we use a sample set and not all the files
  have the same schema,
  we could have 

Re: [DISCUSS] Insert into Table support

2015-08-05 Thread Adam Gilmore
I thought I'd add my two cents based on my effort with Parquet pushdown
filtering.

It sounds like merging schemas is going to be pretty daunting, considering
the work planned around embedded types and considering Parquet doesn't
support those.

Furthermore, metadata sounds like it's going to be fairly specific to each
storage type.  In the Parquet example, it's very beneficial in having
statistics for each column in the file so that you can easily filter out
files that clearly won't match a filter (which is what I did in the
pushdown implementation).  This is the challenge with the current metadata
implementation (it doesn't include that information), so you end up
planning batches out to many Drillbits, often with a lot of those batches
entirely skipped due to mismatching filters.

I completely agree with Jacques's point with regards to not changing the
outcome of the query if the dot drill/metadata files are not present - it
just simply makes the query more expensive without them.

On Thu, Aug 6, 2015 at 8:01 AM, Aman Sinha asi...@maprtech.com wrote:

 @Jacques..
 Since the topic of  metadata caching is closely related, the main issue is
 how is the metadata file maintained in the presence of either concurrent
 INSERTs or concurrent REFRESH METADATA operations ?  One could maintain
 multiple versions of the metadata file or keep the version information
 inside a single metadata file.  Is that what you were thinking when you
 mention Lucene's index versioning ? (I am not familiar with lucene's
 versioning).

 Aman

 On Wed, Aug 5, 2015 at 12:02 PM, Chris Westin chriswesti...@gmail.com
 wrote:

  Re #7 in the original post Select table syntax can specify constant
 values
  for one or more columns:
  I would have assumed the select list can have any expressions that can be
  evaluated on a row from the source; that includes columns, expressions on
  columns, or constants. It's probably not your intent, but the stated form
  implies that all I get are column values and constants. Which is it?
 
  On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid baid.meh...@gmail.com
  wrote:
 
   I wanted to start a conversation around supporting the Insert into
  Table
   feature. As of 1.2 we initially want to support inserting into a table
  with
   Parquet files. Support for Json, CSV and other sources will follow as
   future enhancements.
  
   Aman, Jinfeng, Neeraja and I had an initial discussion about this and
   Neeraja provided a good summary of our discussion (pasted below) also
   stating some of the requirements for this feature.
  
A ) Support Insert into a non-partitioned table
   -
  
   Ex: INSERT INTO T1 [col1, col2, col3]  SELECT col4, col5, col6 from T2
   (Source table: T2, Target table T1)
   Requirements:
  
   1. Target table column list specification is optional for Insert
  statement
   2. When specified, the column list in the Insert statement should
  contain all the columns present in the target table (i.e No support
  for partial insert)
   3. The column names specified for the source table do not need to match
  to the target table column names. Match is performed based on
 ordinal.
   4.   # of Source table columns specified must be same as # of target
  table columns
   5. Types of specified source table columns must match to the types of
  target table columns
   6. Specification of * is not allowed in the Select table syntax
   7. Select table syntax can specify constant values for one or more
  columns
  
  
B ) Support insert into a partitioned table
   --
  
   Ex: INSERT INTO T1 col1, col2,col3  partition by col1,col2 SELECT
   col4,col,col6 from T2
  
* Target column specification is required when inserting data into an
  already partitioned table
* Requirements A.3-A.7 above apply for insert into partitioned tables
  as well
* A partition by clause along with one or more columns is required
* All the columns specified in partition by clause must exist in the
  target column list
* Partition by columns specified do not need to match to the list of
  columns that the original table partitioned with (i.e if the
  original table is partitioned with col1, col2,  new data during
  insert can be partitioned by col3 or just with col1 or col2..)
  
  
   Couple of open questions from the design perspective are
  
   1. How do we perform validation. Validation of data types, number of
   columns being inserted etc. In addition to validation we need to make
  sure
   that when we insert into an existing tables we insert data with the
   existing column names (select column list can have different names).
 This
   poses problems around needing to know the metadata at planning time,
 two
   approaches that have been floating around are
   * DotDrill files: We can store metadata, partitioning columns
 and
   other useful information here and we can perform validation during
  

Re: [DISCUSS] Insert into Table support

2015-08-04 Thread Aman Sinha
Some thoughts around concurrency for INSERTs and SELECTs:

For the consistency model we could aim to provide something close to
snapshot-isolation.
Only 1 INSERT to a particular table is allowed at a time.
SELECT queries happen concurrently while the INSERT is going on, but in
reality there is a finer-granularity of steps:

 For INSERT operation:
   1. create a temp directory (or hidden directory) to store the files -
say foo/tmpdir  for the table foo
   2. start writing files to foo/tmpdir
   3. when complete, rename foo/tmpdir to indicate that it is available for
queries.  --- Q: is there a need for a short term directory level locking
?
   4. call REFRESH METADATA that does the merging of the existing schema
with the new files and collects stats on the new files
  4(a).  During this process, a metadata write lock needs to be acquired

For any query that is running concurrently with the INSERT,  we should only
block for step 4(a).  This is a relatively short period of time
compared to the overall elapsed time of the INSERT.

The SELECT queries themselves have following steps:
  1. Query planning
 1(a) Read the table metadata (presumably from the metadata file)  by
first acquiring a read lock on the metadata, reading the metadata
and releasing the lock.
 1(b) Perform query optimization
  2. Query execution and return results

The only blocking step is 1(a).

Q:  is it possible to completely avoid any locks and what would be the
consistency model that we can claim ?  We are not building transactional
system
  but do need to have a well-defined deterministic behavior in the
presence of concurrency.
Q:  if we do acquire locks, what should be the nature of these...are they
persistent etc.

Aman

On Thu, Jul 30, 2015 at 9:28 AM, Aman Sinha asi...@maprtech.com wrote:

 Two more observations:
 1.  I would think the merged schema should be deterministic and not depend
 on the order of scanning the files.  How can
  this be enforced ? Timestamp order, lexicographical order ? .. but
 these are being read in parallel, so we have to think
  through this..
 2.  Due to merging of schemas, a column that was of REQUIRED type in the
 original parquet file may end up as having
  NULL values after merging.  If we do a CTAS of the merged schema,
 this column will be OPTIONAL type in parquet.
  Is this acceptable ?

 On Wed, Jul 29, 2015 at 7:42 PM, Jacques Nadeau jacq...@dremio.com
 wrote:

 One comment... I think this is an overreach:

 The union schema is saved in a .drill file

 There is no point in saving something derived in a .drill file.  Maybe a
 cache file (extending the other work on metadata caching).  The idea for
 .drill files is that they are supplementary to what can be acquired from
 the data directly.

 --
 Jacques Nadeau
 CTO and Co-Founder, Dremio

 On Wed, Jul 29, 2015 at 6:58 PM, Parth Chandra par...@apache.org wrote:

  Notes based on discussion in hangout, JAcques' notes and later
 discussion:
 
 
 1. For Parquet files
- Merge schema information into a union schema
- The union schema is saved in a .drill file
- The union schema can be cached
- Union schema should maintain ordinal position of fields.
 2. At the time the table is read, we can check directory timestamp
 and
 regenerate merged schema
 3. Insert into does not need to update the merged schema. (It should
 check if there is an incompatible schema being written? An
 incompatible
 schema is one that cannot be merged successfully. )
 -   Suggestion 1 - Allow flexible insert. (Issue a warning if there
 is
going to be a schema change conflict). Queries on columns with
  no conflict
can still run.
-   Suggestion 2 - Do not allow insert unless the schema can be
merged.
 4. Rules for merging can be relaxed as we improve our ability to
 promote
 from one type to another
 5. Concept of schema change different from a merged schema. Schema
 change is needed for reading schema less sources like JSON.
 6. If there is a different column order (or difference in number of
 columns) in the table files, an insert based on ordinal positions
  cannot be
 allowed.
 7. Suggestion - go with a naming based insert. Ordinal based insert
 can
 be enabled based on a switch (?)
 8. Reuse partitioning information from 'create table' in 'insert
 into'.
 No partitioning info is needed in the 'insert into' statement.
 
 
 
  On Mon, Jul 27, 2015 at 6:43 PM, Khurram Faraaz kfar...@maprtech.com
  wrote:
 
   I have a question related to insert into partitioned tables. I see
 that
   DB2, Oracle, and Postgres do not support a PARTITION BY clause in
 their
   INSERT INTO implementation. MySQL does however provide support for a
   PARTITION BY clause in their INSERT INTO implementation. I do not know
  what
   the standard ANSI SQL specification says about INSERT INTO statement.
 We
   may want to keep 

Re: [DISCUSS] Insert into Table support

2015-07-30 Thread Aman Sinha
Two more observations:
1.  I would think the merged schema should be deterministic and not depend
on the order of scanning the files.  How can
 this be enforced ? Timestamp order, lexicographical order ? .. but
these are being read in parallel, so we have to think
 through this..
2.  Due to merging of schemas, a column that was of REQUIRED type in the
original parquet file may end up as having
 NULL values after merging.  If we do a CTAS of the merged schema, this
column will be OPTIONAL type in parquet.
 Is this acceptable ?

On Wed, Jul 29, 2015 at 7:42 PM, Jacques Nadeau jacq...@dremio.com wrote:

 One comment... I think this is an overreach:

 The union schema is saved in a .drill file

 There is no point in saving something derived in a .drill file.  Maybe a
 cache file (extending the other work on metadata caching).  The idea for
 .drill files is that they are supplementary to what can be acquired from
 the data directly.

 --
 Jacques Nadeau
 CTO and Co-Founder, Dremio

 On Wed, Jul 29, 2015 at 6:58 PM, Parth Chandra par...@apache.org wrote:

  Notes based on discussion in hangout, JAcques' notes and later
 discussion:
 
 
 1. For Parquet files
- Merge schema information into a union schema
- The union schema is saved in a .drill file
- The union schema can be cached
- Union schema should maintain ordinal position of fields.
 2. At the time the table is read, we can check directory timestamp and
 regenerate merged schema
 3. Insert into does not need to update the merged schema. (It should
 check if there is an incompatible schema being written? An
 incompatible
 schema is one that cannot be merged successfully. )
 -   Suggestion 1 - Allow flexible insert. (Issue a warning if there is
going to be a schema change conflict). Queries on columns with
  no conflict
can still run.
-   Suggestion 2 - Do not allow insert unless the schema can be
merged.
 4. Rules for merging can be relaxed as we improve our ability to
 promote
 from one type to another
 5. Concept of schema change different from a merged schema. Schema
 change is needed for reading schema less sources like JSON.
 6. If there is a different column order (or difference in number of
 columns) in the table files, an insert based on ordinal positions
  cannot be
 allowed.
 7. Suggestion - go with a naming based insert. Ordinal based insert
 can
 be enabled based on a switch (?)
 8. Reuse partitioning information from 'create table' in 'insert
 into'.
 No partitioning info is needed in the 'insert into' statement.
 
 
 
  On Mon, Jul 27, 2015 at 6:43 PM, Khurram Faraaz kfar...@maprtech.com
  wrote:
 
   I have a question related to insert into partitioned tables. I see that
   DB2, Oracle, and Postgres do not support a PARTITION BY clause in their
   INSERT INTO implementation. MySQL does however provide support for a
   PARTITION BY clause in their INSERT INTO implementation. I do not know
  what
   the standard ANSI SQL specification says about INSERT INTO statement.
 We
   may want to keep it compliant with standard SQL syntax.
  
B ) Support insert into a partitioned table
--
  
Ex: INSERT INTO T1 col1, col2,col3  partition by col1,col2 SELECT
   col4,col,col6 from T2
  
* Target column specification is required when inserting data into
 an
  already partitioned table
* Requirements A.3-A.7 above apply for insert into partitioned
 tables
  as well
* A partition by clause along with one or more columns is required
* All the columns specified in partition by clause must exist in the
  target column list
* Partition by columns specified do not need to match to the list of
  columns that the original table partitioned with (i.e if the
  original table is partitioned with col1, col2,  new data during
  insert can be partitioned by col3 or just with col1 or col2..)
  
   On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid baid.meh...@gmail.com
   wrote:
  
I wanted to start a conversation around supporting the Insert into
   Table
feature. As of 1.2 we initially want to support inserting into a
 table
   with
Parquet files. Support for Json, CSV and other sources will follow as
future enhancements.
   
Aman, Jinfeng, Neeraja and I had an initial discussion about this and
Neeraja provided a good summary of our discussion (pasted below) also
stating some of the requirements for this feature.
   
 A ) Support Insert into a non-partitioned table
-
   
Ex: INSERT INTO T1 [col1, col2, col3]  SELECT col4, col5, col6 from
 T2
(Source table: T2, Target table T1)
Requirements:
   
1. Target table column list specification is optional for Insert
   statement
2. When specified, the column list in the Insert statement should
   contain all the columns present in 

Re: [DISCUSS] Insert into Table support

2015-07-29 Thread Jacques Nadeau
One comment... I think this is an overreach:

The union schema is saved in a .drill file

There is no point in saving something derived in a .drill file.  Maybe a
cache file (extending the other work on metadata caching).  The idea for
.drill files is that they are supplementary to what can be acquired from
the data directly.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Jul 29, 2015 at 6:58 PM, Parth Chandra par...@apache.org wrote:

 Notes based on discussion in hangout, JAcques' notes and later discussion:


1. For Parquet files
   - Merge schema information into a union schema
   - The union schema is saved in a .drill file
   - The union schema can be cached
   - Union schema should maintain ordinal position of fields.
2. At the time the table is read, we can check directory timestamp and
regenerate merged schema
3. Insert into does not need to update the merged schema. (It should
check if there is an incompatible schema being written? An incompatible
schema is one that cannot be merged successfully. )
-   Suggestion 1 - Allow flexible insert. (Issue a warning if there is
   going to be a schema change conflict). Queries on columns with
 no conflict
   can still run.
   -   Suggestion 2 - Do not allow insert unless the schema can be
   merged.
4. Rules for merging can be relaxed as we improve our ability to promote
from one type to another
5. Concept of schema change different from a merged schema. Schema
change is needed for reading schema less sources like JSON.
6. If there is a different column order (or difference in number of
columns) in the table files, an insert based on ordinal positions
 cannot be
allowed.
7. Suggestion - go with a naming based insert. Ordinal based insert can
be enabled based on a switch (?)
8. Reuse partitioning information from 'create table' in 'insert into'.
No partitioning info is needed in the 'insert into' statement.



 On Mon, Jul 27, 2015 at 6:43 PM, Khurram Faraaz kfar...@maprtech.com
 wrote:

  I have a question related to insert into partitioned tables. I see that
  DB2, Oracle, and Postgres do not support a PARTITION BY clause in their
  INSERT INTO implementation. MySQL does however provide support for a
  PARTITION BY clause in their INSERT INTO implementation. I do not know
 what
  the standard ANSI SQL specification says about INSERT INTO statement. We
  may want to keep it compliant with standard SQL syntax.
 
   B ) Support insert into a partitioned table
   --
 
   Ex: INSERT INTO T1 col1, col2,col3  partition by col1,col2 SELECT
  col4,col,col6 from T2
 
   * Target column specification is required when inserting data into an
 already partitioned table
   * Requirements A.3-A.7 above apply for insert into partitioned tables
 as well
   * A partition by clause along with one or more columns is required
   * All the columns specified in partition by clause must exist in the
 target column list
   * Partition by columns specified do not need to match to the list of
 columns that the original table partitioned with (i.e if the
 original table is partitioned with col1, col2,  new data during
 insert can be partitioned by col3 or just with col1 or col2..)
 
  On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid baid.meh...@gmail.com
  wrote:
 
   I wanted to start a conversation around supporting the Insert into
  Table
   feature. As of 1.2 we initially want to support inserting into a table
  with
   Parquet files. Support for Json, CSV and other sources will follow as
   future enhancements.
  
   Aman, Jinfeng, Neeraja and I had an initial discussion about this and
   Neeraja provided a good summary of our discussion (pasted below) also
   stating some of the requirements for this feature.
  
A ) Support Insert into a non-partitioned table
   -
  
   Ex: INSERT INTO T1 [col1, col2, col3]  SELECT col4, col5, col6 from T2
   (Source table: T2, Target table T1)
   Requirements:
  
   1. Target table column list specification is optional for Insert
  statement
   2. When specified, the column list in the Insert statement should
  contain all the columns present in the target table (i.e No support
  for partial insert)
   3. The column names specified for the source table do not need to match
  to the target table column names. Match is performed based on
 ordinal.
   4.   # of Source table columns specified must be same as # of target
  table columns
   5. Types of specified source table columns must match to the types of
  target table columns
   6. Specification of * is not allowed in the Select table syntax
   7. Select table syntax can specify constant values for one or more
  columns
  
  
B ) Support insert into a partitioned table
   --
  
   Ex: INSERT INTO T1 col1, col2,col3  partition by col1,col2 SELECT
   col4,col,col6 from T2
  

Re: [DISCUSS] Insert into Table support

2015-07-29 Thread Parth Chandra
Notes based on discussion in hangout, JAcques' notes and later discussion:


   1. For Parquet files
  - Merge schema information into a union schema
  - The union schema is saved in a .drill file
  - The union schema can be cached
  - Union schema should maintain ordinal position of fields.
   2. At the time the table is read, we can check directory timestamp and
   regenerate merged schema
   3. Insert into does not need to update the merged schema. (It should
   check if there is an incompatible schema being written? An incompatible
   schema is one that cannot be merged successfully. )
   -   Suggestion 1 - Allow flexible insert. (Issue a warning if there is
  going to be a schema change conflict). Queries on columns with
no conflict
  can still run.
  -   Suggestion 2 - Do not allow insert unless the schema can be
  merged.
   4. Rules for merging can be relaxed as we improve our ability to promote
   from one type to another
   5. Concept of schema change different from a merged schema. Schema
   change is needed for reading schema less sources like JSON.
   6. If there is a different column order (or difference in number of
   columns) in the table files, an insert based on ordinal positions cannot be
   allowed.
   7. Suggestion - go with a naming based insert. Ordinal based insert can
   be enabled based on a switch (?)
   8. Reuse partitioning information from 'create table' in 'insert into'.
   No partitioning info is needed in the 'insert into' statement.



On Mon, Jul 27, 2015 at 6:43 PM, Khurram Faraaz kfar...@maprtech.com
wrote:

 I have a question related to insert into partitioned tables. I see that
 DB2, Oracle, and Postgres do not support a PARTITION BY clause in their
 INSERT INTO implementation. MySQL does however provide support for a
 PARTITION BY clause in their INSERT INTO implementation. I do not know what
 the standard ANSI SQL specification says about INSERT INTO statement. We
 may want to keep it compliant with standard SQL syntax.

  B ) Support insert into a partitioned table
  --

  Ex: INSERT INTO T1 col1, col2,col3  partition by col1,col2 SELECT
 col4,col,col6 from T2

  * Target column specification is required when inserting data into an
already partitioned table
  * Requirements A.3-A.7 above apply for insert into partitioned tables
as well
  * A partition by clause along with one or more columns is required
  * All the columns specified in partition by clause must exist in the
target column list
  * Partition by columns specified do not need to match to the list of
columns that the original table partitioned with (i.e if the
original table is partitioned with col1, col2,  new data during
insert can be partitioned by col3 or just with col1 or col2..)

 On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid baid.meh...@gmail.com
 wrote:

  I wanted to start a conversation around supporting the Insert into
 Table
  feature. As of 1.2 we initially want to support inserting into a table
 with
  Parquet files. Support for Json, CSV and other sources will follow as
  future enhancements.
 
  Aman, Jinfeng, Neeraja and I had an initial discussion about this and
  Neeraja provided a good summary of our discussion (pasted below) also
  stating some of the requirements for this feature.
 
   A ) Support Insert into a non-partitioned table
  -
 
  Ex: INSERT INTO T1 [col1, col2, col3]  SELECT col4, col5, col6 from T2
  (Source table: T2, Target table T1)
  Requirements:
 
  1. Target table column list specification is optional for Insert
 statement
  2. When specified, the column list in the Insert statement should
 contain all the columns present in the target table (i.e No support
 for partial insert)
  3. The column names specified for the source table do not need to match
 to the target table column names. Match is performed based on ordinal.
  4.   # of Source table columns specified must be same as # of target
 table columns
  5. Types of specified source table columns must match to the types of
 target table columns
  6. Specification of * is not allowed in the Select table syntax
  7. Select table syntax can specify constant values for one or more
 columns
 
 
   B ) Support insert into a partitioned table
  --
 
  Ex: INSERT INTO T1 col1, col2,col3  partition by col1,col2 SELECT
  col4,col,col6 from T2
 
   * Target column specification is required when inserting data into an
 already partitioned table
   * Requirements A.3-A.7 above apply for insert into partitioned tables
 as well
   * A partition by clause along with one or more columns is required
   * All the columns specified in partition by clause must exist in the
 target column list
   * Partition by columns specified do not need to match to the list of
 columns that the original table partitioned with (i.e if the
 original table is partitioned with col1, col2,  new 

Re: [DISCUSS] Insert into Table support

2015-07-27 Thread Khurram Faraaz
I have a question related to insert into partitioned tables. I see that
DB2, Oracle, and Postgres do not support a PARTITION BY clause in their
INSERT INTO implementation. MySQL does however provide support for a
PARTITION BY clause in their INSERT INTO implementation. I do not know what
the standard ANSI SQL specification says about INSERT INTO statement. We
may want to keep it compliant with standard SQL syntax.

 B ) Support insert into a partitioned table
 --

 Ex: INSERT INTO T1 col1, col2,col3  partition by col1,col2 SELECT
col4,col,col6 from T2

 * Target column specification is required when inserting data into an
   already partitioned table
 * Requirements A.3-A.7 above apply for insert into partitioned tables
   as well
 * A partition by clause along with one or more columns is required
 * All the columns specified in partition by clause must exist in the
   target column list
 * Partition by columns specified do not need to match to the list of
   columns that the original table partitioned with (i.e if the
   original table is partitioned with col1, col2,  new data during
   insert can be partitioned by col3 or just with col1 or col2..)

On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid baid.meh...@gmail.com wrote:

 I wanted to start a conversation around supporting the Insert into Table
 feature. As of 1.2 we initially want to support inserting into a table with
 Parquet files. Support for Json, CSV and other sources will follow as
 future enhancements.

 Aman, Jinfeng, Neeraja and I had an initial discussion about this and
 Neeraja provided a good summary of our discussion (pasted below) also
 stating some of the requirements for this feature.

  A ) Support Insert into a non-partitioned table
 -

 Ex: INSERT INTO T1 [col1, col2, col3]  SELECT col4, col5, col6 from T2
 (Source table: T2, Target table T1)
 Requirements:

 1. Target table column list specification is optional for Insert statement
 2. When specified, the column list in the Insert statement should
contain all the columns present in the target table (i.e No support
for partial insert)
 3. The column names specified for the source table do not need to match
to the target table column names. Match is performed based on ordinal.
 4.   # of Source table columns specified must be same as # of target
table columns
 5. Types of specified source table columns must match to the types of
target table columns
 6. Specification of * is not allowed in the Select table syntax
 7. Select table syntax can specify constant values for one or more columns


  B ) Support insert into a partitioned table
 --

 Ex: INSERT INTO T1 col1, col2,col3  partition by col1,col2 SELECT
 col4,col,col6 from T2

  * Target column specification is required when inserting data into an
already partitioned table
  * Requirements A.3-A.7 above apply for insert into partitioned tables
as well
  * A partition by clause along with one or more columns is required
  * All the columns specified in partition by clause must exist in the
target column list
  * Partition by columns specified do not need to match to the list of
columns that the original table partitioned with (i.e if the
original table is partitioned with col1, col2,  new data during
insert can be partitioned by col3 or just with col1 or col2..)


 Couple of open questions from the design perspective are

 1. How do we perform validation. Validation of data types, number of
 columns being inserted etc. In addition to validation we need to make sure
 that when we insert into an existing tables we insert data with the
 existing column names (select column list can have different names). This
 poses problems around needing to know the metadata at planning time, two
 approaches that have been floating around are
 * DotDrill files: We can store metadata, partitioning columns and
 other useful information here and we can perform validation during planning
 time. However the challenges with introducing DotDrill files include
  - consistency between metadata and the actual data (Nothing
 preventing users to copy files directly).
  - security around DotDrill files (can be dealt in the same
 way we perform security checks for drill tables in hdfs)
  - interface to change the DotDrill file, in the case we need
 to add a column to the table or add a new partition etc.

 * Explicit Syntax/ No metadata approach: Another approach is to
 avoid DotDrill files and use explicit syntax to glean as much information
 as possible from the SQL statement itself. Some of the challenges with this
 approach are
  - Gathering metadata information: Since we have no idea what
 the existing schema is we would need to perform a mini scan to learn the
 schema at planning time to be able to perform some validation. The problem
 with this approach is how do we determine how many files 

[DISCUSS] Insert into Table support

2015-07-27 Thread Mehant Baid
I wanted to start a conversation around supporting the Insert into 
Table feature. As of 1.2 we initially want to support inserting into a 
table with Parquet files. Support for Json, CSV and other sources will 
follow as future enhancements.


Aman, Jinfeng, Neeraja and I had an initial discussion about this and 
Neeraja provided a good summary of our discussion (pasted below) also 
stating some of the requirements for this feature.


 A ) Support Insert into a non-partitioned table
-

Ex: INSERT INTO T1 [col1, col2, col3]  SELECT col4, col5, col6 from T2
(Source table: T2, Target table T1)
Requirements:

1. Target table column list specification is optional for Insert statement
2. When specified, the column list in the Insert statement should
   contain all the columns present in the target table (i.e No support
   for partial insert)
3. The column names specified for the source table do not need to match
   to the target table column names. Match is performed based on ordinal.
4.   # of Source table columns specified must be same as # of target
   table columns
5. Types of specified source table columns must match to the types of
   target table columns
6. Specification of * is not allowed in the Select table syntax
7. Select table syntax can specify constant values for one or more columns


 B ) Support insert into a partitioned table
--

Ex: INSERT INTO T1 col1, col2,col3  partition by col1,col2 SELECT 
col4,col,col6 from T2


 * Target column specification is required when inserting data into an
   already partitioned table
 * Requirements A.3-A.7 above apply for insert into partitioned tables
   as well
 * A partition by clause along with one or more columns is required
 * All the columns specified in partition by clause must exist in the
   target column list
 * Partition by columns specified do not need to match to the list of
   columns that the original table partitioned with (i.e if the
   original table is partitioned with col1, col2,  new data during
   insert can be partitioned by col3 or just with col1 or col2..)


Couple of open questions from the design perspective are

1. How do we perform validation. Validation of data types, number of 
columns being inserted etc. In addition to validation we need to make 
sure that when we insert into an existing tables we insert data with the 
existing column names (select column list can have different names). 
This poses problems around needing to know the metadata at planning 
time, two approaches that have been floating around are
* DotDrill files: We can store metadata, partitioning columns 
and other useful information here and we can perform validation during 
planning time. However the challenges with introducing DotDrill files 
include
 - consistency between metadata and the actual data 
(Nothing preventing users to copy files directly).
 - security around DotDrill files (can be dealt in the same 
way we perform security checks for drill tables in hdfs)
 - interface to change the DotDrill file, in the case we 
need to add a column to the table or add a new partition etc.


* Explicit Syntax/ No metadata approach: Another approach is to 
avoid DotDrill files and use explicit syntax to glean as much 
information as possible from the SQL statement itself. Some of the 
challenges with this approach are
 - Gathering metadata information: Since we have no idea 
what the existing schema is we would need to perform a mini scan to 
learn the schema at planning time to be able to perform some validation. 
The problem with this approach is how do we determine how many files we 
need to read in order to learn the schema? If we use a sample set and 
not all the files have the same schema,
we could have non-deterministic results based on the 
sample of files read. Also reading all the files and merging the schema 
seems like an expensive cost to pay.
 - From the user's perspective, while inserting into a 
partitioned table, user will have to specify the partitioning columns 
again in the Insert statement, despite having specified the partition 
columns in the CTAS.


2. What is a reasonable assumption for a Drill table in terms of 
changing schema. Having the same exact schema for all files in a table 
is too rigid an assumption at this point?


One thing to remember with DotDrill file is to also the repercussions on 
Drop table, Show tables, Describe table etc. i.e. it might make it 
easier to be able to support these operations.


Thanks
Mehant