I thought I'd add my two cents based on my effort with Parquet pushdown
filtering.

It sounds like merging schemas is going to be pretty daunting, considering
the work planned around embedded types and considering Parquet doesn't
support those.

Furthermore, metadata sounds like it's going to be fairly specific to each
storage type.  In the Parquet example, it's very beneficial in having
statistics for each column in the file so that you can easily filter out
files that clearly won't match a filter (which is what I did in the
pushdown implementation).  This is the challenge with the current metadata
implementation (it doesn't include that information), so you end up
planning batches out to many Drillbits, often with a lot of those batches
entirely skipped due to mismatching filters.

I completely agree with Jacques's point with regards to not changing the
outcome of the query if the dot drill/metadata files are not present - it
just simply makes the query more expensive without them.

On Thu, Aug 6, 2015 at 8:01 AM, Aman Sinha <asi...@maprtech.com> wrote:

> @Jacques..
> Since the topic of  metadata caching is closely related, the main issue is
> how is the metadata file maintained in the presence of either concurrent
> INSERTs or concurrent REFRESH METADATA operations ?  One could maintain
> multiple versions of the metadata file or keep the version information
> inside a single metadata file.  Is that what you were thinking when you
> mention Lucene's index versioning ? (I am not familiar with lucene's
> versioning).
>
> Aman
>
> On Wed, Aug 5, 2015 at 12:02 PM, Chris Westin <chriswesti...@gmail.com>
> wrote:
>
> > Re #7 in the original post "Select table syntax can specify constant
> values
> > for one or more columns:"
> > I would have assumed the select list can have any expressions that can be
> > evaluated on a row from the source; that includes columns, expressions on
> > columns, or constants. It's probably not your intent, but the stated form
> > implies that all I get are column values and constants. Which is it?
> >
> > On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid <baid.meh...@gmail.com>
> > wrote:
> >
> > > I wanted to start a conversation around supporting the "Insert into
> > Table"
> > > feature. As of 1.2 we initially want to support inserting into a table
> > with
> > > Parquet files. Support for Json, CSV and other sources will follow as
> > > future enhancements.
> > >
> > > Aman, Jinfeng, Neeraja and I had an initial discussion about this and
> > > Neeraja provided a good summary of our discussion (pasted below) also
> > > stating some of the requirements for this feature.
> > >
> > >  A ) Support Insert into a non-partitioned table
> > > ---------------------
> > >
> > > Ex: INSERT INTO T1 [col1, col2, col3]  SELECT col4, col5, col6 from T2
> > > (Source table: T2, Target table T1)
> > > Requirements:
> > >
> > > 1. Target table column list specification is optional for Insert
> > statement
> > > 2. When specified, the column list in the Insert statement should
> > >    contain all the columns present in the target table (i.e No support
> > >    for partial insert)
> > > 3. The column names specified for the source table do not need to match
> > >    to the target table column names. Match is performed based on
> ordinal.
> > > 4.   # of Source table columns specified must be same as # of target
> > >    table columns
> > > 5. Types of specified source table columns must match to the types of
> > >    target table columns
> > > 6. Specification of * is not allowed in the Select table syntax
> > > 7. Select table syntax can specify constant values for one or more
> > columns
> > >
> > >
> > >  B ) Support insert into a partitioned table
> > > ----------------------
> > >
> > > Ex: INSERT INTO T1 col1, col2,col3  partition by col1,col2 SELECT
> > > col4,col,col6 from T2
> > >
> > >  * Target column specification is required when inserting data into an
> > >    already partitioned table
> > >  * Requirements A.3-A.7 above apply for insert into partitioned tables
> > >    as well
> > >  * A partition by clause along with one or more columns is required
> > >  * All the columns specified in partition by clause must exist in the
> > >    target column list
> > >  * Partition by columns specified do not need to match to the list of
> > >    columns that the original table partitioned with (i.e if the
> > >    original table is partitioned with col1, col2,  new data during
> > >    insert can be partitioned by col3 or just with col1 or col2..)
> > >
> > >
> > > Couple of open questions from the design perspective are
> > >
> > > 1. How do we perform validation. Validation of data types, number of
> > > columns being inserted etc. In addition to validation we need to make
> > sure
> > > that when we insert into an existing tables we insert data with the
> > > existing column names (select column list can have different names).
> This
> > > poses problems around needing to know the metadata at planning time,
> two
> > > approaches that have been floating around are
> > >         * DotDrill files: We can store metadata, partitioning columns
> and
> > > other useful information here and we can perform validation during
> > planning
> > > time. However the challenges with introducing DotDrill files include
> > >              - consistency between metadata and the actual data
> (Nothing
> > > preventing users to copy files directly).
> > >              - security around DotDrill files (can be dealt in the same
> > > way we perform security checks for drill tables in hdfs)
> > >              - interface to change the DotDrill file, in the case we
> need
> > > to add a column to the table or add a new partition etc.
> > >
> > >         * Explicit Syntax/ No metadata approach: Another approach is to
> > > avoid DotDrill files and use explicit syntax to glean as much
> information
> > > as possible from the SQL statement itself. Some of the challenges with
> > this
> > > approach are
> > >              - Gathering metadata information: Since we have no idea
> what
> > > the existing schema is we would need to perform a "mini scan" to learn
> > the
> > > schema at planning time to be able to perform some validation. The
> > problem
> > > with this approach is how do we determine how many files we need to
> read
> > in
> > > order to learn the schema? If we use a sample set and not all the files
> > > have the same schema,
> > >                 we could have non-deterministic results based on the
> > > sample of files read. Also reading all the files and merging the schema
> > > seems like an expensive cost to pay.
> > >              - From the user's perspective, while inserting into a
> > > partitioned table, user will have to specify the partitioning columns
> > again
> > > in the Insert statement, despite having specified the partition columns
> > in
> > > the CTAS.
> > >
> > > 2. What is a reasonable assumption for a Drill table in terms of
> changing
> > > schema. Having the same exact schema for all files in a table is too
> > rigid
> > > an assumption at this point?
> > >
> > > One thing to remember with DotDrill file is to also the repercussions
> on
> > > Drop table, Show tables, Describe table etc. i.e. it might make it
> easier
> > > to be able to support these operations.
> > >
> > > Thanks
> > > Mehant
> > >
> >
>

Reply via email to