Re: Hangout minutes - 2015-07-28

Jacques Nadeau Wed, 29 Jul 2015 12:34:53 -0700

My key notes:

INSERT INTO
This functionality should be consistent how Drill views data and the world.
It seems like there are a number of missing foundational components that
should be built before this could work "right".


First steps that should be done before INSERT INTO, EMBEDDED and DROP:

   - Add a null or nullable any type to the execution flow.  Don't
   materialize this until necessary (convert new field existence to soft
   schema change from current hard schema change).
   - Add a INSERT INTO that works like CTAS.  This is the simplest way to
   start and bumps many decisions to later. This will also solve the top two
   use cases: BI tool temporary table and advanced user use cases (but is a
   bit of sharp instrument)
   - Implement "thorough table identification".  Right now a directory of
   multiple types of files (potentially queryable and non-queryable)
   - Add support for Parquet schema reading, merging and validation.
   Parquet has schema, Drill shouldn't expose as schemaless.  This will set
   the groundwork for a number of types of validation around INSERT INTO,
   DROP, etc. (It will also require a full deconflicting between implicit
   casting behavior for the validator and the execution layer.)
   - Start planning around "dot drill" (a.k.a. DRILL-3572).  Many of things
   that need to supported to make these things work "like a database" require
   this.



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Jul 29, 2015 at 11:02 AM, Parth Chandra <[email protected]> wrote:

> Attendees:  Andries, Daniel, Hanifi, Jacques, Jason,
> Jinfeng, Khurram,  Kristine, Mehant , Neeraja, Parth, Sudheesh (host)
>
> Minutes based on notes from Sudeesh -
>
> 1) Jacques working on the following -
>       a) RPC changes - Sudheesh/Parth reported a regression in perf numbers
> which was unexpected. Tests are being rerun.
>       b) Apache log - format plugin.
>       c) Support for Double quote.
>       d) Allow JSON literals.
>
> 1) Parquet filter pushdown - Patch from Adam Gilmore is waiting review.
> This patch with conflict with Steven's work on metadata caching. Metadata
> caching needs to go in first.
>
> 2) JDBC storage plugin - Patch from Magnus. Parth to follow up to get
> updated code.
>
> 3) Discussion on Embedded types -
>    a) Two types of common problems are being hit -
>         1) Soft Schema change - Lots of initial nulls and then a type
> appears or the type changes to a type that can be promoted to the initial
> type. Drill assumes type to be nullable int if it cannot determine the
> type.Discussion on using nullable Varchar/Varbinary instead of nullable
> int. Suggestion was that we need to introduce some additional types -
>             i) Introduce a LATE  binding type ( type is not known).
>             ii) Introduce a NULL type - only null
>            iii) Schema sampling to determine schema- use for fast schema.
>         2) Hard Schema Change - A schema change that is not transitionable.
>    b) Open questions -    How do we materialize to the user?  How do
> clients expect to handle the schema change events. What does a BI tool like
> Tableau do if a new column is introduced. What is the expectation of a
> JDBC/ODBC application (what do the standards specify, if anything). Neeraja
> to follow up and specify.
>    c) Proposal to add support for embedded types where each value carries
> type information (covered in DRILL-3228) This requires a detailed design
> before we begin implementation.
>
> 4) Discussion on 'Insert into' (based on Mehant's post)
>    a) In general, the feature is expected to behave like in any database.
> Complications arise when the user choses to insert a different schema or
> partitions from the the original table.
>   b) Jacques's main concern regarding this: Do we want Drill to be flexible
> and be able to add columns and be able to not specify columns while
> inserting or do we want it to behave like a traditional Data Warehouse
> where we do ordinal matching and are strict about the number of columns
> being inserted into the target table.
>    c) We should validate the schema where we can (eg parquet), however we
> should start by validating metadata for queries and use that feature in
> Insert as opposed to building that in Insert.
>    d) If we allow insert into with a different schema and we cannot read
> the file, then that would be embarrassing.
>    e) If we are trying to solve a specific BI tool use case for inserts
> then we should explore going down the route of solving this specific use
> case, and treat the insert like CTAS today.
>
>
> 5) Discussion on 'Drop table'
>   a) Strict identification of table - Don't drop tables that Drill can't
> query.
>   b) Fail if there is a file that does not match.
>   c) If no impersonation is enabled then drop only drill owned tables.
>
>    More detailed notes on #4 and #5 to be posted by Jacques.
>

Re: Hangout minutes - 2015-07-28

Reply via email to