My key notes: INSERT INTO This functionality should be consistent how Drill views data and the world. It seems like there are a number of missing foundational components that should be built before this could work "right".
First steps that should be done before INSERT INTO, EMBEDDED and DROP: - Add a null or nullable any type to the execution flow. Don't materialize this until necessary (convert new field existence to soft schema change from current hard schema change). - Add a INSERT INTO that works like CTAS. This is the simplest way to start and bumps many decisions to later. This will also solve the top two use cases: BI tool temporary table and advanced user use cases (but is a bit of sharp instrument) - Implement "thorough table identification". Right now a directory of multiple types of files (potentially queryable and non-queryable) - Add support for Parquet schema reading, merging and validation. Parquet has schema, Drill shouldn't expose as schemaless. This will set the groundwork for a number of types of validation around INSERT INTO, DROP, etc. (It will also require a full deconflicting between implicit casting behavior for the validator and the execution layer.) - Start planning around "dot drill" (a.k.a. DRILL-3572). Many of things that need to supported to make these things work "like a database" require this. -- Jacques Nadeau CTO and Co-Founder, Dremio On Wed, Jul 29, 2015 at 11:02 AM, Parth Chandra <[email protected]> wrote: > Attendees: Andries, Daniel, Hanifi, Jacques, Jason, > Jinfeng, Khurram, Kristine, Mehant , Neeraja, Parth, Sudheesh (host) > > Minutes based on notes from Sudeesh - > > 1) Jacques working on the following - > a) RPC changes - Sudheesh/Parth reported a regression in perf numbers > which was unexpected. Tests are being rerun. > b) Apache log - format plugin. > c) Support for Double quote. > d) Allow JSON literals. > > 1) Parquet filter pushdown - Patch from Adam Gilmore is waiting review. > This patch with conflict with Steven's work on metadata caching. Metadata > caching needs to go in first. > > 2) JDBC storage plugin - Patch from Magnus. Parth to follow up to get > updated code. > > 3) Discussion on Embedded types - > a) Two types of common problems are being hit - > 1) Soft Schema change - Lots of initial nulls and then a type > appears or the type changes to a type that can be promoted to the initial > type. Drill assumes type to be nullable int if it cannot determine the > type.Discussion on using nullable Varchar/Varbinary instead of nullable > int. Suggestion was that we need to introduce some additional types - > i) Introduce a LATE binding type ( type is not known). > ii) Introduce a NULL type - only null > iii) Schema sampling to determine schema- use for fast schema. > 2) Hard Schema Change - A schema change that is not transitionable. > b) Open questions - How do we materialize to the user? How do > clients expect to handle the schema change events. What does a BI tool like > Tableau do if a new column is introduced. What is the expectation of a > JDBC/ODBC application (what do the standards specify, if anything). Neeraja > to follow up and specify. > c) Proposal to add support for embedded types where each value carries > type information (covered in DRILL-3228) This requires a detailed design > before we begin implementation. > > 4) Discussion on 'Insert into' (based on Mehant's post) > a) In general, the feature is expected to behave like in any database. > Complications arise when the user choses to insert a different schema or > partitions from the the original table. > b) Jacques's main concern regarding this: Do we want Drill to be flexible > and be able to add columns and be able to not specify columns while > inserting or do we want it to behave like a traditional Data Warehouse > where we do ordinal matching and are strict about the number of columns > being inserted into the target table. > c) We should validate the schema where we can (eg parquet), however we > should start by validating metadata for queries and use that feature in > Insert as opposed to building that in Insert. > d) If we allow insert into with a different schema and we cannot read > the file, then that would be embarrassing. > e) If we are trying to solve a specific BI tool use case for inserts > then we should explore going down the route of solving this specific use > case, and treat the insert like CTAS today. > > > 5) Discussion on 'Drop table' > a) Strict identification of table - Don't drop tables that Drill can't > query. > b) Fail if there is a file that does not match. > c) If no impersonation is enabled then drop only drill owned tables. > > More detailed notes on #4 and #5 to be posted by Jacques. >
