Attendees: Andries, Daniel, Hanifi, Jacques, Jason, Jinfeng, Khurram, Kristine, Mehant , Neeraja, Parth, Sudheesh (host)
Minutes based on notes from Sudeesh - 1) Jacques working on the following - a) RPC changes - Sudheesh/Parth reported a regression in perf numbers which was unexpected. Tests are being rerun. b) Apache log - format plugin. c) Support for Double quote. d) Allow JSON literals. 1) Parquet filter pushdown - Patch from Adam Gilmore is waiting review. This patch with conflict with Steven's work on metadata caching. Metadata caching needs to go in first. 2) JDBC storage plugin - Patch from Magnus. Parth to follow up to get updated code. 3) Discussion on Embedded types - a) Two types of common problems are being hit - 1) Soft Schema change - Lots of initial nulls and then a type appears or the type changes to a type that can be promoted to the initial type. Drill assumes type to be nullable int if it cannot determine the type.Discussion on using nullable Varchar/Varbinary instead of nullable int. Suggestion was that we need to introduce some additional types - i) Introduce a LATE binding type ( type is not known). ii) Introduce a NULL type - only null iii) Schema sampling to determine schema- use for fast schema. 2) Hard Schema Change - A schema change that is not transitionable. b) Open questions - How do we materialize to the user? How do clients expect to handle the schema change events. What does a BI tool like Tableau do if a new column is introduced. What is the expectation of a JDBC/ODBC application (what do the standards specify, if anything). Neeraja to follow up and specify. c) Proposal to add support for embedded types where each value carries type information (covered in DRILL-3228) This requires a detailed design before we begin implementation. 4) Discussion on 'Insert into' (based on Mehant's post) a) In general, the feature is expected to behave like in any database. Complications arise when the user choses to insert a different schema or partitions from the the original table. b) Jacques's main concern regarding this: Do we want Drill to be flexible and be able to add columns and be able to not specify columns while inserting or do we want it to behave like a traditional Data Warehouse where we do ordinal matching and are strict about the number of columns being inserted into the target table. c) We should validate the schema where we can (eg parquet), however we should start by validating metadata for queries and use that feature in Insert as opposed to building that in Insert. d) If we allow insert into with a different schema and we cannot read the file, then that would be embarrassing. e) If we are trying to solve a specific BI tool use case for inserts then we should explore going down the route of solving this specific use case, and treat the insert like CTAS today. 5) Discussion on 'Drop table' a) Strict identification of table - Don't drop tables that Drill can't query. b) Fail if there is a file that does not match. c) If no impersonation is enabled then drop only drill owned tables. More detailed notes on #4 and #5 to be posted by Jacques.