If there are any suggestion, can we take it in the JIRA. I feel, there is already JIRA for this. https://issues.apache.org/jira/browse/DRILL-4120?focusedCommentId=15048070&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15048070
On Thu, Dec 17, 2015 at 1:28 AM, Stefán Baxter <ste...@activitystream.com> wrote: > Hi, > > Directory pruning is great. It allows us, for example, to do efficient > date-range queries even when our data is arranged in a day or week based > directory structure. > > We would like to be able to run the same query for all this data even > though the schema has changes slightly (new fields added) over time. > > For me there are two thing in this scenario that are unreasonable: > > 1. For Drill to have to get the schema for all possible files (union > based) to validate queries > - adding 100s of *irrelevant* files to the mix > > 2. For Drill to fail the query because a field is not found in the > sub-set (directory pruned sub-set) > > The current approach results in option 2 and the proposed solution results > in option 1 (As I understand it) > > We would be perfectly happy with unknown fields resulting in null as there > are many ways to deal with null values built into Drill. > > Hopefully this a) makes sense and b) is acceptable. > > Enforcing a strict schema for Avro could be an optional feature (IMO). > > Regards, > -Stefán > > On Wed, Dec 16, 2015 at 2:18 PM, Jacques Nadeau <jacq...@dremio.com> > wrote: > > > I think the main problem your hitting is that we should do a union of all > > files. In that case, as long as the field is in a single file, we're > going > > to let the field through. > > > > There is a balancing between early termination and flexibility that we > must > > provide. If someone types a field and it is guaranteed to not be in the > > data, the thinking is we should fail the query early as that is probably > a > > mistake on the user's part. If it could be a valid field, we proceed > with > > execution and null it out until we find something. That is the goal > > anyway. Clearly we have a bug here as we should never deny a possible or > > known field. > > > > I think of fields in three categories: known, possible, impossible. > > Impossible fields should fail to validate. Possible and known fields > should > > validate and execute. > > > > With regards to Ted's concern: I agree that applying a filter shouldn't > > fail a query. That means we will either have to consider the complete > union > > Schema before pruning files or consider all fields as either known or > > possible after pruning files. > > > > Stefan, if you haven't already, please open a bug that known fields are > > failing to validate in Avro and we will fix shortly. Sorry about the bug. > > On Dec 14, 2015 10:51 PM, "Stefán Baxter" <ste...@activitystream.com> > > wrote: > > > > > Well, at least I'm not alone here. > > > > > > I think it must be time to set some ground rules for these things and > > what > > > it means to support evolving schema and what is needed to eliminate > ETL. > > > > > > I trust that enforcing a strict schema "just because we think we can" > > must > > > go against the principles of such rules. > > > > > > We moved all our stuff to Avro to avoid various problems with type > > handling > > > (assuming Double on nulls etc.) and to be hit with this, after all that > > > work, is like a slap in the face with two pilchards (more here: > > > https://www.youtube.com/watch?v=IhJQp-q1Y1s) > > > > > > Regards, > > > -Stefán > > > > > > On Tue, Dec 15, 2015 at 1:10 AM, Ted Dunning <ted.dunn...@gmail.com> > > > wrote: > > > > > > > Sigh of relief is premature. Nobody has committed to carrying this > > > > interpretation forward. > > > > > > > > > > > > > > > > On Mon, Dec 14, 2015 at 11:44 AM, Stefán Baxter < > > > ste...@activitystream.com > > > > > > > > > wrote: > > > > > > > > > /me sighs of relief > > > > > > > > > > On Mon, Dec 14, 2015 at 7:28 PM, Ted Dunning < > ted.dunn...@gmail.com> > > > > > wrote: > > > > > > > > > > > Actually, even without multiple storage types, this could be > > > radically > > > > > > confusing. > > > > > > > > > > > > If I have many avro files that are partitioned into directories, > > then > > > > > > queries that use the partitioning to limit the files that I see > > could > > > > > > include or exclude more recent files that have added a new field. > > > > > > > > > > > > That means that a query would succeed or fail according to which > > date > > > > > range > > > > > > I use for the query. > > > > > > > > > > > > That seems pretty radically bad. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter < > > > > > ste...@activitystream.com> > > > > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > This simply can not be the desired behavior! > > > > > > > > > > > > > > This prevents from using a field from a changing schema with > dir0 > > > > > > > sub-selection (directory pruning) as the altered/full schema is > > > never > > > > > > part > > > > > > > of the query and it subsequently fails. > > > > > > > > > > > > > > Drill should, IMOP, never have rules that are dependent on the > > > > > underlying > > > > > > > storage type. If the query runs with JSON and Parquet then it > > > should > > > > > work > > > > > > > for Avro as well. > > > > > > > > > > > > > > I'm hoping this strict schema validation is all just a > > > > > misunderstanding. > > > > > > > > > > > > > > Regards, > > > > > > > -Stefán > > > > > > > > > > > > > > On Mon, Dec 14, 2015 at 3:28 PM, Kamesh < > kamesh.had...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > > > > > For Avro files, we first construct the schema, and this > schema > > is > > > > > used > > > > > > > for > > > > > > > > validating queries. So, if there are any errors in the query > > > (like > > > > > the > > > > > > > > invalid field references) it will fail fast. As of now, for > > other > > > > > file > > > > > > > > formats, query validation (checking for invalid field > > reference) > > > > > does > > > > > > > not > > > > > > > > happen, and at run time, it constructs the schema for them > and > > > > hence > > > > > > > nulls > > > > > > > > for invalid fields. > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter < > > > > > > > ste...@activitystream.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > I'm getting the following error when querying Avro files: > > > > > > > > > > > > > > > > > > Error: VALIDATION ERROR: From line 1, column 48 to line 1, > > > column > > > > > 57: > > > > > > > > > Column 'some_col' not found in any table > > > > > > > > > > > > > > > > > > It's true that the field is in none of the tables I'm > > > targeting, > > > > in > > > > > > > that > > > > > > > > > particular query, but that does not mean that it is in none > > of > > > > the > > > > > > > > possible > > > > > > > > > files I could be querying. > > > > > > > > > > > > > > > > > > We use Avro to get the benefits of the schema but I never > > > > expected > > > > > > > Drill > > > > > > > > to > > > > > > > > > enforce it this way. > > > > > > > > > > > > > > > > > > Why do unresolved columns not return null? > > > > > > > > > > > > > > > > > > This makes no sense to me as I think a fundamental trade of > > > > Drill, > > > > > > when > > > > > > > > > trying to eliminate ETL, is to return null for any missing > > > > fields. > > > > > > > > > > > > > > > > > > Please advise. > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > -Stefán > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Kamesh. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- Kamesh.