[ https://issues.apache.org/jira/browse/DRILL-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16036141#comment-16036141 ]
Paul Rogers edited comment on DRILL-5546 at 6/5/17 7:31 AM: ------------------------------------------------------------ I believe we are saying basically the same thing. To be certain, [watch out, we're gonna try some theory|https://store.xkcd.com/products/try-science]. h3. Basics We'll need some terminology, defined in the usual way: * (a, b, c) is a set that contains a, b, c * \{a:b} is a map from a to b * \[a, b, c] is an ordered list of a, b, c, where each element has an index i = 0, 1, 2... * empty = () or {} or \[] is the empty collection (of the proper type) * null = SQL null: we don't know what the value is * ∧ = Java, C, etc. null: we know the value, and it is nothing Drill is, at its core, relational. A relation R can be defined as: {{R = (S, T)}} Where: * S is the schema * T is a set of tuples (t ~1~, t ~2~, t ~3~) (AKA a table) {{S = (C, N)}} Where: * C is the list of column schemas: {{C = \[ c ~0~, c ~1~, c ~2~, ...]}} {{c = (name, type)}} And: * N is a map from name to column index: {{N = \{name : i\} }} where i the index of column c ∈ C Drill defines the idea of _compatible_ schemas. Two schemas are compatible if we redefine the schema as a set of columns: {{S’ = (C ~0~, C ~1~, …)}} Two schemas are compatible iff the column sets are identical (same name and type for each column). This is a bit more forgiving than the traditional relational model which requires that the ordered list of column schemas be identical. Let's assume this rule as we discuss schema below. We'll also need the idea of _cardinality_: {{\|S| = n}} Says that the cardinality (number of items) in S is _n_. Later, well just use _n_ to mean a schema (or relation or whatever) that has n items. h3. Relations and Multi-Relations A relation can be: {{R : ∧}} (programming null) -- the relation simply does not exist. {{ | (0,0)}} -- the trivial relation of no columns and (by definition), no rows. {{ | (s,0)}} -- a relation with some schema s, |s| ≠ 0, and no rows {{ | (s,n)}} -- a "normal" relation with schema and n tuples (rows) of data, \|s| ≠ 0, n ≠ 0 It is helpful to remember some basic relational algebra: {{R(s,0) ⋃ R(s,n) = R(s,n)}} {{R(s,n) ⋃ R(s,m) = R(s,n+m)}} Drill works not just with relations R, but also "multi-relations" M. That is, a multi-relation (the entire result set from a single query) can defined (using semi-BNF) as: {{M : ∧}} -- undefined result set {{ | R(0,0)}} -- trivial result set {{ | R(s,0)}} - empty result set, \|s| ≠ 0 {{ | R(s,n)}} -- normal, single result set, n ≠ 0 {{ | R ~1~(s ~1~,n), R ~1~(s ~2~,m), ...}} -- multi-relation if s ~i~ ≠ s ~j~ Normally when we say multi-relation, we mean the last case: two or more relations with distinct schemas. The condition above says that to adjacent relations R ~i~ and R ~j~ must have distinct schema (or by the rules above, two relations with the same schema just collapse into a single relation with that schema. A schema can repeat, but a different schema must occur between repetitions.) h3. Multi-Relations in Drill Now let's get to Drill. Drill uses the term "schema change" to mean the transition from s ~i~ to s ~j~, s ~i~ ≠ s ~j~. The essence of the proposal here, as I understand it, is to update the implementation to fully support the first three definitions of a multi-relation (undefined, trivial and empty), assuming that Drill already supports the other two definitions (single relation and multi-relation.) In Drill, relations are broken into batches, which are, themselves, just relations. Thus a batch, B, can be: {{B : ∧}} -- undefined result set) {{ | R(0,0)}} -- trivial batch {{ | R(s,0)}} -- empty batch set, |s| ≠ 0 {{ | R(s,n)}} -- normal batch, |s| ≠ 0, n ≠ 0 And the whole result set D (for Drill) is an ordered list of batches: {{D = \[B ~1~, B ~2~, ..., B ~n~]}} Where {{B ~i~ = (s ~i~, t ~i~)}} As noted above, if adjacent batches have the same schema, then they are just sub-relations within a single larger (logical) relation. But, if the schemas differ, then the adjacent batches are the last and first of two distinct relations within a larger (logical) multi-relation. Said another way: {{D = R ~1~, R ~2~}} {{R ~i~ = B ~i,1~, B ~i,2~, ...}} To clarify, let’s visualize the schema changes as ∆~i~: {{D = \[B ~0~(s ~0~,t), B ~1~(s ~0~,t), … ∆~1~, B ~i~(s ~1~,t), B ~i+1~(s ~1~,t), …]}} The above sequence can describe any series of batches. As I understand the proposal, we want to put some constraints on the sequence. h3. Results of a Drill Query Let’s start by defining how should present the multi-relation to the client. The client also receives batches, but, following the rules in the proposal, we want to constrain the output: {{D ~e~ : B(0,0)}} -- Null result {{ | B(s,0)}} -- Empty result {{ | B(s,n)\+}} -- “Classic” O/JDBC result {{ | (B(s ~i~,n)\+)\+}} -- Multi-relation result (Here, the parenthesis take on their regex/BNF meaning of grouping.) That is the output of a Drill query is: * A single trivial batch (zero columns, zero rows) * A single empty batch (schema, but no data) * One or more non-empty batches For the non-empty case, we end up with two cases: * A single relation (O/JDBC compatible) * A multi-relation h3. Drill Internals Internally, we might want to allow any pattern of batches (include empty or null); Drill operators do the necessary conversion: {{D ~i~ = \[B\+, (∆, B\+)\+]}} We want to remove empty batches where possible to convert the internal multi-relation to the external one. First, let’s define an empty batch: {{empty-batch : ∧}} -- undefined result set {{ | R(0,0)}} -- trivial result set {{ | R(s,0)}} -- empty result set, |s| ≠ 0 We can define some harmless extensions to the normal rules for unions of relations: {{∧ ⋃ R = R}}, for any relation R {{R(0,0) ⋃ R = R}}, for {{R ≠ ∧}} {{R(c,0) ⋃ R = R}}, for {{R ≠ ∧, R(0,0)}} These rules define a “hierarchy of emptiness” to deal with adjacent empty batches. The result of the above rules are that we can remove empty batches from a Drill result set as long as there are adjacent non-empty batches. But, suppose the (internal to Drill) result set has only one empty batch, B ~1~, then the external (visible to client) result set , D ~e~, is: {{D ~e~ = R(0,0)}} if {{B ~1~ = ∧}} or {{B ~1~ = R(0,0)}} {{D ~e~ = R(c,0)}} if {{B ~1~ = R(c,0)}} Otherwise, according to what we said earlier, since the internal result set has at least one non-empty batch, then the external result set will be non-empty. This is true because we can (indeed, should) remove empt batches internal to Drill using the rules noted above and explained in this proposal. h3. The Drill Iterator Protocol Finally, we can put all this together. The proposal focuses on Drill’s Volcano-based iterator protocol. The iterator protocol labels batches with one of three iterator outcomes, as defined by an enum E: {{e(s,t) ∈ E}} {{E : NONE(∧)}} {{ | OK_NEW_SCHEMA(s,t)}} {{ | OK(s,t)}} That is, each iterator outcome labels a batch and so carries a schema and data set. NONE is a special case as it carries an undefined batch. This definition lets us map the iterator protocol directly to a slightly modified version of the multi-relation sequence shown earlier: {{D = \[∆ ~0~, B ~0~(s ~0~,t), B ~1~(s ~0~,t), … ∆ ~1~, B ~i~(s ~1~,t), B ~i+1~(s ~1~,t), … ∧]}} {{OK_NEW_SCHEMA = \[∆, B]}} {{OK = \[B]}} {{NONE = ∧}} Now we can define the iterator protocol: {{protocol : (OK_NEW_SCHEMA OK\*)\* NONE}} The “fast NONE” path is simply an empty sequence of {{(OK_NEW_SCHEMA OK\*)\*}} terms. h3. Conclusion The proposal for this ticket says that operators should handle the empty batches. The above suggests *why* the operators should handle them. The best solution for *how* to handle them is so modify the "inner next" code that calls a child operator; have that code "compress out" empty batches if followed by a non-empty batch. This note suggests that we look at empty batches (in all three forms) as a natural part of Drill's multi-relation model. Rather than treating them as special cases, empty batches should be defined as a normal part of Drill's internal protocol. A separate proposal (part of DRILL-5211) will suggest that the scanner use the above rules to compress out empty batches that emerge from a variety of special cases with readers. was (Author: paul-rogers): I believe we are saying basically the same thing. To be certain, [watch out, we're gonna try some theory|https://store.xkcd.com/products/try-science]. h3. Basics We'll need some terminology, defined in the usual way: * (a, b, c) is a set that contains a, b, c * \{a:b} is a map from a to b * \[a, b, c] is an ordered list of a, b, c, where each element has an index i = 0, 1, 2... * empty = () or {} or \[] is the empty collection (of the proper type) * null = SQL null: we don't know what the value is * ∧ = Java, C, etc. null: we know the value, and it is nothing Drill is, at its core, relational. A relation R can be defined as: {{R = (S, T)}} Where: * S is the schema * T is a set of tuples (t ~1~, t ~2~, t ~3~) (AKA a table) {{S = (C, N)}} Where: * C is the list of column schemas: {{C = \[ c ~0~, c ~1~, c ~2~, ...]}} {{c = (name, type)}} And: * N is a map from name to column index: {{N = \{name : i\} }} where i the index of column c ∈ C Drill defines the idea of _compatible_ schemas. Two schemas are compatible if we redefine the schema as a set of columns: {{S’ = (C ~0~, C ~1~, …)}} Two schemas are compatible iff the column sets are identical (same name and type for each column). This is a bit more forgiving than the traditional relational model which requires that the ordered list of column schemas be identical. Let's assume this rule as we discuss schema below. We'll also need the idea of _cardinality_: {{\|S| = n}} Says that the cardinality (number of items) in S is _n_. Later, well just use _n_ to mean a schema (or relation or whatever) that has n items. h3. Relations and Multi-Relations A relation can be: {{R : ∧}} (programming null) -- the relation simply does not exist. {{ | (0,0)}} -- the trivial relation of no columns and (by definition), no rows. {{ | (s,0)}} -- a relation with some schema s, |s| ≠ 0, and no rows {{ | (s,n)}} -- a "normal" relation with schema and n tuples (rows) of data, \|s| ≠ 0, n ≠ 0 It is helpful to remember some basic relational algebra: {{R(s,0) ⋃ R(s,n) = R(s,n)}} {{R(s,n) ⋃ R(s,m) = R(s,n+m)}} Drill works not just with relations R, but also "multi-relations" M. That is, a multi-relation (the entire result set from a single query) can defined (using semi-BNF) as: {{M : ∧}} -- undefined result set {{ | R(0,0)}} -- trivial result set {{ | R(s,0)}} - empty result set, \|s| ≠ 0 {{ | R(s,n)}} -- n ≠ 0, normal, single result set {{ | R ~1~(s ~1~,n), R ~1~(s ~2~,m), ...}} multi-relation if s ~i~ ≠ s ~j~ Normally when we say multi-relation, we mean the last case: two or more relations with distinct schemas. The condition above says that to adjacent relations R ~i~ and R ~j~ must have distinct schema (or by the rules above, two relations with the same schema just collapse into a single relation with that schema. A schema can repeat, but a different schema must occur between repetitions.) h3. Multi-Relations in Drill Now let's get to Drill. Drill uses the term "schema change" to mean the transition from s ~i~ to s ~j~, s ~i~ ≠ s ~j~. The essence of the proposal here, as I understand it, is to update the implementation to fully support the first three definitions of a multi-relation (undefined, trivial and empty), assuming that Drill already supports the other two definitions (single relation and multi-relation.) In Drill, relations are broken into batches, which are, themselves, just relations. Thus a batch, B, can be: {{B : ∧}} -- undefined result set) {{ | R(0,0)}} -- trivial batch {{ | R(s,0)}} -- empty batch set, |s| ≠ 0 {{ | R(s,n)}} -- normal batch, |s| ≠ 0, n ≠ 0 And the whole result set D (for Drill) is an ordered list of batches: {{D = \[B ~1~, B ~2~, ..., B ~n~]}} Where {{B ~i~ = (s ~i~, t ~i~)}} As noted above, if adjacent batches have the same schema, then they are just sub-relations within a single larger (logical) relation. But, if the schemas differ, then the adjacent batches are the last and first of two distinct relations within a larger (logical) multi-relation. Said another way: {{D = R ~1~, R ~2~}} {{R ~i~ = B ~i,1~, B ~i,2~, ...}} To clarify, let’s visualize the schema changes as ∆~i~: {{D = \[B ~0~(s ~0~,t), B ~1~(s ~0~,t), … ∆~1~, B ~i~(s ~1~,t), B ~i+1~(s ~1~,t), …]}} The above sequence can describe any series of batches. As I understand the proposal, we want to put some constraints on the sequence. h3. Results of a Drill Query Let’s start by defining how should present the multi-relation to the client. The client also receives batches, but, following the rules in the proposal, we want to constrain the output: {{D ~e~ : B(0,0)}} -- Null result {{ | B(s,0)}} -- Empty result {{ | B(s,n)\+}} -- “Classic” O/JDBC result {{ | (B(s ~i~,n)\+)\+}} -- Multi-relation result (Here, the parenthesis take on their regex/BNF meaning of grouping.) That is the output of a Drill query is: * A single trivial batch (zero columns, zero rows) * A single empty batch (schema, but no data) * One or more non-empty batches For the non-empty case, we end up with two cases: * A single relation (O/JDBC compatible) * A multi-relation h3. Drill Internals Internally, we might want to allow any pattern of batches (include empty or null); Drill operators do the necessary conversion: {{D ~i~ = \[B\+, (∆, B\+)\+]}} We want to remove empty batches where possible to convert the internal multi-relation to the external one. First, let’s define an empty batch: {{empty-batch : ∧}} -- undefined result set {{ | R(0,0)}} -- trivial result set {{ | R(s,0)}} -- empty result set, |s| ≠ 0 We can define some harmless extensions to the normal rules for unions of relations: {{∧ ⋃ R = R}}, for any relation R {{R(0,0) ⋃ R = R}}, for {{R ≠ ∧}} {{R(c,0) ⋃ R = R}}, for {{R ≠ ∧, R(0,0)}} These rules define a “hierarchy of emptiness” to deal with adjacent empty batches. The result of the above rules are that we can remove empty batches from a Drill result set as long as there are adjacent non-empty batches. But, suppose the (internal to Drill) result set has only one empty batch, B ~1~, then the external (visible to client) result set , D ~e~, is: {{D ~e~ = R(0,0)}} if {{B ~1~ = ∧}} or {{B ~1~ = R(0,0)}} {{D ~e~ = R(c,0)}} if {{B ~1~ = R(c,0)}} Otherwise, according to what we said earlier, since the internal result set has at least one non-empty batch, then the external result set will be non-empty. This is true because we can (indeed, should) remove empt batches internal to Drill using the rules noted above and explained in this proposal. h3. The Drill Iterator Protocol Finally, we can put all this together. The proposal focuses on Drill’s Volcano-based iterator protocol. The iterator protocol labels batches with one of three iterator outcomes, as defined by an enum E: {{e(s,t) ∈ E}} {{E : NONE(∧)}} {{ | OK_NEW_SCHEMA(s,t)}} {{ | OK(s,t)}} That is, each iterator outcome labels a batch and so carries a schema and data set. NONE is a special case as it carries an undefined batch. This definition lets us map the iterator protocol directly to a slightly modified version of the multi-relation sequence shown earlier: {{D = \[∆ ~0~, B ~0~(s ~0~,t), B ~1~(s ~0~,t), … ∆ ~1~, B ~i~(s ~1~,t), B ~i+1~(s ~1~,t), … ∧]}} {{OK_NEW_SCHEMA = \[∆, B]}} {{OK = \[B]}} {{NONE = ∧}} Now we can define the iterator protocol: {{protocol : (OK_NEW_SCHEMA OK\*)\* NONE}} The “fast NONE” path is simply an empty sequence of {{(OK_NEW_SCHEMA OK\*)\*}} terms. h3. Conclusion The proposal for this ticket says that operators should handle the empty batches. The above suggests *why* the operators should handle them. The best solution for *how* to handle them is so modify the "inner next" code that calls a child operator; have that code "compress out" empty batches if followed by a non-empty batch. This note suggests that we look at empty batches (in all three forms) as a natural part of Drill's multi-relation model. Rather than treating them as special cases, empty batches should be defined as a normal part of Drill's internal protocol. A separate proposal (part of DRILL-5211) will suggest that the scanner use the above rules to compress out empty batches that emerge from a variety of special cases with readers. > Schema change problems caused by empty batch > -------------------------------------------- > > Key: DRILL-5546 > URL: https://issues.apache.org/jira/browse/DRILL-5546 > Project: Apache Drill > Issue Type: Bug > Reporter: Jinfeng Ni > Assignee: Jinfeng Ni > > There have been a few JIRAs opened related to schema change failure caused by > empty batch. This JIRA is opened as an umbrella for all those related JIRAS ( > such as DRILL-4686, DRILL-4734, DRILL4476, DRILL-4255, etc). > -- This message was sent by Atlassian JIRA (v6.3.15#6346)