[jira] [Comment Edited] (DRILL-5546) Schema change problems caused by empty batch

Paul Rogers (JIRA) Mon, 05 Jun 2017 08:58:20 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16036141#comment-16036141
 ]


Paul Rogers edited comment on DRILL-5546 at 6/5/17 3:57 PM:
------------------------------------------------------------

I believe we are saying basically the same thing. To be certain, [watch out, 
we're gonna try some theory|https://store.xkcd.com/products/try-science].

h3. Basics

We'll need some terminology, defined in the usual way:

* (a, b, c) is a set that contains a, b, c
* \{a:b} is a map from a to b
* \[a, b, c] is an ordered list of a, b, c, where each element has an index i = 
0, 1, 2...
* empty = () or {} or \[] is the empty collection (of the proper type)
* null = SQL null: we don't know what the value is
* ∧ = Java, C, etc. null: we know the value, and it is nothing

Drill is, at its core, relational. A relation R can be defined as:

{{R = (S, T)}}

Where:

* S is the schema
* T is a set of tuples (t ~1~, t ~2~, t ~3~) (AKA a table)

{{S = (C, N)}}

Where:

* C is the list of column schemas:

{{C = \[ c ~0~, c ~1~, c ~2~, ...]}}
{{c = (name, type)}}

And:

* N is a map from name to column index:

{{N = \{name : i\}&nbsp;}} where i the index of column c ∈ C

Drill defines the idea of _compatible_ schemas. Two schemas are compatible if 
we redefine the schema as a set of columns:

{{S’ = (C ~0~, C ~1~, …)}}

Two schemas are compatible iff the column sets are identical (same name and 
type for each column). This is a bit more forgiving than the traditional 
relational model which requires that the ordered list of column schemas be 
identical. Let's assume this rule as we discuss schema below.

We'll also need the idea of _cardinality_:

{{\|S| = n}}

Says that the cardinality (number of items) in S is _n_. Later, well just use 
_n_ to mean a schema (or relation or whatever) that has n items.

h3. Relations and Multi-Relations

A relation can be:

{{R : ∧}} (programming null) -- the relation simply does not exist.
{{&nbsp;&nbsp;| (0,0)}} -- the trivial relation of no columns and (by 
definition), no rows.
{{&nbsp;&nbsp;| (s,0)}} -- a relation with some schema s, |s| ≠ 0, and no rows
{{&nbsp;&nbsp;| (s,n)}} -- a "normal" relation with schema and n tuples (rows) 
of data, \|s| ≠ 0, n ≠ 0

It is helpful to remember some basic relational algebra:

{{R(s,0) ⋃ R(s,n) = R(s,n)}}
{{R(s,n) ⋃ R(s,m) = R(s,n+m)}}

Drill works not just with relations R, but also "multi-relations" (to make up a 
term). That is, each schema change in Drill introduces a new relation, so that 
the whole result set from a query is a series of relations. Let's define a 
"multi-reation" M using semi-BNF as:

{{M : ∧}} -- undefined result set
{{&nbsp;&nbsp;| R(0,0)}} -- trivial result set
{{&nbsp;&nbsp;| R(s,0)}}  - empty result set, \|s| ≠ 0
{{&nbsp;&nbsp;| R(s,n)}} -- normal, single result set, n  ≠ 0
{{&nbsp;&nbsp;| R ~1~(s ~1~,n), R ~1~(s ~2~,m), ...}} -- multi-relation if s 
~i~ ≠ s ~j~

Normally when we say multi-relation, we mean the last case: two or more 
relations with distinct schemas. The condition above says that to adjacent 
relations R ~i~ and R ~j~ must have distinct schema (or by the rules above, two 
relations with the same schema just collapse into a single relation with that 
schema. A schema can repeat, but a different schema must occur between 
repetitions.)

h3. Multi-Relations in Drill

Now let's get to Drill. Drill uses the term "schema change" to mean the 
transition from s ~i~ to s ~j~, s ~i~ ≠ s ~j~.

The essence of the proposal here, as I understand it, is to update the 
implementation to fully support the first three definitions of a multi-relation 
(undefined, trivial and empty), assuming that Drill already supports the other 
two definitions (single relation and multi-relation.)

In Drill, relations are broken into batches, which are, themselves, just 
relations. Thus a batch, B, can be:

{{B : ∧}} -- undefined result set)
{{&nbsp;&nbsp;| R(0,0)}} -- trivial batch
{{&nbsp;&nbsp;| R(s,0)}} -- empty batch set, |s| ≠ 0
{{&nbsp;&nbsp;| R(s,n)}} -- normal batch, |s| ≠ 0, n  ≠ 0

And the whole result set D (for Drill) is an ordered list of batches:

{{D = \[B ~1~, B ~2~, ..., B ~n~]}}

Where

{{B ~i~ = (s ~i~, t ~i~)}}

As noted above, if adjacent batches have the same schema, then they are just 
sub-relations within a single larger (logical) relation. But, if the schemas 
differ, then the adjacent batches are the last and first of two distinct 
relations within a larger (logical) multi-relation. Said another way:

{{D = R ~1~, R ~2~}}
{{R ~i~ = B ~i,1~, B ~i,2~, ...}}

To clarify, let’s visualize the schema changes as ∆~i~:

{{D = \[B ~0~(s ~0~,t), B ~1~(s ~0~,t), … ∆~1~, B ~i~(s ~1~,t), B ~i+1~(s 
~1~,t), …]}}

The above sequence can describe any series of batches. As I understand the 
proposal, we want to put some constraints on the sequence.

h3. Results of a Drill Query

Let’s start by defining how should present the multi-relation to the client. 
The client also receives batches, but, following the rules in the proposal, we 
want to constrain the output:

{{D ~e~ : B(0,0)}} -- Null result
{{&nbsp;&nbsp;| B(s,0)}} -- Empty result
{{&nbsp;&nbsp;| B(s,n)\+}} -- “Classic” O/JDBC result
{{&nbsp;&nbsp;| (B(s ~i~,n)\+)\+}} -- Multi-relation result

(Here, the parenthesis take on their regex/BNF meaning of grouping.)

That is the output of a Drill query is:

* A single trivial batch (zero columns, zero rows)
* A single empty batch (schema, but no data)
* One or more non-empty batches

For the non-empty case, we end up with two cases:

* A single relation (O/JDBC compatible)
* A multi-relation

h3. Drill Internals

Internally, we might want to allow any pattern of batches (include empty or 
null); Drill operators do the necessary conversion:

{{D ~i~ = \[B\+,  (∆, B\+)\+]}}

We want to remove empty batches where possible to convert the internal 
multi-relation to the external one. First, let’s define an empty batch:

{{empty-batch : ∧}} -- undefined result set
{{&nbsp;&nbsp;| R(0,0)}} -- trivial result set
{{&nbsp;&nbsp;| R(s,0)}} -- empty result set, |s| ≠ 0

We can define some harmless extensions to the normal rules for unions of 
relations:

{{∧ ⋃ R = R}}, for any relation R
{{R(0,0) ⋃ R = R}}, for {{R ≠ ∧}}
{{R(c,0) ⋃ R = R}}, for {{R  ≠ ∧, R(0,0)}}

These rules define a “hierarchy of emptiness” to deal with adjacent empty 
batches. The result of the above rules are that we can remove empty batches 
from a Drill result set as long as there are adjacent non-empty batches. But, 
suppose the (internal to Drill) result set has only one empty batch, B ~1~, 
then the external (visible to client) result set , D ~e~, is:

{{D ~e~ = R(0,0)}} if {{B ~1~ = ∧}} or {{B ~1~ = R(0,0)}}
{{D ~e~ = R(c,0)}} if {{B ~1~ = R(c,0)}}

Otherwise, according to what we said earlier, since the internal result set has 
at least one non-empty batch, then the external result set will be non-empty. 
This is true because we can (indeed, should) remove empt batches internal to 
Drill using the rules noted above and explained in this proposal.

h3. The Drill Iterator Protocol

Finally, we can put all this together. The proposal focuses on Drill’s 
Volcano-based iterator protocol. The iterator protocol labels batches with one 
of three iterator outcomes, as defined by an enum E:

{{e(s,t) ∈ E}}
{{E : NONE(∧)}}
{{&nbsp;&nbsp;| OK_NEW_SCHEMA(s,t)}}
{{&nbsp;&nbsp;| OK(s,t)}}

That is, each iterator outcome labels a batch and so carries a schema and data 
set. NONE is a special case as it carries an undefined batch. This definition 
lets us map the iterator protocol directly to a slightly modified version of 
the multi-relation sequence shown earlier:

{{D = \[∆ ~0~, B ~0~(s ~0~,t), B ~1~(s ~0~,t), … ∆ ~1~, B ~i~(s ~1~,t), B 
~i+1~(s ~1~,t), … ∧]}}
{{OK_NEW_SCHEMA = \[∆, B]}}
{{OK = \[B]}}
{{NONE = ∧}}

Now we can define the iterator protocol:

{{protocol : (OK_NEW_SCHEMA OK\*)\* NONE}}

The “fast NONE” path is simply an empty sequence of {{(OK_NEW_SCHEMA OK\*)\*}} 
terms.

h3. Conclusion

The proposal for this ticket says that operators should handle the empty 
batches. The above suggests *why* the operators should handle them. The best 
solution for *how* to handle them is so modify the "inner next" code that calls 
a child operator; have that code "compress out" empty batches if followed by a 
non-empty batch.

This note suggests that we look at empty batches (in all three forms) as a 
natural part of Drill's multi-relation model. Rather than treating them as 
special cases, empty batches should be defined as a normal part of Drill's 
internal protocol.

A separate proposal (part of DRILL-5211) will suggest that the scanner use the 
above rules to compress out empty batches that emerge from a variety of special 
cases with readers.


was (Author: paul-rogers):
I believe we are saying basically the same thing. To be certain, [watch out, 
we're gonna try some theory|https://store.xkcd.com/products/try-science].

h3. Basics

We'll need some terminology, defined in the usual way:

* (a, b, c) is a set that contains a, b, c
* \{a:b} is a map from a to b
* \[a, b, c] is an ordered list of a, b, c, where each element has an index i = 
0, 1, 2...
* empty = () or {} or \[] is the empty collection (of the proper type)
* null = SQL null: we don't know what the value is
* ∧ = Java, C, etc. null: we know the value, and it is nothing

Drill is, at its core, relational. A relation R can be defined as:

{{R = (S, T)}}

Where:

* S is the schema
* T is a set of tuples (t ~1~, t ~2~, t ~3~) (AKA a table)

{{S = (C, N)}}

Where:

* C is the list of column schemas:

{{C = \[ c ~0~, c ~1~, c ~2~, ...]}}
{{c = (name, type)}}

And:

* N is a map from name to column index:

{{N = \{name : i\}&nbsp;}} where i the index of column c ∈ C

Drill defines the idea of _compatible_ schemas. Two schemas are compatible if 
we redefine the schema as a set of columns:

{{S’ = (C ~0~, C ~1~, …)}}

Two schemas are compatible iff the column sets are identical (same name and 
type for each column). This is a bit more forgiving than the traditional 
relational model which requires that the ordered list of column schemas be 
identical. Let's assume this rule as we discuss schema below.

We'll also need the idea of _cardinality_:

{{\|S| = n}}

Says that the cardinality (number of items) in S is _n_. Later, well just use 
_n_ to mean a schema (or relation or whatever) that has n items.

h3. Relations and Multi-Relations

A relation can be:

{{R : ∧}} (programming null) -- the relation simply does not exist.
{{&nbsp;&nbsp;| (0,0)}} -- the trivial relation of no columns and (by 
definition), no rows.
{{&nbsp;&nbsp;| (s,0)}} -- a relation with some schema s, |s| ≠ 0, and no rows
{{&nbsp;&nbsp;| (s,n)}} -- a "normal" relation with schema and n tuples (rows) 
of data, \|s| ≠ 0, n ≠ 0

It is helpful to remember some basic relational algebra:

{{R(s,0) ⋃ R(s,n) = R(s,n)}}
{{R(s,n) ⋃ R(s,m) = R(s,n+m)}}

Drill works not just with relations R, but also "multi-relations" M. That is, a 
multi-relation (the entire result set from a single query) can defined (using 
semi-BNF) as:

{{M : ∧}} -- undefined result set
{{&nbsp;&nbsp;| R(0,0)}} -- trivial result set
{{&nbsp;&nbsp;| R(s,0)}}  - empty result set, \|s| ≠ 0
{{&nbsp;&nbsp;| R(s,n)}} -- normal, single result set, n  ≠ 0
{{&nbsp;&nbsp;| R ~1~(s ~1~,n), R ~1~(s ~2~,m), ...}} -- multi-relation if s 
~i~ ≠ s ~j~

Normally when we say multi-relation, we mean the last case: two or more 
relations with distinct schemas. The condition above says that to adjacent 
relations R ~i~ and R ~j~ must have distinct schema (or by the rules above, two 
relations with the same schema just collapse into a single relation with that 
schema. A schema can repeat, but a different schema must occur between 
repetitions.)

h3. Multi-Relations in Drill

Now let's get to Drill. Drill uses the term "schema change" to mean the 
transition from s ~i~ to s ~j~, s ~i~ ≠ s ~j~.

The essence of the proposal here, as I understand it, is to update the 
implementation to fully support the first three definitions of a multi-relation 
(undefined, trivial and empty), assuming that Drill already supports the other 
two definitions (single relation and multi-relation.)

In Drill, relations are broken into batches, which are, themselves, just 
relations. Thus a batch, B, can be:

{{B : ∧}} -- undefined result set)
{{&nbsp;&nbsp;| R(0,0)}} -- trivial batch
{{&nbsp;&nbsp;| R(s,0)}} -- empty batch set, |s| ≠ 0
{{&nbsp;&nbsp;| R(s,n)}} -- normal batch, |s| ≠ 0, n  ≠ 0

And the whole result set D (for Drill) is an ordered list of batches:

{{D = \[B ~1~, B ~2~, ..., B ~n~]}}

Where

{{B ~i~ = (s ~i~, t ~i~)}}

As noted above, if adjacent batches have the same schema, then they are just 
sub-relations within a single larger (logical) relation. But, if the schemas 
differ, then the adjacent batches are the last and first of two distinct 
relations within a larger (logical) multi-relation. Said another way:

{{D = R ~1~, R ~2~}}
{{R ~i~ = B ~i,1~, B ~i,2~, ...}}

To clarify, let’s visualize the schema changes as ∆~i~:

{{D = \[B ~0~(s ~0~,t), B ~1~(s ~0~,t), … ∆~1~, B ~i~(s ~1~,t), B ~i+1~(s 
~1~,t), …]}}

The above sequence can describe any series of batches. As I understand the 
proposal, we want to put some constraints on the sequence.

h3. Results of a Drill Query

Let’s start by defining how should present the multi-relation to the client. 
The client also receives batches, but, following the rules in the proposal, we 
want to constrain the output:

{{D ~e~ : B(0,0)}} -- Null result
{{&nbsp;&nbsp;| B(s,0)}} -- Empty result
{{&nbsp;&nbsp;| B(s,n)\+}} -- “Classic” O/JDBC result
{{&nbsp;&nbsp;| (B(s ~i~,n)\+)\+}} -- Multi-relation result

(Here, the parenthesis take on their regex/BNF meaning of grouping.)

That is the output of a Drill query is:

* A single trivial batch (zero columns, zero rows)
* A single empty batch (schema, but no data)
* One or more non-empty batches

For the non-empty case, we end up with two cases:

* A single relation (O/JDBC compatible)
* A multi-relation

h3. Drill Internals

Internally, we might want to allow any pattern of batches (include empty or 
null); Drill operators do the necessary conversion:

{{D ~i~ = \[B\+,  (∆, B\+)\+]}}

We want to remove empty batches where possible to convert the internal 
multi-relation to the external one. First, let’s define an empty batch:

{{empty-batch : ∧}} -- undefined result set
{{&nbsp;&nbsp;| R(0,0)}} -- trivial result set
{{&nbsp;&nbsp;| R(s,0)}} -- empty result set, |s| ≠ 0

We can define some harmless extensions to the normal rules for unions of 
relations:

{{∧ ⋃ R = R}}, for any relation R
{{R(0,0) ⋃ R = R}}, for {{R ≠ ∧}}
{{R(c,0) ⋃ R = R}}, for {{R  ≠ ∧, R(0,0)}}

These rules define a “hierarchy of emptiness” to deal with adjacent empty 
batches. The result of the above rules are that we can remove empty batches 
from a Drill result set as long as there are adjacent non-empty batches. But, 
suppose the (internal to Drill) result set has only one empty batch, B ~1~, 
then the external (visible to client) result set , D ~e~, is:

{{D ~e~ = R(0,0)}} if {{B ~1~ = ∧}} or {{B ~1~ = R(0,0)}}
{{D ~e~ = R(c,0)}} if {{B ~1~ = R(c,0)}}

Otherwise, according to what we said earlier, since the internal result set has 
at least one non-empty batch, then the external result set will be non-empty. 
This is true because we can (indeed, should) remove empt batches internal to 
Drill using the rules noted above and explained in this proposal.

h3. The Drill Iterator Protocol

Finally, we can put all this together. The proposal focuses on Drill’s 
Volcano-based iterator protocol. The iterator protocol labels batches with one 
of three iterator outcomes, as defined by an enum E:

{{e(s,t) ∈ E}}
{{E : NONE(∧)}}
{{&nbsp;&nbsp;| OK_NEW_SCHEMA(s,t)}}
{{&nbsp;&nbsp;| OK(s,t)}}

That is, each iterator outcome labels a batch and so carries a schema and data 
set. NONE is a special case as it carries an undefined batch. This definition 
lets us map the iterator protocol directly to a slightly modified version of 
the multi-relation sequence shown earlier:

{{D = \[∆ ~0~, B ~0~(s ~0~,t), B ~1~(s ~0~,t), … ∆ ~1~, B ~i~(s ~1~,t), B 
~i+1~(s ~1~,t), … ∧]}}
{{OK_NEW_SCHEMA = \[∆, B]}}
{{OK = \[B]}}
{{NONE = ∧}}

Now we can define the iterator protocol:

{{protocol : (OK_NEW_SCHEMA OK\*)\* NONE}}

The “fast NONE” path is simply an empty sequence of {{(OK_NEW_SCHEMA OK\*)\*}} 
terms.

h3. Conclusion

The proposal for this ticket says that operators should handle the empty 
batches. The above suggests *why* the operators should handle them. The best 
solution for *how* to handle them is so modify the "inner next" code that calls 
a child operator; have that code "compress out" empty batches if followed by a 
non-empty batch.

This note suggests that we look at empty batches (in all three forms) as a 
natural part of Drill's multi-relation model. Rather than treating them as 
special cases, empty batches should be defined as a normal part of Drill's 
internal protocol.

A separate proposal (part of DRILL-5211) will suggest that the scanner use the 
above rules to compress out empty batches that emerge from a variety of special 
cases with readers.

> Schema change problems caused by empty batch
> --------------------------------------------
>
>                 Key: DRILL-5546
>                 URL: https://issues.apache.org/jira/browse/DRILL-5546
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>
> There have been a few JIRAs opened related to schema change failure caused by 
> empty batch. This JIRA is opened as an umbrella for all those related JIRAS ( 
> such as DRILL-4686, DRILL-4734, DRILL4476, DRILL-4255, etc).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (DRILL-5546) Schema change problems caused by empty batch

Reply via email to