[jira] [Updated] (DRILL-5955) Revisit Union Vectors

2017-11-12 Thread Paul Rogers (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-5955:
---
Description: 
Drill supports a “Union Vector” type that allows a single column to hold values 
of multiple types. Conceptually, each column value is a (type, value) pair. For 
example, row 0 might be an Int, row 1 a Varchar and row 2 a NULL value.

The name refers to a C “union” in which the same bit of memory is used to 
represent one of a set of defined types.

Drill implements the union vector a bit like a map: as a collection of typed 
vectors. Each value is keyed by type. The result is that a union vector is more 
like a C “struct” than a C “union”: every vector takes space, but only one of 
the vectors is used for each row. For the example above, the union vector 
contains an Int vector, a Varchar vector and a type vector. For each row, 
either the Int or the Varchar is used. For NULL values, neither vector is used.

h4. Memory Footprint Concerns

The current representation, despite its name, makes very inefficient use of 
memory because it requires the sum of the storage for each included type. (That 
is, if we store 1000 rows, we need 1000 slots for integers, another 1000 for 
Varchar and yet another 1000 for the type vector.)

Drill poorly supports the union type. One operator that does support it is the 
sort. If the union type is enabled, and the sort sees a schema change, the sort 
will create a new union vector that combines the two types. The result is a 
sudden, unplanned increase in memory usage. Since the sort can buffer many 
hundreds of batches, this unplanned memory increase can cause the sort to run 
out of memory.

h4. Muddy Semantics

The union vector is closely tied with the List vector: a list vector is, 
essentially, an array of unions. (See DRILL-5958). The list type is used to 
model JSON in which a list can hold anything: another list, an object or 
scalars. For this reason, the union vector also can hold any type. And, indeed, 
it can hold a union of any of these types: a Map and an Int, or a List and a 
Map.

Drill is a relational, SQL-based tool. Work is required to bring non-relational 
structures into Drill. As discussed below, a union of scalars can be made to 
work. But, a union of structured types (lists, arrays or Maps) makes no sense.

h4. High Complexity

The union vector, as implemented is quite complex. It contains member variables 
for every other vector type (except, strangely, the decimal types.) Access to 
typed members is by type-specific methods, meaning that the client code must 
include a separate call for every type, resulting in very complex client code.

The complexity allowed the union type to be made to work, but causes this one 
type to consume a disproportionate amount of the vector and client code.

h4. Proposed Revision to Structure: The Variant Type

Given the above, we can now present the proposed changes. First let us 
recognize that a union vector need not hold structured types; there are other 
solutions as discussed in DRILL-. This leaves the union vector to hold just 
scalars.

h4. Proposed Revision to Storage

This in turn lets us adopt the [Variant 
type|https://en.wikipedia.org/wiki/Variant_type] originally introduced in 
Visual Basic. Variant “is a tagged union that can be used to represent any 
other data type”. The Variant type was designed to be compact by building on 
the idea of a tagged union in C.

{code}
struct {
  int tag; // type
  union {
int intValue;
long longValue;
…
  }
}
{code}

When implemented as a vector, the format could consume just a single 
variable-width vector with each entry of the form: {{\[type value]}}. The 
vector is simply a sequence of these (type, value) pairs.

The type is a single-byte that encodes the MinorType that describes the value. 
That is, the type byte is like the existing type vector, but stored in the same 
location as the data. The data is simply the serialized format of data. (Four 
bytes for an Int, 8 bytes for a Float8 and so on.)

Variable-width types require an extra field: the type field: {{\[type length 
value]}}. For example, a Varchar would be encoded as {{\[Varchar 27 byte0-26]}}.

A writer uses the type to drive the serialization. A reader uses the type to 
drive deserialization.

Note that the type field must include a special marker for nulls. Today, the 
union type uses 0 to indicate a null value. (Note that, in a union and variant, 
a null value is not a null of some type, both the type and value are null.) 
That form should be used in the variant representation as well. But, note that 
the 0 value in the MajorType enum is not Null but is instead Late. This is an 
unpleasant messiness that the union (and variant )encoding must handle.

An offset vector provides the location of each field, as is done with 
variable-length vectors today.

The result is huge compaction of

[jira] [Updated] (DRILL-5955) Revisit Union Vectors

2017-11-12 Thread Paul Rogers (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-5955:
---
Description: 
Drill supports a “Union Vector” type that allows a single column to hold values 
of multiple types. Conceptually, each column value is a (type, value) pair. For 
example, row 0 might be an Int, row 1 a Varchar and row 2 a NULL value.

The name refers to a C “union” in which the same bit of memory is used to 
represent one of a set of defined types.

Drill implements the union vector a bit like a map: as a collection of typed 
vectors. Each value is keyed by type. The result is that a union vector is more 
like a C “struct” than a C “union”: every vector takes space, but only one of 
the vectors is used for each row. For the example above, the union vector 
contains an Int vector, a Varchar vector and a type vector. For each row, 
either the Int or the Varchar is used. For NULL values, neither vector is used.

h4. Memory Footprint Concerns

The current representation, despite its name, makes very inefficient use of 
memory because it requires the sum of the storage for each included type. (That 
is, if we store 1000 rows, we need 1000 slots for integers, another 1000 for 
Varchar and yet another 1000 for the type vector.)

Drill poorly supports the union type. One operator that does support it is the 
sort. If the union type is enabled, and the sort sees a schema change, the sort 
will create a new union vector that combines the two types. The result is a 
sudden, unplanned increase in memory usage. Since the sort can buffer many 
hundreds of batches, this unplanned memory increase can cause the sort to run 
out of memory.

h4. Muddy Semantics

The union vector is closely tied with the List vector: a list vector is, 
essentially, an array of unions. (See DRILL-). The list type is used to 
model JSON in which a list can hold anything: another list, an object or 
scalars. For this reason, the union vector also can hold any type. And, indeed, 
it can hold a union of any of these types: a Map and an Int, or a List and a 
Map.

Drill is a relational, SQL-based tool. Work is required to bring non-relational 
structures into Drill. As discussed below, a union of scalars can be made to 
work. But, a union of structured types (lists, arrays or Maps) makes no sense.

h4. High Complexity

The union vector, as implemented is quite complex. It contains member variables 
for every other vector type (except, strangely, the decimal types.) Access to 
typed members is by type-specific methods, meaning that the client code must 
include a separate call for every type, resulting in very complex client code.

The complexity allowed the union type to be made to work, but causes this one 
type to consume a disproportionate amount of the vector and client code.

h4. Proposed Revision to Structure: The Variant Type

Given the above, we can now present the proposed changes. First let us 
recognize that a union vector need not hold structured types; there are other 
solutions as discussed in DRILL-. This leaves the union vector to hold just 
scalars.

h4. Proposed Revision to Storage

This in turn lets us adopt the [Variant 
type|https://en.wikipedia.org/wiki/Variant_type] originally introduced in 
Visual Basic. Variant “is a tagged union that can be used to represent any 
other data type”. The Variant type was designed to be compact by building on 
the idea of a tagged union in C.

{code}
struct {
  int tag; // type
  union {
int intValue;
long longValue;
…
  }
}
{code}

When implemented as a vector, the format could consume just a single 
variable-width vector with each entry of the form: {{\[type value]}}. The 
vector is simply a sequence of these (type, value) pairs.

The type is a single-byte that encodes the MinorType that describes the value. 
That is, the type byte is like the existing type vector, but stored in the same 
location as the data. The data is simply the serialized format of data. (Four 
bytes for an Int, 8 bytes for a Float8 and so on.)

Variable-width types require an extra field: the type field: {{\[type length 
value]}}. For example, a Varchar would be encoded as {{\[Varchar 27 byte0-26]}}.

A writer uses the type to drive the serialization. A reader uses the type to 
drive deserialization.

Note that the type field must include a special marker for nulls. Today, the 
union type uses 0 to indicate a null value. (Note that, in a union and variant, 
a null value is not a null of some type, both the type and value are null.) 
That form should be used in the variant representation as well. But, note that 
the 0 value in the MajorType enum is not Null but is instead Late. This is an 
unpleasant messiness that the union (and variant )encoding must handle.

An offset vector provides the location of each field, as is done with 
variable-length vectors today.

The result is huge compaction of

[jira] [Updated] (DRILL-5955) Revisit Union Vectors

2017-11-12 Thread Paul Rogers (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-5955:
---
Description: 
Drill supports a “Union Vector” type that allows a single column to hold values 
of multiple types. Conceptually, each column value is a (type, value) pair. For 
example, row 0 might be an Int, row 1 a Varchar and row 2 a NULL value.

The name refers to a C “union” in which the same bit of memory is used to 
represent one of a set of defined types.

Drill implements the union vector a bit like a map: as a collection of typed 
vectors. Each value is keyed by type. The result is that a union vector is more 
like a C “struct” than a C “union”: every vector takes space, but only one of 
the vectors is used for each row. For the example above, the union vector 
contains an Int vector, a Varchar vector and a type vector. For each row, 
either the Int or the Varchar is used. For NULL values, neither vector is used.

h4. Memory Footprint Concerns

The current representation, despite its name, makes very inefficient use of 
memory because it requires the sum of the storage for each included type. (That 
is, if we store 1000 rows, we need 1000 slots for integers, another 1000 for 
Varchar and yet another 1000 for the type vector.)

Drill poorly supports the union type. One operator that does support it is the 
sort. If the union type is enabled, and the sort sees a schema change, the sort 
will create a new union vector that combines the two types. The result is a 
sudden, unplanned increase in memory usage. Since the sort can buffer many 
hundreds of batches, this unplanned memory increase can cause the sort to run 
out of memory.

h4. Muddy Semantics

The union vector is closely tied with the List vector: a list vector is, 
essentially, an array of unions. (See DRILL-). The list type is used to 
model JSON in which a list can hold anything: another list, an object or 
scalars. For this reason, the union vector also can hold any type. And, indeed, 
it can hold a union of any of these types: a Map and an Int, or a List and a 
Map.

Drill is a relational, SQL-based tool. Work is required to bring non-relational 
structures into Drill. As discussed below, a union of scalars can be made to 
work. But, a union of structured types (lists, arrays or Maps) makes no sense.

h4. High Complexity

The union vector, as implemented is quite complex. It contains member variables 
for every other vector type (except, strangely, the decimal types.) Access to 
typed members is by type-specific methods, meaning that the client code must 
include a separate call for every type, resulting in very complex client code.

The complexity allowed the union type to be made to work, but causes this one 
type to consume a disproportionate amount of the vector and client code.

h4. Proposed Revision to Structure: The Variant Type

Given the above, we can now present the proposed changes. First let us 
recognize that a union vector need not hold structured types; there are other 
solutions as discussed in DRILL-. This leaves the union vector to hold just 
scalars.

h4. Proposed Revision to Storage

This in turn lets us adopt the [Variant 
type|https://en.wikipedia.org/wiki/Variant_type] originally introduced in 
Visual Basic. Variant “is a tagged union that can be used to represent any 
other data type”. The Variant type was designed to be compact by building on 
the idea of a tagged union in C.

{code}
struct {
  int tag; // type
  union {
int intValue;
long longValue;
…
  }
}
{code}

When implemented as a vector, the format could consume just a single 
variable-width vector with each entry of the form: {{\[type value]}}. The 
vector is simply a sequence of these (type, value) pairs.

The type is a single-byte that encodes the MinorType that describes the value. 
That is, the type byte is like the existing type vector, but stored in the same 
location as the data. The data is simply the serialized format of data. (Four 
bytes for an Int, 8 bytes for a Float8 and so on.)

Variable-width types require an extra field: the type field: {{\[type length 
value]}}. For example, a Varchar would be encoded as {{\[Varchar 27 byte0-26]}}.

A writer uses the type to drive the serialization. A reader uses the type to 
drive deserialization.

Note that the type field must include a special marker for nulls. Today, the 
union type uses 0 to indicate a null value. (Note that, in a union and variant, 
a null value is not a null of some type, both the type and value are null.) 
That form should be used in the variant representation as well. But, note that 
the 0 value in the MajorType enum is not Null but is instead Late. This is an 
unpleasant messiness that the union (and variant )encoding must handle.

An offset vector provides the location of each field, as is done with 
variable-length vectors today.

The result is huge compaction of