[jira] [Updated] (DRILL-5955) Revisit Union Vectors

Paul Rogers (JIRA) Sun, 12 Nov 2017 15:10:40 -0800

     [ 
https://issues.apache.org/jira/browse/DRILL-5955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Rogers updated DRILL-5955:
-------------------------------
    Description: 
Drill supports a “Union Vector” type that allows a single column to hold values 
of multiple types. Conceptually, each column value is a (type, value) pair. For 
example, row 0 might be an Int, row 1 a Varchar and row 2 a NULL value.

The name refers to a C “union” in which the same bit of memory is used to 
represent one of a set of defined types.

Drill implements the union vector a bit like a map: as a collection of typed 
vectors. Each value is keyed by type. The result is that a union vector is more 
like a C “struct” than a C “union”: every vector takes space, but only one of 
the vectors is used for each row. For the example above, the union vector 
contains an Int vector, a Varchar vector and a type vector. For each row, 
either the Int or the Varchar is used. For NULL values, neither vector is used.

h4. Memory Footprint Concerns

The current representation, despite its name, makes very inefficient use of 
memory because it requires the sum of the storage for each included type. (That 
is, if we store 1000 rows, we need 1000 slots for integers, another 1000 for 
Varchar and yet another 1000 for the type vector.)

Drill poorly supports the union type. One operator that does support it is the 
sort. If the union type is enabled, and the sort sees a schema change, the sort 
will create a new union vector that combines the two types. The result is a 
sudden, unplanned increase in memory usage. Since the sort can buffer many 
hundreds of batches, this unplanned memory increase can cause the sort to run 
out of memory.

h4. Muddy Semantics

The union vector is closely tied with the List vector: a list vector is, 
essentially, an array of unions. (See DRILL-5958). The list type is used to 
model JSON in which a list can hold anything: another list, an object or 
scalars. For this reason, the union vector also can hold any type. And, indeed, 
it can hold a union of any of these types: a Map and an Int, or a List and a 
Map.

Drill is a relational, SQL-based tool. Work is required to bring non-relational 
structures into Drill. As discussed below, a union of scalars can be made to 
work. But, a union of structured types (lists, arrays or Maps) makes no sense.

h4. High Complexity

The union vector, as implemented is quite complex. It contains member variables 
for every other vector type (except, strangely, the decimal types.) Access to 
typed members is by type-specific methods, meaning that the client code must 
include a separate call for every type, resulting in very complex client code.

The complexity allowed the union type to be made to work, but causes this one 
type to consume a disproportionate amount of the vector and client code.

h4. Proposed Revision to Structure: The Variant Type

Given the above, we can now present the proposed changes. First let us 
recognize that a union vector need not hold structured types; there are other 
solutions as discussed in DRILL-xxxx. This leaves the union vector to hold just 
scalars.

h4. Proposed Revision to Storage

This in turn lets us adopt the [Variant 
type|https://en.wikipedia.org/wiki/Variant_type] originally introduced in 
Visual Basic. Variant “is a tagged union that can be used to represent any 
other data type”. The Variant type was designed to be compact by building on 
the idea of a tagged union in C.

{code}
struct {
  int tag; // type
  union {
    int intValue;
    long longValue;
    …
  }
}
{code}

When implemented as a vector, the format could consume just a single 
variable-width vector with each entry of the form: {{\[type value]}}. The 
vector is simply a sequence of these (type, value) pairs.

The type is a single-byte that encodes the MinorType that describes the value. 
That is, the type byte is like the existing type vector, but stored in the same 
location as the data. The data is simply the serialized format of data. (Four 
bytes for an Int, 8 bytes for a Float8 and so on.)

Variable-width types require an extra field: the type field: {{\[type length 
value]}}. For example, a Varchar would be encoded as {{\[Varchar 27 byte0-26]}}.

A writer uses the type to drive the serialization. A reader uses the type to 
drive deserialization.

Note that the type field must include a special marker for nulls. Today, the 
union type uses 0 to indicate a null value. (Note that, in a union and variant, 
a null value is not a null of some type, both the type and value are null.) 
That form should be used in the variant representation as well. But, note that 
the 0 value in the MajorType enum is not Null but is instead Late. This is an 
unpleasant messiness that the union (and variant )encoding must handle.

An offset vector provides the location of each field, as is done with 
variable-length vectors today.

The result is huge compaction of space requirements from multiple vectors per 
type to just two vectors (offsets and data.)

Such a change would be daunting if clients work directly with vectors. However, 
with the introduction of the “result set loader” and “reader” abstractions, 
this change in format would be completely hidden from client code. The “result 
set” abstractions provide high level APIs that isolate clients from 
implementation, allowing changes such as this.

h4. Arrow Union Types

[Arrow|https://arrow.apache.org/docs/memory_layout.html] (see “Dense union 
type”) has retained Drill’s union vector design: it contains:

* One child array for each relative type
* Types buffer…
* Offsets buffer…

Unlike Drill, Arrow also has a “Sparse union type” that omits the offsets 
buffer if the child types are all of the same length.

The variant type is an opportunity for Drill to lead based on our extensive 
experience with vectors in production systems. Once the variant type is proven 
in production, we can offer it to Arrow as part of the Drill/Arrow integration.

h4. Backward Compatibility

While the actual code change is quite straightforward, the far larger challenge 
is backward compatibility. Drill offers both JDBC and ODBC drivers. These 
drivers make use of Drills internal vector storage format. Thus, any change to 
the vector format will appear on the wire and must be understood by these 
clients.

Drill does not, unfortunately, provide a versioned API to deal with these 
issues. See DRILL-5957 for a proposed solution to the version negotiation 
problem.

For the union vector, let’s say the variant alternative is introduced in 
version Y. If a version X (older) client connects, the server converts the 
variant type to union format before sending to the client.

Thus, before we can change the union vector (or, for that matter, any vector), 
we must release clients that understand the version handshake protocol. Then, 
once those clients are deployed, a following server version can make the vector 
changes.

Note that this same issue will arise (only in much more complex form) if Drill 
were to adopt Arrow.

h4. Seed of a Row-Based Storage Format

Drill is a columnar engine. However, there are a few situations in which a 
row-based storage format would improve Drill performance and/or simplicity:

* JDBC and ODBC clients work with vectors today, but would prefer to work with 
rows. (The drivers contain complex code to do the column-to-row rotation on the 
client.)
* Hash exchanges broadcast each row to a different host, but today do that by 
buffering rows until gathering a large enough batch to send, causing severe 
memory pressure. Row-by-row sending would be faster and more memory efficient.

If the variant format were to be available, a simple extension is to use the 
same encoding for a row format.

* An offset vector, indexed by column, gives the start location of each column.
* The row buffer is a sequence of (type, value) pairs (for fixed-width) or 
(type, length, value) triples (for variable-width types.)

The same encoder/decoder that handles a column of heterogeneous values could 
also handle the same structure that represents a row of such values.

h4. SQL-level Variant Semantics

The union vector (and the proposed new “variant” vector) exist to hold a 
variety of types. However, SQL is designed to work with just a single type. 
Therefore, we must consider not just storage representation, but also query 
semantics.

A challenge is that neither JDBC nor ODBC were designed for variants, nor do 
most analytic tools know how to interpret varying data types. Indeed, since 
these APIs and tools are designed for relational data (in which the type of 
each column is known and fixed), then it is the job of the query tool to 
determine the column type.

This means that, when using JDBC and ODBC, all union/variant processing must be 
done within Drill itself, with the client seining a single, combined output 
type after some internal operation to produce that combined type.

One simple use case is to handle type schema changes within an input. For 
example, in JSON, a value might first present as an Integer, later as a Float. 
Or, a value might start small enough for a Float, but later present values that 
require a BigDecimal.

In such cases, a variant type allows Drill to hold values that correspond to 
how the JSON parser retrieved the values.

To use those values in SQL, however, the user must unify them, perhaps with a 
Cast. For example, in the mixed-number case above, the user might cast the 
column to a decimal.

h4. Alternatives to the Union/Variant Types

Here, however, we can take a step back and ask a larger question. If the 
union/variant vector is to handle schema changes, might it be better to simply 
push the final schema down to the reader, and simply interpret the data as the 
final value at read time? That is, if we could tell the JSON reader (say) that 
column “x” is a Decimal, then the reader can do the conversion, saving all the 
complexity of a union (or variant) vector and casting.

One way to do this is to “push” cast operations into the reader by providing 
the reader not just column names, but with the types as well. That is, 
projected columns are not just names, they are (name, type) pairs.

The above cannot solve the {{SELECT *}} case, however, as the user has chosen 
not to specify names (let alone types.)

A more general solution is to allow the user to specify the column types as 
metadata (as is already done in all other query tools, perhaps via Hive.) Then, 
the user need not specify the types via casts in each query. Because the types 
are known at read time, {{SELECT *}} works fine. As a result, the need for a 
union/variant never arises.

Here it is worth pointing out that Drill must still be able to query data 
without a schema. But, type conflicts may appear since Drill can’t predict the 
future. The user than makes a decision that the easiest path forward for their 
own use case is to 1) live with the issue, 2) add casts to each query, 3) add 
casts to a per-file view, or 4) provide metadata that solves the problem once 
and for all.

Given this there other cases where we actually do need the union type? Do we 
have compelling use cases? If not, then the best path forward to fix the union 
type is simply to retire it in favor of the type hints described above.

  was:
Drill supports a “Union Vector” type that allows a single column to hold values 
of multiple types. Conceptually, each column value is a (type, value) pair. For 
example, row 0 might be an Int, row 1 a Varchar and row 2 a NULL value.

The name refers to a C “union” in which the same bit of memory is used to 
represent one of a set of defined types.

Drill implements the union vector a bit like a map: as a collection of typed 
vectors. Each value is keyed by type. The result is that a union vector is more 
like a C “struct” than a C “union”: every vector takes space, but only one of 
the vectors is used for each row. For the example above, the union vector 
contains an Int vector, a Varchar vector and a type vector. For each row, 
either the Int or the Varchar is used. For NULL values, neither vector is used.

h4. Memory Footprint Concerns

The current representation, despite its name, makes very inefficient use of 
memory because it requires the sum of the storage for each included type. (That 
is, if we store 1000 rows, we need 1000 slots for integers, another 1000 for 
Varchar and yet another 1000 for the type vector.)

Drill poorly supports the union type. One operator that does support it is the 
sort. If the union type is enabled, and the sort sees a schema change, the sort 
will create a new union vector that combines the two types. The result is a 
sudden, unplanned increase in memory usage. Since the sort can buffer many 
hundreds of batches, this unplanned memory increase can cause the sort to run 
out of memory.

h4. Muddy Semantics

The union vector is closely tied with the List vector: a list vector is, 
essentially, an array of unions. (See DRILL-xxxx). The list type is used to 
model JSON in which a list can hold anything: another list, an object or 
scalars. For this reason, the union vector also can hold any type. And, indeed, 
it can hold a union of any of these types: a Map and an Int, or a List and a 
Map.

Drill is a relational, SQL-based tool. Work is required to bring non-relational 
structures into Drill. As discussed below, a union of scalars can be made to 
work. But, a union of structured types (lists, arrays or Maps) makes no sense.

h4. High Complexity

The union vector, as implemented is quite complex. It contains member variables 
for every other vector type (except, strangely, the decimal types.) Access to 
typed members is by type-specific methods, meaning that the client code must 
include a separate call for every type, resulting in very complex client code.

The complexity allowed the union type to be made to work, but causes this one 
type to consume a disproportionate amount of the vector and client code.

h4. Proposed Revision to Structure: The Variant Type

Given the above, we can now present the proposed changes. First let us 
recognize that a union vector need not hold structured types; there are other 
solutions as discussed in DRILL-xxxx. This leaves the union vector to hold just 
scalars.

h4. Proposed Revision to Storage

This in turn lets us adopt the [Variant 
type|https://en.wikipedia.org/wiki/Variant_type] originally introduced in 
Visual Basic. Variant “is a tagged union that can be used to represent any 
other data type”. The Variant type was designed to be compact by building on 
the idea of a tagged union in C.

{code}
struct {
  int tag; // type
  union {
    int intValue;
    long longValue;
    …
  }
}
{code}

When implemented as a vector, the format could consume just a single 
variable-width vector with each entry of the form: {{\[type value]}}. The 
vector is simply a sequence of these (type, value) pairs.

The type is a single-byte that encodes the MinorType that describes the value. 
That is, the type byte is like the existing type vector, but stored in the same 
location as the data. The data is simply the serialized format of data. (Four 
bytes for an Int, 8 bytes for a Float8 and so on.)

Variable-width types require an extra field: the type field: {{\[type length 
value]}}. For example, a Varchar would be encoded as {{\[Varchar 27 byte0-26]}}.

A writer uses the type to drive the serialization. A reader uses the type to 
drive deserialization.

Note that the type field must include a special marker for nulls. Today, the 
union type uses 0 to indicate a null value. (Note that, in a union and variant, 
a null value is not a null of some type, both the type and value are null.) 
That form should be used in the variant representation as well. But, note that 
the 0 value in the MajorType enum is not Null but is instead Late. This is an 
unpleasant messiness that the union (and variant )encoding must handle.

An offset vector provides the location of each field, as is done with 
variable-length vectors today.

The result is huge compaction of space requirements from multiple vectors per 
type to just two vectors (offsets and data.)

Such a change would be daunting if clients work directly with vectors. However, 
with the introduction of the “result set loader” and “reader” abstractions, 
this change in format would be completely hidden from client code. The “result 
set” abstractions provide high level APIs that isolate clients from 
implementation, allowing changes such as this.

h4. Arrow Union Types

[Arrow|https://arrow.apache.org/docs/memory_layout.html] (see “Dense union 
type”) has retained Drill’s union vector design: it contains:

* One child array for each relative type
* Types buffer…
* Offsets buffer…

Unlike Drill, Arrow also has a “Sparse union type” that omits the offsets 
buffer if the child types are all of the same length.

The variant type is an opportunity for Drill to lead based on our extensive 
experience with vectors in production systems. Once the variant type is proven 
in production, we can offer it to Arrow as part of the Drill/Arrow integration.

h4. Backward Compatibility

While the actual code change is quite straightforward, the far larger challenge 
is backward compatibility. Drill offers both JDBC and ODBC drivers. These 
drivers make use of Drills internal vector storage format. Thus, any change to 
the vector format will appear on the wire and must be understood by these 
clients.

Drill does not, unfortunately, provide a versioned API to deal with these 
issues. See DRILL-5957 for a proposed solution to the version negotiation 
problem.

For the union vector, let’s say the variant alternative is introduced in 
version Y. If a version X (older) client connects, the server converts the 
variant type to union format before sending to the client.

Thus, before we can change the union vector (or, for that matter, any vector), 
we must release clients that understand the version handshake protocol. Then, 
once those clients are deployed, a following server version can make the vector 
changes.

Note that this same issue will arise (only in much more complex form) if Drill 
were to adopt Arrow.

h4. Seed of a Row-Based Storage Format

Drill is a columnar engine. However, there are a few situations in which a 
row-based storage format would improve Drill performance and/or simplicity:

* JDBC and ODBC clients work with vectors today, but would prefer to work with 
rows. (The drivers contain complex code to do the column-to-row rotation on the 
client.)
* Hash exchanges broadcast each row to a different host, but today do that by 
buffering rows until gathering a large enough batch to send, causing severe 
memory pressure. Row-by-row sending would be faster and more memory efficient.

If the variant format were to be available, a simple extension is to use the 
same encoding for a row format.

* An offset vector, indexed by column, gives the start location of each column.
* The row buffer is a sequence of (type, value) pairs (for fixed-width) or 
(type, length, value) triples (for variable-width types.)

The same encoder/decoder that handles a column of heterogeneous values could 
also handle the same structure that represents a row of such values.

h4. SQL-level Variant Semantics

The union vector (and the proposed new “variant” vector) exist to hold a 
variety of types. However, SQL is designed to work with just a single type. 
Therefore, we must consider not just storage representation, but also query 
semantics.

A challenge is that neither JDBC nor ODBC were designed for variants, nor do 
most analytic tools know how to interpret varying data types. Indeed, since 
these APIs and tools are designed for relational data (in which the type of 
each column is known and fixed), then it is the job of the query tool to 
determine the column type.

This means that, when using JDBC and ODBC, all union/variant processing must be 
done within Drill itself, with the client seining a single, combined output 
type after some internal operation to produce that combined type.

One simple use case is to handle type schema changes within an input. For 
example, in JSON, a value might first present as an Integer, later as a Float. 
Or, a value might start small enough for a Float, but later present values that 
require a BigDecimal.

In such cases, a variant type allows Drill to hold values that correspond to 
how the JSON parser retrieved the values.

To use those values in SQL, however, the user must unify them, perhaps with a 
Cast. For example, in the mixed-number case above, the user might cast the 
column to a decimal.

h4. Alternatives to the Union/Variant Types

Here, however, we can take a step back and ask a larger question. If the 
union/variant vector is to handle schema changes, might it be better to simply 
push the final schema down to the reader, and simply interpret the data as the 
final value at read time? That is, if we could tell the JSON reader (say) that 
column “x” is a Decimal, then the reader can do the conversion, saving all the 
complexity of a union (or variant) vector and casting.

One way to do this is to “push” cast operations into the reader by providing 
the reader not just column names, but with the types as well. That is, 
projected columns are not just names, they are (name, type) pairs.

The above cannot solve the {{SELECT *}} case, however, as the user has chosen 
not to specify names (let alone types.)

A more general solution is to allow the user to specify the column types as 
metadata (as is already done in all other query tools, perhaps via Hive.) Then, 
the user need not specify the types via casts in each query. Because the types 
are known at read time, {{SELECT *}} works fine. As a result, the need for a 
union/variant never arises.

Here it is worth pointing out that Drill must still be able to query data 
without a schema. But, type conflicts may appear since Drill can’t predict the 
future. The user than makes a decision that the easiest path forward for their 
own use case is to 1) live with the issue, 2) add casts to each query, 3) add 
casts to a per-file view, or 4) provide metadata that solves the problem once 
and for all.

Given this there other cases where we actually do need the union type? Do we 
have compelling use cases? If not, then the best path forward to fix the union 
type is simply to retire it in favor of the type hints described above.


> Revisit Union Vectors
> ---------------------
>
>                 Key: DRILL-5955
>                 URL: https://issues.apache.org/jira/browse/DRILL-5955
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.11.0
>            Reporter: Paul Rogers
>
> Drill supports a “Union Vector” type that allows a single column to hold 
> values of multiple types. Conceptually, each column value is a (type, value) 
> pair. For example, row 0 might be an Int, row 1 a Varchar and row 2 a NULL 
> value.
> The name refers to a C “union” in which the same bit of memory is used to 
> represent one of a set of defined types.
> Drill implements the union vector a bit like a map: as a collection of typed 
> vectors. Each value is keyed by type. The result is that a union vector is 
> more like a C “struct” than a C “union”: every vector takes space, but only 
> one of the vectors is used for each row. For the example above, the union 
> vector contains an Int vector, a Varchar vector and a type vector. For each 
> row, either the Int or the Varchar is used. For NULL values, neither vector 
> is used.
> h4. Memory Footprint Concerns
> The current representation, despite its name, makes very inefficient use of 
> memory because it requires the sum of the storage for each included type. 
> (That is, if we store 1000 rows, we need 1000 slots for integers, another 
> 1000 for Varchar and yet another 1000 for the type vector.)
> Drill poorly supports the union type. One operator that does support it is 
> the sort. If the union type is enabled, and the sort sees a schema change, 
> the sort will create a new union vector that combines the two types. The 
> result is a sudden, unplanned increase in memory usage. Since the sort can 
> buffer many hundreds of batches, this unplanned memory increase can cause the 
> sort to run out of memory.
> h4. Muddy Semantics
> The union vector is closely tied with the List vector: a list vector is, 
> essentially, an array of unions. (See DRILL-5958). The list type is used to 
> model JSON in which a list can hold anything: another list, an object or 
> scalars. For this reason, the union vector also can hold any type. And, 
> indeed, it can hold a union of any of these types: a Map and an Int, or a 
> List and a Map.
> Drill is a relational, SQL-based tool. Work is required to bring 
> non-relational structures into Drill. As discussed below, a union of scalars 
> can be made to work. But, a union of structured types (lists, arrays or Maps) 
> makes no sense.
> h4. High Complexity
> The union vector, as implemented is quite complex. It contains member 
> variables for every other vector type (except, strangely, the decimal types.) 
> Access to typed members is by type-specific methods, meaning that the client 
> code must include a separate call for every type, resulting in very complex 
> client code.
> The complexity allowed the union type to be made to work, but causes this one 
> type to consume a disproportionate amount of the vector and client code.
> h4. Proposed Revision to Structure: The Variant Type
> Given the above, we can now present the proposed changes. First let us 
> recognize that a union vector need not hold structured types; there are other 
> solutions as discussed in DRILL-xxxx. This leaves the union vector to hold 
> just scalars.
> h4. Proposed Revision to Storage
> This in turn lets us adopt the [Variant 
> type|https://en.wikipedia.org/wiki/Variant_type] originally introduced in 
> Visual Basic. Variant “is a tagged union that can be used to represent any 
> other data type”. The Variant type was designed to be compact by building on 
> the idea of a tagged union in C.
> {code}
> struct {
>   int tag; // type
>   union {
>     int intValue;
>     long longValue;
>     …
>   }
> }
> {code}
> When implemented as a vector, the format could consume just a single 
> variable-width vector with each entry of the form: {{\[type value]}}. The 
> vector is simply a sequence of these (type, value) pairs.
> The type is a single-byte that encodes the MinorType that describes the 
> value. That is, the type byte is like the existing type vector, but stored in 
> the same location as the data. The data is simply the serialized format of 
> data. (Four bytes for an Int, 8 bytes for a Float8 and so on.)
> Variable-width types require an extra field: the type field: {{\[type length 
> value]}}. For example, a Varchar would be encoded as {{\[Varchar 27 
> byte0-26]}}.
> A writer uses the type to drive the serialization. A reader uses the type to 
> drive deserialization.
> Note that the type field must include a special marker for nulls. Today, the 
> union type uses 0 to indicate a null value. (Note that, in a union and 
> variant, a null value is not a null of some type, both the type and value are 
> null.) That form should be used in the variant representation as well. But, 
> note that the 0 value in the MajorType enum is not Null but is instead Late. 
> This is an unpleasant messiness that the union (and variant )encoding must 
> handle.
> An offset vector provides the location of each field, as is done with 
> variable-length vectors today.
> The result is huge compaction of space requirements from multiple vectors per 
> type to just two vectors (offsets and data.)
> Such a change would be daunting if clients work directly with vectors. 
> However, with the introduction of the “result set loader” and “reader” 
> abstractions, this change in format would be completely hidden from client 
> code. The “result set” abstractions provide high level APIs that isolate 
> clients from implementation, allowing changes such as this.
> h4. Arrow Union Types
> [Arrow|https://arrow.apache.org/docs/memory_layout.html] (see “Dense union 
> type”) has retained Drill’s union vector design: it contains:
> * One child array for each relative type
> * Types buffer…
> * Offsets buffer…
> Unlike Drill, Arrow also has a “Sparse union type” that omits the offsets 
> buffer if the child types are all of the same length.
> The variant type is an opportunity for Drill to lead based on our extensive 
> experience with vectors in production systems. Once the variant type is 
> proven in production, we can offer it to Arrow as part of the Drill/Arrow 
> integration.
> h4. Backward Compatibility
> While the actual code change is quite straightforward, the far larger 
> challenge is backward compatibility. Drill offers both JDBC and ODBC drivers. 
> These drivers make use of Drills internal vector storage format. Thus, any 
> change to the vector format will appear on the wire and must be understood by 
> these clients.
> Drill does not, unfortunately, provide a versioned API to deal with these 
> issues. See DRILL-5957 for a proposed solution to the version negotiation 
> problem.
> For the union vector, let’s say the variant alternative is introduced in 
> version Y. If a version X (older) client connects, the server converts the 
> variant type to union format before sending to the client.
> Thus, before we can change the union vector (or, for that matter, any 
> vector), we must release clients that understand the version handshake 
> protocol. Then, once those clients are deployed, a following server version 
> can make the vector changes.
> Note that this same issue will arise (only in much more complex form) if 
> Drill were to adopt Arrow.
> h4. Seed of a Row-Based Storage Format
> Drill is a columnar engine. However, there are a few situations in which a 
> row-based storage format would improve Drill performance and/or simplicity:
> * JDBC and ODBC clients work with vectors today, but would prefer to work 
> with rows. (The drivers contain complex code to do the column-to-row rotation 
> on the client.)
> * Hash exchanges broadcast each row to a different host, but today do that by 
> buffering rows until gathering a large enough batch to send, causing severe 
> memory pressure. Row-by-row sending would be faster and more memory efficient.
> If the variant format were to be available, a simple extension is to use the 
> same encoding for a row format.
> * An offset vector, indexed by column, gives the start location of each 
> column.
> * The row buffer is a sequence of (type, value) pairs (for fixed-width) or 
> (type, length, value) triples (for variable-width types.)
> The same encoder/decoder that handles a column of heterogeneous values could 
> also handle the same structure that represents a row of such values.
> h4. SQL-level Variant Semantics
> The union vector (and the proposed new “variant” vector) exist to hold a 
> variety of types. However, SQL is designed to work with just a single type. 
> Therefore, we must consider not just storage representation, but also query 
> semantics.
> A challenge is that neither JDBC nor ODBC were designed for variants, nor do 
> most analytic tools know how to interpret varying data types. Indeed, since 
> these APIs and tools are designed for relational data (in which the type of 
> each column is known and fixed), then it is the job of the query tool to 
> determine the column type.
> This means that, when using JDBC and ODBC, all union/variant processing must 
> be done within Drill itself, with the client seining a single, combined 
> output type after some internal operation to produce that combined type.
> One simple use case is to handle type schema changes within an input. For 
> example, in JSON, a value might first present as an Integer, later as a 
> Float. Or, a value might start small enough for a Float, but later present 
> values that require a BigDecimal.
> In such cases, a variant type allows Drill to hold values that correspond to 
> how the JSON parser retrieved the values.
> To use those values in SQL, however, the user must unify them, perhaps with a 
> Cast. For example, in the mixed-number case above, the user might cast the 
> column to a decimal.
> h4. Alternatives to the Union/Variant Types
> Here, however, we can take a step back and ask a larger question. If the 
> union/variant vector is to handle schema changes, might it be better to 
> simply push the final schema down to the reader, and simply interpret the 
> data as the final value at read time? That is, if we could tell the JSON 
> reader (say) that column “x” is a Decimal, then the reader can do the 
> conversion, saving all the complexity of a union (or variant) vector and 
> casting.
> One way to do this is to “push” cast operations into the reader by providing 
> the reader not just column names, but with the types as well. That is, 
> projected columns are not just names, they are (name, type) pairs.
> The above cannot solve the {{SELECT *}} case, however, as the user has chosen 
> not to specify names (let alone types.)
> A more general solution is to allow the user to specify the column types as 
> metadata (as is already done in all other query tools, perhaps via Hive.) 
> Then, the user need not specify the types via casts in each query. Because 
> the types are known at read time, {{SELECT *}} works fine. As a result, the 
> need for a union/variant never arises.
> Here it is worth pointing out that Drill must still be able to query data 
> without a schema. But, type conflicts may appear since Drill can’t predict 
> the future. The user than makes a decision that the easiest path forward for 
> their own use case is to 1) live with the issue, 2) add casts to each query, 
> 3) add casts to a per-file view, or 4) provide metadata that solves the 
> problem once and for all.
> Given this there other cases where we actually do need the union type? Do we 
> have compelling use cases? If not, then the best path forward to fix the 
> union type is simply to retire it in favor of the type hints described above.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (DRILL-5955) Revisit Union Vectors

Reply via email to