Paul Rogers created DRILL-7434:
----------------------------------
Summary: TopNBatch constructs Union vector incorrectly
Key: DRILL-7434
URL: https://issues.apache.org/jira/browse/DRILL-7434
Project: Apache Drill
Issue Type: Bug
Reporter: Paul Rogers
The Union type is an "experimental" type that has never been completed. Yet, we
use it as if it works.
Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this with
the new batch validator enabled. This test creates a union vector. Here is how
the schema looks:
{noformat}
(UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
children=([`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)])])
{noformat}
This is very hard to follow because the Union vector structure is complex (and
has many issues.) Let's work though it.
We are looking at the {{MaterializedField}} for the union vector. It tells us
that this Union has two types: {{FLOAT8}} and {{INT}}. All good.
The Union has a vector per type, stored in an "internal map".' That map shows
up as child, it is there on the {{children}} list as {{internal}}. However, the
metadata claims that only one vector exists in that map: the {{types}} vector
(the one that tells us what type to use for each row.) The vectors for
{{FLOAT8}} and {{INT}} are missing.
If, however, we use our debugger and inspect the actual contents of the
{{internal}} map, we get the following:
{noformat}
[`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`float8`
(FLOAT8:OPTIONAL)], [`int` (INT:OPTIONAL)])]
{noformat}
That is, the internal map has the correct schema, but the Union vector itself
has the wrong (incomplete) schema.
This is an inherent design flaw with Union vector: it requires two copies of
the schema to be in sync. Further {{MaterializedField}} was designed to be
immutable, but the map and Union types require mutation. If the Union simply
points to the actual Map vector {{MaterializedField}}, it will drift out of
date since the map vector creates a new schema each time we add fields; the
Union vector ends up pointing to the old one.
This is not a simple bug to fix, but the result of the bug is that the vectors
end up corrupted, as detected by the Batch Validator. In fact, the bug itself
is subtle.
The TopNBatch does pass vector validation. However, because of the incorrect
metadata, the downstream {{RemovingRecordBatch}} creates the derived Union
vector incorrectly: it fails to set the value count for the {{INT}} type.
{noformat}
Found one or more vector errors from RemovingRecordBatch
kl-type-INT - NullableIntVector: Row count = 3, but value count = 0
{noformat}
Where {{kl-type-INT}} is an ad-hoc way of saying we are checking the {{INT}}
type vector for a Union named {{kl}}.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)