[ https://issues.apache.org/jira/browse/DRILL-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Rogers updated DRILL-5529: ------------------------------- Fix Version/s: (was: 1.11.0) Future > Fixed width, repeated vector mutators missing "fill empties" logic > ------------------------------------------------------------------ > > Key: DRILL-5529 > URL: https://issues.apache.org/jira/browse/DRILL-5529 > Project: Apache Drill > Issue Type: Bug > Affects Versions: 1.8.0 > Reporter: Paul Rogers > Assignee: Paul Rogers > Fix For: Future > > > Consider the Drill {{NullableVarCharVector}} type. This vector is composed of > three buffers (also called vectors): > * Is-set (bit) vector: contains 1 if the value is set, 0 if it is null. > * Data vector; effectively a byte array in which each value is packed one > after another. > * Offset vector, in which the entry for each row points to the first byte of > the value in the data vector. > Suppose we have the values "foo", null, "bar". Then, the vectors contain: > {code} > Is-Set: [1 0 1] > Offsets: [0 3 3 6] > Data: [f o o b a r] > {code} > (Yes, there is one more offset entry than rows.) > Suppose that the code creating the vector writes values for rows 1 and 3, but > omits 2 (it is null, which is the default). How do we get that required value > of 3 in the entry for row 2? The answer is that the logic for setting a value > keeps track of the last write position and "backfills" missing offset values: > {code} > public void setSafe(int index, ByteBuffer value, int start, int length) { > if (index > lastSet + 1) { > fillEmpties(index); > } > ... > {code} > So, when we write the value for row 3 ("bar") we back-fill the missing offset > for row 2. So far so good. > We can now generalize. We must to the same trick any time that we use a > vector that uses an offset vector. There are three other cases: > * Required variable-width vectors (where a missing value is the same as an > empty string). > * A repeated fixed-width vector. > * A repeated variable-width vector (which has *two* offset vectors). > The problem is, none of these actually provide the required code. The caller > must implement its own back-fill logic else the offset vectors become > corrupted. > Consider the required {{VarCharVector}}: > {code} > protected void set(int index, byte[] bytes, int start, int length) { > assert index >= 0; > final int currentOffset = offsetVector.getAccessor().get(index); > offsetVector.getMutator().set(index + 1, currentOffset + length); > data.setBytes(currentOffset, bytes, start, length); > } > {code} > As a result of this omission, any client which skips null values will corrupt > offset vectors. Consider an example: "try", "foo", "", "bar". We omit writing > record 2 (empty string). Desired result: > {code} > Data: [t r y f o o b a r] > Offsets: [0 3 6 6 9] > {code} > Actual result: > {code} > Data: [t r y f o o b a r] > Offsets: [0 3 6 0 9] > {code} > The result is that we compute the width of field 2 as -6, not 3. The value of > the empty field is 9, not 0. > A similar issue arrises with repeated vectors. Consider > {{RepeatedVarCharVector}}: > {code} > public void addSafe(int index, byte[] bytes, int start, int length) { > final int nextOffset = offsets.getAccessor().get(index+1); > values.getMutator().setSafe(nextOffset, bytes, start, length); > offsets.getMutator().setSafe(index+1, nextOffset+1); > } > {code} > Consider this example: (\["a", "b"], \[ ], \["d", "e"]). > Expected: > {code} > Array Offset: [0 2 2 4] > Value Offset: [0 1 2 3 4] > Data: [a b d e] > {code} > Actual: > {code} > Array Offset: [0 2 0 4] > Value Offset: [0 1 2 3 4] > Data: [a b d e] > {code} > The entry for the (unwritten) position 2 is missing. > This bug may be the root cause of several other issues found recently. > (Potentially DRILL-5470 -- need to verify.) > Two resolutions are possible: > * Require that client code write all values, backfilling empty or null values > as needed. > * Generalize the mutators to back-fill in all cases, not just a nullable var > char. > A related issue occurs when a reader fails to do a "final fill" at the end of > a batch (DRILL-5487). -- This message was sent by Atlassian JIRA (v6.4.14#64029)