paul-rogers commented on PR #2937:
URL: https://github.com/apache/drill/pull/2937#issuecomment-2334853752
@ychernysh, thank you for the detailed explanation. It is impressive how
well you understand this complex code.
**Scope**: We're clear now that this PR is not trying to handle conflicting
types. Yes, in my example, I was suggesting that the person who creates the
Parquet files manage column types themselves. There is, of course, another
workaround that I did not mention: the SQL user can (I believe) insert the
needed casts. Suppose that we have a set of files where the types differ for
three columns, but we know a common type. We can coerce the types manually:
```sql
SELECT a, b, c
FROM (SELECT CAST(a AS, DOUBLE), CAST(b AS VARCHAR), CAST(c AS INT) FROM
myParquet)
ORDER BY a, b
```
The above should insert that PROJECT I mentioned. In an ideal world, Drill
would figure this out from the Parquet metadata. As we said, this can be left
as a future project for another time.
**Wildcard Query**: Ideally, if I have a Parquet file with columns `a` and
`b`, and `b` is missing in some files, the following two queries should work
identically:
```sql
SELECT a, b FROM (SELECT * FROM myParquet) ORDER BY a, b
SELECT a, b FROM myParquet ORDER BY a, b
```
That is, it should not matter how we learn we will scan column `b`: if it is
missing, it should have a consistent type after this fix.
The code you mentioned reflects the original schema-on-read design: the
wildcard is expanded for each row group at run time. This is one reason I was
surprised that we gather the schema at plan time. Now that it is clear that
Parquet does have the schema at plan time, we can work out the union of all
columns from all files at plan time. We can sort out the types of missing
columns. We can then tell readers that `SELECT *` expands to not just all the
columns in that particular row group, but to the union of all the columns.
It is clear that we've got a mess. Drill started as schema-on-read. Then, it
was found that, for Parquet (only) we need schema at plan time. But, the folks
that added that code didn't fully think through the design. The result is a
huge muddle that you are now starting to sort out.
_Suggestion_: let's leave proper wildcard expansion to another time. You are
fixing this bug for a reason: for some use case. If your use case does not use
wildcard queries, then it is safe to defer this issue until someone actually
needs a fix.
**Reading non-nullable parquet column into a nullable vector**: Thanks for
ensuring we set the null vector correctly. Sounds like this part is good.
**Passing type information to readers**: I saw your fix. That is why I
mentioned that we now have two lists of columns given to the reader:
```java
rowGroupScan.getColumns(), // Columns as SchemaPath
...
// each parquet SubScan shares the same table schema constructed by
a GroupScan
rowGroupScan.getSchema()); // Same list of columns as above, but as
TupleMetadata?
```
As a reader, I have to try to understand: are the two column lists the same?
Is the order the same? Is the `TupleMetadata` version a 1:1 relationship with
the `getSchemaColumns()` list? If not, what are the differences?
You are adding code to an existing implementation, and so you want to avoid
changing things any more than necessary. Having redundant lists is messy, but
probably the simplest fix.
_Suggestion_: Maybe just add a comment about the assumed relationship
between the two lists.
**UntypedNull (Part 1)**: Thanks for the detailed explanation. I appreciate
the time you've put into fully understanding the convoluted logic.
When faced with complex legacy code, I find it helpful to ask, _what is this
trying to do_? The code itself is the ultimate truth, and we have to start
there. But, to sort out what should be happening, we have to work out the
developer's intent, and figure out if they made a mistake or omitted some
important condition.
You pointed out that we do two checks. In the [first
one](https://github.com/apache/drill/blob/11aaa3f89cb85f7ef63e6cc5416c6b2f90e8322c/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/FieldIdUtil.java#L206-L208)
:
```java
public static TypedFieldId getFieldId(ValueVector vector, int id,
SchemaPath expectedPath, boolean hyper) {
if
(!expectedPath.getRootSegment().getPath().equalsIgnoreCase(vector.getField().getName()))
{
return null;
}
```
This code says, here is a `ValueVector` and an expected `SchemaPath`. Let's
make sure that the vector actually is for the given schema path by checking the
`MaterializedField` for the vector. In the case where the name was corrupted
with backticks, this check failed: we have a vector with the name `foo`, but
the expected path is `'foo'`. So, no match.
The core question is: what use case is this meant to handle? I can speculate
that there are two possible cases.
First is the top-level fields. For top level fields, the names should always
match. By the time we get here, we should have created any needed "dummy"
top-level vectors. You correctly fixed a case where the top level did not match.
I speculate that this code is meant to handle a second case: a column within
a `MAP` column. Suppose the Parquet file has a map field `m` that contains two
columns, `a` and `b`. The query, however, is asking to project `m.c` which does
not exist. Perhaps this bit of code handles that map case.
_Suggestion_: your fix is probably fine. Please check if we have a test for
the map case above. If we don't, consider adding one, just so we verify that
this PR doesn't break anything.
**UntypedNull (Part 2)**: Next, let's understand what _should_ happen if a
column is missing. We have four cases:
1. The column `c` exists in _none_ of the Parquet files.
2. The column `c` exists in _some_ of the Parquet files.
3. The column `m.c` (a column within a map) exists in _some_ Parquet files.
4. The column `m.c` exists in _none_ Parquet files.
Your fix handles case 2: we will now correctly use the type of existing
columns. So, we're good here.
Case 1 is where the Nullable `INT`/Untyped NULL question arises. I agree
with you that we should default the column type to Untyped NULL. It is odd that
we used Nullable `INT` in one case, and Untyped NULL in another case. Doing so
makes no sense to a user. Can we modify the code so we always use Untyped NULL
in this case? We would detect case 1 in the planner, and mark the corresponding
column with the Untyped Null type. (I hope `TupleSchema` handles this type! I
can't recall ever testing it.) The Parquet reader would then know to use the
Untyped Null when creating the dummy vector. This is based on what _should_
happen; the devil is in the details, so please check if this suggestion can
actually work.
Case 3 is rather special. EVF has some rather complex logic to handle
missing map columns (to any depth). Your fix relies on code in the planner to
work out the type for case 2 (top-level columns). Does that code handle nested
columns? If so, we just do the same as we do for case 2. However, if the
planner code treats all maps as a single column (does not look inside), then
maybe we just leave the existing code to do whatever it already does in this
case.
Case 4 depends on what we do in case 1 and 3. The "right" answer is for the
missing column to be of type Untyped NULL. But, this case is obscure, so we can
leave it to do whatever it currently does.
_Suggestion_: handle case 1 as described above, and just test that cases 3
and 4 still work however they used to work.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]