[
https://issues.apache.org/jira/browse/DRILL-8507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878244#comment-17878244
]
ASF GitHub Bot commented on DRILL-8507:
---------------------------------------
paul-rogers commented on PR #2937:
URL: https://github.com/apache/drill/pull/2937#issuecomment-2322288590
@ychernysh, thanks for the clarification of the column name issue. I agree
that the name stored in the vector should not contain backticks.
I am unaware of any code in the Foreman that scans all the Parquet files. Is
that something new? Doing that in the Foreman places extreme load on the
Foreman, causes remote reads in Hadoop, and will be slow. Drill relies on
statistical averages to balance load: each node does about the same amount of
work. Doing a large prescan in the Foreman invalidates this assumption. The
feature is fine for Drill run as a single process on a single machine for a
single user. It will cause hotspots and resource issues on a busy, distributed
cluster.
The old, original Parquet schema cache did, in fact, read all the files in a
folder, but did so in a distributed fashion, and wrote the result to a JSON
file. (But without locking, so that two updates at the same time led to
corruption.) Are we talking about a new feature recently added in the last few
years? Or, about the original Parquet schema cache?
If this feature does exist, then we now have six ways that Drill handles
schema (Parquet cache, provides schema, Drill Metastore, HMS, on-the-fly, and
this new prescan of Parquet files). If a schema is available, then I agree with
your analysis. The planner should detect column type conflicts, and if a column
is missing. If a column is missing in any file, then the schema sent to the
readers should be OPTIONAL for that column. Further, if some Parquet files have
REQUIRED, and some OPTIONAL, then the mode should be OPTIONAL. You know the
drill (pardon the pun).
Note, however, that deciding on a common schema is a **planner** task. With
EVF, the planner would provide the schema (using the provided schema mechanism)
to the reader and EVF would "do the right thing" to ensure that the reader's
output schema matches that defined by the planner. There are many CSV file unit
tests that show this in action.
Calcite does a fine job working out the required types assuming that Drill
provides column type information in its metadata. Does the new Parquet prescan
feed into Calcite, or is it something done on the side? If on the side, then
we'll run into issues where Calcite guesses one column type (based on
operations) but the Parquet schema prescan works out some other type (based on
the Parquet files.) For example, if I have a query `SELECT SQRT(foo)...`,
Calcite will guess that the column has to be numeric. But, if the Parquet
prescan sees that `foo` is `VARCHAR`, then there will be a runtime conflict
rather than the planner sorting out the mess. Again, I'm ignorant of the
prescan feature, so I don't know what might have been done there.
Thus, the task for Parquet should be to recreate what EVF does (since EVF
already fought these battles), but using the older column writers. That is:
* The Parquet reader has to properly handle the missing column to avoid a
schema change. In Drill, "schema change" means "rebind the vectors and
regenerate any operator-specific code."
* If the first file does not have the column, create a dummy column of the
type specified by the planner.
* If the first file does have a given column, create a vector _with the type
specified by the planner_, not the type specified by Parquet.
* If a subsequent file does not have the column, _reuse_ the prior vector,
which should be of OPTIONAL mode.
Note that Parquet must already have a way to reuse columns common to two
files, else we'd get a schema change on each new file. (Sorry, my knowledge of
the old column writers is rusty; it's been several years since I last looked at
them.)
In fact, given that Parquet is trying to solve the same problems as EVF, one
might want to review the EVF unit tests and recreate those many cases using
Parquet. "If one is ignorant of history one is doomed to repeat it."
Thanks again for the explanations. Now that I better understand the problem
(and am aware of that new Parquet prescan feature), I can do a better code
review.
> Missing parquet columns quoted with backticks conflict with existing ones
> -------------------------------------------------------------------------
>
> Key: DRILL-8507
> URL: https://issues.apache.org/jira/browse/DRILL-8507
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.21.2
> Reporter: Yaroslav
> Priority: Major
> Attachments: people.tar.gz
>
>
> {*}NOTE{*}: I worked on this issue along with DRILL-8508. It turned out that
> it required this bug to get fixed first. And since these 2 are about a little
> bit different things it was decided to report them as separate issues. So,
> I'm going to link this issue as a requirement for that issue and open one PR
> for both (if it's allowed..). I think single PR would make it easier to
> review the code since the issues are quite related anyway.
> h3. Prerequisites
> If a {{ParquetRecordReader}} doesn't find a selected column, it creates a
> null-filled {{NullableIntVector}} with the column's name and the correct
> value count set. The field name for the vector is derived from
> {{SchemaPath#toExpr}} method, which always enquotes the outcome string with
> backticks.
> h3. Problems
> This causes some wrong field name equality checks (comparing two strings of
> field names, non-quoted and quoted, returns false, but essentially is
> supposed to return true) that lead to some errors.
> For example, the errors occur when you select a column from a table where
> some parquet files contain it and some do not. Consider a {{dfs.tmp.people}}
> table with such parquet files and their schemas:
> {code:java}
> /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> |
> age<INT(OPTIONAL)>
> /tmp/people/1.parquet: id<INT(REQUIRED)>{code}
> Now let's try to use an operator that doesn't support schema change. For
> example, {{{}ORDER BY{}}}:
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people ORDER BY age;
> Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External
> Sort. Please enable Union type.
> Previous schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]],
> selectionVector=NONE]
> Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)], [``age``
> (INT:OPTIONAL)]], selectionVector=NONE]
> Fragment: 0:0
> [Error Id: d3efffd4-cf31-46d5-9f6a-141a61e71d12 on node2.vmcluster.com:31010]
> (state=,code=0)
> {code}
> ORDER BY error clearly shows us that ``age`` is an extra column here and the
> incoming schema should only have the unquoted field to match the previous
> schema.
> Another example is in {{UNION ALL}} operator:
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people UNION ALL (VALUES (1));
> Error: SYSTEM ERROR: IllegalArgumentException: Input batch and output batch
> have different field counthas!
> Fragment: 0:0
> Please, refer to logs for more information.
> [Error Id: 81680275-92ee-4d1b-93b5-14f4068990eb on node2.vmcluster.com:31010]
> (state=,code=0)
> {code}
> Again, "different field counts" issue is caused by an extra quoted column
> that counts as different field.
> h3. Solution
> The solution for these errors would be to replace {{SchemaPath#toExpr}} call
> with {{{}SchemaPath#getAsUnescapedPath{}}}, which doesn't enquote the outcome
> string. Simply enough, but note that we used to use
> {{SchemaPath#getAsUnescapedPath}} before DRILL-4264 where we switched to
> {{{}SchemaPath#toExpr{}}}. The author even put a comment:
> {code:java}
> // col.toExpr() is used here as field name since we don't want to see these
> fields in the existing maps{code}
> So it looks like moving to {{SchemaPath#toExpr}} was a conscious decision.
> But, honestly, I don't really understand the motivation for this, even with
> the comment. I don't really understand what "existing maps" are here. Maybe
> someone from the community can help here, but I will further consider it as a
> mistake, simply because it causes the above problems.
> h3. Regressions
> The change brings some regressions detected by unit tests. Those I found fail
> because they changed their execution flow after the fix in these places:
> *
> [FieldIdUtil#getFieldId|https://github.com/apache/drill/blob/7c813ff6440a118de15f552145b40eb07bb8e7a2/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/FieldIdUtil.java#L206]
> *
> [ScanBatch$Mutator#addField|https://github.com/apache/drill/blob/097a4717ac998ec6bf3c70a99575c7ff53f47430/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java#L523-L524]
> Before the fix, the field name equality checks returned false from comparing
> quoted and unquoted field names. The failing tests relied on that behavior
> and worked only with this condition. But after the fix, when we compare two
> quoted names and return true, we fall to a different branch and the tests
> aren't ready for that.
> But obviously the change _is fixing the problem_ so the tests that relied on
> that problem should now be adjusted.
> Please see more technical details on each failed unit test in the linked PR.
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)