[jira] [Commented] (DRILL-8507) Missing parquet columns quoted with backticks conflict with existing ones

ASF GitHub Bot (Jira) Fri, 30 Aug 2024 13:29:29 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-8507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878244#comment-17878244
 ]


ASF GitHub Bot commented on DRILL-8507:
---------------------------------------

paul-rogers commented on PR #2937:
URL: https://github.com/apache/drill/pull/2937#issuecomment-2322288590

   @ychernysh, thanks for the clarification of the column name issue. I agree 
that the name stored in the vector should not contain backticks.
   
   I am unaware of any code in the Foreman that scans all the Parquet files. Is 
that something new? Doing that in the Foreman places extreme load on the 
Foreman, causes remote reads in Hadoop, and will be slow. Drill relies on 
statistical averages to balance load: each node does about the same amount of 
work. Doing a large prescan in the Foreman invalidates this assumption. The 
feature is fine for Drill run as a single process on a single machine for a 
single user. It will cause hotspots and resource issues on a busy, distributed 
cluster.
   
   The old, original Parquet schema cache did, in fact, read all the files in a 
folder, but did so in a distributed fashion, and wrote the result to a JSON 
file. (But without locking, so that two updates at the same time led to 
corruption.) Are we talking about a new feature recently added in the last few 
years? Or, about the original Parquet schema cache?
   
   If this feature does exist, then we now have six ways that Drill handles 
schema (Parquet cache, provides schema, Drill Metastore, HMS, on-the-fly, and 
this new prescan of Parquet files). If a schema is available, then I agree with 
your analysis. The planner should detect column type conflicts, and if a column 
is missing. If a column is missing in any file, then the schema sent to the 
readers should be OPTIONAL for that column. Further, if some Parquet files have 
REQUIRED, and some OPTIONAL, then the mode should be OPTIONAL. You know the 
drill (pardon the pun).
   
   Note, however, that deciding on a common schema is a **planner** task. With 
EVF, the planner would provide the schema (using the provided schema mechanism) 
to the reader and EVF would "do the right thing" to ensure that the reader's 
output schema matches that defined by the planner. There are many CSV file unit 
tests that show this in action.
   
   Calcite does a fine job working out the required types assuming that Drill 
provides column type information in its metadata. Does the new Parquet prescan 
feed into Calcite, or is it something done on the side? If on the side, then 
we'll run into issues where Calcite guesses one column type (based on 
operations) but the Parquet schema prescan works out some other type (based on 
the Parquet files.) For example, if I have a query `SELECT SQRT(foo)...`, 
Calcite will guess that the column has to be numeric. But, if the Parquet 
prescan sees that `foo` is `VARCHAR`, then there will be a runtime conflict 
rather than the planner sorting out the mess. Again, I'm ignorant of the 
prescan feature, so I don't know what might have been done there.
   
   Thus, the task for Parquet should be to recreate what EVF does (since EVF 
already fought these battles), but using the older column writers. That is:
   
   * The Parquet reader has to properly handle the missing column to avoid a 
schema change. In Drill, "schema change" means "rebind the vectors and 
regenerate any operator-specific code."
   * If the first file does not have the column, create a dummy column of the 
type specified by the planner.
   * If the first file does have a given column, create a vector _with the type 
specified by the planner_, not the type specified by Parquet.
   * If a subsequent file does not have the column, _reuse_ the prior vector, 
which should be of OPTIONAL mode.
   
   Note that Parquet must already have a way to reuse columns common to two 
files, else we'd get a schema change on each new file. (Sorry, my knowledge of 
the old column writers is rusty; it's been several years since I last looked at 
them.)
   
   In fact, given that Parquet is trying to solve the same problems as EVF, one 
might want to review the EVF unit tests and recreate those many cases using 
Parquet. "If one is ignorant of history one is doomed to repeat it."
   
   Thanks again for the explanations. Now that I better understand the problem 
(and am aware of that new Parquet prescan feature), I can do a better code 
review.




> Missing parquet columns quoted with backticks conflict with existing ones
> -------------------------------------------------------------------------
>
>                 Key: DRILL-8507
>                 URL: https://issues.apache.org/jira/browse/DRILL-8507
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.21.2
>            Reporter: Yaroslav
>            Priority: Major
>         Attachments: people.tar.gz
>
>
> {*}NOTE{*}: I worked on this issue along with DRILL-8508. It turned out that 
> it required this bug to get fixed first. And since these 2 are about a little 
> bit different things it was decided to report them as separate issues. So, 
> I'm going to link this issue as a requirement for that issue and open one PR 
> for both (if it's allowed..). I think single PR would make it easier to 
> review the code since the issues are quite related anyway.
> h3. Prerequisites
> If a {{ParquetRecordReader}} doesn't find a selected column, it creates a 
> null-filled {{NullableIntVector}} with the column's name and the correct 
> value count set. The field name for the vector is derived from 
> {{SchemaPath#toExpr}} method, which always enquotes the outcome string with 
> backticks.
> h3. Problems
> This causes some wrong field name equality checks (comparing two strings of 
> field names, non-quoted and quoted, returns false, but essentially is 
> supposed to return true) that lead to some errors.
> For example, the errors occur when you select a column from a table where 
> some parquet files contain it and some do not. Consider a {{dfs.tmp.people}} 
> table with such parquet files and their schemas:
> {code:java}
> /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | 
> age<INT(OPTIONAL)>
> /tmp/people/1.parquet: id<INT(REQUIRED)>{code}
> Now let's try to use an operator that doesn't support schema change. For 
> example, {{{}ORDER BY{}}}:
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people ORDER BY age;
> Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
> Sort. Please enable Union type.
> Previous schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]], 
> selectionVector=NONE]
> Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)], [``age`` 
> (INT:OPTIONAL)]], selectionVector=NONE]
> Fragment: 0:0
> [Error Id: d3efffd4-cf31-46d5-9f6a-141a61e71d12 on node2.vmcluster.com:31010] 
> (state=,code=0)
> {code}
> ORDER BY error clearly shows us that ``age`` is an extra column here and the 
> incoming schema should only have the unquoted field to match the previous 
> schema.
> Another example is in {{UNION ALL}} operator:
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people UNION ALL (VALUES (1));
> Error: SYSTEM ERROR: IllegalArgumentException: Input batch and output batch 
> have different field counthas!
> Fragment: 0:0
> Please, refer to logs for more information.
> [Error Id: 81680275-92ee-4d1b-93b5-14f4068990eb on node2.vmcluster.com:31010] 
> (state=,code=0)
> {code}
> Again, "different field counts" issue is caused by an extra quoted column 
> that counts as different field.
> h3. Solution
> The solution for these errors would be to replace {{SchemaPath#toExpr}} call 
> with {{{}SchemaPath#getAsUnescapedPath{}}}, which doesn't enquote the outcome 
> string. Simply enough, but note that we used to use 
> {{SchemaPath#getAsUnescapedPath}} before DRILL-4264  where we switched to 
> {{{}SchemaPath#toExpr{}}}. The author even put a comment:
> {code:java}
> // col.toExpr() is used here as field name since we don't want to see these 
> fields in the existing maps{code}
> So it looks like moving to {{SchemaPath#toExpr}} was a conscious decision. 
> But, honestly, I don't really understand the motivation for this, even with 
> the comment. I don't really understand what "existing maps" are here. Maybe 
> someone from the community can help here, but I will further consider it as a 
> mistake, simply because it causes the above problems.
> h3. Regressions
> The change brings some regressions detected by unit tests. Those I found fail 
> because they changed their execution flow after the fix in these places:
>  * 
> [FieldIdUtil#getFieldId|https://github.com/apache/drill/blob/7c813ff6440a118de15f552145b40eb07bb8e7a2/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/FieldIdUtil.java#L206]
>  * 
> [ScanBatch$Mutator#addField|https://github.com/apache/drill/blob/097a4717ac998ec6bf3c70a99575c7ff53f47430/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java#L523-L524]
> Before the fix, the field name equality checks returned false from comparing 
> quoted and unquoted field names. The failing tests relied on that behavior 
> and worked only with this condition. But after the fix, when we compare two 
> quoted names and return true, we fall to a different branch and the tests 
> aren't ready for that.
> But obviously the change _is fixing the problem_ so the tests that relied on 
> that problem should now be adjusted.
> Please see more technical details on each failed unit test in the linked PR.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (DRILL-8507) Missing parquet columns quoted with backticks conflict with existing ones

Reply via email to