[
https://issues.apache.org/jira/browse/DRILL-7330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17059284#comment-17059284
]
ASF GitHub Bot commented on DRILL-7330:
---------------------------------------
vvysotskyi commented on issue #2026: DRILL-7330: Implement metadata usage for
all format plugins
URL: https://github.com/apache/drill/pull/2026#issuecomment-599032547
@paul-rogers, this pull request enables the format plugin to gather
metadata. Metadata gathering logic was added in DRILL-7273.
Regarding the schema, when metadata is collecting, rules are the same as for
regular select queries - Drill tries to infer the table schema or uses
user-provided schema.
Collecting metadata logic may become clearer after reading this section of
docs:
https://github.com/apache/drill/blob/master/docs/dev/MetastoreAnalyze.md#analyze-operators-description
or this design doc:
https://docs.google.com/document/d/14pSIzKqDltjLEEpEebwmKnsDPxyS_6jGrPOjXu6M_NM/edit?usp=sharing
In short, yes, we use a reader that reads all the data and downstream
operators for transforming and storing its statistics.
> For files that need a provided schema (CSV, say), do we apply stats to the
columns after type conversion, or are stats gathered on the raw text values?
That is, does this work use the provided schema if available?
Yes, we apply stats to the columns after schema conversion, so such stats as
min/max would have correct values in the scope of natural ordering.
> How does the provided schema relate to the metadata schema?
After the provided schema is used in the scan, Drill will use the resolved
schema for columns and store it to the metastore.
> What stats will we gather for non-Parquet files? How will we use them?
Looks like there is code for partitions (have not looked in depth, so I may be
wrong). Are we using stats for partition pruning? If so, how does that differ
from the existing practice of just walking the directory tree?
We collect exactly the same stats for non-parquet files. We may use them in
the same way as it is used in parquet - prune files when filter for specific
columns is specified, prune unneeded files for limit queries. Dirs pruning
would still work in the same way as it worked before changes (it also works for
parquet).
I think some tests in `TestMetastoreWithEasyFormatPlugin` will help to
understand which optimizations are added.
> Do you see any potential conflicts between your metadata model and the
above provided schema model?
Looks like there shouldn't be any conflicts.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Implement metadata usage for text format plugin
> -----------------------------------------------
>
> Key: DRILL-7330
> URL: https://issues.apache.org/jira/browse/DRILL-7330
> Project: Apache Drill
> Issue Type: Sub-task
> Reporter: Arina Ielchiieva
> Assignee: Vova Vysotskyi
> Priority: Major
> Fix For: 1.18.0
>
>
> 1. Change the current group scan to leverage Schema from Metastore;
> 2. Use stats for enabling additional logical planning rules for text format
> plugin. It will enable such optimizations as limit, filter push and so on.
> + add possibility to pass schema through schema file (using path or table
> root), inline.
> + check for other enhancements in analyze command
--
This message was sent by Atlassian Jira
(v8.3.4#803005)