[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files

GitBox Mon, 17 Feb 2020 03:37:09 -0800

vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE 
TABLE ... REFRESH METADATA does not work for empty Parquet files
URL: https://github.com/apache/drill/pull/1985#discussion_r380124653


 ##########
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java
 ##########
 @@ -237,6 +238,18 @@ private IterOutcome internalNext() {
       logger.trace("currentReader.next return recordCount={}", recordCount);
       Preconditions.checkArgument(recordCount >= 0, "recordCount from 
RecordReader.next() should not be negative");
       boolean isNewSchema = mutator.isNewSchema();
+      // adds additional record for the case of making scan for obtaining 
metadata if required
+      if (implicitValues != null) {
+        String projectMetadataColumn = 
context.getOptions().getOption(ExecConstants.IMPLICIT_PROJECT_METADATA_COLUMN_LABEL).string_val;
+        if (recordCount > 0) {
+          // sets implicit value to false to signalize that some results were 
returned and there is no need for creating additional record
 
 Review comment:
   Thanks, updated the comment and added more details.
   
   Regarding the concept of the additional record, I will try to explain how 
Metastore collects the data in general cases, it may help to understand the 
reason for such a decision.
   
   Drill Metastore may collect metadata for every file or row group, so 
aggregation calls for every column with grouping by `fqn`, `rgi`, `dirX`... 
columns were added.
   This approach works fine for the case of non-empty files and row groups, but 
when an empty file is present, no data is passed to the aggregation from the 
Scan, so Metastore was ignoring such files.
   To solve this problem, I have added this logic to return a single record for 
the case when no data was read with the correct values of implicit columns, and 
this additional implicit column helps to distinguish such records and collect 
info about rows count, schema, etc.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files

Reply via email to