[ https://issues.apache.org/jira/browse/DRILL-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149067#comment-15149067 ]
ASF GitHub Bot commented on DRILL-4287: --------------------------------------- Github user amansinha100 commented on a diff in the pull request: https://github.com/apache/drill/pull/376#discussion_r53056521 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -529,6 +549,36 @@ public long getRowCount() { } + + // Create and return a new file selection based on reading the metadata cache file. + // This function also initializes a few of ParquetGroupScan's fields as appropriate. + private FileSelection + initFromMetadataCache(DrillFileSystem fs, FileSelection selection) throws IOException { + FileStatus metaRootDir = selection.getFirstPath(fs); + Path metaFilePath = new Path(metaRootDir.getPath(), Metadata.METADATA_FILENAME); + + // get (and set internal field) the metadata for the directory by reading the metadata file + this.parquetTableMetadata = Metadata.readBlockMeta(fs, metaFilePath.toString()); + List<String> fileNames = Lists.newArrayList(); + for (Metadata.ParquetFileMetadata file : parquetTableMetadata.getFiles()) { + fileNames.add(file.getPath()); + } + // when creating the file selection, set the selection root in the form /a/b instead of + // file:/a/b. The reason is that the file names above have been created in the form + // /a/b/c.parquet and the format of the selection root must match that of the file names + // otherwise downstream operations such as partition pruning can break. + final Path metaRootPath = Path.getPathWithoutSchemeAndAuthority(metaRootDir.getPath()); + this.selectionRoot = metaRootPath.toString(); + + // Use the FileSelection constructor directly here instead of the FileSelection.create() method + // because create() changes the root to include the scheme and authority; In future, if create() + // is the preferred way to instantiate a file selection, we may need to do something different... + FileSelection newSelection = new FileSelection(selection.getStatuses(fs), fileNames, metaRootPath.toString()); --- End diff -- @jinfengni, as part of this PR I moved the expansion from ParquetFormatPlugin to ParquetGroupScan for the metadata cache but did not change the call to create FileSelection. I agree that the inconsistency could be a problem. As @adeneche points out, DRILL-4380 has the background for the change, but we really need to fix the FileSelection.create() interface, which means we should prioritize DRILL-4381. > Do lazy reading of parquet metadata cache file > ---------------------------------------------- > > Key: DRILL-4287 > URL: https://issues.apache.org/jira/browse/DRILL-4287 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization > Affects Versions: 1.4.0 > Reporter: Aman Sinha > Assignee: Jinfeng Ni > > Currently, the parquet metadata cache file is read eagerly during creation of > the DrillTable (as part of ParquetFormatMatcher.isReadable()). This is not > desirable from performance standpoint since there are scenarios where we want > to do some up-front optimizations - e.g. directory-based partition pruning > (see DRILL-2517) or potential limit 0 optimization etc. - and in such > situations it is better to do lazy reading of the metadata cache file. > This is a placeholder to perform such delayed reading since it is needed for > the aforementioned optimizations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)