Github user paul-rogers commented on a diff in the pull request:
https://github.com/apache/drill/pull/824#discussion_r122602595
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java
---
@@ -264,15 +275,18 @@ private ParquetTableMetadata_v3
getParquetTableMetadata(List<FileStatus> fileSta
/**
* Get a list of file metadata for a list of parquet files
*
- * @param fileStatuses
- * @return
+ * @param parquetTableMetadata_v3 can store column schema info from all
the files and row groups
+ * @param fileStatuses list of the parquet files statuses
+ * @param absolutePathInMetadata true if result metadata files should
contain absolute paths, false for relative paths.
+ * Relative paths in the metadata are only
necessary while creating meta cache files.
+ * @return list of the parquet file metadata (parquet metadata for every
file)
* @throws IOException
*/
- private List<ParquetFileMetadata_v3> getParquetFileMetadata_v3(
- ParquetTableMetadata_v3 parquetTableMetadata_v3, List<FileStatus>
fileStatuses) throws IOException {
+ private List<ParquetFileMetadata_v3>
getParquetFileMetadata_v3(ParquetTableMetadata_v3 parquetTableMetadata_v3,
+ List<FileStatus> fileStatuses, boolean absolutePathInMetadata)
throws IOException {
--- End diff --
Is this really needed? Or, is it an attempt to answer my earlier concern
about compatibility?
Only newer Drill instances will create metadata. If we want relative paths,
then we should always use relative paths. No need to pass along a flag.
On the other hand, if we are saying that the root call is absolute (as seen
in the code earlier), but subdirectories are relative, then doesn't the
presence of even one absolute directory name make the whole feature invalid?
Perhaps some more background explanation in the PR comments (or even a
design spec) might shed some light on what we are trying to accomplish here.
Very hard to simply reverse engineer a design from code changes...
Also, below, we have a method to convert relative paths to absolute in
bulk. Should we do the same here? Always gather data in absolute form, then
convert it to relative just before serializing?
I wasn't sure why we are converting paths from relative to absolute. If we
are doing that because we use absolute paths internally, then it is OK to
gather absolute paths here. Convert the to relative just before writing if that
is easier.
Here, I'm referring to the note about the "proposed alternative solution".
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---