[ 
https://issues.apache.org/jira/browse/DRILL-3867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053336#comment-16053336
 ] 

ASF GitHub Bot commented on DRILL-3867:
---------------------------------------

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/824#discussion_r122602595
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
    @@ -264,15 +275,18 @@ private ParquetTableMetadata_v3 
getParquetTableMetadata(List<FileStatus> fileSta
       /**
        * Get a list of file metadata for a list of parquet files
        *
    -   * @param fileStatuses
    -   * @return
    +   * @param parquetTableMetadata_v3 can store column schema info from all 
the files and row groups
    +   * @param fileStatuses list of the parquet files statuses
    +   * @param absolutePathInMetadata true if result metadata files should 
contain absolute paths, false for relative paths.
    +   *                               Relative paths in the metadata are only 
necessary while creating meta cache files.
    +   * @return list of the parquet file metadata (parquet metadata for every 
file)
        * @throws IOException
        */
    -  private List<ParquetFileMetadata_v3> getParquetFileMetadata_v3(
    -      ParquetTableMetadata_v3 parquetTableMetadata_v3, List<FileStatus> 
fileStatuses) throws IOException {
    +  private List<ParquetFileMetadata_v3> 
getParquetFileMetadata_v3(ParquetTableMetadata_v3 parquetTableMetadata_v3,
    +      List<FileStatus> fileStatuses, boolean absolutePathInMetadata) 
throws IOException {
    --- End diff --
    
    Is this really needed? Or, is it an attempt to answer my earlier concern 
about compatibility?
    
    Only newer Drill instances will create metadata. If we want relative paths, 
then we should always use relative paths. No need to pass along a flag.
    
    On the other hand, if we are saying that the root call is absolute (as seen 
in the code earlier), but subdirectories are relative, then doesn't the 
presence of even one absolute directory name make the whole feature invalid?
    
    Perhaps some more background explanation in the PR comments (or even a 
design spec) might shed some light on what we are trying to accomplish here. 
Very hard to simply reverse engineer a design from code changes...
    
    Also, below, we have a method to convert relative paths to absolute in 
bulk. Should we do the same here? Always gather data in absolute form, then 
convert it to relative just before serializing?
    
    I wasn't sure why we are converting paths from relative to absolute. If we 
are doing that because we use absolute paths internally, then it is OK to 
gather absolute paths here. Convert the to relative just before writing if that 
is easier.
    
    Here, I'm referring to the note about the "proposed alternative solution".


> Store relative paths in metadata file
> -------------------------------------
>
>                 Key: DRILL-3867
>                 URL: https://issues.apache.org/jira/browse/DRILL-3867
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Metadata
>    Affects Versions: 1.2.0
>            Reporter: Rahul Challapalli
>            Assignee: Vitalii Diravka
>             Fix For: Future
>
>
> git.commit.id.abbrev=cf4f745
> git.commit.time=29.09.2015 @ 23\:19\:52 UTC
> The below sequence of steps reproduces the issue
> 1. Create the cache file
> {code}
> 0: jdbc:drill:zk=10.10.103.60:5181> refresh table metadata 
> dfs.`/drill/testdata/metadata_caching/lineitem`;
> +-------+-------------------------------------------------------------------------------------+
> |  ok   |                                       summary                       
>                 |
> +-------+-------------------------------------------------------------------------------------+
> | true  | Successfully updated metadata for table 
> /drill/testdata/metadata_caching/lineitem.  |
> +-------+-------------------------------------------------------------------------------------+
> 1 row selected (1.558 seconds)
> {code}
> 2. Move the directory
> {code}
> hadoop fs -mv /drill/testdata/metadata_caching/lineitem /drill/
> {code}
> 3. Now run a query on top of it
> {code}
> 0: jdbc:drill:zk=10.10.103.60:5181> select * from dfs.`/drill/lineitem` limit 
> 1;
> Error: SYSTEM ERROR: FileNotFoundException: Requested file 
> maprfs:///drill/testdata/metadata_caching/lineitem/2006/1 does not exist.
> [Error Id: b456d912-57a0-4690-a44b-140d4964903e on pssc-66.qa.lab:31010] 
> (state=,code=0)
> {code}
> This is obvious given the fact that we are storing absolute file paths in the 
> cache file



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to