[ 
https://issues.apache.org/jira/browse/HUDI-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1479:
---------------------------------
    Labels: pull-request-available  (was: )

> Replace FSUtils.getAllPartitionPaths() with 
> HoodieTableMetadata#getAllPartitionPaths()
> --------------------------------------------------------------------------------------
>
>                 Key: HUDI-1479
>                 URL: https://issues.apache.org/jira/browse/HUDI-1479
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: Code Cleanup
>            Reporter: Vinoth Chandar
>            Assignee: Udit Mehrotra
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 0.7.0
>
>         Attachments: image-2021-01-05-10-00-35-187.png
>
>
> *Change #1*
> {code:java}
> public static List<String> getAllPartitionPaths(FileSystem fs, String 
> basePathStr, boolean useFileListingFromMetadata, boolean verifyListings,
>                                                   boolean 
> assumeDatePartitioning) throws IOException {
>     if (assumeDatePartitioning) {
>       return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr);
>     } else {
>       HoodieTableMetadata tableMetadata = 
> HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/", 
> useFileListingFromMetadata,
>           verifyListings, false, false);
>       return tableMetadata.getAllPartitionPaths();
>     }
>  }
> {code}
> is the current implementation, where `HoodieTableMetadata.create()` always 
> creates `HoodieBackedTableMetadata`. Instead we should create 
> `FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways. 
> This helps address https://github.com/apache/hudi/pull/2398/files#r550709687
> *Change #2*
> On master, we have the `HoodieEngineContext` abstraction, which allows for 
> parallel execution. We should consider moving it to `hudi-common` (its 
> doable) and then have `FileSystemBackedTableMetadata` redone such that it can 
> do parallelized listings using the passed in engine. either 
> HoodieSparkEngineContext or HoodieJavaEngineContext. 
> HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized 
> code. We should take one pass and see if that can be redone a bit as well.  
> Food for thought: 
> https://github.com/apache/hudi/pull/2398#discussion_r550711216
>  
> *Change #3*
> There are places, where we call fs.listStatus() directly. We should make them 
> go through the HoodieTable.getMetadata()... route as well. Essentially, all 
> listing should be concentrated to `FileSystemBackedTableMetadata`
> !image-2021-01-05-10-00-35-187.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to