[ https://issues.apache.org/jira/browse/HUDI-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinoth Chandar updated HUDI-1479: --------------------------------- Status: Closed (was: Patch Available) > Replace FSUtils.getAllPartitionPaths() with > HoodieTableMetadata#getAllPartitionPaths() > -------------------------------------------------------------------------------------- > > Key: HUDI-1479 > URL: https://issues.apache.org/jira/browse/HUDI-1479 > Project: Apache Hudi > Issue Type: Sub-task > Components: Code Cleanup > Reporter: Vinoth Chandar > Assignee: Udit Mehrotra > Priority: Blocker > Labels: pull-request-available > Fix For: 0.7.0 > > Attachments: image-2021-01-05-10-00-35-187.png > > > *Change #1* > {code:java} > public static List<String> getAllPartitionPaths(FileSystem fs, String > basePathStr, boolean useFileListingFromMetadata, boolean verifyListings, > boolean > assumeDatePartitioning) throws IOException { > if (assumeDatePartitioning) { > return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr); > } else { > HoodieTableMetadata tableMetadata = > HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/", > useFileListingFromMetadata, > verifyListings, false, false); > return tableMetadata.getAllPartitionPaths(); > } > } > {code} > is the current implementation, where `HoodieTableMetadata.create()` always > creates `HoodieBackedTableMetadata`. Instead we should create > `FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways. > This helps address https://github.com/apache/hudi/pull/2398/files#r550709687 > *Change #2* > On master, we have the `HoodieEngineContext` abstraction, which allows for > parallel execution. We should consider moving it to `hudi-common` (its > doable) and then have `FileSystemBackedTableMetadata` redone such that it can > do parallelized listings using the passed in engine. either > HoodieSparkEngineContext or HoodieJavaEngineContext. > HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized > code. We should take one pass and see if that can be redone a bit as well. > Food for thought: > https://github.com/apache/hudi/pull/2398#discussion_r550711216 > > *Change #3* > There are places, where we call fs.listStatus() directly. We should make them > go through the HoodieTable.getMetadata()... route as well. Essentially, all > listing should be concentrated to `FileSystemBackedTableMetadata` > !image-2021-01-05-10-00-35-187.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)