[ 
https://issues.apache.org/jira/browse/HDFS-17871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18051488#comment-18051488
 ] 

ASF GitHub Bot commented on HDFS-17871:
---------------------------------------

XorSum commented on PR #8179:
URL: https://github.com/apache/hadoop/pull/8179#issuecomment-3744010793

   > Good corner case catch. Just wonder in what case that request one path 
which is not in mount point frequently. Thanks.
   
   Thanks for the question! This issue occurs not only when paths are 
completely unmounted, but also **when querying sibling directories** that don't 
have mount points. Let me provide a concrete example:
   
   ### Example
   
   Assume the default namespace is `ns1`, and we have mount points configured 
like this:
   
   ```
   ns2  /user/hive/warehouse/default.db
   ns2  /user/hive/warehouse/default.db/aaa
   ns2  /user/hive/warehouse/default.db/bbb
   ns2  /user/hive/warehouse/default.db/ccc
   ...
   ns2  /user/hive/warehouse/default.db/xxx
   ns2  /user/hive/warehouse/default.db/yyy
   ```
   
   
   Now, consider a query for path: `/user/hive/warehouse/default.db/zzz`
   
   ### What Happens with the Original Method
   
   1. `floorEntry("/user/hive/warehouse/default.db/zzz")` returns 
`/user/hive/warehouse/default.db/yyy`
   2. Check `isParentEntry()` → false (yyy is a sibling, not a parent)
   3. Call `lowerEntry()` → returns `/user/hive/warehouse/default.db/xxx`
   4. Check `isParentEntry()` → false
   5. Keep calling `lowerEntry()` → iterates through `xxx`, `www`, ..., `ccc`, 
`bbb`, `aaa`
   6. Finally finds `/user/hive/warehouse/default.db/` (the actual parent)
   
   The original method iterates through all sibling directories (yyy → xxx → 
... → aaa) before finding the parent mount point.
   
   However, the  improved method avoids this by walking up the path hierarchy 
directly:
   `/user/hive/warehouse/default.db/zzz` → `/user/hive/warehouse/default.db`.




> Performance Issue: `findDeepest()` in RBF traverses entire tree when parent 
> path doesn't exist
> ----------------------------------------------------------------------------------------------
>
>                 Key: HDFS-17871
>                 URL: https://issues.apache.org/jira/browse/HDFS-17871
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: rbf
>            Reporter: xorsum
>            Priority: Minor
>              Labels: pull-request-available
>         Attachments: BenchmarkFindDeepest.java
>
>
> h2. Problem
> The current implementation of 
> `org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver#findDeepest`
>  method in RBF has a performance issue: when the tree doesn't contain a 
> parent path for the queried path, the `while` loop iterates through *all 
> entries* in the TreeMap, resulting in `O(N)` time complexity instead of the 
> expected `O(log N)`.
> h2. Example
> Consider a mount table containing:
>  - `/user/hive/warehouse/db1.db/table1`
>  - `/user/hive/warehouse/db1.db/table2`
>  - `/user/hive/warehouse/db2.db/table3`
> When querying a path like `/xyz/random/path/file.txt`, the algorithm will:
>  - `floorEntry("/xyz/random/path/file.txt")` returns the largest entry in the 
> tree
>  - Check `isParentEntry()`: false
>  - Call `lowerEntry()` repeatedly until reaching the first entry in the tree
>  - Worst case: Iterates through ALL entries in `O(N)` time
> h2. Proposed Solution
> We can improve the `findDeepest` method by checking the parent path directly, 
> instead of using `floorEntry`.
> I benchmarked the performance of these two implementations.
> The benchmark results show that the improved method is 50x faster than the 
> original method.
> {quote}================================================================================
> BENCHMARK COMPARISON RESULTS
> ================================================================================
> 【Original Method - findDeepest】
> Total Execution Time: 1,924,288,819 ns (1924.29 ms)
> Total Lookup Count: 15,302,856
> Avg Time Per Query: 320,607.93 ns (0.32 ms)
> Avg Lookups Per Query: 2,549.63
> 【Improved Method - findDeepestImproved】
> Total Execution Time: 38,030,989 ns (38.03 ms)
> Total Lookup Count: 24,008
> Avg Time Per Query: 6,336.39 ns (0.01 ms)
> Avg Lookups Per Query: 4.00
> 【Performance Improvement】
> Time Saved: 98.02%
> Lookup Count Reduced: 99.84%
> Speedup Factor: 50.60x
> 【Correctness Verification】
> ✓ All results match! Both methods returned identical results.
> ================================================================================
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to