JWuCines opened a new issue, #19411:
URL: https://github.com/apache/druid/issues/19411

   ### Affected Version
   
   Druid 36.0.0 (likely affects all versions using `druid-hdfs-storage` 
extension with Kerberos)
   
   ### Description
   
   When using `index_parallel` with `HdfsInputSource` on a Kerberized HDFS 
cluster where the NameNode has KMS configured, the ingestion task attempts to 
acquire a KMS delegation token even though native ingestion does not need one. 
Druid authenticates directly via the Kerberos TGT and never interacts with KMS.
   
   **Root Cause:**
   
   `HdfsInputSource.getPaths()` uses `FileInputFormat.getSplits()` solely for 
HDFS glob/path expansion. As a side effect, `FileInputFormat.listStatus()` 
calls `TokenCache.obtainTokensForNamenodes()`, which calls 
`fs.addDelegationTokens()`. This cascades into 
`KMSClientProvider.getDelegationToken()` via 
`DelegationTokenIssuer.collectDelegationTokens()`. The acquired delegation 
tokens are never actually used by Druid's native ingestion.
   
   The KMS URI is discovered from the NameNode via RPC at runtime, so 
client-side configuration overrides (`hadoop.security.key.provider.path`, 
`dfs.encryption.key.provider.uri` set to empty) do not prevent the KMS contact 
attempt, because Hadoop treats empty strings as "not set" and falls back to 
querying the NameNode.
   
   **Proposed fix:**
   
   Replace the `FileInputFormat`/`Job` usage in `HdfsInputSource.getPaths()` 
with direct `FileSystem.globStatus()` calls. This achieves the same glob 
expansion without entering the MapReduce `TokenCache` code path at all. For 
example:
   
   ```java
   public static Collection<Path> getPaths(List<String> inputPaths, 
Configuration configuration) throws IOException {
       if (inputPaths.isEmpty()) {
           return Collections.emptySet();
       }
       Set<Path> paths = new LinkedHashSet<>();
       for (String inputPath : inputPaths) {
           Path p = new Path(inputPath);
           FileSystem fs = p.getFileSystem(configuration);
           FileStatus[] statuses = fs.globStatus(p);
           if (statuses != null) {
               for (FileStatus status : statuses) {
                   if (status.getLen() > 0) {
                       paths.add(status.getPath());
                   }
               }
           }
       }
       return paths;
   }
   ```
   
   **Call chain leading to unnecessary KMS contact:**
   
   ```
   HdfsInputSource.getPaths()
     → FileInputFormat.getSplits()
       → FileInputFormat.listStatus()
         → TokenCache.obtainTokensForNamenodes()
           → fs.addDelegationTokens()
             → DelegationTokenIssuer.collectDelegationTokens()
               → KMSClientProvider.getDelegationToken()  ← unnecessary KMS 
contact
   ```
   
   **Relevant source:**
   - [`HdfsInputSource.java` (getPaths 
method)](https://github.com/apache/druid/blob/druid-36.0.0/extensions-core/hdfs-storage/src/main/java/org/apache/druid/inputsource/hdfs/HdfsInputSource.java#L133-L165)
   - [`FileInputFormat.java` (listStatus calling 
TokenCache)](https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L257)
   - [`TokenCache.java` 
(obtainTokensForNamenodes)](https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to