JWuCines opened a new issue, #19411:
URL: https://github.com/apache/druid/issues/19411
### Affected Version
Druid 36.0.0 (likely affects all versions using `druid-hdfs-storage`
extension with Kerberos)
### Description
When using `index_parallel` with `HdfsInputSource` on a Kerberized HDFS
cluster where the NameNode has KMS configured, the ingestion task attempts to
acquire a KMS delegation token even though native ingestion does not need one.
Druid authenticates directly via the Kerberos TGT and never interacts with KMS.
**Root Cause:**
`HdfsInputSource.getPaths()` uses `FileInputFormat.getSplits()` solely for
HDFS glob/path expansion. As a side effect, `FileInputFormat.listStatus()`
calls `TokenCache.obtainTokensForNamenodes()`, which calls
`fs.addDelegationTokens()`. This cascades into
`KMSClientProvider.getDelegationToken()` via
`DelegationTokenIssuer.collectDelegationTokens()`. The acquired delegation
tokens are never actually used by Druid's native ingestion.
The KMS URI is discovered from the NameNode via RPC at runtime, so
client-side configuration overrides (`hadoop.security.key.provider.path`,
`dfs.encryption.key.provider.uri` set to empty) do not prevent the KMS contact
attempt, because Hadoop treats empty strings as "not set" and falls back to
querying the NameNode.
**Proposed fix:**
Replace the `FileInputFormat`/`Job` usage in `HdfsInputSource.getPaths()`
with direct `FileSystem.globStatus()` calls. This achieves the same glob
expansion without entering the MapReduce `TokenCache` code path at all. For
example:
```java
public static Collection<Path> getPaths(List<String> inputPaths,
Configuration configuration) throws IOException {
if (inputPaths.isEmpty()) {
return Collections.emptySet();
}
Set<Path> paths = new LinkedHashSet<>();
for (String inputPath : inputPaths) {
Path p = new Path(inputPath);
FileSystem fs = p.getFileSystem(configuration);
FileStatus[] statuses = fs.globStatus(p);
if (statuses != null) {
for (FileStatus status : statuses) {
if (status.getLen() > 0) {
paths.add(status.getPath());
}
}
}
}
return paths;
}
```
**Call chain leading to unnecessary KMS contact:**
```
HdfsInputSource.getPaths()
→ FileInputFormat.getSplits()
→ FileInputFormat.listStatus()
→ TokenCache.obtainTokensForNamenodes()
→ fs.addDelegationTokens()
→ DelegationTokenIssuer.collectDelegationTokens()
→ KMSClientProvider.getDelegationToken() ← unnecessary KMS
contact
```
**Relevant source:**
- [`HdfsInputSource.java` (getPaths
method)](https://github.com/apache/druid/blob/druid-36.0.0/extensions-core/hdfs-storage/src/main/java/org/apache/druid/inputsource/hdfs/HdfsInputSource.java#L133-L165)
- [`FileInputFormat.java` (listStatus calling
TokenCache)](https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L257)
- [`TokenCache.java`
(obtainTokensForNamenodes)](https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]