rajucomp commented on issue #1588:
URL: https://github.com/apache/stormcrawler/issues/1588#issuecomment-3695155846
@rzo1 I would like to take on this issue, but I’d appreciate some
clarification on a few observations regarding the current implementation.
### Observations
1. The behavior of `addValue`, `getValues`, `remove`, and `containsKey` is
inconsistent.
2. Specifically:
-
[`getValues`](https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/Metadata.java#L125)
and
[`containsKey`](https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/Metadata.java#L136)
normalize keys to lowercase.
-
[`addValues`](https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/Metadata.java#L185)
and
[`remove`](https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/Metadata.java#L212)
do not.
3. This results in inconsistent semantics. All key-based operations should
follow the same case-handling strategy.
### Prior Art
I reviewed how Apache Nutch approaches this ([PR
#777](https://github.com/apache/nutch/pull/777)). They use a `TreeMap` with a
case-insensitive comparator. While this ensures correctness, it changes the
time complexity of operations to `O(log N)`, which may not be ideal from a
performance standpoint.
### Proposed Options
1. Use `CaseInsensitiveMap` from Apache Commons Collections:
-
https://commons.apache.org/proper/commons-collections/apidocs/org/apache/commons/collections4/map/CaseInsensitiveMap.html
- This preserves `HashMap`-like performance characteristics while
enforcing consistent, case-insensitive behavior.
- This would involve replacing the underlying `HashMap` with
`CaseInsensitiveMap`.
2. Introduce a dedicated implementation:
- Extend `Metadata` and introduce a `CaseInsensitiveMetadata` class.
- Override the constructor to initialize a case-insensitive backing map.
- This avoids changing existing behavior by default while offering a
clear alternative.
### Open Questions
- Should existing usages (`new Metadata()`) be migrated to
`CaseInsensitiveMetadata`, or should the behavior of `Metadata` itself be
changed?
- Are there any compatibility concerns or downstream assumptions about case
sensitivity that should be preserved?
I’d like to align on the preferred approach before starting work on this
issue. Thanks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]