lewismc commented on PR #825:
URL: https://github.com/apache/nutch/pull/825#issuecomment-3736568735
This PR now upgrades the `index-geoip` plugin to use MaxMind GeoIP2 Java API
5.0.2, with significant architectural improvements including support for
multiple database types and in-memory caching.
## Changes
### Dependency Updates
- `geoip2`: upgraded to **5.0.2**
- `maxmind-db`: upgraded to **4.0.2**
- `jackson-datatype-jsr310`: added **2.20.1** (new transitive dependency)
### Performance Improvement — CHMCache
Database readers now use `CHMCache` (ConcurrentHashMap Cache) from the
maxmind-db library for improved lookup performance:
```java
DatabaseReader reader = new DatabaseReader.Builder(db)
.withCache(new CHMCache())
.build();
```
This caches parsed database nodes in memory, reducing disk I/O and improving
throughput when the same IP prefixes are queried repeatedly during indexing.
### New Configuration Options in `conf/nutch-default.xml`
The plugin now supports multiple database types simultaneously. Configure
each by setting its file path:
| Property | Description |
|----------|-------------|
| `index.geoip.db.anonymous` | Anonymous IP database — identifies VPNs,
proxies, Tor exit nodes |
| `index.geoip.db.asn` | ASN database — autonomous system number and
organization |
| `index.geoip.db.city` | City database — city, subdivision, country,
continent, coordinates |
| `index.geoip.db.connection` | Connection Type database — Cable/DSL,
Cellular, Corporate, Satellite |
| `index.geoip.db.domain` | Domain database — second-level domain for the IP
|
| `index.geoip.db.isp` | ISP database — ISP name, organization, ASN |
### MaxMind Insights Web Service Support
| Property | Description |
|----------|-------------|
| `index.geoip.insights.userid` | User ID for MaxMind Precision Insights API
|
| `index.geoip.insights.licensekey` | License key for the Insights API |
### Architecture Improvements
- Refactored to support multiple databases via `EnumMap<DatabaseType,
DatabaseReader>`
- Each database type is loaded independently and queried in sequence
- Proper resource cleanup via `Closeable` implementation
- Graceful error handling per-database (one failure doesn't block others)
## Files Modified
- `src/plugin/index-geoip/` — plugin source, tests, dependencies, and config
- `build.xml` — root build configuration
- `conf/nutch-default.xml` — new GeoIP configuration properties
- `src/plugin/build.xml` — plugin build configuration
- `src/plugin/indexer-solr/schema.xml` — Solr schema field definitions
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]