lewismc commented on PR #825:
URL: https://github.com/apache/nutch/pull/825#issuecomment-3736568735

   This PR now upgrades the `index-geoip` plugin to use MaxMind GeoIP2 Java API 
5.0.2, with significant architectural improvements including support for 
multiple database types and in-memory caching.
   
   ## Changes
   
   ### Dependency Updates
   
   - `geoip2`: upgraded to **5.0.2**
   - `maxmind-db`: upgraded to **4.0.2**
   - `jackson-datatype-jsr310`: added **2.20.1** (new transitive dependency)
   
   ### Performance Improvement — CHMCache
   
   Database readers now use `CHMCache` (ConcurrentHashMap Cache) from the 
maxmind-db library for improved lookup performance:
   
   ```java
   DatabaseReader reader = new DatabaseReader.Builder(db)
       .withCache(new CHMCache())
       .build();
   ```
   
   This caches parsed database nodes in memory, reducing disk I/O and improving 
throughput when the same IP prefixes are queried repeatedly during indexing.
   
   ### New Configuration Options in `conf/nutch-default.xml`
   
   The plugin now supports multiple database types simultaneously. Configure 
each by setting its file path:
   
   | Property | Description |
   |----------|-------------|
   | `index.geoip.db.anonymous` | Anonymous IP database — identifies VPNs, 
proxies, Tor exit nodes |
   | `index.geoip.db.asn` | ASN database — autonomous system number and 
organization |
   | `index.geoip.db.city` | City database — city, subdivision, country, 
continent, coordinates |
   | `index.geoip.db.connection` | Connection Type database — Cable/DSL, 
Cellular, Corporate, Satellite |
   | `index.geoip.db.domain` | Domain database — second-level domain for the IP 
|
   | `index.geoip.db.isp` | ISP database — ISP name, organization, ASN |
   
   ### MaxMind Insights Web Service Support
   
   | Property | Description |
   |----------|-------------|
   | `index.geoip.insights.userid` | User ID for MaxMind Precision Insights API 
|
   | `index.geoip.insights.licensekey` | License key for the Insights API |
   
   ### Architecture Improvements
   
   - Refactored to support multiple databases via `EnumMap<DatabaseType, 
DatabaseReader>`
   - Each database type is loaded independently and queried in sequence
   - Proper resource cleanup via `Closeable` implementation
   - Graceful error handling per-database (one failure doesn't block others)
   
   ## Files Modified
   
   - `src/plugin/index-geoip/` — plugin source, tests, dependencies, and config
   - `build.xml` — root build configuration
   - `conf/nutch-default.xml` — new GeoIP configuration properties
   - `src/plugin/build.xml` — plugin build configuration
   - `src/plugin/indexer-solr/schema.xml` — Solr schema field definitions
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to