lewismc commented on PR #825:
URL: https://github.com/apache/nutch/pull/825#issuecomment-3776075746

   Most recent updates address a field duplication issue which could result 
when chaining multiple GeoIP databases.
   Here's the example of running `indexchecker`
   
   ```
   ./runtime/local/bin/nutch indexchecker https://nutch.apache.org
   ...
   accuracyRadius :     1000
   isPublicProxy :      false
   countryIsoCode :     US
   cityNetworkAddress : 151.101.0.0/21
   countryNetworkAddress :      151.101.0.0/21
   countryGeoNameId :   6252001
   autonomousSystemNumber :     54113
   title :      Apache Nutch™
   content :    Apache Nutch™
   Apache Nutch™
   Apache Nutch™
   Community
   Development
   Docs
   Download
   News
   The Apache Softwa
   isHostingProvider :  false
   isTorExitNode :      false
   digest :     09f55cdd88bb9a668023f96143ec9605
   host :       nutch.apache.org
   id : https://nutch.apache.org
   isAnycast :  false
   continentCode :      NA
   isLegitimateProxy :  false
   ip : 151.101.2.132
   timeZone :   America/Chicago
   isAnonymousVpn :     false
   isResidentialProxy : false
   autonomousSystemOrganization :       FASTLY
   url :        https://nutch.apache.org
   isAnonymous :        false
   tstamp :     Tue Jan 20 20:21:34 PST 2026
   latLon :     37.751,-97.822
   countryInEuropeanUnion :     false
   continentGeoNameId : 6255149
   countryName :        United States
   continentName :      North America
   asnNetworkAddress :  151.101.0.0/16
   ```
   
   Required configuration
   
   ```
   <property>
     <name>store.ip.address</name>
     <value>true</value>
     <description>Enables us to capture the specific IP address
     (InetSocketAddress) of the host which we connect to via the given
     protocol. Currently supported by: protocol-ftp, protocol-http,
     protocol-okhttp, protocol-htmlunit, protocol-selenium.  Note that
     the IP address is required by the plugin index-geoip and when
     writing WARC files.
     </description>
   </property>
   
   <property>
     <name>plugin.includes</name>
     
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|geoip)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
     <description>Regular expression naming plugin directory names to
     include.  Any plugin not matching this expression is excluded.
     By default Nutch includes plugins to crawl HTML and various other
     document formats via HTTP/HTTPS and indexing the crawled content
     into Solr.  More plugins are available to support more indexing
     backends, to fetch ftp:// and file:// URLs, for focused crawling,
     and many other use cases.
     </description>
   </property>
   
   <property>
     <name>index.geoip.db.asn</name>
     <value>GeoLite2-ASN.mmdb</value>
     <description>
     GeoIP2/GeoLite2 ASN database file (MMDB format).
     Provides autonomous system number and organization information.
     </description>
   </property>
   
   <property>
     <name>index.geoip.db.city</name>
     <value>GeoLite2-City.mmdb</value>
     <description>
     GeoIP2/GeoLite2 City database file (MMDB format).
     Provides city, subdivision, country, continent, and location data.
     </description>
   </property>
   
   <property>
     <name>index.geoip.db.country</name>
     <value>GeoLite2-Country.mmdb</value>
     <description>
     GeoIP2/GeoLite2 Country database file (MMDB format).
     Provides country, continent, and represented country information.
     This is a lighter-weight alternative to the City database when only
     country-level information is needed.
     </description>
   </property>
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to