[ 
https://issues.apache.org/jira/browse/NUTCH-3064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18053183#comment-18053183
 ] 

ASF GitHub Bot commented on NUTCH-3064:
---------------------------------------

lewismc commented on PR #825:
URL: https://github.com/apache/nutch/pull/825#issuecomment-3776075746

   Most recent updates address a field duplication issue which could result 
when chaining multiple GeoIP databases.
   Here's the example of running `indexchecker`
   
   ```
   ./runtime/local/bin/nutch indexchecker https://nutch.apache.org
   ...
   accuracyRadius :     1000
   isPublicProxy :      false
   countryIsoCode :     US
   cityNetworkAddress : 151.101.0.0/21
   countryNetworkAddress :      151.101.0.0/21
   countryGeoNameId :   6252001
   autonomousSystemNumber :     54113
   title :      Apache Nutch™
   content :    Apache Nutch™
   Apache Nutch™
   Apache Nutch™
   Community
   Development
   Docs
   Download
   News
   The Apache Softwa
   isHostingProvider :  false
   isTorExitNode :      false
   digest :     09f55cdd88bb9a668023f96143ec9605
   host :       nutch.apache.org
   id : https://nutch.apache.org
   isAnycast :  false
   continentCode :      NA
   isLegitimateProxy :  false
   ip : 151.101.2.132
   timeZone :   America/Chicago
   isAnonymousVpn :     false
   isResidentialProxy : false
   autonomousSystemOrganization :       FASTLY
   url :        https://nutch.apache.org
   isAnonymous :        false
   tstamp :     Tue Jan 20 20:21:34 PST 2026
   latLon :     37.751,-97.822
   countryInEuropeanUnion :     false
   continentGeoNameId : 6255149
   countryName :        United States
   continentName :      North America
   asnNetworkAddress :  151.101.0.0/16
   ```
   
   Required configuration
   
   ```
   <property>
     <name>store.ip.address</name>
     <value>true</value>
     <description>Enables us to capture the specific IP address
     (InetSocketAddress) of the host which we connect to via the given
     protocol. Currently supported by: protocol-ftp, protocol-http,
     protocol-okhttp, protocol-htmlunit, protocol-selenium.  Note that
     the IP address is required by the plugin index-geoip and when
     writing WARC files.
     </description>
   </property>
   
   <property>
     <name>plugin.includes</name>
     
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|geoip)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
     <description>Regular expression naming plugin directory names to
     include.  Any plugin not matching this expression is excluded.
     By default Nutch includes plugins to crawl HTML and various other
     document formats via HTTP/HTTPS and indexing the crawled content
     into Solr.  More plugins are available to support more indexing
     backends, to fetch ftp:// and file:// URLs, for focused crawling,
     and many other use cases.
     </description>
   </property>
   
   <property>
     <name>index.geoip.db.asn</name>
     <value>GeoLite2-ASN.mmdb</value>
     <description>
     GeoIP2/GeoLite2 ASN database file (MMDB format).
     Provides autonomous system number and organization information.
     </description>
   </property>
   
   <property>
     <name>index.geoip.db.city</name>
     <value>GeoLite2-City.mmdb</value>
     <description>
     GeoIP2/GeoLite2 City database file (MMDB format).
     Provides city, subdivision, country, continent, and location data.
     </description>
   </property>
   
   <property>
     <name>index.geoip.db.country</name>
     <value>GeoLite2-Country.mmdb</value>
     <description>
     GeoIP2/GeoLite2 Country database file (MMDB format).
     Provides country, continent, and represented country information.
     This is a lighter-weight alternative to the City database when only
     country-level information is needed.
     </description>
   </property>
   ```




> Upgrade index-geoip to GeoIP2 5.0.2
> -----------------------------------
>
>                 Key: NUTCH-3064
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3064
>             Project: Nutch
>          Issue Type: Task
>          Components: index-geoip, plugin
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.22
>
>
> A recent mailing list question about the index-geoip plugin prompted me to 
> take a look at it and perform any necessary maintenance. 
> As of writing, the latest dependency can be found at 
> [https://central.sonatype.com/artifact/com.maxmind.geoip2/geoip2] at v4.2.0.
> At a minimum this ticket will accomplish the dependency update. I'll also 
> have a look at documentation and maybe provide some unit tests... which I 
> neglected to furnish last time around.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to