Hi all, I am having some issues with the "http.accept.language" and "urlfilter-regexā€¯ functions. My goal is to collect only english webpages, and disregard all "wikipedia" pages.
1. I have added the following content in the nutch-site.xml, but the result still contains lots of "zh, ca, fr, etc." In addition, I also changed this in nutch-default.xml to be safe. Wonder if I need to add a plugin to the nutch-site.xml to do this. <property> <name>http.accept.language</name> <value>en-us,en-gb,en</value> <description>Value of the "Accept-Language" request header field. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group. </description> </property> 2. With respect to the "urlfilter-regex", I have added the following configurations in nutch-site.xml and regex-urlfilter.txt. <property> <name>plugin.includes</name> <value>protocol-http|*urlfilter-regex* |parse-(tika)|index-(anchor|basic|more|static|replace|links)|indexer-elastic|urlnormalizer-basic|scoring-(opic|similarity)|language-identifier|protocol-httpclient</value> </property> *-^.*wikipedia.*$* Thanks, Yongyao -- Yongyao Jiang https://www.linkedin.com/in/yongyao-jiang-42516164 Ph.D. Student in Earth Systems and GeoInformation Sciences NSF Spatiotemporal Innovation Center George Mason University