Hi all,

I am having some issues with the "http.accept.language" and
"urlfilter-regexā€¯ functions. My goal is to collect only english webpages,
and disregard all "wikipedia" pages.

1. I have added the following content in the nutch-site.xml, but the result
still contains lots of "zh, ca, fr, etc." In addition, I also changed this
in nutch-default.xml to be safe. Wonder if I need to add a plugin to the
nutch-site.xml to do this.

<property>
  <name>http.accept.language</name>
  <value>en-us,en-gb,en</value>
  <description>Value of the "Accept-Language" request header field.
  This allows selecting non-English language as default one to retrieve.
  It is a useful setting for search engines build for certain national
group.
  </description>
</property>

2. With respect to the "urlfilter-regex", I have added the following
configurations in nutch-site.xml and regex-urlfilter.txt.

<property>
  <name>plugin.includes</name>
  <value>protocol-http|*urlfilter-regex*
|parse-(tika)|index-(anchor|basic|more|static|replace|links)|indexer-elastic|urlnormalizer-basic|scoring-(opic|similarity)|language-identifier|protocol-httpclient</value>
</property>

*-^.*wikipedia.*$*

Thanks,
Yongyao


-- 
Yongyao Jiang
https://www.linkedin.com/in/yongyao-jiang-42516164
Ph.D. Student in Earth Systems and GeoInformation Sciences
NSF Spatiotemporal Innovation Center
George Mason University

Reply via email to