Hello,

During a fetch, the fetcher failed to retrieve a certain page with the
following exception:

// url is masked ****
Error parsing: http://*********/validCode.asp:
org.apache.nutch.parse.ParseException: parser not found for
contentType=image/bmp url=http://0086jia.com/include/validCode.asp
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:81)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(
Fetcher.java:349)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java
:194)

i've configed both regex-urlfilter.txt;

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|wmv|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|jpeg|JPEG|
bmp|BMP|swf)$

and suffix-urlfilter.txt:

### prohibit these
# pictures
.gif
.jpg
.jpeg
.bmp
.png
.tif
.tiff

both plugins are in the nutch-site "plugin-include" property:

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|urlfilter-suffix|
parse-(text|html|js|zip)|query-(basic|site|url)|index-basic|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>


and my crawling is done by running: nutch inject/generate/fetch loops.

Am i missing some property i should config  in  order to avoid
fetching/crawling contentTypes i don't to? (same goes for xml/jpeg... and
other filetypes).

Thanks!

Eyal.

Reply via email to