Hi,

I am trying to use the nutch fetcher for d/l EXE/ZIP files from web pages.
i've removed the suffixes from the regex-urlfilter &
automation-urlfilter(files identical):


regex-urlfilter.txt:
--------------------------------------------------------------------------------------------------------
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|jpeg|JPEG|bmp|BMP|iso|ISO|bin|BIN)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept anything else
+.

------------------------------------------------------------------------------------------------------------------

When trying to download EXE:
http://www.xtodvd.com/apodvdcopy.exe

the fetch fails:
found segment crawl/segments/20070902084928
Fetching now the urls..
Fetcher: starting
Fetcher: segment: crawl/segments/20070902084928
Fetcher: threads: 1000
fetching http://www.xtodvd.com/apodvdcopy.exe
Error parsing: http://createdvd.net/apodvdcopy.exe: failed(2,200):
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/x-dosexec url=http://createdvd.net/apodvdcopy.exe
Fetcher: done

when trying to fetch Zip file, its works, but how can i tell him to save the
zip to a folder in a directory on the file system, do i need to write a
plugin?

thanks!






Eyal Edri

Reply via email to